Php set page encoding. Solving problems with incorrect web page encoding. Character encoding and language

Vlad Merzhevich

Meta tags are used to store information intended for browsers and search engines. For example, search engine engines access meta tags to obtain site descriptions, keywords, and other data.

Meta tags for search engines

There is an opinion among website developers that correctly written meta tags allow you to rise to the top of search engines. In fact, this is not true; meta tags alone will not help you rise high, but poorly executed meta tag content can worsen the site’s ranking.

Two meta tags are designed specifically for search engines: description and keywords. Some webmasters added keywords to the keywords section that had nothing to do with the topic of the site, but nevertheless enjoyed some success among search engine visitors. However, after some time, search engines learned to deal with this phenomenon and check the content of a web page for compliance with the stated keywords.

Some principles related to meta tags:

do not include keywords that are not contained on your pages;
do not repeat keywords;
use meta tags for their intended purpose;
make the description and list of keywords different for each page of the site, taking into account the content.

description

Most search engines display the contents of the description field (example 1) when displaying search results. If this tag is not on the page, then the search engine will simply list the first words found on the page, which, as a rule, are not very relevant to the topic.

Example 1: Using Description

description

keywords

This meta tag was intended to describe keywords appearing on the page (Example 2). But as a result, the actions of people who want to get to the top of search engines by any means necessary have now been discredited. Therefore, many search engines skip this parameter.

Example 2: Using Keywords

keywords

Keywords can be listed separated by spaces or commas. Search engines themselves will convert the entry to the form they use.

Autoloading pages

To automatically download a new document after a certain period of time, use the http-equiv="refresh" instruction (example 3).

The browser will understand this entry to wait 5 seconds and then load the new page specified in the URL parameter, in this case a link to the website site.

This meta tag allows you to create a redirection to another site. If no URL is specified, the current page will automatically refresh after the number of seconds specified in the content attribute.

Encoding

To tell the browser what encoding the characters on a web page are in, you need to set the parameter . For the Windows operating system and the Cyrillic alphabet, the charset usually takes the value utf-8 or windows-1251 (example 4).

Example 4. Selecting the current encoding

Encoding

Cyrillic

If there is no encoding specified, the browser itself tries to determine what type of characters is used in the document and selects the necessary encoding automatically. The browser may not always accurately recognize the language of a web page and in some cases will suggest Vietnamese encoding instead of Cyrillic. For this reason, it is better to always include the line given. However, there are circumstances where specifying the encoding may cause some harm. For example, the web server automatically uses data transcoding in KOI-8, and the browser, when it encounters the charset=windows-1251 parameter, converts the text to Windows encoding. This results in a double change of characters; it is not easy to read such text. Fortunately, such a problem is already a thing of the past; in any case, it can be easily identified and neutralized at the server level.

Hello, dear readers of my blog. Today we will talk to you about encoding. If you read my article about that, you know that any document on the Internet is not stored in the form in which we are used to seeing it. It is written using symbols and signs incomprehensible to humans. It's exactly the same with text.

There are several encodings, and therefore, sometimes you see strange characters when opening a book in a mobile application or uploading an article to a website, and by changing some values in the settings, you will see the alphabet that is familiar to the eye.

Windows-1251 encoding - what is it, what significance does it have when creating a website, what characters will be available and is it the best solution today? About all this in today's article. As always, in simple language, as clear as possible and with a minimum number of terms.

A little theory

Any document on a computer or on the Internet, as I said, is stored in the form of binary code. For example, if you use ASCII encoding, then the letter “K” will be written as 10001010, and in windows 1251 the symbol – Љ is hidden under this number. As a result, if a browser or program accesses another table and reads Windows 1251 codes instead of ASCII, the reader will see a symbol that is completely incomprehensible to him.

The logical question is, why bother coming up with so many tables with codes? The fact is that in addition to the Russian alphabet, there is also English, German, and Chinese. By some estimates, there are about 200,000 characters. Although, I don’t really trust these statistics, remembering Japanese.

Don't forget that for capital and lowercase letters you need to come up with your own code, there are commas, dashes, and so on.

The more symbols in the table, the longer the code for each of them, which means the weight of the document becomes greater.

Imagine if one book weighed 4 GB! It would take a very long time to load and take up all the free space on the computer. The decision to download would not seem easy.

If you think about websites, it’s generally scary to think what would have happened. Each page took more than an hour to open even on high-speed fiber optics! I think mobile phones could be safely thrown away. Can you use them outdoors even with 4G? I doubt.

For these reasons, every programmer at one time tried to come up with his own symbol table. To make it convenient to use and keep the weight optimal.

Microsoft, for example, created windows-1251 for the Russian-language segment. It, of course, has its advantages and disadvantages. Just like any other product.

Nowadays, only 2% of all pages on the Internet are written in 1251. Most webmasters use UTF-8. Why is that?

Disadvantages and advantages

UTF-8, unlike windows-1251, is a universal encoding; it contains letters of various alphabets. There is even UTF-128, which contains all the languages - Teulu, Swahili, Laotian, Maltese and so on.

UTF-8 is poorer, letters take up much less space and take up only one byte of memory, as in 1251. UTF contains rare characters from other languages or special characters. They weigh 5-6 bytes, but are used extremely rarely in the document.

This encoding is more thoughtful, and therefore most applications use it by default. That is, if you do not tell the program what encoding you are using, then the first thing it will check is UTF-8.

When you create an HTML document for a website, you tell browsers which table to look at when decoding records.

To do this, you need to insert the following data into the head tag. After the “charset=” symbols comes either UTF or Windows, as in the example below.

If in the future you want to change something and insert a phrase in Albanian using this decoding table, then nothing will work, because the encoding does not support this language. UTF‑8 will allow you to do this without any problems.

If you are interested in the correct creation of a website, then I can recommend you the course by Mikhail Rusakov “ Website creation and promotion from A to Z ».

It contains a lot - 256 lessons covering JavaScript, and XML. In addition to programming languages, you will be able to understand how to monetize a site, that is, make more profit faster and more. One of the few courses that explains everything you need in such detail.

I've been studying for a year now. at the school of bloggers Alexander Borisov . It takes many times more time, the end is not yet in sight, but it is no less exhaustive and disciplined. Motivates to continue development.

Well, if questions arise, there is no need to search on the Internet. There is always a competent mentor.

Somehow I went off topic. Let's get back to encodings.

Bath databases

When it comes to PHP, everything is generally scary. I have already talked about databases; they are used to speed up the website. Usually, you don’t turn to them, but when the need arises to transfer a site, you become uneasy.

Difficulties happen to everyone, no matter what your work experience, length of service or length of service. Some pages in the database may contain all the available characters for Windows 1251, others, for example, in page templates, in a different encoding.

Until the transfer is needed, everything works and functions, although not entirely correctly. But after the move, troubles begin. Ideally, you should use either only UTF or Windows 1251, but in fact, such shortcomings always happen to everyone.

In order for the decryption to be consistent, you must enter the code mysql_query("SET NAMES cp1251"). In this case, the conversion will be carried out using a different protocol - cp1251.

Htaccess

If you insistently decide to use 1251 on your site, then you should find or create an htaccess file. He is responsible for configuration settings. You will have to add three more lines to it for everything to come together.

DefaultLanguage ru; AddDefaultCharset windows-1251; php_value default_charset "cp1251"

I still strongly recommend that you consider using UTF-8. It is more popular, simple and rich. Whatever decisions you make now, it is important that you can correct everything later. Adding an English version of the site using this encoding will be much easier. Nothing needs to be fixed.

Decision is on you. Subscribe to the newsletter to find out as quickly as possible where to learn so as not to repeat the mistakes of others, as well as which bloggers receive more visitors.

See you again and good luck in your endeavors.

1. We have a file: Myfile.html.
2. You need to save it in Unicode -> UTF-8 encoding. Solution 1.

Open Myfile.html in a text editor Notebook.
Select “Save as...”.
Select UTF-8 encoding.
Click the button - Save.

Solution 2.

Open Myfile.html in a text editor Notepad++(there is also a PSPad editor)
Menu -> Encodings.
Here we see (Notepad++ determines itself) the encoding of the file we opened.
Choose Convert to UTF-8 without BOM(BOM - Byte Order Mark).
(Codiroaka "UTF-8 without BOM" is preferred and differs from just "UTF-8").
Menu -> File -> Save.

Browser encoding detection

We ourselves tell the browser what encoding is set for this HTML file.
This is done using the META tag 1) The example above instructs the browser that the downloaded HTML file is saved in utf-8 encoding. If the HTML file is saved in windows-1251 encoding, then: 2) Important!
When transcoding files don't forget to change directives in the META tag to be relevant.
If one encoding is specified in the META tag, and the file is saved in another encoding, then we will see “gibberish” on the screen.

3) If The META tag contains the required encoding, but the site still displays “abracadabra”, then you need to check the site settings on the hosting (web server).
Usually on hosting, the encoding is set to utf-8 in the site settings.
If the hosting settings specify the encoding windows-1251, then you need to change the setting to utf-8.

Later ASCII was expanded (initially it did not use all 8 bits), so it became possible to use not 128, but 256 (2 to the 8th power) different characters that can be encoded in one byte of information.
This improvement made it possible to add to the encoding ASCII symbols of the national languages of different countries, in addition to the already existing Latin alphabet.
Extended encoding options ASCII There are a lot of them due to the fact that there are also a lot of languages in the world. I think that many of you have heard about an encoding such as KOI8 (Code of Information Exchange, 8 bits) - this is also an extended encoding ASCII. KOI8 included numbers, letters of the Latin and Russian alphabet, as well as punctuation marks, special characters and pseudographics.

ISO encoding

The International Standards Organization has created a range of encodings for different alphabets/languages.

ISO 8859 series encodings

Encoding	Description
ISO 8859-1 (Latin-1)	Extended Latin, including characters from most Western European languages (English, Danish, Irish, Icelandic, Spanish, Italian, German, Norwegian, Portuguese, Romansh, Faroese, Swedish, Scottish Gaelic and parts of Dutch, Finnish, French), as well as some Eastern European (Albanian) and African languages (Afrikaans, Swahili). Latin-1 lacks the euro sign and the capital letter Ÿ. This code page is considered the default encoding for HTML documents and email messages. Also, the first 256 Unicode characters correspond to this code page.
ISO 8859-2 (Latin-2)	Extended Latin, including characters from Central and Eastern European languages (Bosnian, Hungarian, Polish, Slovak, Slovenian, Croatian, Czech). Latin-2, like Latin-1, lacks the euro sign.
ISO 8859-3 (Latin-3)	Extended Latin, including characters from southern European languages (Maltese, Turkish and Esperanto).
ISO 8859-4 (Latin-4)	Extended Latin, including characters from Northern European languages (Greenlandic, Estonian, Latvian, Lithuanian and Sami languages).
ISO 8859-5 (Latin/Cyrillic)	Cyrillic, including characters from Slavic languages (Belarusian, Bulgarian, Macedonian, Russian, Serbian and partly Ukrainian).
ISO 8859-6 (Latin/Arabic)	Symbols used in Arabic. Characters from other Arabic-based languages are not supported. Support for bidirectional writing and context-sensitive character forms is required to display ISO 8859-6 text correctly.
ISO 8859-7 (Latin/Greek)	Symbols of modern Greek language. Can also be used to write ancient Greek texts in monotonic orthography.
ISO 8859-8 (Latin/Hebrew)	Symbols of modern Hebrew. It is used in two versions: with a logical order of characters (requires support for bidirectional writing) and with a visual order of characters.
ISO 8859-9 (Latin-5)	A variant of Latin-1 that replaces rarely used Icelandic characters with Turkish ones. Used for Turkish and Kurdish languages.
ISO 8859-10 (Latin-6)	A Latin-4 variant more suitable for Scandinavian languages.
ISO 8859-11 (Latin/Thai)	Symbols of the Thai language.
ISO 8859-13 (Latin-7)	Latin-4 variant, more suitable for Baltic languages.
ISO 8859-14 (Latin-8)	An extended Latin script that includes characters from Celtic languages such as Scots Gaelic and Breton.
ISO 8859-15 (Latin-9)	A variant of Latin-1 that replaces rarely used characters with those needed to fully support Finnish, French and Estonian. In addition, the euro sign was added to Latin-9.
ISO 8859-16 (Latin-10)	Extended Latin, including characters from Southern European and Eastern European languages (Albanian, Hungarian, Italian, Polish, Romanian, Slovenian, Croatian), as well as some Western European languages (Irish in the new spelling, German, Finnish, French). Like Latin-9, Latin-10 added the euro sign.

For documents in English and most other Western European languages, encoding is widely supported ISO-8859-1.

In HTML ISO-8859-1 is the default encoding (in XHTML and HTML5 the default encoding is UTF-8).
When using a page encoding other than ISO-8859-1, you need to indicate this in the tag .

For HTML4:

For HTML5:

An example of ANSI encoding is the well-known Windows-1251.

Windows-1251 differs favorably from other 8-bit Cyrillic encodings (such as CP866 and ISO 8859-5) by the presence of almost all the characters used in Russian typography for plain text (only the accent mark is missing). It also contains all the symbols for other Slavic languages: Ukrainian, Belarusian, Serbian, Macedonian and Bulgarian.
Below are the decimal values of the encoding characters Windows-1251.

To display table symbols in an HTML document, use the following syntax:

& + code + ;

Windows encoding-1251 (CP1251)

	.0	.1	.2	.3	.4	.5	.6	.7	.8	.9	.A	.B	.C	.D	.E	.F
8.	Ђ 402	Ѓ 403	‚ 201A	ѓ 453	„ 201E	… 2026	† 2020	‡ 2021	€ 20AC	‰ 2030	Љ 409	‹ 2039	Њ 40A	Ќ 40C	Ћ 40B	Џ 40F
9.	ђ 452	‘ 2018	’ 2019	“ 201C	” 201D	2022	– 2013	- 2014		™ 2122	љ 459	› 203A	њ 45A	ќ 45C	ћ 45B	џ 45F
A.	A0	Ў 40E	ў 45E	Ј 408	¤ A4	Ґ 490	¦ A6	§ A7	Yo 401	© A9	Є 404	« AB	¬ A.C.	AD	® A.E.	Ї 407
B.	° B0	± B1	І 406	і 456	ґ 491	µ B5	¶ B6	· B7	e 451	№ 2116	є 454	» BB	ј 458	Ѕ 405	ѕ 455	ї 457
C.	A 410	B 411	IN 412	G 413	D 414	E 415	AND 416	Z 417	AND 418	Y 419	TO 41A	L 41B	M 41C	N 41D	ABOUT 41E	P 41F
D.	R 420	WITH 421	T 422	U 423	F 424	X 425	C 426	H 427	Sh 428	SCH 429	Kommersant 42A	Y 42B	b 42C	E 42D	YU 42E	I 42F
E.	A 430	b 431	V 432	G 433	d 434	e 435	and 436	h 437	And 438	th 439	To 43A	l 43B	m 43C	n 43D	O 43E	P 43F
F.	R 440	With 441	T 442	at 443	f 444	X 445	ts 446	h 447	w 448	sch 449	ъ 44A	s 44B	b 44C	uh 44D	Yu 44E	I 44F

UNICODE standard encodings

Unicode is a character encoding standard that allows you to represent the characters of almost all written languages in the world, and special characters. Characters represented in Unicode are encoded as unsigned integers. Unicode has several forms of representing characters on a computer: UTF-8, UTF-16 (UTF-16BE, UTF-16LE) and UTF-32 (UTF-32BE, UTF-32LE). (English: Unicode transformation format - UTF).
UTF-8 is a currently common encoding that is widely used in operating systems and the web. Text consisting of Unicode characters numbered less than 128 (code area U+0000 to U+007F) contains set characters ASCII with the corresponding codes. Next are areas of characters of various scripts, punctuation marks and technical symbols. Areas of characters with codes from U+0400 to U+052F, from U+2DE0 to U+2DFF, from U+A640 to U+A69F are allocated for Cyrillic characters.

Encoding UTF-8 is universal and has an impressive reserve for the future. This makes it the most convenient encoding for use on the Internet.

Please enable JavaScript to view the