Skip to content

Character encoding on web sites

Kristina edited this page Nov 2, 2017 · 1 revision

On a computer a character is any letter, digit or symbol that makes up words in a language. 0-9, A-Z, and even Japanese Kanji like 日本 (Japan) are considered characters. A character set is a collection of characters (letters and symbols) for a specific writing system.

On a computer each character is assigned a number called a code point. A point code is stored in computer memory in the form of bytes (a unit of data used by computer memory).

Basically all the possible characters are stored in computer language and a character encoding is like a dictionary that is used to list out what byte combination is tied to which text character. There are several different character encodings which are tied to different languages.

This video goes over how the ASCII character encoding system which contains letters, characters and a limited set of symbols and punctuation for the English language works. It explains how computers know what text to display on the screen.

When you create a HTML website you must specify the encoding that the page needs to use. Providing no encoding or the wrong one can result in text being displayed incorrectly on the page or the data not being read correctly by a search engine.

You should always use Unicode character encoding UTF-8 which supports encoding all possible characters. Unicode was invented to replace legacy encodings like ASCII or Windows-1252 which only supported characters for one language.

To use UTF-8 encoding on your web page put <meta charset=”utf-8″> or <meta http-equiv=”Content-Type” content=”text/html; charset=utf-8″> in the heading section of your web page like in the example below.

<html lang="en">
   <head>
     <meta charset="utf-8">
   </head>
</html>

Also you must save your HTML file using UTF-8 encoding.

Now in the past you had to use entitles like &copy to display the © symbol on a webpage. Because all browsers support Unicode and are capable of handling all possible characters you no longer have to use entitles to display the copyright, symbol, currency symbol, math and arrows. Just type the symbol in your HTML code and will be displayed correctly on your website.

The only exception to this rule is the (<, >, &, space, “”) characters. For these characters you will have to use the entity name or entity number text to display them on your page.

Result Description Entity Name Entity Number
non-breaking space &nbsp; &#160;
< less than &lt; &#60;
> greater than &gt; &#62;
& ampersand &amp; &#38;
double quotation mark &quot; &#34;

You can read more about entities here.

Clone this wiki locally