Character Sets and Encoding in HTML

Topic status automatically displays here - do not remove.

Bookmark me! Bookmark this topic Print me! Print this topic

By Colin Ramsden, December 2006.

There are four English character sets which are commonly used on the web: US-ASCII, Windows-1252, ISO-8859-*, and UTF-8. This topic will explore and explain the differences.

Character encoding in HTML documents is declared as the document 'content type' in the 'Head' section, and is used by the user agent (e.g. a browser) to determine which symbol to display on the page in place of a particular character encoding.

For example:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

The characters (letters, numbers and symbols) you see on a page are not stored as characters, but as bytes.
A single byte consists of 8 bits, and can store up to 256 different combinations (2x2x2x2x2x2x2x2).

The evolution of character codes

Character codes have evolved from the days of early telegraphy, originating with Morse code. To represent a character, a series of short dots and long dashes were arranged between pauses just like written words are presented on a page. Telegraph operators learnt Morse code, and hand wrote the message telegrams as they were received, one character at a time. As you could imagine, this necessarily took much time. Only one telegram could be sent or received at a time, and so messages (on paper) were queued at the senders end of the telegraph line.

The development of the teletype machine—which was a mixture of a serial telecommunications device (modem) and an electro-mechanical typewriter—vastly sped-up the telegram business, which after that only required typists and messengers to send and receive messages. No knowledge of Morse code was required.

The introduction of teletype (TTY) necessitated the introduction of character codes for the remote control of the mechanical typewriter, to perform mechanical things such as the insertion of a tab space, activating the carriage return, moving to the next line, etc.

The most commonly used character set was the Baudot code, also known as the International Telegraph Alphabet No2 (ITA2) code, which used 5 bits to represent 32 code combinations (2x2x2x2x2). Now if you think about it, the English alphabet consists of 26 characters (a-z), and the decimal numbering system consists of 10 characters (0-9), so how were these, plus punctuation and machine codes represented with only 32 codes?

International Telegraph Alphabet No 2 (ITA2) Hex code table

International Telegraph Alphabet No 2 (ITA2) code table
See that each code can represent 2 characters, either Letter or Figure.

The answer is that this system used two codes to swap between modes, one for letters, and one for figures.
This nearly doubled the character code count, providing 58 unique character codes; 26 doubled, plus one common mode each for NUL (0), LF (2), SP (4), CR (8), (FIGS (26), and LTRS (32). Amongst the control characters, LF is for (printer) paper line feed, CR for (printer) carriage return, BEL (11) to sound a bell on the receiving teletype machine and alert the operator to an incoming message, and ENQ (9) to request the receiving machine to respond.

Teletype machines were often fitted with punched paper tape writers and readers—like the ticker tape as used by stock ticker machines for stock market traders at the start of the 20th century—which provided an accurate record of messages received, and which could then be reread and resent without the need to retype the message.

Teletype machines required a dedicated twisted pair copper phone cable which was not connected to the public telephone exchange, and was often a leased "private" line from the telecommunication company.

Teletype machines tended to be large, heavy, and extremely robust, capable of running non-stop for months at a time, requiring minimal service or maintenance with only occasional oiling and cleaning, and a eventual lifetime of tens of thousands of hours until completely worn out.

For all these reasons—accurate message relay, secure communications, and reliable operation—complex networks of teletype machines were established for military and commercial communications which predominated throughout the 20th century until the advent of personal computers and the internet. Message centres had rows of teleprinters and large racks for paper tapes awaiting transmission.

ITA2 is still used in Telecommunication Device for the Deaf (TDD) and some amateur radio applications, such as radio-teletype (RTTY).

The American Standard Character Information Interchange (ASCII) character codes were created to

Back in the days of DOS,

A knowledge of character encoding is useful when formatting the layout of your documents. For example, the use of the emdash as a grammatical element—to set apart a tangential thought or idea from the main thrust of the sentence—instead of the use of commas or brackets, can make your writing better structured and more readily understandable.

However, you need to know how to insert an emdash, and have it display properly online and in print. This is where an awareness of character encoding becomes useful. Depending upon some variables—like which writing/editing tool you are using, and also which operating system (OS) you are running—entering an emdash is as easy as entering the numeric character code with the keyboard combination Alt+0151 on the numeric keypad (that is, you hold the Alt key whilst you sequentially enter the 0151 number using the numeric keypad—not the number keys across the top of the keyboard).

That enters the character, but does not guarantee that it will display or print correctly. This is where the page language encoding comes into play. For an XML or HTML document, the meta tag setting carries

Who am I? > find out more

Character Sets and Encoding in HTML

The evolution of character codes

See Also