Saturday, January 29, 2005

XML Character Sets

A character set defines a range of characters, where each character can be thought of as a unit of text. The character set istelf does not define a way to refer to the characters. Instead, each character is referenced by assigning a number, called a code point. Any character set where the characters are mapped to code points is called a coded character set (CCS).

Character encoding is where the characters are represented in binary form. The designated encoding system for XML documents is unicode, a CCS whose goal is to provide a unique number for every charater, regardless of the platform or the language being represented. The Universal Character Set (UCS), is an ISO standard that encompasses most of the world's writing systems. UCS uses multi-octet characters with are not compatible with current applications and protocols. The UCS Transformation Formats (UTF) standards were developed to overcome the compatibility issue. The two most widely used encoding schemes for unicode are UTF-8, and UTF-16.

No comments: