Saturday, January 29, 2005

Choice and Identification of XML Character Encodings

Choice and Identification of XML Character Encodings
Author: Dan Chiba, Globalization Specialist, Oracle Corp.
Date: February 2004

When text is stored, transferred or processed, it is essential that the character encoding be known. This principle also applies to XML, because XML is text based.

Despite the fact that XML may be encoded in various ways, there are no guidelines about which encoding to use and how to identify it. Thus, this Technical Note explains best practices for applying character encodings to XML.

Choosing the Character Encoding of an XML Entity

When choosing a character encoding to use, it is necessary to first know which, if any, options are available. Certain XML application scenarios involving standards specifications and the execution environment may dictate the encoding to some extent. In fact, it's possible that given a specific scenario, only one choice will be available.

The term "mandated encoding" refers to the encoding that must be used in such a circumstance. For example, a string datatype of a programming environment may have to be in a predetermined encoding, as in UTF-16 in Java character or the database character set in SQL character datatypes. A unique encoding is often mandated in different ways. Table 1 shows some other examples of such cases.

Table 1. Examples of Mandated Unique Encodings

Unique encoding mandated by: Example
Protocol

* UDDI(UTF-8)
* LDAP(UTF-8)

API Specification

* DOM(UTF-16) *
* Win32 wide character

Execution Environment

* Java(UTF-16)
* C#(UTF-16)
* Database Character Set


Note: Oracle XDK for C/C++ provides a special mode that allows the DOM tree to be built in an arbitrary single-byte character encoding and the API works in the specified encoding. This feature is provided for optimization purposes and considered to be an exception, as it is intended for use in scenarios where it is known that the data consists of characters in the specified character set. It is highly recommended that you use the xmlinitenc initialization function or specify the data_encoding property to the XML context at all times.

Using a mandated encoding is a best practice because you won't have to care about in- or on-document encoding information--encoding is always known to the consumer. Thus, the chance of errors is reduced and efficiency may improve.

But even if an encoding is mandated, the XML processor may not "know" that--in which case the application has to make sure that the mandated encoding is used. For example, if a Java application has a DOM tree that must be serialized to an output stream in UTF-8 [RFC-3629], you can ensure that process occurs by explicitly specifying UTF-8 when converting the output from Writer to OutputStream. The following pseudo code is an example of specifying output encoding in a Java Servlet:

/* response is an http servlet response object */
response.setCharacterEncoding("UTF-8"); // set the output encoding to UTF-8
PrintWriter w = response.getWriter(); // get the output stream mandated to UTF-8
:
/* doc is an instance of an XML */
doc.print(w); // the document printed in the specified encoding

Similarly, if your input must be in UTF-8, your application should be coded to accept input in UTF-8 only. For example, in Java you may want to create an InputSource object using the constructor that takes a parameter to specify the input encoding. Alternatively you may create an InputStreamReader from the input stream specifying UTF-8 to be the input encoding. The following pseudo code shows how to specify the input encoding in Java.

InputSource is = new InputSource(); // create an input source
is.setByteStream(request.getInputStream()); // set the input stream mandated to UTF-8
is.setEncoding("UTF-8"); // set the mandate encoding to the input source
parser.parse(is); // the parser will parse in the specified encoding

Most string datatypes of a popular programming environment dictate the character encoding to a certain extent. Even if multiple choices are available, some constraints usually exist. For example, the character encoding used for the char type in C/C++ must be supported by the standard and Oracle libraries. Use the FORCE_INCODING flag or the input_encoding property to specify the mandated or externally specified input encoding. The following pseudo code demonstrates how to specify a mandated encoding to XDK for C.

// parse an input stream in UTF-8 with DOM
XmlLoadDom(ctx, &err, "stream", in, "input_encoding", "UTF-8", NULL);
// parse an input stream in UTF-8 with SAX
XmlLoadSax(ctx, &err, "stream", in, "input_encoding", "UTF-8", NULL);
// print the document in UTF-8
XmlSaveDom(ctx, &err, doc, "stream", out, "output_encoding", "UTF-8", NULL);

Choosing a Mandated Character Encoding

If an application does not need to support multiple encodings, it can mandate a unique encoding on its own. If one encoding is mandated, it should be UTF-8 or UTF-16 [RFC-2781]--otherwise, interoperability will suffer severely because other encodings may not be supported by the XML processor that consumes the documents. If compatibility with US-ASCII [RFC-20] is desired, or for a serialization format for transfer or storage purpose, UTF-8 is recommended. In other situations, UTF-16 may be appropriate.

Supporting Multiple Encodings

An application with a requirement to support multiple encodings can support any encoding the XML processor supports. All XML processors support UTF-8 and UTF-16. Typically several commonly used native encodings are also supported.

Although Oracle XML processors support all popular encodings and many others, it is advisable to allow multiple encodings only when necessary. An application should not compose an XML document in an encoding other than UTF-8 or UTF-16 unless it is known that the encoding is supported by the consumer and the content can be represented in the encoding. For example, if the database character set is not Unicode, composing an XML document in the database and serving it in the database character set to an unknown audience is discouraged.

To receive input entities in various encodings, the input stream should be read unmodified by the XML processor as a byte stream. Make sure the externally provided encoding information (the charset parameter in the content-type HTTP header, for example) is passed to the XML processor to force the specified encoding when the XML processor parses the input. This should happen as if the specified encoding were the mandated encoding.

To produce output in an arbitrary encoding, make sure the entity is accompanied by character encoding information either via external tagging, such as the charset parameter on the HTTP header and the character set property in a repository such as Oracle Files or Oracle XML DB, or via embedded tagging (namely, the encoding declaration).

External tagging is preferred over the internal variety because it is more reliable and easier to handle; it is sensible to do without internal tagging wherever possible. In fact, as discussed previously, internal tagging is generally not required because the encoding is usually known.

Often the declared encoding and the actual encoding disagree because of required character set conversion. For example, if you insert a document with the encoding declaration into a database column of type CLOB or read it through a Java character stream, the declaration will not magically change to the actual value. This situation is easily avoided by maintaining the correct encoding using a higher protocol such as the NLS_LANG setting and Java character datatype. (Oracle's XMLType datatype addresses anticipated scenarios to handle various character encodings.)

External vs. Internal Encoding Information

Sources of character encoding information can be put into two categories: external or internal. This section discusses the important contrast between the two.

External Encoding Information

Figure 1 depicts how the parser determines character encoding from the content-type header of HTTP, an external source of encoding information. Note that internal encoding information is not used.


When input comes over HTTP or in any other form whose encoding is externally recognizable, the application should do one of the following things:

* pass the value of the charset parameter on to the parser
* convert the input stream to Unicode based on the charset parameter
* instruct the parser to parse the URI
* use a datatype that meets the encoding requirement

Internal Encoding Information

Figure 2 shows how the parser determines character encoding by auto-detection. Note that the document must have correct Byte Order Mark (BOM) and/or the encoding declaration.


Auto-detection may be used in scenarios where external encoding information is not available. For example:

* XML is stored as a file on a file system.
* the sender did not provide external encoding information and no encoding is mandated.

Details of Sources of Internal Character Encoding Information

As we have discussed which encoding is to be used and how it is specified, transferred, and determined at runtime, let's explore the byte order mark and the encoding declaration, the devices for auto-detection of character encoding.

Byte Order Mark (BOM)

BOM (the Unicode character U+FEFF, ZERO WIDTH NO-BREAK SPACE) may appear at the beginning of an XML entity. In XML, BOM is not only used to indicate the byte order of the input text stream but also as a hint to help detect character encoding. XML processors typically examine the first few bytes of the input stream to figure out if the encoding is UTF-16 or ASCII based, so they can read the encoding declaration that may be present in the XML header. Entities in UTF-16 are required to have BOM for auto-detection to work, because the UTF-16 encoding has two forms: UTF-16BE (big endian: fe ff) and UTF-16LE (little endian: ff fe). Entities in UTF-8 may have BOM (ef bb bf), although there is no byte order issue. BOM is not part of the document, so it cannot be read from user code. Usually XML processors add or remove BOM as necessary.

The Encoding Declaration

The encoding declaration is one of the parameters on the XML header (which is specifically called XML or Text Declaration). e.g.



The encoding declaration is introduced to provide character encoding information for those entities in an encoding other than UTF-8 or UTF-16 parsed in the absence of external character encoding information. It is a common misconception that an XML entity must have the encoding declaration if it is using an encoding other than UTF-8 or UTF-16. In fact, the encoding declaration is useless if the encoding is known, which is often the case. An XML parser provides a way to specify the character encoding of each input entity. If the parser knows the encoding of an entity, the value in the declaration is insignificant. Usually the value is ignored if encoding information is provided externally as a parameter passed to the parser. In other cases it is not necessary to have an encoding identifier at all. For example, if the entity comes in as a Java character stream through a java.io.Reader, encoding auto-detection will not take place, because the input datatype is mandating the character encoding. If and only if the parser does not know the encoding, the parser reads the encoding declaration to figure out the encoding.

It's a good practice to use BOM and the encoding declaration to determine the character encoding if an XML entity is stored as a file, because a
Character Set vs. Encoded Character Set

Understanding the distinction between character set and character encoding is very important. In the context of computing, character set usually refers to a set of characters in which a unique numeric value is assigned to each character. Unicode is a character set that enables worldwide interchange of text information by supporting the diverse languages from around the world. The numeric value assigned to each Unicode character is called "code point," which is then encoded in a specific character encoding such as UTF-8 and UTF-16. In XML Unicode characters may be represented by mapping a Unicode code point to an encoded character in a non-Unicode character encoding. This concept is referred to as Document Character Set.
file system does not have a mechanism to identify the character encoding of stored files. Oracle Files, Content Management SDK, and XML DB Repository are exceptions, because you can maintain the encoding information of stored files as a property.

The XML and relevant standards define that external encoding information takes precedence over embedded declaration. Therefore it is likely that the internal declaration will simply be ignored. One of the key motivations to give the weak precedence to the encoding declaration is that it becomes incorrect when a character set conversion occurs. Character set conversion is usually inevitable when data travels from one environment to another, and it is impractical to keep the declaration correct during every conversion.

The encoding declaration does not indicate the character set of the entity. It merely indicates the physical encoding of the document character set (see sidebar), which is always the universal character set defined by ISO/IEC 10646 and Unicode. Thus, a document may contain any valid character from the universal character set regardless of the physical encoding.

For example, a document with the prologue shown above may still contain characters not available in the Latin-1 (ISO 8859 Page 1) character set, by using an entity reference or an external entity, which may be in an encoding different from that of the referencing entity in Latin-1.

Figure 3 illustrates an XML document comprising several entities in different character encodings.


Enabling auto-detection of character encoding in the absence of external character encoding information is the sole purpose of the encoding declaration. That is, the value is consumed by the parser--and in principle, XML applications--should not be interested in the value. However, a way to read a value of an encoding declaration may be provided so you can serialize an entity in the original encoding. An XSL processor typically retains the original encoding when the document is serialized after processing, although usually an arbitrary output encoding may be specified on the API or on the stylesheet to override the behavior.

The valid values of an encoding declaration are defined in the IANA character set registry. However, it is recommended to use UTF-8 or UTF-16, in which case the encoding declaration is not necessary and it is guaranteed that all processors support the encoding. If you elected to use a non-Unicode encoding, you should make sure that all processors that will consume your documents support that encoding.

Summary

Recommended practice can be summarized as follows:

* Often a protocol, format, or API specification will mandate a specific encoding.
* If not, consider mandating UTF-8 or UTF-16.
* Specify the encoding information externally unless there is no means to do so.
* Correctly specify BOM and/or the encoding declaration when using auto-detection.

No comments: