Character Set Types
Windows 2000 supports a range of character sets, which include ASCII, OEM 8-bit, ANSI, DBCS, and Unicode/ISO 10646. Character sets are formats used to compose code pages. For example, code page 1252 refers to a set of characters known as Latin I. For a complete description of all Windows 2000 supported codepages, go to the <<Microsoft Character Code Reference>> link on the Web Resources page at http://windows.microsoft.com/windows2000/reskit/webresources.
ASCII
American Standard Code for Information Interchange (ASCII) is a 7-bit character set providing 128 characters. ASCII allows for upper- and lowercase English, American English punctuation, base 10 numbers, a few control characters, and little else. Note that ASCII is the common denominator contained in all the other common character sets, making it the only means of interchanging data across all major languages without risk of character mapping loss.
OEM 8-bit
In the past, separate Original Equipment Manufacturer (OEM) code pages were created so that text-based computers could display and print line-drawing characters. These character sets are still used today for direct FAT access, and for accessing data files created by Microsoft® MS-DOS-based applications. OEM code pages typically have a three-digit label, such as CP 437 for American English.Since each hardware manufacturer was free to set their own character standards, characters can be scrambled or lost even within the same language, if two OEM code pages have different character code points.
ANSI
Windows American National Standards Institute (ANSI) supports international characters and publishing symbols. An assortment of 256-character Windows ANSI character sets cover all the 8-bit languages supported by Windows. Windows ANSI is composed of a lower 128 characters, and an upper 128 characters. The lower 128 characters are identical to ASCII, and the upper 128 characters are different for each ANSI character set. The upper 128 contains the distinct international characters for each code page.The European Union includes languages with more characters than a single standard code page can support, despite the fact that this code page was intended to cover all European Union font needs. Switching entirely to Unicode allows coverage of all EU languages in one character set, but the conversion is not automatic, and requires every text-related algorithm to be inspected and perhaps rewritten. As an interim solution, multiple code pages are provided for European character set needs.
DBCS
Double-byte character set (DBCS) is actually a multibyte encoding system that uses a mix of 8-bit and 16-bit characters, allowing for a wider range of characters. For example, modern writing systems used in the Far East region might require a minimum of 15,000 characters, which DBCS can accommodate. By allowing characters to be represented with two bytes, the number of possible permutations increases from 256 to 65,536, although in practice, some possible character permutations are used for special purposes, such as to indicate leading bytes or trailing bytes.There are several DBCS character sets supported by Far East editions of Windows including Windows 95, Windows 98, and Windows NT. Leading bytes indicate that the following byte is a trailing byte of the 16-bit character unit, rather than the start of the next character. There are multiple DBCS code pages, each of which have a different leading byte and trailing byte range.
Unicode
Unicode is a 16-bit character set that contains all of the characters commonly used in information processing, including Latin, Greek, Cyrillic, Indic, Thai, Kana, and Hangul characters, punctuation marks, and ideographs. Unicode is a standard supported by members of the Unicode Consortium. Unicode is not a technology in itself, and does not solve international engineering issues.Unicode is language-independent, helping conserve space in the character map. Characters are not assigned to specific languages, for example "a" can be used in French, German, or English. Similarly, a particular Han ideograph might map to a character used in Chinese, Japanese, or Korean. Unicode may not appear correct to viewers of a particular language because characters or ideographs are abstracted. To solve this issue, use a font that recreates a language's particular representation of the character, rather than seeking an alternate Unicode character.Although the majority of the Unicode character space is used, approximately a third of the 64,000 possible code points are still unassigned, allowing for additional characters in the future, and for private use and compatibility issues.