Unicode
From Wikipedia, the free encyclopedia
For the 1889 Universal Telegraphic Phrase-book, see Commercial code (communications).
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems. Developed in conjunction with the Universal Character Set standard and published in book form as The Unicode Standard, the latest version of Unicode contains a repertoire of more than 110,000 characters covering 100 scripts. The standard consists of a set of code charts for visual reference, an encoding methodology and set of standard character encodings, a set of reference data computer files, and a number of related items, such as character properties, rules for normalization, decomposition, collation, rendering, and bidirectional display order (for the correct display of text containing both right-to-left scripts, such as Arabic and Hebrew, and left-to-right scripts).[1] As of September 2012, the most recent version is Unicode 6.2. The standard is maintained by the Unicode Consortium.Unicode's success at unifying character sets has led to its widespread and predominant use in the internationalization and localization of computer software. The standard has been implemented in many recent technologies, including modern operating systems, XML, the Java programming language, and the Microsoft .NET Framework.
Unicode can be implemented by different character encodings. The most commonly used encodings are UTF-8, UTF-16 and the now-obsolete UCS-2. UTF-8 uses one byte for any ASCII characters, which have the same code values in both UTF-8 and ASCII encoding, and up to four bytes for other characters. UCS-2 uses a 16-bit code unit (two 8-bit bytes) for each character but cannot encode every character in the current Unicode standard. UTF-16 extends UCS-2, using two 16-bit units (4 × 8 bit) to handle each of the additional characters.
History[edit]
The origins of Unicode date to 1987, when Joe Becker from Xerox and Lee Collins and Mark Davis from Apple started investigating the practicalities of creating a universal character set.[2] In August 1988, Joe Becker published a draft proposal for an "international/multilingual text character encoding system, tentatively called Unicode". Although the term "Unicode" had previously been used for other purposes, such as the name of a programming language developed for the UNIVAC in the late 1950s,[3] and most notably a universal telegraphic phrase-book that was first published in 1889,[4] Becker may not have been aware of these earlier usages, and he explained that "[t]he name 'Unicode' is intended to suggest a unique, unified, universal encoding".[5]In this document, entitled Unicode 88, Becker outlined a 16-bit character model:[5]
Unicode is intended to address the need for a workable, reliable world text encoding. Unicode could be roughly described as "wide-body ASCII" that has been stretched to 16 bits to encompass the characters of all the world's living languages. In a properly engineered design, 16 bits per character are more than sufficient for this purpose.His original 16-bit design was based on the assumption that only those scripts and characters in modern use would need to be encoded:[5]
Unicode gives higher priority to ensuring utility for the future than to preserving past antiquities. Unicode aims in the first instance at the characters published in modern text (e.g. in the union of all newspapers and magazines printed in the world in 1988), whose number is undoubtedly far below 214 = 16,384. Beyond those modern-use characters, all others may be defined to be obsolete or rare; these are better candidates for private-use registration than for congesting the public list of generally useful Unicodes.In early 1989, the Unicode working group expanded to include Ken Whistler and Mike Kernaghan of Metaphor, Karen Smith-Yoshimura and Joan Aliprand of RLG, and Glenn Wright of Sun Microsystems, and in 1990 Michel Suignard and Asmus Freytag from Microsoft and Rick McGowan of NeXT joined the group. By the end of 1990, most of the work on mapping existing character encoding standards had been completed, and a final review draft of Unicode was ready.
The Unicode Consortium was incorporated on January 3, 1991, in California, and in October 1991, the first volume of the Unicode standard was published. The second volume, covering Han ideographs, was published in June 1992.
In 1996, a surrogate character mechanism was implemented in Unicode 2.0, so that Unicode was no longer restricted to 16 bits. This increased the Unicode codespace to over a million code points, which allowed for the encoding of many historic scripts (e.g. Egyptian Hieroglyphs) and thousands of rarely used or obsolete characters that had not been anticipated as needing encoding. Among the characters not originally intended for Unicode are rarely-used Kanji or Chinese characters, many of which are part of personal and place names, making them rarely used, but much more essential than envisioned in the original architecture of Unicode.[6]
Architecture and terminology[edit]
Unicode defines a codespace of 1,114,112 code points in the range 0hex to 10FFFFhex.[7] Normally a Unicode code point is referred to by writing "U+" followed by its hexadecimal number. For code points in the Basic Multilingual Plane (BMP), four digits are used (e.g. U+0058 for the character LATIN CAPITAL LETTER X); for code points outside the BMP, five or six digits are used, as required (e.g. U+E0001 for the character LANGUAGE TAG and U+10FFFD for the character PRIVATE USE CHARACTER-10FFFD). Older versions of the standard used similar notations but with slightly different rules. For example, Unicode 3.0 used "U-" followed by eight digits to indicate a code point, and allowed "U+" to be used only with exactly four digits to indicate a code unit, such as a single byte of a multibyte UTF-8 encoding of a code point.[8]
No comments:
Post a Comment