■ 文字コード means the numerical codes that are use to represent written characters in computers and telecommunications. There are code systems for many writing systems, with several different systems used for Japanese alone.
The main code for Japanese characters is laid down in the JIS X 0208-1997 standard. It has the kana, 6,355 kanji, a lot of symbols, and the more-or-less complete Latin (known as JIS-ASCII), Greek and Cyrillic alphabets.
Let's take as an example the very first kanji in JIS X 0208, which is 亜. In the standard it is defined by a pair of decimal numbers which place it in the 94 x 94 matrix used in that standard. This is known as the 区点(くてん) code of the kanji:
Kuten for 亜: 16-01The raw "JIS" code is formed by adding by adding 32 (hexadecimal 20) to (each of) the kuten pair. This moves the code up into the "printable ASCII" range where it won't upset software by pretending to be escapes, line-feeds, etc.
(Raw) JIS for 亜: 3021 (hex) or 0! (ASCII)The EUC (originally Extended Unix Code) coding is formed by turning on bit-8 (MSB) of the raw JIS code:
EUC for 亜: B0A1The Shift-JIS is formed by putting the 14 bits of the raw JIS code through a transformation to make a pair of bytes. The (MSB, i.e. Most Significant Bit) of the first is always set, and the second always lies in or above the printable ASCII range.
Shift-JIS: 889fFor the sake of completeness, the JIS/ISO-2022-JP coding wraps the "raw" JIS in escape sequences:
ISO-2022-JP for 亜: ^[ $ B 0 ! ^[ ( BNote that as many raw JIS codes as you like can be in the wrapper, although the RFC for Japanese in email limits it to 72.
This both tells the browser what the codes are in the page, and, more importantly, tells it how to code Japanese text in the input fields in a form before sending it to a server.<META HTTP-EQUIV='Content-Type' CONTENT='text/html;CHARSET=x-sjis">
■ 文字コード have been the subject of some controversy in Japan. The
original version of JIS X 0208 (JIS C 6226-1978) was compiled using several
industrial "standards", and included almost every kanji used in the
official names of towns. In this the standards committee made some blunders,
misreading several hand-written kanji and thus inventing several kanji
such as 墸, which cannot be found in Morohashi.
Soon after the establishment of the original JIS character set, objections were
raised
to the omission or abbreviation of some fairly common characters. One often
cited in objections is 鴎(かもめ), which appears in the name of the
author 森鴎外(もりおうがい). Traditionally, this character has been
written
(represented here by a GIF image rather than by a character code).
The JIS committee got into more hot water with the 1983 revision of the
standard, when they replaced several kanji with simplified forms
hitherto unknown in Japan. Various fiddles were done at this time to
handle the expansion of the 常用漢字 and the 人名用漢字.
In 1990, in order to address the problem of missing kanji, a further standard: JIS X 0212-1990, described as the 補助漢字, was introduced. This added a further 5,801 kanji, plus some other characters missing from JIS X 0208, such as odd-ball kana moras and Latin characters with diacritics. While JIS X 0212 went a long way to solving complaints about missing kanji, it has been a total non-event. Since there is no room in Shift-JIS to carry these extra characters, virtually no-one has actually implemented the standard (you can see them using WWWJDIC.) The かもめ kanji above is in JIS X 0212.
The early 1990s saw the emergence of the Unicode and ISO-10646 standards, which attempt to pull all the national code-sets into a single compatible set. A major part of these has been the "Han Unification", in which the kanji/hanzi/hanja from Japan, the PRC, Taiwan and Korea were merged into a single block of about 21,000 characters. The unification rules were quite strict, and thus every distinct kanji in JIS X 0208 and 0212 was included, even the cases such as 劒 and 劔 which are clearly orthographical variants. Kanji not in those standards, and not in the Chinese sets such as Big5, missed out totally.
The controversy over codes heated up in the late 1990s when a group of
writers and
critics headed by the late 江藤淳(えとうじゅん) made statements and held
symposia to protest the limited number of kanji available on personal
computers and to decry the Unicode standard, which has been seen by
some Japanese to be dominated by Chinese or American interests.
Their sometimes xenophobic arguments met
sharp rebuttals from other critics, particularly those with experience in
typesetting or desktop publishing.
Some groups in Japan and China unhappy with current coding systems have
built operating systems using alternative character sets such as the
huge CCCII (Chinese Character Code for Information Interchange) that seeks
to remedy the
perceived deficiencies in the current standards, but there has been little
sign that the new systems will be widely adopted. The controversy seems
likely to be settled by the adoption of Unicode as the de facto standard,
especially since Microsoft is publicly committed to its use in all
platforms.
This entry was created by Jim Breen. The original version of the first part of the document is available here. The section on the 文字コード controversy was created by Tom Gally and extensively amended by Jim Breen.
Created 2000-08-15.