Orthographic Variation in Chinese


Jack Halpern
CEO
The CJK Dictionary Institute, Inc.
株式会社日中韓辭典研究所



Index to This Document
  1. Introduction
  2. Simplified and Traditional Chinese
  3. Traditional Chinese Variants
  4. Synonym and Cross-Language Expansion
  5. Mapping Tables and Lexical Databases
  6. Documents for Reference

1. Introduction

This report provides an overview of the linguistic and orthographic issues related to the processing of Chinese texts, especially webpages. It aims to enable software developers to acquire a basic knowledge of the complexities of the Chinese script, especially orthographic variation, from the point of view of building linguistic tools such as intelligent information retrieval tools, input method editors, and machine translation systems.

Abbreviations used in this document
SCSimplified Chinese
TCTraditional Chinese
STC"Simplified" Traditional Chinese (explained below)
TTC"Traditional" Traditional Chinese (explained below)
C2CChinese to Chinese (SC-to/from-TC) conversion

The complexity of the Chinese writing system is well known. Some of the linguistic factors that contribute to this include the large number of characters in common use, the complexity of the character forms, the major differences between Traditional Chinese and Simplified Chinese along various dimensions (orthography, phonology, semantics), the presence of numerous orthographic variants in Traditional Chinese, and others.

From an information processing point of view (which is beyond the scope of this report), there are complex issues such as the use of multiple character sets, multiple encodings, the incompatibility between character sets, a plethora of input methods, and more.

Of crucial importance to text processing is the fact that there is a significant amount of orthographic variation. The existence of numerous orthographic variants, especially in Traditional Chinese, and the high degree of sophistication required for performing accurate conversion between SC and TC, pose a special challenge to the developers of linguistic tools, especially in the field of information retrieval, since the same word or phrase can be written in multiple, often unpredictable, ways.

This report focuses on the major types of orthographic variation in Chinese, and provides a brief analysis of the linguistic issues to be considered by software developers.


2. Simplified and Traditional Chinese

As a result of the large-scale language reforms undertaken in the PRC in the postwar period, thousands of character forms underwent drastic simplifications. Chinese written in these simplified forms is called Simplified Chinese (SC). Taiwan and Hong Kong, and most overseas Chinese continue to use the old, complex forms, referred to as Traditional Chinese (TC).

The process of automatically converting SC to/from TC, referred to as C2C conversion, is full of complexities and pitfalls. The conversion can be implemented on the three levels in increasing order of sophistication, briefly described below. For more details, see the author's The Pitfalls and Complexities of Chinese to Chinese Conversion.

2.1 Code Conversion

The easiest, but most unreliable, way to perform C2C conversion is on a codepoint-to-codepoint basis by looking the source up in a mapping table, such as the one shown below. This is referred to as Code Conversion or transcoding.


Table 1
Code Conversion
SCTC1TC2TC3TC4Remarks
   one-to-one
   one-to-one
  one-to-many
  one-to-many
one-to-many

Because of the numerous one-to-many ambiguities (which occur in both the SC-to-TC and the TC-to-SC directions), the rate of conversion failure is significant. Without time-consuming human proofreading, the results of code conversion are unacceptable. For more details, see Code Conversion.

2.2 Orthographic Conversion

The next level of sophistication in C2C conversion is referred to as orthographic conversion, because the items being converted are orthographic units, rather than mere codepoints in a character set. That is, they are meaningful linguistic units such as single-character free words, bound morphemes (such as affixes), and multi-character compound words. As shown in Table 1 above, code conversion is ambiguous because of the numerous one-to-many mappings. Successful C2C conversion depends on the context and requires orthographic mapping tables on the word level, as shown below.


Table 2
Orthographic Conversion
EnglishSCTC1TC2Incorrect
Candidates
Comments
telephone电话電話  unambiguous
we我们我們  unambiguous
start-off出发出發 出髮  齣髮  齣發one-to-many
dry干燥乾燥 干燥  幹燥  榦燥one-to-many
 阴干陰乾陰干 depends on context


As can be seen, the ambiguities inherent in code conversion are resolved by using an orthographic mapping table, which avoids false conversions such as shown in the Incorrect Candidates column above. It is important to note that such conversion must be done with the aid of a Chinese Morphological Analyzer (CMA) that can segment the text stream into meaningful units (such as lexemes). For more details, see Orthographic Conversion.

2.3 Lexemic Conversion

A more sophisticated, and far more challenging, approach to C2C conversion is called lexemic conversion, which maps SC and TC words that are semantically, not just orthographically, equivalent. For example, the SC word 信息 (xìnxī) 'information' is converted to the semantically equivalent TC 資訊 (zīxùn). This is similar to the difference between lorry in British English and truck in American English.

There are many lexemic differences between SC and TC words, especially in technical terms and proper nouns. To complicate matters, the correct TC is sometimes locale-dependent, as is shown in the table below.


Table 4
Lexemic Conversion
EnglishSCTaiwan TCHong Kong TCOther TCIncorrect
(orthographic)
Comments
Software软件軟體軟件 軟件lexemic
Taxi出租汽车計程車的士德士出租汽車lexemic
Kennedy肯尼迪甘迺迪堅尼地 肯尼迪lexemic proper noun
Oahu瓦胡岛歐胡島  瓦胡島lexemic proper noun

Studying the above table carefully should make most of the issues clear. For a detailed discussion and more examples, see Lexemic Conversion.


3. Traditional Chinese Variants

Unlike Simplified Chinese, Traditional Chinese does not have a stable orthography. There are numerous TC variant forms, and much confusion prevails. It is necessary normalize or expand these variants based on hard-coded mapping tables, such as the ones shown below. For a more detailed discussion, see Variation in Traditional Chinese Orthography.

3.1 TC Variants in Taiwan and Hong Kong

Traditional Chinese dictionaries often disagree on the choice of the standard TC form. There are various reasons for the existence of TC variants:

  1. Some TC forms are not available in the Big Five character set.
  2. Some forms have coexisted historically.
  3. Unavailability of certain glyphs in some fonts.
  4. The use of simplified character forms, especially in handwriting.

TC variants can be classified into various types, as illustrated in the table below.


Table 5
TC Variants
Variant 1Variant 2EnglishComment
inside100% interchangeable
teach100% interchangeable
particleVariant 2 not in Big5
forVariant 2 not in Big5
sink; surnamepartially interchangeable
leak; divulgepartially interchangeable


3.2 Mainland vs. Taiwanese Variants

To a limited extent, the traditional forms are still used in the PRC for some classical literature, newspapers for overseas Chinese, etc., based on a standard that maps the SC forms (GB 2312-80) to their corresponding TC forms (GB 12345-90). However, these mappings do not necessarily agree with those widely used in Taiwan. We will refer to the former as Simplified Traditional Chinese (STC), and to the latter as Traditional Traditional Chinese (TTC).


Table 6
STC vs. TTC Variants
PinyinSCSTCTTC
xiàn线
bēng


4. Synonym and Cross-Language Expansion

An advanced form of variant expansion is synonym expansion and cross-language expansion. The details of this are beyond the scope of this report. In a nutshell, synonym expansion generates a picklist of synonyms and other semantically related variants; cross-language expansion generates a picklist of Chinese equivalents to a foreign language source term. A brief example of this is shown in the table below. For a more detailed treatment, see the author's paper Cross-Synonym and Cross-Language Searching in Japanese.


Table 7
Synonym and Cross-Language Expansion
Synonyms时钟時鐘時計鐘錶
 國家国家故乡故鄉国土國土
Cross-languagecountry國家国家   


5. Lexical Databases and Conversion Software

In 1996, the CJK Dictionary Institute (CJKI) launched a project to investigate C2C conversion issues in-depth, and to build a comprehensive SC↔TC database (now at 1.2 million SC and 1.2 million TC items and growing) whose goal is to achieve near 100% conversion accuracy. We have collaborated with Basis Technology Corporation, a leading provider of CJK software technology, in developing advanced word segmentation technology and a highly accurate C2C conversion engine, which have been released as successful commercial products. These are described in detail at CJK Products.

One of the central components necessary for building tools for processing Chinese orthographic, lexemic and other variants is a database of hard-coded mapping tables fine-tuned to the needs of C2C conversion, variant expansion and normalization. Below is a list of components required for building such a database developed by our Institute, most of which have been incorporated into the sophistictaed CJK tools developed by Basis Technology..

  1. Traditional Chinese single character variants
  2. Traditional Chinese compound variants
  3. Single character SC-to-TC code-level mapping table
  4. Single character TC-to-SC code-level mapping table
  5. STC-to/from-TTC character mapping table
  6. SC-to/from-TC orthographic mapping table for general vocabulary
  7. SC-to/from-TC orthographic mapping table for proper nouns
  8. SC-to/from-TC lexemic mapping tables
  9. Chinese thesaurus for synonym expansion
  10. Basic Chinese-English dictionary for cross-language expansion

See the author's The Pitfalls and Complexities of Chinese to Chinese Conversion for more details.


6. Documents for Reference

See the following links for more information: