Jack Halpern loves two things that he wants you to love too. He loves to ride unicycles, including the thirteen-wheeler shown here, and never misses a chance to teach others to ride or to promote unicycling as a sport. More than anything, Halpern loves languages, especially Japanese and Chinese. As a result, he has devoted many years to learning these languages, compiling Japanese and Chinese dictionaries, and providing large-scale lexical databases to developers of CJK language technology applications.
Halpern came by his first half-dozen languages naturally while
growing up. He was born in
After the 1967 Arab-Israeli war, he went to
live in a kibbutz in
In 1973, after exhausting the materials available to him for
studying Japanese in Israel, Halpern moved with his family to Japan, where he
has remained ever since. Over the years his “language collection” expanded to
include Chinese, both Traditional and Simplified, and he
learned various Japanese dialects including the Kansai (western) and
Halpern has also learned Spanish, Judezmo (aka Ladino), the international languages Esperanto and Interlingua. More recently, he began to study Arabic intensely, and during a brief vacation in Aruba he literally “picked up the basics” and became hooked on his 14th language, Papiamento, a local Creole based on Spanish and Dutch,
Halpern has also put a lot of effort into explaining Judaism and
Jews to Japanese.
Halpern has published some twenty books and hundreds of
articles and papers, and presented his work in dozens of academic conferences,
but he is best known for his dictionaries, produced
through the Kanji Dictionary Publishing Society (http://www.kanji.org).
The New Japanese-English Character Dictionary exists in three editions.
These consist of two printed versions, one for
The design of Halpern’s dictionaries grew out of his own experience
in learning Japanese in
The result is a mass of unorganized material that must just be
memorized. Halpern assumed at first that in
Halpern solved this problem in his dictionaries by tracing the various meanings of each character back to a core meaning(s), and finding a concise, easy-to-memorize English keyword for the character meaning(s). Understanding the core meaning makes it much easier to remember the various derived meanings.
Consider the core meaning of 留, which has the on pronunciations ryū and ru in Chinese compounds, and the kun readings tomeru and todomeru, among others, in Japanese verbs. Halpern relates all of its uses to the one key concept KEEP:
1. KEEP in place, fix
2. KEEP in custody, detain
3. KEEP for future use, reserve
4. KEEP in mind, pay attention to
An important feature of any kanji dictionary is indexing. No single method of organizing characters is completely satisfactory. It must be possible to look up characters by reading and also by shape. The traditional shape-based method identifies a component of each character, called a radical, then requires counting the strokes in the rest of the character. This method is clumsy and slow.
Halpern spent some seven years analyzing and developing a more efficient pattern-recognition based indexing method for kanji, called SKIP (System of Kanji Indexing by Patterns). SKIP is a very quick and effective indexing method, particularly helpful to learners, and is being widely used by online kanji dictionaries (see here).
It is widely assumed that conversion between Traditional Chinese (TC) characters such as 馬 and Simplified Chinese (SC) characters such as 马 is simply a matter of table lookup. After all, the PRC government created its repertoire of Simplified Characters by taking Traditional Characters and defining substitutes for them. What could be the problem?
Well, the problem comes from several sources. Jack Halpern and Jouni Kerman, also of The CJK Dictionary Institute, explain this in a conference paper reproduced at http://www.cjk.org/cjk/c2c/c2centry.htm. For one thing, the PRC simplifications are not entirely consistent. Some TC correspond to the same simplified form. For example, both TC 面 (, face) and TC 麵 (, noodles) map to SC 面, that is, it is a one-to-many, ambiguous relation.
Moreover, vocabulary in the PRC and other Chinese-speaking countries has diverged significantly since the Communist regime took over in 1949. The result of replacing simplified characters with the original traditional characters can be meaningless. The correct TC equivalent for an SC word may not be a matter of character-based code substitution (code conversion) or character-based word substitution, called orthographic conversion (like colour vs. color in British and American English), but often requires meaning-based word translation, called lexemic conversion (like truck vs. lorry). The problem is particularly acute for computer technology and proper nouns. For example, Internet is written 因特网 in SC and 網際網路 in TC, while Osama bin Ladin is written as shown below (see for a detailed paper).
Osama bin Ladin
أسامة بن لادن
The foundation for Halpern’s dictionaries and for his larger business is his massive accumulation of lexical data in Chinese, Japanese, and Korean. Halpern licenses portions of this data for use in dictionaries, commercial software products, and free software. These databases are highly regarded for both academic and commercial uses. The China Lexicographic Society at Guangdong University of Foreign Studies has written, “We were deeply impressed by the high linguistic standards of [Halpern’s] work and his profound knowledge of Chinese linguistics and lexicography. We were especially impressed with the technical and linguistic sophistication of the dictionaries and systems the [CJKI] Institute developed for converting between Simplified and Traditional Chinese.”
Licensing for this database is managed by The CJK Dictionary Institute, Inc. which has a website at http://www.cjk.org. CJKI consists of a group of linguists and other experts specializing in CJK lexicography, under Halpern’s direction. Google, Verity and some of the world’s major search engine companies such as Microsoft and Fujitsu use CJKI’s data to power their Japanese and Chinese morphological analyzers or other applications.
collaborates on research into Chinese with the
CJKI’s comprehensive lexical databases have a rich set of attributes to support morphological analyzers. These are sophisticated computational linguistic tools used for analyzing a text to segment it into lexemes (sometimes morphemes), and other computational procedures such as POS-tagging (identifying syntactic categories), stemming and indexing.. This analysis is fundamental to machine translation and natural language processing. CJKI’s data is used to power the morphological analyzers of major portals and search engines, such as Google, Lycos, Infoseek, Excalibur, and others, while its Japanese dictionaries are used by MT and IME products from Fujitsu, Sharp, and Sony.
A major focus of CJKI, and one of its greatest strengths, is the development of large-scale databases of proper nouns. CJKI has been engaged in intense efforts to develop the world’s largest Chinese (3 million entries) and Japanese (2.5 million entries) databases of proper nouns, fine-tuned to the needs of NER (named entity recognition), one of the hottest topics in computational linguistics and IR technology today.
Some examples of products using CKI’s dictionaries are Wizcom Technologies (http://www.wizcomtech.com), which offers the Quicktionary II, a pen-shaped scanner with an LCD screen and
speaker built in. Users can scan a word or a line of text and immediately read
a translation. Versions are available with
three different CJKI dictionaries (English to Japanese, SC and TC).
Halpern also gave permission for Jim Breen to include SKIP coding in his free KANJIDIC Japanese-English dictionary file, part of the EDICT project at http://www.csse.monash.edu.au/~jwb/japanese.html, which Halpern also helped build. Royalty-free licensing of SKIP may be available from CJKI for use non-commercial free software. Vacs Corporation, whose Chinese IME is powered by CJKI’s data, offers a popular line of IME software, called VJE-Delta, for the Japanese market. It improves on the Microsoft IME offerings in the various versions of Windows.
After more than a decade in
Halpern and his team have been working intensely in the last several years on a comprehensive English-Chinese Dictionary of Computer and IT Terminology, which will be published by the well-known Shanghai Cishu Chubanshe. A unique feature of this dictionary is that it will include both Simplified Chinese and Traditional Chinese both on the orthographic and lexemic levels, and that it will, it seems, be the first dictionary ever published for the Chinese market whose editor-in-chief is a non-Chinese (samples at ).
CJKI is now working intensely on expanding its Arabic lexical resources including a database of Arabic-English personal and place names, A comprehensive database of over 200,000 Arabic names variants (over 100 ways to spell Mu'ammar Qadhafi!), based on authoritative resources and huge corpora (of great interest to security agencies), a database of broken plurals, an application for accurately transcribing to/from Arabic, and more (see this page for details)
Last but not least, there is one thing that Halpern loves even more that CJK and unicycles -- what he and his polyglot friends call “God’s mother tongue, ” a language for which he has a burning passion and which brings him great joy and excitement – Brazilian Portuguese (BP). He has hundreds of books in and on it, and spends endless hours researching its grammar, phonology and dialects. Though CJKI has no customers for it now, Halpern has launched a project the world’s largest lexical database of BP, which is now hovering at 250,000 entries of general vocabulary – proper nouns and technical terms will follow.
Driven by his passion for languages, Halpern and his devoted staff plow ahead with one dictionary project after another to feed his insatiable craving for building large-scale lexical databases which bring great benefit to the language technology community by powering IR, MT and various NLP tools and applications.