Linguistic and Technical Documents
This page brings together some linguistic and technical documents written by Jack Halpern, aimed at introducing the CJK languages, with emphasis on the linguistic issues to be addressed in developing CJK linguistic tools.
- Japanese information processing
- The Japanese language
- Chinese information processing
- Korean information processing
- Arabic information processing
- Other languages
Japanese Information Processing
- Pedagogical Lexicography Applied to Chinese and Japanese Learner's Dictionaries
Introduces The CJKI Chinese Learner's Dictionary, designed to satisfy the needs of learners and to overcome the shortcomings of existing Chinese dictionaries. Presented at ASIALEX 2011. See also slide show. - The Role of Lexical Resources in CJK Natural Language Processing
A linguistic description of the principal challenges to be overcome by developers of CJK NLP application, this paper was presented at workshops of COLING/ACL 2006 in Sydney as well as other conferences. It appears in various proceedings and journals, such as Lecture Notes in Computer Science. - The Role of Phonetics and Phonetic Databases in Japanese Speech Technology
Presented at the 11th Oriental COCOSDA Workshop held in Kyoto in 2008, this paper summarizes the complex allophonic variations that need to be considered in developing Japanese speech technology applications, and introduces the 130,000-entry Japanese Phonetic Database (JPD) developed by CJKI. - The Challenges of Japanese Speech Technology
A linguistic description of the principal challenges to be overcome by developers of Japanese speech technology and the role of phonological databases. - Lexicon-based Orthographic Disambiguation in CJK Intelligent Information Retrieval
Presented at COLING 2002 (Taipei), this paper analyzes the linguistic issues of CJK orthographic variation, including Japanese, and discusses why lexical databases should play a central role in NLP. - The Challenges of Intelligent Japanese Searching
This paper analyzes in detail the linguistic issues related to orthographic variation in Japanese, and discusses advanced information retrieval technologies such as cross-script and cross-orthographic searching for use in intelligent IR. - Orthographic Variation in Japanese
The highly irregular orthography and morphological complexity of Japanese pose formidable challenges to software developers. This report focuses on orthographic variation and analyzes the linguistic issues in developing Japanese linguistic tools. - The Complexities of Japanese Homophones
Explains the subtle distinctions between the numerous homophones in Japanese, and shows why homophone processing deserves special attention in Japanese information retrieval. - Cross-Synonym and Cross-Language Searching in Japanese
Describes the linguistic issues to be addressed by advanced Japanese information retrieval technologies, focusing on cross-language and cross-synonym searching. - Morphological Attributes in Japanese
Describes the derivational affixes and binding valency in our Japanese lexical database, particularly useful for disambiguating Japanese lexemes in such applications as search engine query processing.
The Japanese Language
- Outline of Japanese Writing System
An fairly detailed introduction to the Japanese writing system, including the birth of the Chinese characters, the function of kanji in Japanese, and a description of the various scripts used in Japanese. - Building a Comprehensive Chinese Character Database
Presented at Euralex '94, this paper describes how we began to develop DESK, our comprehensive CJK lexical databases, on the basis of the New Japanese-English Character Dictionary. - Kana and Romanization
A detailed introduction to the hiragana, katakana, and romaji scripts, which together with kanji constitute the complex Japanese writing system. - A Brief Introduction to Japanese Morphology Describes the principal word-formation processes in Japanese, with special emphasis on the function of kanji as word elements and bound affixes.
Chinese Information Processing
- Pedagogical Lexicography Applied to Chinese and Japanese Learner's Dictionaries
Introduces The CJKI Chinese Learner's Dictionary, designed to satisfy the needs of learners and to overcome the shortcomings of existing Chinese dictionaries. Presented at ASIALEX 2011. See also slide show. -
The Role of Lexical Resources in CJK Natural Language Processing
A linguistic description of the principal challenges to be overcome by developers of Chinese NLP application. - Lexicon-based Orthographic Disambiguation in CJK Intelligent Information Retrieval
This paper analyzes the linguistic issues of CJK orthographic variation, and discusses why lexical databases should play a central role in disambiguation. - The Pitfalls and Complexities of Chinese to Chinese Conversion
Presented at several international conferences, this academic paper presents an in-depth analysis of the linguistic and technical issues related to converting Simplified Chinese to/from Traditional Chinese. - Orthographic Variation in Chinese
This report focuses on the complexities of orthographic variation in Chinese, analyzes the linguistic issues in developing Chinese linguistic tools, and describes the major differences between Traditional and Simplified Chinese. - Variation in Traditional Chinese Orthography
Traditional Chinese does not have a stable orthography. This short document describes the various types character form variants and how they relate to each other.
Korean Information Processing
- Lexicon-based Orthographic Disambiguation in CJK Intelligent Information Retrieval
This paper analyzes the linguistic issues of CJK orthographic variation, including Korean,and discusses why lexical databases should play a central role NLP. - Orthographic Variation in Korean
This report focuses on Korean orthographic variation and analyzes the linguistic issues to be addressed when developing Korean linguistic tools, especially intelligent information retrieval tools.
Arabic Information Processing
- Applying Smartphone Technology to Compile Innovative Arabic Learner’s Dictionaries
Presented at the 2012 International Conference on Asian Languages Processing (Hanoi), this paper describes some of the methodology used in compiling two innovative Arabic learner's dictionaries fine-tuned to the special needs of learners that present abundant lexicographic information in a user-friendly manner. - Pedagogical Lexicography Applied to Arabic Dictionaries and Smartphone Applications
Introduces a new type of Arabic-English dictionary and smartphone app fine-tuned to the special needs of learners, and describes the ultimate verb conjugator smartphone app that provides instant access to verb conjugation paradigms. - CJKI Arabic Romanization System (CARS)
An innovative phonemic transcription system developed mainly for ease of use by learners of Modern Standard Arabic, with several unique features including indication of word stress and vowel neutralization. Presented at the Towards A Transliteration Standard of Arabic: Challenges and Solutions conference in Abu Dhabi in 2009. See also slide show. - Lexicon-Driven Approach to the Recognition of Arabic Named Entities
This paper describes the techniques used to compile the Database of Arabic Names (DAN), the world's largest Arab name resource containing millions of names and their variants. Presented at the 2nd International Conference on Arabic Language Resources and Tools in Cairo in 2009. - Word Stress and Vowel Neutralization in Modern Standard Arabic
This paper presents word stress and neutralization rules that are both linguistically accurate and pedagogically useful based on how spoken MSA is actually pronounced. Presented at the 2nd International Conference on Arabic Language Resources and Tools in Cairo in 2009. - Exploiting Lexical Resources for Disambiguating Orthographic CJK and Arabic Orthographic Variants
This paper analyzes the principal linguistic issues of Arabic and CJK orthographic variation and argues that linguistic knowledge supported by large-scale lexical databases is essential for accurate disambiguation. Presented at LREC 2008. - The Challenges and Pitfalls of Arabic Romanization and Arabization was presented at The Second Workshop on Computational Approaches to Arabic Script-based Languages (CAASL2) conference held at Stanford University. This paper focuses on the linguistic issues encountered in developing unique systems for the automatic romanization of Arabic names and the arabization of non-Arabic names that can arabize CJK names directly.
- The Typology of Arabic Proper Nouns
A 50-page report with in-depth analysis of the etymology, structure, and typology of Arabic proper nouns, highly useful for Arabic information processing and name recognition, with several appendixes.
Other languages
- Is English Segmentation Trivial?
Describes the principal word-formation processes in English, and demonstrates that word segmentation in English, contrary to popular belief, is far from trivial. - Criteria for Inclusion of Multiword Lexical Units in Dictionaries
Coming Soon. - European and Semitic languages
Coming Soon. A series of reports describing the features of the major European and Semitic languages, focusing on orthographic variation, and describing the linguistic issues to be addressed in developing linguistic tools.