Linguistic and Technical Documents

This page brings together some linguistic and technical documents written by Jack Halpern, aimed at introducing the CJK languages, in addition to Arabic, with emphasis on the linguistic issues to be addressed in developing both CJK and Arabic linguistic tools.

Japanese Information Processing

Current State of JMWEL: a Comprehensive Japanese MWE Lexicon and its Applications

A paper co-authored by Masahito Takahashi, Toshifumi Tanabe, Kosho Shudo, and Jack Halpern on JMWEL, a comprehensive lexicon of Japanese Multiword Expressions (MWEs) with a rich set of grammatical attributes fine-tuned for phrase-based NLP applications such as machine translation and information retrieval. Presented at the EUROPHRAS 2019: Computational and Corpus-based Phraseology in Malaga, Spain in September, 2019.

Very Large-scale Lexical Resources to Enhance Chinese and Japanese Machine Translation

This paper presented at the TAUS Executive Forum Tokyo 2017 looks at the linguistic issues related to orthographic variation, showing how Very Large-scale Lexical Resources (VLSLR) can significantly enhance the accuracy of NLP tools, with focus on machine translation (MT),named entity recognition (NER) and named entity translation (NET). See the slide show.

Some Linguistic Issues in the Machine Transliteration of Chinese, Japanese, and Arabic Names

This keynote address given at the 6th NEWS Named Entities Workshop in Berlin in August, 2016 focuses on the special characteristics of Chinese, Japanese, and Arabic scripts that impact machine translation, and the role played by lexical resources such as personal name dictionaries and how these resources can be used to enhance the accuracy of name transliteration systems. See the slide show.

Pedagogical Lexicography Applied to Chinese and Japanese Learner’s Dictionaries

Introduces The CJKI Chinese Learner’s Dictionary, designed to satisfy the needs of learners and to overcome the shortcomings of existing Chinese dictionaries. Presented at ASIALEX 2011. See also the slide show.

The Role of Lexical Resources in CJK Natural Language Processing

A linguistic description of the principal challenges to be overcome by developers of CJK NLP application, this paper was presented at workshops of COLING/ACL 2006 in Sydney as well as other conferences. See the slide show.

The Role of Phonetics and Phonetic Databases in Japanese Speech Technology

Presented at the 11th Oriental COCOSDA Workshop held in Kyoto in 2008, this paper summarizes the complex allophonic variations that need to be considered in developing Japanese speech technology applications, and introduces the 130,000-entry Japanese Phonetic Database (JPD) developed by CJKI. Presentation slides here.

The Challenges of Japanese Speech Technology

A linguistic description of the principal challenges to be overcome by developers of Japanese speech technology and the role of phonological databases.

Lexicon-based Orthographic Disambiguation in CJK Intelligent Information Retrieval

Presented at COLING 2002 (Taipei), this paper analyzes the linguistic issues of CJK orthographic variation, including Japanese, and discusses why lexical databases should play a central role in NLP.

The Challenges of Intelligent Japanese Searching

This paper analyzes in detail the linguistic issues related to orthographic variation in Japanese, and discusses advanced information retrieval technologies such as cross-script and cross-orthographic searching for use in intelligent IR.

Orthographic Variation in Japanese

The highly irregular orthography and morphological complexity of Japanese pose formidable challenges to software developers. This report focuses on orthographic variation and analyzes the linguistic issues in developing Japanese linguistic tools.

The Complexities of Japanese Homophones

Explains the subtle distinctions between the numerous homophones in Japanese, and shows why homophone processing deserves special attention in Japanese information retrieval.

Cross-Synonym and Cross-Language Searching in Japanese

Describes the linguistic issues to be addressed by advanced Japanese information retrieval technologies, focusing on cross-language and cross-synonym searching.

Morphological Attributes in Japanese

Describes the derivational affixes and binding valency in our Japanese lexical database, particularly useful for disambiguating Japanese lexemes in such applications as search engine query processing.

Mobile Language Learning

Enhancing Mobile Learning by Linking Japanese Dictionary Apps

This paper, presented at eLex 2019 in Sintra, Portugal, describes how four mobile apps exploit the unique features of the mobile platform to help learners study Japanese effectively in previously unavailable ways. (Abstract | Presentation)

Groundbreaking Mobile Technology to Enhance Chinese and Japanese Language Learning

This paper, presented at the ACLL2017: The Asian Conference on Language Learning in Kobe, Japan, describes our groundbreaking Libera platform that combines the strengths of traditional bilingual parallel texts with the educational potential of the smart tablet platform. (Abstract | Presentation)

Exploiting Mobile Technology to Enhance EFL

This paper, presented at the JALT2016 Annual Conference in Nagoya, describes our groundbreaking Libera platform that combines the strengths of traditional bilingual parallel texts with the educational potential of the smart tablet platform. (Abstract | Presentation)

Exploiting Mobile-Assisted Language Learning Technology to Enhance Japanese Language Education

This poster presentation, given at the 2016 Pacific Second Language Research Forum in Tokyo, describes two applications that leverage mobile technology to help learners study Japanese more effectively than ever before. (Abstract | Poster)

Dictionaries and Mobile Tools for Effective Language Learning

A workshop presentation sponsored by Kodansha USA was given at the 2015 ACTFL Annual Convention and World Languages Expo in San Diego, CA.

Interactive Parallel Text: A New Paradigm for Language Learning

A presentation on an exciting new langage learning platform, given at the 2015 ACTFL Annual Convention and World Languages Expo in San Diego, CA.

The Japanese Language

Compilation Techniques for Pedagogically Effective Bilingual Learners’ Dictionaries

This article was published in a special issue of the International Journal of Lexicography (Volume 29, Issue 3) on “Bilingual Learners’ Dictionaries”.

Major Issues in Compiling Multilingual Kanji Dictionaries

The aim of this presentation, given at the Second Wordnet Bahasa Workshop in Singapore in January 2016, was to examine several key issues in pedagogical lexicography both from the lexicographer’s and from the kanji learner’s points of view, focusing on compilation and design innovations that increase learner usability. (Abstract | Presentation)

Outline of Japanese Writing System

A fairly detailed introduction to the Japanese writing system, including the birth of the Chinese characters, the function of kanji in Japanese, and a description of the various scripts used in Japanese.

Building a Comprehensive Chinese Character Database

Presented at Euralex ’94, this paper describes how we began to develop DESK, our comprehensive CJK lexical databases, on the basis of the New Japanese-English Character Dictionary.

Kana and Romanization

A detailed introduction to the hiragana, katakana, and romaji scripts, which together with kanji constitute the complex Japanese writing system.

A Brief Introduction to Japanese Morphology

Describes the principal word-formation processes in Japanese, with special emphasis on the function of kanji as word elements and bound affixes.

Chinese Information Processing

Very Large-scale Lexical Resources to Enhance Chinese and Japanese IR and NLP

This paper looks at the linguistic issues related to orthographic variation, showing how Very Large-scale Lexical Resources (VLSLR) can significantly enhance the accuracy of NLP tools, with a focus on information retrieval (IR) and named entity recognition (NER) and named entity translation (NET).

Some Linguistic Issues in the Machine Transliteration of Chinese, Japanese, and Arabic Names

This keynote address was given at the 6th NEWS Named Entities Workshop in Berlin in August 2016 focuses on the special characteristics of Chinese, Japanese, and Arabic scripts that impact machine translation, and the role played by lexical resources such as personal name dictionaries and how these resources can be used to enhance the accuracy of name transliteration systems. See also the slide show.

Compilation Techniques for Pedagogically Effective Bilingual Learners’ Dictionaries

This article was published in a special issue of International Journal of Lexicography (Volume 29, Issue 3) on “Bilingual Learners’ Dictionaries”.

Pedagogical Lexicography Applied to Chinese and Japanese Learner’s Dictionaries

The Role of Lexical Resources in CJK Natural Language Processing

A linguistic description of the principal challenges to be overcome by developers of Chinese NLP application, this paper was presented at COLING/ACL 2006 in Sydney as well as other conferences. See the slide show.

Lexicon-based Orthographic Disambiguation in CJK Intelligent Information Retrieval

Presented at COLING 2002 (Taipei), this paper analyzes the linguistic issues of CJK orthographic variation, including Chinese, and discusses why lexical databases should play a central role in NLP.

The Pitfalls and Complexities of Chinese to Chinese Conversion

Presented at several international conferences, this academic paper presents an in-depth analysis of the linguistic and technical issues related to converting Simplified Chinese to/from Traditional Chinese.

Orthographic Variation in Chinese

This report focuses on the complexities of orthographic variation in Chinese, analyzes the linguistic issues in developing Chinese linguistic tools, and describes the major differences between Traditional and Simplified Chinese.

Variation in Traditional Chinese Orthography

Traditional Chinese does not have a stable orthography. This short document describes the various types character form variants and how they relate to each other.

Korean Information Processing

Lexicon-based Orthographic Disambiguation in CJK Intelligent Information Retrieval

This paper, presented at COLING/ACL 2006 in Sydney as well as other conferences. analyzes the linguistic issues of CJK orthographic variation, including Korean,and discusses why lexical databases should play a central role NLP. See the slide show.

Orthographic Variation in Korean

This report focuses on Korean orthographic variation and analyzes the linguistic issues to be addressed when developing Korean linguistic tools, especially intelligent information retrieval tools.

Arabic Information Processing

A Comprehensive Full-Form Lexicon for Arabic NLP and Speech Technology

An academic paper presented orally at LREC 2026 about our ArabLEX database, a full-form lexicon (includes all wordforms, i.e., fully inflected/cliticized members of a lexeme class) comprising approximately 570 million entries with fully inflected forms and detailed morphological, phonetic, and orthographic attributes (paper | slide show | video).

DiaLEX: Arabic Dialects Full-Form Lexicon

A presentation (slide show | poster) at LREC 2026’s Industry Day highlighting DiaLEX, which covers Egyptian, Hejazi, Emirati, Syrian, Lebanese, and Palestinian Arabic.

ARABLEX: Comprehensive Arabic Full-Form Lexicon

A white paper about our ArabLEX database, a comprehensive Arabic lexical resource that provides a rich set of grammatical, morphological and phonological features.

Some Linguistic Issues in the Machine Transliteration of Chinese, Japanese, and Arabic Names

Compilation Techniques for Pedagogically Effective Bilingual Learners’ Dictionaries

This article was published in a special issue of International Journal of Lexicography (Volume 29, Issue 3) on “Bilingual Learners’ Dictionaries”.

Compilation Strategies for Pedagogically Effective Bilingual Learner’s Dictionaries

This presentation at ASIALEX2016 in The Philippines describes three bilingual learner’s dictionaries. (Abstract | Presentation slides)

Exploiting Mobile Technology and Computational Lexicography to Enhance Arabic Pedagogy

A panel discussion organized by our director Jack Halpern for the Middle East Studies Association (MESA) 2014 Annual Meeting focused on methodologies to create pedagogically effective language learning and dictionary applications by harnessing the vast potential of the mobile platform. View Mr. Halpern’s presentation abstract here and slide show here.

Headword Selection in Arabic Lexicography

Discusses key issues related to the selection of headwords in Arabic dictionaries, in particular learner’s dictionaries, and briefly touches on criteria for selecting word senses.

Applying Smartphone Technology to Compile Innovative Arabic Learner’s Dictionaries

Presented at the 2012 International Conference on Asian Languages Processing (Hanoi), this paper describes some of the methodology used in compiling two innovative Arabic learner’s dictionaries fine-tuned to the special needs of learners that present abundant lexicographic information in a user-friendly manner.

Pedagogical Lexicography Applied to Arabic Dictionaries and Smartphone Applications

Introduces a new type of Arabic-English dictionary and smartphone app fine-tuned to the special needs of learners, and describes the ultimate verb conjugator smartphone app that provides instant access to verb conjugation paradigms.

CJKI Arabic Romanization System (CARS)

An innovative phonemic transcription system developed mainly for ease of use by learners of Modern Standard Arabic, with several unique features including an indication of word stress and vowel neutralization. Presented at the Towards A Transliteration Standard of Arabic: Challenges and Solutions conference in Abu Dhabi in 2009. See also slide show.

Lexicon-Driven Approach to the Recognition of Arabic Named Entities

This paper describes the techniques used to compile the Database of Arabic Names (DAN), the world’s largest Arab name resource containing millions of names and their variants. Presented at the 2nd International Conference on Arabic Language Resources and Tools in Cairo in 2009.

Word Stress and Vowel Neutralization in Modern Standard Arabic

This paper presents word stress and neutralization rules that are both linguistically accurate and pedagogically useful based on how spoken MSA is actually pronounced. Presented at the 2nd International Conference on Arabic Language Resources and Tools in Cairo in 2009.

Exploiting Lexical Resources for Disambiguating Orthographic CJK and Arabic Orthographic Variants

This paper analyzes the principal linguistic issues of Arabic and CJK orthographic variation and argues that linguistic knowledge supported by large-scale lexical databases is essential for accurate disambiguation. Presented at LREC 2008.

The Challenges and Pitfalls of Arabic Romanization and Arabization

This paper was presented at The Second Workshop on Computational Approaches to Arabic Script-based Languages (CAASL2) conference held at Stanford University. This paper focuses on the linguistic issues encountered in developing unique systems for the automatic romanization of Arabic names and the arabization of non-Arabic names that can arabize CJK names directly. See the slide show.

Others

The Role of Paper Dictionaries in Language Learning

This paper explores the evidence supporting the continued existence and unique benefits of paper dictionaries for language learners and enthusiasts.

Parallel Annotated Synthetic Corpora (PASC)

The Parallel Annotated Synthetic Corpora (PASC) project focuses on creating comprehensive synthetic corpora for various applications in natural language processing and speech translation. By providing fully aligned and accurate synthetic corpora along with precise annotations, the quality of language models, including Neural Machine Translation, Automatic Speech Recognition, and Text to Speech, can be enhanced. (White Paper | Summary | Data Sample)

What is a multiword expression? Lexicographic criteria for including MWEs in bilingual dictionaries

This paper, presented at the Collocations in Lexicography: existing solutions and future challenges workshop at eLex 2019 in Sintra, Portugal, discusses some of the fundamental principles for the selection of headwords in bilingual dictionaries. (Abstract | Presentation)

Lexicographic Criteria for Selecting Multiword Units for MT Lexicons

This paper, presented at The 4th Workshop on Multi-word Units in Machine Translation and Translation Technology (MUMTTT 2019) in Malaga, Spain, discusses the fundamental principles for identifying and selecting MWUs for inclusion in bilingual dictionaries, both for humans and for MT systems (MT lexicons). See the slide show.

Is English Segmentation Trivial?

Describes the principal word-formation processes in English, and demonstrates that word segmentation in English, contrary to popular belief, is far from trivial.

Criteria for Inclusion of Multiword Lexical Units in Dictionaries

Coming Soon.

European and Semitic languages

Coming Soon. A series of reports describing the features of the major European and Semitic languages, focusing on orthographic variation, and describing the linguistic issues to be addressed in developing linguistic tools.