Linguistic and Technical Documents
This page brings together some linguistic and technical documents written by Jack Halpern, aimed at introducing the CJK languages, in addition to Arabic, with emphasis on the linguistic issues to be addressed in developing both CJK and Arabic linguistic tools.
Japanese Information Processing
A paper co-authored by Masahito Takahashi, Toshifumi Tanabe, Kosho Shudo, and Jack Halpern on JMWEL, a comprehensive lexicon of Japanese Multiword Expressions (MWEs) with a rich set of grammatical attributes fine-tuned for phrase-based NLP applications such as machine translation and information retrieval. Presented at the EUROPHRAS 2019: Computational and Corpus-based Phraseology in Malaga, Spain in September, 2019.
This paper presented at the TAUS Executive Forum Tokyo 2017 looks at the linguistic issues related to orthographic variation, showing how Very Large-scale Lexical Resources (VLSLR) can significantly enhance the accuracy of NLP tools, with focus on machine translation (MT),named entity recognition (NER) and named entity translation (NET). See the slide show.
This keynote address given at the 6th NEWS Named Entities Workshop in Berlin in August, 2016 focuses on the special characteristics of Chinese, Japanese, and Arabic scripts that impact machine translation, and the role played by lexical resources such as personal name dictionaries and how these resources can be used to enhance the accuracy of name transliteration systems. See the slide show.
A linguistic description of the principal challenges to be overcome by developers of CJK NLP application, this paper was presented at workshops of COLING/ACL 2006 in Sydney as well as other conferences. See the slide show.
Presented at the 11th Oriental COCOSDA Workshop held in Kyoto in 2008, this paper summarizes the complex allophonic variations that need to be considered in developing Japanese speech technology applications, and introduces the 130,000-entry Japanese Phonetic Database (JPD) developed by CJKI. Presentation slides here.
Describes the linguistic issues to be addressed by advanced Japanese information retrieval technologies, focusing on cross-language and cross-synonym searching.
Mobile Language Learning
Enhancing Mobile Learning by Linking Japanese Dictionary Apps
This paper, presented at eLex 2019 in Sintra, Portugal, describes how four mobile apps exploit the unique features of the mobile platform to help learners study Japanese effectively in previously unavailable ways. (Abstract | Presentation)
Groundbreaking Mobile Technology to Enhance Chinese and Japanese Language Learning
This paper, presented at the ACLL2017: The Asian Conference on Language Learning in Kobe, Japan, describes our groundbreaking Libera platform that combines the strengths of traditional bilingual parallel texts with the educational potential of the smart tablet platform. (Abstract | Presentation)
Exploiting Mobile Technology to Enhance EFL
Exploiting Mobile-Assisted Language Learning Technology to Enhance Japanese Language Education
A presentation on an exciting new langage learning platform, given at the 2015 ACTFL Annual Convention and World Languages Expo in San Diego, CA.
The Japanese Language
This article was published in a special issue of the International Journal of Lexicography (Volume 29, Issue 3) on “Bilingual Learners’ Dictionaries”.
The aim of this presentation, given at the Second Wordnet Bahasa Workshop in Singapore in January 2016, was to examine several key issues in pedagogical lexicography both from the lexicographer’s and from the kanji learner’s points of view, focusing on compilation and design innovations that increase learner usability. (Abstract | Presentation)
Presented at Euralex ’94, this paper describes how we began to develop DESK, our comprehensive CJK lexical databases, on the basis of the New Japanese-English Character Dictionary.
A detailed introduction to the hiragana, katakana, and romaji scripts, which together with kanji constitute the complex Japanese writing system.
Chinese Information Processing
This keynote address was given at the 6th NEWS Named Entities Workshop in Berlin in August 2016 focuses on the special characteristics of Chinese, Japanese, and Arabic scripts that impact machine translation, and the role played by lexical resources such as personal name dictionaries and how these resources can be used to enhance the accuracy of name transliteration systems. See also the slide show.
This article was published in a special issue of International Journal of Lexicography (Volume 29, Issue 3) on “Bilingual Learners’ Dictionaries”.
Introduces The CJKI Chinese Learner’s Dictionary, designed to satisfy the needs of learners and to overcome the shortcomings of existing Chinese dictionaries. Presented at ASIALEX 2011. See also the slide show.
A linguistic description of the principal challenges to be overcome by developers of Chinese NLP application, this paper was presented at COLING/ACL 2006 in Sydney as well as other conferences. See the slide show.
Presented at COLING 2002 (Taipei), this paper analyzes the linguistic issues of CJK orthographic variation, including Chinese, and discusses why lexical databases should play a central role in NLP.
Presented at several international conferences, this academic paper presents an in-depth analysis of the linguistic and technical issues related to converting Simplified Chinese to/from Traditional Chinese.
Korean Information Processing
This paper, presented at COLING/ACL 2006 in Sydney as well as other conferences. analyzes the linguistic issues of CJK orthographic variation, including Korean,and discusses why lexical databases should play a central role NLP. See the slide show.
Arabic Information Processing
This keynote address was given at the 6th NEWS Named Entities Workshop in Berlin in August 2016 focuses on the special characteristics of Chinese, Japanese, and Arabic scripts that impact machine translation, and the role played by lexical resources such as personal name dictionaries and how these resources can be used to enhance the accuracy of name transliteration systems. See also the slide show.
This article was published in a special issue of International Journal of Lexicography (Volume 29, Issue 3) on “Bilingual Learners’ Dictionaries”.
This presentation at ASIALEX2016 in The Philippines describes three bilingual learner’s dictionaries. (Abstract | Presentation slides)
A panel discussion organized by our director Jack Halpern for the Middle East Studies Association (MESA) 2014 Annual Meeting focused on methodologies to create pedagogically effective language learning and dictionary applications by harnessing the vast potential of the mobile platform. View Mr. Halpern’s presentation abstract here and slide show here.
An innovative phonemic transcription system developed mainly for ease of use by learners of Modern Standard Arabic, with several unique features including an indication of word stress and vowel neutralization. Presented at the Towards A Transliteration Standard of Arabic: Challenges and Solutions conference in Abu Dhabi in 2009. See also slide show.
This paper analyzes the principal linguistic issues of Arabic and CJK orthographic variation and argues that linguistic knowledge supported by large-scale lexical databases is essential for accurate disambiguation. Presented at LREC 2008.
This paper was presented at The Second Workshop on Computational Approaches to Arabic Script-based Languages (CAASL2) conference held at Stanford University. This paper focuses on the linguistic issues encountered in developing unique systems for the automatic romanization of Arabic names and the arabization of non-Arabic names that can arabize CJK names directly. See the slide show.
Others
This paper explores the evidence supporting the continued existence and unique benefits of paper dictionaries for language learners and enthusiasts.
Parallel Annotated Synthetic Corpora (PASC)
The Parallel Annotated Synthetic Corpora (PASC) project focuses on creating comprehensive synthetic corpora for various applications in natural language processing and speech translation. By providing fully aligned and accurate synthetic corpora along with precise annotations, the quality of language models, including Neural Machine Translation, Automatic Speech Recognition, and Text to Speech, can be enhanced. (White Paper | Summary | Data Sample)
This paper, presented at the Collocations in Lexicography: existing solutions and future challenges workshop at eLex 2019 in Sintra, Portugal, discusses some of the fundamental principles for the selection of headwords in bilingual dictionaries. (Abstract | Presentation)
This paper, presented at The 4th Workshop on Multi-word Units in Machine Translation and Translation Technology (MUMTTT 2019) in Malaga, Spain, discusses the fundamental principles for identifying and selecting MWUs for inclusion in bilingual dictionaries, both for humans and for MT systems (MT lexicons). See the slide show.