DeepLEX: Lexical Resources for Deep Learning

The CJK Dictionary Institute is engaged in the active development of very large-scale lexical resources, referred to as DeepLEX Resources, to support Deep Learning technologies in such diverse technologies as named entity recognition (NER), cybersecurity, neural machine translation (NMT), and speech technology.

Selected Resources


Our DeepLEX Resources include tens of millions of CJK named entities specifically designed to support NLP applications such as NER and speech technology. They are being used by the world’s largest IT companies in NLP and AI applications including speech technology, machine translation, and AI applications such as natural language generation. (Please click through the resources listed below for more details on each database.)


Chinese Personal Name Variants

7.6 million Chinese personal names and their romanized variants

Japanese Personal Name Variants

3.5 million Japanese personal names and their romanized variants

Arabic Full-Form Lexicon

1.2 billion entries, including all inflected, declined, and conjugated forms

Japanese Orthographical Database

Orthographic variants for core Japanese vocabulary, covering 126.000 entries

Japanese-Multilingual Place Names and POIs

3.1 million, multilingual database of Japanese and Western place names

Database of Arabic Names

6.5 million Arabic personal names and their romanized variants


Use Cases

Our DeepLEX Resources can benefit the development of Deep Learning systems and technologies platforms in the following ways:

Named Entity Recognition (NER)
NER traditionally uses rule-based approaches, but in data-rich domains such as romanized personal name variants in Chinese and Arabic, these approaches do not always achieve adequate recall and precision. The integration of comprehensive, hard-coded lexicons covering tens or hundreds of millions of entries, such as those we maintain provided by CJKI, offers the most practical solution to achieving high accuracy.

Neural Machine Translation (NMT)
NMT performs poorly on low-frequency content words, especially named entities. Integrating DeepLEX data into NMT systems can substantially increase translation accuracy scores.

Large scale entity lexicons can also play a major role in cybersecurity, but extraction models tend to ignore entities specific to the cybersecurity domain such as names of hackers and viruses. Cybersecurity can significantly benefit from both traditional CRF-based NER using ordinary entity lexicons, as well as from security entity lexicons fine-tuned to specific entities.

Regularization algorithms must perform optimally not only on trained data but also on unknown input data such as orthographic variants of named entities. Large-scale entity lexicons can significantly enhance accuracy by compressing vector data and computing meaningful values for each variant.

Pre-trained Models
Building pre-trained word association models using our DeepLEX Resources and combining them with other resources such as annotated corpora can lead to satisfactory results, especially for morphologically complex languages like Arabic.