DeepLEX

DeepLEX: Lexical Resources for Deep Learning

The CJK Dictionary Institute is engaged in the active development of very large-scale lexical resources, referred to as DeepLEX Resources, to support Deep Learning technologies in such diverse technologies as named entity recognition (NER), cybersecurity, neural machine translation (NMT), and speech technology.

Selected Resources

Our DeepLEX Resources include tens of millions of CJK named entities specifically designed to support NLP applications such as NER and speech technology. They are being used by the world’s largest IT companies in NLP and AI applications including speech technology, machine translation, and AI applications such as natural language generation. (Please click through the resources listed below for more details on each database.)
CNV

Chinese Personal Name Variants

7.6 million Chinese personal names and their romanized variants

JOD

Japanese Orthographic Database

Orthographic variants for core Japanese vocabulary, covering 126,000 entries

JNV

Japanese Personal Name Variants

3.5 million Japanese personal names and their romanized variants

JMP

Japanese-Multilingual Place Names and POIs

3.1 million, multilingual database of Japanese and Western place names

ArabLEX

Arabic Full-Form Lexicon

530 million entries, including all inflected, declined, and conjugated forms

DAN

Database of Arabic Names

6.5 million Arabic personal names and their romanized variants

Use Cases

Our DeepLEX Resources can benefit the development of Deep Learning systems and technologies platforms in the following ways:

Named Entity Recognition (NER)

NER traditionally uses rule-based approaches, but in data-rich domains such as romanized personal name variants in Chinese and Arabic, these approaches do not always achieve adequate recall and precision. The integration of comprehensive, hard-coded lexicons covering tens or hundreds of millions of entries, such as those we maintain provided by CJKI, offers the most practical solution to achieving high accuracy.

Neural Machine Translation (NMT)

NMT performs poorly on low-frequency content words, especially named entities. Integrating DeepLEX data into NMT systems can substantially increase translation accuracy scores.

Cybersecurity

Large-scale entity lexicons can also play a major role in cybersecurity, but extraction models tend to ignore entities specific to the cybersecurity domain such as names of hackers and viruses. Cybersecurity can significantly benefit from both traditional CRF-based NER using ordinary entity lexicons, as well as from security entity lexicons fine-tuned to specific entities.

Regularization

Regularization algorithms must perform optimally not only on trained data but also on unknown input data such as orthographic variants of named entities. Large-scale entity lexicons can significantly enhance accuracy by compressing vector data and computing meaningful values for each variant.

Pre-trained Models

Building pre-trained word association models using our DeepLEX Resources and combining them with other resources such as annotated corpora can lead to satisfactory results, especially for morphologically complex languages like Arabic.

Reference Documents