DeepLEX

DeepLEX: Lexical Resources for Deep Learning

The CJK Dictionary Institute is engaged in the active development of very large-scale lexical resources, referred to as DeepLEX Resources, to support Deep Learning technologies in such diverse technologies as named entity recognition (NER), cybersecurity, neural machine translation (NMT), and speech technology.

Selected Resources

Our DeepLEX Resources include tens of millions of CJK named entities specifically designed to support NLP applications such as NER and speech technology. They are being used by the world’s largest IT companies in NLP and AI applications including speech technology, machine translation, and AI applications such as natural language generation. (Please click through the resources listed below for more details on each database.)

Use Cases

Our DeepLEX Resources can benefit the development of Deep Learning systems and technologies platforms in the following ways:

Named Entity Recognition (NER)

NER traditionally uses rule-based approaches, but in data-rich domains such as romanized personal name variants in Chinese and Arabic, these approaches do not always achieve adequate recall and precision. The integration of comprehensive, hard-coded lexicons covering tens or hundreds of millions of entries, such as those we maintain provided by CJKI, offers the most practical solution to achieving high accuracy.

Neural Machine Translation (NMT)

NMT performs poorly on low-frequency content words, especially named entities. Integrating DeepLEX data into NMT systems can substantially increase translation accuracy scores.

Cybersecurity

Large-scale entity lexicons can also play a major role in cybersecurity, but extraction models tend to ignore entities specific to the cybersecurity domain such as names of hackers and viruses. Cybersecurity can significantly benefit from both traditional CRF-based NER using ordinary entity lexicons, as well as from security entity lexicons fine-tuned to specific entities.

Regularization

Regularization algorithms must perform optimally not only on trained data but also on unknown input data such as orthographic variants of named entities. Large-scale entity lexicons can significantly enhance accuracy by compressing vector data and computing meaningful values for each variant.

Pre-trained Models

Building pre-trained word association models using our DeepLEX Resources and combining them with other resources such as annotated corpora can lead to satisfactory results, especially for morphologically complex languages like Arabic.

DeepLEX

DeepLEX: Lexical Resources for Deep Learning

Selected Resources

Chinese Personal Name Variants

Japanese Orthographic Database

Japanese Personal Name Variants

Japanese-Multilingual Place Names and POIs

Arabic Full-Form Lexicon

Database of Arabic Names

Use Cases

Named Entity Recognition (NER)

Neural Machine Translation (NMT)

Cybersecurity

Regularization

Pre-trained Models

Reference Documents

DeepLEX: Lexical Resources for Deep Learning

DeepLEX: 深層学習用辞書データベース

DeepLEX: 用于深度学习的词库资源