DeepLEX: Lexical Resources for Deep Learning
The CJK Dictionary Institute is engaged in the active development of very large-scale lexical resources, referred to as DeepLEX Resources, to support Deep Learning technologies in such diverse technologies as named entity recognition (NER), cybersecurity, neural machine translation (NMT), and speech technology.
Our DeepLEX Resources include tens of millions of CJK named entities specifically designed to support NLP applications such as NER and speech technology. They are being used by the world’s largest IT companies in NLP and AI applications including speech technology, machine translation, and AI applications such as natural language generation. (Please click through the resources listed below for more details on each database.)
Orthographic variants for core Japanese vocabulary, covering 126.000 entries
3.1 million, multilingual database of Japanese and Western place names
Our DeepLEX Resources can benefit the development of Deep Learning systems and technologies platforms in the following ways:
Named Entity Recognition (NER)
NER traditionally uses rule-based approaches, but in data-rich domains such as romanized personal name variants in Chinese and Arabic, these approaches do not always achieve adequate recall and precision. The integration of comprehensive, hard-coded lexicons covering tens or hundreds of millions of entries, such as those we maintain provided by CJKI, offers the most practical solution to achieving high accuracy.
Neural Machine Translation (NMT)
NMT performs poorly on low-frequency content words, especially named entities. Integrating DeepLEX data into NMT systems can substantially increase translation accuracy scores.
Large scale entity lexicons can also play a major role in cybersecurity, but extraction models tend to ignore entities specific to the cybersecurity domain such as names of hackers and viruses. Cybersecurity can significantly benefit from both traditional CRF-based NER using ordinary entity lexicons, as well as from security entity lexicons fine-tuned to specific entities.
Regularization algorithms must perform optimally not only on trained data but also on unknown input data such as orthographic variants of named entities. Large-scale entity lexicons can significantly enhance accuracy by compressing vector data and computing meaningful values for each variant.
Building pre-trained word association models using our DeepLEX Resources and combining them with other resources such as annotated corpora can lead to satisfactory results, especially for morphologically complex languages like Arabic.