Announcing World's First Comprehensive
©2011 The CJK Dictionary
SPANISH FULL FORM LEXICON
The CJK Dictionary Institute (CJKI), which specializes in the compilation of large-scale CJK (Chinese, Japanese and Korean), Spanish and Arabic lexical resources, is pleased to announce the release of the Spanish Full Form Lexicon (S-FULEX). Developed by a team of experts at our institute over many years for a US-government sponsored project aimed at achieving Spanish-English human quality machine translation, this large-scale bilingual Spanish dictionary with over 26 million records is finally being made available to to machine translation (MT) and natural language processing (NLP) communities for research and product development.
What is a Full Form Lexicon?
A full form lexicon is a comprehensive lexical database that contains all inflected, declined and conjugated forms of a language. Unlike an ordinary dictionary that lists only the canonical forms (base lexemes), such as eat, this new type of lexicon includes all inflected forms such as eating, eaten and ate. In English, the number of such forms is limited to a handful of wordforms, but languages like Japanese, Spanish and Arabic can have thousands of inflected forms for each verb. Moreover, some languages have clitics, such as pronomial suffixes and reflexive particles, that further increase the number of forms. For example Spanish lavarse can have such cliticized forms as lavándome and doubly cliticized forms such as decírselo.
Why a Full Form Lexicon?
Traditionally, morphological analyzers, MT and other NLP systems have been (and often still are) rule-based. More recently, statistical methods have become increasingly prevalent (such as those provided by Language Weaver and Google). For some language pairs, Statistical Machine Translation (SMT) results are comparable to that of rule-based systems, but there is still much room for improvement, especially for languages for which large bilingual corpora are not available. Another MT paradigm is Example-Based Machine Translation (EBMT), an approach using bilingual parallel texts to translate by analogy (a kind of machine learning).
A new appoach to MT, which has resulted in a dramatic increase in translation quality, is combining EBMT with a full form bilingual lexicon. Monolingual full form lexicons, such as those pioneered by Dr. Franz Günthner of the University of Munich, have demonstrated their effectiveness in the popular enterprise search product Fast ESP. At CJKI we have gone one step further and brought full form lexicons to the realm of bilingual MT. That is, we have developed a full form bilingual lexicon for Spanish (S-FULEX) that has been used in conjunction with an MT system that resulted in translation quality so high as to be virtually indistinguishable from human translation.
Full Form lexicons can bring the following benefits to MT and other NLP applications:
Because of their large size, full form lexicons require significant computational resources, both memory and processing power, and are very expensive and time-consuming to develop. Because of the dramatic advances in hardware technology in recent years, these lexicons have finally come of age.
- Greatly enhanced translation quality for MT and other NLP applications (higher recall).
- Significantly simplified algorithms for morphological analysis.
- Dramatically improved named entity recognition and entity extraction.
- Support for query processing for information retrieval applications.
- Automatic conjugation systems for pedagogical and NLP applications.
- Part-of-speech (POS) analysis and POS tagging.
- Greater accuracy in determining the root of inflected and derived forms.
For about eight years, CJKI's team of lexicographers and software engineers have been engaged in the development of full form lexicons for Spanish, Arabic and Japanese. To this end, we have analyzed the grammar and morphology of these languages in great depth, to a degree well beyond found in comprehensive descriptive grammars for these languages.
A distinctive feature of the FULEX series is that that are fully bilingual, so that every inflected form is given usually multiple equivalents in the target language (see next section). Detailed documents on the FULEX series will eventually be released to the public. Below are its most important features.
- Comprehensive coverage.
- Includes all inflected and declined wordforms.
- Rich set of useful attributes, such as conjugation patterns and orthographic variants.
- Dozens of data fields with supplementary information for each entry, such as cross references to canonical forms.
- Detailed part-of-speech and other grammatical codes.
- Fully bilingual: every entry is mapped to its (often multiple) English equivalent(s).
S-FULEX is extremely comprehensive. Our mission was to create the largest Spanish-English lexicon for general vocabulary in existence. S-FULEX surpasses the coverage of the most comprehensive Spanish dictionaries ever published, including the prestigious Diccionario de la Lengua Española published by the Real Academia Española, and of all commercial bilingual Spanish dictionaries including unabridged ones published by HarperCollins and Oxford.
S-FULEX contains over 26,000,000 records. This does not include technical terms and proper nouns, except for the most common ones, since S-FULEX is designed to cover general vocabulary, rather than specialized domains. The reason for his huge number of records lies in the nature of bilingual full form lexicons. That is, not only are all forms, including inflected, plural, feminine and affixed forms included, but all English equivalents for each of these forms is given as well.
A normal dictionary for the entry hablar might list the following equivalents: speak, talk, converse, discuss, call, and phone. S-FULEX includes 3429 Spanish-English pairs for hablar, a small subset of which is shown below (the full list can be found at hablar_samp.xls):
Though it may seem strange that such forms as yo hablo are translated as I would have spoken, the English equivalents are based on in-depth analysis of bilingual tense mappings and other translational equivalence classes between Spanish and English. The goal is to provide as many equivalents as possible, ranked by importance, and let other algorithms determine which candidate is most appropriate to the context.
|(yo) hablo||(I) speak|
|(yo) hablo||(I) am speaking|
|(yo) hablo||(I) spoke|
|(yo) hablo||(I) will speak|
|(yo) hablo||(I) shall speak|
|(yo) hablo||(I) would have spoken|
|(yo) hablo||(I) am going to speak|
|(yo) hablo||(I) talk|
|(yo) hablo||(I) am talking|
|(yo) hablo||(I) spoke|
|(yo) hablo||(I) will talk|
|(yo) hablo||(I) shall talk|
|(yo) hablo||(I) would have talked|
|(yo) hablo||(I) am going to talk|