
On July 22, 2007 Jack Halpern presented an academic paper on ARAN/NANA at The Second Workshop on Computational Approaches to Arabic Script-based Languages (CAASL2) conference held at Stanford University. Go here for the full paper:
The Challenges and Pitfalls of Arabic Romanization and Arabization (.pdf file, 293K) See also the PowerPoint slide show presentation (240K) given at CAASL2.
| /qaabuus/ | slashes for phonemic transcription |
| [qɑːbuːs]/ | square brackets for phonetic transcription (in IPA) |
| Qaboos | italics for English and Latin popular transcriptions |
| \qAbws\ | back slashes for Buckwalter transliteration |
| 'writer' | single quotes for English equivalents of Arabic words |
The process of automatically converting unvocalized Arabic to a Roman script representation, called romanization, and such related operations as adding vowels to unvocalized Arabic, called vocalization, are challenging tasks to which there is no definitive solution. This document describes a system for automatically romanizing Arabic names, called ARAN for Automatic Romanizer of Arabic Names, and some of the relevant linguistic issues. For example, ARAN can romanize a name like قابوس into a large variety of systems, such as /qaabuus/ (phonemic), Qabous (popular), \qAbws\ (Buckwalter), and [qɑːbuːs] (IPA).
Developed by a team of experts on Arabic orthography and phonology at The CJK Dictionary Institute, ARAN is a versatile system that performs a full range of computational linguistic tasks required for processing Arabic names. Though the focus is on processing Arabic names, it can for the most part be applied to processing Arabic texts in general. ARAN consists of multiple modules that perform such tasks as phonetic and phonemic transcription, transliteration, name variant generation, vocalization, code conversion and language identification.
A sister package of ARAN, called NANA for Non-Arabic Name Arabizer (نعنع 'mint' in Arabic), performs the opposite operation: it arabizes, with a high degree of accuracy, non-Arabic names. This is not limited to Latin names, such as Jack Halpern to جـاك هلبرن , but also includes a truly unique technology: the direct arabization of CJK names. This means that the arabization rules and algorithms are specifically adjusted to names written in their original scripts -- romanized readings (shown in parentheses below) are unnecessary. For example, the Japanese place name 埼玉 is arabized as سايتاما (Saitama), the Chinese name 杨海洋 as يانغ هاييانغ (Yang Haiyang), and the Korean city of 부산 as بوسان (Busan).
We have spared no effort to tackle every aspect of the many tough linguistic challenges by doing meticulous research and analysis, by writing sophisticated algorithms, and by building comprehensive mapping tables. We are confident that the ARAN and NANA systems offered here represent the best of romanization and arabization technology today.
Ultimately, for a software tool to fully disambiguate an Arabic string requires it to "understand" the text based on a semantic/syntactic analysis of the context. Though ARAN does not do that yet, it is nonetheless a highly practical tool that adequately meets the practical needs of identifying, processing and normalizing names and their numerous variants useful in a variety of real world applications, such as:
Romanization of Arabic has such uses as:
The Arabic script is a member of a class of Semitic scripts known as abjads. A distinguishing feature of abjads in general, and of Arabic in particular, is that words are written as a string of consonants with little or no indication of vowels. This is referred to as unvocalized Arabic (or unvoweled Arabic). Though diacritics, and some consonants, are used to indicate vowels, these are sparsely used. On the whole, unvocalized Arabic is ambiguous, in some cases highly ambiguous, posing significant challenges to Arabic information processing.
For example, the two letters مو \mw\ can theoretically represent 25 legitimate consonant -vowel permutations, such as mawa, mawwa, mawi, mawwi, mawu, mawwu, maw, maww, miwa, miwwa.... etc. Humans can normally disambiguate this by context, but for a computer program the task is formidable. An example of an ambiguous unvocalized word is كاتب \kAtb\ , which can represent any of the seven vocalized wordforms below:
- كَاتِب /kaatib/
- كَاتَبَ /kaataba/
- كَاتِبٍ /kaatibin/
- كَاتِبٌ /kaatibun/
- كَاتِبَ /kaatiba/
- كَاتِبِ /kaatibi/
- كَاتِبُ /kaatibu/
The main reason for this ambiguity is that Arabic is a highly inflected language. Inflection is indicated by changing the vowel patterns as well as by adding various suffixes, prefixes, and clitics. A full paradigm for كَاتِب /kaatib/ 'writer' that we created (for a comprehensive Arabic-English dictionary project) reaches a staggering total of 3487 (out of a thoeretical 10541) vocalized forms, including identical forms of distinct function (called inflectional syncretism) and sense.
There is much confusion surrounding such terms as transliteration, transcription, and romanization. It is important to understand these concepts correctly. In the definitions below, the common name Muhammad, written محمد in Arabic script, is used for illustration. More information is available at Transliteration and Transcription Technology.
4.1 Romanization
The representation of a language written in a non-Roman script, such as Chinese or Arabic, in the Roman or Latin alphabet. This includes transliteration and the various types of transcription described below.
4.2 Transliteration
A representation of the script of a source language by using the characters of another script. It aims to represent the letters (graphemes), rather than the sounds (phonemes), of the source language, by one (sometimes multiple) characters in an unambiguous way. For example, محمد is transliterated as \mHmd\, with each Arabic letter represented unambiguously by one Roman character, as shown below:
Transliteration of محمد Arabic Letter Contextual Form Transliteration Letter Name م ﻣ m miim ح ﺤ H Haa' م ﻤ m miim د ﺩ d daal In good transliteration systems there is a one-to-one correspondence that enables round-trip conversion. A widely used system for transliterating Arabic on a letter-by-letter basis is the excellent Buckwalter transliteration.
Note that the term transliteration is often misleadingly used in the sense of transcription, which is very confusing and should be avoided.
4.3 Transcription
A representation of the source script of a language in the target script in a manner that reflects the pronunciation of the original, often ignoring graphemic (character-to-character) correspondence. There are three kinds of transcription:
4.4 Vocalization
The process of automatically adding vowels to unvocalized Arabic. For example, the unvocalized محمد \mHmd\ is vocalized as مُحَمَّد \muHam~ad\. Note the four diacritics that were added in the vocalized version. This is difficult to do even for native speakers unless trained in Arabic phonology. For a computer program, the high level of ambiguity makes it extremely challenging.
4.5 Arabization
As used here, arabization refers to the process of automatically converting an Arabic or non-Arabic name written in the Latin or CJK native script into Arabic script. For example, Muhammad → محمد , Jack → جاك, and 埼玉 (Saitama) → سايتاما.
4.6 Vocalization Modes
Arabic is written mostly in unvocalized script, which is why it is so difficult to transcribe and is the raison d'être for the ARAN system. Vocalized Arabic is found in the Koran, children's books, and didactic materials such as dictionaries. The Koran is fully vocalized (explicit short vowels, gemination, nunation etc.), but in other cases one often encounters partially vocalized or semivocalized texts.
ARAN supports three modes of vocalization: unvocalized, semivocalized, and fully vocalized, as illustrated below:
Mode Arabic Transcription Transliteration Unvocalized كتب /kutiba/ \ktb\ Semivocalized كُتب /kutiba/ \kutb\ Fully vocalilzed كُتِبَ /kutiba/ \kutiba\ Transcribing vocalized and semivocalized Arabic is considerably easier than transcribing unvocalized Arabic. However, it requires a different set of rules. Similarly, vocalizing unvocalized Arabic is just as difficult as transcribing it, but again requires a different set of rules. Each ARAN module has a knowledge base that captures the precise rules for the different vocalization modes.
5.1 Basic Goals and Methodology
ARAN aims to provide a robust solution to the difficult task of romanizing Arabic names, including all the transcription subtypes described above. CJKI is engaged in ongoing research and development efforts to enhance the functionality of the various ARAN modules, especially ATAN, ARAN's core module for generating phonemic and popular transcriptions. The main emphasis is on automatically transcribing unvocalized Arabic names into as many popular romanized variants as possible.
The most difficult challenge, the core problem to which ARAN provides a solution, is to make an intelligent guess at determining the vowels of unvocalized Arabic names and generating a list of likely candidates on the basis of statistical models and in-depth analysis of Arabic orthography. If a name is not found in our comprehensive Database of Arabic Names, (DAN) variants are generated in various romanization systems by linguistically advanced algorithms using a sophsticated knowledge base that captures the rules of Arabic orthography. DAN, which now has approximately one and a half million entries, is now undergoing major expansion and is expected to grow substantially in size by the end of 2008 to cover all the major countries in the Middle East-- please see full announcement.
5.2 ARAN Modules
ARAN consists of the following components, described in more detail in the sections below (live links):
The table below illustrates the conversion processes performed by the principal ARAN modules using the Arabic name Qaboos (قابوس) as an example. It shows the data input to each module and the resulting output after processing. Each module is further described in more detail in the sections below. To get an overview of ARAN's features and capabilities, please study this table carefully,
| Conversion process | ARAN module | Input | Output | Remarks |
|---|---|---|---|---|
| Phonemic Transcription | ATAN | قابوس | /qaabuus/ْ | linguistic representation of phonemes |
| English Transcription | ATAN | قابوس | Qaboos | "Standard" English spelling |
| Popular Transcriptions | ATAN | قابوس | Qabuus, Qabus, Qabous, Qabooss, Qaaboos, Kaboos, Kabuus, Gabous... | some of the many popular variants |
| Phonetic Transcription | APAN | قابوس | [qɑːbuːs]ْ | scientific transcription in IPA |
| Unvocalized Transliteration | AXAN | قابوس | \qAbws\ | Buckwalter transliteration of unvocalized Arabic |
| Vocalized Transliteration | AXAN | قَابُوس | \qaAbuws\ | Buckwalter transliteration of vocalized Arabic |
| Diacriticization | ADAN | قابوس | قَابُوس | adding vowels (vocalization) and diacrtics to unvocalized Arabic |
| Arabization | NANA | Qabuus, Qabus, etc. | قابوس | converting non-Arabic to Arabic script |
The Automatic Transcriber of Arabic Names, or ATAN for short, is ARAN's core module for generating phonemic and popular transcriptions of Arabic personal names.
Because of the inconsistent nature of the various popular Arabic romanization systems, there are often many, sometimes dozens or even hundreds, of romanizations for the same name. ATAN supports most of the commonly used systems, and has a flexible architecture that enables the user to configure the system to support user-defined systems.
The table below shows some of the major romanization systems. Though transcription is handled by the ATAN module and transliteration by the AXAN module, for convenience examples of both are given below (follow the links for more details).
| System | Example | Description |
|---|---|---|
| ALC-LC | shwlwkh | Romanization standard of the American Library Association--Library of Congress. ⇒ More info. |
| IC | Shulukh | Intelligence Community Standard. ⇒ More info. |
| DIN | šūlūḫ | DIN 31635, the DIN standard for Arabic transliteration. |
| BGN/PCGN | Shūlūkh | The official system adopted by the U.S. Board of Geographic Names (BGN) and the Permanent Committee on Geographical Names (PCGN) ⇒ More info. |
| IPA | ʃuːluːx | International Phonetic Alphabet, a scientific system of representing speech sounds.
⇒ More info.
|
| English | Shoulokh | One of many possible popular transcriptions ⇒ More info. |
| Buckwalter | $wlwx | A strict transliteration system widely used in information processing. ⇒ More info. |
In addition to the systems shown above, there are others not shown here, such as Deutsche Morgenländische Gesellschaft, ISO/R 233, SATTS and many that will be supported by the ATAN and AXAN modules.
The Automatic Transliterator of Arabic Names, or AXAN for short, generates transliterations of Arabic names or any other Arabic text. There are few strict transliteration systems; that is, systems that use unique symbols for each letter and allow for round-trip conversion. The excellent and widely used Buckwalter transliteration system is not only supported by AXAN, but is also used for internal processing in all ARAN databases and algorithms. AXAN can be configured to support other transliteration systems, including Cyrillization, by adding a custom mapping tables. Examples are shown in the table in Section 6. ATAN.
A table comparing romanization systems can be found at Romanization of Arabic.
The Automatic Phoneticizer of Arabic Names, or APAN for short, generates phonetic transcriptions of Arabic names in IPA. This represents the actual pronunciation in Modern Standard Arabic (MSA), including distinctions between the major allophones. APAN can be configured to generate transcriptions in various flavors of MSA pronunciation, e.g. the Saudi, Egyptian and Levantine flavors. Flavors refers to variations in the pronunciation of MSA in various regions of the Arab world, and is not to be confused with Arabic dialects.
For example, the name قابوس Qaboos is transcribed phonetically as [qɑːbuːs]. Note that the phonemic transcription /qaabuus/ generated by ATAN indicates the long vowel a by /aa/ and does not indicate the phonetic details of that vowel other than that it is long, a phonemic distinction. In contrast, the IPA phonetic transcription generated by APAN for this vowel is [ɑː], distinguishing it from its more common realization [æː], since [ɑː] is an allophonic variant of /aa/ that occurs after the uvular stop [q]. Thus the phonemic transcription /aa/ represents a single phoneme, which can be realized phonetically as [æː] or [ɑː].
This is further illustrated by the table below.
| Arabic | English | Phonemic | Phonetic Gulf | Phonetic Egyptian | Phonetic Levantine |
|---|---|---|---|---|---|
| قابوس | Qaboos | /qaabuus/ | [qɑːbuːs] | [ʔɑːbuːs] | [qɑːbuːs] |
| جمال | Jamal | /jamaal/ | [dʒɛ̈mɛ̈ːl] | [gɛ̈mɛ̈ːl] | [ʒɛmɛ̈ːl] |
The Automatic Diacriticizer of Arabic Names, or ADAN for short, perfoms automatic diacriticization; that is, it automatically vocalizes (adds vowels and diacritics) to unvocalized or semi-vocalized Arabic and adds the appropriate vowel signs and other diacritics. For example, the well known name Muhammed, written محمد \mHmd\ in unvocalized Arabic, is converted into the vocalized version مُحَمَّد \muHam~ad\ (/muHammad/) by adding the diacritics damma, fatha and shadda. This is related to, but distinct from, the equally difficult task of automatically generating a romanized phonemic transcription, which is done by the ATAN module.
Below are some example of the output from the ADAN module.
| Unvocalized | Vocalized | Transcription | English |
|---|---|---|---|
| محمد | مُحَمَّد | muHammad | Muhammad |
| إبراهيم | إِبْرَاهِـيم | 'ibrahiim | Abraham |
| إسحاق | إِسْحَاق | isHaaq | Isaac |
| الرياض | الرِّيـَاض | arriyaaD | Riyadh |
| مكة | مَـكـَّة | makkah | Mecca |
| القاهرة | الْقـَاهِـرَة | alqaahirah | Cairo |
10.1 Romanization Variants
The many popular transcriptions of Arabic names result in a very large number of variants. One of the main factors contributing to this is that several Arabic consonants do not exist in European languages. These sounds are difficult to pronounce and are rendered in different ways when romanized. Another factor is the vowels, which are transcribed in a bewildering variety of ways, partially due to dialectical variation. For example, the Arabic vowel /u/ in /usama/> is transcribed in such different ways as Usama, Ousama, Osama and Oosama.
| Arabic | Buckwalter Transliteration | Popular Transcription |
|---|---|---|
| معـمر | mEmr | Moammar |
| معـمر | mEmr | Muammar |
| معـمر | mEmr | Mu'ammar |
| معـمر | mEmr | Mu`ammar |
| معـمر | mEmr | Mo'ammar |
| معـمر | mEmr | Moammar |
| معـمر | mEmr | Moamer |
| معـمر | mEmr | Moamar |
| معـمر | mEmr | Mohamar |
For more details on romanized variants of Arabic names, see our Database of Arabic Name Variants.
10.2 Arabic Orthographic Variants
The second kind of variant are variants in Arabic name itself. This could be of three kinds:
Though the difference between variants and errors cannot be rigorously defined (there may be differences of opinion among native speakers as to what constitutes an error), they are both based on deep statistical and linguistic analysis of contemporary Arabic orthography, and provide fairly exhaustive coverage of Arabic orthographic variation. It should also be noted that standard form, though linguistically correct, is not necessarily the most common one (we have statistics for the occurrence of each form).
| Standard | Buck- walter | English | Variant | Error | Remarks |
|---|---|---|---|---|---|
| أبو ظبي | >bw Zby | Abu Dhabi | ابو ظبي | أبو ظبى ابو ظبى | V: omit hamza E: alif maqsura replaces yaa' |
| الإسكندرية | Al<skndryp | Alexandria | الاسكندرية | الإسكندريه | V: omit hamza E: haa' replaces taa' marbuuTa |
| جدة | jdp | Jeddah | جدّة | جده | V: explicit shadda E: haa' replaces taa' marbuuTa |
| الأردن | Al>rdn | Jordan | الاردن | V: omit hamza | |
| بالو ألتو | bAlw >ltw | Palo Alto | بالو التو بالو آلتو | V1: omit hamza V2: madda replaces hamza | |
| الرياض | AlryAD | Riyadh | الرّياض | V: explicit shadda | |
| طوكيو | Twkyw | Tokyo | توكيو | E: taa' replaces Taa' |
For details see our Dictionary of Arabic Place Name Variants.
AEAN is a code conversion module that supports various legacy encodings for Arabic, re-enconding the text into UTF-8 or UTF-16. It supports the following encodings:
This module enables the automatic identification of a language written in the Arabic script. There are dozens of non-Arabic languages that are or have been written in the Arabic script, referred to as Arabic Script Based Languages(ASBL). The most important of these are:
Others include Shamukhi (Pakistani version of Punjabi), Kashmiri (India and Pakistan), and Uyghur (northwest China).
ARAN will eventually be expanded to romanize to/from the major Arabic Script Based Languages (ASBL), described at Section 12 above.