
The CJK Dictionary Institute is engaged in the development and continuous expansion of comprehensive lexical databases for CJK languages and Arabic consisting of approximately eight million entries (see CJK Lexical Resources for details). This document describes our database of Arabic place names. We also maintain a large database of Arabic Personal Names and their variants.
Though Arabic has become a world language of critical importance, lexical resources, especially for proper nouns, are either scarce or exist only on a small scale. Because of the important role place names play in such natural language applications as named entity extraction (NER) and machine translation, we are continuously expanding and revising our Dictionary of Arabic Place Name Variants, which provides systematic coverage of Arabic orthographic variants and common orthographic errors.
It is important to note that although there is a handful of machine translation packages and data providers that offer Arabic place names, their coverage is poor, the data contains many machine-generated errors, and they do not cover variants. Our project may well be the first attempt to build a comprehensive database of Arabic place names that covers the entire world, is accurate, validated, and based on state-of-the art techniques in computational lexicography. Please have a look at the data sample shown in 3. Data Sample below.
Identifying, processing and normalizing place names and their numerous variants is useful in a variety of applications, such as:
- Improving the accuracy of English-to-Arabic machine translation by providing the standard, correct Arabic form.
- Improving the accuracy of Arabic-to-English machine translation by identifying variants and errors in the original Arabic text.
- Place name dictionaries for human translators.
- Entity and information extraction.
- Segmentation and morphological analysis of Arabic texts.
Our database covers both the Arab and non-Arab world, including variants. Only the most common variants are shown in the sample below -- see the next section for more.
| Arabic | Buckwalter Transliteration | English | Variant | Error | Country |
|---|---|---|---|---|---|
| أبو ظبي | >bw Zby | Abu Dhabi | ابو ظبي | أبو ظبى, ابو ظبى | UAE |
| الإسكندرية | Al<skndryp | Alexandria | الاسكندرية | الإسكندريه | Egypt |
| الجزائر | AljzA}r | Algiers | الجزاير | Algeria | |
| برازيليا | brAzylyA | Brasilia | برازيلية | برازيليه | Brazil |
| القاهرة | AlqAhrp | Cairo | القاهره | Egypt | |
| الشرق الاقصى | Al$rq AlAqSY | Far East | الشرق الاقصي | N/A | |
| ألمانيا | >mAnyA | Germany | المانيا | Germany | |
| الجيزة | Aljyzp | Giza | الجيزه | Egypt | |
| حيفا | HyfA | Haifa | حيفة | Israel | |
| جدة | jdp | Jeddah | جدّة | جده | Saudi Arabia |
| القدس | Alqds | Jerusalem | Israel | ||
| المنامة | AlmnAmp | Manama | المنامه | Bahrain | |
| مكة | mkp | Mecca | مكه | Saudi Arabia | |
| نابلس | nAbls | Nablus | Palestinian Territory | ||
| نانجينغ | nAnjyng | Nanjing | China | ||
| بالو ألتو | bAlw >ltw | Palo Alto | بالو التو, بالو آلتو | USA | |
| الرياض | AlryAD | Riyadh | الرّياض | Saudi Arabia |
The table below shows various orthographic variants and common errors for االإسكندرية, the Egyptian city of Alexandria, along with Google occurrences (there are many other variants involving partial vocalization). Our databases are now being expanded to systematically include all orthographic variants and errors based on statistical analysis of Arabic orthography as it currently occurs in corpora, and often include the fully vocalized versions as well (see Database of Arabic Proper Nouns for a sample).
Our Arabic place names are carefully proofread to ensure strict adherence to the complex rules of hamza orthography, something which is often ignored outside of publications of the highest editorial standards. The result of this strict editorial policy is that we can provide not only the linguistically correct standard MSA version, but also all common non-standard and incorrect versions as well, carefully flagged to distinguish between them, as shown in the table below.
V=variant E=error S=Standard N=normalized
| Rank | Type | Arabic | Buckwalter Transliteration | Google Hits | Remarks |
|---|---|---|---|---|---|
| 1 | N | الاسكندرية | AlAskndryp | 2,930,000 | Normalized, no hamza |
| 2 | S | الإسكندرية | Al<skndryp | 690,000 | Standard form, with hamza |
| 3 | E | الاسكندريه | AlAskndryh | 89,200 | No hamza, taa' marbuuta replaced by haa' |
| 4 | V | الإسكندريّة | Al<skndry~p | 954 | Explicit shadda |
| 5 | E | الإسكندريه | Al<skndryh | 897 | taa' marbuuta replaced by haa' |
| 6 | V | الاسكندريّة | AlAskndry~p | 245 | no hamza, shadda explicit |
| 7 | E | الاسكندريا | AlAskndryA | 80 | hamza omitted, taa' marbuuta replaced by alif |
| 8 | V | الإسْكَنْدَريَّة | Al<sokanodary~ap | 24 | fully vocalized |
| 9 | E | الاسكندريّه | AlAskndry~h | 12 | no hamza, shadda explicit, taa' marbuuta replaced by haa' |
| 10 | E | الإسكندريا | Al<skndryA | 7 | taa' marbuuta replaced by alif tawiila |
| 11 | E | الإسكندريّه | Al<skndry~h | 5 | taa' marbuuta replaced by haa', shadda explicit |
In addition to the above, our database contains many other variants, such as those with partial and full vocalization, covering all actual and potential variants. The full set of Alexandria variants includes 35 entries.