Database of Arabic Place Names
D A P


©2004-2008 The CJK Dictionary Institute, Inc.


1. Introduction

The CJK Dictionary Institute is engaged in the development and continuous expansion of comprehensive lexical databases for CJK languages and Arabic consisting of approximately eight million entries (see CJK Lexical Resources for details). This document describes our database of Arabic place names. We also maintain a large database of Arabic Personal Names and their variants.

Though Arabic has become a world language of critical importance, lexical resources, especially for proper nouns, are either scarce or exist only on a small scale. Because of the important role place names play in such natural language applications as named entity extraction (NER) and machine translation, we are continuously expanding and revising our Dictionary of Arabic Place Name Variants, which provides systematic coverage of Arabic orthographic variants and common orthographic errors.

It is important to note that although there is a handful of machine translation packages and data providers that offer Arabic place names, their coverage is poor, the data contains many machine-generated errors, and they do not cover variants. Our project may well be the first attempt to build a comprehensive database of Arabic place names that covers the entire world, is accurate, validated, and based on state-of-the art techniques in computational lexicography. Please have a look at the data sample shown in 3. Data Sample below.

2. Why are variants useful?

Identifying, processing and normalizing place names and their numerous variants is useful in a variety of applications, such as:

  1. Improving the accuracy of English-to-Arabic machine translation by providing the standard, correct Arabic form.
  2. Improving the accuracy of Arabic-to-English machine translation by identifying variants and errors in the original Arabic text.
  3. Place name dictionaries for human translators.
  4. Entity and information extraction.
  5. Segmentation and morphological analysis of Arabic texts.

3. Data Sample

Our database covers both the Arab and non-Arab world, including variants. Only the most common variants are shown in the sample below -- see the next section for more.

ArabicBuckwalter
Transliteration
EnglishVariant Error Country
أبو ظبي >bw Zby Abu Dhabi ابو ظبي أبو ظبى, ابو ظبى UAE
الإسكندرية Al<skndryp Alexandria الاسكندرية الإسكندريه Egypt
الجزائر AljzA}r Algiers  الجزاير Algeria
برازيليا brAzylyA Brasilia برازيلية برازيليه Brazil
القاهرة AlqAhrp Cairo  القاهره Egypt
الشرق الاقصى Al$rq AlAqSY Far East  الشرق الاقصي N/A
ألمانيا >mAnyA Germany المانيا  Germany
الجيزة Aljyzp Giza  الجيزه Egypt
حيفا HyfA Haifa  حيفة Israel
جدة jdp Jeddah جدّة جده Saudi Arabia
القدس Alqds Jerusalem   Israel
المنامة AlmnAmp Manama  المنامه Bahrain
مكة mkp Mecca  مكه Saudi Arabia
نابلس nAbls Nablus   Palestinian Territory
نانجينغ nAnjyng Nanjing   China
بالو ألتو bAlw >ltw Palo Alto بالو التو, بالو آلتو  USA
الرياض AlryAD Riyadh الرّياض  Saudi Arabia

4. Place Name Variants

The table below shows various orthographic variants and common errors for االإسكندرية, the Egyptian city of Alexandria, along with Google occurrences (there are many other variants involving partial vocalization). Our databases are now being expanded to systematically include all orthographic variants and errors based on statistical analysis of Arabic orthography as it currently occurs in corpora, and often include the fully vocalized versions as well (see Database of Arabic Proper Nouns for a sample).

Our Arabic place names are carefully proofread to ensure strict adherence to the complex rules of hamza orthography, something which is often ignored outside of publications of the highest editorial standards. The result of this strict editorial policy is that we can provide not only the linguistically correct standard MSA version, but also all common non-standard and incorrect versions as well, carefully flagged to distinguish between them, as shown in the table below.

V=variant E=error S=Standard N=normalized

Some Orthographic Variants and Common Errors
for االإسكندري (Alexandria)
RankTypeArabicBuckwalter
Transliteration
Google HitsRemarks
1N الاسكندرية AlAskndryp2,930,000 Normalized, no hamza
2S الإسكندرية Al<skndryp690,000 Standard form, with hamza
3E الاسكندريه AlAskndryh89,200 No hamza, taa' marbuuta replaced by haa'
4V الإسكندريّة Al<skndry~p954 Explicit shadda
5E الإسكندريه Al<skndryh897 taa' marbuuta replaced by haa'
6V الاسكندريّة AlAskndry~p245 no hamza, shadda explicit
7E الاسكندريا AlAskndryA80 hamza omitted, taa' marbuuta replaced by alif
8V الإسْكَنْدَريَّة Al<sokanodary~ap24 fully vocalized
9E الاسكندريّه AlAskndry~h12 no hamza, shadda explicit, taa' marbuuta replaced by haa'
10E الإسكندريا Al<skndryA7 taa' marbuuta replaced by alif tawiila
11E الإسكندريّه Al<skndry~h5 taa' marbuuta replaced by haa', shadda explicit

In addition to the above, our database contains many other variants, such as those with partial and full vocalization, covering all actual and potential variants. The full set of Alexandria variants includes 35 entries.