Database of Arab Names in Arabic
- Every Arabic name is normalized and vocalized.
The complexity of the Arabic script gives rise to a variety of Arabic spelling variants and spelling errors, which can lead to a variety of problems in Arabic information processing. To deal with this issue, the CJKI has created the Database of Arab Names in Arabic (DANA), a one-of-a-kind resource which covers several hundred thousand Arabic script variants and common spelling mistakes, as illustrated in the data sample below.
A key feature of DANA is that every Arabic name is normalized and vocalized to produce a database of error-free, fully sanitized Arabic canonical forms. The vocalization is performed by a team of editors with the aid of tools and interfaces designed to achieve maximum efficiency. The canonical forms are used both as a basis for creating accurate romanized variants for our Database of Arab Names (DAN) -- which contains over seven million romanized variants of Arab names -- as well as Arabic orthographic variants for DANA.
Arabic names are spelled with or without a hamza over the alif, sometimes a shadda appears and sometime not, sometimes a madda is not written over the alif, and the like. Other than variants, there are also common errors such as yaa' being replaced by alif maqsuura and taa' marbuuta being replaced by haa'.
You can see the breadth of coverage in DANA by trying out our ANTE demo.
Below are examples of Arabic variants for two male surnames. A larger sample is also available.