Jack Halpern


   Data Licensing

The CJKI Database of Arabic Plurals

The CJKI Database of Arabic Plurals (DAP), the first truly modern, fully up-to-date database covering both regular and irregular Arabic plurals. Developed by experts over a period of several years, this database includes various grammatical attributes such as part-of-speech, collectivity codes, gender codes, and full vocalization. CJKI is now making this database available for use in software development, machine translation, and Arabic language education.

The Database of Arabic Plurals has been assembled by a team of experts in Arabic grammar through meticulous attention and research to ensure accuracy and to avoid the errors found in other works. In an era in which accurate processing of Arabic text is critical, this database represents a major step forward for natural language processing, machine translation, lexicography, and pedagogy.

Linguistic Background

Arabic regular plurals, also called sound plurals, are formed with the regular suffixes ُونَ uunaa for masculine plurals and َاتْ aat for feminine plurals, as shown below:

Singular مُدَرِّسٌ mudarrisun male teacher
Male plural مُدَرِّسُونَ mudarrisuuna male teachers
Female singular مُدَرِّسَةٌ mudarrisatun female teacher
Female plural مُدَرِّسَاتٌ mudarrisaat female teachers

For language learners and language processing software alike, the irregular, or broken, plurals present one of the greatest challenges in the Arabic language. These morphologically irregular plurals are distinct in that they are not formed with the regular plural suffixes shown above. Instead, they are formed by modifying the vowels of the vowel-consonant pattern (CV templates) of the singular form (a phenomenon known as "nonconcatenative morphology"). Each of these plurals follows one of dozens of complicated morphological patterns derived from the root of the singular form by adding suffixes, adding prefixes, and changing the vowels.

English also has irregular plurals such as geese for goose and loaves for loaf, but the problem in Arabic is significantly more challenging since, in Arabic, such plurals are far more numerous and the patterns far more complex. In fact, the vast majority of Arabic plurals are broken. Furthermore, some nouns have distinct plural forms for different senses of the noun. For instance, the word بَيْتٌ baytun has two plural forms: بُيُوت in the sense of 'houses' and أَبْيَات in the sense of 'verses'. To add to this complexity, while many broken plurals are formed by inserting the root into a fixed pattern or template, the formation of some broken plurals requires complex morphophonological changes that cannot be predicted from the singular form, at least not without the aid of sophisticated algorithms and computational models.

Let us examine some of the broken plurals in the table below:

Arabic English Arabic English Pattern Singular Pattern Plural
book كُتُب
books CiCaaC CuCuC
room غُرَف
rooms CuCCa CuCaC
heart قْلُوب
hearts CaCC CuCuuC
boy أَوْلَاد
boys CaCaC ʼaCCaaC

The last column shows the broken plural pattern. For example, CuCuC means that the first two consonants of the root, ك /k/ and ت /t/, are followed by the vowel /u/, whereas the last consonant ب /b/ has no vowel. There are several dozen of such broken plurals patterns, all irregular and all unpredictable.

Broken Plurals in Information Processing

Since, on the whole, the formation of the broken plural cannot be easily predicted from the singular form, such plurals pose a daunting challenge for machine translation, natural language processing, and language learning. This is immediately clear from the table in the previous section. The lack of regular patterns means that learners must learn each plural individually, while software for processing Arabic texts should ideally have a hard-coded database of broken plurals to determine the plural form from the singular or vice versa.

Lack of consideration for irregular plurals would be fatal even in an English natural language processor, let alone in the more complex case of Arabic. Given the prevalence of Arabic broken plurals, it is critical to have an accurate database of irregular forms in order to process text meaningfully.

Key Features of DAP

Below are the key features that make the CJKI Database of Arabic Plurasl (DAP) a highly useful resource:

Data sample

A sample of the data is provided below. You can also download the sample as a text or PDFfile.

Broken Plurals in Arabic
ID Plural Voweled Singular Voweled Plural Unvoweled Singular Unvoweled Gender POS Meaning
1 أَسْيَاف سَيْف أَسياف سيف m N Sword
2 أَقـْلام قـَلَم أَقلام قلم m N Pen
3 أَرْغِـفَة رَغِيف أَرغفة رغيف m N Bread
4 عُمُد عَمُود عمد عمود m N Mast, Column
5 حُمْر أَحْمَر حمر أَحمر m A Red
6 كـُتـُب كِـتـَاب كتب كتاب m N Book
7 سُرُر سَرِير سرر سرير m N Bed
8 غُرَف غـُرْفـَة غرف غرفة f N Room
9 رُمَاة رَامِي رماة رامي m N Lancer, Spear shooter
10 قـُضَاة قـَاضِي قضاة قاضي m N Judge
11 كـَمَلَة كـَامِل كملة كامل m A Complete
12 قِـرَدَة قِـرْد قردة قرد m N Monkey
13 كِعَاب كـَعْب كعاب كعب m N Heel bone
14 نِعَاج نـَعْجَة نعاج نعجة f N Ewe
15 رِمَاح رُمْح رماح رمح m N Spear
16 ذِئاب ذِئب ذئاب ذئب m N Wolf
17 جِمَال جَمَل جمال جمل m N Camel
18 ضُرُوس ضُرْس ضروس ضرس m N Tooth
19 جُـنـُود جُنْدِيّ جنود جندي m N Soldier
20 أُسُود أَسَد أسود أَسد m N Lion
21 حِـيتـَان حُوت حيتان حوت m N Whale
22 قِـيعَان قـَاع قيعان قاع m N Bottom
23 كـُرَمَاء كـَرِيم كرماء كريم m A Generous
24 عُـقـَلاء عَاقِـل عقلاء عاقل m A Wise, Rational
25 أَوْلِـيَاء وَلِيّ أَولياء وليّ m N Parent
26 أَشِـدَّاء شَدِيد أَشداء شديد m A Powerful
27 جَوَاهِر جَوْهَرَة جواهر جوهرة f N Jewel
28 طَوَابِع طَابـِع طوابع طابع m N Stamp
29 عَجَائِز عَجُوز عجائز عجوز f A Oldster
30 عَذَارَى عَذْرَاء عذارى عذراء f A Virgin
31 صَحَارَى صَحْرَاء صحارى صحراء f N Desert
32 كـَرَاسِيّ كـُرْسِيّ كراسيّ كرسيّ m N Chair
33 جَعَافِر جَعْفَر جعافر جعفر m PN Jaafar
34 عَصَافِـير عُصْـفُور عصافير عصفور m N Bird
35 قـَنَادِيل قِنـْدِيل قناديل قنديل m N Lamp

1 Many nouns have several plurals, e.g. entry No. 4 has two plurals (the one given is not used in Sudan but it is well known and proper).
2 A broken plural can depend on the meaning. Some English words are of Arabic origin as in entry No. 35. It was originally Candela (its origin is the qindiil), a type of flame torche, but is now Lamp.
3 Some nouns in Arabic can hardly be rendered in English e.g. entry No. 25, (walii) is mentioned a lot in Qur’an. The noun is generally given for Islamic clergymen, but it may also be taken to mean parent(s).
4 Many nouns in the Arabic language have multiple plurals, those may be a mixture of both regular and broken plurals.
4 عمود أعمدة عَمُود أَعْمِدَة m N Mast, Column
25 وليّ أَولياء وَلِيّ أَوْلِـيَاء m N Parent
35 قنديل قناديل قِنـْدِيل قـَنَادِيل m N Candela