Historically – at my former employer Sakhr and Harf – I believed you have to support Arabic morphological analysis in order to have a good Arabic Full Text Search Engine. Recently I was developing an Arabic application using Java, we used Apache Lucene search component, as you know Java already internationalized, so Lucene supports Arabic without much effort. We studied to improve the search by adding morphological analysis.
What it means from the user perspective to have morphological analysis in search, it means searching by كتب (write) and you get results that has many variations, like كتابة، سيكتب، فكتب (writing, wrote, books, …) and so on. What a stupid idea, Look at Google, if you search by the keyword “reading”, it will find it only. Why we should slow down our search algorithm and just adding confusion to the user, Google even search in misspelled words and just offer you to search using the coorect words. What it means?, it means we should respect user input and use it as it, the search engine efforts should concentrate on getting the most relevant hits to the user query at the top.
My conclusion is not the morphological analysis is useless, I believe it is important in machine translation and many other nutural language processing applications but not search engines. The key feature of search engine are accuracy, ranking and of course speed.
If you want to understand Arabic Morphlogical Analysis, I suggest to look at Buckwalter efforts, you can also download his GPL version to learn how things is going on. His code is written in Perl, but you can easily understand his nice readme file, thanks BuckWalter [1] [2].
You can also develop a shallow morphological analyzer using the techniques suggested by Kareem M. Darwish, you can look at his web site [3], he has a nice research paper called titles “Building a Shallow Arabic Morphological Analyzer in One Day” [4]. A free software tools are available for free.
Research in Arabic morphology analyzer increased nowdays, you can find amazing number of research papers and dissertation online at Google Scholar [5].
Note: Later Buckwalter ported the code to java with GNU licenese, look at:
http://www.nongnu.org/aramorph
References:
1. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002L49
2. http://www.qamous.org/
3. http://www.glue.umd.edu/%7Ekareem/research/Publish
4. http://www.cs.um.edu.mt/~mros/WSL/papers/darwish.pdf
5. http://scholar.google.com/
6. http://www.nongnu.org/aramorph
* This post originally written on Jan 16, 2006
0 responses to “Arabic Morphological Analysis”
I don’t think its a stupid idea to add Morphological analysis at all. What Sakhr are specifically doing by using the “root” concept is stupid though. But on a practical leve, yes the most imp. factor is to get exactly what the user is searching for. Because of Arabic being a highly inflectional language as well as highly ambiguous this means that search results will be incomplete + include a lot of irrelevant results if no linguistic analysis is used and only “exact” match is use.
Using morphological analyzer with search engines is neither useless nor stupid. Search quality is measured by two factors ; Recall and Precision .Without using the morphological analyzer with search you will get poor ” recall” and high “precision” which is not good enough for you as an end user. Suppose you are searching for the word ” سيارة” , the search result won’t include “والسيارة ،كالسيارة ، للسيارة ، بالسيارة، ..الخ” while they are needed results .In English language you won’t find similar poor results because prepositions are separate words.So to increase the ” Recall ” while searching in the Arabic full-text you have to use the morphological analyzer .
It is useful to try everything in practise anyway and I like that here it's always possible to find something new. 🙂
Any comment on using the "light" stemmer for Arabic in search applications? An implementation is in Lucene core v3.1 nowadays. Light Stemming for Arabic Information Retrieval