Indexing and Searching Documents in Multiple Languages (Part I)

The good thing about dedicated workshops like the Search Workshops we have done last week in Brussels and Copenhagen, is that after the course, you end up with a lot of questions that were answered and that should somehow end up in blog postings, articles or whatever. Finding the time to do this is of course always a problem. But I'll do my best, certainly if it adds to the material covered in our latest book:

Inside the Index and Search Engines: Microsoft® Office SharePoint® Server 2007
by Patrick Tisseghem, Lars Fastrup

Read more about this book...

 

One of the interesting questions was regarding the indexing and the searching for documents created in a specific language. I'll cover a bit here in this first posting and continue with that later this week.

How does the crawler detect the language of the content of the document?

First of all, the detection of the language is dependent on the IFilter that is used to index the content of the document. There is a full explanation of the internals of IFilters and also a guide how to build your own one in chapter 9 of the book.

The built-in IFilter that is part of the MOSS indexing architecture is capable of looking at an Office document and collect plenty of information. This information gathering is actually the task of one of the internal plug-ins named the Metadata Extraction plug-in. It relies on an internal language detection algorithm (developed by Microsoft Research) to find out about the language of the content. When it was able to retrieve the language (represented by a number), it stores this information in a hidden managed property called DetectedLanguage.

How do I search for a document in a specific language?

Let's have a look first at the out-of-the-box experience. I have for example here a document library storing different documents each authored in a language. I configured a content source that indexed all of this data.

image

The advanced search page allows us to filter on the language very easily using the language picker. By default there are a couple of options but if you open the tool pane and configure the XML that is set as the value for the Properties property of the AdvancedSearchBox Web Part, you are able to offer more choices.

In the XML you find a list of LangDef elements each one representing one language and the number for it. Note that it is not very clear how Microsoft got to these numbers (they do not match for example the LCID numbers).

   1: <LangDefs>
   2:         <LangDef DisplayName="Arabic" LangID="1"/>
   3:         <LangDef DisplayName="Bengali" LangID="69"/>
   4:         <LangDef DisplayName="Bulgarian" LangID="2"/>
   5:         <LangDef DisplayName="Catalan" LangID="3"/>
   6:         <LangDef DisplayName="Chinese" LangID="4"/>
   7:         <LangDef DisplayName="Croatian/Serbian" LangID="26"/>
   8:         <LangDef DisplayName="Czech" LangID="5"/>
   9:         <LangDef DisplayName="Danish" LangID="6"/>
  10:         <LangDef DisplayName="Dutch" LangID="19"/>
  11:         <LangDef DisplayName="Finnish" LangID="11"/>
  12:         <LangDef DisplayName="French" LangID="12"/>
  13:         <LangDef DisplayName="German" LangID="7"/>
  14:         <LangDef DisplayName="Greek" LangID="8"/>

The language picker will show all the languages that are defined within the Languages element:

   1: <Languages>
   2:         <Language LangRef="12"/>
   3:         <Language LangRef="7"/>
   4:         <Language LangRef="17"/>
   5:         <Language LangRef="10"/>
   6:         <Language LangRef="19"/>
   7:         <Language LangRef="25"/>
   8:         <Language LangRef="22"/>
   9:     </Languages>

A query using the language picker will result in the inclusion of the match on the DetectedLanguage managed property as shown here:

image

image

The Advanced Search Page is not the only place where you can use this managed property. You can also immediately type it in in the search box where you formulate your keyword syntax query. You just have to find out the number of the language (see the above XML).

image

In a next posting I'll show you how you can customize the search experience using the language information.