U2U Blog

for developers and other creative minds

Indexing and Searching Documents in Multiple Languages (Part I)

The good thing about dedicated workshops like the Search Workshops we have done last week in Brussels and Copenhagen, is that after the course, you end up with a lot of questions that were answered and that should somehow end up in blog postings, articles or whatever. Finding the time to do this is of course always a problem. But I'll do my best, certainly if it adds to the material covered in our latest book:

Inside the Index and Search Engines: Microsoft® Office SharePoint® Server 2007
by Patrick Tisseghem, Lars Fastrup

Read more about this book...

 

One of the interesting questions was regarding the indexing and the searching for documents created in a specific language. I'll cover a bit here in this first posting and continue with that later this week.

How does the crawler detect the language of the content of the document?

First of all, the detection of the language is dependent on the IFilter that is used to index the content of the document. There is a full explanation of the internals of IFilters and also a guide how to build your own one in chapter 9 of the book.

The built-in IFilter that is part of the MOSS indexing architecture is capable of looking at an Office document and collect plenty of information. This information gathering is actually the task of one of the internal plug-ins named the Metadata Extraction plug-in. It relies on an internal language detection algorithm (developed by Microsoft Research) to find out about the language of the content. When it was able to retrieve the language (represented by a number), it stores this information in a hidden managed property called DetectedLanguage.

How do I search for a document in a specific language?

Let's have a look first at the out-of-the-box experience. I have for example here a document library storing different documents each authored in a language. I configured a content source that indexed all of this data.

image

The advanced search page allows us to filter on the language very easily using the language picker. By default there are a couple of options but if you open the tool pane and configure the XML that is set as the value for the Properties property of the AdvancedSearchBox Web Part, you are able to offer more choices.

In the XML you find a list of LangDef elements each one representing one language and the number for it. Note that it is not very clear how Microsoft got to these numbers (they do not match for example the LCID numbers).

   1: <LangDefs>
   2:         <LangDef DisplayName="Arabic" LangID="1"/>
   3:         <LangDef DisplayName="Bengali" LangID="69"/>
   4:         <LangDef DisplayName="Bulgarian" LangID="2"/>
   5:         <LangDef DisplayName="Catalan" LangID="3"/>
   6:         <LangDef DisplayName="Chinese" LangID="4"/>
   7:         <LangDef DisplayName="Croatian/Serbian" LangID="26"/>
   8:         <LangDef DisplayName="Czech" LangID="5"/>
   9:         <LangDef DisplayName="Danish" LangID="6"/>
  10:         <LangDef DisplayName="Dutch" LangID="19"/>
  11:         <LangDef DisplayName="Finnish" LangID="11"/>
  12:         <LangDef DisplayName="French" LangID="12"/>
  13:         <LangDef DisplayName="German" LangID="7"/>
  14:         <LangDef DisplayName="Greek" LangID="8"/>

The language picker will show all the languages that are defined within the Languages element:

   1: <Languages>
   2:         <Language LangRef="12"/>
   3:         <Language LangRef="7"/>
   4:         <Language LangRef="17"/>
   5:         <Language LangRef="10"/>
   6:         <Language LangRef="19"/>
   7:         <Language LangRef="25"/>
   8:         <Language LangRef="22"/>
   9:     </Languages>

A query using the language picker will result in the inclusion of the match on the DetectedLanguage managed property as shown here:

image

image

The Advanced Search Page is not the only place where you can use this managed property. You can also immediately type it in in the search box where you formulate your keyword syntax query. You just have to find out the number of the language (see the above XML).

image

In a next posting I'll show you how you can customize the search experience using the language information.

Comments (22) -

  • wiktach

    6/10/2015 2:31:16 PM | Reply

    I have read a few good stuff here. Definitely worth bookmarking for revisiting. I surprise how much effort you put to make such a wonderful informative web site.

  • niesciolkowani

    6/16/2015 10:44:59 AM | Reply

    I was wondering if you ever thought of changing the page layout of your blog? Its very well written; I love what youve got to say. But maybe you could a little more in the way of content so people could connect with it better. Youve got an awful lot of text for only having 1 or 2 pictures. Maybe you could space it out better?

  • outdoorow

    6/24/2015 1:21:47 PM | Reply

    I’ve read a few excellent stuff here. Certainly worth bookmarking for revisiting. I wonder how a lot effort you set to create this type of magnificent informative web site.

  • hestiaolsztyn pl

    7/3/2015 4:14:14 PM | Reply

    My brother recommended I may like this website. He used to be entirely right. This put up truly made my day. You can not consider just how a lot time I had spent for this info! Thank you!

  • tanie ubezpieczenie eu

    7/3/2015 5:02:54 PM | Reply

    Undeniably imagine that which you stated. Your favorite reason appeared to be on the internet the simplest thing to keep in mind of. I say to you, I certainly get irked while other people think about issues that they plainly don't understand about. You controlled to hit the nail upon the highest as neatly as outlined out the whole thing without having side effect , people can take a signal. Will probably be again to get more. Thank you

  • http://www.ubezpieczenia.holms.eu/

    7/3/2015 5:05:15 PM | Reply

    Fantastic goods from you, man. I have keep in mind your stuff previous to and you are simply extremely magnificent. I really like what you have bought right here, certainly like what you are stating and the way in which through which you are saying it. You make it entertaining and you still care for to keep it wise. I cant wait to read far more from you. That is really a wonderful web site.

  • pisanie prac dyplomowych

    7/3/2015 5:05:25 PM | Reply

    Aw, this was a really nice post. In concept I would like to put in writing like this additionally – taking time and actual effort to make a very good article… but what can I say… I procrastinate alot and in no way appear to get one thing done.

  • www ergohestiaolsztyn pl

    7/6/2015 1:29:03 AM | Reply

    Hello! Quick question that's completely off topic. Do you know how to make your site mobile friendly? My site looks weird when viewing from my iphone4. I'm trying to find a template or plugin that might be able to fix this issue. If you have any recommendations, please share. With thanks!

  • plan pracy inzynierskiej

    7/6/2015 1:29:05 AM | Reply

    I'm not sure exactly why but this site is loading very slow for me. Is anyone else having this problem or is it a issue on my end? I'll check back later on and see if the problem still exists.

  • www.tanie-ubezpieczenie.eu/

    7/6/2015 5:07:14 AM | Reply

    Hey very cool website!! Guy .. Excellent .. Wonderful .. I will bookmark your web site and take the feeds also…I'm glad to search out a lot of helpful information here in the submit, we'd like work out extra strategies on this regard, thanks for sharing. . . . . .

  • http://www.ubezpieczenia.holms.eu

    7/7/2015 4:46:12 PM | Reply

    Great work! This is the type of information that should be shared around the internet. Shame on the search engines for not positioning this post higher! Come on over and visit my site . Thanks =)

  • tanie-ubezpieczenie.eu

    7/7/2015 5:14:07 PM | Reply

    Do you have a spam problem on this website; I also am a blogger, and I was wanting to know your situation; many of us have created some nice practices and we are looking to swap techniques with others, be sure to shoot me an e-mail if interested.

  • http www hestiaolsztyn pl

    7/7/2015 5:14:13 PM | Reply

    That is really attention-grabbing, You are an excessively professional blogger. I have joined your rss feed and sit up for in the hunt for more of your excellent post. Additionally, I have shared your web site in my social networks!

  • www holms eu

    7/10/2015 3:23:35 AM | Reply

    This design is incredible! You obviously know how to keep a reader amused. Between your wit and your videos, I was almost moved to start my own blog (well, almost...HaHa!) Wonderful job. I really loved what you had to say, and more than that, how you presented it. Too cool!

  • ergohestiaolsztyn pl

    7/10/2015 8:14:53 AM | Reply

    I have observed that in the world of today, video games are the latest trend with children of all ages. Many times it may be difficult to drag your son or daughter away from the video games. If you want the very best of both worlds, there are plenty of educational games for kids. Good post.

  • http://www.tanie-ubezpieczenie.eu

    7/10/2015 8:14:58 AM | Reply

    Good day! I know this is kinda off topic but I was wondering if you knew where I could get a captcha plugin for my comment form? I'm using the same blog platform as yours and I'm having trouble finding one? Thanks a lot!

  • mexican sex

    8/15/2015 7:22:47 PM | Reply

    Really appreciate you sharing this article. Really Cool.

  • he has a good point

    10/9/2015 12:24:34 PM | Reply

    I just want to say I am just beginner to blogging and site-building and honestly enjoyed this blog site. Likely I’m planning to bookmark your site . You actually come with good articles and reviews. Thanks a bunch for sharing with us your blog.

  • find more

    10/31/2015 11:54:26 PM | Reply

    I just want to tell you that I am beginner to blogging and site-building and absolutely enjoyed your web-site. Almost certainly I’m going to bookmark your site . You actually have tremendous article content. Thanks a lot for sharing with us your webpage.

  • SSzxu1hp4uoK2

    11/28/2015 9:24:05 AM | Reply

    304680 859698Often the Are usually Weight reduction plan is unquestionably an low-priced and flexible weight-reduction plan product modeled on individuals seeking out shed some pounds combined with at some point maintain a far healthier your life. la weight loss 820570

  • you can try this out

    12/3/2015 8:10:07 PM | Reply

    I simply want to mention I'm all new to blogging and site-building and definitely loved this web site. Almost certainly I’m planning to bookmark your blog post . You certainly have fantastic stories. Cheers for sharing your blog.

Loading