06 April 2009

Oracle Text: diacritic search

For the client where I'm currently working, the need arose that we needed to find lastnames regardless of diacritical characters. In Dutch, we have lastnames (and firstname too) where characters like ü, ä, ö, é occur.

The "usual" way to handle this was to add a column to the table, and store the name there without the diacritical characters. Meaning the name "müller" would be in one column, while another column would contain the name "muller", the same letters but without the double dots (called a trema in Dutch) over the "u". A common complaint with this approach, is that when you query with the double dots, you won't be able to find what you are looking for.
There is also the possibility of using Oracle Text to handle this. In the past I have been to a presentation on Oracle Text, but never used it before. It scared me in a certain way. I thought it would be quite complicated to use it, but it turned out to be real easy.

Let's start with the table

create table text_test
(name varchar2(255)
);

insert into text_test values ('muller');
insert into text_test values ('müller');
insert into text_test values ('MULLER');
insert into text_test values ('MÜLLER');
insert into text_test values ('mueller');
insert into text_test values ('MUELLER');
--
insert into text_test values ('möller');
insert into text_test values ('moller');
--
insert into text_test values ('mäller');
insert into text_test values ('maller');
--
insert into text_test values ('Médar');

commit;

In this table are some variations of "Muller", upper- and lowercase and with and without the diacritical character.
In order to use Oracle Text to search for the name regardless of diacritic, we need to create a Custom Lexer. This custom lexer is needed, because we need to change the Base Letter setting. From the documentation:
With base-letter conversions enabled, letters with umlauts, acute accents, cedillas, and the like are converted to their basic forms for indexing, so fiancé is indexed both as fiancé and as fiance, and a query of fiancé returns documents containing either form.

And this is exactly what we're after.

To be able to search using the Base Letter conversion, as described in the quote from the ducmentation, we need to create a Preference.
To change the setting of the Base Letter, the default is NO, we to set this attribute to YES.

begin
ctxsys.ctx_ddl.create_preference ('cust_lexer','BASIC_LEXER');
ctxsys.ctx_ddl.set_attribute ('cust_lexer','base_letter','YES'); -- removes diacritics
end;
/


In the above code, we create a preference, called CUST_LEXER. Because the main language will be Dutch, I use the BASIC_LEXER. More information regarding this in the Oracle documentation.
The attribute that I want to override is the BASE_LETTER, so set that attribute to YES. And that should take care of it.
Now the only thing to do is create the Oracle Text index and specify that we want to use our preference set.

CREATE INDEX text_test_idx ON text_test(name) INDEXTYPE IS CTXSYS.CONTEXT
parameters ('LEXER cust_lexer');


Now that we are all set, let's run a query and verify the results.

SQL> select name
2 from text_test
3 where contains (name, 'muller') > 0
4 ;

NAME
-----------------------------------------
müller
MULLER
muller
MÜLLER

And there you have it, it works as advertised.

While writing this blog, a colleague of mine pointed out that in Germany it is accepted to spell the name "müller" as "mueller". Oracle Text can even handle that.
Add this to your preference set, and you're good to go.

ctxsys.ctx_ddl.set_attribute ('cust_lexer','alternate_spelling','GERMAN');


With a little demo:

SQL> select name
2 from text_test
3 where contains (name, 'muller') > 0
4 ;

NAME
-----------------------------------------
muller
müller
MULLER
MÜLLER
mueller
MUELLER

This also works when searching for "Mueller"

SQL> select name
2 from text_test
3 where contains (name, 'mueller') > 0
4 ;

NAME
-----------------------------------------
muller
müller
MULLER
MÜLLER
mueller
MUELLER


The problem with Oracle Text used to be the synchronization of the Text indexes with DML actions. In the old days, you needed to take care of this yourself. Schedule a "make sure the Text index is updated to be in sync with the table". This could mean that after you added a name like "Désiré" to your table, you wouldn't be able to find because it wasn't in the text index.
Starting with Oracle 10g (release 2) you can indicate that the index needs to be synchronized during a COMMIT. Just specify it with the creation of the index, and that's it.

CREATE INDEX text_test_idx ON text_test(name) INDEXTYPE IS CTXSYS.CONTEXT
parameters ('LEXER cust_lexer SYNC (ON COMMIT)');


This Oracle Text is really cool stuff, and this is just the beginning. The Oracle Documentation is a great source of information. A link is included at the bottom.

Finally the cleanup:

drop table text_test cascade constraints
/
begin
ctxsys.ctx_ddl.drop_preference ('cust_lexer');
end;
/


Oracle Text Documentation

11 comments:

  1. Without Oracle Tex:

    select *
    from text_test
    where instr( lower( convert( name, 'US7ASCII' ) ), 'muller') > 0

    ReplyDelete
  2. Very creative, but why not take advantage of what Oracle has to offer? How would you index that? It won't hold up with the alternate spelling, though... unless you have another trick up your sleeve? ;)

    ReplyDelete
  3. Nice to be visiting your blog again, Generic Metformin has been months for me. Well this article that i've been waited for so long. Generic Ultracet need this article to complete my assignment in the college, and it has same topic with your article. Thanks, great share

    ReplyDelete
  4. Simple article but makes a difference because it adds something to what oracle official doc provides.
    Clarified to me that the preference must be created before the index is re/created.
    thanks a lot, Goog Job ! :-)

    ReplyDelete
  5. What about if want to search for contains(name,'mü%') >0
    and we want to have result as :

    müller first row

    ReplyDelete
    Replies
    1. select * from text_test where name like 'mü%'

      Delete
  6. thanks for ur help...but I want all other data related to 'mü%'....means i want result like below :
    müller
    MÜLLER
    muller

    So I want to use diacritic search
    I mean data should return in the order in which search is written.


    ReplyDelete
    Replies
    1. There is no such thing as "the order in which search is written", it is just a where clause. When you want to sort the results a certain way, you will need an ORDER BY. You decide how you want the results ordered. You might want to take a look at the SCORE() function

      Delete
  7. Score function doesnt work with this....

    I have the below query
    If possible , when the search is with an accent, display the results with the accent before displaying the results without the accent.
    And vice-versa, when the serch isn't with an accent, display the results with no accent before displaying the results with accents.
    As an Eg:

    I have a situation like below:

    select p.last_name from person p where user_id = 'mprakash' and contains(last_name , 'fré%') > 0

    FreNormal Frederic Frédéric Frêdéric

    result comes as above......accented characters always comes last even if i am giving the search condition with using accented characters 'fré%'

    Is there something through which we can get result below: Frédéric Frêdéric FreNormal Frederic

    or if I give search as contains(last_name , 'frê%') > 0 then result comes as below :

    Frêdéric Frédéric FreNormal Frederic

    and if i search as contains(last_name , 'fre%') > 0 then result would be FreNormal Frederic Frêdéric Frédéric

    ReplyDelete
    Replies
    1. not that I know of... still this needs to take place in the ORDER BY of the query.

      Delete