Notes on Oracle: Oracle Text: diacritic search

06 April 2009

Oracle Text: diacritic search

For the client where I'm currently working, the need arose that we needed to find lastnames regardless of diacritical characters. In Dutch, we have lastnames (and firstname too) where characters like ü, ä, ö, é occur.

The "usual" way to handle this was to add a column to the table, and store the name there without the diacritical characters. Meaning the name "müller" would be in one column, while another column would contain the name "muller", the same letters but without the double dots (called a trema in Dutch) over the "u". A common complaint with this approach, is that when you query with the double dots, you won't be able to find what you are looking for.
There is also the possibility of using Oracle Text to handle this. In the past I have been to a presentation on Oracle Text, but never used it before. It scared me in a certain way. I thought it would be quite complicated to use it, but it turned out to be real easy.

Let's start with the table


create table text_test
(name varchar2(255)
);

insert into text_test values ('muller');
insert into text_test values ('müller');
insert into text_test values ('MULLER');
insert into text_test values ('MÜLLER');
insert into text_test values ('mueller');
insert into text_test values ('MUELLER');
--
insert into text_test values ('möller');
insert into text_test values ('moller');
--
insert into text_test values ('mäller');
insert into text_test values ('maller');
--
insert into text_test values ('Médar');

commit;

In this table are some variations of "Muller", upper- and lowercase and with and without the diacritical character.
In order to use Oracle Text to search for the name regardless of diacritic, we need to create a Custom Lexer. This custom lexer is needed, because we need to change the Base Letter setting. From the documentation:

With base-letter conversions enabled, letters with umlauts, acute accents, cedillas, and the like are converted to their basic forms for indexing, so fiancé is indexed both as fiancé and as fiance, and a query of fiancé returns documents containing either form.

And this is exactly what we're after.

To be able to search using the Base Letter conversion, as described in the quote from the ducmentation, we need to create a Preference.
To change the setting of the Base Letter, the default is NO, we to set this attribute to YES.


begin
   ctxsys.ctx_ddl.create_preference ('cust_lexer','BASIC_LEXER');
   ctxsys.ctx_ddl.set_attribute ('cust_lexer','base_letter','YES'); -- removes diacritics
end;
/

In the above code, we create a preference, called CUST_LEXER. Because the main language will be Dutch, I use the BASIC_LEXER. More information regarding this in the Oracle documentation.
The attribute that I want to override is the BASE_LETTER, so set that attribute to YES. And that should take care of it.
Now the only thing to do is create the Oracle Text index and specify that we want to use our preference set.


CREATE INDEX text_test_idx ON text_test(name) INDEXTYPE IS CTXSYS.CONTEXT
parameters ('LEXER cust_lexer');

Now that we are all set, let's run a query and verify the results.


SQL> select name
  2    from text_test
  3   where contains (name, 'muller') > 0
  4  ;

NAME
-----------------------------------------
müller
MULLER
muller
MÜLLER

And there you have it, it works as advertised.

While writing this blog, a colleague of mine pointed out that in Germany it is accepted to spell the name "müller" as "mueller". Oracle Text can even handle that.
Add this to your preference set, and you're good to go.


ctxsys.ctx_ddl.set_attribute ('cust_lexer','alternate_spelling','GERMAN');

With a little demo:


SQL> select name
  2    from text_test
  3   where contains (name, 'muller') > 0
  4  ;

NAME
-----------------------------------------
muller
müller
MULLER
MÜLLER
mueller
MUELLER

This also works when searching for "Mueller"


SQL> select name
  2    from text_test
  3   where contains (name, 'mueller') > 0
  4  ;

NAME
-----------------------------------------
muller
müller
MULLER
MÜLLER
mueller
MUELLER

The problem with Oracle Text used to be the synchronization of the Text indexes with DML actions. In the old days, you needed to take care of this yourself. Schedule a "make sure the Text index is updated to be in sync with the table". This could mean that after you added a name like "Désiré" to your table, you wouldn't be able to find because it wasn't in the text index.
Starting with Oracle 10g (release 2) you can indicate that the index needs to be synchronized during a COMMIT. Just specify it with the creation of the index, and that's it.


CREATE INDEX text_test_idx ON text_test(name) INDEXTYPE IS CTXSYS.CONTEXT
parameters ('LEXER cust_lexer SYNC (ON COMMIT)');

This Oracle Text is really cool stuff, and this is just the beginning. The Oracle Documentation is a great source of information. A link is included at the bottom.

Finally the cleanup:


drop table text_test cascade constraints
/
begin
   ctxsys.ctx_ddl.drop_preference ('cust_lexer');
end;
/

Oracle Text Documentation

10 comments:

blogjeApril 17, 2009 at 11:55 PM
Without Oracle Tex:

select *
from text_test
where instr( lower( convert( name, 'US7ASCII' ) ), 'muller') > 0
ReplyDelete
Replies
Alex NuijtenApril 24, 2009 at 10:42 AM
Very creative, but why not take advantage of what Oracle has to offer? How would you index that? It won't hold up with the alternate spelling, though... unless you have another trick up your sleeve? ;)
ReplyDelete
Replies
AnonymousAugust 31, 2010 at 8:41 PM
hi
ReplyDelete
Replies
Gilberto Pe-CurtoAugust 24, 2012 at 1:32 PM
Simple article but makes a difference because it adds something to what oracle official doc provides.
Clarified to me that the preference must be created before the index is re/created.
thanks a lot, Goog Job ! :-)
ReplyDelete
Replies
UnknownSeptember 17, 2014 at 11:42 AM
What about if want to search for contains(name,'mü%') >0
and we want to have result as :

müller first row
ReplyDelete
Replies
UnknownSeptember 17, 2014 at 12:19 PM
thanks for ur help...but I want all other data related to 'mü%'....means i want result like below :
müller
MÜLLER
muller

So I want to use diacritic search
I mean data should return in the order in which search is written.

ReplyDelete
Replies
UnknownSeptember 17, 2014 at 1:35 PM
Score function doesnt work with this....

I have the below query
If possible , when the search is with an accent, display the results with the accent before displaying the results without the accent.
And vice-versa, when the serch isn't with an accent, display the results with no accent before displaying the results with accents.
As an Eg:

I have a situation like below:

select p.last_name from person p where user_id = 'mprakash' and contains(last_name , 'fré%') > 0

FreNormal Frederic Frédéric Frêdéric

result comes as above......accented characters always comes last even if i am giving the search condition with using accented characters 'fré%'

Is there something through which we can get result below: Frédéric Frêdéric FreNormal Frederic

or if I give search as contains(last_name , 'frê%') > 0 then result comes as below :

Frêdéric Frédéric FreNormal Frederic

and if i search as contains(last_name , 'fre%') > 0 then result would be FreNormal Frederic Frêdéric Frédéric
ReplyDelete
Replies

Add comment

Notes on Oracle

Pages

06 April 2009

Oracle Text: diacritic search

10 comments:

Blog Archive

Non Oracle Links