Removing accents (and diacritics) in any language from Java

Java 6 contains a few well-hidden gems for language researchers and developers who are working with multilingual text, for example on search engines or dictionaries (like Deect in my case). With just a few lines of code, you will be able to remove any accent in most of the languages without much hustle (e.g. you are not require to define a translation table or language detection).

For the one-minute readers, here is the utility code you are looking for:

import java.text.Normalizer;
import java.text.Normalizer.Form;

// ...

public static String removeAccents(String text) {
    return text == null ? null
        : Normalizer.normalize(text, Form.NFD)
            .replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}

What is under the hood?

Although the Hungarian árvíztűrő tükörfúrógép contains all the accented letters in Hungarian language, and it is a typical and excellent test case for such domains, I will use only the accented letters with the related base letters, to show you what happens inside. The pseudo-code is the following (details are removed for clarity):

String original = "aáeéiíoóöőuúüű AÁEÉIÍOÓÖŐUÚÜŰ";
for (int i = 0; i < original.length(); i++) {
    // we will report on each separate character, to show you how this works
    String text = original.substring(i, i + 1);
    // normalizing
    String decomposed = Normalizer.normalize(text, Form.NFD);
    // removing diacritics
    String removed = decomposed.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");

    // checking the inside content
    System.out.println(text + " (" + asHex(text) + ") -> "
                + decomposed + " (" + asHex(decomposed) + ") -> "
                + removed + " (" + asHex(removed) + ")");
}

// further methods are removed for clarity

And the result is:

a (0061     ) -> a (0061     ) -> a (0061     )
á (00e1     ) -> á (0061 0301) -> a (0061     )
e (0065     ) -> e (0065     ) -> e (0065     )
é (00e9     ) -> é (0065 0301) -> e (0065     )
i (0069     ) -> i (0069     ) -> i (0069     )
í (00ed     ) -> í (0069 0301) -> i (0069     )
o (006f     ) -> o (006f     ) -> o (006f     )
ó (00f3     ) -> ó (006f 0301) -> o (006f     )
ö (00f6     ) -> ö (006f 0308) -> o (006f     )
ő (0151     ) -> ő (006f 030b) -> o (006f     )
u (0075     ) -> u (0075     ) -> u (0075     )
ú (00fa     ) -> ú (0075 0301) -> u (0075     )
ü (00fc     ) -> ü (0075 0308) -> u (0075     )
ű (0171     ) -> ű (0075 030b) -> u (0075     )
  (0020     ) ->   (0020     ) ->   (0020     )
A (0041     ) -> A (0041     ) -> A (0041     )
Á (00c1     ) -> Á (0041 0301) -> A (0041     )
E (0045     ) -> E (0045     ) -> E (0045     )
É (00c9     ) -> É (0045 0301) -> E (0045     )
I (0049     ) -> I (0049     ) -> I (0049     )
Í (00cd     ) -> Í (0049 0301) -> I (0049     )
O (004f     ) -> O (004f     ) -> O (004f     )
Ó (00d3     ) -> Ó (004f 0301) -> O (004f     )
Ö (00d6     ) -> Ö (004f 0308) -> O (004f     )
Ő (0150     ) -> Ő (004f 030b) -> O (004f     )
U (0055     ) -> U (0055     ) -> U (0055     )
Ú (00da     ) -> Ú (0055 0301) -> U (0055     )
Ü (00dc     ) -> Ü (0055 0308) -> U (0055     )
Ű (0170     ) -> Ű (0055 030b) -> U (0055     )

The Normalizer decomposes the original characters into a combination of a base character and a diacritic sign (this could be multiple signs in different languages). á, é and í have the same sign: 0301 for marking the ' accent.

The \p{InCombiningDiacriticalMarks}+ regular expression will match all such diacritic codes and we will replace them with an empty string.

Simple and elegant, isn't it?

Timestamp: 2011-02-22 22:04
blog comments powered by Disqus
Author
István Soós
technology expert, trainer, business consultant and agile coach
More...