Removing accents (and diacritics) in any language from Java
Java 6 contains a few well-hidden gems for language researchers and developers who are working with multilingual text, for example on search engines or dictionaries (like Deect in my case). With just a few lines of code, you will be able to remove any accent in most of the languages without much hustle (e.g. you are not require to define a translation table or language detection).
For the one-minute readers, here is the utility code you are looking for:
import java.text.Normalizer;
import java.text.Normalizer.Form;
// ...
public static String removeAccents(String text) {
return text == null ? null
: Normalizer.normalize(text, Form.NFD)
.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}
What is under the hood?
Although the Hungarian árvíztűrő tükörfúrógép contains all the accented letters in Hungarian language, and it is a typical and excellent test case for such domains, I will use only the accented letters with the related base letters, to show you what happens inside. The pseudo-code is the following (details are removed for clarity):
String original = "aáeéiíoóöőuúüű AÁEÉIÍOÓÖŐUÚÜŰ";
for (int i = 0; i < original.length(); i++) {
// we will report on each separate character, to show you how this works
String text = original.substring(i, i + 1);
// normalizing
String decomposed = Normalizer.normalize(text, Form.NFD);
// removing diacritics
String removed = decomposed.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
// checking the inside content
System.out.println(text + " (" + asHex(text) + ") -> "
+ decomposed + " (" + asHex(decomposed) + ") -> "
+ removed + " (" + asHex(removed) + ")");
}
// further methods are removed for clarity
And the result is:
a (0061 ) -> a (0061 ) -> a (0061 )
á (00e1 ) -> á (0061 0301) -> a (0061 )
e (0065 ) -> e (0065 ) -> e (0065 )
é (00e9 ) -> é (0065 0301) -> e (0065 )
i (0069 ) -> i (0069 ) -> i (0069 )
í (00ed ) -> í (0069 0301) -> i (0069 )
o (006f ) -> o (006f ) -> o (006f )
ó (00f3 ) -> ó (006f 0301) -> o (006f )
ö (00f6 ) -> ö (006f 0308) -> o (006f )
ő (0151 ) -> ő (006f 030b) -> o (006f )
u (0075 ) -> u (0075 ) -> u (0075 )
ú (00fa ) -> ú (0075 0301) -> u (0075 )
ü (00fc ) -> ü (0075 0308) -> u (0075 )
ű (0171 ) -> ű (0075 030b) -> u (0075 )
(0020 ) -> (0020 ) -> (0020 )
A (0041 ) -> A (0041 ) -> A (0041 )
Á (00c1 ) -> Á (0041 0301) -> A (0041 )
E (0045 ) -> E (0045 ) -> E (0045 )
É (00c9 ) -> É (0045 0301) -> E (0045 )
I (0049 ) -> I (0049 ) -> I (0049 )
Í (00cd ) -> Í (0049 0301) -> I (0049 )
O (004f ) -> O (004f ) -> O (004f )
Ó (00d3 ) -> Ó (004f 0301) -> O (004f )
Ö (00d6 ) -> Ö (004f 0308) -> O (004f )
Ő (0150 ) -> Ő (004f 030b) -> O (004f )
U (0055 ) -> U (0055 ) -> U (0055 )
Ú (00da ) -> Ú (0055 0301) -> U (0055 )
Ü (00dc ) -> Ü (0055 0308) -> U (0055 )
Ű (0170 ) -> Ű (0055 030b) -> U (0055 )
The Normalizer decomposes the original characters into a combination of a base character and a diacritic sign (this could be multiple signs in different languages). á, é and í have the same sign: 0301 for marking the ' accent.
The \p{InCombiningDiacriticalMarks}+ regular expression will match all such diacritic codes and we will replace them with an empty string.
Simple and elegant, isn't it?
