Removing Unicode accents and diacritics with Java
Working with multilingual text can be tricky, language researchers, search engines or dictionaries need to process the content they are working with. One part of that processing is to remove accents, which is language- and charset specific. With Java 6’s Unicode support, it is very easy to do that without defining a translation table (that may be limited to a set of languages).
Unicode characters
Here is the short overview:
import java.text.Normalizer;
import java.text.Normalizer.Form;
public static String removeAccents(String text) {
return text == null ? null :
Normalizer.normalize(text, Form.NFD)
.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}
What is under the hood?
Although the Hungarian árvíztűrő tükörfúrógép
contains all the accented
letters in Hungarian language, and it is a typical and excellent test case
for such domains, I will use only the accented letters with the related base
letters, to show you what happens inside. The pseudo-code is the following
(details are removed for clarity):
String original = "aáeéiíoóöőuúüű AÁEÉIÍOÓÖŐUÚÜŰ";
for (int i = 0; i < original.length(); i++) {
// we will report on each separate character, to show you how this works
String text = original.substring(i, i + 1);
// normalizing
String decomposed = Normalizer.normalize(text, Form.NFD);
// removing diacritics
String removed = decomposed.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
// checking the inside content
System.out.println(text + " (" + asHex(text) + ") -> "
+ decomposed + " (" + asHex(decomposed) + ") -> "
+ removed + " (" + asHex(removed) + ")");
}
// further methods are removed for clarity
And the result is:
a (0061 ) -> a (0061 ) -> a (0061 )
á (00e1 ) -> á (0061 0301) -> a (0061 )
e (0065 ) -> e (0065 ) -> e (0065 )
é (00e9 ) -> é (0065 0301) -> e (0065 )
i (0069 ) -> i (0069 ) -> i (0069 )
í (00ed ) -> í (0069 0301) -> i (0069 )
o (006f ) -> o (006f ) -> o (006f )
ó (00f3 ) -> ó (006f 0301) -> o (006f )
ö (00f6 ) -> ö (006f 0308) -> o (006f )
ő (0151 ) -> ő (006f 030b) -> o (006f )
u (0075 ) -> u (0075 ) -> u (0075 )
ú (00fa ) -> ú (0075 0301) -> u (0075 )
ü (00fc ) -> ü (0075 0308) -> u (0075 )
ű (0171 ) -> ű (0075 030b) -> u (0075 )
(0020 ) -> (0020 ) -> (0020 )
A (0041 ) -> A (0041 ) -> A (0041 )
Á (00c1 ) -> Á (0041 0301) -> A (0041 )
E (0045 ) -> E (0045 ) -> E (0045 )
É (00c9 ) -> É (0045 0301) -> E (0045 )
I (0049 ) -> I (0049 ) -> I (0049 )
Í (00cd ) -> Í (0049 0301) -> I (0049 )
O (004f ) -> O (004f ) -> O (004f )
Ó (00d3 ) -> Ó (004f 0301) -> O (004f )
Ö (00d6 ) -> Ö (004f 0308) -> O (004f )
Ő (0150 ) -> Ő (004f 030b) -> O (004f )
U (0055 ) -> U (0055 ) -> U (0055 )
Ú (00da ) -> Ú (0055 0301) -> U (0055 )
Ü (00dc ) -> Ü (0055 0308) -> U (0055 )
Ű (0170 ) -> Ű (0055 030b) -> U (0055 )
The Normalizer
decomposes the original characters into a combination of a
base character and a diacritic sign (this could be multiple signs in
different languages). á
, é
and í
have the same sign: 0301
for
marking the '
accent.
The \p{InCombiningDiacriticalMarks}+
regular expression will match all such
diacritic codes and we will replace them with an empty string.
Contact us!