Removing Unicode accents and diacritics with Java

February 22, 2011

4 min

Working with multilingual text can be tricky, language researchers, search engines or dictionaries need to process the content they are working with. One part of that processing is to remove accents, which is language- and charset specific. With Java 6’s Unicode support, it is very easy to do that without defining a translation table (that may be limited to a set of languages).

My dictionary project Deect was using this method to index and query the words. It is surprisingly easy to normalize the text this way.

Unicode characters

Here is the short overview:

import java.text.Normalizer;
import java.text.Normalizer.Form;

public static String removeAccents(String text) {
    return text == null ? null :
        Normalizer.normalize(text, Form.NFD)
            .replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}

What is under the hood?

Although the Hungarian árvíztűrő tükörfúrógép contains all the accented letters in Hungarian language, and it is a typical and excellent test case for such domains, I will use only the accented letters with the related base letters, to show you what happens inside. The pseudo-code is the following (details are removed for clarity):

String original = "aáeéiíoóöőuúüű AÁEÉIÍOÓÖŐUÚÜŰ";
for (int i = 0; i < original.length(); i++) {
    // we will report on each separate character, to show you how this works
    String text = original.substring(i, i + 1);
    // normalizing
    String decomposed = Normalizer.normalize(text, Form.NFD);
    // removing diacritics
    String removed = decomposed.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");

    // checking the inside content
    System.out.println(text + " (" + asHex(text) + ") -> "
                + decomposed + " (" + asHex(decomposed) + ") -> "
                + removed + " (" + asHex(removed) + ")");
}
// further methods are removed for clarity

And the result is:

a (0061     ) -> a (0061     ) -> a (0061     )
á (00e1     ) -> á (0061 0301) -> a (0061     )
e (0065     ) -> e (0065     ) -> e (0065     )
é (00e9     ) -> é (0065 0301) -> e (0065     )
i (0069     ) -> i (0069     ) -> i (0069     )
í (00ed     ) -> í (0069 0301) -> i (0069     )
o (006f     ) -> o (006f     ) -> o (006f     )
ó (00f3     ) -> ó (006f 0301) -> o (006f     )
ö (00f6     ) -> ö (006f 0308) -> o (006f     )
ő (0151     ) -> ő (006f 030b) -> o (006f     )
u (0075     ) -> u (0075     ) -> u (0075     )
ú (00fa     ) -> ú (0075 0301) -> u (0075     )
ü (00fc     ) -> ü (0075 0308) -> u (0075     )
ű (0171     ) -> ű (0075 030b) -> u (0075     )
  (0020     ) ->   (0020     ) ->   (0020     )
A (0041     ) -> A (0041     ) -> A (0041     )
Á (00c1     ) -> Á (0041 0301) -> A (0041     )
E (0045     ) -> E (0045     ) -> E (0045     )
É (00c9     ) -> É (0045 0301) -> E (0045     )
I (0049     ) -> I (0049     ) -> I (0049     )
Í (00cd     ) -> Í (0049 0301) -> I (0049     )
O (004f     ) -> O (004f     ) -> O (004f     )
Ó (00d3     ) -> Ó (004f 0301) -> O (004f     )
Ö (00d6     ) -> Ö (004f 0308) -> O (004f     )
Ő (0150     ) -> Ő (004f 030b) -> O (004f     )
U (0055     ) -> U (0055     ) -> U (0055     )
Ú (00da     ) -> Ú (0055 0301) -> U (0055     )
Ü (00dc     ) -> Ü (0055 0308) -> U (0055     )
Ű (0170     ) -> Ű (0055 030b) -> U (0055     )

The Normalizer decomposes the original characters into a combination of a base character and a diacritic sign (this could be multiple signs in different languages). á, é and í have the same sign: 0301 for marking the ' accent.

The \p{InCombiningDiacriticalMarks}+ regular expression will match all such diacritic codes and we will replace them with an empty string.

Last updated: August 29, 2014

István Soós

software engineer, business advisor

Advocates for the maker-movement, self-directed learning and agile methods. His regular topics include: machine intelligence, data and risk analysis, distributed systems and knowledge management.

Question? Comment?
Contact us!