I’ve come across a situation where I have improperly encoded UTF-8 in a text file. I need to remove lines with improper encoding from the file.
Caveat: Keep in mind that Unicode and UTF-8 are different things. Unicode is best described as a character set while UTF-8 is a digital encoding, analogous respectively to the English alphabet and cursive writing. Just as you can write the alphabet in different ways (cursive, print, shorthand, etc.), you can write Unicode characters in various ways … UTF-8 has simply become the most popular.
Finding Multi-Byte Characters
One of the oddities with UTF-8 is that it uses a variable-length byte encoding. Many characters only require a single byte, but some require up to four bytes. You can grep a file and find anything encoded with more than a single byte with the following shell command.
grep -P '[^\x00-\x7f]' filename
Removing Improperly Encoded UTF-8
If a file contains improperly encoded UTF-8, it can be found and removed with the following command.
inconv -c -f UTF-8 -t UTF-8 -o outputfile inputfile
If you diff the input and output files, you’ll see any difference. Hence, when the diff is empty, you know that the input only contains valid UTF-8.