I’ve come across a situation where I have improperly encoded UTF-8 in a text file. I need to remove lines with improper encoding from the file.

Caveat: Keep in mind that Unicode and UTF-8 are different things. Unicode is best described as a character set while UTF-8 is a digital encoding, analogous respectively to the English alphabet and cursive writing. Just as you can write the alphabet in different ways (cursive, print, shorthand, etc.), you can write Unicode characters in various ways … UTF-8 has simply become the most popular.

Finding Multi-Byte Characters

One of the oddities with UTF-8 is that it uses a variable-length byte encoding. Many characters only require a single byte, but some require up to four bytes. You can grep a file and find anything encoded with more than a single byte with the following shell command.

grep -P '[^\x00-\x7f]' filename

Removing Improperly Encoded UTF-8

If a file contains improperly encoded UTF-8, it can be found and removed with the following command.

inconv -c -f UTF-8 -t UTF-8 -o outputfile inputfile

If you diff the input and output files, you’ll see any difference. Hence, when the diff is empty, you know that the input only contains valid UTF-8.

Posted in Bash
Share this post, let the world know

One Comment

  1. Pete
    Posted November 17, 2016 at 10:07 | Permalink

    Good thinking, but iconv is the command I think

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" highlight="">