Detecting UTF BOM (byte order mark) using PHP

Standard

When integrating systems with many different data sources and systems across Europe you are bound to eventually run in to issues with UTF-8 and national character sets as for example the Swedish ISO-8859-1. Even when parsing simple UTF-8 files with comma separated values things might things might popup to bite you.

One such thing is the occurrence of the UTF byte order mark, or BOM. The UTF-8 character for the byte order mark is U+FEFF, or rather three bytes – 0xef, 0xbb and 0xbf – that sits in the beginning of the text file. For UTF-16 it is used to indicate the byte order. For UTF-8 it is not really necessary.

But for UTF-8, especially on Windows, it has become more and more common to use it to indicate that the file is indeed UTF. Most text editors handle this well and you won’t ever see these bytes. As it should be.

The problems start when you are using PHP binary safe string functions such as strcmp() and substr(). Then these three bytes that won’t be visible even when using var_dump() can become bothersome. (You would however see that the string length output by var_dump() is correct and also counts the invisible bytes.)

So you need to detect the three bytes and remove the BOM. Below is a simplified example on how to detect and remove the three bytes.

$str = file_get_contents('yourfile.utf8.csv');
$bom = pack("CCC", 0xef, 0xbb, 0xbf);
if (0 === strncmp($str, $bom, 3)) {
    echo "BOM detected - file is UTF-8\n";
    $str = substr($str, 3);
}

Simple. Let me know what you think.

Advertisements

3 thoughts on “Detecting UTF BOM (byte order mark) using PHP

  1. HI Anupam,
    I am facing the BOM problem in my php code (using classic php, version 5.2 with mysql5). I have a form and a process page to fetch the data from that form, do some calculation and insert the date into database when the form is submitted. Then redirect the user to another page using phps header (location) function. And as you know, header() function do not work if any other output is generated before its initiation. Somehow this process page is generating BOM and those invisible 3 characters, hence the page is not getting redirected as expected. One point I should mention here, the db connection page is included in this process page at the beginning. Hoping to get a help from you. Thanks in advance.

  2. Pingback: php and notepad unicode files | FSSE

I will be happy to answer your queries

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s