Archive for February, 2010

Dealing with PHP and character encoding

How many times have you been presented with a character encoding glitch on a PHP site that you thought was working properly? You finished the site months ago with no issues, but now you are being told some weird character is showing up on that page? Chances are, that character is either Ã, Â or Ä, possibly followed by some other random character. If I’m right, keep reading. If not, I offer my deepest condolences, for you are knee deep in a character encoding issue that I cannot help you with.

So what is it that causes that à to show up? In short – you have UTF-8 encoded data set to display as ISO-8859-1. But the more important question is how to prevent this from happening in the future. The easy answer to that – store all data in the same format as the final display. Convert the encoding as soon as you receive the data, either through an input field, reading a file or any other method. Don’t put anything in your database unless you are sure it is in the right format.

First, an introduction to the three important character encodings that you are likely to deal with – ASCII, UTF-8 and ISO-8859-1. Character encoding is, in a nutshell, the exact method that characters are stored and displayed by a computer. Since computers only deal in binary, there must be some way to convert from binary into a character. Character encodings come in two flavors (at least, two flavors I’ll be discussing) – single byte and multi byte encodings. Single byte encodings, such as ASCII and ISO-8859-1, use a single byte to represent a character. For instance, the letter “M” in ASCII is represented as 0x4D, whereas the number 4 is 0×34.

A single byte is limited to only 256 possible values, which rapidly becomes a limitation when you want to use characters not typically seen in English, such as “ç” or “ß”, and when you begin thinking about Asian languages, you can quickly see there simply isn’t room for all characters. Multi byte encodings use one or more bytes to represent a character. Each byte still only has 256 possibilities, but 2 bytes have 65,536 possibilities. But, what if there is confusion as to whether or not you are using single or multi byte encoding? Is that short string “0x3C 0xA5″ supposed to be the UTF-8 character å or the ISO-8859-1 string ÃÑ? And there we are – a sudden bug has shown up.

ASCII is the oldest of the three and the most basic. The ASCII character set started out as just 128 characters, and very English-centric. You can see that the basic ASCII table is very restricted to the English speaking world. The extended ASCII table goes off into some weird directions and is rarely used. The full ASCII table is only 256 characters, and thus fits neatly into a single byte, and so it is a single byte encoding.

ISO-8859-1 is also a single byte encoding, but you can see that it is a bit better in the extended range to support many Latin based languages. Unlike ASCII, ISO-8859-1 can handle the majority of Latin based languages without too much issue. It fails when you go outside of the Latin based languages, however.

The third character encoding is UTF-8, which is designed to support over 1.1 million characters, more than enough for all existing languages (and a few dead ones). Unlike ASCII and ISO-8859-1, UTF-8 is not limited to a single byte, but can vary between 1 and 4 bytes, depending on the character being used.

One thing to take note of is that the first 128 characters of all three encodings are identical. This is why a character encoding issue on an English website can take so long to spot. You can go months not knowing that you have a character encoding issue because the only characters you used were the same in all 3 encodings – 0x4D is an M in ASCII, UTF-8 and ISO-8859-1.

On to the meat of this article – dealing with character encodings in PHP. Namely, how to detect and convert encodings.

The magic bullet of dealing with character encodings in PHP can be summed up in three functions – mb_detect_encoding(), mb_detect_order() and iconv(). The mb libraries are designed for working with multibyte languages and iconv is a function to convert from one encoding type to another. One nonobvious restriction to the multibyte languages is the lack of native support for ISO-8859-1, since that is a single byte language. mb_detect_order() can be used to specify the order of encodings to scan when attempting to detect an encoding on a string.

The basics of dealing with character encoding conversion in PHP is a simple three step process:

  1. Set the encoding types to scan (ASCII, UTF-8, ISO-8859-1). Luckily, this only has to be done once.
  2. Identify the encoding of the string to convert.
  3. Convert to the desired encoding type (I prefer UTF-8).

But, let’s cut to the crap. Sample code!

// tell the multibyte library which encodings to use
mb_detect_order('ASCII, UTF-8, ISO-8859-1');

// tell the browser that we are using UTF-8
header('content-type: text/plain; charset=UTF-8');

$string = "Fün wîth èñcøding!";

// UTF string displays properly
echo $string . "\r\n";

// convert to ISO-8859-1, which will confuse the browser
$string = iconv(
    mb_detect_encoding($string),
    'ISO-8859-1',
    $string
);
echo $string . "\r\n";

// convert back to UTF-8, which the browser understands
$string = iconv(
    mb_detect_encoding($string),
    'UTF-8',
    $string
);
echo $string . "\r\n";

, ,

1 Comment

How to unchunk data received through PHP sockets

When using the PHP socket functions to pull data from another website, you’ll often find yourself dealing with chunked data. Chunked data often looks like this:

1e
this line is 30 characters lon
2c
g, while this line is 44 characters long.
an
21
d this line is 33 characters long

As you can see, each chunk is preceded by a hex string. This hex string corresponds to the string length of the following chunk. Line breaks are included in the chunk, as you can see in the second chunk. So to unchunk the string, you have to find every hex string that is preceded by, and followed by, a line break (or start of end of string), and if the following block of code is equal in length to the hex value, strip out the hex value.

I accomplished this with a regular expression and preg_replace_callback. In essence, it finds each block of code that follows this pattern:

  1. start of string / hex string on it’s own line
  2. anything
  3. hex string on it’s own line / end of string

Once it has found all of those patterns, it replaces only the ones where the hex value of #1 is equal to the string length of #2.

My unchunk function:

function unchunk($result) {
    return preg_replace_callback(
        '/(?:(?:\r\n|\n)|^)([0-9A-F]+)(?:\r\n|\n){1,2}(.*?)'
        .'((?:\r\n|\n)(?:[0-9A-F]+(?:\r\n|\n))|$)/si',
        create_function(
            '$matches',
            'return hexdec($matches[1]) == strlen($matches[2]) ?
                 $matches[2] :
                 $matches[0];'
        ),
        $result
    );
}

, ,

1 Comment

Greased PMA – Stripping the suck out of phpmyadmin’s SQL query window

Anyone who has used phpmyadmin‘s query window knows how frustratingly small the textarea is, and how difficult editing SQL query blocks can be. I wrote a Greasemonkey script to de-suck this query window. As of now, Greased PMA is version 0.1 and has the following features:

  • Automatic resizing of the textarea to make best use of the window size
  • Ctrl+Enter submits the query
  • Tab inserts 4 spaces into the query

Click here to download Greased PMA version 0.1.

Update: While messing around at home on Vista / Chrome, I noticed that Greased PMA partially works on Chrome. I wasn’t expecting this to work outside of Firefox / Greasemonkey, but Chrome has a similar extension feature, it seems. The textarea resizing works, but the Ctrl+Enter submitting query and tab modification do not work. Perhaps this weekend I will debug those two functions on Chrome.

, , , , ,

5 Comments