Dealing with PHP and character encoding

How many times have you been presented with a character encoding glitch on a PHP site that you thought was working properly? You finished the site months ago with no issues, but now you are being told some weird character is showing up on that page? Chances are, that character is either Ã, Â or Ä, possibly followed by some other random character. If I’m right, keep reading. If not, I offer my deepest condolences, for you are knee deep in a character encoding issue that I cannot help you with.

So what is it that causes that à to show up? In short – you have UTF-8 encoded data set to display as ISO-8859-1. But the more important question is how to prevent this from happening in the future. The easy answer to that – store all data in the same format as the final display. Convert the encoding as soon as you receive the data, either through an input field, reading a file or any other method. Don’t put anything in your database unless you are sure it is in the right format.

First, an introduction to the three important character encodings that you are likely to deal with – ASCII, UTF-8 and ISO-8859-1. Character encoding is, in a nutshell, the exact method that characters are stored and displayed by a computer. Since computers only deal in binary, there must be some way to convert from binary into a character. Character encodings come in two flavors (at least, two flavors I’ll be discussing) – single byte and multi byte encodings. Single byte encodings, such as ASCII and ISO-8859-1, use a single byte to represent a character. For instance, the letter “M” in ASCII is represented as 0×4D, whereas the number 4 is 0×34.

A single byte is limited to only 256 possible values, which rapidly becomes a limitation when you want to use characters not typically seen in English, such as “ç” or “ß”, and when you begin thinking about Asian languages, you can quickly see there simply isn’t room for all characters. Multi byte encodings use one or more bytes to represent a character. Each byte still only has 256 possibilities, but 2 bytes have 65,536 possibilities. But, what if there is confusion as to whether or not you are using single or multi byte encoding? Is that short string “0×3C 0xA5″ supposed to be the UTF-8 character å or the ISO-8859-1 string ÃÑ? And there we are – a sudden bug has shown up.

ASCII is the oldest of the three and the most basic. The ASCII character set started out as just 128 characters, and very English-centric. You can see that the basic ASCII table is very restricted to the English speaking world. The extended ASCII table goes off into some weird directions and is rarely used. The full ASCII table is only 256 characters, and thus fits neatly into a single byte, and so it is a single byte encoding.

ISO-8859-1 is also a single byte encoding, but you can see that it is a bit better in the extended range to support many Latin based languages. Unlike ASCII, ISO-8859-1 can handle the majority of Latin based languages without too much issue. It fails when you go outside of the Latin based languages, however.

The third character encoding is UTF-8, which is designed to support over 1.1 million characters, more than enough for all existing languages (and a few dead ones). Unlike ASCII and ISO-8859-1, UTF-8 is not limited to a single byte, but can vary between 1 and 4 bytes, depending on the character being used.

One thing to take note of is that the first 128 characters of all three encodings are identical. This is why a character encoding issue on an English website can take so long to spot. You can go months not knowing that you have a character encoding issue because the only characters you used were the same in all 3 encodings – 0×4D is an M in ASCII, UTF-8 and ISO-8859-1.

On to the meat of this article – dealing with character encodings in PHP. Namely, how to detect and convert encodings.

The magic bullet of dealing with character encodings in PHP can be summed up in three functions – mb_detect_encoding(), mb_detect_order() and iconv(). The mb libraries are designed for working with multibyte languages and iconv is a function to convert from one encoding type to another. One nonobvious restriction to the multibyte languages is the lack of native support for ISO-8859-1, since that is a single byte language. mb_detect_order() can be used to specify the order of encodings to scan when attempting to detect an encoding on a string.

The basics of dealing with character encoding conversion in PHP is a simple three step process:

  1. Set the encoding types to scan (ASCII, UTF-8, ISO-8859-1). Luckily, this only has to be done once.
  2. Identify the encoding of the string to convert.
  3. Convert to the desired encoding type (I prefer UTF-8).

But, let’s cut to the crap. Sample code!

// tell the multibyte library which encodings to use
mb_detect_order('ASCII, UTF-8, ISO-8859-1');

// tell the browser that we are using UTF-8
header('content-type: text/plain; charset=UTF-8');

$string = "Fün wîth èñcøding!";

// UTF string displays properly
echo $string . "\r\n";

// convert to ISO-8859-1, which will confuse the browser
$string = iconv(
    mb_detect_encoding($string),
    'ISO-8859-1',
    $string
);
echo $string . "\r\n";

// convert back to UTF-8, which the browser understands
$string = iconv(
    mb_detect_encoding($string),
    'UTF-8',
    $string
);
echo $string . "\r\n";

, ,

No Comments

How to unchunk data received through PHP sockets

When using the PHP socket functions to pull data from another website, you’ll often find yourself dealing with chunked data. Chunked data often looks like this:

1e
this line is 30 characters lon
2c
g, while this line is 44 characters long.
an
21
d this line is 33 characters long

As you can see, each chunk is preceded by a hex string. This hex string corresponds to the string length of the following chunk. Line breaks are included in the chunk, as you can see in the second chunk. So to unchunk the string, you have to find every hex string that is preceded by, and followed by, a line break (or start of end of string), and if the following block of code is equal in length to the hex value, strip out the hex value.

I accomplished this with a regular expression and preg_replace_callback. In essence, it finds each block of code that follows this pattern:

  1. start of string / hex string on it’s own line
  2. anything
  3. hex string on it’s own line / end of string

Once it has found all of those patterns, it replaces only the ones where the hex value of #1 is equal to the string length of #2.

My unchunk function:

function unchunk($result) {
    return preg_replace_callback(
        '/(?:(?:\r\n|\n)|^)([0-9A-F]+)(?:\r\n|\n){1,2}(.*?)'
        .'((?:\r\n|\n)(?:[0-9A-F]+(?:\r\n|\n))|$)/si',
        create_function(
            '$matches',
            'return hexdec($matches[1]) == strlen($matches[2]) ?
                 $matches[2] :
                 $matches[0];'
        ),
        $result
    );
}

, ,

No Comments

Greased PMA – Stripping the suck out of phpmyadmin’s SQL query window

Anyone who has used phpmyadmin’s query window knows how frustratingly small the textarea is, and how difficult editing SQL query blocks can be. I wrote a Greasemonkey script to de-suck this query window. As of now, Greased PMA is version 0.1 and has the following features:

  • Automatic resizing of the textarea to make best use of the window size
  • Ctrl+Enter submits the query
  • Tab inserts 4 spaces into the query

Click here to download Greased PMA version 0.1.

Update: While messing around at home on Vista / Chrome, I noticed that Greased PMA partially works on Chrome. I wasn’t expecting this to work outside of Firefox / Greasemonkey, but Chrome has a similar extension feature, it seems. The textarea resizing works, but the Ctrl+Enter submitting query and tab modification do not work. Perhaps this weekend I will debug those two functions on Chrome.

, , , , ,

No Comments

What makes a user want to buy digital content?

As you can learn from a cursory glance at this site, I am a professional computer programmer. I also have side projects that I would like to one day monetize. One such side project is my planned Android app remind@home. On the other hand, I am a content consumer in the digital age, using the internet to acquire new content, sometimes in ways the content creator might not find appealing. Sometimes I pay for content (such as my recent Steam purchase of Dragon Age: Origins) and sometimes I do not pay for content (such as all apps I’ve installed on my Droid, except my podcast app BeyondPod).

Do I acquire content for free because I am cheap? Absolutely not. I buy when the content creator has provided sufficient incentives to buying. As a content creator, I strive to understand those reasons, both within myself and within others. For this reason, I have begun asking friends what a content creator can do to encourage a purchase. I am not concerned with preventing unauthorized downloads or free software that is asking for donations. I am only concerned with finding out what a content creator can do to encourage users to pay for their content. After my discussions with other techies, here are some of the ways a content creator can encourage users to pay:

Release a sample

Release a free sample that can be easily upgraded to the full version. This allows me to find out if I like your content before I commit money. iTunes and Amazon MP3 Store are examples of this from a music perspective, and game demos accomplish the same thing for game developers. If I am unsure of the quality of your product, I am unlikely to blindly commit money. Through Amazon MP3, I can sample every song on an album before I decide to buy. If you do not provide an easy to locate sample, I will still find a free way to sample the content. The alternative is an unauthorized acquisition of your content, and in that scenario I am unlikely to pay if I like the content. After all, why pay for what I already have? By placing barriers to sampling, you aren’t hindering my ability to sample, you are just putting yourself in a bad place if I like what I’ve sampled.

For PC game developers, this is even more critical. Unlike console games, PC games are not always playable on every PC. A demo not only lets me find out if I like the game, but it lets me find out if I can even run the game. There is nothing worse than paying $50 to find out my video card can’t handle your game. Just like in the music sampling example above, instead I can just acquire the game through other channels. If it doesn’t work, I still have my $50. If it does work, where does that leave you?

Continuously update your content

This rule applies more to software than any other form of content, but it is an important rule nonetheless. If I purchase software that is constantly updated, and only authorized copies can be updated directly from the developer, I have an incentive to pay. Fear not, all of the updates will be leaked through unauthorized channels, but the convenience of a built in updater cannot be understated. Imagine downloading a program to manage a database of clients and receiving a weekly update with bug fixes and occasional new features. These updates can come through two channels – an authorized update system that came with the purchase of the software, or a weekly hunt through P2P networks for the newest version. The convenience of a built in updater is easily worth the $50, assuming the updates are worthwhile. Releasing your software and then abandoning it gives me no incentive to purchase a copy. That just guarantees that the version available on a P2P network is the same as a legitimate copy, permanently.

Be easy to acquire

I’ve purchased over a dozen games from Steam simply because it is easy to pay for and acquire my game. The user interface for acquiring your content needs to be easy to use. If I am having trouble locating your content digitally through authorized channels, I’ll simply start looking through unauthorized channels. Make it easy enough that I never consider looking for alternate methods to get your content. Sometimes the issue isn’t how much your content costs, but how much of a hassle it is to locate and acquire.

As a content provider, make use of services like Amazon MP3 store, Audible and Steam. You don’t have to create your own system of distribution, but use a reliable distribution system. Users want convenience above all else when it comes to acquiring new content. The more convenient your distribution system is, the less likely a person is to seek alternate methods. There will always be those that simply don’t want to pay. There is almost nothing you can do about that crowd. Instead, focus on the crowd that is willing to pay. Find out what barriers are in their way and remove those barriers.

Offer your content digitally

This should go without saying. If you want people to pay you for your digital content, then offer that digital content. Rather than offer an explanation, I will just provide one of the conversations I had about this subject:

Steve: How often do you acquire unauthorized digital content?

Anonymous: Maybe twice a month, it really depends if theres an HBO or Showtime show I want to watch.

Steve: What could a content creator do to encourage you to buy?

Anonymous: Offer a place to buy it online. Just make it possible. Do it the day after not months later. I mean, I pay to torrent now, since I use usegroups. It’s access thats the issue, not price.

Steve: So the only reason HBO and Showtime aren’t getting your money is because they aren’t giving you a way to give it to them? Someone is getting the money, and if HBO or Showtime set up a good system, they’d have it. But, they are making it impossible for you to get, through them, what you can certainly get elsewhere.

Anonymous: Yeah.

No Comments

Dragon Age: Origins has me psyched for Mass Effect 2

I got burned out from World of Warcraft last weekend and decided to go snag Dragon Age: Origins to waste some time before the much awaited Mass Effect 2 is rocking my face in ways I once thought unrockable. Although I had a hell of a time getting it installed due to Physx issues with my ATi Radeon 3870, once the game was going, it sure did impress the hell out of me. I ended up having to upgrade to an nVidia Geforce GTX 275 to get past my issues.

Dragon Age: Origins, while fantastic in its own right, is really just hyping me up for Mass Effect 2 on Jan 26. I really liked Mass Effect, Bioware fixed all of my gripes with the game, and Dragon Age: Origins shows great progress in many areas. I took a few days off work for Mass Effect 2 and I know I won’t regret it.

,

No Comments

This week on twitter – 2010-01-17

  • Brewing my stout again today. Try number three. #
  • I like the new WoW armory. The model viewer and activity feed are both really cool. I think I'll add my char's activity feed onto my site. #
  • Khourys, my beer store, just got more awesome. They were playing The Decemberists on the speakers. #
  • I had been hearing reports of an earthquake in Haiti all day, but I just now read an article on it. I had no idea it was this bad… #
  • I've been having a number of blue screen problems lately… If this persists, I might go get Windows 7. #
  • I just bought Speaker for the Dead #

Powered by Twitter Tools

No Comments

This week on twitter – 2010-01-10

  • By the way, that last comment about car stereos was not about my car. I heard a car last night where the bass sounded like a clutch grinding #
  • When your car stereo can be mistaken for engine trouble, turn it down or upgrade it. #

Powered by Twitter Tools

No Comments

Storing on/off switches through binary representation

As part of the Nightlife Project, you will have to store some data through binary representation in the payment_rates table. While doing the writeup on the MySQL schema, I decided that this section was too large to easily fit inside of that article and it should be its own article. The goal of this article is to explain how database schemas employ the principles of binary notation to store a sequence of on/off flags in a single field.

Imagine this scenario: You are the head of maintenance at an office and are trying to determine which lightbulbs get the most usage and should be replaced with high efficiency bulbs. In order to determine which gets the most usage, you walk through the building in the morning, at lunch and again at night and make a note of which bulbs are on and which are off. Once you’ve gathered your data, you store the information in a database and after 2 months, you will view the results and determine the top used bulbs and replace them with higher efficiency bulbs to save electricity.

After each walk through the building, you have a list of lights and a corresponding on/off value for each one. Your list may look something like this:

Reception Break Room Bathroom Office 1 Office 2 Office 3 Office 4 Hallway Copy Room
On Off Off Off On On Off On Off

The obvious way to store this data is to create a table called ‘lights’ and give it 10 fields – a timestamp (we need to know when this walk through occurred, after all) and a field for each of the 9 rooms. However, this obvious way has a few nasty shortcomings. What if you decide to add additional lights to be checked, such as desk lamps or the front walkway? Adding additional fields rarely ends well. This is also extremely inefficient, as you now have an 10 field table when a 2 field table would suffice.

Instead of storing it through this obvious, but inefficient method, instead you can convert that list of on/off switches into a binary string. Using the same data above, the binary string would look like this: 100011010. Each of the lights that are on is a 1 and each light that is off is a 0. Each character represents a room – the first position is reception, the second position is break room, etc. Converting this binary string to decimal, 100011010 becomes 282 (why?).

Reception Break Room Bathroom Office 1 Office 2 Office 3 Office 4 Hallway Copy Room
1 0 0 0 1 1 0 1 0

The key thing behind storing binary information in a database is this: no other combination of binary will equal that number. There is simply no other set of on/off values that will add up to 282. This lets us store an infinite amount of on/off values in a single integer slot, without ambiguity and always available for reading. As far as how you will read it, look into your language’s “bitwise and” and “bitwise or” operators. An example in PHP would be:

$user_permission = 6; // binary 0110

$view            = 1; // binary 0001
$edit            = 2; // binary 0010
$create          = 4; // binary 0100
$delete          = 8; // binary 1000

if($user_permission & $view)
    echo 'user can view';

if($user_permission & $edit)
    echo 'user can edit';

if($user_permission & $create)
    echo 'user can create';

if($user_permission & $delete)
    echo 'user can delete';

You can use a system like this for storing user permissions(which is how *nix permissions work), which days of the week an event occurs(the nightlife project will be doing this for storing payment rates), or any other piece of information that can be summed up in a series of on/off switches. Using binary storage is extremely efficient and flexible. Remember, however, that any additional fields to be added should be added to the left side of the string to preserve old data.

, , ,

No Comments

This week on twitter – 2010-01-03

  • I've got a few Google voice invites if anyone wants one. #

Powered by Twitter Tools

No Comments

Configuring an Android development platform on Ubuntu 9.04

Since I thoroughly dig Linux for all things development, I’m going to use a Virtual Machine running Ubuntu 9.04 to work on my Droid app, remind@home. I’ll later keep track of the project in a Subversion repository. I still don’t know where I will host my repository, but the short list is likely to be either Sourceforge or Google Code.

I’m using the directions on the Android Developer page to configure my SDK using Eclipse as the Java IDE. Here are the steps I had to follow to set up Ubuntu 9.04 as an Android development platform:

  1. Install Java 6 JDK by typing ‘sudo apt-get install sun-java6-jdk‘ on the command line.
  2. Install the Eclipse IDE. I initially tried this with apt to install the version of Eclipse from the repository, but this is version 3.2 (as of 2009-12-30) and the ADT recommends version 3.4 or newer. Later on in the configuration process, I had some difficulty installing ADT and ended up installing Eclipse 3.5 from source. Here’s how:
    1. Download the latest version of Eclipse classic (3.5.1 as of 2009-12-30) from the Eclipse download page.
    2. Extract the folder into your home directory. This does not appear to need compilation or any such fun stuff.
    3. Open the eclipe folder and run eclipse.
  3. Install the Android Development Tools (ADT) for Eclipse. Open Eclipse and go to Help -> Install New Software. Inside the dialog, click ‘Add’ and add the following:
    ADT : https://dl-ssl.google.com/android/eclipse/
    Check the box net to ‘Developer Tools’ and hit next. Follow the series of dialog boxes to install ADT. Restart Eclipse when done.
  4. Download the Android SDK for Linux from the Android developer page. Extract the folder into your home directory.
  5. Open Eclipse and configure the Android SDK’s location within Eclipse. Go to Window -> Preferences. Select the Android menu on the left and click Browse to choose the SDK location. Navigate to the /home/<user>/android-sdk-linux_86/ folder and hit OK. Hit OK to close the Preferences dialog.
  6. Install the Android platforms that you want to develop applications on. Go to Window -> Android SDK and AVD Manager. Click Available Packages on the left and choose which packages you want to install. Since this is my first app and I’m not concerned about it being usable on anything other than my Motorola Droid, I only chose Android 2.0 and Android 2.0.1. In the future, I can install more SDKs to ensure my software works through multiple devices. Click ‘Install Selected’ when done to install the desired SDKs.
  7. Create an Android Virtual Device (AVD) to begin work on. While still in the Android SDK and AVD Manager you opened in step 5, click ‘Virtual Devices’ on the left and click ‘New’. Name your device and choose your SDK. I named my device ‘motorola-droid’ and chose Android 2.0.1 and left all other settings the same and clicked ‘Create AVD’.
  8. From here, all that is left to do is start writing your app. I followed the Hello World tutorial available on the Android developer site. If you followed the directions for configuring Eclipse 3.5 on Ubuntu 9.04 with the Android Development Tools, you should have no problem compiling and running this app. However, The directions say it should be as easy as going to Run -> Run to launch the program, but I had some errors came up when I tried that method. Instead, go to the projects window on the left side and right click your project name ‘HelloAndroid’ and go to Run As -> Android Application to launch the program. It appears that the Run menu contains configuration settings to add a new configuration type (going to Run -> Run Configurations pulls up this menu) but I have yet to figure out how to add Android Application to that menu.

Now that I have a working development environment, I can get started on my remind@home application. I will continue to blog as that project moves forward.

, , , ,

1 Comment