Posts Tagged mysql

MySQL dump of every country and state in the world

As part of a recent project, I needed to compile a list of every country and first level administrative subdivision in the world. I started with the ISO 3166-2 list, but later cleaned up certain countries that had inaccurate data. I also needed timezones for every state, so I put all of those in as well.

This zip file contains a single SQL dump for two tables – region and subregion.

Region contains 248 entries, with the following data for each country: ISO code, 3 digit ISO code, fips code, country name, continent, currency code, currency name, phone prefix, postal code regex, languages and geonameid. Subregions contains region ID, name and timezone. The timezone format is “America/Los_Angeles”, “Europe/Madrid”, etc. Unfortunately, some states have multiple timezones and this is not taken into account.

There might still be some issues with some countries not having an accurate state list, but this list is more comprehensive than anything I was able to find online.

Download the list: region.sql

5 Comments

Dealing with PHP and character encoding

How many times have you been presented with a character encoding glitch on a PHP site that you thought was working properly? You finished the site months ago with no issues, but now you are being told some weird character is showing up on that page? Chances are, that character is either Ã, Â or Ä, possibly followed by some other random character. If I’m right, keep reading. If not, I offer my deepest condolences, for you are knee deep in a character encoding issue that I cannot help you with.

So what is it that causes that à to show up? In short – you have UTF-8 encoded data set to display as ISO-8859-1. But the more important question is how to prevent this from happening in the future. The easy answer to that – store all data in the same format as the final display. Convert the encoding as soon as you receive the data, either through an input field, reading a file or any other method. Don’t put anything in your database unless you are sure it is in the right format.

First, an introduction to the three important character encodings that you are likely to deal with – ASCII, UTF-8 and ISO-8859-1. Character encoding is, in a nutshell, the exact method that characters are stored and displayed by a computer. Since computers only deal in binary, there must be some way to convert from binary into a character. Character encodings come in two flavors (at least, two flavors I’ll be discussing) – single byte and multi byte encodings. Single byte encodings, such as ASCII and ISO-8859-1, use a single byte to represent a character. For instance, the letter “M” in ASCII is represented as 0x4D, whereas the number 4 is 0×34.

A single byte is limited to only 256 possible values, which rapidly becomes a limitation when you want to use characters not typically seen in English, such as “ç” or “ß”, and when you begin thinking about Asian languages, you can quickly see there simply isn’t room for all characters. Multi byte encodings use one or more bytes to represent a character. Each byte still only has 256 possibilities, but 2 bytes have 65,536 possibilities. But, what if there is confusion as to whether or not you are using single or multi byte encoding? Is that short string “0x3C 0xA5″ supposed to be the UTF-8 character å or the ISO-8859-1 string ÃÑ? And there we are – a sudden bug has shown up.

ASCII is the oldest of the three and the most basic. The ASCII character set started out as just 128 characters, and very English-centric. You can see that the basic ASCII table is very restricted to the English speaking world. The extended ASCII table goes off into some weird directions and is rarely used. The full ASCII table is only 256 characters, and thus fits neatly into a single byte, and so it is a single byte encoding.

ISO-8859-1 is also a single byte encoding, but you can see that it is a bit better in the extended range to support many Latin based languages. Unlike ASCII, ISO-8859-1 can handle the majority of Latin based languages without too much issue. It fails when you go outside of the Latin based languages, however.

The third character encoding is UTF-8, which is designed to support over 1.1 million characters, more than enough for all existing languages (and a few dead ones). Unlike ASCII and ISO-8859-1, UTF-8 is not limited to a single byte, but can vary between 1 and 4 bytes, depending on the character being used.

One thing to take note of is that the first 128 characters of all three encodings are identical. This is why a character encoding issue on an English website can take so long to spot. You can go months not knowing that you have a character encoding issue because the only characters you used were the same in all 3 encodings – 0x4D is an M in ASCII, UTF-8 and ISO-8859-1.

On to the meat of this article – dealing with character encodings in PHP. Namely, how to detect and convert encodings.

The magic bullet of dealing with character encodings in PHP can be summed up in three functions – mb_detect_encoding(), mb_detect_order() and iconv(). The mb libraries are designed for working with multibyte languages and iconv is a function to convert from one encoding type to another. One nonobvious restriction to the multibyte languages is the lack of native support for ISO-8859-1, since that is a single byte language. mb_detect_order() can be used to specify the order of encodings to scan when attempting to detect an encoding on a string.

The basics of dealing with character encoding conversion in PHP is a simple three step process:

  1. Set the encoding types to scan (ASCII, UTF-8, ISO-8859-1). Luckily, this only has to be done once.
  2. Identify the encoding of the string to convert.
  3. Convert to the desired encoding type (I prefer UTF-8).

But, let’s cut to the crap. Sample code!

// tell the multibyte library which encodings to use
mb_detect_order('ASCII, UTF-8, ISO-8859-1');

// tell the browser that we are using UTF-8
header('content-type: text/plain; charset=UTF-8');

$string = "Fün wîth èñcøding!";

// UTF string displays properly
echo $string . "\r\n";

// convert to ISO-8859-1, which will confuse the browser
$string = iconv(
    mb_detect_encoding($string),
    'ISO-8859-1',
    $string
);
echo $string . "\r\n";

// convert back to UTF-8, which the browser understands
$string = iconv(
    mb_detect_encoding($string),
    'UTF-8',
    $string
);
echo $string . "\r\n";

, ,

1 Comment

Greased PMA – Stripping the suck out of phpmyadmin’s SQL query window

Anyone who has used phpmyadmin‘s query window knows how frustratingly small the textarea is, and how difficult editing SQL query blocks can be. I wrote a Greasemonkey script to de-suck this query window. As of now, Greased PMA is version 0.1 and has the following features:

  • Automatic resizing of the textarea to make best use of the window size
  • Ctrl+Enter submits the query
  • Tab inserts 4 spaces into the query

Click here to download Greased PMA version 0.1.

Update: While messing around at home on Vista / Chrome, I noticed that Greased PMA partially works on Chrome. I wasn’t expecting this to work outside of Firefox / Greasemonkey, but Chrome has a similar extension feature, it seems. The textarea resizing works, but the Ctrl+Enter submitting query and tab modification do not work. Perhaps this weekend I will debug those two functions on Chrome.

, , , , ,

6 Comments

Storing on/off switches through binary representation

As part of the Nightlife Project, you will have to store some data through binary representation in the payment_rates table. While doing the writeup on the MySQL schema, I decided that this section was too large to easily fit inside of that article and it should be its own article. The goal of this article is to explain how database schemas employ the principles of binary notation to store a sequence of on/off flags in a single field.

Imagine this scenario: You are the head of maintenance at an office and are trying to determine which lightbulbs get the most usage and should be replaced with high efficiency bulbs. In order to determine which gets the most usage, you walk through the building in the morning, at lunch and again at night and make a note of which bulbs are on and which are off. Once you’ve gathered your data, you store the information in a database and after 2 months, you will view the results and determine the top used bulbs and replace them with higher efficiency bulbs to save electricity.

After each walk through the building, you have a list of lights and a corresponding on/off value for each one. Your list may look something like this:

Reception Break Room Bathroom Office 1 Office 2 Office 3 Office 4 Hallway Copy Room
On Off Off Off On On Off On Off

The obvious way to store this data is to create a table called ‘lights’ and give it 10 fields – a timestamp (we need to know when this walk through occurred, after all) and a field for each of the 9 rooms. However, this obvious way has a few nasty shortcomings. What if you decide to add additional lights to be checked, such as desk lamps or the front walkway? Adding additional fields rarely ends well. This is also extremely inefficient, as you now have an 10 field table when a 2 field table would suffice.

Instead of storing it through this obvious, but inefficient method, instead you can convert that list of on/off switches into a binary string. Using the same data above, the binary string would look like this: 100011010. Each of the lights that are on is a 1 and each light that is off is a 0. Each character represents a room – the first position is reception, the second position is break room, etc. Converting this binary string to decimal, 100011010 becomes 282 (why?).

Reception Break Room Bathroom Office 1 Office 2 Office 3 Office 4 Hallway Copy Room
1 0 0 0 1 1 0 1 0

The key thing behind storing binary information in a database is this: no other combination of binary will equal that number. There is simply no other set of on/off values that will add up to 282. This lets us store an infinite amount of on/off values in a single integer slot, without ambiguity and always available for reading. As far as how you will read it, look into your language’s “bitwise and” and “bitwise or” operators. An example in PHP would be:

$user_permission = 6; // binary 0110

$view            = 1; // binary 0001
$edit            = 2; // binary 0010
$create          = 4; // binary 0100
$delete          = 8; // binary 1000

if($user_permission & $view)
    echo 'user can view';

if($user_permission & $edit)
    echo 'user can edit';

if($user_permission & $create)
    echo 'user can create';

if($user_permission & $delete)
    echo 'user can delete';

You can use a system like this for storing user permissions(which is how *nix permissions work), which days of the week an event occurs(the nightlife project will be doing this for storing payment rates), or any other piece of information that can be summed up in a series of on/off switches. Using binary storage is extremely efficient and flexible. Remember, however, that any additional fields to be added should be added to the left side of the string to preserve old data.

, , ,

4 Comments

The Nightlife Project – Part 1 – Introduction to the Problem

“A carelessly planned project takes three times longer to complete than expected; a carefully planned project takes only twice as long.”

The Nightlife Project is a novice PHP/MySQL tutorial. While you don’t have to know a whole lot to get started, you should understand the basics of PHP and MySQL. Knowledge of variables, functions and flow control is required. If the following block of code makes sense to you, you are ready to start the project:

<?php
repeat('foo', 15);
function repeat($string, $count) {
     for($i = 0; $i < $count; ++$i) {
          mysql_query("INSERT INTO `data` VALUES ('$string')");
     }
}
?>

Now, an explanation is due.

You work for Vision Nightlife, a nightlife promotion company working out of Las Vegas. Your company hires promoters to hand out nightclub flyers for the various clubs that have hired your company. These promoters give flyers to tourists for things like free entry to Domi Lounge, a free drink at Club Septuro, etc. Each flyer is stamped with a unique code identifying the promoter that drove the tourists to the club. The club will then pay your company, based on how many people your promoters drove to their club. The amount is based on how many people and what day of the week. For example, Hoodoo Lounge pays $1 per person on Friday or Saturday nights, if you bring between 1 and 10 people. If you bring between 11 and 20 people, it is $1.50 per person, and 21 or more people is $2.25 per person. On a Wednesday night, those amounts are cut in half. Your company wants you to write software to keep track of all of this information.

Your software needs to track clubs, promoters, payment rates, referral amounts and payout amounts.

This is just an introduction to the nightlife project. Part 2 will begin looking at the database structure and writing the ideal schema for each table.

, , ,

No Comments

mysql_insert_id and insert ignore

While working on some code recently, I realized that mysql_insert_id fails when using insert ignore. When using insert, mysql_insert_id returns the primary key of the newly inserted row. However, nothing is returned with insert ignore if a key conflict prevents a record from being inserted. If you are wanting to get the key that a conflict was just hit against, in as dynamic of a way as possible, you can use this script to find the primary key when insert ignore does not enter a record. This is useful if you are setting up a many-to-many pivot table and don’t want duplicate data on either side. When you attempt to insert a new record, it either gives you the new key or the key of the one that already existed.

$db->query($sql)

// if an insert id exists, use it
if ($db->insert_id != 0) {
    $id = $db->insert_id;
// if there is no insert id and there was no error and insert ignore was ued
} elseif($db->insert_id == 0 &&
    empty($db->error) &&
    preg_match('/^\s*insert\s+ignore/si', $sql)) {

    // find the table that was queried
    preg_match('/^\s*insert\s+ignore\s+into\s+([-`a-zA-Z0-9_]+)/si', $sql, $extract);
    $table = trim($extract[1], '`');

    // change insert ignore  to insert
    $Sql = preg_replace('/^\s*insert\s+ignore/si', 'insert', $sql);

    // query and scan the error for the key conflict
    $db->query($sql);
    $error = $db->error;
    preg_match('/^Duplicate entry \'(.*)\' for key (\d+)$/', $error, $extract);

    $value = $extract[1];
    $key = $extract[2];

    // in the case of multi column keys, figure out what the keys actually are
    if(strstr($value, '-') && !strstr($sql, $value)) {
        $values = explode('-', $value);
        $finished = false;
        while(!$finished) {
            foreach($values as $k=>&$v) {
                if(strstr($Sql, $v.'-'.$values[$k+1])) {
                    $values[$k] = $v.'-'.$values[$k+1];
                    unset($values[$k+1]);
                    $values = array_values($values);
                    break;
                }
                $finished = true;
            }
        }
        $value = $values;
    }

    // look up all keys on the table, isolating the primary key
    $keySql = "show keys from `$table`";
    $keyResult = $db->query($keySql);
    $keys = array();
    while($row = $keyResult->fetch_assoc()) {
        if(strtolower($row['Key_name']) == 'primary') {
            $primary = $row['Column_name'];
        }
        $keys[$row['Key_name']][] = $row;
    }

    // build a where clause to find the primary based on key conflicts
    $keys = array_values($keys);
    if(!is_array($keys[$key-1])) {
        $unique = $keys[$key-1]['Column_name'];
        $where = "`$unique` = '".$db->real_escape_string($value)."'";
    } else {
        foreach($keys[$key-1] as $key) {
            $whereParts[] = "`{$key['Column_name']}` = '".$db->real_escape_string($value[$key['Seq_in_index']-1])."'";
        }
        $where = implode(' and ', $whereParts);
    }

    // get the primary key that conflicted with the insert
    $sql = "select `$primary` from `$table` where $where";
    if(is_object($result)) {
        $result = $result->fetch_assoc();
        $id = $result[$primary];
    }
}

echo $id;

, ,

No Comments