Getting PHP to play nicely with Unicode

PHP does not play nicely with Unicode, so how can we properly use it in a PHP-powered site. First off, as I’ve mentioned before, read Spolky’s guide on encoding schemes and Unicode to see why Unicode is worthwhile. PHP isn’t aware of all this, it thinks one byte equals one character. This is improved by the mbstring extention, which should be used when possible. Then a further problem is presented by MySQL. Though they now support the character type utf8mb4. Mathias Bynens has an excellent write-up of using utf8mb4. So now, as long as everything is configured in our app to use UTF-8, we can save unicode characters in our web-forms and they'll be saved in our database and outputted correctly by PHP.

How do we input these esoteric characters though? Different operating systems handle this differently. So I went about creating a way to enter these characters. Using the appropriate keyboard shortcut or simply copy-and-pasting the character will still work. But I can now enter the character using it’s Unicode code-point in the form \\uXXXXX\\. Then when my text gets parsed to be outputted to the browser the code-point gets converted to a UTF-8 encoded character.

Here’s the code:

[php]
/**
 * This takes a codepoint of the form \\uXXXX(X)\\ and uses
 * PHP\\'s chr() function to ouptput a raw UTF-8 encoded character
 *
 * @returns string
 */
public function utf8CPtoHex($cp) {
	$num = $cp[1];
	$num = '0x' . $num;
	$bin = base_convert($cp[1], 16, 2);
	if($num <= 0x7F) { //U+0000 - U+007F -- 1 byte
		$bin = str_pad($bin, 7, \"0\", STR_PAD_LEFT);
		$returnbin = '0' . $bin;
		$utf8hex = chr('0x' . base_convert($returnbin, 2, 16));
		return $utf8hex;
	}
	if($num <= 0x7FF) { //U+0080 - U+07FF -- 2 bytes
		$bin = str_pad($bin, 11, \"0\", STR_PAD_LEFT);
 		$bin1 = substr($bin, 0, 5); $returnbin1 = '110' . $bin1;
 		$bin2 = substr($bin, 5); $returnbin2 = '10' . $bin2;
		$utf8hex = chr('0x' . base_convert($returnbin1, 2, 16)) . chr('0x' . base_convert($returnbin2, 2, 16));
		return $utf8hex;
	}
	if($num <= 0xFFFF) { //U+0800 - U+FFFF -- 3 bytes
		$bin = str_pad($bin, 16, \"0\", STR_PAD_LEFT);
		$bin1 = substr($bin, 0, 4); $returnbin1 = '1110' . $bin1;
		$bin2 = substr($bin, 4, 6); $returnbin2 = '10' . $bin2;
		$bin3 = substr($bin, 10); $returnbin3 = '10' . $bin3;
		$utf8hex = chr('0x' . base_convert($returnbin1, 2 ,16)) . chr('0x' . base_convert($returnbin2, 2, 16)) . chr('0x' . base_convert($returnbin3, 2, 16));
		return $utf8hex;
	}
	if($num <= 0x1FFFFF) { //U+10000 - U+10FFF -- 4 bytes
		$bin = str_pad($bin, 21, \"0\", STR_PAD_LEFT);
		$bin1 = substr($bin, 0, 3); $returnbin1 = '11110' . $bin1;
		$bin2 = substr($bin, 3, 6); $returnbin2 = '10' . $bin2;
		$bin3 = substr($bin, 9, 6); $returnbin3 = '10' . $bin3;
		$bin4 = substr($bin, 15); $returnbin4 = '10' . $bin4;
		$utf8hex = chr('0x' . base_convert($returnbin1, 2, 16)) . chr('0x' . base_convert($returnbin2, 2, 16)) . chr('0x' . base_convert($returnbin3, 2, 16)) . chr('0x' . base_convert($returnbin4, 2, 16));
		return $utf8hex;
	}
}

/**
 * This is a callback that parses a string for any occurence of
 * \\uXXXXX\\ - a unicode codepoint, and then calls the utf8CPtoHex
 * function output the raw unicode character
 *
 * @returns string
 */
public function convertUnicodeCodepoints($input) {
	$output = preg_replace_callback('/\\\\\\\\u([0-9a-f]{4,5}\\\\\\\\)/i', 'self::utf8CPtoHex', $input);
	return $output;
}

The second function convertUnicodeCodepoints() simply looks for the aforementioned \\uXXXXX\\ and then calls the utf8CPtoHex, where the real action happens.

In order to write this function took a little analysis of the UTF-8 WikiPedia page. In particular the chart in the Description section. Here we see how cleverly the multi-byte encoding scheme has been designed for UTF-8. When you look at a single byte you can see exactly where it belongs. If its a single byte character the byte starts with a 0. No other byte starts with a 0 so that’s unique. Then the starting sequence of the first byte of a multi-byte character is unique for the corresponding number of bytes for that character. That means if you are parsing your document and you come across a byte starting 1110 then this is the first byte of a three byte character. It can’t be anything else. Then any trailing bytes start 10. You literally cannot mess up the parsing of a correctly encoded UTF-8 document.

If you read the code above you’ll see this is exactly what I’m doing. Once I know how many bytes a character should be, and this the start of those bytes, the rest of the bits are simply the binary representation of the hexadecimal number that is the codepoint.

And now I can enter unicode character if I know it’s codepoint. No messing around with keyboard shortcuts.

Mark Shuttleworth “fixes” bug 1

Mark Shuttleworth:

There is a social element to this bug report as well, of course. It served for many as a sort of declaration of intent. But it's better for us to focus our intent on excellence in our own right, rather than our impact on someone else's product.

Finally Twitter has two-step authentication

About time too, the issue now is how can we expand this kind of security to all. Google Authenticator is one possible way forward.

Laravel 4 and <code>composer.lock</code>

The stable release of Laravel 4 is soon upon us. If you use git to work with Laravel like I do then there is a possible improvement to how you deploy your code.

The default .gitignore file includes the composer.lock file. If you want to know how composer works Dayle Rees wrote an excellent primer. Essentially a project will have composer.json file which details the dependencies. The true power of composer lies in the cascading nature of the dependency resolution, i.e. a dependency can have its own dependencies and composer will sort all this out for you.

When composer goes about resolving these dependencies, initiated through composer update it retrieves the libraries/projects, normally from Github, and saves them to the ./vendor folder. Composer then creates a new file called composer.lock, or updates said file if it already exists. This file is a list of the exact versions of the dependencies installed.

Once you are sure all your code works as expected, including that the dependencies work as they should you commit your code and deploy it to the server. Our composer.lock file allows us to tie our project to dependencies we know work, when we run composer install then composer will read the contents of the composer.lock file and install exactly those dependencies down to the exact version. This way we can safeguard against unwanted surprises when deploying our code in production. You have to be careful when you live on the bleeding-edge of code.

Unfortunately Laravel doesn’t promote this practice. Maybe I'll open an issue about it.

Free Will

I recently read a book by Sam Harris called Free Will. The idea that the book aims to convey is that free will as we like to think of it is an illusion. Which has obvious and deep ramifications on a whole array of issues such as religion or morality or politics.

Don’t make the mistake of thinking this is a form of pre-determinism. The world isn’t pre-determined, just look at the weather system. The idea is my decision making is deterministic in nature. That given enough information you could predict every decision I made. The science here gets somewhat contentious. Though I feel the morality of the situation doesn’t. As mentioned before the weather is an example of a system that is chaotic. At a neural level this idea of randomness may also hold. Certain synapses will simply be active due to random chance.

So my decisions fall into two categories. Those that are deterministic and out of my control; and those that are the result of random chance and also out of my control. Morally speaking the result is the same either way. It is unjustifiable to hold me personally responsible for my decisions, my actions. My conscious is just as much an observer of my decision making as you are.

Given this idea certain values and assumptions we hold dear are now unconscionable. When a person murders someone else they didn’t really have a choice not to. We like to think they did, we like to think that they could simply have made a different decision. This can lead to a justice system that aims to punish people for their actions, after all it’s their actions, they’re responsible.

But as we can see, they’re not really responsible. This doesn’t mean we shouldn’t incarcerate people. Clearly some individuals are more likely, more pre-disposed, to violent crimes. Thus to protect the greater society we should remove these individuals, but we should have empathy for these people that commit crimes. Empathy for how terribly unlucky they’ve been.

At the end of the day that’s the main driving force behind how people behave, luck.