Getting PHP to play nicely with Unicode

PHP does not play nicely with Unicode, so how can we properly use it in a PHP-powered site. First off, as I’ve mentioned before, read Spolky’s guide on encoding schemes and Unicode to see why Unicode is worthwhile. PHP isn’t aware of all this, it thinks one byte equals one character. This is improved by the mbstring extention, which should be used when possible. Then a further problem is presented by MySQL. Though they now support the character type utf8mb4. Mathias Bynens has an excellent write-up of using utf8mb4. So now, as long as everything is configured in our app to use UTF-8, we can save unicode characters in our web-forms and they'll be saved in our database and outputted correctly by PHP.

How do we input these esoteric characters though? Different operating systems handle this differently. So I went about creating a way to enter these characters. Using the appropriate keyboard shortcut or simply copy-and-pasting the character will still work. But I can now enter the character using it’s Unicode code-point in the form \\uXXXXX\\. Then when my text gets parsed to be outputted to the browser the code-point gets converted to a UTF-8 encoded character.

Here’s the code:

[php]
/**
 * This takes a codepoint of the form \\uXXXX(X)\\ and uses
 * PHP\\'s chr() function to ouptput a raw UTF-8 encoded character
 *
 * @returns string
 */
public function utf8CPtoHex($cp) {
	$num = $cp[1];
	$num = '0x' . $num;
	$bin = base_convert($cp[1], 16, 2);
	if($num <= 0x7F) { //U+0000 - U+007F -- 1 byte
		$bin = str_pad($bin, 7, \"0\", STR_PAD_LEFT);
		$returnbin = '0' . $bin;
		$utf8hex = chr('0x' . base_convert($returnbin, 2, 16));
		return $utf8hex;
	}
	if($num <= 0x7FF) { //U+0080 - U+07FF -- 2 bytes
		$bin = str_pad($bin, 11, \"0\", STR_PAD_LEFT);
 		$bin1 = substr($bin, 0, 5); $returnbin1 = '110' . $bin1;
 		$bin2 = substr($bin, 5); $returnbin2 = '10' . $bin2;
		$utf8hex = chr('0x' . base_convert($returnbin1, 2, 16)) . chr('0x' . base_convert($returnbin2, 2, 16));
		return $utf8hex;
	}
	if($num <= 0xFFFF) { //U+0800 - U+FFFF -- 3 bytes
		$bin = str_pad($bin, 16, \"0\", STR_PAD_LEFT);
		$bin1 = substr($bin, 0, 4); $returnbin1 = '1110' . $bin1;
		$bin2 = substr($bin, 4, 6); $returnbin2 = '10' . $bin2;
		$bin3 = substr($bin, 10); $returnbin3 = '10' . $bin3;
		$utf8hex = chr('0x' . base_convert($returnbin1, 2 ,16)) . chr('0x' . base_convert($returnbin2, 2, 16)) . chr('0x' . base_convert($returnbin3, 2, 16));
		return $utf8hex;
	}
	if($num <= 0x1FFFFF) { //U+10000 - U+10FFF -- 4 bytes
		$bin = str_pad($bin, 21, \"0\", STR_PAD_LEFT);
		$bin1 = substr($bin, 0, 3); $returnbin1 = '11110' . $bin1;
		$bin2 = substr($bin, 3, 6); $returnbin2 = '10' . $bin2;
		$bin3 = substr($bin, 9, 6); $returnbin3 = '10' . $bin3;
		$bin4 = substr($bin, 15); $returnbin4 = '10' . $bin4;
		$utf8hex = chr('0x' . base_convert($returnbin1, 2, 16)) . chr('0x' . base_convert($returnbin2, 2, 16)) . chr('0x' . base_convert($returnbin3, 2, 16)) . chr('0x' . base_convert($returnbin4, 2, 16));
		return $utf8hex;
	}
}

/**
 * This is a callback that parses a string for any occurence of
 * \\uXXXXX\\ - a unicode codepoint, and then calls the utf8CPtoHex
 * function output the raw unicode character
 *
 * @returns string
 */
public function convertUnicodeCodepoints($input) {
	$output = preg_replace_callback('/\\\\\\\\u([0-9a-f]{4,5}\\\\\\\\)/i', 'self::utf8CPtoHex', $input);
	return $output;
}

The second function convertUnicodeCodepoints() simply looks for the aforementioned \\uXXXXX\\ and then calls the utf8CPtoHex, where the real action happens.

In order to write this function took a little analysis of the UTF-8 WikiPedia page. In particular the chart in the Description section. Here we see how cleverly the multi-byte encoding scheme has been designed for UTF-8. When you look at a single byte you can see exactly where it belongs. If its a single byte character the byte starts with a 0. No other byte starts with a 0 so that’s unique. Then the starting sequence of the first byte of a multi-byte character is unique for the corresponding number of bytes for that character. That means if you are parsing your document and you come across a byte starting 1110 then this is the first byte of a three byte character. It can’t be anything else. Then any trailing bytes start 10. You literally cannot mess up the parsing of a correctly encoded UTF-8 document.

If you read the code above you’ll see this is exactly what I’m doing. Once I know how many bytes a character should be, and this the start of those bytes, the rest of the bits are simply the binary representation of the hexadecimal number that is the codepoint.

And now I can enter unicode character if I know it’s codepoint. No messing around with keyboard shortcuts.