Getting PHP to play nicely with Unicode

PHP does not play nicely with Unicode, so how can we properly use it in a PHP-powered site. First off, as I’ve mentioned before, read Spolky’s guide on encoding schemes and Unicode to see why Unicode is worthwhile. PHP isn’t aware of all this, it thinks one byte equals one character. This is improved by the mbstring extention, which should be used when possible. Then a further problem is presented by MySQL. Though they now support the character type utf8mb4. Mathias Bynens has an excellent write-up of using utf8mb4. So now, as long as everything is configured in our app to use UTF-8, we can save unicode characters in our web-forms and they'll be saved in our database and outputted correctly by PHP.

How do we input these esoteric characters though? Different operating systems handle this differently. So I went about creating a way to enter these characters. Using the appropriate keyboard shortcut or simply copy-and-pasting the character will still work. But I can now enter the character using it’s Unicode code-point in the form \\uXXXXX\\. Then when my text gets parsed to be outputted to the browser the code-point gets converted to a UTF-8 encoded character.

Here’s the code:

 * This takes a codepoint of the form \\uXXXX(X)\\ and uses
 * PHP\\'s chr() function to ouptput a raw UTF-8 encoded character
 * @returns string
public function utf8CPtoHex($cp) {
	$num = $cp[1];
	$num = '0x' . $num;
	$bin = base_convert($cp[1], 16, 2);
	if($num <= 0x7F) { //U+0000 - U+007F -- 1 byte
		$bin = str_pad($bin, 7, \"0\", STR_PAD_LEFT);
		$returnbin = '0' . $bin;
		$utf8hex = chr('0x' . base_convert($returnbin, 2, 16));
		return $utf8hex;
	if($num <= 0x7FF) { //U+0080 - U+07FF -- 2 bytes
		$bin = str_pad($bin, 11, \"0\", STR_PAD_LEFT);
 		$bin1 = substr($bin, 0, 5); $returnbin1 = '110' . $bin1;
 		$bin2 = substr($bin, 5); $returnbin2 = '10' . $bin2;
		$utf8hex = chr('0x' . base_convert($returnbin1, 2, 16)) . chr('0x' . base_convert($returnbin2, 2, 16));
		return $utf8hex;
	if($num <= 0xFFFF) { //U+0800 - U+FFFF -- 3 bytes
		$bin = str_pad($bin, 16, \"0\", STR_PAD_LEFT);
		$bin1 = substr($bin, 0, 4); $returnbin1 = '1110' . $bin1;
		$bin2 = substr($bin, 4, 6); $returnbin2 = '10' . $bin2;
		$bin3 = substr($bin, 10); $returnbin3 = '10' . $bin3;
		$utf8hex = chr('0x' . base_convert($returnbin1, 2 ,16)) . chr('0x' . base_convert($returnbin2, 2, 16)) . chr('0x' . base_convert($returnbin3, 2, 16));
		return $utf8hex;
	if($num <= 0x1FFFFF) { //U+10000 - U+10FFF -- 4 bytes
		$bin = str_pad($bin, 21, \"0\", STR_PAD_LEFT);
		$bin1 = substr($bin, 0, 3); $returnbin1 = '11110' . $bin1;
		$bin2 = substr($bin, 3, 6); $returnbin2 = '10' . $bin2;
		$bin3 = substr($bin, 9, 6); $returnbin3 = '10' . $bin3;
		$bin4 = substr($bin, 15); $returnbin4 = '10' . $bin4;
		$utf8hex = chr('0x' . base_convert($returnbin1, 2, 16)) . chr('0x' . base_convert($returnbin2, 2, 16)) . chr('0x' . base_convert($returnbin3, 2, 16)) . chr('0x' . base_convert($returnbin4, 2, 16));
		return $utf8hex;

 * This is a callback that parses a string for any occurence of
 * \\uXXXXX\\ - a unicode codepoint, and then calls the utf8CPtoHex
 * function output the raw unicode character
 * @returns string
public function convertUnicodeCodepoints($input) {
	$output = preg_replace_callback('/\\\\\\\\u([0-9a-f]{4,5}\\\\\\\\)/i', 'self::utf8CPtoHex', $input);
	return $output;

The second function convertUnicodeCodepoints() simply looks for the aforementioned \\uXXXXX\\ and then calls the utf8CPtoHex, where the real action happens.

In order to write this function took a little analysis of the UTF-8 WikiPedia page. In particular the chart in the Description section. Here we see how cleverly the multi-byte encoding scheme has been designed for UTF-8. When you look at a single byte you can see exactly where it belongs. If its a single byte character the byte starts with a 0. No other byte starts with a 0 so that’s unique. Then the starting sequence of the first byte of a multi-byte character is unique for the corresponding number of bytes for that character. That means if you are parsing your document and you come across a byte starting 1110 then this is the first byte of a three byte character. It can’t be anything else. Then any trailing bytes start 10. You literally cannot mess up the parsing of a correctly encoded UTF-8 document.

If you read the code above you’ll see this is exactly what I’m doing. Once I know how many bytes a character should be, and this the start of those bytes, the rest of the bits are simply the binary representation of the hexadecimal number that is the codepoint.

And now I can enter unicode character if I know it’s codepoint. No messing around with keyboard shortcuts.

Mark Shuttleworth “fixes” bug 1

Mark Shuttleworth:

There is a social element to this bug report as well, of course. It served for many as a sort of declaration of intent. But it's better for us to focus our intent on excellence in our own right, rather than our impact on someone else's product.

Laravel 4 and <code>composer.lock</code>

The stable release of Laravel 4 is soon upon us. If you use git to work with Laravel like I do then there is a possible improvement to how you deploy your code.

The default .gitignore file includes the composer.lock file. If you want to know how composer works Dayle Rees wrote an excellent primer. Essentially a project will have composer.json file which details the dependencies. The true power of composer lies in the cascading nature of the dependency resolution, i.e. a dependency can have its own dependencies and composer will sort all this out for you.

When composer goes about resolving these dependencies, initiated through composer update it retrieves the libraries/projects, normally from Github, and saves them to the ./vendor folder. Composer then creates a new file called composer.lock, or updates said file if it already exists. This file is a list of the exact versions of the dependencies installed.

Once you are sure all your code works as expected, including that the dependencies work as they should you commit your code and deploy it to the server. Our composer.lock file allows us to tie our project to dependencies we know work, when we run composer install then composer will read the contents of the composer.lock file and install exactly those dependencies down to the exact version. This way we can safeguard against unwanted surprises when deploying our code in production. You have to be careful when you live on the bleeding-edge of code.

Unfortunately Laravel doesn’t promote this practice. Maybe I'll open an issue about it.

Free Will

I recently read a book by Sam Harris called Free Will. The idea that the book aims to convey is that free will as we like to think of it is an illusion. Which has obvious and deep ramifications on a whole array of issues such as religion or morality or politics.

Don’t make the mistake of thinking this is a form of pre-determinism. The world isn’t pre-determined, just look at the weather system. The idea is my decision making is deterministic in nature. That given enough information you could predict every decision I made. The science here gets somewhat contentious. Though I feel the morality of the situation doesn’t. As mentioned before the weather is an example of a system that is chaotic. At a neural level this idea of randomness may also hold. Certain synapses will simply be active due to random chance.

So my decisions fall into two categories. Those that are deterministic and out of my control; and those that are the result of random chance and also out of my control. Morally speaking the result is the same either way. It is unjustifiable to hold me personally responsible for my decisions, my actions. My conscious is just as much an observer of my decision making as you are.

Given this idea certain values and assumptions we hold dear are now unconscionable. When a person murders someone else they didn’t really have a choice not to. We like to think they did, we like to think that they could simply have made a different decision. This can lead to a justice system that aims to punish people for their actions, after all it’s their actions, they’re responsible.

But as we can see, they’re not really responsible. This doesn’t mean we shouldn’t incarcerate people. Clearly some individuals are more likely, more pre-disposed, to violent crimes. Thus to protect the greater society we should remove these individuals, but we should have empathy for these people that commit crimes. Empathy for how terribly unlucky they’ve been.

At the end of the day that’s the main driving force behind how people behave, luck.

On the Prohibition of Drugs

##What is a drug?

A drug is a substance which affects you in some way. Most drugs are used for medicinal purposes and prescribed by a registered physician. This isn’t all drugs are used for. Many affect the mental state of the person taking them. This could be that they relax the person in some way, depending on what we mean by “relax”; or they could cause hallucinations of some description.

##What drugs are prohibited?

Currently, in the U.K., we do prohibit most drugs, but is this the right policy? Alcohol and tobacco are legal and regulated. That is to say you require a licence to sell them and you have to be over 18 years of age to buy them. Then there are a whole class of pharmaceutical drugs which you can only obtain with a prescription from a doctor, further we have some so-called over-the-counter drugs. Such as paracetamol, basic pain-killers which aren’t particularly addictive so it has been deemed safe enough to allow people easy access without the hassle of visiting the doctor. The idea is then that everything else is illegal. There are however substances not covered by the relevant laws and thus legal to take. They are used for the same purpose as more well known drugs such as cocaine or cannabis. These are often referred to as legal-highs.

Most people take drugs, particularly alcohol and tobacco, because they enjoy the altered state that the drug produces. It seems to me the current policy is geared to this idea that drugs users are addicts who can't control themselves and are thus a danger to society. The UN, in a report (page 7), suggest that globally roughly only 10% of drug users are “problem” users. As I said, most users are normal people, like you and me, going about their daily lives and enjoy the occasional responsible experience of drugs.

##Why are we prohibiting most drugs?

People put forward several arguments for the prohibition of drugs.

  • That the drugs are bad you your health.
  • That the drugs are cause you to act in an uninhibited and antisocial way, thus harming society.
  • And also a concern with the link between drugs and organised crime, the concern being causing an increase in crime and gang activity.

These are potentially valid reasons to prohibit drugs, but let's look closer. Particularly, I want to note society’s acceptance alcohol and tobacco. We have the idea that drugs are bad for you. This does seem reasonable, we discovered that asbestos is a carcinogen and so banned its use as an insulator in construction. We stop lead being used in paints to stop heavy metal poisoning. But shouldn’t we be rational and consistent about this? Alcohol and cigarettes are also damaging for your health. Alcoholism is a real problem that has taken many lives over the years. Worse we have cigarettes which are a known carcinogen. These are still legal. Perhaps then it should be a question of severity, after all everything gives you cancer.

I feel this question of severity is the right one to ask. Absolutely there are highly dangerous drugs like heroin. There are also drugs that most people can enjoy in a responsible fashion, like alcohol. A sliding scale where only the genuinely dangerous drugs are prohibited is something I feel we should move toward. Some people more liberal than I will say even this is wrong and who is the government to say what I can or cannot do with my body. That all drugs should be legal. The thought is a nice one, if perhaps a little too naïve for the real world.

##Where should we stand on the slippery slope?

Given the position that there are some drugs that reasonably cannot be taken responsibly, they are simply too addictive or dangerous, or any other criteria we may have. The question must then become the clichéd “where do we draw the line?”

Emotion nor tradition should be allowed to determine the answer to this question. This is an ethical question that we must try to answer rationally through reasonable discourse. Taking into account the actual affects that drugs have, both on individuals and society as a whole. People don't seem willing to do this. Though things appear to be turning in this regard.

Prof. David Nutt is someone who’s opinion I respect on the matter. He voices a view similar to mine, that policy that is detached from the reality of drug use isn’t going to be an effective policy. If we want our policy to genuinely reduce the harm and suffering that drugs can and do cause then we need to take am honest look at drug usage.

George Orwell reviews Mein Kampf

A quite fascinating read. I love the way Orwell writes, it's so personal it makes me ashamed when I read my own writing.

Schrödinger’s Cat causing problems

The linux distribution Fedora have announced their new version release name, Schrödinger’s Cat, inspired by the famous thought-experiment trying to conceptualise quantum mechanics. But this has brought about some perhaps unforeseen issues regarding the handling of text. The name isn’t simple and contains non-ASCII characters. This is where a slight digression on what a string is would be advisable if you aren’t already comfortable. May I recommend Spolky’s excellent What Every Programmer Should Know About Unicode. If software hasn’t taken this into account then things can go wrong. Things went wrong. The folks over at LWN have detailed the events.

What interests me is the behaviour of Chrome on Mac; given the suggestion of naming the release “Schrödinger’s ?”. That final symbol is the Unicode character for a cat. Specifically a smiling cat face with heart-shaped eyes. Which leads to certain problems. Specifically for me that character will render correctly in title tabs in Chrome, but not in the document body. This isn’t a problem with Firefox however, which handles the symbol without issue. Further Chrome on either Linux or iOS seem to behave properly as well. This appears to have been a known problem for some time.

It’s 2013 and Unicode is still a problem, sigh.

Git workflows with Laravel

I have my local dev box and my VPS which this site resides on. I don't want to make changes directly to the PHP code on the server, mistakes happen and that would be bad. So I write code on my dev box, and when I'm happy everything works, put the code on to the VPS. Git is a godsend for tasks like this. It also helps that Laravel is developed with Git.

First we set up a git repo on the dev box tracking the develop branch of the Laravel git repo. Then we change the origin to our own github repo, we shall call Laravel's repo upstream. Obviously we need to first go to and set up a new empty repo. This means any changes we make to the code will get pushed to our repo, not Laravel's (not that we could push changes directly to them, that requires a pull request).

$ git clone -b develop git:// my-laravel
$ cd my-laravel
$ git remote rename origin upstream
$ git remote add origin
$ git push -u origin develop

When the guys at Laravel update their code we need to incorporate those into our codebase, which is easily done.

$ git fetch upstream
$ git merge upstream/develop

Once we have made any changes we want to it is a simple case of committing and pushing those changes

$ git add <file>
$ git commit -a
$ git push -u origin develop

We also need to run the last command once we've pulled any changes from upstream. We can then pull the changes from our github repo onto the VPS. This way we can make sure nothing is going to break before we put any code in production. *[VPS]: Virtual Private Server

Using Laravel 4

I've updated the blog to use Laravel 4. Laravel 4 is currently in beta development, a complete rewrite of the PHP framework to use Composer to be more modular and interoperable.

Currently I'm not doing things the best way as my Git knowledge isn't up to scratch. I followed the instructions on the Laravel beta docs page. I also installed Composer system-wide by putting the phar file in /usr/local/bin. Thus all I need to do is go into my Laravel directory and run composer update and all the components are updated to the latest version; including any extra packages I've added to the composer.json file myself.

I find this incredibly exciting. It's simply a case of going to and going I like the sound of that, all I have to do is add one line to one file and I can use it. For example, I'm using Markdown in my posts, so I use this package by dflydev. Admittedly it's not as simple as just adding a line to the composer.json file. Due to how packages are loaded, anywhere I want to use the file I need to add the line

use dflydev\\markdown\\MarkdownExtraParser

Though I am sure there are better ways of doing this. I believe this can be done by editing the providers and aliases arrays in app/config/app.php, but I'm yet to look into this properly.

I'm also doing a basic redesign of the site, focusing slightly on the layout, but mainly the colour-scheme, incorporating the Solarized colour palette. Maybe even a little SASS/LESS magic to switch between dark/light schemes.

My name is Jonny Barnes, and is my site. I’m from Manchester, UK .

I am active to varying degrees on several silos:

My usual online nickname is normally jonnybarnes for other services. I also syndicate my content to the IndieWeb friendly site Here’s a profile pic. I also have a PGP key, with fingerprint. You can email me at jonny at my domain, or message me on XMPP.