Select theme:

Porn Finds a Way

David Cameron has decided the Government needs to do something to protect the innocence of children. How horribly miss-guided he is. There are two main problems that I can see. The first is an ideological problem, the second a logistical one.

##Why? The main issue to be dealt with is child pornography. This is less of a filtering issue. In terms of online consumption it’s mostly adults who can jut turn any filter off. The current strategy of taking any servers found containing CP offline is an effective strategy that we should continue and expand upon.

Then we get legal porn. Filmed with consenting adults acting out various fantasies or fetishes. This is now to be censored by default.

##How? Quite, how?

IndieWeb and Short URLs

Here I shall use the terms URI and URL interchangeably to mean the same thing. I appreciate there are subtle differences.

The IndieWeb is a fantastic idea. The web itself is inherently open. No one owns it, no one directly controls it. However, if you aren’t careful what services you use on the web then it can effectively end up that way. We all use the web in a primarily social way these days, social networks if you will. The big three players on the social web are Facebook, Twitter, and Google with their Google+ service. They want you to spend as much of your time as possible on their services in order to maximise their advertising revenues.

This doesn’t play nicely with the inherently open and interoperable foundations of the web. Foundations without which these big players wouldn’t exist.

And thus the IndieWeb is born. A desire to own your social identity and share however little or much of that social presence with these big players you want. Which I think is absolutely right.

But what of URL shortening services?

Some people seem to think that having your very own short URL helps in this cause. Which I suppose it does, but I think only to a small degree. Why do we even need to shorten web addresses? The only situation I think it would be necessary is posting/syndicating to Twitter. Any other service has ample character space to post the full URL, or is sensible and uses annotations. On Twitter however, any link, regardless of how short it is, gets wrapped up in their t.co service.

I therefore don’t currently see any compelling reason to run your own URL shortening service other than simply because you can.

*[URL]: Universal Resource Link *[URI]: Universal Resouce Identifier

The “Failed” State

A thorough examination of yet another way that the U.S. attempts to justify its foreign policy. Namely by claiming a sovereign state is failing and needs saving.

Luckily, we can pinpoint exactly where it all began – right down to the words on the page. The failed state was invented in late 1992 by Gerald Helman and Steven Ratner, two US state department employees, in an article in – you guessed it – Foreign Policy, suggestively entitled Saving failed states. With the end of the cold war, they argued, “a disturbing new phenomenon is emerging: the failed nation state, utterly incapable of sustaining itself as a member of the international community”. And with that, the beast was born.

Hoefler & Frere-Jones release their webfont service

This looks like an awesome service. The only issue I have is the cost. Its $99+ for the subscription and then you have to license the fonts you want to use.

The service does however make others look like amateurs.

The Science of Why We Don’t Believe Science

A great article that I think hints on a lot of things about being human. That we have these feelings that sometimes conflict with a cold scientific view of the world. My favourite quote being:

Head-on attempts to persuade can sometimes trigger a backfire effect, where people not only fail to change their minds when confronted with the facts—they may hold their wrong views more tenaciously than ever.

Introducing Dumbquotes

This is slightly re-inventing the wheel, but I have released a new package called Dumbquotes. The idea is to replace simple typographic techniques with their more correct forms. Such as replacing a ' with ‘ or ’. This also gave me the chance to try and write a package. So dealing with making sure it’s psr-0 compliant and has associated unit tests to run with phpunit.

The package will deal with apostrophes, quotes, dashes, and ellipses. There are certain issues. Ultimately this is designed to deal with plain text such as a markdown document. It does not work with HTML. Trying to parse HTML with regex will bring the return of Cthulu. However once you deal with HTML directly things get a little complicated.

Consider the following sentence that could appear in some HTML <p>Mary said \"How <em>did</em> she do that?\"</p>. We want to turn this into <p>Mary said “How <em>did</em> she do that?”</p>. This is complicated by the fact we can't just search for a string of text containing two double quotes like so, /\"(.*?)\"/. The sentence doesn't actually appear in the HTML DOM. We actually have three blocks of text

  • Mary said \"How
  • did
  • she do that?\"

To concatenate that into a single string, and then put the tags back in the right place seems a very difficult task. So I have decided to write the dumbquotes parser to be applied before the markdown transform is applied.

*[HTML]: HyperText Markup Language *[DOM]: Document Object Model

Google buys Waze

I’ve been using Waze for a while now and it’s generally a very good app. Being bought out by Google makes me uneasy given the track record Google has in sun-setting products.

Getting PHP to play nicely with Unicode

PHP does not play nicely with Unicode, so how can we properly use it in a PHP-powered site. First off, as I’ve mentioned before, read Spolky’s guide on encoding schemes and Unicode to see why Unicode is worthwhile. PHP isn’t aware of all this, it thinks one byte equals one character. This is improved by the mbstring extention, which should be used when possible. Then a further problem is presented by MySQL. Though they now support the character type utf8mb4. Mathias Bynens has an excellent write-up of using utf8mb4. So now, as long as everything is configured in our app to use UTF-8, we can save unicode characters in our web-forms and they'll be saved in our database and outputted correctly by PHP.

How do we input these esoteric characters though? Different operating systems handle this differently. So I went about creating a way to enter these characters. Using the appropriate keyboard shortcut or simply copy-and-pasting the character will still work. But I can now enter the character using it’s Unicode code-point in the form \\uXXXXX\\. Then when my text gets parsed to be outputted to the browser the code-point gets converted to a UTF-8 encoded character.

Here’s the code:

[php]
/**
 * This takes a codepoint of the form \\uXXXX(X)\\ and uses
 * PHP\\'s chr() function to ouptput a raw UTF-8 encoded character
 *
 * @returns string
 */
public function utf8CPtoHex($cp) {
	$num = $cp[1];
	$num = '0x' . $num;
	$bin = base_convert($cp[1], 16, 2);
	if($num <= 0x7F) { //U+0000 - U+007F -- 1 byte
		$bin = str_pad($bin, 7, \"0\", STR_PAD_LEFT);
		$returnbin = '0' . $bin;
		$utf8hex = chr('0x' . base_convert($returnbin, 2, 16));
		return $utf8hex;
	}
	if($num <= 0x7FF) { //U+0080 - U+07FF -- 2 bytes
		$bin = str_pad($bin, 11, \"0\", STR_PAD_LEFT);
 		$bin1 = substr($bin, 0, 5); $returnbin1 = '110' . $bin1;
 		$bin2 = substr($bin, 5); $returnbin2 = '10' . $bin2;
		$utf8hex = chr('0x' . base_convert($returnbin1, 2, 16)) . chr('0x' . base_convert($returnbin2, 2, 16));
		return $utf8hex;
	}
	if($num <= 0xFFFF) { //U+0800 - U+FFFF -- 3 bytes
		$bin = str_pad($bin, 16, \"0\", STR_PAD_LEFT);
		$bin1 = substr($bin, 0, 4); $returnbin1 = '1110' . $bin1;
		$bin2 = substr($bin, 4, 6); $returnbin2 = '10' . $bin2;
		$bin3 = substr($bin, 10); $returnbin3 = '10' . $bin3;
		$utf8hex = chr('0x' . base_convert($returnbin1, 2 ,16)) . chr('0x' . base_convert($returnbin2, 2, 16)) . chr('0x' . base_convert($returnbin3, 2, 16));
		return $utf8hex;
	}
	if($num <= 0x1FFFFF) { //U+10000 - U+10FFF -- 4 bytes
		$bin = str_pad($bin, 21, \"0\", STR_PAD_LEFT);
		$bin1 = substr($bin, 0, 3); $returnbin1 = '11110' . $bin1;
		$bin2 = substr($bin, 3, 6); $returnbin2 = '10' . $bin2;
		$bin3 = substr($bin, 9, 6); $returnbin3 = '10' . $bin3;
		$bin4 = substr($bin, 15); $returnbin4 = '10' . $bin4;
		$utf8hex = chr('0x' . base_convert($returnbin1, 2, 16)) . chr('0x' . base_convert($returnbin2, 2, 16)) . chr('0x' . base_convert($returnbin3, 2, 16)) . chr('0x' . base_convert($returnbin4, 2, 16));
		return $utf8hex;
	}
}

/**
 * This is a callback that parses a string for any occurence of
 * \\uXXXXX\\ - a unicode codepoint, and then calls the utf8CPtoHex
 * function output the raw unicode character
 *
 * @returns string
 */
public function convertUnicodeCodepoints($input) {
	$output = preg_replace_callback('/\\\\\\\\u([0-9a-f]{4,5}\\\\\\\\)/i', 'self::utf8CPtoHex', $input);
	return $output;
}

The second function convertUnicodeCodepoints() simply looks for the aforementioned \\uXXXXX\\ and then calls the utf8CPtoHex, where the real action happens.

In order to write this function took a little analysis of the UTF-8 WikiPedia page. In particular the chart in the Description section. Here we see how cleverly the multi-byte encoding scheme has been designed for UTF-8. When you look at a single byte you can see exactly where it belongs. If its a single byte character the byte starts with a 0. No other byte starts with a 0 so that’s unique. Then the starting sequence of the first byte of a multi-byte character is unique for the corresponding number of bytes for that character. That means if you are parsing your document and you come across a byte starting 1110 then this is the first byte of a three byte character. It can’t be anything else. Then any trailing bytes start 10. You literally cannot mess up the parsing of a correctly encoded UTF-8 document.

If you read the code above you’ll see this is exactly what I’m doing. Once I know how many bytes a character should be, and this the start of those bytes, the rest of the bits are simply the binary representation of the hexadecimal number that is the codepoint.

And now I can enter unicode character if I know it’s codepoint. No messing around with keyboard shortcuts.

Mark Shuttleworth “fixes” bug 1

Mark Shuttleworth:

There is a social element to this bug report as well, of course. It served for many as a sort of declaration of intent. But it's better for us to focus our intent on excellence in our own right, rather than our impact on someone else's product.

My name is Jonny Barnes, and jonnybarnes.uk is my site. I’m from Manchester, UK .

I am active to varying degrees on several silos:

My usual online nickname is normally jonnybarnes for other services. I also syndicate my content to the IndieWeb friendly site micro.blog. Here’s a profile pic. You can email me at hi@jonnybarnes.uk, or message me on Matrix: @jonny:jonnybarnes.uk.