Emoji and text analytics
If you write software that handles text, you should expect to have your fundamental assumptions about how text works shaken up about once per decade.
It's happening again around now. Suddenly, a large volume of text contains characters that not all software is ready to support. Three years ago, these characters didn't exist, but they've been adopted so quickly that, on Twitter, they're now more common than hyphens.
These characters aren't produced by geeks who are trying to push your software to the limit, or by foreigners who won't speak proper (insert your native language here). They are produced by non-technical users who are speaking your language and who have probably never heard of Unicode. They're just trying to say things like:
The steady march of progress 🐌
People used to write programs that assumed that all text was in English (or Russian, or Japanese, or whatever language the programmer spoke). Then they bolted on extensions to their language's character set to account for a few other languages. All of these extensions were, of course, incompatible with each other.
The need to share text worldwide made this unsustainable. In the 2000s, many programmers finally transitioned to using Unicode and representing it using UTF-8. And then, except for the need to drag some recalcitrant programmers kicking and screaming into the 21st century, all was well. Now every computer can represent any text in any language in a consistent way. Right?
There are MORE characters?! 😧
What I've seen is that people who used to assume incorrectly that there were only 256 possible characters, now assume incorrectly that there are only 65,536 possible characters. This was true in Unicode for a brief window of time, but it quickly expanded far beyond that, once people realized that you can find more than 65,536 different characters in East Asia alone. The first 65,536 characters are called the "Basic Multilingual Plane", and the regions above that are referred to informally as "astral planes" inhabited by "astral characters".
There is a lot of software out there that does not understand astral characters. Even in my preferred programming language, Python, the support for astral characters is inconsistent up until version 3.3 (which not many people use yet).
I've seen discussions among developers who are competent at using Unicode that dismiss astral characters as somebody else's problem. A StackOverflow question got a response along the lines of, "But seriously, what are you using astral characters for? Are you a linguist studying dead languages or something?"
Welcome to 2013. The Unicode standard added a large range of pictographic characters in 2010 called "emoji". Emoji were originally used on Japanese cell phones, but now they're used at a higher rate outside of Japan, mostly because they've been embraced by iOS. Almost all emoji are outside of the Basic Multilingual Plane.
Some chat software has supported replacing ASCII smileys with pictures for a long time (often unexpectedly and irritatingly), but that's not as good a solution as having them be real Unicode text. When they're Unicode, you can copy and paste them, save them, and generally use them in any way you would expect text to work. If they're images that are added in by the software, you can't be sure how they'll work.
If you write or depend on software that thinks the Basic Multilingual Plane is all there is, then emoji will break your code. The code may replace them with nothing, or meaningless boxes, or garbage characters (mojibake), or it may even crash.
Emoji are no longer rare edge cases. On Twitter -- which is of course not representative of everything, but is a really good public sample of how people communicate worldwide -- emoji in the astral plane appear in 1 out of 20 tweets, and more frequently than 1 in 600 characters. You can see for yourself on emojitracker, a site that catalogues all the emoji on Twitter as people tweet them.
To put that in perspective:
Astral characters representing emoji are, in total, more common than hyphens.
They're also more common than the digit 5, or the capital letter V.
They are half as common as the # symbol. Yes, on Twitter.
The character 😂 alone is more common than the tilde.
"That's not possible," you might say. "I have a ~ key right here on my keyboard, and I don't have a 😂 key." But iPhones do have a 😂 key. It's easier to find than ~.
👉 Why this is important
People are expressing their emotions in a single character, in a way that is understandable in any language. It's apparently 1/600 of the way people want to communicate. Don't just throw that away.
If you think you're too serious and professional to worry about emoji, consider business software such as Campfire, Trello, and GitHub, which have all added some joy to their user experience with excellent support for emoji (including shortcuts to help type them on desktop computers).
If your main interest in text is to consume it -- for example, if you too are using the Twitter streaming API -- then losing astral characters means you're losing a lot of content.
And if you're particularly unprepared for astral characters, they may crash your code. If your code crashes given unexpected input, that's a denial-of-service attack in the making. No matter what you think of emoji, you need to fix that anyway.
Can you imagine software whose developers decided that supporting the capital V (which is also about 1 character in 600) wasn't worth it? After all, it's not like you really need that letter unless you're some nerdy fan of obscure characters. If you know someone named Victoria, why not get used to calling her "victoria" or maybe "Wictoria" to make the programmer's life easier?
You might not notice right away that this hypothetical software was bad at handling capital Vs. Perhaps the software would even let you type 'Vernor Vinge', but when anyone else sees that text it would say '�ernor �inge' or 'ï¼¶ernor ï¼¶inge'. And if you typed the capital V when you were searching for something, you might just not get any results.
💥 How is this going wrong?
A symptom of this is that, when you need to interoperate with the large amount of code that uses UTF-8, you might think you're producing UTF-8, but you're really producing CESU-8, the nonstandard encoding that results from using UTF-8 on top of UTF-16. CESU-8 looks just like UTF-8, except that every astral character is messed up. (Astral characters are four bytes long in UTF-8, and six bytes long in CESU-8.)
From what I have seen in testing ftfy, CESU-8 is more common in the wild than legitimate UTF-8. Nobody really intends to produce CESU-8, so that indicates a large amount of code that has not been tested on astral characters.
This points to almost entirely the reason that astral characters are hard. There's a right way and a wrong way to handle them that look really similar. It's like the problem with plugging in a USB connector, except you don't even notice when you've done it upside down.
Understand your tools 🔨
Emoji aren't really different from other Unicode characters. You shouldn't have to care whether something is in an astral plane or not, just like you no longer care whether a particular character is ASCII or not.
The libraries you already depend on to handle Unicode strings should be able to do their job. Although Unicode has assigned meanings to more astral characters recently, the standards that tell you how to encode and decode these characters have not changed since 1996.
(Okay, UTF-8 changed in 2003, but it was only to explicitly forbid characters with code points above 0x10ffff, making the number of possible astral characters smaller. This is of no concern to you unless you're writing a UTF-8 decoder from scratch.)
Programming languages and libraries have had over 17 years to adopt Unicode and get it right. The fact that bugs remain reflects the assumption that astral characters are unimportant and only weird people use them, which I hope the statistics in this post can help to dispel.
They say "it is a poor craftsman who blames his tools", but go ahead and blame your tools, because it seems they really have earned much of the blame. A string representation in which the Basic Multilingual Plane works, but weird things happen to other Unicode characters, is a leaky abstraction. As a programmer who works with strings in 2013, then, you need to find out whether these abstractions are leaking or not, and figure out the right way to use your system's Unicode representations so that they work correctly.
☑ Just check your code
The answer to all of these issues is to put astral characters in your test cases. If your code supports Unicode and doesn't support astral characters, then either your code or code you depend on is making a bad assumption. That code may be hiding other bugs as well.
So, try giving your code input that includes the character 😹. (That's U+1F639 CAT FACE WITH TEARS OF JOY. In UTF-8, its bytes are F0 9F 98 B9.) Does it come out unharmed? When you write it to a file in UTF-8, does it come out as the same four bytes (six is right out)? If not, you've got a bug to either fix or report. If your code explodes given the input 😹, you should find out before your users do, by putting it in your test cases.
😕 So, uh, why were all those weird boxes in this blog post?
I included an emoji character in every sub-heading. In practice, they were probably only visible if you're on Mac OS or iOS, or maybe on Windows 8.1 if you're using the beta version or if you're here from the future. Other recent versions of Windows and Linux are prepared to display emoji, but come without any fonts that can actually do so.
Apparently even some up-to-date versions of Google Chrome will not display emoji, on an OS that otherwise supports them. Seriously, Google?
If your browser itself isn't getting in the way, you should be able to see the characters if you install a free font called "Symbola". They won't be beautifully rendered, but at least they'll be there. You can get this font from a page called Unicode Fonts for Ancient Scripts.
And what a telling anachronism that title is! You need a page that was intended to be about "ancient scripts" to get emoji that were invented in the last decade. The universe of text is very different than it was three years ago.