I didn't know exactly how Unicode strings are encoded in UTF-8 format until I've read this post by James Coglan. Basically it uses a variable number of bytes (from 1 to 4) to represent characters. If you are developing in C, correctly allocate and parse UTF-8 strings can be annoying without a dedicated library. It would have been better (and more elegant) if there were always 3 bytes for each char, don't you think?

But, just like in the real world, elegance has its tradeoffs. In this case mainly one: compatibility with ASCII. This encoding ensures that every ASCII string is also a valid UTF-8 string. A great achievement in my opinion, compared with the intangibility of the elegance.

When developing an application, searching for elegance in code, data structures or classes may make you miss the key point: you are creating applications for the users, not to make other developers think that you're smart.


Cover image by Kent Wang taken from Flickr licensed under a Creative Commons Attribution-ShareAlike 2.0 Generic license.