An extremely common programming exercise – popping up usually as an interview question – is to write a function that turns all characters in a string into uppercase. As you may know or suspect, such task is not really about the problem stated explicitly. Instead, it’s meant to probe if you can program at all, and whether you remember about handling special subsets of input data. That’s right: the actual problem is almost insignificant; it’s all about the necessary plumbing. Without a need for it, the task becomes awfully easy, especially in certain kind of languages:
This simplicity may be a cause of misconception that the whole problem of letter case is similarly trivial. Actually, I would not be surprised if the notion of having any sort of real ‘problem’ here is baffling to some. After all, every self-respecting language has those
toUpperCase functions built-in, right?…
Sure it has. But even assuming they work correctly, they are usually the only case-related transformations available out of the box. As it turns out, it’s hardly uncommon to need something way more sophisticated.
Let’s do a quick check first. Fire up a REPL for your favorite programming language (or editor with a compiler nearby) and try to uppercase all characters in any of the following strings:
These are the pangrams: short and often nonsensical sentences which contain all letters from their respective languages. Since many of those letters are outside of the a–z range, their uppercase counterparts cannot be obtained by simply subtracting a magic number
32 from their ASCII code. Because they are scattered over much wider span of character table, this simple approach doesn’t work.
But programming languages usually promise to handle all the complexities behind upper- and lowercasing – if not for strings, then at least for individual characters. Many of them, however, fail to deliver on that promise in their standard strings API. It’s the Unicode, of course, that is supposed to deal with such issues. Using Unicode strings by default is rare among popular languages, though; Python 3.x, Java and Haskell come into mind as honorable exceptions. Others typically offer alternate string types (such as
std::wstring in C++ or
unicode in Python 2.x) and various encoding & decoding mechanisms to cooperate with code that uses standard, ANSI-only strings.
Still, we may consider this an appropriate abstraction when it comes to handling letter case of individual characters. However, it isn’t nearly enough for actual texts. The issue of proper capitalization is grammatical one; and whenever grammar of natural languages is involved, we can be sure that any related computing problems will be notoriously difficult to solve.
I stumbled upon this very topic when making the switch to blogging in English. It turns out that English has quite amusing rules regarding capitalization of titles – or title case for short. Its simplest variant is to capitalize every single word:
Conventions which are actually used are much more complex: they introduce several exceptions, and exceptions to exceptions, so it all contributes to rather fine mess. For example, it is common to leave certain “small” words lowercased: prepositions (for, by, on, …), articles (a, an, the) and conjunctions (and, or, but, etc.). Nevertheless, the first word is always capitalized – and depending on particular convention, the last one may be too. Additionally, if the “small” word grows big enough, it might also warrant capitalizing; an excessively long preposition such as without could be a good example. Finally, punctuation may also play its role, forcing any word after certain marks (such as colon) to be capitalized as well.
That’s rather convoluted, isn’t it? Setting aside any concerns about how arbitrary these rules are, we can observe that even an undisputed standard built upon them is unlikely to be easy to implement as an algorithm. Anything that requires recognizing parts of speech is almost certain to be AI-hard. Although a simple exclusion list (
['a', 'an', 'but', ...]) works correctly for many inputs, it fails for texts containing e.g. certain phrasal verbs. So in order to properly capitalize a title, one should know a semantic context of every word in it. That’s easy for humans, but not so for computers.
I hope I have successfully made a point that changing letter case may be almost as far from trivial as a computing problem can get. At the very least, we should be extremely wary of any related operations on texts in foreign languages – including, for example, a case-insensitive comparison. And even for English texts, the task may border on impossible if it’s something just slightly more complicated than straightforward substitution of characters.
For more information about the phenomenon of letter case, it’s probably best to consult the indispensable Wikipedia.
Adding comments is disabled.Comments are disabled.