A Curious Case of Letter Case

2012-01-08 18:32

An extremely common programming exercise – popping up usually as an interview question – is to write a function that turns all characters in a string into uppercase. As you may know or suspect, such task is not really about the problem stated explicitly. Instead, it’s meant to probe if you can program at all, and whether you remember about handling special subsets of input data. That’s right: the actual problem is almost insignificant; it’s all about the necessary plumbing. Without a need for it, the task becomes awfully easy, especially in certain kind of languages:

  1. toUpperCase :: String -> String
  2. toUpperCase s = map toUpper s

This simplicity may be a cause of misconception that the whole problem of letter case is similarly trivial. Actually, I would not be surprised if the notion of having any sort of real ‘problem’ here is baffling to some. After all, every self-respecting language has those toLowerCase/toUpperCase functions built-in, right?…

Sure it has. But even assuming they work correctly, they are usually the only case-related transformations available out of the box. As it turns out, it’s hardly uncommon to need something way more sophisticated.

How Unicode helps (sort of)

Let’s do a quick check first. Fire up a REPL for your favorite programming language (or editor with a compiler nearby) and try to uppercase all characters in any of the following strings:

  • Falsches Üben von Xylophonmusik quält jeden größeren Zwerg.
  • Voix ambiguë d’un cœur qui au zéphyr préfère les jattes de kiwi.
  • Pchnąć w tę łódź jeża lub ośm skrzyń fig.
  • Экс-граф? Плюш изъят. Бьём чуждый цен хвощ!

These are the pangrams: short and often nonsensical sentences which contain all letters from their respective languages. Since many of those letters are outside of the az range, their uppercase counterparts cannot be obtained by simply subtracting a magic number 32 from their ASCII code. Because they are scattered over much wider span of character table, this simple approach doesn’t work.

But programming languages usually promise to handle all the complexities behind upper- and lowercasing – if not for strings, then at least for individual characters. Many of them, however, fail to deliver on that promise in their standard strings API. It’s the Unicode, of course, that is supposed to deal with such issues. Using Unicode strings by default is rare among popular languages, though; Python 3.x, Java and Haskell come into mind as honorable exceptions. Others typically offer alternate string types (such as std::wstring in C++ or unicode in Python 2.x) and various encoding & decoding mechanisms to cooperate with code that uses standard, ANSI-only strings.

Case in point

Still, we may consider this an appropriate abstraction when it comes to handling letter case of individual characters. However, it isn’t nearly enough for actual texts. The issue of proper capitalization is grammatical one; and whenever grammar of natural languages is involved, we can be sure that any related computing problems will be notoriously difficult to solve.

I stumbled upon this very topic when making the switch to blogging in English. It turns out that English has quite amusing rules regarding capitalization of titles – or title case for short. Its simplest variant is to capitalize every single word:

  1. def title_case(text):
  2.     words = (word.capitalize() for word in text.split())
  3.     return ' '.join(words)
  1. >>> title_case("This is a title")
  2. 'This Is A Title'

Conventions which are actually used are much more complex: they introduce several exceptions, and exceptions to exceptions, so it all contributes to rather fine mess. For example, it is common to leave certain “small” words lowercased: prepositions (for, by, on, …), articles (a, an, the) and conjunctions (and, or, but, etc.). Nevertheless, the first word is always capitalized – and depending on particular convention, the last one may be too. Additionally, if the “small” word grows big enough, it might also warrant capitalizing; an excessively long preposition such as without could be a good example. Finally, punctuation may also play its role, forcing any word after certain marks (such as colon) to be capitalized as well.

That’s rather convoluted, isn’t it? Setting aside any concerns about how arbitrary these rules are, we can observe that even an undisputed standard built upon them is unlikely to be easy to implement as an algorithm. Anything that requires recognizing parts of speech is almost certain to be AI-hard. Although a simple exclusion list (['a', 'an', 'but', ...]) works correctly for many inputs, it fails for texts containing e.g. certain phrasal verbs. So in order to properly capitalize a title, one should know a semantic context of every word in it. That’s easy for humans, but not so for computers.

In any case…

I hope I have successfully made a point that changing letter case may be almost as far from trivial as a computing problem can get. At the very least, we should be extremely wary of any related operations on texts in foreign languages – including, for example, a case-insensitive comparison. And even for English texts, the task may border on impossible if it’s something just slightly more complicated than straightforward substitution of characters.

For more information about the phenomenon of letter case, it’s probably best to consult the indispensable Wikipedia.

Tags: , , , , ,
Author: Xion, posted under Computer Science & IT, Culture »



Adding comments is disabled.

Comments are disabled.
 


© 2017 Karol Kuczmarski "Xion". Layout by Urszulka. Powered by WordPress with QuickLaTeX.com.