The Great Plural Experiment

2013-03-31 22:24

You probably know very well that internationalization is hard. The mere act of translating the UI texts is actually one of the easiest parts, even though it’s not a pushover either. As one example: if your messages include quantities, you need to have some logic in place to choose different forms of nouns to go with your numbers. Fortunately, most frameworks already have that, as it’s a standard i18n feature.

Not every string message in your code is something to localize, of course. Log messages that are not visible to the user can be left alone in English – they should be, in fact. Coincidentally, though, those messages are also very likely to contain many numbers, often used as numerical quantities: things to do, things done, error count, and so on:

  1. INFO: database.py:442/init_database - 236 rows created

What if that number is 1?…

  1. INFO: files.py:132/process_files - 1 files processed

Oh well. That’s hardly the end of the world, isn’t it? Anyway, let’s just make the message slightly more universal:

  1. INFO: files.py:132/process_files - 1 file(s) processed

There, problem solved!

No worries, I haven’t gone insane. I know that no real-world software would put such a gold plating on something as irrelevant as grammar of its log messages. But it’s spring break, and we can be silly, so let’s have some fun with the idea.
Here I pose the question:

How hard would it be to construct a plural form of English noun from the singular one?

Consulting the largest repository of human knowledge (well, second largest) reveals that the rules of building English plurals are not exactly trivial – but not very complex either. There are exceptions to almost every rule, though, and a large body of exceptions in general. Still, you could expect to achieve at least some success by just disregarding them completely, and following the simple rules to the letter.

How high that success ratio would be, though?

To estimate this, I crafted a totally insolent function that attempts to capture the various intricacies of English language with half-screen worth of very simple code:

  1. def pluralize(singular):
  2.     """Returns a plural form of given English noun,
  3.    or more specifically, an attempt at something
  4.    that can sometimes pass as a plural form... maybe.
  5.    """
  6.     plural = None
  7.  
  8.     if not plural:
  9.         for suffix in ("ff", "fe", "f"):
  10.             if singular.endswith(suffix):
  11.                 plural = singular[:-len(suffix)] + "ves"
  12.                 break
  13.     if not plural:
  14.         for suffix in ("s", "sh", "x", "z"):
  15.             if singular.endswith(suffix):
  16.                 plural = singular + "es"
  17.                 break
  18.     if not plural:
  19.         if len(singular) < 2:
  20.             plural = singular + "'s"
  21.         elif singular.endswith("y") and singular&#91;-2] not in "aeiouy":
  22.             plural = singular&#91;:-1] + "ies"
  23.  
  24.     if not plural:
  25.         plural = singular + "s"
  26.     return plural&#91;/python]
  27. Loosely speaking, it's just a smarter version of a completely foolish approach, which is to slap <em>s</em> at the end and pray the reader won't notice. No irregularities whatsoever, just few simple mappings.
  28.  
  29. To test that code, I obviously wanted to get my hands on some large set of singular-plural noun pairs. That turned out to be surprisingly hard to find, however, especially in a form that is easily processable. Failing to procure that, I opted for manual solution using just the singulars, sampled from the <a href="http://wordnet.princeton.edu/">WordNet</a> noun corpora that was included with the well known <a href="http://nltk.org/">nltk</a> library.
  30.  
  31. And so <a href="https://gist.github.com/Xion/5281703">I wrote a script</a> that draws a sample of given size and presents the noun pairs one by one for human to judge. After several sessions and few thousands words, I arrived at pretty surprising success rate:
  32. [code]Results: 333 out of 350 words correct (74.0%)[/code]
  33. Moreover, all the failures fell precisely into one of these two buckets:
  34. <ul>
  35.     <li>words whose singular is equal to plural (like <em>fish</em>)</li>
  36.     <li>foreign words imported verbatim, such as taxonomic Latin names (<em>antennariidae</em>?!...) or various French loan-words</li>
  37. </ul>
  38. While you could improve the algorithm a bit by adding the <em>-ae</em> case, it's clear that no significant improvement can be achieved without employing some heavy dictionary lookups.
  39.  
  40. That's assuming uniform probability distribution over the whole space of words, of course. Because no real text exhibits that, the list of exceptions that are actually worthy of consideration include mostly everyday irregular nouns - a third class, not included above.
  41. There aren't that many of them, but they are relatively common - among some most frequent words:
  42. [python]>>> import nltk
  43. >>> nouns = (w for (w, t) in nltk.corpus.brown.tagged_words() if t == 'NN')
  44. >>> nouns_freq = nlth.FreqDist(nouns)
  45. >>> list(nouns)[:100]
  46. ['time', 'man', 'Af', 'way', 'world', 'life', 'year', 'day', 'work', 'state',
  47. 'place', 'course', 'number', ...

they constitute at least a few percent. With some 30 or 40 such a words, hard-coded into our simple function, I suspect going above 90% success ratio in practice is firmly within our grasp.

All in O(1) time and without any noticeable memory footprint. Who said that natural language processing must be hard? :)

Tags: , , , ,
Author: Xion, posted under Computer Science & IT »


4 comments for post “The Great Plural Experiment”.
  1. Kos:
    March 31st, 2013 o 22:58

    Doing that with English is playing on Bring It On.

  2. Krever:
    April 1st, 2013 o 0:37

    Try with polish :)

  3. TeMPOraL:
    April 1st, 2013 o 13:19

    It’s not internationalization, it’s using proper english. Try doing it in english and polish at the same time :D.

  4. Asm:
    April 4th, 2013 o 14:15

    ReSharper uses similar operation to suggest variable name for collection types. Also, Entity Framework generates plural table names from singular object name. It’s fast, and it has about 90% correctness, I assume method is similar or even identical.

    Still, what about other languages? ;-)

Comments are disabled.
 


© 2017 Karol Kuczmarski "Xion". Layout by Urszulka. Powered by WordPress with QuickLaTeX.com.