Counting Lines in Multiple Files

2012-04-06 13:40

Looks like using Linux is really bound to slowly – but steadily – improve your commandline-fu. As evidence, today I wanted to share a little piece of shell acolyte’s magic that I managed to craft without very big trouble. It’s about counting lines in files – code lines in code files, to be specific.

For a single file, getting the number of text rows is very simple:

  1. $ wc -l some.file
  2.   142 some.file

Although the name wc comes from “word count”, the -l switch changes its mode of operation into counting rows. The flexibility of this little program doesn’t end here; for example, it can also accept piped input (as stdin):

  1. $ cat some.file | wc -l
  2. 142

as well as multiple files:

  1. $ wc -l some.file other.file
  2.   142 some.file
  3.    54 other.file
  4.   196 all

or even wildcards, such as wc -l *.file. With these we could rather easily count the number of lines of code in our project:

  1. $ wc -l **/*.py
  2.     3 foo/__init__.py
  3.   189 foo/main.py
  4.    89 foo/utils.py
  5.    24 setup.py
  6.   133 tests.py
  7.   438 all

Unfortunately, the exact interpretation of **/* wildcard seems to vary between shells. In zsh it works as shown above, but in bash I had it omit files from current directory. While it might make some sense here (as it would give a total without setup script and tests), I’m sure it won’t be the case all projects.

And so we need something smarter.

A certain way to list all files matching given property (e.g. name) in current and (recursively) child directories is to use the find command:

  1. $ find -name "*.py"
  2. ./foo/__init__.py
  3. ./foo/main.py
  4. ./foo/utils.py
  5. ./setup.py
  6. ./tests.py

Can we feed such a list into command taking multiple files, like wc? As it turns out, it is perfectly possible, and the utility that allows to do this is called xargs. Numerous are its features, of course, but the simplest usage is totally option-less; we only need to supply the target command and pipe our list to xargs‘ standard input:

  1. $ find -name "*.py" | xargs cat
  2. # ...many, many lines of code from all the *.py files...

This is how we get all the lines printed, so counting them is trivial now:

  1. $ find -name "*.py" | xargs cat | wc -l
  2. 438

The real power of this technique lies in the fact that we can inject additional modifiers, or filters, at any stage. We can, for instance, eliminate some files we are not interested in by using grep -v:

  1. $ find -name "*.py" | grep -v "test" | xargs wc -l
  2.     3 foo/__init__.py
  3.   189 foo/main.py
  4.    89 foo/utils.py
  5.    24 setup.py
  6.   305 all

Likewise, we can get rid of comment lines if we push the output of cat through another regex-based filter:

  1. $ find -name "*.py" | xargs cat | grep -vx "^\s*\#.*" | wc -l
  2. 312

Obviously, both greps can be present at once:

  1. $ find -name "*.py" | grep -v "test" | xargs cat | grep -vx "^\s*\#.*" | wc -l
  2. 223

Complexity of this command likely exceeds many typical use cases for find or grep, although seasoned shell hackers may think otherwise. In any case, I think the power of this technique is very evident, and not only for counting lines.

Tags: , , , ,
Author: Xion, posted under Computer Science & IT »


4 comments for post “Counting Lines in Multiple Files”.
  1. pax:
    April 6th, 2012 o 16:26

    or even wildcards, such as wc -l *.file.

    nope! wildcards are expanded by shell, it’s not a wc feature.

  2. Sebastian:
    April 6th, 2012 o 22:46

    Pax has right. Unescaped wildcards are expanded by shell, for this reason find with name or path option must have quoted or escaped wildcards.

  3. agentj:
    April 7th, 2012 o 7:51

    “Pipes are good for simple hacks, like passing around simple text streams,
    but not for building robust software.”

  4. Xion:
    April 10th, 2012 o 10:18

    When it comes to pipes, it’s actually PowerShell that does it better. Since what’s being pushed through pipes there is not text but objects, you don’t have to pay attention (or specify) what format should a particular part of the chain output their data in. Unfortunately, in case of *nix there is no standarized object model (like COM or .NET for Windows) that would allow to do that.

Comments are disabled.
 


© 2017 Karol Kuczmarski "Xion". Layout by Urszulka. Powered by WordPress with QuickLaTeX.com.