Many are the quirks of shell scripting. Most are related to confusing syntax, but some come from certain surprising semantics of Bash as a language, as well as the way scripts are executed.
Consider, for example, that you’d like to list files that are within certain size range. This is something you cannot do with ls alone. And while there’s certainly some awk incantation that makes it trivial, let’s assume you’re a rare kind of scripter who actually likes their hacks readable:
So you use an explicit while
loop, obtain the file size using stat and compare it to given bounds using a straightforward if
statement. Pretty simple code that shouldn’t cause any troubles later on… right?
But as your needs grow, you find that you also want to count how many files fall within your range, and how many do not. Given that you have an explicit if
, it appears like a simple addition (in quite literal sense):
Why it doesn’t work, then? Because clearly this is not the output we’re looking for (ls_between is our script here):
It seems that neither matches nor misses are counted properly, even though it’s clear from the printed list that everything is fine with our if
statement and loop. Wherein lies the problem?
Surprising as it may be, the issue is within the very first line of our loop:
The while
statement itself is completely fine; however, the way we’re feeding it with input data is not adequate to our needs. We use piping, which is a way to turn the output of one process into the input of another process. A cornerstone of Unix programming and automation, it has one very useful property that allows us to manipulate even very big chunks of data: it doesn’t use any intermediate storage. Piping happens on line-by-line basis between processes running in parallel, and no temporary files are involved.
But why such an efficient method of data processing would be inadequate here? Well, the whole thing about “processes running in parallel” is what gets us here. We establish a pipe between ls and a while
loop, which requires the latter to run in a process of its own. That process is simply another instance of the executing shell – subshell – spawned specifically to capture the output of ls, read it and execute our loop’s body.
And this works fine, as long as we don’t try to communicate data back to our main script. The loop lives in a child process, which makes it unable to alter anything in the environment of the parent process – the one executing our complete script. Which means, among other things, that it cannot modify variables from the “outer scope” (such as matches
and misses
) and have the changes persist after the loop exits. More precisely, it doesn’t even have access to those parent variables: all it gets is a mere local copy of the environment containing them. Sure, it can increment them perfectly well for its own use, but the parent environment will still have those variables set to 0
.
The solution to this phenomenon is to avoid creating subshell in the first place and execute the loop in the same process as the rest of our script. For that we need to leave piping aside and find another way of feeding the loop with data.
Fortunately, the while-done
syntax permits input redirection, much in the same way as ordinary command execution, i.e. through <
operators. The crucial difference is that unlike external commands, the shell builtins – such as while
– execute within the running shell’s process. As a result, they have access to the “outer” environment and variables.
You may recall, though, that usage of redirection operator <
requires a file:
$ cat <./hello.py
print "Hello, world!"[/bash]
But we don't want to store the results of ls
in a file just to pass them to a loop; that would be very cumbersome, slow and require additional cleanup. Thankfully, in Bash & co. you can use the <()
operator to make a temporary “file” (file descriptor, really) that maps to the output of given command:
while read filename; do
size=$(stat -f %z $filename)
if [ $size -gt $min ] && [ $size -lt $max ]; then
echo $filename
((matches++))
else
((misses++))
fi
done < <(ls) # note the space[/bash]
There is also a POSIX-complaint alternative: store the output in a variable and redirect from that variable using the dedicated <<<
operator:
LS=$(ls)
while read filename; do
size=$(stat -f %z $filename)
if [ $size -gt $min ] && [ $size -lt $max ]; then
echo $filename
((matches++))
else
((misses++))
fi
done <<< “$LS”[/bash]
I actually find this variant more readable, probably because the source command is placed before the loop itself. Either one should work in most cases, though.
I think it should be min=$1, max=$2. Your stat command doesn’t work for me either, I had to use ‘stat –printf=”%s” “$filename”‘.
Anyways, thanks for the weird/nice tip about this feature :) But in this particular case, I would write an “for f in *; do” loop, which is a cooler way to iterate over files.
Right, the `ls` was mostly for illustration purposes. If I recall correctly, it was some result of `grep` that I wanted to iterate over when I’ve found about this quirk.
As for `stat`: that’s portability issue. Turns out `-f` is works in BSD flavors of Unix, while `-printf` does in Linux. I tested this on OS X and didn’t really think about checking Linux man, as I’ve been bitten by any such differences yet. I will know better now :)