sed / regular expressions / optimizing

sed / regular expressions / optimizing

  • Written by
    Walter Doekes
  • Published on

Last week, we were looking at using sed(1) to remove line feeds from CSV files. The regular expressions in that post could use some further inspection.

I'm sorry that this is a rather dry post. I wanted to get the numbers out there because the differences are significant. But without the example runs you're just left with numbers.

Feel free to skip right to the conclusions.

To recap: we were working on multiline CSV files. Matching even and odd numbers of double quotes is relevant, because they tell us whether or not we're still working on the same row.

We start off by creating two sample files:

$ python3 -c 'for x in range(100000): print("A"*1500)' >many-alpha1500.dat
$ python3 -c 'for x in range(100000): print("\""*1500)' >many-dquotes1500.dat

many-alpha1500.dat has 100k lines with 1500 letters A: zero double quotes.

many-dquotes1500.dat has 100k lines with 1500 double quotes: basically everything is a double quote.

We want to compare the speed of a match versus a non-match, and take printing a line or not out of the equation. For sed we add !d or d to the case that matches:

$ time sed -Ee '/^[^"]*("[^"]*"[^"]*)*$/d' many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m4.950s (all user time)

The astute reader will recognise d41d8cd98f00b204e9800998ecf8427e as the empty md5 hash. We want no output, because printing would affect the timing.

The regular expression from the previous post, which was quickly cobbled together, looked like this:

/^[^"]*("[^"]*"[^"]*)*$/

That is, zero-or-more non-double quotes, then a combination of two double quotes with non-double quotes around it, and that as many times as possible. Or, in plain English: an even amount of double quotes.

Reducing backtracking

Regular expression backtracking wisdom says you need to avoid optional matches (?, *, +). Rewriting this slightly to /^([^"]|"[^"]*")*$/ makes sense, both for clarity and fewer optionals. For speed it doesn't seem to matter though.

$ time sed -Ee '/^[^"]*("[^"]*"[^"]*)*$/d' many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m4.901s (all user time)
$ time sed -Ee '/^([^"]|"[^"]*")*$/d' many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m4.844s (all user time)
$ time sed -Ee '/^[^"]*("[^"]*"[^"]*)*$/d' many-dquotes1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m4.071s (all user time)
$ time sed -Ee '/^([^"]|"[^"]*")*$/d' many-dquotes1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m4.098s (all user time)

The regular expression did not seem to matter, but the dataset did. For some reason, the matching double quotes take a slightly faster path.

Here it's hard to tell whether it's slow because of backtracking, or because of something else in the regex. It's not the address code itself, because replacing the regex with /./ or /(.*)/ makes it fast. And when feeding it double quotes (the "hard" case), it was in fact faster (0m4.071s/0m4.098s); maybe because of the larger match between parentheses?

What we do observe though, is if we make it not-matching (look for an uneven count of double quotes), it's suddenly fast for the A case:

$ time sed -Ee '/^([^"]|"[^"]*")*"[^"]*$/!d' many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m0.365s (all user time)

My instinct says those two matches should take the same amount of time on a long string of A characters. But apparently I'm wrong.

The lots-of-double-quotes case is now (expectedly) slow, even though it doesn't match either.

$ time sed -Ee '/^([^"]|"[^"]*")*"[^"]*$/!d' many-dquotes1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m4.490s (all user time)

grep and perlre

We take a quick detour using grep.

grep differentiates between our old and our new regular expression:

$ time grep -vE '^[^"]*("[^"]*"[^"]*)*$' many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m13.411s (all user time)
$ time grep -vE '^([^"]|"[^"]*")*$' many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m21.086s (all user time)

.. but in a bad way. The — in my opinion — better expression performs worse. And it's horribly slow compared to sed.

grep can also do Perl regular expressions (perlre, using the -P option). Let's see how that compares:

$ time grep -vE '^.*$' many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m0.403s (all user time)
$ time grep -vP '^.*$' many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m0.106s (all user time)

Using -P is twice as fast for the most basic expression. Let's try a slightly less simple, but still trivial, one:

$ time grep -vE '^[^"]*$' many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m4.062s (all user time)
$ time grep -vP '^[^"]*$' many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m0.113s (all user time)

The above example tells us that using even the simplest regular expressions can be slow using the old POSIX engine in grep. Perl or other modern implementations will be slightly faster in trivial cases, and insanely much faster in slightly more complicated cases:

$ time grep -vE '^([^"]|"[^"]*")*$' many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m21.072s (all user time)
$ time grep -vP '^([^"]|"[^"]*")*$' many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m0.267s (all user time)

And no, before you ask, memory usage is only slightly higher when using the Perl engine.

Summary

Here's a quick summary of the results from above, although it doesn't show all the nuance between the different inputs and matching vs. non-matching.

sed -E grep -E grep -P
^.*$ 2.40 0.40 0.11
^[^"]*$ 4.25 4.15 0.11
^([^"]|"[^"]*")*$ 5.00 21.20 0.32
^[^"]*("[^"]*"[^"]*)*$ 5.00 13.70 0.11

(Lower is better.)

And what about Python? Yes, there is that too. More different results. Hard to keep track of all dimensions now. And now there is even system time involved.

$ time python -c "$(printf '%s\n' \
    'import re;import sys;c=re.compile(r'\''^[^"]*("[^"]*"[^"]*)*$'\'')' \
    'for l in sys.stdin:' ' if not c.match(l):' '  print(l)')" \
    < many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m0.122s
user  0m0.086s
sys   0m0.036s
$ time python -c "$(printf '%s\n' \
    'import re;import sys;c=re.compile(r'\''^[^"]*("[^"]*"[^"]*)*$'\'')' \
    'for l in sys.stdin:' ' if not c.match(l):' '  print(l)')" \
    < many-dquotes1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m4.176s
user  0m2,768s
sys   0m1.369s

I give up trying to explain any of this.

Conclusions

The conclusions I draw from the above are:

  1. you should test your regular expressions on exaggerated (larger) datasets instead of assuming, and
  2. before you try to optimize your regular expressions you may want to consider a different regular expression engine instead,
  3. and you'll probably want to avoid both grep -E and sed altogether if speed matters.

Back to overview Newer post: pveproxy / systemd journald / perl buffering Older post: sed / remove line feeds / csv