sed / remove line feeds / csv

sed / remove line feeds / csv

  • Written by
    Walter Doekes
  • Published on

Can you use sed(1) to remove line feeds?

Yes you can. Even though sed, the stream editor, is line based, you can use it to remove line feeds from output.

In this post we'll be exploring how that is done and look into how a slight refactoring step can provide a big speed gain. The task at hand is removing line feeds from CSV file with multiline strings. Why? Because I want to use grep and awk on it, and multiline strings would complicate that a lot. Why sed? Because I wanted to know if it's as easy as I think it should be.

Let's dig in.

sample.csv

We'll start off with a small sample.csv that we can play around with, to get us going:

idx,name,year,summary,url
1,"Electromagnetic Field 2018",2018,"EMF2018 geek village in the summer at:
Eastnor Castle Deer Park
Eastnor
Herefordshire (UK)",https://www.emfcamp.org/schedule/2018
2,"Legoland trip",2019,"Went to Billund for team building.",https://www.osso.nl/blog/2019/recap2019/
3,"MCH2022 ""May Contain Hackers""",2022,"Nonprofit outdoors hacker camp.
Location: Zeewolde (NL)",https://wiki.mch2022.org/Village:OSSO
4,"WHY2025 ""What Hackers Yearn""",2025,"Successor to MCH2022.",-

As you can see, this CSV file has one header row and four CSV data rows. But because some rows are split over multiple lines, doing a simple grep, cut or awk will not work as intended.

Here's a feedble attempt to get all the URLs from the fifth column, by asking awk to split by comma:

$ awk -F, '{print $5}' sample.csv
url




https://www.osso.nl/blog/2019/recap2019/


-

As expected, awk will not work, as it works on the lines, and has no idea where a record/row begins.

Quick sidenote: using cut -d, -f5 sample.csv would produce different, and even more wrong, output. If there is no fifth field, you'll get an earlier field instead. See this example:

$ echo how | cut -d, -f2,3,4
how
$ echo how,are,you | cut -d, -f2,3,4
are,you

That's unexpected behaviour that awk avoids.

The goal

Selecting the URLs from the CSV sample would have worked fine, had it looked like this:

idx,name,year,summary,url
1,"Electromagnetic Field 2018",2018,"EMF2018 geek village in the summer at: Eastnor Castle Deer Park Eastnor Herefordshire (UK)",https://www.emfcamp.org/schedule/2018
2,"Legoland trip",2019,"Went to Billund for team building.",https://www.osso.nl/blog/2019/recap2019/
3,"MCH2022 ""May Contain Hackers""",2022,"Nonprofit outdoors hacker camp. Location: Zeewolde (NL)",https://wiki.mch2022.org/Village:OSSO
4,"WHY2025 ""What Hackers Yearn""",2025,"Successor to MCH2022.",-

Here we have one header and four rows, and they add up to five lines. Selecting the fifth column is easy now:

$ awk -F, '{print $5}' goal.csv
url
https://www.emfcamp.org/schedule/2018
https://www.osso.nl/blog/2019/recap2019/
https://wiki.mch2022.org/Village:OSSO
-

As you may have noticed, a comma in CSV column/string would trip us up. Ignore that for now. You can argue that using regular unix utilities to work with CSV files is the wrong approach, but I think that often times the command line is king. If you're working with new/varied data, using the command line to quickly explore and prototype is unbeatable. You can choose a real programming language later when you're done prototyping.

So, we have the input (sample.csv), and we have the output (goal.csv with md5 hash d7a5eba9e0a80a1d8c1158a1430927a3). Can we coax sed into producing goal.csv? We can. (I used the md5 hash to confirm that all the versions of goal.csv are equal.)

Attempt #1: making it work

Here's attempt one, which I have to admit took me longer to produce than I wanted. And I have to say ChatGPT was of no help at all. I can tell you its sed skills are abysmal.

#!/bin/sed -Ef

# Append line to hold space, clear pattern
# space, put hold space in pattern space.
H
z
x

# If pattern space contains an uneven number of
# double quotes, place everything back into the
# hold space and don't display.
/^[^"]*("[^"]*"[^"]*)*"[^"]*$/ {
    h
    d
}

# If pattern space contains an even number of
# double quotes, we're done collecting.
# The first hold added a linefeed to void,
# remove it now.
# Replace the others with a single space.
/^[^"]*("[^"]*"[^"]*)*$/ {
    s/^\n//
    s/\n/ /g
}

That's my first working attempt:

$ ./attempt-1.sed | md5sum
d7a5eba9e0a80a1d8c1158a1430927a3  -

It works! But it has a couple of odd quirks.

Taking you through it one by one:

  1. sed has two buffers: the pattern buffer and the hold buffer. You're always working on the pattern buffer. The first H appends the input (the current pattern) to the hold buffer.
  2. The z wipes (zeroes) the pattern buffer, the x swaps them. Now the pattern buffer has all the collected input so far.
  3. The /^[^"]*("[^"]*"[^"]*)*"[^"]*$/ regex checks that there is an uneven number of double quotes: if it is uneven, the pattern buffer is replaced into the hold buffer, the line is deleted (not printed), and the cycle restarts.
  4. If we get to the next regex (/^[^"]*("[^"]*"[^"]*)*$/) we've confirmed that the quotes are even. We remove the first linefeed (which got added after the first H) and replace the rest with a single space.
  5. If we get here, we're not replacing the pattern buffer into the hold buffer (using h), so the next cycle starts clean.

So. That works. And after looking more into the sed info pages, it turns out that the d command restarts the cycle. That means that the second regex match isn't even needed. It could instead just be:

# If we get here, the pattern space contains an
# even number of double quotes. The first hold
# added a linefeed to void, remove it now.
# Replace the others with a single space.
s/^\n//
s/\n/ /g

Attempt #2: use only the pattern buffer

When working with real data, it turned out that this was rather slow. For a simple 1000 row, 1 MiB file, having the script take more than a second seemed inappropriate.

The thing that immediately stands out is the swapping between pattern and hold space. Step one is avoiding that.

#!/bin/sed -Ef

:x

# If pattern space contains an uneven number of
# double quotes, add more to the line and
# continue at x.
/^[^"]*("[^"]*"[^"]*)*"[^"]*$/ {
    N
    bx
}

# Apparently we have an even number of double
# quotes. Remove LFs and print (by ending the
# cycle).
s/\n/ /g

This looks better already. We don't need the hold buffer at all.

  1. The N reads a new line and appends it to the pattern buffer.
  2. bx jumps back to the :x label.
  3. This repeats for as long as there is an uneven amount of double quotes in the pattern buffer.
  4. Once the double quotes are even, we end up on the last line.
  5. There we replace the line feeds with spaces and let the cycle restart with the automatic printing of the pattern buffer.

Is it faster?

Well.. according to a couple of sample runs with a 1.3 MiB dataset, maybe we go from 690ms to 670ms. Definitely not earth shattering speed gains. What's next?

Attempt #3: stop doing double work

On to fix the real problem, the repetitive regular expression. The data set I'm working with at the moment looks like this:

$ wc dataset.csv
  23679  156039 1389021 dataset.csv
$ ./attempt-2.sed dataset.csv | wc
   1001  156039 1389021

A 1.3 MiB file, 1001 rows, 23679 lines. On average, that's 27 lines per row and 1388 bytes per row.

That means that on average, it has to rerun the regular expression 27 times before it doesn't match anymore (because the double quotes are even). And it has to do so on a line that grows to up to 1388 characters:

1,"Electromagnetic Field 2018",2018,"EMF2018 geek village in the summer at:

Uneven amount of quotes? Yes, 3. Restart.

1,"Electromagnetic Field 2018",2018,"EMF2018 geek village in the summer at:
Eastnor Castle Deer Park

Still uneven amount of quotes? Yes, 3. Restart.

1,"Electromagnetic Field 2018",2018,"EMF2018 geek village in the summer at:
Eastnor Castle Deer Park
Eastnor

Still uneven amount of quotes? Yes, 3. Restart again.

You can see how that's inefficient. It would be better if it just counted the double quotes on the latest input line.

Attempt #3 of the script fixes that:

#!/bin/sed -nEf

# Find an uneven amount of double quotes.
# If found, populate hold space, load next line
# and jump to inquote.
/^[^"]*("[^"]*"[^"]*)*"[^"]*$/ {
    h
    n
    b inquote
}

# Not found? Jump to done.
b done

# Loop here until the input line also has an
# uneven amount of quotes: uneven + uneven is
# even. In the mean time, add to the hold
# buffer and fetch the next line.
:inquote
/^[^"]*("[^"]*"[^"]*)*"[^"]*$/ {
    H
    x
    b done
}
H
n
b inquote

:done
s/\n/ /g
# sed -n and manual printing is needed,
# otherwise the 'n' will print the pattern
# space.
p

Admittedly, this looks a lot less like a sed oneliner, and more like an obscure programming language. But, boy, it is so much faster. With the same dataset, we go down to mere milliseconds.

It works like this:

  1. Find an uneven amount of double quotes. If not found, jump to the done label.
  2. At the done label, we replace the line feeds, and print the output. Restart the cycle.
  3. If we did find an uneven amount of double quotes, store the pattern in the hold space, get the next pattern, and go to the inquote label.
  4. At inquote we also look for an uneven amount of quotes. If not found it means that we're not done (uneven + even = uneven). We append the pattern to the hold buffer using H and load up the next pattern with n. Go back to inqoute.
  5. If we did find an uneven amount of double quotes, we're done (uneven + uneven = even). Append this line as well with H, and then exchange the hold and the pattern buffer with x so we can work on the entire line. (Using H;x is the same as x;G in this case.) Go to done (see above).

Timings

For completeness sake, here are some representative timing outputs:

$ time ./attempt-1.sed dataset.csv >/dev/null

real  0m0.696s (all user time)
$ time ./attempt-2.sed dataset.csv >/dev/null

real  0m0.664s (all user time)
$ time ./attempt-3.sed dataset.csv >/dev/null

real  0m0.033s (all user time)

As you can see, that's much better.

In the next post, we'll see if we can improve the regular expressions.


Back to overview Newer post: sed / regular expressions / optimizing Older post: recap 2024 - updates at OSSO