hpr2184 :: Gnu Awk - Part 5

In this episode, I describe how to use regular expressions with Awk.

Hosted by Mr. Young on Thursday, 2016-12-15 is flagged as Clean and is released under a CC-BY-SA license.
Tags: awk, bash, command-line, cli. Comments: 4.

Listen in ogg, opus, or mp3 format. Play now:

Duration: 00:39:54
Download the transcription and subtitles.

Part of the series: Learning Awk.

Episodes about using Awk, the text manipulation language. It comes in various forms called awk, nawk, mawk and gawk, but the standard version on Linux is GNU Awk (gawk). It's a programming language optimised for the manipulation of delimited text.

GNU AWK - Part 5

Regular Expressions in AWK

The syntax for using regular expressions to match lines in AWK is as follows:

word ~ /match/

Or for not matching, use the following:

word !~ /match/

Remember the following file from the previous episodes:

name       color  amount
apple      red    4
banana     yellow 6
strawberry red    3
grape      purple 10
apple      green  8
plum       purple 2
kiwi       brown  4
potato     brown  9
pineapple  yellow 5

We can run the following command:

$1 ~ /p[elu]/ {print $0}

We will get the following output:

apple      red    4
grape      purple 10
apple      green  8
plum       purple 2
pineapple  yellow 5

In another example:

$2 ~ /e{2}/ {print $0}

Will produce the output:

apple      green  8

Regular expression basics

Certain characters have special meaning when using regular expressions.

Anchors

^ - beginning of the line
$ - end of the line
\A - beginning of a string
\z - end of a string
\b on a word boundary

Characters

[ad] - a or d
[a-d] - any character a through d
[^a-d] - not any character a through d
\w - any word
\s - any white-space character
\d - any digit

The capital version of w, s, and d are negations.

Or, you can reference characters the POSIX standard way:

[:alnum:] - Alphanumeric characters
[:alpha:] - Alphabetic characters
[:blank:] - Space and TAB characters
[:cntrl:] - Control characters
[:digit:] - Numeric characters
[:graph:] - Characters that are both printable and visible (a space is printable but not visible, whereas an ‘a’ is both)
[:lower:] - Lowercase alphabetic characters
[:print:] - Printable characters (characters that are not control characters)
[:punct:] - Punctuation characters (characters that are not letters, digits, control characters, or space characters)
[:space:] - Space characters (such as space, TAB, and formfeed, to name a few)
[:upper:] - Uppercase alphabetic characters
[:xdigit:] - Characters that are hexadecimal digits

Quantifiers

. - match any character
+ - match preceding one or more times
* - match preceding zero or more times
? - match preceding zero or one time
{n} - match preceding exactly n times
{n,} - match preceding n or more times
{n,m} - match preceding between n and m times

Grouped Matches

(...) - Parentheses are used for grouping
| - Means or in the context of a grouped match

Replacement

The sub command substitutes the match with the replacement string. This only applies to the first match.
The gsub command substitutes all matching items.
The gensub command command substitutes the in a similar way as sub and gsub, but with extra functionality
The & character in the replacement field references the matched text. You have to use \& to replace the match with the literal & character.

Example:

{ sub(/apple/, "nut", $1);
    print $1}

The output is:

name
nut
banana
strawberry
grape
nut
plum
kiwi
potato
pinenut

Another example:

{ sub(/.+(pp|rr)/, "test-&", $1);
    print $1}

This produces the following output:

name
test-apple
banana
test-strawberry
grape
test-apple
plum
kiwi
potato
test-pineapple

Resources

Regex 101 - Advanced online regex tester
RegExr - Simple online regex tester
Grymoire's Awk Tutorial - Great Resource for understanding Awk
GNU Awk's Regex User Guide - Official Gawk user guide

Comments

Comment #1 posted on 2016-12-15 01:00:28 by Clinton Roy

Lots of useful info, great notes as well :)

There were a few times where the plosive Ps made it hard to listen to. What recording setup are you using?

Comment #2 posted on 2016-12-16 00:15:54 by Mr. Young

:re Lots of useful info

Yes I know. I don't always use that Plantronics USB headset because of that reason, but it does the best at reducing background noise. I have to remember to position it correctly and do some tests before recording.

Comment #3 posted on 2017-12-11 10:43:02 by ZZ

GNU Awk part 5

PLEASE do something about your sound quality. It is just painful to listen to constant pops, clicks, squeaks, booms... etc...

Comment #4 posted on 2017-12-11 15:49:16 by Ken Fallon

Re: Audio

Hi ZZ,

I had a listen to this show again, and the content came through loud and clear. Sure there were some artifacts in this show, but if you listen to other shows from b-yeezi, you'll see that this is not typical of his setup.

We all have a "bad audio day" but I would prefer to get shows that are imperfect, over not getting perfect shows. "Our golden rule is Any audio is better than no audio."

Thanks for listening, and taking the time to comment. We are always interested in hearing from our listeners. Perhaps you could do a show and tell us your tech story, or any other story you like "as long as it's of interest to hackers".

Ken.

Leave Comment

Note to Verbose Commenters
If you can't fit everything you want to say in the comment below then you really should record a response show instead.

Note to Spammers
All comments are moderated. All links are checked by humans. We strip out all html. Feel free to record a show about yourself, or your industry, or any other topic we may find interesting. We also check shows for spam :).

Your Name/Handle:
Title:
Comment:
Anti Spam Question:	What does the letter P in HPR stand for?
Are you a spammer?	Yes No
Who is the host of this show?
What does HPR mean to you?