Mastering AWK: A Comprehensive Guide with Advanced Examples

By gerald, 3 April, 2023

Photo by Sora Shimazaki: https://bit.ly/3U0FX5j

Introduction

AWK is a powerful command-line tool used for processing and manipulating text files in Unix/Linux operating systems. It is a scripting language that provides a lot of functionality for text processing, pattern matching, and data manipulation. In this blog post, we will explore the basics of AWK and some advanced examples that demonstrate its power.

AWK Anatomy

AWK scripts are composed of patterns and actions. A pattern is a condition that is matched against each line of input, and an action is the set of commands that are executed if the pattern is matched. The basic syntax for an AWK command is:

awk 'pattern { action }' input_file

Where pattern is the condition to match, { action } is the set of commands to execute, and input_file is the file to be processed.

AWK comes with a few built-in variables and functions that can be used in AWK scripts. For example:

NR: The current line number being processed.
NF: The number of fields in the current line.
$0: The entire current line.
$1, $2, $3, ...: The individual fields of the current line.

BEGIN and END

In addition to patterns that are matched against each line of input, AWK also provides two special patterns that are executed before the first line of input is processed and after the last line of input is processed. These patterns are called BEGIN and END, respectively.

The BEGIN pattern is executed before the first line of input is processed. This is useful for initializing variables, setting up counters, or performing any other operations that need to be done before the main processing begins. The syntax for using the BEGIN pattern is:

awk 'BEGIN { action } pattern { action } END { action }' input_file

Where BEGIN { action } is the set of commands to be executed before the first line of input is processed.

The END pattern is executed after the last line of input is processed. This is useful for printing out final results, displaying summary statistics, or performing any other operations that need to be done after the main processing is finished. The syntax for using the END pattern is:

awk 'BEGIN { action } pattern { action } END { action }' input_file

Where END { action } is the set of commands to be executed after the last line of input is processed.

AWK Field Separators

AWK uses whitespace (spaces, tabs, and newlines) as the default field separator. However, it also allows you to specify a custom field separator using the -F option. For example, to use a comma as the field separator, you can use:

awk -F',' '{ print $1 }' input_file

This will print the first field of each line of the file, using a comma as the field separator.

In awk, FS stands for Field Separator. It is a built-in variable that specifies the character or string that separates fields in a record or line of input. By default, the field separator is a space or a tab character.

You can change the value of FS to specify a different field separator for your input. For example, if your input file uses a comma as the field separator, you can set FS to a comma using the following command:

awk 'BEGIN { FS = "," } { print $1 }' file.txt

In this example, the BEGIN block sets the FS variable to a comma. This means that when awk reads each line of file.txt, it will use a comma as the field separator. The $1 in the second block refers to the first field in each line, which is printed to the console.

You can also set FS to a regular expression to use multiple characters as the field separator. For example, the following command sets FS to a regular expression that matches one or more spaces or tabs:

awk 'BEGIN { FS = "[ \t]+" } { print $1 }' file.txt

In this example, the regular expression [ \t]+ matches one or more spaces or tabs, so awk will use any combination of spaces or tabs as the field separator.

AWK If/Else Statement

In awk, the if/else statement is used to execute different code based on a condition. The general syntax of an if/else statement in awk is as follows:

if (condition) { # code to execute if condition is true } else { # code to execute if condition is false }

The condition is an expression that evaluates to either true or false. If the condition is true, the code inside the first set of curly braces is executed. If the condition is false, the code inside the second set of curly braces is executed.

Here is an example of an if/else statement in awk:

awk '{ if ($1 > 10) { print "The first field is greater than 10"; } else { print "The first field is less than or equal to 10"; } " }' file.txt

This code reads from file.txt and checks if the first field in each line is greater than 10. If it is, it prints The first field is greater than 10. If it is not, it prints The first field is less than or equal to 10.

Basic AWK Examples

Print each line of a file:

awk '{ print }' input_file

Print the first field of each line of a file:

awk '{ print $1 }' input_file

Print the number of lines in a file:

awk 'END { print NR }' input_file

Print the sum of the second column of a CSV file:

awk -F, '{ sum += $2 } END { print sum }' input_file

Advanced AWK Examples

Find the most frequent word in a file:

awk '{ for (i=1; i<=NF; i++) { words[$i]++; } } END { max=0; for (w in words) { if (words[w] > max) { max = words[w]; max_word = w; } } print max_word; }' input_file

Extract the lines between two patterns:

awk '/start_pattern/,/end_pattern/' input_file

Count the number of occurrences of a word in a file:

grep -o 'word' input_file | wc -l | awk '{ print $1 }'

Replace all occurrences of a word with another word in a file:

sed 's/old_word/new_word/g' input_file | awk '{ print }'

Conclusion

AWK is a powerful tool for text processing and data manipulation in Unix/Linux operating systems. It provides a lot of functionality for pattern matching, text processing, and data manipulation. In this blog post, we have explored the basics of AWK and some advanced examples that demonstrate its power. We have also seen how AWK can be used in conjunction with other Unix/Linux commands like grep, sed, and wc to perform more complex text processing tasks.