Analysing data using unix tools
• Analyzing data using Unix command-line tools
is a powerful and efficient approach,
especially for handling large text-based
datasets.
• These tools allow you to process, filter,
transform, and summarize data directly from
the terminal, often eliminating the need for
more complex software.
Core Unix Tools for Data Analysis
• These foundational tools are typically pre-installed
on Unix-like systems and can be combined to
perform complex data manipulations:
• cat: Concatenates and displays file contents.
• head / tail: Displays the beginning or end of files,
useful for previewing data.
• grep: Searches for patterns using regular
expressions.
• cut: Extracts specific columns from structured data.
sort: Sorts lines in files.
uniq: Removes duplicate lines, often used after
sort.
wc: Counts lines, words, and characters.
find: Searches for files in a directory hierarchy.
awk: A versatile language for pattern scanning
and processing.
sed: Stream editor for filtering and
transforming text.
• By chaining these tools with pipes (|), you can create powerful
one-liners for data processing.
For example, to count unique entries in a specific column:
cut -d',' -f2 data.csv | sort | uniq -c | sort –nr
This command extracts the second column from a CSV file, counts
unique occurrences, and sorts them in descending order.
• cut -d',' -f2 data.csv:
– cut extracts specific fields from each line of a file.
– -d',' sets the delimiter to a comma, suitable for CSV files.
– -f2 selects the second field (column).
This command outputs the second column from data.csv.
• sort:
– Sorts the extracted column alphabetically.
– Necessary for uniq to correctly identify duplicate lines, as it
only removes adjacent duplicates.
• uniq -c:
– uniq filters out repeated lines that are adjacent.
– -c prefixes each line with the number of times it occurred.
This step counts the occurrences of each unique entry in the
second column.
• sort -nr:
– sort again, this time with options:
• -n sorts numerically.
• -r reverses the sort order (descending).
Advanced Command-Line Tools
• Beyond the basic utilities, several specialized tools enhance
data analysis capabilities:
• jq: Processes JSON data, allowing for filtering, transformation,
and formatting.
• q: Executes SQL-like queries on CSV/TSV files directly from the
command line.
• dsq: Runs SQL queries on various data formats, including JSON
and Excel.
• csvkit: A suite of tools for converting and processing CSV files.
• gnuplot: Generates 2D and 3D plots from data, useful for
visualization.
These tools can be combined with standard Unix utilities to
create complex data processing pipelines.
data.csv
• id,name,score
• 1,Alice,85
• 2,Bob,90
• 3,Alice,78
• 4,Charlie,92
• 5,Bob,88
cut -d',' -f2 data.csv | sort | uniq -c | sort -nr
Output:
2 Alice
2 Bob
1 Charlie
1 name
• Note: The line 1 name appears because the
header row is included. To exclude headers, you
can use:
tail -n +2 data.csv | cut -d',' -f2 | sort | uniq -c | sort -nr
• tail: Outputs the last part of files.
• -n +2: Specifies the starting line number for
output. The +2 indicates that output should
begin from line 2.