0% found this document useful (0 votes)
257 views10 pages

Analysing Data Using Unix Tools

Uploaded by

sangameshmp10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
257 views10 pages

Analysing Data Using Unix Tools

Uploaded by

sangameshmp10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Analysing data using unix tools

• Analyzing data using Unix command-line tools


is a powerful and efficient approach,
especially for handling large text-based
datasets.
• These tools allow you to process, filter,
transform, and summarize data directly from
the terminal, often eliminating the need for
more complex software.
Core Unix Tools for Data Analysis

• These foundational tools are typically pre-installed


on Unix-like systems and can be combined to
perform complex data manipulations:
• cat: Concatenates and displays file contents.
• head / tail: Displays the beginning or end of files,
useful for previewing data.
• grep: Searches for patterns using regular
expressions.
• cut: Extracts specific columns from structured data.
sort: Sorts lines in files.
uniq: Removes duplicate lines, often used after
sort.
wc: Counts lines, words, and characters.
find: Searches for files in a directory hierarchy.
awk: A versatile language for pattern scanning
and processing.
sed: Stream editor for filtering and
transforming text.
• By chaining these tools with pipes (|), you can create powerful
one-liners for data processing.
For example, to count unique entries in a specific column:
cut -d',' -f2 data.csv | sort | uniq -c | sort –nr

This command extracts the second column from a CSV file, counts
unique occurrences, and sorts them in descending order.

• cut -d',' -f2 data.csv:


– cut extracts specific fields from each line of a file.
– -d',' sets the delimiter to a comma, suitable for CSV files.
– -f2 selects the second field (column).
This command outputs the second column from data.csv.
• sort:
– Sorts the extracted column alphabetically.
– Necessary for uniq to correctly identify duplicate lines, as it
only removes adjacent duplicates.
• uniq -c:
– uniq filters out repeated lines that are adjacent.
– -c prefixes each line with the number of times it occurred.
This step counts the occurrences of each unique entry in the
second column.
• sort -nr:
– sort again, this time with options:
• -n sorts numerically.
• -r reverses the sort order (descending).
Advanced Command-Line Tools

• Beyond the basic utilities, several specialized tools enhance


data analysis capabilities:
• jq: Processes JSON data, allowing for filtering, transformation,
and formatting.
• q: Executes SQL-like queries on CSV/TSV files directly from the
command line.
• dsq: Runs SQL queries on various data formats, including JSON
and Excel.
• csvkit: A suite of tools for converting and processing CSV files.
• gnuplot: Generates 2D and 3D plots from data, useful for
visualization.
These tools can be combined with standard Unix utilities to
create complex data processing pipelines.
data.csv
• id,name,score
• 1,Alice,85
• 2,Bob,90
• 3,Alice,78
• 4,Charlie,92
• 5,Bob,88
cut -d',' -f2 data.csv | sort | uniq -c | sort -nr

Output:
2 Alice
2 Bob
1 Charlie
1 name
• Note: The line 1 name appears because the
header row is included. To exclude headers, you
can use:
tail -n +2 data.csv | cut -d',' -f2 | sort | uniq -c | sort -nr
• tail: Outputs the last part of files.
• -n +2: Specifies the starting line number for
output. The +2 indicates that output should
begin from line 2.

You might also like