10 Command-line Tools for Data Analysis in Linux _ Opensource.com
10 Command-line Tools for Data Analysis in Linux _ Opensource.com
com
Image by : Opensource.com
So you've landed on some data you want to analyze.
Where do you begin?
You can learn a lot about a dataset without ever leaving your
terminal.
https://opensource.com/article/17/2/command-line-tools-data-analysis-linux?intcmp=7016000000127cYAAQ 1/12
2/5/2020 10 command-line tools for data analysis in Linux | Opensource.com
I'm talking about the Linux command line. Just using some tools that you've
probably already got installed on your computer, you can learn a lot about a
dataset without ever leaving your terminal. Long-time Linux users will of course
laugh—they've been using many of these tools for years to parse logs and
understand configuration tools. But for the Linux newcomer, the revelation that
you've got a whole data analysis toolkit already at your fingertips can be a
welcomed surprise.
Most of these tools aren't strictly speaking limited to Linux, either. Most
hearken back to the days of Unix, and users of other Unix-like operating
systems likely have them installed already or can do so with ease. Many are a
part of the GNU Coreutils package, while a few are individually maintained,
and with some work, you can even use them on Windows.
So let's try out a few of the many simple open source tools for data analysis
and see how they work! If you'd like to follow along with these examples, go
ahead and download this sample data file, from GitHub, which is a CSV
(comma separated value) list of articles we published to Opensource.com in
January.
https://opensource.com/article/17/2/command-line-tools-data-analysis-linux?intcmp=7016000000127cYAAQ 2/12
2/5/2020 10 command-line tools for data analysis in Linux | Opensource.com
First, let's get started by getting a handle on the file. What's in it? What does its
format look like? You can use the cat command to display a file in the terminal,
but that's not going to do us much good if you're working with files more than a
few dozen lines.
Enter head and tail. Both are utilities for showing a specified number of lines
from the top or bottom of the file. If you don't specify the number of lines you
want to see, you'll get 10. Let's try it with our file.
$ tail -n 3 jan2017articles.csv
02 Jan 2017,Article,Scott Nesbitt,3 tips for effectively using wikis for documenta
02 Jan 2017,Article,Jen Wike Huger,The Opensource.com preview for January,0,/artic
02 Jan 2017,Poll,Jason Baker,What is your open source New Year's resolution?,1,/po
Looking at those last three lines, I can pick out a date, author name, title, and a
few other pieces of information immediately. But I don't know what every
column is. Let's look at the top of the file and see if it has headers to explain
what each column means:
$ head -n 1 jan2017articles.csv
Post date,Content type,Author,Title,Comment count,Path,Tags,Word count
Okay, that all makes sense now. Looks like we've got a list of articles with the
date they were published, the type of content for each one, the author, title,
number of comments, relative URL, the tags each article has, and the word
count.
wc
That's great, but how big is this file? Are we talking about dozens of articles we
want to analyze, or hundreds, or even thousands? The wc command can help
with that. Short for "word count," wc can count the number of bytes,
characters, words, or lines in the file. In our case, we want to know the number
of lines.
$ wc -l jan2017articles.csv
93 jan2017articles.csv
https://opensource.com/article/17/2/command-line-tools-data-analysis-linux?intcmp=7016000000127cYAAQ 3/12
2/5/2020 10 command-line tools for data analysis in Linux | Opensource.com
And, there it is. 93 lines in this file; since we know the first row contained
headers, we can surmise that this is a list of 92 articles.
grep
Okay, now let's ask ourselves: Out of these 92 articles, how many of them
were about a security topic? For our purposes, let's say we're interested in
articles that mention security anywhere in the entry, whether in the title, the list
of tags, or somewhere else. The grep tool can help us with that. With grep,
you can search a file or other input for a particular pattern of
characters. grep is an incredibly powerful tool, thanks to the regular
expressions you can build to match very precise patterns. But for now, let's just
search for a simple string.
The format we used was grep, followed by the -i flag (which tells grep not to
be case sensitive), followed by the pattern we wanted to search for, and then
the file in which we were searching. It looks like we had four security-related
articles last month. But let's imagine we got a much longer list than we could
easily count. Using a pipe, we could combine grep with the wc command we
just learned about above, to get a count of the total lines mentioning security.
In this case, wc took the output of our grep command, and used it as its input,
without ever having to worry about saving it anywhere first. This is why piping
input and output, in particular when combined with a little shell scripting, makes
the terminal such a powerful tool for data analysis.
tr
https://opensource.com/article/17/2/command-line-tools-data-analysis-linux?intcmp=7016000000127cYAAQ 4/12
2/5/2020 10 command-line tools for data analysis in Linux | Opensource.com
A CSV file is a pretty helpful format for many analysis scenarios, but what if
you need to convert the file to a different format for use in a different
application? Maybe you need tab separators instead of commas, or maybe you
want to change them to some HTML so that you can use the data output in a
table. The tr command can help you with that, by translating from one type of
character to another. Like the other examples, you can also pipe input and
output to this command.
Let's try another multi-part example, by creating a TSV (tab separated values)
file with just the articles that published on January 20.
What's going on here? First, we searched for the date in question, using grep.
We piped this output to the tr command, which we used to replace the
commas with tabs (denoted with '\t'). But where did it go? Well, the > character
redirected the output to our new file instead of the screen. All of this work in
one command sequence. We can then verify that the jan20only.tsv file contains
the data that we expected.
$ cat jan20only.tsv
20 Jan 2017 Article Kushal Das 5 ways to expand your project's contributo
20 Jan 2017 Article D Ruth Bavousett How to write web apps in R with Sh
20 Jan 2017 Article Jason Baker "Top 5: Shell scripting the Cinnamon Linu
20 Jan 2017 Article Tracy Miranda How is your community promoting diversity?
sort
https://opensource.com/article/17/2/command-line-tools-data-analysis-linux?intcmp=7016000000127cYAAQ 5/12
2/5/2020 10 command-line tools for data analysis in Linux | Opensource.com
This is another long example, so let's break down what's happening. FIrst,
we're using the sort command to sort by the number of words. The -nr option
tells sortto do a numeric sort, and to return the results in reverse order (largest
to smallest). The next -t$'\t' tells sort that the delimiter is the tab ('\t'). (You can
read why you need the dollar sign here; in short, it's to tell the shell that this is
a string that needs processing to turn the \n into an actual tab). The -k8 portion
of the command tells sort to use the eighth column, which is the column for
word count in our example.
Finally, the whole output is piped to head with instructions just to show the top
line, which is our result, the article from this file with the highest word count.
sed
You might want to select specific lines of a file. sed, short for stream editor, is
one way to do this. What if you wanted to combine multiple files that all had
headers? You would only want one set of headers to appear for the whole file,
so you would need a way to scrape out the extras. Or what if you wanted to
grab only a particular range of lines? sed is your tool. It's also a great way to
do a bulk find and replace a file.
Let's create a new file with no headers from our list of articles, suitable for
combining with other files (if, for example, I had a different file for every month
and wanted to put them together).
The '1 d' option tells sed to delete the first line. sed is far more powerful than
this, and I'd recommend reading up further on its replacement powers.
cut
In this simple example, we told cut with -d',' that this is a comma-delimited file,
that we wanted the third column (-f3), and to send the output to a new file
called authors.txt.
uniq
That last example left us with a list of authors, but, how many unique authors
are on that list? How many articles did each author write? Enter uniq.
With uniq, you can easily find out. Let's sort the file, find uniques, then outputs
a file that has a count of the number of articles written by each author.
Glimpsing at the file, we can now see how many articles each author had. Let's
just look at the last three lines to be sure it worked.
awk
Let's look at one more tool in our command-line data analysis toolbelt
today, awk. awk is another one of those tools that I'm going to give far too little
credit to; it's really a powerhouse worth exploring on its own. It is another great
tool for replacement, but also much more. Let's go back to the TSV file we
made earlier of just the January 20 articles, and use that to create a new list of
just the authors of those articles, along with the number of words each author
wrote.
https://opensource.com/article/17/2/command-line-tools-data-analysis-linux?intcmp=7016000000127cYAAQ 7/12
2/5/2020 10 command-line tools for data analysis in Linux | Opensource.com
What's going on here? The -F "\t" we pass to awk simply tells it that we're
working with tab-separated data. Within the braces, we're actually
telling awk to execute just a little bit of code. We're telling it to print the third
column with $3, and then the last column with $NF (the "number of fields"), and
place two spaces between them to make it a little more legible.
You could choose to load the files into a database tool and work with the data
there. But is that the right choice? It might be overkill. What if you're just
examining the data to get a sense of what it contains? With these simple tools
and a little scripting to recurse through a directory, you can work with large
amounts of data with ease. Professionals and amateurs alike who work with
data on a regular basis would do well to spend some time learning these and
other command line data analysis tools.
This introduction only scratches the surface of each of these tools. They are
far, far more powerful than these simple examples would let on, which is why
volumes of books have been written about most. I hope you'll take the time to
read a man page, do a little work in the search engine, or pick up a book and
learn more about this interesting suite of tools that you have ready at your
fingertips.
https://opensource.com/article/17/2/command-line-tools-data-analysis-linux?intcmp=7016000000127cYAAQ 8/12
2/5/2020 10 command-line tools for data analysis in Linux | Opensource.com
Jason Baker - I use technology to make the world more open. Linux desktop
enthusiast. Map/geospatial nerd. Raspberry Pi tinkerer. Data analysis and
visualization geek. Occasional coder. Cloud nativist. Civic tech and open
government booster.
More about me
Recommended reading
https://opensource.com/article/17/2/command-line-tools-data-analysis-linux?intcmp=7016000000127cYAAQ 9/12
2/5/2020 10 command-line tools for data analysis in Linux | Opensource.com
11 Comments
https://opensource.com/article/17/2/command-line-tools-data-analysis-linux?intcmp=7016000000127cYAAQ 10/12
2/5/2020 10 command-line tools for data analysis in Linux | Opensource.com
perl gives you all the power of sed and awk and grep combined and extended. It works very
well in a pipeline.
https://opensource.com/article/17/2/command-line-tools-data-analysis-linux?intcmp=7016000000127cYAAQ 11/12
2/5/2020 10 command-line tools for data analysis in Linux | Opensource.com
a url or an IP address, you can summarize the logs and get a clearer picture. I have found it
useful when I see a script hit a page multiple times and want a list of IP's for that one URL.
https://opensource.com/article/17/2/command-line-tools-data-analysis-linux?intcmp=7016000000127cYAAQ 12/12