This article was written by Brandon Burton, aka @solarce.
As system administrators, we are often faced with tasks that need to run against a number of things, perhaps files, users, servers, etc. In most cases, we resort of a loop of some sort, often, just a for loop in the shell. The drawback to this seemingly obvious approach, is that we are constrained by the fact that this approach is serial and so the time it will take increases linearly with the number of things we are running the task against.
I am here to tell you there is a better way, it is the path of going parallel!
Tools for your shell scripts
The first place to start is with tools that can replace that for
loop you
usually use and add some parallelism to the task you are running.
The two most well known tools that are available are:
xargs
is a tool used to build and execute command lines from standard
input, but one of its great features is that it can execute those command lines
in parallel via its -P
argument. A quick example of this is:
seq 10 20 | xargs -n 1 -P 5 sleep
This will send a sequence of numbers to xargs and divide it into chunks of one
argument (-n 1
) at a time and fork off 5 parallel processes (-P 5
) to
execute each. You can see it in action:
$ ps -eaf | grep sleep
baron 5830 5482 0 11:12 pts/2 00:00:00 xargs -n 1 -P 5 sleep
baron 5831 5830 0 11:12 pts/2 00:00:00 sleep 10
baron 5832 5830 0 11:12 pts/2 00:00:00 sleep 11
baron 5833 5830 0 11:12 pts/2 00:00:00 sleep 12
baron 5834 5830 0 11:12 pts/2 00:00:00 sleep 13
baron 5835 5830 0 11:12 pts/2 00:00:00 sleep 14
Some further reading on xargs
is available at:
- http://www.semicomplete.com/blog/articles/week-of-unix-tools/day-5-xargs.html
- http://www.xaprb.com/blog/2009/05/01/an-easy-way-to-run-many-tasks-in-parallel/
- http://www.linuxask.com/questions/run-tasks-in-parallel-with-xargs
- http://stackoverflow.com/questions/3321738/shell-scripting-using-xargs-to-execute-parallel-instances-of-a-shell-function
gnu parallel
is a lesser known tool, but has been gaining popularity
recently. It is written with the specific focus on executing processes in
parallel. From the home page description: "GNU parallel is a shell tool for
executing jobs in parallel locally or using remote machines. A job is typically
a single command or a small script that has to be run for each of the lines in
the input. The typical input is a list of files, a list of hosts, a list of
users, a list of URLs, or a list of tables."
A quick example of using parallel
is:
% cat offlineimap-cron5min.plist | parallel --max-procs=8 --group 'echo "Thing: {}"'
Thing: <string>offlineimap-cron5min</string>
Thing: <key>Label</key>
Thing: <string>solarce</string>
Thing: <key>UserName</key>
Thing: <dict>
Thing: <key>ProgramArguments</key>
Thing: <string>admin</string>
...
This plist
file is xml, but the output of parallel is unordered above because
each line of input is processed by one of the 8 workers and output occurs
(--group
) as each worker finishes an input (line) and not necessarily in the
order of input.
Some further reading on parallel
is available at:
- http://www.gnu.org/software/parallel/man.html
- http://psung.blogspot.com/2010/08/gnu-parallel.html
- http://unethicalblogger.com/posts/2010/11/gnuparallelchangedmy_life
Additionally, there is a great screencast on it.
Tools for multiple machines
The next step in our journey is to progress from just running parallel processes to running our tasks in parallel on multiple machines.
A common approach to this is to use something like the following:
for server in $(cat list_of_servers.txt); do
ssh $server command argument
done
While this approach is fine for small tasks on a small number of machines, the drawback to it is that it is executed linearly, so the total time the job will take is as long as the task takes to finish multiplied by the number of machines you are executing it on, which means it could take a while, so you'd better get a Snickers.
Fortunately, people have recognized this problem and have developed a number of tools have been developed to solve this, by running your SSH commends in parallel.
These include:
I'll illustrate how these work with a few examples.
First, here is a basic example of pssh (on Ubuntu the package is 'pssh,' but the command is 'parallel-ssh'):
# cat hosts-file
p1
p2
# pssh -h hosts-file -l ben date
[1] 21:12:55 [SUCCESS] p2 22
[2] 21:12:55 [SUCCESS] p1 22
# pssh -h hosts-file -l ben -P date
p2: Thu Oct 16 21:14:02 EST 2008
p2: [1] 21:13:00 [SUCCESS] p2 22
p1: Thu Sep 25 15:44:36 EST 2008
p1: [2] 21:13:00 [SUCCESS] p1 22
Second, here is an example of using sshpt:
./sshpt -f ../testhosts.txt "echo foo" "echo bar"
Username: myuser
Password:
"devhost","SUCCESS","2009-02-20 16:20:10.997818","0: echo foo
1: echo bar","0: foo
1: bar"
"prodhost","SUCCESS","2009-02-20 16:20:11.990142","0: echo foo
1: echo bar","0: foo
1: bar"
As you can see, these tools simplify and parallelize your SSH commands, decreasing the execution time that your tasks take and improving your efficiency.
Some further reading on this includes:
- What is a good modern parallel SSH tool?
- Linux - Running The Same Command on Many Machines at Once
- Effective adhoc commands in clusters
Smarter tools for multiple machines
Once you've adopted the mindset your tasks can be done in parallel and you've started using one of the parallel ssh tools for executing ad-hoc commands in a parallel fashion, you may find yourself thinking that you'd like to be able to execute tasks in parallel, but in a more repeatable, extensible, and organized fashion.
If you were thinking this, you are in a luck. There is a class of tools commonly classified as Command and Control or Orchestration tools. These tools include:
These tools are built to be frameworks within which you can build repeatable systems automation. Mcollective and capistrano are written in Ruby, and Func and Fabric are written in Python. This gives you options for whichever language you prefer. Each has strengths and weaknesses. I'm a big fan of Mcollective in particular, because it has the strength of being built on Puppet and its primary author, R.I. Pienaar has a vision for it to become an extremely versatile tool for the kinds of needs that fall within the realm of Command and Control or Orchestration.
As it's always easiest to grasp a tool by seeing it in action, here are basic examples of using each tool:
mcollective
% mc-package install zsh
* [ ============================================================> ] 3 / 3
web2.my.net version = zsh-4.2.6-3.el5
web3.my.net version = zsh-4.2.6-3.el5
web1.my.net version = zsh-4.2.6-3.el5
---- package agent summary ----
Nodes: 3 / 3
Versions: 3 * 4.2.6-3.el5
Elapsed Time: 16.33 s
func
% func client15.example.com call hardware info
{'client15.example.com': {'bogomips': '7187.63',
'cpuModel': 'Intel(R) Pentium(R) 4 CPU 3.60GHz',
'cpuSpeed': '3590',
'cpuVendor': 'GenuineIntel',
'defaultRunlevel': '3',
...
'systemSwap': '8191',
'systemVendor': 'Dell Inc.'}}
fabric
% fab -H localhost,linuxbox host_type
[localhost] run: uname -s
[localhost] out: Darwin
[linuxbox] run: uname -s
[linuxbox] out: Linux
Done.
Disconnecting from localhost... done.
Disconnecting from linuxbox... done.
capistrano
# cap invoke COMMAND="yum -y install zsh"
* executing `invoke'
* executing "yum -y install zsh"
servers: ["web1", "web2", "web3"]
[web2] executing command
[web1] executing command
[web3] executing command
[out :: web3] Nothing to do
[out :: web2] Nothing to do
[out :: web1] Complete!
command finished
As you can see from these brief examples, each of these tools accomplishes similar things, each one has a unique ecosystem, plugins available, and strengths and weaknesses, a description of which, is beyond the scope of this post.
Taking your own script(s) multithreaded
The kernel of this article was an article I recently wrote for my employer's blog, Taking your script multithreaded, in which I detailed how I wrote a Python script to make an rsync job multithreaded and cut the execution time of a task from approximately 6 hours, down to 45 minutes.
I've created a git repo out of the script, so you can take my code and poke at it. If you end up using the script and make improvements, feel free to send me patches!
With the help of David Grieser, there is also a Ruby port of the script up on Github.
These are two good examples of how you can easily implement a multithreaded version of your own scripts to help parallelize your tasks.
Conclusion
There are clearly many steps you can take along the path to going parallel. I've tried to highlight how you can begin with using tools to execute commands in a more parallel fashion, progress to tools which help you execute ad-hoc and then repeatable tasks across many hosts, and finally, given some examples on how to make your own scripts more parallel.