sysadvent: scripting

This article was written by Brandon Burton, aka @solarce.

As system administrators, we are often faced with tasks that need to run against a number of things, perhaps files, users, servers, etc. In most cases, we resort of a loop of some sort, often, just a for loop in the shell. The drawback to this seemingly obvious approach, is that we are constrained by the fact that this approach is serial and so the time it will take increases linearly with the number of things we are running the task against.

I am here to tell you there is a better way, it is the path of going parallel!

Tools for your shell scripts

The first place to start is with tools that can replace that for loop you usually use and add some parallelism to the task you are running.

The two most well known tools that are available are:

xargs is a tool used to build and execute command lines from standard input, but one of its great features is that it can execute those command lines in parallel via its -P argument. A quick example of this is:

seq 10 20 | xargs -n 1 -P 5 sleep

This will send a sequence of numbers to xargs and divide it into chunks of one argument (-n 1) at a time and fork off 5 parallel processes (-P 5) to execute each. You can see it in action:

$ ps -eaf | grep sleep
baron     5830  5482  0 11:12 pts/2    00:00:00 xargs -n 1 -P 5 sleep
baron     5831  5830  0 11:12 pts/2    00:00:00 sleep 10
baron     5832  5830  0 11:12 pts/2    00:00:00 sleep 11
baron     5833  5830  0 11:12 pts/2    00:00:00 sleep 12
baron     5834  5830  0 11:12 pts/2    00:00:00 sleep 13
baron     5835  5830  0 11:12 pts/2    00:00:00 sleep 14

Some further reading on xargs is available at:

gnu parallel is a lesser known tool, but has been gaining popularity recently. It is written with the specific focus on executing processes in parallel. From the home page description: "GNU parallel is a shell tool for executing jobs in parallel locally or using remote machines. A job is typically a single command or a small script that has to be run for each of the lines in the input. The typical input is a list of files, a list of hosts, a list of users, a list of URLs, or a list of tables."

A quick example of using parallel is:

% cat offlineimap-cron5min.plist | parallel --max-procs=8 --group 'echo "Thing: {}"'
Thing:       <string>offlineimap-cron5min</string> 
Thing:     <key>Label</key> 
Thing:       <string>solarce</string> 
Thing:     <key>UserName</key> 
Thing:   <dict> 
Thing:     <key>ProgramArguments</key> 
Thing:       <string>admin</string> 
...

This plist file is xml, but the output of parallel is unordered above because each line of input is processed by one of the 8 workers and output occurs (--group) as each worker finishes an input (line) and not necessarily in the order of input.

Some further reading on parallel is available at:

Additionally, there is a great screencast on it.

Tools for multiple machines

The next step in our journey is to progress from just running parallel processes to running our tasks in parallel on multiple machines.

A common approach to this is to use something like the following:

for server in $(cat list_of_servers.txt); do
    ssh $server command argument
done

While this approach is fine for small tasks on a small number of machines, the drawback to it is that it is executed linearly, so the total time the job will take is as long as the task takes to finish multiplied by the number of machines you are executing it on, which means it could take a while, so you'd better get a Snickers.

Fortunately, people have recognized this problem and have developed a number of tools have been developed to solve this, by running your SSH commends in parallel.

These include:

I'll illustrate how these work with a few examples.

First, here is a basic example of pssh (on Ubuntu the package is 'pssh,' but the command is 'parallel-ssh'):

# cat hosts-file
p1
p2

# pssh -h hosts-file -l ben date
[1] 21:12:55 [SUCCESS] p2 22
[2] 21:12:55 [SUCCESS] p1 22

# pssh -h hosts-file -l ben -P date
p2: Thu Oct 16 21:14:02 EST 2008
p2: [1] 21:13:00 [SUCCESS] p2 22
p1: Thu Sep 25 15:44:36 EST 2008
p1: [2] 21:13:00 [SUCCESS] p1 22

Second, here is an example of using sshpt:

./sshpt -f ../testhosts.txt "echo foo" "echo bar"
Username: myuser
Password:
"devhost","SUCCESS","2009-02-20 16:20:10.997818","0: echo foo
1: echo bar","0: foo
1: bar"
"prodhost","SUCCESS","2009-02-20 16:20:11.990142","0: echo foo
1: echo bar","0: foo
1: bar"

As you can see, these tools simplify and parallelize your SSH commands, decreasing the execution time that your tasks take and improving your efficiency.

Some further reading on this includes:

Smarter tools for multiple machines

Once you've adopted the mindset your tasks can be done in parallel and you've started using one of the parallel ssh tools for executing ad-hoc commands in a parallel fashion, you may find yourself thinking that you'd like to be able to execute tasks in parallel, but in a more repeatable, extensible, and organized fashion.

If you were thinking this, you are in a luck. There is a class of tools commonly classified as Command and Control or Orchestration tools. These tools include:

These tools are built to be frameworks within which you can build repeatable systems automation. Mcollective and capistrano are written in Ruby, and Func and Fabric are written in Python. This gives you options for whichever language you prefer. Each has strengths and weaknesses. I'm a big fan of Mcollective in particular, because it has the strength of being built on Puppet and its primary author, R.I. Pienaar has a vision for it to become an extremely versatile tool for the kinds of needs that fall within the realm of Command and Control or Orchestration.

As it's always easiest to grasp a tool by seeing it in action, here are basic examples of using each tool:

mcollective

% mc-package install zsh

 * [ ============================================================> ] 3 / 3

web2.my.net                      version = zsh-4.2.6-3.el5
web3.my.net                      version = zsh-4.2.6-3.el5
web1.my.net                      version = zsh-4.2.6-3.el5

---- package agent summary ----
           Nodes: 3 / 3
        Versions: 3 * 4.2.6-3.el5
    Elapsed Time: 16.33 s

func

% func client15.example.com call hardware info
{'client15.example.com': {'bogomips': '7187.63',
                          'cpuModel': 'Intel(R) Pentium(R) 4 CPU 3.60GHz',
                          'cpuSpeed': '3590',
                          'cpuVendor': 'GenuineIntel',
                          'defaultRunlevel': '3',
...
                          'systemSwap': '8191',
                          'systemVendor': 'Dell Inc.'}}

fabric

% fab -H localhost,linuxbox host_type
[localhost] run: uname -s
[localhost] out: Darwin
[linuxbox] run: uname -s
[linuxbox] out: Linux

Done.
Disconnecting from localhost... done.
Disconnecting from linuxbox... done.

capistrano

# cap invoke COMMAND="yum -y install zsh"
  * executing `invoke'
  * executing "yum -y install zsh"
    servers: ["web1", "web2", "web3"]
    [web2] executing command
    [web1] executing command
    [web3] executing command
    [out :: web3] Nothing to do
    [out :: web2] Nothing to do
    [out :: web1] Complete!
    command finished

As you can see from these brief examples, each of these tools accomplishes similar things, each one has a unique ecosystem, plugins available, and strengths and weaknesses, a description of which, is beyond the scope of this post.

Taking your own script(s) multithreaded

The kernel of this article was an article I recently wrote for my employer's blog, Taking your script multithreaded, in which I detailed how I wrote a Python script to make an rsync job multithreaded and cut the execution time of a task from approximately 6 hours, down to 45 minutes.

I've created a git repo out of the script, so you can take my code and poke at it. If you end up using the script and make improvements, feel free to send me patches!

With the help of David Grieser, there is also a Ruby port of the script up on Github.

These are two good examples of how you can easily implement a multithreaded version of your own scripts to help parallelize your tasks.

Conclusion

There are clearly many steps you can take along the path to going parallel. I've tried to highlight how you can begin with using tools to execute commands in a more parallel fashion, progress to tools which help you execute ad-hoc and then repeatable tasks across many hosts, and finally, given some examples on how to make your own scripts more parallel.

Maybe you're a windows sysadmin. Maybe you're not. Either way, you might find the features in Powershell pretty cool.

Powershell is Windows-only and free to use. Some syntactic differences asside, it looks and feels like a unix shell language. It has standard features you might expect such as functions, recursion, variables, variable scope, objects, and a handful of built-in functionality to help you get work done, but it does many things better.

In addition to these baseline expectations, functions in powershell trivially take flag arguments by simply declaring a function argument (function foo($bar) { ... } can be invoked as foo -bar "somevalue". You can create arbitrary objects with properties and methods defined on the fly. Exception handling, logging, trace debugging, and other goodies are packed in by default.

It supports pipes like your favorite unix shell, except instead of piping text, you pipe objects. The key word is object. When you run 'dir' (or ls, which is an alias), it outputs file objects. When you run 'ps' (which is an alias of get-process), you get process objects. When you run 'get-content' a file, you get an array of strings.

Why is this significant? As a unix sysadmin, you quickly become intimate with piping one command to another, smoothly sandwiching filter invocations between others tools. Filter tools like awk, sed, grep, xargs, etc, all helping you convert one output text into another input text for another command. What if you didn't have to do that, or had to do it less? No more parsing the output of ls(1), stat(1), or du(1) to ask for file attributes when powershell's file object has them. What about getting process attributes?

# Yes, this is a comment in Powershell
# Show the top 3 consumers of virtual memory:
PS > get-process | sort {$_.VirtualMemorySize} | select -last 3

Handles  NPM(K)    PM(K)      WS(K) VM(M)   CPU(s)     Id ProcessName
-------  ------    -----      ----- -----   ------     -- -----------
    745      58    66648       4316   632    21.03   3564 CCC
   1058     107   230788      28384   680   600.23   5048 Steam
    446      78  1328988    1267960  1616 6,223.72   3692 firefox

# Kill firefox
PS > get-process firefox | stop-process
# Alternately
PS > get-process firefox | foreach { $_.kill() }

'select' is an alias for 'select-object' which lets you (among other things) trim an object to only selected properties. Inspection is done with 'get-member' (or 'gm' for short) and you can inspect objects output by 'ls' by doing: ls | gm, or processes with get-process | gm. You can ask an object what type it is with obj.gettype(); such as (get-item .).gettype()

But what if you want to manipulate the registry easily? The registry, filesystem, aliases, variables, environment, functions, and more are all considered "providers" in Powershell. Each provider gives you access to a certain data store using standard built-in commands. A provider can be invoked by prefixing a path with the provider name. For example, to access a registry key, you could use dir Registry::HKEY_CURRENT_USER to list keys in that part of the registry.

In addition to other neat features, you've got nice access to both COM and .NET. Want to create a tempfile?

PS > $tmp = [System.IO.Path]::GetTempFileName()
PS > ls $tmp
    Directory: Microsoft.PowerShell.Core\FileSystem::C:\Users\Jordan\AppData\Local\Temp

Mode                LastWriteTime     Length Name
----                -------------     ------ ----
-a---         12/2/2008   1:25 AM          0 tmp55FC.tmp

PS > rm $tmp

Help is also conveniently available from the powershell prompt. Help, which can be accessed with a muscle-memory-friendly 'man,' comes in different details levels. Try help select-object and then help select-object -detailed. There's also other useful builtins like foreach-object (like 'for' in bourne), select-object (like cut, tail, head, and uniq, but cooler) , sort-object (like sort, but cooler), where-object (like grep, but cooler), measure-object (like wc, but cooler), and format-list and format-table for sanely printing object properties.

Are you still scripting in DOS batch or VBScript? Do you use Cygwin as a means of escaping to a scripting language on windows that is less frustrating or awkward? Are you suddenly facing windows administration when your background is unix? Check out Powershell.

December 2, 2010

Day 2 - Going Parallel

Tools for your shell scripts

Tools for multiple machines

Smarter tools for multiple machines

Taking your own script(s) multithreaded

Conclusion

December 2, 2008

Day 2 - Windows Powershell

What is sysadvent?

Blog Archive

December 2, 2010

Day 2 - Going Parallel

Tools for your shell scripts

Tools for multiple machines

Smarter tools for multiple machines

Taking your own script(s) multithreaded

Conclusion

December 2, 2008

Day 2 - Windows Powershell

What is sysadvent?

Subscribe

Blog Archive