Introduction¶
Cumin - An automation and orchestration framework¶
Cumin provides a flexible and scalable automation framework to execute multiple commands on multiple hosts in parallel.
It allows to easily perform complex selections of hosts through a user-friendly query language which can interface with different backend modules and combine their results for a fine grained selection. The transport layer can also be selected, and can provide multiple execution strategies. The executed commands outputs are automatically grouped for an easy-to-read result.
It can be used both via its command line interface (CLI) cumin and as a Python 3 only library. Cumin was Python 2 only before the 3.0.0 release, due to ClusterShell not yet being Python 3 compatible.
The documentation is available on Wikimedia Documentation and Read the Docs. The details on how Cumin it's used at the Wikimedia Foundation are available on Wikitech.
Main components¶
Query language¶
Cumin provides a user-friendly generic query language that allows to combine the results of subqueries from multiple backends. The details of the main grammar are:
Each query part can be composed with any other query part using boolean operators:
and
,or
,and not
,xor
.Multiple query parts can be grouped together with parentheses:
(
,)
.Each query part can be one of:
Specific backend query:
I{backend-specific query syntax}
(whereI
is an identifier for the specific backend).Alias replacement, according to the aliases defined in the configuration:
A:group1
.
If a
default_backend
is set in the configuration, Cumin will try to first execute the query directly with the default backend and only if the query is not parsable with that backend it will parse it with the main grammar.
Backends¶
The backends are the ones that allow to select the target hosts. Each backend is free to define its own grammar. Those are the available backends:
PuppetDB: allow to select hosts querying the PuppetDB API for Puppet facts or resources. See the
cumin.backends.puppetdb.PuppetDBQuery
class documentation for the details.OpenStack: allow to select hosts querying the OpenStack APIs to select based on project, instance name and so on. See the
cumin.backends.openstack.OpenStackQuery
class documentation for the details. This is an optional backend.KnownHosts: allow to select hosts listed in multiple SSH known hosts files that are not hashed. See the
cumin.backends.knownhosts.KnownHostsQuery
class documentation for the details.Direct: a fallback backend without extenal dependecies with
ClusterShell.NodeSet.NodeSet
group expansion capabilities. See thecumin.backends.direct.DirectQuery
class documentation for the details.Custom: is possible to plug-in custom backends developed externally from Cumin, as long as they:
are included in the Python
PATH
.define a
GRAMMAR_PREFIX
module constant that doesn't conflict with the other backend prefixes.define a
query_class
module variable that points to a class that inherit fromcumin.backends.BaseQuery
.are listed in the configuration file in the
plugins->backends
section, see config.yaml.
An example of external backend can be found in the source code as part of the tests in the
cumin.tests.unit.backends.external.ok
module.
Transports¶
The transport layer is the one used to convey the commands to be executed into the selected hosts. The transport abstraction allow to specify different execution strategies. Those are the available backends:
ClusterShell: SSH transport using the ClusterShell Python library. See the
cumin.transports.clustershell.ClusterShellWorker
class documentation for the details. It's possible to set all SSH-related options in the configuration file, also passing directly an existing ssh_config file.
Examples¶
CLI¶
Simple example without fine-tuning the options:
Execute the single command
systemctl is-active nginx
in parallel on all the hosts matching the query for the aliascp-esams
, as defined in thealiases.yaml
configuration file.
$ sudo cumin 'A:cp-esams' 'systemctl is-active nginx'
23 hosts will be targeted:
cp[3007-3008,3010,3030-3049].esams.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(23) cp[3007-3008,3010,3030-3049].esams.wmnet
----- OUTPUT of 'systemctl is-active nginx' -----
active
================
PASS: |████████████████████████████████████████████████| 100% (23/23) [00:01<00:00, 12.61hosts/s]
FAIL: | | 0% (0/23) [00:01<?, ?hosts/s]
100.0% (23/23) success ratio (>= 100.0% threshold) for command: 'systemctl is-active nginx'.
100.0% (23/23) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
More complex example fine-tuning many of the parameters using the long form of the options for clarity:
Execute two commands in each host in sequence in a moving window of 2 hosts at a time, moving to the next host 5 seconds after the previous one has finished.
Each command will be considered timed out if it takes more than 30 seconds to complete.
If the percentage of successful hosts goes below 95% at any point it will not schedule any more hosts for execution.
$ sudo cumin --batch-size 2 --batch-sleep 5 --success-percentage 95 --timeout 30 --mode async \
'(P{R:class = role::puppetmaster::backend} or P{R:class = role::puppetmaster::frontend}) and not D{rhodium.eqiad.wmnet}' \
'date' 'ls -la /tmp/foo'
4 hosts will be targeted:
puppetmaster[2001-2002].codfw.wmnet,puppetmaster[1001-1002].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(2) puppetmaster[2001-2002].codfw.wmnet
----- OUTPUT -----
Thu Nov 2 18:45:18 UTC 2017
===== NODE GROUP =====
(1) puppetmaster2002.codfw.wmnet
----- OUTPUT -----
ls: cannot access /tmp/foo: No such file or directory
===== NODE GROUP =====
(1) puppetmaster2001.codfw.wmnet
----- OUTPUT -----
-rw-r--r-- 1 root root 0 Nov 2 18:44 /tmp/foo
================
PASS: |████████████▌ | 25% (1/4) [00:05<00:01, 2.10hosts/s]
FAIL: |████████████▌ | 25% (1/4) [00:05<00:01, 2.45hosts/s]
25.0% (1/4) of nodes failed to execute command 'ls -la /tmp/foo': puppetmaster2002.codfw.wmnet
25.0% (1/4) success ratio (< 95.0% threshold) of nodes successfully executed all commands. Aborting.: puppetmaster2001.codfw.wmnet
Library¶
Simple example without fine-tuning of optional parameters:
import cumin
from cumin import query, transport, transports
# Load configuration files /etc/cumin/config.yaml and /etc/cumin/aliases.yaml (if present).
config = cumin.Config()
# Assuming default_backend: direct is set in config.yaml, select with the direct backend 5 hosts.
hosts = query.Query(config).execute('host[1-5]')
target = transports.Target(hosts)
worker = transport.Transport.new(config, target)
worker.commands = ['systemctl is-active nginx']
worker.handler = 'sync'
exit_code = worker.execute() # Execute the command on all hosts in parallel
for nodes, output in worker.get_results(): # Cycle over the results
print(nodes)
print(output.message().decode())
print('-----')
More complex example fine-tuning many of the parameters:
import cumin
from cumin import query, transport, transports
from cumin.transports.clustershell import NullReporter
config = cumin.Config(config='/path/to/custom/cumin/config.yaml')
hosts = query.Query(config).execute('A:nginx') # Match hosts defined by the query alias named 'nginx'.
# Needed only if SSH is authenticated via Kerberos and the related configuration flags are set
# (see also the example configuration).
cumin.ensure_kerberos_ticket(config)
# Moving window of 5 hosts a time with 30s sleep before adding a new host once the previous one has finished.
target = transports.Target(hosts, batch_size=5, batch_sleep=30.0)
worker = transport.Transport.new(config, target)
worker.commands = [
transports.Command('systemctl is-active nginx'),
# In each host, for this command apply a timeout of 30 seconds and consider successful an exit code of 0 or 42.
transports.Command('depool_command', timeout=30, ok_codes=[0, 42]),
transports.Command('systemctl restart nginx'),
transports.Command('systemctl is-active nginx'),
transports.Command('repool_command', ok_codes=[0, 42]),
]
# On each host perform the above commands in a sequence, only if the previous command was successful.
worker.handler = 'async'
# Change the worker's default reporter from the current default that outputs to stdout all commands stdout/err
# outputs to the empty reporter that does nothing.
worker.reporter = NullReporter
# Suppress the progress bars during execution
worker.progress_bars = False
exit_code = worker.execute()
for nodes, output in worker.get_results():
print(nodes)
print(output.message().decode())
print('-----')