cookbook 'slurm', '= 1.4.1'
slurm
(34) Versions
1.4.1
-
Follow0
Installs/Configures slurm workload manager
cookbook 'slurm', '= 1.4.1', :supermarket
knife supermarket install slurm
knife supermarket download slurm
slurm
Wrapper cookbook that can prepare a full slurm cluster, controller, compute and accounting nodes
Requirements
Requires the following cookbooks:
Platforms
The following platforms are supported:
- Ubuntu 18.04
- Debian 9
Other Debian family distributions are assumed to work, as long as the slurm version from the package tree
is at least 17.02 due to hostname behaviour of slurmdbd.
Chef
- Chef 14.0+
TODO
- Support for RHEL family
- Make cgroup.conf file dynamic
- Add recipe to setup a dynamic resource allocation cluster
- Install slurm from static stable sources, i.e 17.11-latest, 18.08-latest
- Refactor and remove code that can be used as a resource instead of a recipe
- Remove static types of nodes and partitions and support static generation, maybe by passing the Hash directly
- Complete spec files
Usage
Check the .kitchen.yml file for the run_list, this can be applied with:
$ kitchen converge [debian|ubuntu|all]
The use case for this run_list is to setup a monolith which contains all of the slurm components.
Recipes
slurm::_disable_ipv6
- Disable ipv6 on a Linux system.
slurm::_systemd_daemon_reload
- Makes available forcing a
daemon-reload
on systemd, in order to refresh service unit files.
slurm::accounting
- Installs and configures slurmdbd, slurms' accounting service.
slurm::cluster
- TODO sets up a dynamic resource allocation cluster.
slurm::compute
- Installs and configures slurmd, slurms' compute service.
slurm::database
- Installs and configures a MariaDB service.
slurm::default
- Sets up slurm user and group
- Installs packages common to all slurms' services.
slurm::munge
- Sets up munge user and group
- Installs and configures munge authentication service.
slurm::plugin_shifter
- Sets up shifter plugin for slurm.
slurm::server
- Installs and configures slurmctld, slurms' controller service.
This is where the common configuration file shared between slurmctld
and slurmd
services is generated.
Take a close look at attributes below.
Attributes
The attributes are presented here in order of importance for assembling a whole infrastructure.
Common
# ========================= Data bag configuration =========================
default['slurm']['secret']['secrets_data_bag'] # The name of the encrypted data bag that stores openstack secrets
default['slurm']['secret']['service_passwords_data_bag'] # The name of the encrypted data bag that stores service user passwords, with
# each key in the data bag corresponding to a named Slurm service, like
# "slurmdbd", "slurmctl", "slurmd" (this may not be needed for slurm).
default['slurm']['secret']['db_passwords_data_bag'] # The name of the encrypted data bag that stores database passwords, with
# each key in the data bag corresponding to a named Slurm database, like
# "slurmdbd", "slurmctl", "slurmd"
default['slurm']['secret']['user_passwords_data_bag'] # The name of the encrypted data bag that stores general user passwords, with
# each key in the data bag corresponding to a user (this may not be needed for slurm).
# ========================= Slurm specific configuration =========================
default['slurm']['common']['conf_dir'] # slurm configuration directory, usually '/etc/slurm-llnl'
default['slurm']['custom_template_banner'] # String that is prepended to each slurm configuration file
default['slurm']['user'] # username to configure slurm as, usually 'slurm'
default['slurm']['group'] # group to configure slurm as, usually 'slurm'
default['slurm']['uid'] # Slurm user ID, common to all nodes, our default is 999, just before user land id's
default['slurm']['gid'] # Slurm group ID, common to all nodes, our default is 999, just before user land id's
default['proxy']['http'] # proxy address for use with apt, mariadb, and system environment
Munge
default['slurm']['munge']['key'] # munge key location
default['slurm']['munge']['env_file'] # munge environment file, to be used by systemd
default['slurm']['munge']['auth_socket'] # munge communication socket location
default['slurm']['munge']['user'] # username to configure munge as, usually 'munge'
default['slurm']['munge']['group'] # group name to configure munge as, usually 'munge'
default['slurm']['munge']['uid'] # MUNGE user ID, common to all nodes, our default is 998, just before Slurm's
default['slurm']['munge']['gid'] # MUNGE user ID, common to all nodes, our default is 998, just before Slurm's
Monolith
default['slurm']['control_machine'] # fqdn of the machine where slurmctld is running
default['slurm']['nfs_apps_server'] # fqdn of the machine where the apps directory is made available through nfs
default['slurm']['nfs_homes_server'] # fqdn of the machine where the home directory is made available through nfs
default['slurm']['apps_dir'] # path to the apps directory
default['slurm']['homes_dir'] # path to the home directory
default['slurm']['monolith_testing'] # tells the cookbook if the setup should be that of a monolith or not, usually for testing, either true or false
Database
default['mysql']['bind_address'] # CIDR to where the mariadb server should listen to connections, defaults to '0.0.0.0'
default['mysql']['port'] # port to where the mariadb server should listen to connections, defaults to '3306'
default['mysql']['version'] # MariaDB version lock, defaults to '10.1'
default['mysql']['character-set-server'] # database character set, defaults to 'utf8'
default['mysql']['collation-server'] # database collation, defaults to 'utf8_general_ci'
default['mysql']['user']['slurm'] # user which slurm accounting service uses to connect to the database
Accounting
default['slurm']['accounting']['conf_file'] # path to the slurmdbd configuration file, defaults to '/etc/slurm-llnl/slurmdbd.conf'
default['slurm']['accounting']['env_file'] # path to the slurmdbd environment file location, defaults to '/etc/default/slurmdbd'
default['slurm']['accounting']['bin_file'] # path to the slurmdbd binary, defaults to '/usr/sbin/slurmdbd'
default['slurm']['accounting']['pid_file'] # path to the slurmdbd pid file, defaults to '/var/run/slurm-llnl/slurmdbd.pid'
default['slurm']['accounting']['systemd_file'] # path to the slurmdbd systemd service unit file, defaults to '/lib/systemd/system/slurmdbd.service'
default['slurm']['accounting']['debug'] # debug level, valid values from 0-7, defaults to '3'
default['slurm']['accounting']['conf'] # Hash representing the slurmdbd configuration options
The default for ['slurm']['accounting']['conf']
is:
{
AuthType: 'auth/munge',
AuthInfo: node['slurm']['munge']['auth_socket'],
DbdHost: node['hostname'],
DebugLevel: node['slurm']['accounting']['debug'],
LogFile: '/var/log/slurm-llnl/slurmdbd.log', # default is syslog
MessageTimeout: '10',
PidFile: node['slurm']['accounting']['pid_file'],
SlurmUser: node['mysql']['user']['slurm'],
StorageHost: node['hostname'],
StorageLoc: 'slurm_acct_db',
StoragePort: node['mysql']['port'],
StorageType: 'accounting_storage/mysql',
StorageUser: node['mysql']['user']['slurm'],
}
take into account that when overriding ['slurm']['accounting']['conf']
you will override all of its options.
Server
default['slurm']['cluster']['name'] # Name for the cluster, defaults to 'slurm-test'
default['slurm']['server']['conf_file'] # path to the slurmctld and slurmd configuration file, defaults to '/etc/slurm-llnl/slurm.conf'
default['slurm']['server']['env_file'] # path to the slurmctld environment file, defaults to '/etc/default/slurmctld'
default['slurm']['server']['bin_file'] # path to the slurmctld binary file, defaults to '/usr/sbin/slurmctld'
default['slurm']['server']['pid_file'] # path to the slurmctld pid file, defaults to '/var/run/slurm-llnl/slurmctld.pid'
default['slurm']['server']['systemd_file'] # path to the slurmctld systemd service unit file, defaults to '/lib/systemd/system/slurmctld.service'
default['slurm']['server']['service_req'] # name of the storage service(s) that the slurm service should depend on to start
# this should be either empty or the name of the storage service client(s) that slurm might depend on (ceph, beegfs, lustre)
default['slurm']['server']['cgroup_dir'] # path to the cgroup plugin directory, defaults to '/etc/slurm-llnl/cgroup'
default['slurm']['server']['cgroup_conf_file'] # path to the cgroup configuration file, defaults to '/etc/slurm-llnl/cgroup.conf'
default['slurm']['server']['plugstack_dir'] # path to the slurm plugin directory, defaults to '/etc/slurm-llnl/plugstack.conf.d'
default['slurm']['server']['plugstack_conf_file'] # path to the slurm plugin configuration file, defaults to '/etc/slurm-llnl/plugstack.conf'
default['slurm']['shifter'] # Boolean, if true shifter will be installed
default['shifter']['imagegw'] # Boolean, if true the shifter image gateway will be installed and configured (assumes default['slurm']['shifter'] == true
default['shifter']['imagegw_fqdn'] # String, Image Gateway FQDN, accessible hostname or ip address, defaults node['slurm']['control_machine']
Compute nodes
In the computes.rb attribute file you can see an example for the various slurm cluster settings.
For now we assume three types of partitions (and nodes):
- small
- medium
- large
representing the capacity (memory) for each group. The nodes in each group are assumed to be homogeneous.
Each group properties can be passed via the following attributes
default['slurm']['conf']['nodes'][type]['count']
default['slurm']['conf']['nodes'][type]['properties']['cpus'] # amount of CPUs available in the node group, Integer
default['slurm']['conf']['nodes'][type]['properties']['mem'] # amount of RAM available in the node group, Megabytes
default['slurm']['conf']['nodes'][type]['properties']['sockets'] # number of sockets in node group, on private cloud systems it is usually the number of cpus
default['slurm']['conf']['nodes'][type]['properties']['cores_per_socket'] # number of cores per socket, on private cloud systems it is usually one
default['slurm']['conf']['nodes'][type]['properties']['threads_per_core'] # number of threas per core, on private cloud systems it is usually one
default['slurm']['conf']['nodes'][type]['properties']['weight'] # preference for being allocated work to, the lower the weight the highest the preference
At this time, this cookbook is designed to work either as a monolith (PoC) or to be deployed in a private cloud environment.
Data Bags
From the previous section we can see which data bags are required to exist. Each of the items must have a key with the same name as the data bag, where the secret value should be stored.
Within those databags we have to create the following items:
DataBag | Item | Keys |
---|---|---|
slurm_db_passwords | mysqlroot | --- |
slurm_db_passwords | node['mysql']['user']['slurm'] | --- |
slurm_secrets | munge | --- |
Any of the slurm_db_passwords
items should be text passwords, generated with your favorite tool.
The munge key should be a base64 key, based on binary data generated from running either of the following:
-
$ create-munge-key -r
on a system with munge installed (note that it will try to overwrite any existing key in /etc/munge/munge.key) $ dd if=/dev/random bs=1 count=1024 > munge.key
$ dd if=/dev/urandom bs=1 count=1024 > munge.key
For more information on generating a munge key see the munge documentation.
Authors
- Manuel Torrinha manuel.torrinha@tecnico.ulisboa.pt
Dependent cookbooks
mariadb ~> 2.0 |
shifter ~> 1.0 |
Contingent cookbooks
There are no cookbooks that are contingent upon this one.
slurm CHANGELOG
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog and this project adheres to Semantic Versioning.
This file is used to list changes made in each version of the slurm cookbook.
1.4.1
Fixed
- issue with apt-key add not allowing to parse output
1.4.0
Fixed
- issue with missing apt key for mariadb (temporary hotfix)
Changed
-
slurm
user gid to 998,systemd-coredump
has 999 in debian buster -
mariadb
version to current stable (10.4)
1.3.3
Fixed
- documentation formatting
1.3.2
Fixed
- documentation formatting
1.3.1
Fixed
- absence of
node['slurm']['cluster']['name']
attribute
1.3.0
Added
- system_name property so we can properly use shifter in the compute nodes
1.2.4
Fixed
-
node['shifter']
being hardcoded and overwritten the attribute set by wrapper cookbooks
1.2.3
Changed
- ruby code to match a non-empty attribute
1.2.2
Changed
- ruby code to match a Boolean
1.2.1
Changed
- how we decide if the image manager is installed. It is now via an attribute
1.1.2
Fixed
- count can now be a String, an integer or some type that can be converted to an integer
1.1.1
Fixed
- wrong count comparison.
1.1.0
Removed
- support for Ubuntu 18.04, see Known Issues
Fixed
- slurm.conf node list appeared [1-1] when the type count was 1, still worked but not very appealing
- slurm.conf node list appeared [1-0] when the type count was 1, which made the slurmctld service not start
1.0.6
Added
- forgotten
with_slurm
option to shifter resources to generate the shifter_slurm.so file
1.0.5
Added
- Edge case to not export nfs shares if testing monolith, there seem to be some issues with nfs exports when using dokken
Changed
- compute verifications to more friendly boolean expressions
- reordered resource notifications
1.0.4
Fixed
- chef
service
resource action
1.0.3
Added
- NFS Kernel service explicit start, it is a bad practice to expect services to be running after the respective packages are installed
1.0.2
Changed
- control machine address should be just the hostname, the name resolution is assumed to be solved locally in each node
1.0.1
Removed
- support for Ubuntu Xenial
1.0.0
Changed
- shifter dependency to major version 1
0.6.2
Fixed
- Linting
0.6.1
Changed
- Slurm and munge users are now regular user so that we can force the uid and gid values
Added
- home directory for both slurm and munge users
0.6.0
Added
- MUNGE user and group with pre-established uid and gid
- SLURM user and group with pre-established uid and gid
- Updated documentation
0.5.6
Removed
- munge service nfs mount due to user uid mismatch between the controller and the compute nodes
0.5.5
Changed
- Now using supermarket sources for all dependent cookbooks
0.5.4
Added
- Chef logging (info) for compute information on mount stage
0.5.3
Fixed
- Ruby syntax error on assignment
0.5.2
Added
- subnet filtering to exports file, via the
node['slurm']['nfs_network']
attribute -
enabled
option to chefmount
resources - proper update to
exportfs
Fixed
- slurm.conf newlines and definitions
- exports file generation
- slurm variable
apps_dir
deprecated
0.5.1
Changed
- apps directory is now slurm directory, making nodes mount the nfs share to the correct path
0.4.1
Added
- TESTING.md
0.4.0
Added
- Shifter support and dependency
- Kitchen suite with shifter support
- Older Ubuntu/Debian images
0.3.9
Changed
- proxy is now passed as attribute
- action for slurm services to
:start
0.3.8
Changed
- proxy string not ending with ";" anymore, gave false negatives in InSpec
0.3.7
Fixed
-
plugin_shifter
recipe, haddefault
instead ofnode
0.3.6
Changed
- now using appropriate attribute names instead of
node['fqdn']
0.3.5
Changed
- now passing root password to reflect changes in mariadb cookbook,
node['mariadb']['server_root_password']
is no longer used as default.
0.3.4
Changed
- translating base64 munge key into binary
0.3.3
Removed
- support for Ubuntu 16.04. The slurm version from apt repos is < 16 so slurmdbd fails to start because of hostname issues.
0.3.2
Added
- support for monolith testing, setting
node['slurm']['monolith_testing']
attribute totrue
configures slurm.conf file with an entry for theslurmctl
too
Fixed
-
cgroup_allowed_devices_file.conf
missing error - nfs mount resource does not apply to monolith
- typo in slurm.conf property
- Service resource commands for Slurm server
0.3.1
Added
- Added
apt_repository
variable to mariadb_repository, changed its mirror tohttp://mirrors.up.pt/pub/mariadb/repo
Removed
- Fully removed support for CentOS
0.3.0
Added
- slurm controller automatic registration with the slurm accounting
- NFS package installation for the slurm controller and compute nodes
- NFS configuration for the slurm controller and compute nodes
Removed
- disable ipv6 on the chef run list
Modified
-
.kitchen.yml
sets up a mariadb database, a slurmdb daemon and a slurm controller in one singlecontroller
machine - changed proxy address to its fqdn, so it will either resolve in ipv5 ou ipv6
Fixed
- added some redundant
apt update
commands as in some cases the apt cache didn't seem to be updated
0.2.0
Added
- working database recipe
- recipe to disable ipv6 on linux systems
0.1.0
Initial release.
Added
- created skeleton for the recipes of the different slurm components
- created initial inspec tests
- created initial chefspec tests
- created a modified version of openstack-common get_password library
- created test data bag skeleton and changed usual location for them, as well as the data bag secret
- created some attributes, the data structure's structure is still not set in stone
Known Issues
1.1.0
- when running in travis, Ubuntu 18.04 vms do not start the munge service:
dokken systemd[1]: Starting MUNGE authentication service... -- Subject: Unit munge.service has begun start-up -- Defined-By: systemd -- Support: http://www.ubuntu.com/support -- -- Unit munge.service has begun starting up. dokken systemd[1]: munge.service: New main PID 3335 does not belong to service, and PID file is not owned by root. Refusing. dokken systemd[1]: munge.service: New main PID 3335 does not belong to service, and PID file is not owned by root. Refusing. dokken systemd[1]: munge.service: Start operation timed out. Terminating. dokken systemd[1]: munge.service: Failed with result 'timeout'. dokken systemd[1]: Failed to start MUNGE authentication service.
the user is created properly, has the right uid
and guid
, the systemd unit file is executing with user defined by name.
When running locally, with docker, vagrant or launching on openstack it runs fine...
Besides, the Debian 9 run in travis runs just fine. A mystery...
Collaborator Number Metric
1.4.1 passed this metric
Contributing File Metric
1.4.1 passed this metric
Foodcritic Metric
1.4.1 passed this metric
No Binaries Metric
1.4.1 passed this metric
Testing File Metric
1.4.1 passed this metric
Version Tag Metric
1.4.1 passed this metric
1.4.1 passed this metric
1.4.1 passed this metric
Foodcritic Metric
1.4.1 passed this metric
No Binaries Metric
1.4.1 passed this metric
Testing File Metric
1.4.1 passed this metric
Version Tag Metric
1.4.1 passed this metric
1.4.1 passed this metric
1.4.1 passed this metric
Testing File Metric
1.4.1 passed this metric
Version Tag Metric
1.4.1 passed this metric
1.4.1 passed this metric
1.4.1 passed this metric