Operation
Nodepool has two components which run as daemons. The
nodepool-builder
daemon is responsible for building diskimages and
uploading them to providers, and the nodepool-launcher
daemon is
responsible for launching and deleting nodes.
Both daemons frequently re-read their configuration file after starting to support adding or removing new images and providers, or otherwise altering the configuration.
These daemons communicate with each other via a Zookeeper database. You must run Zookeeper and at least one of each of these daemons to have a functioning Nodepool installation.
Nodepool-builder
The nodepool-builder
daemon builds and uploads images to
providers. It may be run on the same or a separate host as the main
nodepool daemon. Multiple instances of nodepool-builder
may be
run on the same or separate hosts in order to speed up image builds
across many machines, or supply high-availability or redundancy.
However, since nodepool-builder
allows specification of the number
of both build and upload threads, it is usually not advantageous to
run more than a single instance on one machine. Note that while
diskimage-builder (which is responsible for building the underlying
images) generally supports executing multiple builds on a single
machine simultaneously, some of the elements it uses may not. To be
safe, it is recommended to run a single instance of
nodepool-builder
on a machine, and configure that instance to run
only a single build thread (the default).
Nodepool-launcher
The main nodepool daemon is named nodepool-launcher
and is
responsible for managing cloud instances launched from the images
created and uploaded by nodepool-builder
.
When a new image is created and uploaded, nodepool-launcher
will
immediately start using it when launching nodes (Nodepool always uses
the most recent image for a given provider in the ready
state).
Nodepool will delete images if they are not the most recent or second
most recent ready
images. In other words, Nodepool will always
make sure that in addition to the current image, it keeps the previous
image around. This way if you find that a newly created image is
problematic, you may simply delete it and Nodepool will revert to
using the previous image.
Daemon usage
To start the main Nodepool daemon, run nodepool-launcher:
usage: nodepool-launcher [-h] [-l LOGCONFIG] [--version] [-p PIDFILE] [-d]
[-f] [-c CONFIG] [-s SECURE] [--no-webapp] [--repl]
Node pool.
options:
-h, --help show this help message and exit
-l LOGCONFIG path to log config file (default: None)
--version show nodepool version
-p PIDFILE path to pid file (default: /var/run/nodepool/nodepool.pid)
-d do not run as a daemon with debug logging (default: False)
-f do not run as a daemon (default: False)
-c CONFIG path to config file (default: /etc/nodepool/nodepool.yaml)
-s SECURE path to secure file (default: None)
--no-webapp
--repl Start a REPL on port 3000 (default: False)
To start the nodepool-builder daemon, run nodepool–builder:
usage: nodepool-builder [-h] [-l LOGCONFIG] [--version] [-p PIDFILE] [-d] [-f]
[-c CONFIG] [-s SECURE]
[--build-workers BUILD_WORKERS]
[--upload-workers UPLOAD_WORKERS] [--repl]
NodePool Image Builder.
options:
-h, --help show this help message and exit
-l LOGCONFIG path to log config file (default: None)
--version show nodepool version
-p PIDFILE path to pid file (default: /var/run/nodepool/nodepool-
builder.pid)
-d do not run as a daemon with debug logging (default:
False)
-f do not run as a daemon (default: False)
-c CONFIG path to config file (default:
/etc/nodepool/nodepool.yaml)
-s SECURE path to secure config file (default: None)
--build-workers BUILD_WORKERS
number of build workers (default: 1)
--upload-workers UPLOAD_WORKERS
number of upload workers (default: 4)
--repl Start a REPL on port 3000 (default: False)
To stop a daemon, send SIGINT to the process.
When yappi (Yet Another Python Profiler) is available, additional functions’ and threads’ stats are emitted as well. The first SIGUSR2 will enable yappi, on the second SIGUSR2 it dumps the information collected, resets all yappi state and stops profiling. This is to minimize the impact of yappi on a running system.
Metadata
When Nodepool creates instances, it will assign the following nova metadata:
- groups
A comma separated list containing the name of the image and the name of the provider. This may be used by the Ansible OpenStack inventory plugin.
- nodepool_image_name
The name of the image as a string.
- nodepool_provider_name
The name of the provider as a string.
- nodepool_node_id
The nodepool id of the node as an integer.
Common Management Tasks
In the course of running a Nodepool service you will find that there are some common operations that will be performed. Like the services themselves these are split into two groups, image management and instance management.
Image Management
Before Nodepool can launch any cloud instances it must have images to boot
off of. nodepool dib-image-list
will show you which images are available
locally on disk. These images on disk are then uploaded to clouds,
nodepool image-list
will show you what images are bootable in your
various clouds.
If you need to force a new image to be built to pick up a new feature more
quickly than the normal rebuild cycle (which defaults to 24 hours) you can
manually trigger a rebuild. Using nodepool image-build
you can tell
Nodepool to begin a new image build now. Note that depending on work that
the nodepool-builder is already performing this may queue the build. Check
nodepool dib-image-list
to see the current state of the builds. Once
the image is built it is automatically uploaded to all of the clouds
configured to use that image.
At times you may need to stop using an existing image because it is broken.
Your two major options here are to build a new image to replace the existing
image or to delete the existing image and have Nodepool fall back on using
the previous image. Rebuilding and uploading can be slow so typically the
best option is to simply nodepool image-delete
the most recent image
which will cause Nodepool to fallback on using the previous image. Howevever,
if you do this without “pausing” the image it will be immediately reuploaded.
You will want to pause the image if you need to further investigate why
the image is not being built correctly. If you know the image will be built
correctly you can simple delete the built image and remove it from all clouds
which will cause it to be rebuilt using nodepool dib-image-delete
.
Command Line Tools
Usage
The general options that apply to all subcommands are:
usage: nodepool [-h] [-l LOGCONFIG] [--version] [-c CONFIG] [-s SECURE]
[--debug]
{list,image-list,dib-image-list,image-status,image-build,alien-image-list,delete,hold,image-delete,dib-image-delete,config-validate,request-list,info,erase,image-pause,image-unpause,export-image-data,import-image-data}
...
Node pool.
options:
-h, --help show this help message and exit
-l LOGCONFIG path to log config file (default: None)
--version show nodepool version
-c CONFIG path to config file (default:
/etc/nodepool/nodepool.yaml)
-s SECURE path to secure file (default: None)
--debug show DEBUG level logging (default: False)
commands:
valid commands
{list,image-list,dib-image-list,image-status,image-build,alien-image-list,delete,hold,image-delete,dib-image-delete,config-validate,request-list,info,erase,image-pause,image-unpause,export-image-data,import-image-data}
additional help
list list nodes
image-list list images from providers
dib-image-list list images built with diskimage-builder
image-status list image status
image-build build image using diskimage-builder
alien-image-list list images not accounted for by nodepool
delete place a node in the DELETE state
hold place a node in the HOLD state e.g. for running
maintenance tasks
image-delete delete an image
dib-image-delete Delete a dib built image from disk along with all
cloud uploads of this image
config-validate Validate configuration file
request-list list the current node requests
info Show provider data from zookeeper
erase Erase provider data from zookeeper
image-pause pause an image
image-unpause unpause an image
export-image-data Export image data from ZooKeeper
import-image-data Import image data to ZooKeeper
The following subcommands deal with nodepool images:
dib-image-list
usage: nodepool dib-image-list [-h]
options:
-h, --help show this help message and exit
image-status
usage: nodepool image-status [-h]
options:
-h, --help show this help message and exit
image-list
usage: nodepool image-list [-h]
options:
-h, --help show this help message and exit
image-build
usage: nodepool image-build [-h] image
positional arguments:
image image name
options:
-h, --help show this help message and exit
dib-image-delete
usage: nodepool dib-image-delete [-h] id
positional arguments:
id dib image id
options:
-h, --help show this help message and exit
image-delete
usage: nodepool image-delete [-h] --provider PROVIDER --image IMAGE
--upload-id UPLOAD_ID --build-id BUILD_ID
options:
-h, --help show this help message and exit
--provider PROVIDER provider name
--image IMAGE image name
--upload-id UPLOAD_ID
image upload id
--build-id BUILD_ID image build id
The following subcommands deal with nodepool nodes:
list
usage: nodepool list [-h] [--detail]
options:
-h, --help show this help message and exit
--detail Output detailed node info
delete
usage: nodepool delete [-h] [--now] id
positional arguments:
id node id
options:
-h, --help show this help message and exit
--now delete the node in the foreground
hold
usage: nodepool hold [-h] id
positional arguments:
id node id
options:
-h, --help show this help message and exit
The following subcommands deal with ZooKeeper data management:
info
usage: nodepool info [-h] PROVIDER
positional arguments:
PROVIDER Provider name
options:
-h, --help show this help message and exit
erase
usage: nodepool erase [-h] [--force] PROVIDER
positional arguments:
PROVIDER Provider name
options:
-h, --help show this help message and exit
--force Bypass the warning prompt
If Nodepool’s database gets out of sync with reality, the following commands can help identify compute instances or images that are unknown to Nodepool:
alien-image-list
usage: nodepool alien-image-list [-h] [provider]
positional arguments:
provider provider name
options:
-h, --help show this help message and exit
Image builds and uploads can take a lot of time, so there is a pair of commands to export and import the image build and upload metadata from Nodepool’s internal storage in ZooKeeper. These can be used to backup and restore data in case the ZooKeeper cluster is lost. Note that these commands do not save or restore the actual image data, only the records in ZooKeeper. If the data are important, consider backing them up as well. Even without the local image builds, restoring the image metadata will allow nodepool-launcher to continue to operate while new builds are created.
These commands do not export or import any node information. It is expected that any existing nodes will be detected as leaked and automatically deleted if the ZooKeeper storage is reset.
export-image-data
usage: nodepool export-image-data [-h] path
positional arguments:
path Export file path
options:
-h, --help show this help message and exit
import-image-data
usage: nodepool import-image-data [-h] path
positional arguments:
path Import file path
options:
-h, --help show this help message and exit
Removing a Provider
Removing a provider from nodepool involves two separate steps: removing from the builder process, and removing from the launcher process.
Warning
Since the launcher process depends on images being present in the provider, you should follow the process for removing a provider from the launcher before doing the steps to remove it from the builder.
Removing from the Launcher
To remove a provider from the launcher, set that provider’s max-servers
value to 0 (or any value less than 0). This disables the provider and will
instruct the launcher to stop booting new nodes on the provider. You can then
let the nodes go through their normal lifecycle. Once all nodes have been
deleted, you may remove the provider from launcher configuration file entirely,
although leaving it in this state is effectively the same and makes it easy
to turn the provider back on.
Note
There is currently no way to force the launcher to immediately begin deleting any unused instances from a disabled provider. If urgency is required, you can delete the nodes directly instead of waiting for them to go through their normal lifecycle, but the effect is the same.
For example, if you want to remove ProviderA from a launcher with a configuration file defined as:
providers:
- name: ProviderA
region-name: region1
cloud: ProviderA
boot-timeout: 120
diskimages:
- name: centos
- name: fedora
pools:
- name: main
max-servers: 100
labels:
- name: centos
min-ram: 8192
flavor-name: Performance
diskimage: centos
key-name: root-key
Then you would need to alter the configuration to:
providers:
- name: ProviderA
region-name: region1
cloud: ProviderA
boot-timeout: 120
diskimages:
- name: centos
- name: fedora
pools:
- name: main
max-servers: 0
labels:
- name: centos
min-ram: 8192
flavor-name: Performance
diskimage: centos
key-name: root-key
Note
The launcher process will automatically notice any changes in its configuration file, so there is no need to restart the service to pick up the change.
Removing from the Builder
The builder controls image building, uploading, and on-disk cleanup.
The builder needs a chance to properly manage these resources for a removed
a provider. To do this, you need to first set the diskimage
configuration
section for the provider you want to remove to an empty list.
Warning
Make sure the provider is disabled in the launcher before disabling in the builder.
For example, if you want to remove ProviderA from a builder with a configuration file defined as:
providers:
- name: ProviderA
region-name: region1
diskimages:
- name: centos
- name: fedora
diskimages:
- name: centos
pause: false
elements:
- centos-minimal
...
env-vars:
...
Then you would need to alter the configuration to:
providers:
- name: ProviderA
region-name: region1
diskimages: []
diskimages:
- name: centos
pause: false
elements:
- centos-minimal
...
env-vars:
...
By keeping the provider defined in the configuration file, but changing
the diskimages
to an empty list, you signal the builder to cleanup
resources for that provider, including any images already uploaded, any
on-disk images, and any image data stored in ZooKeeper. After those
resources have been cleaned up, it is safe to remove the provider from the
configuration file entirely, if you wish to do so.
Note
The builder process will automatically notice any changes in its configuration file, so there is no need to restart the service to pick up the change.
Web interface
If configured (see webapp), a nodepool-launcher
instance can provide a range of end-points that can provide
information in text and json
format. Note if there are multiple
launchers, all will provide the same information.
- GET /image-list
The status of uploaded images
- Query Parameters:
fields – comma-separated list of fields to display
- Request Headers:
Accept –
application/json
ortext/*
- Response Headers:
Content-Type –
application/json
ortext/plain
depending on the Accept header
- GET /dib-image-list
The status of images built by
diskimage-builder
- Query Parameters:
fields – comma-separated list of fields to display
- Request Headers:
Accept –
application/json
ortext/*
- Response Headers:
Content-Type –
application/json
ortext/plain
depending on the Accept header
- GET /image-status
The paused and manual build status of images
- Query Parameters:
fields – comma-separated list of fields to display
- Request Headers:
Accept –
application/json
ortext/*
- Response Headers:
Content-Type –
application/json
ortext/plain
depending on the Accept header
- GET /node-list
The status of currently active nodes
- Query Parameters:
node_id – restrict to a specific node
fields – comma-separated list of fields to display
- Request Headers:
Accept –
application/json
ortext/*
- Response Headers:
Content-Type –
application/json
ortext/plain
depending on the Accept header
- GET /request-list
Outstanding requests
- Query Parameters:
fields – comma-separated list of fields to display
- Request Headers:
Accept –
application/json
ortext/*
- Response Headers:
Content-Type –
application/json
ortext/plain
depending on the Accept header
- GET /label-list
All available labels as reported by all launchers
- Query Parameters:
fields – comma-separated list of fields to display
- Request Headers:
Accept –
application/json
ortext/*
- Response Headers:
Content-Type –
application/json
ortext/plain
depending on the Accept header
- GET /ready
Responds with status code 200 as soon as all configured providers are fully started. During startup it returns 500. This can be used as a readiness probe in a kubernetes based deployment.
Monitoring
Nodepool provides monitoring information to statsd. See Statsd and Graphite to learn how to enable statsd support. Currently, these metrics are supported:
Nodepool builder
The following metrics are produced by a nodepool-builder
process:
- nodepool.dib_image_build.<diskimage_name>.<ext>.size (gauge)
This stat reports the size of the built image in bytes.
ext
is based on the formats of the images created for the build, for exampleqcow2
,raw
,vhd
, etc.
- nodepool.dib_image_build.<diskimage_name>.status.rc (gauge)
Return code of the last DIB run. Zero is successful, non-zero is unsuccessful.
- nodepool.dib_image_build.<diskimage_name>.status.duration (timer)
Time the last DIB run for this image build took, in ms
- nodepool.dib_image_build.<diskimage_name>.status.last_build (gauge)
The UNIX timestamp of the last time a build for this image returned. This can be useful for presenting a relative time (“X hours ago”) in a dashboard.
- nodepool.image_update.<image name>.<provider name> (counter, timer)
Number of image uploads to a specific provider in the cloud plus the time in ms spent to upload the image.
- nodepool.image_build_requests (gauge)
Number of manual build requests outstanding (does not include currently running builds).
- nodepool.image.<diskimage name>.image_build_requests (gauge)
Number of manual build requests outstanding (does not include currently running builds) for the specified image.
- nodepool.builder.<hostname>.current_builds (gauge)
The number of builds currently in progress.
- nodepool.builder.<hostname>.current_uploads (gauge)
The number of uploads currently in progress.
- nodepool.builder.<hostname>.build_workers (gauge)
The number of simultaneous build workers configured for this builder.
- nodepool.builder.<hostname>.upload_workers (gauge)
The number of simultaneous upload workers configured for this builder.
- nodepool.builder.<hostname>.image.<image name>.build.state (gauge)
Indicates whether a builder is currently building an image. The value will be one of the following constants:
0: idle 1: building 3: paused
- nodepool.builder.<hostname>.image.<image name>.provider.<provider name>.upload.state (gauge)
Indicates whether a builder is currently uploading an image. The value will be one of the following constants:
0: idle 2: uploading 3: paused
Nodepool launcher
The following metrics are produced by a nodepool-launcher
process:
- nodepool.nodes.<state> (counter)
Number of nodes in a specific state.
state can be:
building
deleting
failed
in-use
ready
used
- nodepool.label.<label>.nodes.<state> (counter)
Number of nodes with a specific label in a specific state. See nodepool.nodes for a list of possible states.
- nodepool.tenant_limits.<tenant>.<limit> (guage)
The currently configured resource limits of a tenant.
limit can be:
cores
instances
ram
Provider Metrics
- nodepool.provider.<provider>.max_servers (gauge)
Current setting of the max-server configuration parameter for the respective provider.
- nodepool.provider.<provider>.nodes.<state> (gauge)
Number of nodes per provider that are in one specific state. See nodepool.nodes for a list of possible states.
- nodepool.provider.<provider>.leaked
This hierarchy supplies driver-dependent information about leaked resource cleanup. Non-zero values indicate an error situation as resources should be cleaned up automatically.
- nodepool.provider.<provider>.leaked.amis (counter)
Drivers: AWS
Number of leaked AMIs removed automatically by Nodepool.
- nodepool.provider.<provider>.leaked.disks (counter)
Drivers: Azure
Number of leaked disks removed automatically by Nodepool.
- nodepool.provider.<provider>.leaked.floatingips (counter)
Drivers: OpenStack, IBMVPC
Number of unattached floating IPs removed automatically by Nodepool.
- nodepool.provider.<provider>.leaked.images (counter)
Drivers: Azure, IBMVPC
Number of leaked images removed automatically by Nodepool.
- nodepool.provider.<provider>.leaked.instances (counter)
Drivers: AWS, Azure, GCE, IBMVPC, OpenStack
Number of nodes not correctly recorded in Zookeeper that Nodepool has cleaned up automatically.
- nodepool.provider.<provider>.leaked.nics (counter)
Drivers: Azure
Number of leaked NICs removed automatically by Nodepool.
- nodepool.provider.<provider>.leaked.objects (counter)
Drivers: AWS, IBMVPC
Number of leaked storage objects removed automatically by Nodepool.
- nodepool.provider.<provider>.leaked.pips (counter)
Drivers: Azure
Number of leaked public IPs removed automatically by Nodepool.
- nodepool.provider.<provider>.leaked.ports (counter)
Drivers: OpenStack
Number of ports in the DOWN state that have been removed.
- nodepool.provider.<provider>.leaked.snapshots (counter)
Drivers: AWS
Number of leaked snapshots removed automatically by Nodepool.
- nodepool.provider.<provider>.leaked.volumes (counter)
Drivers: AWS
Number of leaked volumes removed automatically by Nodepool.
- nodepool.provider.<provider>.pool.<pool>.addressable_requests (gauge)
Number of open node requests a provider pool can address.
Launch metrics
- nodepool.launch.<result> (counter, timer)
Number of launches, categorized by the launch result plus the duration of the launch.
result can be:
ready: launch was successful
error.zksession: Zookeeper session was lost
error.quota: Quota of the provider was reached
error.unknown: Some other error during launch
- nodepool.launch.provider.<provider>.<az>.<result> (counter, timer)
Number of launches per provider per availability zone, categorized by the launch result plus duration of the launch.
See nodepool.launch for a list of possible results.
- nodepool.launch.image.<image>.<result> (counter, timer)
Number of launches per image, categorized by the launch result plus duration of the launch.
See nodepool.launch for a list of possible results.
- nodepool.launch.requestor.<requestor>.<result> (counter, timer)
Number of launches per requestor, categorized by the launch result plus the duration of the launch.
See nodepool.launch for a list of possible results.
OpenStack API metrics
Low level details on the timing of OpenStack API calls will be logged
by openstacksdk
. These calls are logged under
nodepool.task.<provider>.<api-call>
. The API call name is of the
generic format <service-type>.<method>.<operation>
. For example, the
GET /servers
call to the compute
service becomes
compute.GET.servers
.
Since these calls reflect the internal operations of the
openstacksdk
, the exact keys logged may vary across providers and
releases.
Internal metrics
The following metrics are low-level performance metrics of the launcher itself, primarily of interest to Nodepool developers, and are subject to change in the future as development needs change:
- nodepool.launcher.<hostname>.zk.client.connection_queue (gauge)
ZooKeeper client connection queue length.
- nodepool.launcher.<hostname>.zk.node_cache.event_queue (gauge)
Node cache event queue length.
- nodepool.launcher.<hostname>.zk.node_cache.playback_queue (gauge)
Node cache playback queue length.
- nodepool.launcher.<hostname>.zk.request_cache.event_queue (gauge)
Request cache event queue length.
- nodepool.launcher.<hostname>.zk.request_cache.playback_queue (gauge)
Request cache playback queue length.
- nodepool.launcher.<hostname>.zk.image_cache.event_queue (gauge)
Image cache event queue length.
- nodepool.launcher.<hostname>.zk.image_cache.playback_queue (gauge)
Image cache playback queue length.