BOSH is an Open Source tool for orchestrating deployment, lifecycle management, and monitoring of distributed systems. To learn more about BOSH, visit Ultimate Guide to BOSH, or the official BOSH documentation.
Before you can run any BOSH commands, you should set up an alias for it, in your local configuration, and authenticate.
Via Genesis, this is done with the alias
addon:
$ genesis do my-env alias
Running alias addon for my-env
Using environment 'https://10.128.80.0:25555' as user 'admin'
Name my-env-bosh
UUID 1f7de7f1-bb35-4b12-9f2a-556c1dd77958
Version 269.0.1 (00000000)
Director Stemcell ubuntu-xenial/315.34
CPI vsphere_cpi
Features compiled_package_cache: disabled
config_server: enabled
local_dns: enabled
power_dns: disabled
snapshots: disabled
Succeeded
Creating this alias allows future BOSH commands to only need to specify
-e my-env
to target this BOSH director, instead of having to specify
the full URL and certificate authority certificate.
Now that we have an alias, we can similarly use the login
addon to login:
$ genesis do my-env login
Running login addon for my-env
Logging you in as user 'admin'...
Using environment 'https://10.128.80.0:25555'
Email (): admin
Password ():
Successfully authenticated with UAA
Succeeded
You can use the bosh env
command to verify that you are logged in:
$ bosh -e my-env env
Using environment 'https://10.128.80.0:25555' as user 'admin'
Name my-env-bosh
UUID 1f7de7f1-bb35-4b12-9f2a-556c1dd77958
Version 269.0.1 (00000000)
Director Stemcell ubuntu-xenial/315.34
CPI vsphere_cpi
Features compiled_package_cache: disabled
config_server: enabled
local_dns: enabled
power_dns: disabled
snapshots: disabled
User (not logged in)
Succeeded
You only need to login
for interactive (i.e. jumpbox) use. For
automated scripts, you can set the BOSH_CLIENT_ID
and BOSH_CLIENT_SECRET
environment variables to admin
and the password:
$ export BOSH_CLIENT_ID=admin
$ export BOSH_CLIENT_SECRET=$(safe read secret/your/env/bosh/users/admin:password)
$ bosh env
... etc ...
The Genesis BOSH Kit makes the admin client secret and the admin user password identical, so this works in the general case.
When BOSH goes to deploy software, it does so by way of BOSH Releases, a native packaging format specific to BOSH. Genesis Kits take care of uploading their own releases, but for custom add-ons, or manual deployments, you may need to upload a release or two yourself.
To upload a release:
$ bosh upload-release path/to/release-1.2.3.tar.gz
You can also upload by URL:
$ bosh upload-release https://some-host/path/to/release-1.2.3.tar.gz
To see what releases have been uploaded, use bosh releases
:
$ bosh releases
Using environment 'https://10.200.130.1' as client 'admin'
Name Version Commit Hash
binary-buildpack 1.0.14* cdf2d3ff+
~ 1.0.11 60f6b0e9+
bosh 264.6.0 930eb48+
~ 264.5.0* e522d81+
bosh-vsphere-cpi 45.1.0 45d0f21
~ 45* 857f3d2
(*) Currently deployed
(+) Uncommitted changes
6 releases
Succeeded
If a release has a +
next to its commit hash, that means that the release
was created while there were still local changes to its git repository. For
in-house releases, this could indicate that the release cannot be properly
recreated, because changes may not have been committed after they were
incorporated into the release.
BOSH only handles image-based deployment. The images it uses are called Stemcells, because they can specialize into whatever VM type you need through BOSH releases. Each Cloud / IaaS has its own set of Stemcells that are tailored to its peculiarities.
Before you can deploy anything, you will need to upload a stemcell for your platform.
Genesis has an interactive addon for this as well:
$ genesis do my-env upload-stemcells
Running upload-stemcells addon for my-env
Select the release family for the vsphere-esxi ubuntu-xenial stemcell you wish to upload:
...
Or to upload manually, or a stemcell not supplied in the kit:
$ bosh upload-stemcell path/to/stemcell.tgz
or, specify a remote URL:
$ bosh upload-stemcell https://some-host/path/to/stemcell.tgz
To see what stemcells have already been uploaded:
$ bosh stemcells
Using environment 'https://10.200.130.1' as client 'admin'
Name Version OS CPI CID
bosh-vsphere-esxi-ubuntu-trusty-go_agent 3468.21* ubuntu-trusty - sc-cf483483-1be8-4a53-a244-378e89addf74
~ 3468.13* ubuntu-trusty - sc-0fe9bcd7-6010-4e30-812f-49d69c71aed2
~ 3445.24 ubuntu-trusty - sc-3c54878e-7161-41f4-b8f6-d24f7d037bd7
(*) Currently deployed
3 stemcells
Succeeded
If you attempt to upload a stemcell or BOSH release with the same name and version of one that already exists on the BOSH director, nothing will happen.
Occasionally, however, you need to overwrite a stemcell or release with a better copy. Perhaps the file didn’t download successfully and you uploaded a corrupt copy. Perhaps the BOSH director ran out of disk space and only partially processed the file upload.
Whatever the reason, the upload-stemcell
and upload-release
commands sport a --fix
flag for just this situation:
$ genesis do my-env upload-stemcells --fix
$ bosh upload-stemcell --fix path/to/stemcell.tgz
$ bosh upload-release --fix path/to/release.tgz
Over time, your BOSH director will accumulate releases and stemcells that it no longer needs. If you are diligent about patching systems when new stemcells come out, a lot of director disk space will be used by older stemcells that you no longer need. Likewise, if you update your deployments to the latest and greatest releases regularly, you’ll have a lot of unused release archives on-disk.
To clean them up, use the bosh clean-up
command:
$ bosh clean-up
Yes, there’s a hyphen in the middle there.
The clean-up
command deletes most of the unused stemcells and
releases. Stemcells will be removed from the underlying cloud /
IaaS; releases will be removed from the BOSH blobstore. The most
recent two unused releases and stemcells will remain, in case you
need to downgrade a deployment to a previous revision.
BOSH cloud config is a YAML file that defines IaaS-specific configuration properties used by the director for deployments. These include things like VM types, networking, availability zones, etc.
For full details on all the fun IaaS-specific options, refer to the BOSH Cloud Config documentation.
You can get the current cloud-config
like this:
$ bosh cloud-config
$ bosh cloud-config > cloud.yml
Saving your cloud config to a file is a great way to make changes
to it. Download the current config, modify it, and then upload
the new version to BOSH. That last step is handled by
update-cloud-config
:
$ bosh cloud-config > cloud.yml
$ vim cloud.yml
$ bosh update-cloud-config cloud.yml
Every time you give BOSH a new cloud-config, it will mark all deployments as outdated until they are re-deployed with the latest configuration.
BOSH uses a facility called runtime configs to inject configuration and software into its deployments, without having to modify the existing deployment manifest. These addons can be anything: extra utilities, a virus scanner, firewall and intrusion detection software, monitoring agents etc.
The following runtime configuration deploys the excellent Toolbelt BOSH release to all VMs, enriching the on-box troubleshooting experience:
addons:
- name: toolbelt
jobs:
- name: toolbelt
release: toolbelt
releases:
- name: toolbelt
version: 3.4.2
url: https://github.com/cloudfoundry-community/toolbelt-boshrelease/releases/download/v3.4.2/toolbelt-3.4.2.tgz
sha1: 2b4debac0ce6115f8b265ac21b196dda206e93ed
Genesis has an interactive addon to help generate the runtime-config:
$ genesis do my-env runtime-config
Running runtime-config addon for my-env
...
You can get the current runtime-config
like this:
$ bosh runtime-config
$ bosh runtime-config > runtime.yml
As with cloud configs, saving your runtime config to a file is a
great way to make changes to it. Download the current config,
modify it, and then upload the new version to BOSH. That last
step is handled by update-runtime-config
:
$ bosh runtime-config > runtime.yml
$ vim runtime.yml
$ bosh update-runtime-config runtime.yml
For more information, check the BOSH Runtime Config documentation.
There are a few interesting bits of information you can get out of BOSH, with respect to the health of the VMs it has deployed.
First up, bosh vms
shows you the agent status:
$ bosh -d vault vms
Instance Process State AZ IPs VM CID VM Type
vault/0 running z1 10.200.130.6 vm-98627dfd small
vault/1 failing z1 10.200.130.5 vm-5c9638b1 small
vault/2 running z1 10.200.130.4 vm-a59d7f16 small
The possible values for Process State are:
running
- Everything is OKfailing
- The VM is up, but the deployed software isn’tunresponsive agent
- The BOSH director hasn’t heard from
the agent on the VM in a while.You can also get system vitals out of BOSH:
bosh -d vault vms --vitals
The newer bosh instances
provides similar information:
$ bosh -d vault instances
Instance Process State AZ IPs
vault/0 running z1 10.200.130.6
vault/1 failing z1 10.200.130.5
vault/2 running z1 10.200.130.4
To get detailed information about each instance, pass --ps
:
$ bosh -d vault instances --ps
Instance Process Process State AZ IPs
vault/0 - running z1 10.200.130.6
~ consul running - -
~ strongbox running - -
~ vault running - -
vault/1 - running z1 10.200.130.5
~ consul running - -
~ strongbox running - -
~ vault failing - -
vault/2 - running z1 10.200.130.4
~ consul running - -
~ strongbox running - -
~ vault running - -
Something is wrong with the actual Vault process on vault/1
.
Persistent disks are vital to any deployments that involve durable
data, like databases and storage solutions. If you need to figure
out which instances in a deployment have been assigned persistent
disks, you can use the --details
flag to bosh instances
$ bosh -d vault instances --details
Instance Process State AZ IPs State VM CID VM Type Disk CIDs
vault/0 running z1 10.200.130.6 started vm-98627dfd small disk-8970b8d2
vault/1 running z1 10.200.130.5 started vm-5c9638b1 small disk-d0ccdc58
vault/2 running z1 10.200.130.4 started vm-a59d7f16 small disk-204c8403
If there is a value in the Disk CIDs column, that instance has been given a persistent disk.
To get a remote shell on a BOSH-deployed instance, you can use the
bosh ssh
command:
$ bosh -d vault ssh vault/1
BOSH will provision you a temporary user account with sudo
access, and then run the appropriate ssh
commands to log into
the instance, remotely, as that user.
From there, you can look at logs, restart jobs and processes, and otherwise diagnose and troubleshoot.
If BOSH is deploying instances behind a NAT device, you may need a
gateway to bounce through for SSH access. All this gateway
needs is SSH access. The --gateway-*
options take care of the
configuration:
$ bosh -d vault ssh \
--gw-host your-gateway-ip \
--gw-user username \
--gw-private-key path/to/user/key \
vault/1
Once you’ve SSHed onto a deployed instance, you can see what the
software you’re trying to deploy has been up to by perusing the
logs. BOSH releases almost always store logs under
/var/vcap/sys/log
, instead of more traditional places.
Often, each component of the deployment will have a directory
under /var/vcap/sys/log
; logs live under those. Often, releases
will split their standard output and standard error streams into
separate log files, suffixed .stdout.log
and .stderr.log
.
BOSH uses a system called Monit to supervise the processes that
make up the software it deploys. In addition to restarting
defunct processes, Monit also informs the BOSH director of the
health and state of the pieces of each deployment. This is where
the Process State values in bosh vms
/ bosh instances
output
come from.
If you SSH into an instance, you can use the monit
command (as
the root user) to see what’s going on and restart processes.
$ monit summary
The Monit daemon 5.2.5 uptime: 1d 1h 10m
Process 'shield-agent' failing
Process 'vault' running
Process 'shieldd' running
Process 'nginx' running
System 'system_localhost' running
To restart a failing process:
$ monit restart shield-agent
It’s usually best to follow that up with:
$ watch monit summary
The watch
command will run monit summary
every 2 seconds, and
keep its output on the screen in the same place, making it easy to
notice when the process flips from initializing to running.
If you suspect that the IaaS / Cloud layer is acting up, either by removing VMs or losing disk attachments, you can run a cloud check against a deployment.
When BOSH runs a cloud check, it takes inventory of the VMs and disks that it ought to have, and compares that with what it actually has. If it finds an discrepancies, you’ll be asked to resolve each one individually.
bosh -d vault cloud-check
For more information, check out the BOSH Cloud Check documentation.
Errands are a special type of one-off task that a BOSH release can define. Errands can do things like apply database migrations, initialize systems, run smoke tests, conduct an inventory of a cluster, and more.
To see what errands are available, consult with your Genesis Kit documentation, or the documentation that came with the BOSH releases you are deploying.
To see what errands are runnable:
$ bosh errands
To run an errand, specify it by name:
$ bosh run-errand my-errand
When errands fail, they print error logs to standard error, and then exit. BOSH then deprovisions the errand VM, making it difficult to diagnose things like connectivity issues or authentication problems.
If you specify the --keep-alive
flag when you run the errand,
however, BOSH will not perform this cleanup step. You can then
bosh ssh
into the VM to perform your troubleshooting.
$ bosh run-errand my-errand --keep-alive
... wait for the failure ...
$ bosh ssh my-errand
Once you figure out the problem and correct it, you will want to
run the errand again without the --keep-alive
flag to get BOSH
to clean up the errand VM one last time.
If BOSH lists an instance as unresponsive agent
, it means it
hasn’t heard from the agent, via the NATS message bus, in a while.
BOSH doesn’t initiate conversation with deployed instances; it waits for them to contact it via the message bus. Sometimes this fails because of networking configuration between the instance and the director. Other times, TLS certificates get in the way.
Often, a single unresponsive agent in an otherwise healthy deployment will clear up on its own. Unless its an emergency, give the system some time to coalesce and see if it recovers.
If lots of agents become unresponsive, it could point to a systemic or network-wide issue, like a bad route, failing router, misconfigured firewall, etc. In these cases, start troubleshooting at the network and work your way back to the BOSH director.
As a last resort, you can have BOSH forcibly recreate the
instances via the bosh recreate
command with --fix
, it will
recreate an instance with an unresponsive agent instead of erroring.
$ bosh -d vault recreate --fix
This will detach any persistent disks (so that they survive), delete the running virtual machine instances, and bring up new copies.
Everything BOSH does it does via tasks. You can use the bosh tasks
command to get a list of currently executing, and recently
executed tasks:
$ bosh tasks
$ bosh tasks -r
Each task has a number, and you can use that number to identify that task and interact with it via the BOSH command-line utility.
To view a task and follow its output (à la tail -f
):
$ bosh task 12345
You can also cancel a task if you know its ID:
$ bosh cancel-task 12345
However, be aware that BOSH often cannot interrupt a task to cancel it, and instead has to wait for a “lull” in task processing to actually cancel it. For example, when a deployment task is canceled, it will continue attempting to deploy software to the current instances it is working on, and only then will BOSH try to cancel it. You may not even be able to cancel a task to delete a deployment, depending on when you get to it.
For more details, refer to the BOSH Tasks documentation
As of v263, BOSH directors can support multiple CPIs for instrumenting lots of different IaaS instances. You can deploy BOSH that talks to three different vCenters.
This is a more advanced subject, but it’s really neat, so we want to include it in the runbook. A full write-up can be found on the Stark & Wayne blog, but here’s a summary:
We recommend that you backup the database on the BOSH director. We have a tool called shield to help you backup your BOSH director and more. This runbook also has a section on shield to help you get started with shield.
Since BOSH only allows you to access the database from the director itself, you have to collocate the shield agent on the bosh director itself.
If your BOSH is deployed by another BOSH, you can add shield agent through runtime config; if your BOSH is deployed through bosh create-env, you can add the shield agent on the jobs’ config in the manifest.
We do not recommend you to backup BOSH blobstore unless you:
Shield has a local file plugin which can be used to backup the local dav blobstore. However, since you will have your database backup separately and there is no locking during the backup process, you cannot guarantee that the database and blobstore backup are in a consistent state.
Shield also has a bbr plugin which you can use to backup both the database, Credhub, UAA and blobstore of the BOSH director. Please be aware during backup process, the Credhub and UAA will be locked and read-only. It can take hours to run the backup depending on the size of your BOSH blobstore.