When we observe that the API has random response latency, we can perform the following steps to better capture the issue. First, we need to explore the patterns:
Is there any pattern to the commands or API calls that have experienced the
response latency? More specifically, does it occur when call v2/info
, cf login
,
or API calls related to one app or all the apps?
What is the frequency of the issue? Minute-level, hour-level or days?
Is there any pattern in the time-of-day, day-of-week, or day-of-month that this issue occurs?
Next, in order to get more details when the issue occurs, the following debugging methods can be used:
Repeatedly running commands to reproduce the latency issue or with the pattern observed in step one, to reproduce the issue.
Turn on CF_TRACE=true when running commands
Look at the output, locate where the latency occurs. For example, when you
run cf login -a url -p pass -u user
, it will first call /v2/info
, then get
uaa endpoint information in the response,after that it will request to log into
uaa. You can look at the output for http traffic to locate where the latency
happens. Assume it happens after /v2/info
, then you can keep breaking it down
as run cf login -a api_url
, so you can observe the latency happens before you
provide credentials or after. Based on where latency happens, you will check
different components to diagnose.
Check the status and resource usage of the components involved, for example, go routers, apis, haproxy, etc
Look at the logs on the related components. For example, if the latency
only happens at v2/info
, you will want to check the logs for API nodes,
and also any components you have such as DNS, load balancer or haproxy
before the request reaches the API nodes.
There are many different possible reasons to cause “CF push app: ERR Downloading Failed”. This guide will show you how to debug such problems by going through an example.
The following error messages were printed out when we started an app.
[cell/o] Creating container for app xx, container successfully created
[cell/o] ERR Downloading Failed
[cell/0] OUT cell-xxxxx stopping instance, destroying the container
[api/0] OUT process crushed with type: "web"
[api/0] OUT app instance exited
The first step is trying to figure out what failed to download. By knowing how CF push, stage and run its applications, we know that it already created a container, the next step will be downloading the droplet from the blobstore so it can be run in the container it created.
Since it is the cell node needs to get the droplet, we ran the bosh ssh
to the cell
node to look for more detailed logs. By exploring the logs on the cell nodes, we found that
there was a bad tls
error message in the log entries. This tells us that the certificates
are probably the issue.
safe has a command safe x509 validate [path to the cert]
which we can use to inspect
and validate certificates. With a simple script, we looped through all of the
certificates used in the misbehaving CF environment with the safe validate
command.
The outputs showed us all of the certificates that were expired.
We then ran safe x509 renew
against all of the expired certificates. After double
checking that all of the expired certificates were successfully renewed, we then
redeployed the CF in order to update the certificates.
The redeployment went well, for the most part, except for when it came to the
cell instances, it hung at the first one forever. We then tried bosh redeploy
using the --skip-drain
flag, unfortunately, this did not solve our issue completely.
We ran bosh ssh
to the cell that was hanging, and replaced all of the expired
certificates in the config files manually, and then ran monit restart all
on
the cell. This helped to nudge the bosh redeploy
into moving forward happily.
We got a happy running CF back.
This guide is for the case that you use Vault and Safe to manage your credentials for your BOSH and CF deployments.
safe x509 validate [OPTIONS] path/to/cert
will validate a certificate in the Vault,
checking CA signatories,expiration, name applicability, etc.
safe x509 renew [OPTIONS] path/to/certificate
will renew the cert specified in the
path. Option -t
can be configured to define how long the cert will be valid for.
It defaults to the last TTL used to issue or renew the certificate.
A script can be written to iterate all the certs that need to be validated and renewed based the above safe commands.
To take a step further, you can also use Doomsday to monitor your certs so you can take actions before your certs expire.
If you need to migrate your CF from one vSphere cluster to another, you can follow the following major steps in two different scenarios:
VMotion Works when VMs are Alive
Check backup for CF is set successfully if you have any
Turn off BOSH resurrection, otherwise BOSH will try to self-recover/recreate your VMs that are down when you try to migrate
Create a new cluster in the same vCenter
vMotion the CF VMs to the new cluster
Delete or rename the old cluster
Rename the new cluster to the old cluster’s name
Enable Bosh resurrection
Everything should be working as normally after this process in the new cluster.
Vmotion Does Not Work when VMs are Alive
vMotion between the two clusters when VMs are running may not work due to the CPU compatibility and other issues between the two clusters. In this case, you have to power off VMs before you do vMotion. The steps for migration are as follows:
Check backup for CF is set successfully if you have any
Turn off BOSH resurrection, otherwise BOSH will try to self-recover/recreate your VMs that are down when you try to migrate
Create a new cluster in the same vCenter
Run bosh stop
on a subgroup of the VMs so there were still same type VMs running
to keep the platform working. bosh stop
without --hard
flag by default will
stop VM while keeping the persistent disk.
Power off those BOSH stopped VMs to do vMotion to the new cluster
After vMotion, bring the VMs in the new cluster up
Repeat the above process until you migrate all the VMs over to the new cluster
Delete or rename the old cluster
Rename the new cluster to the old cluster’s name
Turn the BOSH resurrection back on
Everything should be working as normally after this process in the new cluster.
It is extremely important that you check the disks are successfully attached to the new datastore you would like to use before you move forward with your deployments. To migrate your BOSH and CF to a new datastore, you can follow the steps below.
Attach new datastore(s) to the hosts where the BOSH and CF VMs are running (Do not detach the old datastores)
Change deployment manifest for the BOSH Director to configure vSphere CPI to reference new datastore(s)
properties:
vsphere:
host: your_host
user: root
password: something_secret
datacenters:
- name: BOSH_DC
vm_folder: sandbox-vms
template_folder: sandbox-templates
disk_path: sandbox-disks
datastore_pattern: '\new-sandbox\z' # <---
persistent_datastore_pattern: '\new-sandbox\z' # <---
clusters: [SANDBOX]
Redeploy the BOSH Director
Verify that the BOSH Director VM’s root, ephemeral and persistent disks are all now on the new datastore(s)
Run bosh deploy --recreate
for CF deployments so that VMs are recreated and
persistent disks are reattached
Verify that the persistent disks and VMs were moved to new datastore(s) and there are no remaining disks in the old datastore(s)