Cloud Pak Clone and reinstate tool
Special thanks to Deepak Rangarao for producing this game changing utility
Introduction
This tool can be used to clone (take a snapshot) or the current state of a cluster at the OpenShift namespace/project level and save it off to a S3 repository. You can subsequently select which clone in S3 you want to reinstate (restore) onto a new OpenShift cluster. This clone process will capture all services installed in the cluster, the underlying data and the assets that have been configured. The utility can run anywhere (laptop, server, cloud, etc.).
Note: This is a container based delivery, which required Docker or Podman environment to execute in.
The supported K8S/OpenShift objects for cloning/reinstatement include:
Security Context Constraint | Service Account | Role |
Role Binding | Project/Namespace | Secret |
ConfigMap | Persistent Volume | Persistent Volume Claim |
Deployment | StatefulSet | Job |
CronJob | Services | Custom Resource Definitions |
Custom Resources |
Note: We have tested a clone/reinstate on IBMs TEC, AWS and ROKS clusters with success. The clone/reinstate will work on any OpenShift cluster as well as with any storage option (currently tested with Portworx, NFS, IBMC-FILE-GOLD-GID and IBMC-FILE-CUSTOM-GOLD-GID). Our focus has been on Cloud Pak for Data, but the utility is architected in a fashion where there is no dependency on specific Cloud Paks, and it should work for other Cloud Paks as well (not yet tested for other Cloud Paks).
There are 2 main actions in this utility. Clone and Reinstate.
Clone is essentially capturing a snapshot of your environment. This includes all the persistent data, metadata, OpenShift artifacts for you and saving these off to a S3 bucket.
- Quiesces the system scales down all the OpenShift defined statefulSets in the desired project to a quiesced state.
- It captures all the OpenShift artifacts.
- It captures all the content in the persistent volumes.
- Scales the statefulSets back to take activity again
- Transfers the scripts, logs and data up to its own defined S3 bucket.
- Jobs and pods are cleaned up if CLEANUP=1
Reinstate is doing things in reverse.
- Creates a namespace in OpenShift
- Deploys OpenShift Jobs up to the OpenShift cluster namespace
- Jobs rebuild the Persistent Volumes and Persistent Volume Claims pulling from the clone in S3.
- Jobs rebuilds OpenShift objects including deployments pulling from the clone in S3.
- Deployments restart and scale up the statefulSets.
- Jobs and pods are cleaned up if CLEANUP=1
Usage:
The Cloner utility runs as follows (you can use either podman or docker):
docker run <environment variables> quay.io/drangar_us/cpc:cp4d.3.5
podman run <environment variables> quay.io/drangar_us/cpc:cp4d.3.5
The utility gets its arguments via environment variables. There are 3 ways you can pass these environment variables to the container. How you feed the container its variables is up to you.
- Using an
--env-file=<file name>
where all arguments are in a file. This the easiest but requires file editing. - Using combination of
--env-file=<file name> -e ARG=<value>
. This is the easy and flexibility. - Using all
-e ARG=<value>
commands. This is by far the most typing.
I am providing a sample cloner.env file, which has all the variable needed to be filled out. It is fairly well documented, but there is a section below with links to how to get values. Overtime, this can be expanded to multiple clouds, but for now its IBM Cloud. Each Variable is used either in the file, or on the command line. The Container will ignore variable not associated with the ACTION. It is fair to populate all fields in file.
To Clone a Cloud Pak Cluster, you will first validate the environment variables look correct. This command will pull the docker container if it doesn’t already exist, then dump the environment variables which will be used. I call these the default actions.
Once you have reinstated your Cloud Pak for Data Cluster, you will need to go in and validate a few items. Are your node/hostanme names the same for you database connections. Did you use node affinity? If so did you label the new nodes?
Running with just the env file, in current directory.
- First dump the environment variables using an argument at the end of
env
docker run --env-file=./cloner.env quay.io/drangar_us/cpc:cp4d.3.5 env
tjm$ docker run --env-file=./cloner.env quay.io/drangar_us/cpc:cp4d.3.5 env Unable to find image 'quay.io/drangar_us/cpc:cp4d.3.5' locally cp4d.3.5: Pulling from drangar_us/cpc 7a0437f04f83: Already exists .... .... Digest: sha256:3ca278cfadd6a6b8eea6c8b5810ec8649c65a7ec52c99082cdbd4081b20e984f Status: Downloaded newer image for quay.io/drangar_us/cpc:cp4d.3.5 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOSTNAME=8e660213f8c8 ACTION=CLONE COSBUCKET=clone-cpd-lite-3-15-2021 PROJECT=zen INTERNALREGISTRY=0 ICR_KEY=eyJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJJQk0gTWFyaK7a7Dafcrg TARGET_REGISTRY_USER=iamapikey TARGET_REGISTRY_PASSWORD=2qm6gbuiYBb_oBnLBgAbpE6Gjbn61YVdODRNZRa REGISTRYPROJECT=partner TARGET_REGISTRY=us.icr.io SERVER=https://c114-e.us-south.containers.cloud.ibm.com:30744 TOKEN=UtqtiJSAz4amSrzxDxWKaRYFxZz-966vB0Y COSAPIKEY=f3669921b8a4451e21b38ab2ba1f COSENDPOINT=s3.us-south.cloud-object-storage.appdomain.cloud COSREGION=us-south COSSECRETKEY=68b44f315c357792d4210df2e25545ab4a6b7fc3b08 SINCRIMAGE=quay.io/drangar_us/sincr:v2 HOME=/root
- Execute the clone
docker run --env-file=./cloner.env quay.io/drangar_us/cpc:cp4d.3.5
Using combination of environment file and command line arguments
This is the easy with flexibility.
- First dump the environment variables using an argument at the end of
env
:
docker run --env-file=./cloner.env -e ACTION=CLONE -e COSBUCKET=clone-cpd-lite-3-17-2021 quay.io/drangar_us/cpc:cp4d.3.5 env
NOTE: The COSBUCKET is now reading clone-cpd-lite-3-17-2021 vs clone-cpd-lite-3-15-2021 as the file lists. Each clone requires a unique bucket. - Execute the clone
docker run --env-file=./cloner.env -e ACTION=CLONE -e COSBUCKET=clone-cpd-lite-3-17-2021 quay.io/drangar_us/cpc:cp4d.3.5
Using all command line arguments
- First dump the environment variables using an argument at the end of
env
:
docker run -e ACTION=CLONE -e SINCRIMAGE=[SINCR image path] -e SERVER=[OCP login server URL] -e OCUSER=[Openshift user name] -e OCPASSWORD=[Openshift user password] -e INTERNALREGISTRY=[0 or 1] -e PROJECT=[Openshift project name] -e COSBUCKET=[S3 bucket name] -e COSAPIKEY=[S3 access key] -e COSSECRETKEY=[S3 secret key] -e COSREGION=[S3 region] -e COSENDPOINT=[S3 end point] quay.io/drangar_us/cpc:cp4d.3.5 env
- Execute the clone
docker run -e ACTION=CLONE -e SINCRIMAGE=[SINCR image path] -e SERVER=[OCP login server URL] -e OCUSER=[Openshift user name] -e OCPASSWORD=[Openshift user password] -e INTERNALREGISTRY=[0 or 1] -e PROJECT=[Openshift project name] -e COSBUCKET=[S3 bucket name] -e COSAPIKEY=[S3 access key] -e COSSECRETKEY=[S3 secret key] -e COSREGION=[S3 region] -e COSENDPOINT=[S3 end point] quay.io/drangar_us/cpc:cp4d.3.5
Setup
Required steps
- Docker or Podman installed.
- Terminal window in your laptop (mac/linux) or powershell in windows
- Internet access to the Quay repo where the containers are stored.
- Execute
docker pull quay.io/drangar_us/sincr:v2
If not things will fail on REINSTATE based on the faulty CLONE but you may never know. - Execute
docker pull quay.io/drangar_us/cpc:cp4d.3.5
NOTE: The Tool will do this for you, but in the interest of time, pull the container. - If you are running on IBM Cloud, you should run the following prior to installing.
- If you provide an API key from the cloud account and the name of the cluster you should be able to simply run the following command from a terminal:
curl -X POST -H “Content-Type: application/json” “http://cloudpak-provisioner.ibmcloudroks.net:8000/api/v1/cp4d_preinstallation/” -d ‘{“apikey”: “<API KEY>“, “cluster_name”: “<CLUSTER NAME>“}’
(make sure when you copy and paste this command the single and double quotes are correct) - If your clone used storage class
ibmc-file-custom-gold-gid
then you need to add this storageClass following these instructions) - If you are using DB2 or WKC, you will need to apply the kernel parameter changes steps 3 and 4.
Verify that you have both containers pulled:
tjm$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
quay.io/drangar_us/cpc cp4d.3.5 51f701144a74 4 days ago 381MB
quay.io/drangar_us/sincr v2 ed3fb3d69bec 8 months ago 399MB
Setting up the environment file. I am providing a cloner.env file with some detailed documentation in the file. However, there are some trickier aspects of this that need some explaining.
We will work from the bottom of the file upwards. The bottom has least changes and topmost.
Last Section
- SINCRIMAGE is merely the job cloner image and should nearly never change.
Moving on the Cloud Object Storage section
- COSREGION is the S3 region where you will store the cloned files. I had issues at first run with us-east, so I know that us-south works. more investigate will happen here.
- COSENDPOINT - is the s3 regional endpoint. Using the Cloud Object Store endpoints URL following URL pick the public endpoint that aligns with your COSREGION selection. You can use ctrl+f to find the proper endpoint. I chose the public endpoint. More to investigate with private endpoints.
- Steps to take to get COSAPIKEY and COSSECRETKEY values.
- Go to the IBM Cloud > Resource list expand Storage and click on the Cloud Object Storage service listed.
- On the left side, Click Service credentials.
- If this is your first time, Click the New Credentials button.
- Enter a name you will remember
- Keep Writer permissions
- Expand the Advanced Options
- Slide the button to enable Include HMAC Credential
- Click Add
1. This will create the new service credential, expand the next credential.
Here is where you will find the values you need,
"cos_hmac_keys": { "access_key_id": "a52292cde94547968d4a8ad0767da668", "secret_access_key": "582013659ef15ff2fd06e7ef4faed5f0508aa3a5f87a0379" },
- COSAPIKEY is to get access to the Cloud Object Storage service where you will create a bucket for each clone. To find this service id under
cos_hmac_keys: access_key_id:
- COSSECRETKEY - This is part of the service credential for your COS or S3 bucket. On IBM Cloud, this is the value of
cos_hmac_keys: secret_access_key:
OpenShift Login section
- SERVER This OCP Cluster login URL which includes host and port. On IBM Cloud this can change from day to day and region to region, so so it may be a good candidate for a
-e SERVER=https://c100-e.us-east.containers.cloud.ibm.com:30891
on the command line. - TOKEN This the OpenShift Cluster credential. On IBM Cloud Managed OpenShift, you will use the results of
$(oc whoami -t)
. This is not needed if you can or do login using a username and password. If you do use username and password uncomment the OCUSER and OCPASSWORD and comment out the TOKEN with a#
. This changes often, so it may be a good candidate for a-e TOKEN=<oc login token>
on the command line. - OCUSER This will used to authenticate to OpenShift Cluster, if you commented out TOKEN. You will use the results of
$(oc whoami -t)
. However, this doesn’t work in the env file so it may be a good candidate for a-e OCUSER=$(oc whoami)
on the command line. - OCPASSWORD This will used to authenticate to OpenShift Cluster, if you commented out TOKEN. You will use the results of $(oc whoami -t). However, this doesn’t work in the env file so it may be a good candidate for a
-e OCPASSWORD=$(oc whoami -t)
on the command line.
Container Registry Section
All of this section should not change very often
- INTERNALREGISTRY This offers 2 options 1 or 0
- 1 Value of 1 is for using the internal registry in you OCP Cluster. This option will pull images from IBM Container Registry and push them into the OpenShift Cluster internal registry, which will increase reinstate time. You will want to plan accordingly on registry disk space for for each cluster.
- 0 Value of 0 is for using an external registry of your OCP Cluster. It is becoming a better practice to control your own registry across multiple clusters. Another benefit is that you only load images once and save on disk space on all the other clusters.Instructions on building out your own external registry.
- ICR_KEY This is your entitlement APIKey from the entitlement registry. Same as you would use in repo.yaml. This is found on step 3 of Getting Entitlement to Cloud Pak Software
- TARGET_REGISTRY_USER This will depend on of you are using INTERNALREGISTRY=1 or 0.
If INTERNALREGISTRY=1, then you will use the results of$(oc whoami)
. However, this doesn’t work in the env file so it may be a good candidate for a-e TARGET_REGISTRY_USER=$(oc whoami)
on the command line.
If INTERNALREGISTRY=0, then you will useiamapikey
from the IAM APIKey you created. - TARGET_REGISTRY_PASSWORD This will depend on of you are using INTERNALREGISTRY=1 or 0.
If INTERNALREGISTRY=1, then you will use the results of$(oc whoami -t)
. However, this doesn’t work in the env file so it may be a good candidate for a-e TARGET_REGISTRY_PASSWORD=$(oc whoami -t)
on the command line.
If INTERNALREGISTRY=0, then you will use APIKey from the IAM APIKey you created. - REGISTRYPROJECT This will depend on of you are using INTERNALREGISTRY=1 or 0.
If INTERNALREGISTRY=1, then you will use same value as repo.yaml orcp/cpd
If INTERNALREGISTRY=0, then you will you the values from when you created the external registry. - TARGET_REGISTRY This will depend on of you are using INTERNALREGISTRY=1 or 0.
If INTERNALREGISTRY=1, then you will use same value asregistry: -url
in repo.yaml orcp.icr.io
If INTERNALREGISTRY=0, then you will you the values from when you created the external registry.
Top section
- ACTION = Offers 2 Main options CLONE or REINSTATE
- CLONE will scale down the statefulSets, capture a backup of the environment, scale up the statefulSets then copy the backup to the COS bucket you define with COSBUCKET. When the statefulSets are scales down this will be a service outage.
- REINSTATE will run a host of
oc jobs
that will first rebuild the PV and PVCs, restore the data then rebuild the deployments which will start the pods. Depending on the number of services and where you are executing this, the PVC/PV portion can take 10+ minutes, the deployments 5 minutes then time to have the pods stand up.
- COSBUCKET = This is self-defined and is not needed to be created ahead of time. This should be unique enough for you remember what, when, who and why it was taken
- PROJECT This is the namespace that your Cloud Pak is installed into. For CPD, I use clonerdemo instead of my ‘zen’ out of habit, but it is a better practice to use something other than zen or cpd for better results when searching for things. You can look through the projects using oc projects to see which is being used.
oc get pods -n <project name>
will show you what is there. - CLEANUP Offers 2 Main options 1 and 0
1 Value of 1 if for cleaning up the CLONE and REINSTATE jobs that are created during each invocation. I suggest setting this as the default value in this file.
0 Value of 0 if for leaving the CLONE and REINSTATE jobs that are created during each invocation. You may want to leave these around to see what was happening. I would suggest overriding the CLEANUP=1 in this file with a-e CLEANUP=0
on the command line. If you try to re-run CLONE on the same cluster/namespace it will not work.
Troubleshooting
Stopping the utility during execution
- Find out which containers are running using
docker ps
tjm$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 86e12bef2e7b quay.io/drangar_us/cpc:cp4d.3.5 "/bin/sh -c /cloneto…" About a minute ago 1m eloquent_bouman
- Issue a
docker stop
tjm$ docker stop eloquent_bouman eloquent_bouman tjm$
Access the Container
When you execute the container, as long as it is running you can execute into the container and tail the logs.
- Start by running the command to run the container.
docker run <environment variables> quay.io/drangar_us/cpc:cp4d.3.5
- Find out the name of the running container
docker ps
Toms-MBP:cloner tjm$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES a5a46f2f4e47 quay.io/drangar_us/cpc:cp4d.3.5 "/bin/sh -c /cloneto…" 14 seconds ago Up 13 seconds deepak_the_de
- Exec into the running pod using the name on the right. In this sample, the name is deepak_the_de. Run
docker exec -it deepak_the_de bash
Toms-MBP:cloner tjm$ docker exec -it deepak_the_de bash [root@a5a46f2f4e47 clonetool]#
- List the files.
[root@a5a46f2f4e47 clonetool]# ls clone clone.sh cpc_execution.txt.2021.03.18-01.51.27 reinstate reinstate.sh runcpc.sh
Viewing Logs
- Access the container
- List the files.
[root@a5a46f2f4e47 clonetool]# ls clone clone.sh cpc_execution.txt.2021.03.18-01.51.27 reinstate reinstate.sh runcpc.sh
- Tail, more or cat the logs
[root@a5a46f2f4e47 clonetool]# tail -f cpc_execution.txt.2021.03.18-01.51.27 deployment.apps/redis-ha-haproxy scaled deployment.apps/spaces scaled deployment.apps/spawner-api scaled ... ... zen-metastoredb statefulset.apps/zen-metastoredb scaled zookeeper statefulset.apps/zookeeper scaled Info: Exporting manifests Info: Exporting project to clone/specs/0-projects.json Info: Exporting secrets to clone/specs/1-secrets.json Info: Exporting serviceaccount to clone/specs/2-serviceaccounts.json Info: Exporting scc to clone/specs/3-securitycontextconstraints.json Info: Exporting roles to clone/specs/4-roles.json ... ... job.batch/clone-job-32 created Info: Starting data backup for user-home-pvc job.batch/clone-job-33 created Info: Starting data backup for zookeeper-data-zookeeper-0 job.batch/clone-job-34 created Info: Wating for all clone job's to complete [Elapsed time:60]... Info: Wating for all clone job's to complete [Elapsed time:120]...
S3 connectivity issues
- Access the container
- Using the s3cmd verify that you can list buckets
s3cmd ls
[root@243b0735ba65 clonetool]# s3cmd ls 2021-03-18 02:59 s3://clone-cpd-lite-wkc-3-17-2021 2021-03-16 02:45 s3://clone1-datacp-mar15-lite
- Verify that you can make a bucket
s3cmd mb s3://<name>
[root@243b0735ba65 clonetool]# s3cmd mb s3://testus-south-bucket Bucket 's3://testus-south-bucket/' created
- If these fail, review your s3 parameters which were passed. Using
cat ~/.s3cfg
Also verify with the actual COS admin or configuration. I have used the APIKey as the access_key. I also had an issue with s3.us-east.cloud endpoint. Will need to look into config for use_https=true for the future.[root@243b0735ba65 clonetool]# cat ~/.s3cfg host_base=s3.us-south.cloud-object-storage.appdomain.cloud host_bucket=s3.us-south.cloud-object-storage.appdomain.cloud use_https=False bucket_location=us-south access_key=f3669921b8a44a91b38ab2ba1f secret_key=68b44f315c357792ddf2e25545ab4a6b7fc3b08 signature_v2=False
Clone failing to upload to S3
I had one scenario when ACTION=CLONE was running and it would not upload to the S3 bucket. After looking through the source Cluster it turned out I did not have a secret set up for norootsquash which caused a delay in the statefulSet from starting thus timing out before the upload happened. This was not a utility issue, but a cluster configuration. This was resolved by running the following:
oc project kube-system
oc create secret docker-registry cpregistrysecret --docker-server=cp.icr.io/cp/cpd --docker-username=cp --docker-password=<apikey> --docker-email=<email associated with apikey>
- apikey can be generated here
- Download the norootsquash.yaml file.
oc apply -f norootsquash.yaml
Reinstate fails to create namespace
When I was executing the ACTION=REINSTATE, I was passing it a bogus bucket name.
[root@243b0735ba65 clonetool]# more cpc_execution.txt.2021.03.18-12.25.49
Info: S3 config does not exist, create one ..
Info: -rw-r--r--
Info: Start Time : 2021.03.18-12.25.49
Info: Checking to see if the S3 connection parameters are correct
Info: Downloading template from COS
ERROR: S3 error: 404 (NoSuchBucket): The specified bucket does not exist.
ERROR: S3 error: 404 (NoSuchBucket): The specified bucket does not exist.
Info: Update any spec as required
/clonetool/reinstate.sh: line 122: /clonetool/reinstate/specs/fixspec.sh: No such file or directory
Info: reinstante specs not found
cat: reinstate/specs/cloneddomain.txt: No such file or directory
Info: Reinstate project and apply prerequisites
When looking at the logs I didn’t see the following, which suggests a healthy download. Instead it skips over this and tries to execute the oc build, which it can’t hence doesn’t create the namespace. NOTE:Future enhancement to check if downloads are successful, if not error out.
[root@81d90ecb4856 clonetool]# more cpc_execution.txt.2021.03.18-13.09.31
Info: S3 config does not exist, create one ..
Info: -rw-r--r--
Info: Start Time : 2021.03.18-13.09.31
Info: Checking to see if the S3 connection parameters are correct
Info: Downloading template from COS
download: 's3://clone-cpd-lite-wkc-3-17-2021/state/container_image_list.txt' -> 'reinstate/state/container_image_list.txt' (2940 bytes in 0.3 seconds, 10.14 KB/s) [1 of 5]
download: 's3://clone-cpd-lite-wkc-3-17-2021/state/deployment_replicas.csv' -> 'reinstate/state/deployment_replicas.csv' (1501 bytes in 0.4 seconds, 3.78 KB/s) [2 of 5]
download: 's3://clone-cpd-lite-wkc-3-17-2021/state/deployments.json' -> 'reinstate/state/deployments.json' (3515304 bytes in 1.7 seconds, 2027.11 KB/s) [3 of 5]
download: 's3://clone-cpd-lite-wkc-3-17-2021/state/statefulset_replicas.csv' -> 'reinstate/state/statefulset_replicas.csv' (362 bytes in 0.1 seconds, 5.19 KB/s) [4 of 5]
download: 's3://clone-cpd-lite-wkc-3-17-2021/state/statefulsets.json' -> 'reinstate/state/statefulsets.json' (766003 bytes in 0.7 seconds, 1066.76 KB/s) [5 of 5]
Done. Downloaded 4286110 bytes in 3.1 seconds, 1334.25 KB/s.
download: 's3://clone-cpd-lite-wkc-3-17-2021/specs/0-projects.json' -> 'reinstate/specs/0-projects.json' (1772 bytes in 0.2 seconds, 7.16 KB/s) [1 of 20]
Reinstating a cloned environment which used ibmc-file-custom-gold-gid
When cloning an existing environment, pay attention to the storage classes. By default a Managed OpenShift provides a certain set of storageClasses. ibmc-file-custom-gold-gid is not created by default. The utility will do the needful, but after a while the reinstate jobs will fail or you will not see progression. This is because the PVs and PVCs are not being provisioned. To resolve this:
Create the storage class:
- Download storageclass-ibmc-file-custom-gold-gid.yaml
- Log into the new target OpenShift cluster.
- run
kubectl apply -f storageclass-ibmc-file-custom-gold-gid.yaml
You should start seeing the storage being provisioned and progression continuing.
Reinstated WKC or DB2 pods fail to start
This occurs when the prerequisite kernel parameters and secrets are not met for DB2/WKC. Please refer back to the required steps Running the following command on the target cluster will show these. oc get pods | grep '0/1' | grep -v Completed
The culprit is wdp-db2-0