IBMPartnerDemo

View project on GitHub

Cloud Pak Clone and reinstate tool

Special thanks to Deepak Rangarao for producing this game changing utility

Introduction

This tool can be used to clone (take a snapshot) or the current state of a cluster at the OpenShift namespace/project level and save it off to a S3 repository. You can subsequently select which clone in S3 you want to reinstate (restore) onto a new OpenShift cluster. This clone process will capture all services installed in the cluster, the underlying data and the assets that have been configured. The utility can run anywhere (laptop, server, cloud, etc.).

Note: This is a container based delivery, which required Docker or Podman environment to execute in.

The supported K8S/OpenShift objects for cloning/reinstatement include:

Security Context Constraint Service Account Role
Role Binding Project/Namespace Secret
ConfigMap Persistent Volume Persistent Volume Claim
Deployment StatefulSet Job
CronJob Services Custom Resource Definitions
Custom Resources    


Note: We have tested a clone/reinstate on IBMs TEC, AWS and ROKS clusters with success. The clone/reinstate will work on any OpenShift cluster as well as with any storage option (currently tested with Portworx, NFS, IBMC-FILE-GOLD-GID and IBMC-FILE-CUSTOM-GOLD-GID). Our focus has been on Cloud Pak for Data, but the utility is architected in a fashion where there is no dependency on specific Cloud Paks, and it should work for other Cloud Paks as well (not yet tested for other Cloud Paks).

There are 2 main actions in this utility. Clone and Reinstate.
Clone is essentially capturing a snapshot of your environment. This includes all the persistent data, metadata, OpenShift artifacts for you and saving these off to a S3 bucket.

  1. Quiesces the system scales down all the OpenShift defined statefulSets in the desired project to a quiesced state.
  2. It captures all the OpenShift artifacts.
  3. It captures all the content in the persistent volumes.
  4. Scales the statefulSets back to take activity again
  5. Transfers the scripts, logs and data up to its own defined S3 bucket.
  6. Jobs and pods are cleaned up if CLEANUP=1

Reinstate is doing things in reverse.

  1. Creates a namespace in OpenShift
  2. Deploys OpenShift Jobs up to the OpenShift cluster namespace
  3. Jobs rebuild the Persistent Volumes and Persistent Volume Claims pulling from the clone in S3.
  4. Jobs rebuilds OpenShift objects including deployments pulling from the clone in S3.
  5. Deployments restart and scale up the statefulSets.
  6. Jobs and pods are cleaned up if CLEANUP=1

Usage:

The Cloner utility runs as follows (you can use either podman or docker):

  • docker run <environment variables> quay.io/drangar_us/cpc:cp4d.3.5
  • podman run <environment variables> quay.io/drangar_us/cpc:cp4d.3.5


The utility gets its arguments via environment variables. There are 3 ways you can pass these environment variables to the container. How you feed the container its variables is up to you.

  • Using an --env-file=<file name> where all arguments are in a file. This the easiest but requires file editing.
  • Using combination of --env-file=<file name> -e ARG=<value>. This is the easy and flexibility.
  • Using all -e ARG=<value> commands. This is by far the most typing.


I am providing a sample cloner.env file, which has all the variable needed to be filled out. It is fairly well documented, but there is a section below with links to how to get values. Overtime, this can be expanded to multiple clouds, but for now its IBM Cloud. Each Variable is used either in the file, or on the command line. The Container will ignore variable not associated with the ACTION. It is fair to populate all fields in file.


To Clone a Cloud Pak Cluster, you will first validate the environment variables look correct. This command will pull the docker container if it doesn’t already exist, then dump the environment variables which will be used. I call these the default actions.


Once you have reinstated your Cloud Pak for Data Cluster, you will need to go in and validate a few items. Are your node/hostanme names the same for you database connections. Did you use node affinity? If so did you label the new nodes?

Running with just the env file, in current directory.

  1. First dump the environment variables using an argument at the end of env
    docker run --env-file=./cloner.env quay.io/drangar_us/cpc:cp4d.3.5 env
       tjm$ docker run --env-file=./cloner.env quay.io/drangar_us/cpc:cp4d.3.5 env
       Unable to find image 'quay.io/drangar_us/cpc:cp4d.3.5' locally
       cp4d.3.5: Pulling from drangar_us/cpc
       7a0437f04f83: Already exists
       ....
       ....
       Digest: sha256:3ca278cfadd6a6b8eea6c8b5810ec8649c65a7ec52c99082cdbd4081b20e984f
       Status: Downloaded newer image for quay.io/drangar_us/cpc:cp4d.3.5
       PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
       HOSTNAME=8e660213f8c8
       ACTION=CLONE
       COSBUCKET=clone-cpd-lite-3-15-2021
       PROJECT=zen
       INTERNALREGISTRY=0
       ICR_KEY=eyJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJJQk0gTWFyaK7a7Dafcrg 
       TARGET_REGISTRY_USER=iamapikey
       TARGET_REGISTRY_PASSWORD=2qm6gbuiYBb_oBnLBgAbpE6Gjbn61YVdODRNZRa
       REGISTRYPROJECT=partner
       TARGET_REGISTRY=us.icr.io
       SERVER=https://c114-e.us-south.containers.cloud.ibm.com:30744
       TOKEN=UtqtiJSAz4amSrzxDxWKaRYFxZz-966vB0Y
       COSAPIKEY=f3669921b8a4451e21b38ab2ba1f
       COSENDPOINT=s3.us-south.cloud-object-storage.appdomain.cloud
       COSREGION=us-south
       COSSECRETKEY=68b44f315c357792d4210df2e25545ab4a6b7fc3b08
       SINCRIMAGE=quay.io/drangar_us/sincr:v2
       HOME=/root
    
  2. Execute the clone
    docker run --env-file=./cloner.env quay.io/drangar_us/cpc:cp4d.3.5

Using combination of environment file and command line arguments

This is the easy with flexibility.

  1. First dump the environment variables using an argument at the end of env:
    docker run --env-file=./cloner.env -e ACTION=CLONE -e COSBUCKET=clone-cpd-lite-3-17-2021 quay.io/drangar_us/cpc:cp4d.3.5 env
    NOTE: The COSBUCKET is now reading clone-cpd-lite-3-17-2021 vs clone-cpd-lite-3-15-2021 as the file lists. Each clone requires a unique bucket.
  2. Execute the clone
    docker run --env-file=./cloner.env -e ACTION=CLONE -e COSBUCKET=clone-cpd-lite-3-17-2021 quay.io/drangar_us/cpc:cp4d.3.5

Using all command line arguments

  1. First dump the environment variables using an argument at the end of env:
    docker run -e ACTION=CLONE -e SINCRIMAGE=[SINCR image path] -e SERVER=[OCP login server URL] -e OCUSER=[Openshift user name] -e OCPASSWORD=[Openshift user password] -e INTERNALREGISTRY=[0 or 1] -e PROJECT=[Openshift project name] -e COSBUCKET=[S3 bucket name] -e COSAPIKEY=[S3 access key] -e COSSECRETKEY=[S3 secret key] -e COSREGION=[S3 region] -e COSENDPOINT=[S3 end point] quay.io/drangar_us/cpc:cp4d.3.5 env
  2. Execute the clone
    docker run -e ACTION=CLONE -e SINCRIMAGE=[SINCR image path] -e SERVER=[OCP login server URL] -e OCUSER=[Openshift user name] -e OCPASSWORD=[Openshift user password] -e INTERNALREGISTRY=[0 or 1] -e PROJECT=[Openshift project name] -e COSBUCKET=[S3 bucket name] -e COSAPIKEY=[S3 access key] -e COSSECRETKEY=[S3 secret key] -e COSREGION=[S3 region] -e COSENDPOINT=[S3 end point] quay.io/drangar_us/cpc:cp4d.3.5

Setup

Required steps

  • Docker or Podman installed.
  • Terminal window in your laptop (mac/linux) or powershell in windows
  • Internet access to the Quay repo where the containers are stored.
  • Execute docker pull quay.io/drangar_us/sincr:v2 If not things will fail on REINSTATE based on the faulty CLONE but you may never know.
  • Execute docker pull quay.io/drangar_us/cpc:cp4d.3.5 NOTE: The Tool will do this for you, but in the interest of time, pull the container.
  • If you are running on IBM Cloud, you should run the following prior to installing.
  • If you provide an API key from the cloud account and the name of the cluster you should be able to simply run the following command from a terminal: curl -X POST -H “Content-Type: application/json” “http://cloudpak-provisioner.ibmcloudroks.net:8000/api/v1/cp4d_preinstallation/” -d ‘{“apikey”: “<API KEY>“, “cluster_name”: “<CLUSTER NAME>“}’ (make sure when you copy and paste this command the single and double quotes are correct)
  • If your clone used storage class ibmc-file-custom-gold-gid then you need to add this storageClass following these instructions)
  • If you are using DB2 or WKC, you will need to apply the kernel parameter changes steps 3 and 4.

Verify that you have both containers pulled:

 tjm$ docker images
 REPOSITORY                 TAG        IMAGE ID       CREATED        SIZE
 quay.io/drangar_us/cpc     cp4d.3.5   51f701144a74   4 days ago     381MB
 quay.io/drangar_us/sincr   v2         ed3fb3d69bec   8 months ago   399MB

Setting up the environment file. I am providing a cloner.env file with some detailed documentation in the file. However, there are some trickier aspects of this that need some explaining.

We will work from the bottom of the file upwards. The bottom has least changes and topmost.

Last Section

  • SINCRIMAGE is merely the job cloner image and should nearly never change.

Moving on the Cloud Object Storage section

  • COSREGION is the S3 region where you will store the cloned files. I had issues at first run with us-east, so I know that us-south works. more investigate will happen here.
  • COSENDPOINT - is the s3 regional endpoint. Using the Cloud Object Store endpoints URL following URL pick the public endpoint that aligns with your COSREGION selection. You can use ctrl+f to find the proper endpoint. I chose the public endpoint. More to investigate with private endpoints.
  • Steps to take to get COSAPIKEY and COSSECRETKEY values.
    1. Go to the IBM Cloud > Resource list expand Storage and click on the Cloud Object Storage service listed.
    2. On the left side, Click Service credentials.
    3. If this is your first time, Click the New Credentials button.
    • Enter a name you will remember
    • Keep Writer permissions
    • Expand the Advanced Options
    • Slide the button to enable Include HMAC Credential
    • Click Add 1. This will create the new service credential, expand the next credential. Here is where you will find the values you need,
       "cos_hmac_keys": {
       "access_key_id": "a52292cde94547968d4a8ad0767da668",
       "secret_access_key": "582013659ef15ff2fd06e7ef4faed5f0508aa3a5f87a0379"
        },
      
  • COSAPIKEY is to get access to the Cloud Object Storage service where you will create a bucket for each clone. To find this service id under cos_hmac_keys: access_key_id:
  • COSSECRETKEY - This is part of the service credential for your COS or S3 bucket. On IBM Cloud, this is the value of cos_hmac_keys: secret_access_key:

OpenShift Login section

  • SERVER This OCP Cluster login URL which includes host and port. On IBM Cloud this can change from day to day and region to region, so so it may be a good candidate for a -e SERVER=https://c100-e.us-east.containers.cloud.ibm.com:30891 on the command line.
  • TOKEN This the OpenShift Cluster credential. On IBM Cloud Managed OpenShift, you will use the results of $(oc whoami -t). This is not needed if you can or do login using a username and password. If you do use username and password uncomment the OCUSER and OCPASSWORD and comment out the TOKEN with a #. This changes often, so it may be a good candidate for a -e TOKEN=<oc login token> on the command line.
  • OCUSER This will used to authenticate to OpenShift Cluster, if you commented out TOKEN. You will use the results of $(oc whoami -t). However, this doesn’t work in the env file so it may be a good candidate for a -e OCUSER=$(oc whoami) on the command line.
  • OCPASSWORD This will used to authenticate to OpenShift Cluster, if you commented out TOKEN. You will use the results of $(oc whoami -t). However, this doesn’t work in the env file so it may be a good candidate for a -e OCPASSWORD=$(oc whoami -t) on the command line.

Container Registry Section

All of this section should not change very often

  • INTERNALREGISTRY This offers 2 options 1 or 0
    • 1 Value of 1 is for using the internal registry in you OCP Cluster. This option will pull images from IBM Container Registry and push them into the OpenShift Cluster internal registry, which will increase reinstate time. You will want to plan accordingly on registry disk space for for each cluster.
    • 0 Value of 0 is for using an external registry of your OCP Cluster. It is becoming a better practice to control your own registry across multiple clusters. Another benefit is that you only load images once and save on disk space on all the other clusters.Instructions on building out your own external registry.
  • ICR_KEY This is your entitlement APIKey from the entitlement registry. Same as you would use in repo.yaml. This is found on step 3 of Getting Entitlement to Cloud Pak Software
  • TARGET_REGISTRY_USER This will depend on of you are using INTERNALREGISTRY=1 or 0.
    If INTERNALREGISTRY=1, then you will use the results of $(oc whoami). However, this doesn’t work in the env file so it may be a good candidate for a -e TARGET_REGISTRY_USER=$(oc whoami) on the command line.
    If INTERNALREGISTRY=0, then you will use iamapikey from the IAM APIKey you created.
  • TARGET_REGISTRY_PASSWORD This will depend on of you are using INTERNALREGISTRY=1 or 0.
    If INTERNALREGISTRY=1, then you will use the results of $(oc whoami -t). However, this doesn’t work in the env file so it may be a good candidate for a -e TARGET_REGISTRY_PASSWORD=$(oc whoami -t) on the command line.
    If INTERNALREGISTRY=0, then you will use APIKey from the IAM APIKey you created.
  • REGISTRYPROJECT This will depend on of you are using INTERNALREGISTRY=1 or 0.
    If INTERNALREGISTRY=1, then you will use same value as repo.yaml or cp/cpd
    If INTERNALREGISTRY=0, then you will you the values from when you created the external registry.
  • TARGET_REGISTRY This will depend on of you are using INTERNALREGISTRY=1 or 0.
    If INTERNALREGISTRY=1, then you will use same value as registry: -url in repo.yaml or cp.icr.io
    If INTERNALREGISTRY=0, then you will you the values from when you created the external registry.

Top section

  • ACTION = Offers 2 Main options CLONE or REINSTATE
    • CLONE will scale down the statefulSets, capture a backup of the environment, scale up the statefulSets then copy the backup to the COS bucket you define with COSBUCKET. When the statefulSets are scales down this will be a service outage.
    • REINSTATE will run a host of oc jobs that will first rebuild the PV and PVCs, restore the data then rebuild the deployments which will start the pods. Depending on the number of services and where you are executing this, the PVC/PV portion can take 10+ minutes, the deployments 5 minutes then time to have the pods stand up.
  • COSBUCKET = This is self-defined and is not needed to be created ahead of time. This should be unique enough for you remember what, when, who and why it was taken
  • PROJECT This is the namespace that your Cloud Pak is installed into. For CPD, I use clonerdemo instead of my ‘zen’ out of habit, but it is a better practice to use something other than zen or cpd for better results when searching for things. You can look through the projects using oc projects to see which is being used. oc get pods -n <project name> will show you what is there.
  • CLEANUP Offers 2 Main options 1 and 0
    1 Value of 1 if for cleaning up the CLONE and REINSTATE jobs that are created during each invocation. I suggest setting this as the default value in this file.
    0 Value of 0 if for leaving the CLONE and REINSTATE jobs that are created during each invocation. You may want to leave these around to see what was happening. I would suggest overriding the CLEANUP=1 in this file with a -e CLEANUP=0 on the command line. If you try to re-run CLONE on the same cluster/namespace it will not work.

Troubleshooting

Stopping the utility during execution

  1. Find out which containers are running using docker ps
       tjm$ docker ps
       CONTAINER ID   IMAGE                           COMMAND                  CREATED              STATUS PORTS    NAMES
       86e12bef2e7b   quay.io/drangar_us/cpc:cp4d.3.5 "/bin/sh -c /cloneto…"   About a minute ago   1m              eloquent_bouman
    
  2. Issue a docker stop
       tjm$ docker stop  eloquent_bouman
       eloquent_bouman
       tjm$
    

Access the Container

When you execute the container, as long as it is running you can execute into the container and tail the logs.

  1. Start by running the command to run the container. docker run <environment variables> quay.io/drangar_us/cpc:cp4d.3.5
  2. Find out the name of the running container docker ps
    Toms-MBP:cloner tjm$ docker ps
    CONTAINER ID   IMAGE                            COMMAND                  CREATED          STATUS       PORTS  NAMES
    a5a46f2f4e47   quay.io/drangar_us/cpc:cp4d.3.5  "/bin/sh -c /cloneto…"   14 seconds ago   Up 13 seconds       deepak_the_de
    
  3. Exec into the running pod using the name on the right. In this sample, the name is deepak_the_de. Run docker exec -it deepak_the_de bash
     Toms-MBP:cloner tjm$ docker exec -it deepak_the_de bash
     [root@a5a46f2f4e47 clonetool]#
    
  4. List the files.
    [root@a5a46f2f4e47 clonetool]# ls
    clone  clone.sh  cpc_execution.txt.2021.03.18-01.51.27	reinstate  reinstate.sh  runcpc.sh
    

Viewing Logs

  1. Access the container
  2. List the files.
     [root@a5a46f2f4e47 clonetool]# ls
     clone  clone.sh  cpc_execution.txt.2021.03.18-01.51.27	reinstate  reinstate.sh  runcpc.sh
    
  3. Tail, more or cat the logs
     [root@a5a46f2f4e47 clonetool]# tail -f cpc_execution.txt.2021.03.18-01.51.27
     deployment.apps/redis-ha-haproxy scaled
     deployment.apps/spaces scaled
     deployment.apps/spawner-api scaled
     ...
     ...
     zen-metastoredb
     statefulset.apps/zen-metastoredb scaled
     zookeeper
     statefulset.apps/zookeeper scaled
     Info: Exporting manifests
     Info: Exporting project to clone/specs/0-projects.json
     Info: Exporting secrets to clone/specs/1-secrets.json
     Info: Exporting serviceaccount to clone/specs/2-serviceaccounts.json
     Info: Exporting scc to clone/specs/3-securitycontextconstraints.json
     Info: Exporting roles to clone/specs/4-roles.json
     ...
     ...
     job.batch/clone-job-32 created
     Info: Starting data backup for user-home-pvc
     job.batch/clone-job-33 created
     Info: Starting data backup for zookeeper-data-zookeeper-0
     job.batch/clone-job-34 created
     Info: Wating for all clone job's to complete [Elapsed time:60]...
     Info: Wating for all clone job's to complete [Elapsed time:120]...
    

S3 connectivity issues

  1. Access the container
  2. Using the s3cmd verify that you can list buckets s3cmd ls
    [root@243b0735ba65 clonetool]# s3cmd ls
    2021-03-18 02:59  s3://clone-cpd-lite-wkc-3-17-2021
    2021-03-16 02:45  s3://clone1-datacp-mar15-lite
    
  3. Verify that you can make a bucket s3cmd mb s3://<name>
    [root@243b0735ba65 clonetool]# s3cmd mb s3://testus-south-bucket
    Bucket 's3://testus-south-bucket/' created
    
  4. If these fail, review your s3 parameters which were passed. Using cat ~/.s3cfg Also verify with the actual COS admin or configuration. I have used the APIKey as the access_key. I also had an issue with s3.us-east.cloud endpoint. Will need to look into config for use_https=true for the future.
    [root@243b0735ba65 clonetool]# cat ~/.s3cfg
    host_base=s3.us-south.cloud-object-storage.appdomain.cloud
    host_bucket=s3.us-south.cloud-object-storage.appdomain.cloud
    use_https=False
    bucket_location=us-south
    access_key=f3669921b8a44a91b38ab2ba1f
    secret_key=68b44f315c357792ddf2e25545ab4a6b7fc3b08
    signature_v2=False
    

Clone failing to upload to S3

I had one scenario when ACTION=CLONE was running and it would not upload to the S3 bucket. After looking through the source Cluster it turned out I did not have a secret set up for norootsquash which caused a delay in the statefulSet from starting thus timing out before the upload happened. This was not a utility issue, but a cluster configuration. This was resolved by running the following:

  1. oc project kube-system
  2. oc create secret docker-registry cpregistrysecret --docker-server=cp.icr.io/cp/cpd --docker-username=cp --docker-password=<apikey> --docker-email=<email associated with apikey>
  3. Download the norootsquash.yaml file.
  4. oc apply -f norootsquash.yaml

Reinstate fails to create namespace

When I was executing the ACTION=REINSTATE, I was passing it a bogus bucket name.

 [root@243b0735ba65 clonetool]# more cpc_execution.txt.2021.03.18-12.25.49
  Info: S3 config does not exist, create one ..
  Info: -rw-r--r--
  Info: Start Time : 2021.03.18-12.25.49
  Info: Checking to see if the S3 connection parameters are correct
  Info: Downloading template from COS
  ERROR: S3 error: 404 (NoSuchBucket): The specified bucket does not exist.
  ERROR: S3 error: 404 (NoSuchBucket): The specified bucket does not exist.
  Info: Update any spec as required
  /clonetool/reinstate.sh: line 122: /clonetool/reinstate/specs/fixspec.sh: No such file or directory
  Info: reinstante specs not found
  cat: reinstate/specs/cloneddomain.txt: No such file or directory
  Info: Reinstate project and apply prerequisites


When looking at the logs I didn’t see the following, which suggests a healthy download. Instead it skips over this and tries to execute the oc build, which it can’t hence doesn’t create the namespace. NOTE:Future enhancement to check if downloads are successful, if not error out.

 [root@81d90ecb4856 clonetool]# more cpc_execution.txt.2021.03.18-13.09.31
 Info: S3 config does not exist, create one ..
 Info: -rw-r--r--
 Info: Start Time : 2021.03.18-13.09.31
 Info: Checking to see if the S3 connection parameters are correct
 Info: Downloading template from COS
 download: 's3://clone-cpd-lite-wkc-3-17-2021/state/container_image_list.txt' -> 'reinstate/state/container_image_list.txt' (2940 bytes in 0.3 seconds, 10.14 KB/s) [1 of 5]
 download: 's3://clone-cpd-lite-wkc-3-17-2021/state/deployment_replicas.csv' -> 'reinstate/state/deployment_replicas.csv' (1501 bytes in 0.4 seconds, 3.78 KB/s) [2 of 5]
 download: 's3://clone-cpd-lite-wkc-3-17-2021/state/deployments.json' -> 'reinstate/state/deployments.json' (3515304 bytes in 1.7 seconds, 2027.11 KB/s) [3 of 5]
 download: 's3://clone-cpd-lite-wkc-3-17-2021/state/statefulset_replicas.csv' -> 'reinstate/state/statefulset_replicas.csv' (362 bytes in 0.1 seconds, 5.19 KB/s) [4 of 5]
 download: 's3://clone-cpd-lite-wkc-3-17-2021/state/statefulsets.json' -> 'reinstate/state/statefulsets.json' (766003 bytes in 0.7 seconds, 1066.76 KB/s) [5 of 5]
 Done. Downloaded 4286110 bytes in 3.1 seconds, 1334.25 KB/s.
 download: 's3://clone-cpd-lite-wkc-3-17-2021/specs/0-projects.json' -> 'reinstate/specs/0-projects.json' (1772 bytes in 0.2 seconds, 7.16 KB/s) [1 of 20]

Reinstating a cloned environment which used ibmc-file-custom-gold-gid

When cloning an existing environment, pay attention to the storage classes. By default a Managed OpenShift provides a certain set of storageClasses. ibmc-file-custom-gold-gid is not created by default. The utility will do the needful, but after a while the reinstate jobs will fail or you will not see progression. This is because the PVs and PVCs are not being provisioned. To resolve this:

Create the storage class:



You should start seeing the storage being provisioned and progression continuing.

Reinstated WKC or DB2 pods fail to start

This occurs when the prerequisite kernel parameters and secrets are not met for DB2/WKC. Please refer back to the required steps Running the following command on the target cluster will show these. oc get pods | grep '0/1' | grep -v Completed The culprit is wdp-db2-0