IBMPartnerDemo

View project on GitHub

Data Lineage API tutorial

This is a short tutorial using the Watson Data API to query Catalogs, Getting Assets and creating relationship mappings between the assets to populate the lineage graph on the asset in Watson Knowledge Catalog. The Column based asset lineage below is a subset of what can be done with Business Asset Lineage. If there is a cataloged asset, we can add it to the lineage. Whether this is a PowerBI dashboard or Cognos report, or Jupytper notebook.

Setting up Environment variables

For simplicity, I will used some environment variables along the way. To set the CPD_URL, you can find this value using OpenShift commands or cutting and pasting from the Browser.

If you are using OpenShift CLI, the command would look something like this. After logging into OpenShift CLI, you will issue a command to get the route in the namespace/project which CPD instance is installed. In my case, it is cpd. The command is oc get routes cpd -n cpd You will want to collect the second column value.

oc get routes cpd -n cpd

NAME   HOST/PORT                                                                                                PATH   SERVICES        PORT                   TERMINATION          WILDCARD
cpd    cpd-cpd.cpd46-wdc07-c3c-16x16-3cfa7977c157a3e8d61067e13964bdf8-0000.us-east.containers.appdomain.cloud          ibm-nginx-svc   ibm-nginx-https-port   reencrypt/Redirect   None

If you want to be fancy then the follow you can send right to the variable. oc get routes cpd -n cpd | grep cpd | awk '{print $2}'

Next you will export the CPD_URL variable export CPD_URL=cpd-cpd.cpd46-wd...3964bdf8-0000.us-east.containers.appdomain.cloud

Authorization

First thing that is needed is to get the Bearer token. You will need the following:

  • User Id
  • Password
  • Base Cloud Pak for Data URL, which was set using CPD_URL
    curl -k -X POST https://${CPD_URL}/icp4d-api/v1/authorize     -H 'cache-control: no-cache'     -H 'content-type: application/json'     -d '{"username":"admin","password":"PASSWORD"}'
    

Getting Catalogs

You will first want to list out all availabe catalogs, so that you can retrive the GUID for the catalog which your assets are located. In my case, I am looking for Business Catalog. To pretty up the output, I am using jq to help format and parse the output. Download

curl -k -X GET --header 'Content-Type: application/json' --header 'Accept: application/json' --header "Authorization: Bearer ${TOKEN}" https://$CPD_URL/v2/catalogs?limit=25 | jq .
{
    "catalogs": [
        {
            "metadata": {
                "guid": "aea20795-3c0b-4d25-a99d-c089c6c20dd8",
                "url": "/v2/catalogs/aea20795-3c0b-4d25-a99d-c089c6c20dd8",
                "creator_id": "1000330999",
                "create_time": "2022-12-02T17:41:53Z",
                "update_time": "2023-01-25T14:39:17Z"
            },
            "entity": {
                "name": "Business Catalog",
                "description": "",
                "generator": "Catalog-UI-Service",
                "bss_account_id": "999",
                "capacity_limit": 0,
                "is_governed": true,
                "auto_profiling": true,
                "configurations": {
                    "duplicate_action": "UPDATE",
                    "default_duplicate_strategy": "DUPLICATE_DETECTION_BY_NAME",
                    "duplicate_strategies": [
                        {
                            "asset_type": "data_asset",
                            "strategy": "DUPLICATE_DETECTION_BY_NAME_AND_RESOURCE_KEY"
                        }
                    ]
                }
            },
            "href": "/v2/catalogs/aea20795-3c0b-4d25-a99d-c089c6c20dd8"
        },
        {
            "metadata": {
                "guid": "2c50fa47-049e-4d72-a24d-6839028a27bd",
                "url": "/v2/catalogs/2c50fa47-049e-4d72-a24d-6839028a27bd",
                "creator_id": "1000330999",
                "create_time": "2022-12-01T17:28:26Z",
                "update_time": "2023-01-25T14:39:17Z"
            },
            "entity": {
                "name": "Platform assets catalog",
                "description": "The catalog where all platform users can share assets and connections.",
                "generator": "wkc post install",
                "bss_account_id": "999",
                "capacity_limit": 0,
                "is_governed": false,
                "auto_profiling": false,
                "uid": "ibm-global-catalog"
            },
            "href": "/v2/catalogs/2c50fa47-049e-4d72-a24d-6839028a27bd"
        }
    ],
    "nextBookmark": "g1AAAAG8eJy10DtOA0EMBuARG-5BjeTB8e5m1g0KFKnCGWDG3ihaTR7KCzquwjGSKldBouQQy0BQlBIJpbH9N9anPxpjsnGmxslsLWMN_Wedw-8NaE-Slbna5UasxPVyVS9snImPMT24CE9phte2bZvvnA3uX4afW-qbSUo3JCWOfOEAC66hUEfgqVDoVTkjVZ5c0McuIuY5MrM3zb9B006a5i2tZNofUQ-3H5d4QPnaEzouIRcMCUUleGYFwYqlJ4Sq1VlQ7wfUaVO7u6sf1PWfm8Lkab4A5MOEXw",
    "nextSkip": 0
}

Parse out the names and UID you want.

curl -k -X GET --header 'Content-Type: application/json' --header 'Accept: application/json' --header "Authorization: Bearer ${TOKEN}" https://$CPD_URL/v2/catalogs?limit=25 | jq -r '.catalogs[] | .entity.name + " " +  .metadata.guid'

Note you will see this lay out in an index. Business Catalog is index 0.

Business Catalog aea20795-3c0b-4d25-a99d-c089c6c20dd8
Platform assets catalog 2c50fa47-049e-4d72-a24d-6839028a27bd

Store the Catalog ID for later usage:

export CATALOG=$(curl -k -X GET --header 'Content-Type: application/json' --header 'Accept: application/json' --header "Authorization: Bearer ${TOKEN}" https://$CPD_URL/v2/catalogs?limit=25 | jq -r .catalogs[0].metadata.guid)

Verify that the correct GUID is stored for the Catalog

echo $CATALOG
aea20795-3c0b-4d25-a99d-c089c6c20dd8

Searching for assets

Lets search on a column and extract the artifact_id. You will notice that the artifact_id for a data_asset_column is for the data_asset UID and the column name is appended to it. In this case, I am searching for CREDITCARD then parsing the return to list to print out name, artifact_id, artifact_type and catalog _id. This will start to give me the basis to build my json to store relationship information.

curl -k -X GET --header 'Content-Type: application/json' --header 'Accept: application/json' --header "Authorization: Bearer ${TOKEN}" https://$CPD_URL/v3/search?query=CREDITCARD | jq -r '.rows[] | .metadata.name + " " +  .artifact_id + " " + .metadata.artifact_type + " catalog_id: " + .entity.assets.catalog_id'

CREDITCARD_CVV a7451e4d-1e3f-483b-b582-245c057b99f9:CREDITCARD_CVV data_asset_column catalog_id: aea20795-3c0b-4d25-a99d-c089c6c20dd8
CREDITCARD_CVV af416dea-4c6e-4b85-b20d-fd068c9f6a9c:CREDITCARD_CVV data_asset_column catalog_id:
CREDITCARD_EXP 4f3726d2-9474-4e71-a18d-3af269239297:CREDITCARD_EXP data_asset_column catalog_id:
CREDITCARD_CVV 4f3726d2-9474-4e71-a18d-3af269239297:CREDITCARD_CVV data_asset_column catalog_id:
CREDITCARD_NUMBER af416dea-4c6e-4b85-b20d-fd068c9f6a9c:CREDITCARD_NUMBER data_asset_column catalog_id:
CREDITCARD_EXP af416dea-4c6e-4b85-b20d-fd068c9f6a9c:CREDITCARD_EXP data_asset_column catalog_id:
CREDITCARD_NUMBER a7451e4d-1e3f-483b-b582-245c057b99f9:CREDITCARD_NUMBER data_asset_column catalog_id: aea20795-3c0b-4d25-a99d-c089c6c20dd8
CREDITCARD_EXP a7451e4d-1e3f-483b-b582-245c057b99f9:CREDITCARD_EXP data_asset_column catalog_id: aea20795-3c0b-4d25-a99d-c089c6c20dd8
CREDITCARD_TYPE 4f3726d2-9474-4e71-a18d-3af269239297:CREDITCARD_TYPE data_asset_column catalog_id:
CREDITCARD_TYPE af416dea-4c6e-4b85-b20d-fd068c9f6a9c:CREDITCARD_TYPE data_asset_column catalog_id:
CREDITCARD_TYPE a7451e4d-1e3f-483b-b582-245c057b99f9:CREDITCARD_TYPE data_asset_column catalog_id: aea20795-3c0b-4d25-a99d-c089c6c20dd8
CREDITCARD_NUMBER 4f3726d2-9474-4e71-a18d-3af269239297:CREDITCARD_NUMBER data_asset_column catalog_id:
Customers_LIFE_s.csv a7451e4d-1e3f-483b-b582-245c057b99f9 data_asset catalog_id: aea20795-3c0b-4d25-a99d-c089c6c20dd8
Customers_LIFE_s.csv af416dea-4c6e-4b85-b20d-fd068c9f6a9c data_asset catalog_id:
Customers_LIFE_s.csv 4f3726d2-9474-4e71-a18d-3af269239297 data_asset catalog_id:

Lets query again for CREDITCARD_NUMBER, this time adding auth_scope=$CATALOG.

curl -k -X GET --header 'Content-Type: application/json' --header 'Accept: application/json' --header "Authorization: Bearer ${TOKEN}" https://$CPD_URL/v3/search?query=CREDITCARD_NUMBER | jq -r '.rows[] | .artifact_id'

a7451e4d-1e3f-483b-b582-245c057b99f9:CREDITCARD_NUMBER
a7451e4d-1e3f-483b-b582-245c057b99f9:CREDITCARD_CVV
a7451e4d-1e3f-483b-b582-245c057b99f9:CREDITCARD_EXP
a7451e4d-1e3f-483b-b582-245c057b99f9:CREDITCARD_TYPE
a7451e4d-1e3f-483b-b582-245c057b99f9

For the API we need the actual asset root, not the column

export ASSET=$(curl -k -X GET --header 'Content-Type: application/json' --header 'Accept: application/json' --header "Authorization: Bearer ${TOKEN}" https://$CPD_URL/v3/search?auth_scope=$CATALOG&query=Customers_LIFE_s.csv | jq -r '.rows[0] | .artifact_id')

echo $ASSET
a7451e4d-1e3f-483b-b582-245c057b99f9

Just randomly picked another asset to link together:

Toms-MBP:aunt-kitty tjm$ curl -k -X GET --header 'Content-Type: application/json' --header 'Accept: application/json' --header "Authorization: Bearer ${TOKEN}" https://$CPD_URL/v3/search?query=email | jq -r '.rows[0] | .artifact_id'

5d7ef46a-3437-4f99-aedb-b44bae377ac7:EMAIL

Now lets link two assets together:

lets build out the linkage between the two assets, which is layed out in this attached json file

{
  "name": "data_lineage",
  "entity": {
      "assets": [
         {
          "internal_id": "creditcard_number",
          "lookup": {
              "producer_properties": {
                  "id": "a7451e4d-1e3f-483b-b582-245c057b99f9:CREDITCARD_NUMBER"
              }
          }
        },
        {
          "internal_id": "email",
          "lookup": {
              "producer_properties": {
                  "id": "5d7ef46a-3437-4f99-aedb-b44bae377ac7:EMAIL"
              }
          }
        }
      ],
      "lineage_relationships": [
        {
          "flow_type": "flow_design",
          "flows": [
            {
              "sources": [
                "creditcard_number"
              ],
              "targets": [
                "email"
              ]
            }
          ]
        }
      ]
   }
}

Executing the lineage api:

curl -k -X POST -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" -d @datalineage.json https://$CPD_URL/v2/assets/$ASSET/attributes?catalog_id=$CATALOG | jq .

{
  "asset_id": "715f7fdb-73be-4f3e-8a25-0268ad5e7cfe",
  "data_lineage": {
    "assets": [
      {
        "internal_id": "creditcard_number",
        "lookup": {
          "producer_properties": {
            "id": "a7451e4d-1e3f-483b-b582-245c057b99f9:CREDITCARD_NUMBER"
          }
        }
      },
      {
        "internal_id": "email",
        "lookup": {
          "producer_properties": {
            "id": "715f7fdb-73be-4f3e-8a25-0268ad5e7cfe:EMAIL"
          }
        }
      }
    ],
    "lineage_relationships": [
      {
        "flow_type": "flow_design",
        "flows": [
          {
            "sources": [
              "creditcard_number"
            ],
            "targets": [
              "email"
            ]
          }
        ]
      }
    ]
  },
  "href": "/v2/assets/715f7fdb-73be-4f3e-8a25-0268ad5e7cfe/attributes/data_lineage?catalog_id=aea20795-3c0b-4d25-a99d-c089c6c20dd8"

End result in the Catalog looks like this:

Validate Data Linage

You can use the GET command on an asset to get an attribute. In the command below, you are looking up the ASSET in the CATALOG and bringing back the data_lineage attribute.

curl -k  -X GET -H "Authorization: Bearer ${TOKEN}" -H "Content-Type: application/json" https://${CPD_URL}/v2/assets/${ASSET}/attributes/data_lineage?catalog_id=${CATALOG} | jq .
{
  "href": "/v2/assets/a7451e4d-1e3f-483b-b582-245c057b99f9/attributes/data_lineage?catalog_id=aea20795-3c0b-4d25-a99d-c089c6c20dd8",
  "asset_id": "a7451e4d-1e3f-483b-b582-245c057b99f9",
  "data_lineage": {
    "assets": [
      {
        "internal_id": "creditcard_number",
        "lookup": {
          "producer_properties": {
            "id": "a7451e4d-1e3f-483b-b582-245c057b99f9:CREDITCARD_NUMBER"
          }
        }
      },
      {
        "internal_id": "email",
        "lookup": {
          "producer_properties": {
            "id": "5d7ef46a-3437-4f99-aedb-b44bae377ac7:EMAIL"
          }
        }
      }
    ],
    "lineage_relationships": [
      {
        "flow_type": "flow_design",
        "flows": [
          {
            "sources": [
              "creditcard_number"
            ],
            "targets": [
              "email"
            ]
          }
        ]
      }
    ]
  }
}

Deleting lineage

If you happen to make a mistake, or if lineage has changed, you can remove the old using DELETE

curl -k  -X DELETE -H "Authorization: Bearer ${TOKEN}" -H "Content-Type: application/json" https://${CPD_URL}/v2/assets/${ASSET}/attributes/data_lineage?catalog_id=${CATALOG} | jq .

In theory, passing this file to the -d argument in the curl command.

curl -k -X POST -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" -d @datalineage-multi.json https://$CPD_URL/v2/assets/$ASSET/attributes?catalog_id=$CATALOG | jq .

When you look at the file, the json parser seems to map left to right. a1 maps to a2, a3 maps to a4 and a5 maps to a6. These can be arbitrary values but I think they should be more meaningful names. The second format might be more work, but it might be more descriptive, for anyone that comes behind you, if you are retaining these n source control.

"lineage_relationships": [
  {
    "flow_type": "flow_design",
    "flows": [
      {
        "sources": [
          "creditcard_number","phone_number","customerid"
        ],
        "targets": [
          "email","phone_1","cust_id"
        ]
      }
    ]  

This can also be laid out the following way.

"lineage_relationships": [
  {
    "flow_type": "flow_design",
    "flows": [
      {
        "sources": [
          "creditcard_number"
        ],
        "targets": [
          "email"
        ]
      },      
      {
        "sources": [
          "phone_number","customerid"
        ],
        "targets": [
          "phone_1","cust_id"
        ]
      },
      {
        "sources": [
          "customerid"
        ],
        "targets": [
          "cust_id"
        ]
      }
    ]  

Here is the view in the lineage tab. There is an open “IDEA” or RFE with development to let you add an edge label to describe the link between the two assets.

Building into a utility

With Python we can write a small program to make all these steps easier. I am working with Gabe Green ggreen@techd.com from Technology Dyanmics to build this into a utility to ingest CSV much like Extension Mapping Documents. More to come on this topic.