Data Lineage API tutorial
This is a short tutorial using the Watson Data API to query Catalogs, Getting Assets and creating relationship mappings between the assets to populate the lineage graph on the asset in Watson Knowledge Catalog. The Column based asset lineage below is a subset of what can be done with Business Asset Lineage. If there is a cataloged asset, we can add it to the lineage. Whether this is a PowerBI dashboard or Cognos report, or Jupytper notebook.
Setting up Environment variables
For simplicity, I will used some environment variables along the way. To set the CPD_URL, you can find this value using OpenShift commands or cutting and pasting from the Browser.
If you are using OpenShift CLI, the command would look something like this. After logging into OpenShift CLI, you will issue a command to get the route in the namespace/project which CPD instance is installed. In my case, it is cpd. The command is oc get routes cpd -n cpd
You will want to collect the second column value.
oc get routes cpd -n cpd
NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD
cpd cpd-cpd.cpd46-wdc07-c3c-16x16-3cfa7977c157a3e8d61067e13964bdf8-0000.us-east.containers.appdomain.cloud ibm-nginx-svc ibm-nginx-https-port reencrypt/Redirect None
If you want to be fancy then the follow you can send right to the variable. oc get routes cpd -n cpd | grep cpd | awk '{print $2}'
Next you will export the CPD_URL variable
export CPD_URL=cpd-cpd.cpd46-wd...3964bdf8-0000.us-east.containers.appdomain.cloud
Authorization
First thing that is needed is to get the Bearer token. You will need the following:
- User Id
- Password
- Base Cloud Pak for Data URL, which was set using CPD_URL
curl -k -X POST https://${CPD_URL}/icp4d-api/v1/authorize -H 'cache-control: no-cache' -H 'content-type: application/json' -d '{"username":"admin","password":"PASSWORD"}'
Getting Catalogs
You will first want to list out all availabe catalogs, so that you can retrive the GUID for the catalog which your assets are located. In my case, I am looking for Business Catalog. To pretty up the output, I am using jq to help format and parse the output. Download
curl -k -X GET --header 'Content-Type: application/json' --header 'Accept: application/json' --header "Authorization: Bearer ${TOKEN}" https://$CPD_URL/v2/catalogs?limit=25 | jq .
{
"catalogs": [
{
"metadata": {
"guid": "aea20795-3c0b-4d25-a99d-c089c6c20dd8",
"url": "/v2/catalogs/aea20795-3c0b-4d25-a99d-c089c6c20dd8",
"creator_id": "1000330999",
"create_time": "2022-12-02T17:41:53Z",
"update_time": "2023-01-25T14:39:17Z"
},
"entity": {
"name": "Business Catalog",
"description": "",
"generator": "Catalog-UI-Service",
"bss_account_id": "999",
"capacity_limit": 0,
"is_governed": true,
"auto_profiling": true,
"configurations": {
"duplicate_action": "UPDATE",
"default_duplicate_strategy": "DUPLICATE_DETECTION_BY_NAME",
"duplicate_strategies": [
{
"asset_type": "data_asset",
"strategy": "DUPLICATE_DETECTION_BY_NAME_AND_RESOURCE_KEY"
}
]
}
},
"href": "/v2/catalogs/aea20795-3c0b-4d25-a99d-c089c6c20dd8"
},
{
"metadata": {
"guid": "2c50fa47-049e-4d72-a24d-6839028a27bd",
"url": "/v2/catalogs/2c50fa47-049e-4d72-a24d-6839028a27bd",
"creator_id": "1000330999",
"create_time": "2022-12-01T17:28:26Z",
"update_time": "2023-01-25T14:39:17Z"
},
"entity": {
"name": "Platform assets catalog",
"description": "The catalog where all platform users can share assets and connections.",
"generator": "wkc post install",
"bss_account_id": "999",
"capacity_limit": 0,
"is_governed": false,
"auto_profiling": false,
"uid": "ibm-global-catalog"
},
"href": "/v2/catalogs/2c50fa47-049e-4d72-a24d-6839028a27bd"
}
],
"nextBookmark": "g1AAAAG8eJy10DtOA0EMBuARG-5BjeTB8e5m1g0KFKnCGWDG3ihaTR7KCzquwjGSKldBouQQy0BQlBIJpbH9N9anPxpjsnGmxslsLWMN_Wedw-8NaE-Slbna5UasxPVyVS9snImPMT24CE9phte2bZvvnA3uX4afW-qbSUo3JCWOfOEAC66hUEfgqVDoVTkjVZ5c0McuIuY5MrM3zb9B006a5i2tZNofUQ-3H5d4QPnaEzouIRcMCUUleGYFwYqlJ4Sq1VlQ7wfUaVO7u6sf1PWfm8Lkab4A5MOEXw",
"nextSkip": 0
}
Parse out the names and UID you want.
curl -k -X GET --header 'Content-Type: application/json' --header 'Accept: application/json' --header "Authorization: Bearer ${TOKEN}" https://$CPD_URL/v2/catalogs?limit=25 | jq -r '.catalogs[] | .entity.name + " " + .metadata.guid'
Note you will see this lay out in an index. Business Catalog is index 0.
Business Catalog aea20795-3c0b-4d25-a99d-c089c6c20dd8
Platform assets catalog 2c50fa47-049e-4d72-a24d-6839028a27bd
Store the Catalog ID for later usage:
export CATALOG=$(curl -k -X GET --header 'Content-Type: application/json' --header 'Accept: application/json' --header "Authorization: Bearer ${TOKEN}" https://$CPD_URL/v2/catalogs?limit=25 | jq -r .catalogs[0].metadata.guid)
Verify that the correct GUID is stored for the Catalog
echo $CATALOG
aea20795-3c0b-4d25-a99d-c089c6c20dd8
Searching for assets
Lets search on a column and extract the artifact_id. You will notice that the artifact_id for a data_asset_column is for the data_asset UID and the column name is appended to it. In this case, I am searching for CREDITCARD then parsing the return to list to print out name, artifact_id, artifact_type and catalog _id. This will start to give me the basis to build my json to store relationship information.
curl -k -X GET --header 'Content-Type: application/json' --header 'Accept: application/json' --header "Authorization: Bearer ${TOKEN}" https://$CPD_URL/v3/search?query=CREDITCARD | jq -r '.rows[] | .metadata.name + " " + .artifact_id + " " + .metadata.artifact_type + " catalog_id: " + .entity.assets.catalog_id'
CREDITCARD_CVV a7451e4d-1e3f-483b-b582-245c057b99f9:CREDITCARD_CVV data_asset_column catalog_id: aea20795-3c0b-4d25-a99d-c089c6c20dd8
CREDITCARD_CVV af416dea-4c6e-4b85-b20d-fd068c9f6a9c:CREDITCARD_CVV data_asset_column catalog_id:
CREDITCARD_EXP 4f3726d2-9474-4e71-a18d-3af269239297:CREDITCARD_EXP data_asset_column catalog_id:
CREDITCARD_CVV 4f3726d2-9474-4e71-a18d-3af269239297:CREDITCARD_CVV data_asset_column catalog_id:
CREDITCARD_NUMBER af416dea-4c6e-4b85-b20d-fd068c9f6a9c:CREDITCARD_NUMBER data_asset_column catalog_id:
CREDITCARD_EXP af416dea-4c6e-4b85-b20d-fd068c9f6a9c:CREDITCARD_EXP data_asset_column catalog_id:
CREDITCARD_NUMBER a7451e4d-1e3f-483b-b582-245c057b99f9:CREDITCARD_NUMBER data_asset_column catalog_id: aea20795-3c0b-4d25-a99d-c089c6c20dd8
CREDITCARD_EXP a7451e4d-1e3f-483b-b582-245c057b99f9:CREDITCARD_EXP data_asset_column catalog_id: aea20795-3c0b-4d25-a99d-c089c6c20dd8
CREDITCARD_TYPE 4f3726d2-9474-4e71-a18d-3af269239297:CREDITCARD_TYPE data_asset_column catalog_id:
CREDITCARD_TYPE af416dea-4c6e-4b85-b20d-fd068c9f6a9c:CREDITCARD_TYPE data_asset_column catalog_id:
CREDITCARD_TYPE a7451e4d-1e3f-483b-b582-245c057b99f9:CREDITCARD_TYPE data_asset_column catalog_id: aea20795-3c0b-4d25-a99d-c089c6c20dd8
CREDITCARD_NUMBER 4f3726d2-9474-4e71-a18d-3af269239297:CREDITCARD_NUMBER data_asset_column catalog_id:
Customers_LIFE_s.csv a7451e4d-1e3f-483b-b582-245c057b99f9 data_asset catalog_id: aea20795-3c0b-4d25-a99d-c089c6c20dd8
Customers_LIFE_s.csv af416dea-4c6e-4b85-b20d-fd068c9f6a9c data_asset catalog_id:
Customers_LIFE_s.csv 4f3726d2-9474-4e71-a18d-3af269239297 data_asset catalog_id:
Lets query again for CREDITCARD_NUMBER, this time adding auth_scope=$CATALOG.
curl -k -X GET --header 'Content-Type: application/json' --header 'Accept: application/json' --header "Authorization: Bearer ${TOKEN}" https://$CPD_URL/v3/search?query=CREDITCARD_NUMBER | jq -r '.rows[] | .artifact_id'
a7451e4d-1e3f-483b-b582-245c057b99f9:CREDITCARD_NUMBER
a7451e4d-1e3f-483b-b582-245c057b99f9:CREDITCARD_CVV
a7451e4d-1e3f-483b-b582-245c057b99f9:CREDITCARD_EXP
a7451e4d-1e3f-483b-b582-245c057b99f9:CREDITCARD_TYPE
a7451e4d-1e3f-483b-b582-245c057b99f9
For the API we need the actual asset root, not the column
export ASSET=$(curl -k -X GET --header 'Content-Type: application/json' --header 'Accept: application/json' --header "Authorization: Bearer ${TOKEN}" https://$CPD_URL/v3/search?auth_scope=$CATALOG&query=Customers_LIFE_s.csv | jq -r '.rows[0] | .artifact_id')
echo $ASSET
a7451e4d-1e3f-483b-b582-245c057b99f9
Just randomly picked another asset to link together:
Toms-MBP:aunt-kitty tjm$ curl -k -X GET --header 'Content-Type: application/json' --header 'Accept: application/json' --header "Authorization: Bearer ${TOKEN}" https://$CPD_URL/v3/search?query=email | jq -r '.rows[0] | .artifact_id'
5d7ef46a-3437-4f99-aedb-b44bae377ac7:EMAIL
Now lets link two assets together:
lets build out the linkage between the two assets, which is layed out in this attached json file
{
"name": "data_lineage",
"entity": {
"assets": [
{
"internal_id": "creditcard_number",
"lookup": {
"producer_properties": {
"id": "a7451e4d-1e3f-483b-b582-245c057b99f9:CREDITCARD_NUMBER"
}
}
},
{
"internal_id": "email",
"lookup": {
"producer_properties": {
"id": "5d7ef46a-3437-4f99-aedb-b44bae377ac7:EMAIL"
}
}
}
],
"lineage_relationships": [
{
"flow_type": "flow_design",
"flows": [
{
"sources": [
"creditcard_number"
],
"targets": [
"email"
]
}
]
}
]
}
}
Executing the lineage api:
curl -k -X POST -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" -d @datalineage.json https://$CPD_URL/v2/assets/$ASSET/attributes?catalog_id=$CATALOG | jq .
{
"asset_id": "715f7fdb-73be-4f3e-8a25-0268ad5e7cfe",
"data_lineage": {
"assets": [
{
"internal_id": "creditcard_number",
"lookup": {
"producer_properties": {
"id": "a7451e4d-1e3f-483b-b582-245c057b99f9:CREDITCARD_NUMBER"
}
}
},
{
"internal_id": "email",
"lookup": {
"producer_properties": {
"id": "715f7fdb-73be-4f3e-8a25-0268ad5e7cfe:EMAIL"
}
}
}
],
"lineage_relationships": [
{
"flow_type": "flow_design",
"flows": [
{
"sources": [
"creditcard_number"
],
"targets": [
"email"
]
}
]
}
]
},
"href": "/v2/assets/715f7fdb-73be-4f3e-8a25-0268ad5e7cfe/attributes/data_lineage?catalog_id=aea20795-3c0b-4d25-a99d-c089c6c20dd8"
End result in the Catalog looks like this:
Validate Data Linage
You can use the GET command on an asset to get an attribute. In the command below, you are looking up the ASSET in the CATALOG and bringing back the data_lineage attribute.
curl -k -X GET -H "Authorization: Bearer ${TOKEN}" -H "Content-Type: application/json" https://${CPD_URL}/v2/assets/${ASSET}/attributes/data_lineage?catalog_id=${CATALOG} | jq .
{
"href": "/v2/assets/a7451e4d-1e3f-483b-b582-245c057b99f9/attributes/data_lineage?catalog_id=aea20795-3c0b-4d25-a99d-c089c6c20dd8",
"asset_id": "a7451e4d-1e3f-483b-b582-245c057b99f9",
"data_lineage": {
"assets": [
{
"internal_id": "creditcard_number",
"lookup": {
"producer_properties": {
"id": "a7451e4d-1e3f-483b-b582-245c057b99f9:CREDITCARD_NUMBER"
}
}
},
{
"internal_id": "email",
"lookup": {
"producer_properties": {
"id": "5d7ef46a-3437-4f99-aedb-b44bae377ac7:EMAIL"
}
}
}
],
"lineage_relationships": [
{
"flow_type": "flow_design",
"flows": [
{
"sources": [
"creditcard_number"
],
"targets": [
"email"
]
}
]
}
]
}
}
Deleting lineage
If you happen to make a mistake, or if lineage has changed, you can remove the old using DELETE
curl -k -X DELETE -H "Authorization: Bearer ${TOKEN}" -H "Content-Type: application/json" https://${CPD_URL}/v2/assets/${ASSET}/attributes/data_lineage?catalog_id=${CATALOG} | jq .
How can I link multiple sources to a target?
In theory, passing this file to the -d argument in the curl command.
curl -k -X POST -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" -d @datalineage-multi.json https://$CPD_URL/v2/assets/$ASSET/attributes?catalog_id=$CATALOG | jq .
When you look at the file, the json parser seems to map left to right. a1 maps to a2, a3 maps to a4 and a5 maps to a6. These can be arbitrary values but I think they should be more meaningful names. The second format might be more work, but it might be more descriptive, for anyone that comes behind you, if you are retaining these n source control.
"lineage_relationships": [
{
"flow_type": "flow_design",
"flows": [
{
"sources": [
"creditcard_number","phone_number","customerid"
],
"targets": [
"email","phone_1","cust_id"
]
}
]
This can also be laid out the following way.
"lineage_relationships": [
{
"flow_type": "flow_design",
"flows": [
{
"sources": [
"creditcard_number"
],
"targets": [
"email"
]
},
{
"sources": [
"phone_number","customerid"
],
"targets": [
"phone_1","cust_id"
]
},
{
"sources": [
"customerid"
],
"targets": [
"cust_id"
]
}
]
Here is the view in the lineage tab. There is an open “IDEA” or RFE with development to let you add an edge label to describe the link between the two assets.
Building into a utility
With Python we can write a small program to make all these steps easier. I am working with Gabe Green ggreen@techd.com from Technology Dyanmics to build this into a utility to ingest CSV much like Extension Mapping Documents. More to come on this topic.