Databricks: Change Data Feed with Unity Catalog and Delta Sharing

Eric Tome
6 min readAug 29, 2022

--

I recently came across a customer use case that was very interesting. They wanted to share Change Data Capture (CDC) data with external partners who may or may not be Databricks customers. B2B data sharing has been on the rise as companies seek to get the most value out of their data. In a 2021 study, Gartner states that companies who share data externally generate three times more measurable economic benefits than those not involved in data sharing. CDC data is a process that captures the movement of data values over time, usually as part of Change Data Feeds (CDF). Depending on the data volume, CDC data can be hard to capture, and some companies still use older patterns, such as SFTP or proprietary platforms, to share CDC data. With the release of Delta Sharing and leveraging the capabilities of Delta Lake and Unity Catalog, I can share a scalable way to include data sharing for anyone wanting to modernize their processes. The code for this is in the GitHub repository here; make sure to check it out.

A couple of things to note, to capture CDC metadata in your Delta tables, you’ll need to set the table property delta.enableChangeDataFeed to true. To learn how to do that, please see the documentation here.Then to share data, you’ll need a Databricks Unity Catalog enabled workspace.

I wrote a script to create the initial Delta table and the table we will share externally. Let’s walk through the important code. We’ll need to create a Delta table with some data in it, I used a combination of faker and Databricks Lab Data Generator to do this, you can skip this step as you’ll most likely be using your own data for this exercise.

I’m going to show two different ways to share CDF data. The first will be sharing ONLY the CDF data, not the table where that data comes from. A recipient can use this to consume only the CDF data. The second will be sharing the base table with CDF enabled.

CDF Only Option

We’ve just created a CDF enabled table which will add metadata columns to track how the table has changed over time. These columns are _change_type with the values insert, update_preimage, update_postimage, or delete. Next, _commit_version will contain a numeric value that increments with each update to the Delta table. Last, _commit_timestamp is the date time associated with the commit change. The recipient consuming the data will use these values to help understand how to modify a data store in their environment.

But to do that, we need to create a table we’ll use as the basis for our share and load it with the CDF data from the table we just made. Since we can easily switch between languages in Databricks, I’ll use SQL to do this. We are using the CDF function table_changes and loading all commit versions starting from 0 from our first table, erictome_cdf_delta_sharing.share_data.

Notice that we are partitioning the table by COMPANYNAME. One of the nice things we can do with Delta Sharing is to share only the data on each partition with our recipients. I’ve added Company1, Company2, and Company3 to our data because we share with three companies and don’t want them to access the other’s data. Hence the partition on table creation will keep all the records for each company in separate locations.

Now that we have our tables created and loaded with data, we want to load new data into that table that will insert, update, and delete records so that we can provide this data to our recipient via Delta Sharing.

The above code will create a data set to update records created in our previous table and store it in the data frame cdc_inc_data_df. Next, we’ll use Delta Lake’s merge functionality to either update or insert new records.

To make the CDF feed complete, I want to also include some deletions in our feed, so let’s do that with this line:

deltaTable.delete("RECID BETWEEN 200000 AND 299999 ")

The last set of data prep we need to do is to load these new changes into our share table. We’ll again switch to SQL to do this with the following code.

This SQL code will insert new CDF records into our share table. We make sure to only insert records that have a commit version greater than what is currently in the shared table. This way we only add in new change records. We’re also going to exclude records with a change type of update_preimage as we’re only concerned with the update_postimage, insert, and delete records.

Now that we’ve got our data ready, we’re all set to do some data sharing! This is where Unity Catalog with Delta Sharing comes in. We will use some basic SQL commands to create a share, add a table to that share, create a recipient, and grant them access to the share.

In this case, we’re only sharing data with Company2, and want to restrict their access to just the partition with Company2 data (line 10). Once a recipient is created, you will see an activation link when you run DESC RECIPIENT <name>. You must share this link with the recipient, from which they will get access to a credential file that they can download one time only. This file will contain an endpoint and a bearer token giving them access to the data share. The URL’s page looks like this:

The great thing about Delta Sharing is that the recipient isn’t locked into any specific technology to access the data. They can use PowerBI, Tableau, Spark, Pandas, etc. To the recipient, the underlaying data storage is abstracted from them. Here is an example of using Python and Pandas to consume the data we just created.

There is a delta_sharing client used to connect to the data share via our credential file (config.share on line 5) that we just downloaded from the activation link URL. When running the above code, we get the following results:

Data + CDF Option

Now, let’s say you want to share the data table (updated with all changes) and the CDF data. To do this, we will share the source table and skip the step where we create a new table to share the CDF data. We will again use SQL to do this:

When shared with a recipient, we can then run another Python client to get the changes from the table:

Which results in the following records printed to the terminal:

We’ve shown two ways to share CDC data, one with just the CDF feed, and another with the full table + CDF feed. Both of these were consumed by an external to Databricks recipient. We truly live in amazing times!

I want to thank my colleague David Radford for showing me faker, dbldatagen, and sharing the Pandas client code I used in this tutorial.

Also, a big thank you to Hayley Horn, Div Saini, and Tanveer Shaikh for reviewing this blog.

--

--

Eric Tome
Eric Tome

Written by Eric Tome

Solutions Architect at Databricks. I love working with data, math, coding, music, soccer, working out, and video games.

No responses yet