Author: Pulkit Mittal, Utkarsh Dixit, Shikha Bhagat
Change Data Capture
Change Data Capture or CDC, is a set of design patterns used to detect any kind of variation in the database. It is an event-based trigger associated with data so that an action can be taken for any CDC. It excludes bulk data loading by implementing incremental data load in nearly real-time. This Data Integration approach of CDC allows high-velocity data to obtain low latency with reliable and scalable data replication. Using Change Data Capture(CDC), one can deliver incoming new data changes to BI (Business Intelligence) tools and coworkers in near real-time.
Types of CDC
- Date Modified
Many applications keep track of metadata in every record which includes the last modified date. The approach to CDC is to keep track of when changes are made to a row and in a subsequent run filter on the LAST_MODIFIED column to only fetch rows that were changed since the last time data was extracted for replication.
This type of CDC works well for traditional applications that are populated using Extract, Transform and Load (ETL) jobs when the source tables do not process the deletion of data.
- Diff
The diff method for CDC compares the current state of the data with the previous state of the data to identify what has changed.
If we compare it with the DATE_MODIFIED CDC method, the diff method does not face a challenge with deleted rows. It works well for low data volumes as resource consumption needs to grow at least linearly with the growth in data volume.
- Triggers
In this approach, database triggers are used to perform CDC in shadow tables. The shadow tables may store the entire row to track any column change, or only the primary key along with the operation type (insert, update or delete)
Using database triggers for CDC decreases the overhead to extract the changes but increases the overhead to record the changes.
- Log-based Capture
Databases store all changes in transaction logs to recover the committed state of the database if the database crashes for whatever reason. The Log-based CDC process takes advantage of this to read any changes from the logs.
Log-based Capture is considered the better method for CDC that can be applied to any scenario including systems with very high transaction volumes. The biggest benefit of log-based change data capture is the asynchronous nature of CDC where changes are captured independent of the source application performing the changes.
HVR
HVR is a software tool that can perform near-real-time heterogeneous and homogeneous data replication. It uses various CDC methods to replicate changes between databases and directories. HVR also has a compare feature that allows performing real-time verification to ensure that the source and target locations are in sync. HVR has a replication monitoring property allowing users to actively monitor the status of replication including viewing real-time data flow statistics. Using the event audit feature, all actions can be securely monitored and logged.
HVR Architecture Diagram
Source: https://www.hvr-software.com/docs/
Connection setup to HVR Demo by Snowflake
1: Login to Snowflake> change role > ACCOUNTADMIN.
2: Go to Partner connect and click on HVR. You will be redirected to your generated account with a preconfigured user and role.
3: You will be redirected to the HVR website to sign in using the snowflake email address and password.
5: Upon logging in successfully the HVR environment will be created. Click on the link which will take you to the HVR tool environment.
6: To connect with Oracle database as the source database. In the locations configuration section, right-click and select the new location option.
Fill in the necessary information for your source database. We are using an Oracle database for this article.
7: Test our connection with the Oracle database. Scroll down to the “TEST CONNECTION” option after entering the required information.
8: Connect our snowflake account with HVR as Target database. Following similar steps as last time, go to Locations>New. Add details for your snowflake account, user, role, and database.
Test the connection by scrolling down and clicking on the TEST CONNECTION button.
Sample replication test cases
- Slicing large tables using the Modulo refresh feature of HVR.
When replicating very large tables, it’s considered good practice to divide them into smaller slices and run them sequentially. This leads to a lesser load on the database and better performance than a load extracting the whole data set at once.
- In HVR, go to the channel name and right-click to select the HVR refresh option.
- In HVR go to the channel and right-click to select the HVR refresh option.
- Select the table you want to refresh and click on SCHEDULING.
- Select the ‘Schedule classic job’ checkbox to create jobs.
- Select the ‘Slice table’ checkbox and the ‘Modulo’ radio button.
- Enter the number of slices/refresh jobs you want to create.
- Select the column on which the slices should be made, generally the primary key of the table using the drop-down menu and click ‘Schedule’.
- On clicking Schedule, you would get the below popup stating the jobs are now created.
- Changing the data type of a column, and adding a new column that stores the name of the source table.
- Go to Channel definitions
- Click on tables
- Select the table which you want to edit
- Right click on the table name > New Action > Column Properties.
- Select the Group as your Target Group.
- Choose the column name for which you want to change the data type.
- Select Data Type and Length.
- Initialize the HVR to reflect the changes in the target group(Snowflake)
- Reload the table using HVR Refresh and the newly replicated table on Snowflake should contain the changed data type for the column.
In this example, we have changed the datatype of our column from char to varchar2.
Adding a column in the Target database which stores the value of the current source table
- In HVR, select table > New Action > Column properties.
- In the column properties menu, select the following options
- Group(Target group)
- /Name(name of the new column you want to create)
- /Extra
- IntegrateExpression( {hvr_col_name})
- /Datatype(as appropriate for your column)
- /length(for character data types)
- Now we need to initialize the channel for the changes to reflect.
- Now we need to do a HVR refresh, so that the changes are reflected in our target database(Snowflake). Go to channel > right click > HVR refresh.
Select the following options in the HVR refresh window.
- Source(oracle)
- Target(snowflake db)
- Table name
- Create Absent tables
- Once you hit refresh, you should see a window showing the refresh operation completion and details about it.
- Now you can see the changes reflected in Snowflake.
- We have successfully added a column in our target table which stores the current table name from the source.
- Concat two columns to create a new column in the target table.
- We will first manually create a column in our table. Since we are going to combine two varchar2 and char data type columns, we will use the target column data type as varchar2 as well.
- Now create a new action, go to table > New Action > ColumnProperties.
- On this screen,
select /Name( name of the target column).
select Capture Expression and write the following command,
trim({column1})||’-‘||trim({column2}).
Here column1 and column2 are the names of the columns you want to
concatenate.
select datatype as varchar2 and length as 400.
- Now we need to initialize our channel.
- After we have initialized the channel, we need to create a Refresh job.
Once the job is run, we will be able to see our changes reflected in our target database.