Thursday, November 19, 2020

Change Data Capture (CDC)

The concept of incremental extraction is strict forward. We just extract the changed data. Today's post is going to tell you how. How can we identify changed data?

We call the identification process as the change detection. And the process of extracting changed data is called Changed Data Capture (CDC). As far as I know, there are four methods to implement it. 

1. Trigger

Use DB Trigger to capture those changed records. Some people may copy the changed (inserted/updated/deleted) records into another table pending for extraction with an extra column to mark their operation (I/U/D). Other people may copy only the primary key without copying all columns. There are different implementation of this method. 

2. Audit columns

People use the audit columns (e.g. the "last modified date") to identify the changed records. This audit columns may be built-in and controlled by the corresponding system. Or, it may be added and controlled by a DB trigger.

3. Snapshot Compare

We capture the full snapshot of the table regularly. And do a table comparison between snapshots. So, this method is still expensive, just like a full extraction. We may use this method for smaller tables only.

4. From transaction log

It means to scrap the changes from the DB transaction log (aka redo log). We usually won't implement such a scrapping program by ourselves. So, it was technically a method but not practical. However, many brands offer these solutions in recent 15 years. (e.g. SQL Server CDC, IBM CDC replication, AWS Data Migration Service)

No comments:

Post a Comment

Extract: Performance Tips

Below listed some common performance tips for extract queries. Extract required columns only and specify in the query, avoid select * Extrac...