Tuesday, November 24, 2020

ETL: Interface files/tables or pipeline

We have different approaches to implement the ETL process. We may create an end-to-end ETL pipeline to propagate the data from extract, transform to load. Another common approach is to save the extracted data in some intermediate files or tables before transforming and loading.

The former approach is obviously better in performance. It reduced the write and read (IO) for the intermediate interface files or tables. 

The latter approach is better in troubleshooting and restart-ability. This approach makes it easier to implement the partial rerun using the existing code. We don't need to restart the whole pipeline if it is halted from error. We may investigate the extracted data file for troubleshooting. Fix the transformation job by adding corresponding handling. Finally, we can restart the ETL partially from the transformation step. Although it doesn't achieve the ultimate performance, we usually recommend this approach as a system decoupling. It is also why we recommended pulling or publish/subscribe models discussed in previous post.

The upstream application owners may not always aware our extraction jobs. When the upstream application owners also own the extraction jobs (e.g. pushing or publishing), they always aware and are accountable for these jobs in their regular operations. It ensures that they will notify the subscribers and receivers for any scheduled downtime. They are not likely to miss the impact analysis of the extraction interface when they have system changes. (see also our best practice advice)

In some cases, we may use pipelining. For example, it may be suitable in some mini project. Or whenever the internal communications are always reliable.

No comments:

Post a Comment

Extract: Performance Tips

Below listed some common performance tips for extract queries. Extract required columns only and specify in the query, avoid select * Extrac...