Tuesday, November 17, 2020

What is ETL process?

ETL is the short form of Extract, Transform and Load. They are the processes to propagate data from single or multiple sources to the destination system. Firstly, data are extracted from the source system(s). Transform is the process to convert the extracted data into the required format. It may include data cleansing, combining, splitting, deriving, grouping, sorting, aggregating and lookup(ing). Finally, the transformed data are loaded into the destination system.

The most common use cases of ETL processes is in the Data Warehouse. ETL processes are developed to extract source data from various systems. And then transform and load into the data warehouse. Other common use cases are data migration or other data propagation between systems.

In this series, I am going to introduce some ETL techniques and practices.

Extract:

Transform:

  • Data Transformation
  • Data cleansing
  • Late-Arriving Dimension - Strategy 1 - Reject
  • Late-Arriving Dimension - Strategy 2 - Load as unknown dimension
  • Late-Arriving Dimension - Strategy 3 - Load with dimension augmentation
  • Validation
  • Performance Tips

Load:

  • Loading Strategies
  • Loading Performance - General
  • Loading Performance - Insert
  • Loading Performance - Update
  • Loading Performance - Delete
  • Loading by multiple steps
  • Loading Type 2 dimension
  • Loading partitioned table

Overall ETL:

  • Modularized and reusable
  • Restart-able and recovery
  • Error Handling
  • Data integrity
  • Error Reports
  • Audit Log and notification
  • Notification - Exception only?
  • Parallel
  • Dependency
  • Automated maintenance
  • Troubleshooting Performance Problems
  • Common tuning tips in ETL engine
  • ETL tool or hand-coded
  • Reliable workflow
  • Teamwork
  • ETL Framework

No comments:

Post a Comment

Extract: Performance Tips

Below listed some common performance tips for extract queries. Extract required columns only and specify in the query, avoid select * Extrac...