ETL is the short form of Extract, Transform and Load. They are the processes to propagate data from single or multiple sources to the destination system. Firstly, data are extracted from the source system(s). Transform is the process to convert the extracted data into the required format. It may include data cleansing, combining, splitting, deriving, grouping, sorting, aggregating and lookup(ing). Finally, the transformed data are loaded into the destination system.
The most common use cases of ETL processes is in the Data Warehouse. ETL processes are developed to extract source data from various systems. And then transform and load into the data warehouse. Other common use cases are data migration or other data propagation between systems.
In this series, I am going to introduce some ETL techniques and practices.
Extract:
- Full or Incremental (Delta)
- Change Data Capture (CDC)
- Pull, Push or Publish/Subscribe
- Interface files/tables or pipeline
- Transfer extracted data
- Staging area
- Format and encoding of extracted data
- Performance Tips
Transform:
- Data Transformation
- Data cleansing
- Late-Arriving Dimension - Strategy 1 - Reject
- Late-Arriving Dimension - Strategy 2 - Load as unknown dimension
- Late-Arriving Dimension - Strategy 3 - Load with dimension augmentation
- Validation
- Performance Tips
Load:
- Loading Strategies
- Loading Performance - General
- Loading Performance - Insert
- Loading Performance - Update
- Loading Performance - Delete
- Loading by multiple steps
- Loading Type 2 dimension
- Loading partitioned table
Overall ETL:
- Modularized and reusable
- Restart-able and recovery
- Error Handling
- Data integrity
- Error Reports
- Audit Log and notification
- Notification - Exception only?
- Parallel
- Dependency
- Automated maintenance
- Troubleshooting Performance Problems
- Common tuning tips in ETL engine
- ETL tool or hand-coded
- Reliable workflow
- Teamwork
- ETL Framework
No comments:
Post a Comment