Raymond Law Blog: ETL

Showing posts with label ETL. Show all posts

Wednesday, December 16, 2020

Extract: Performance Tips

Below listed some common performance tips for extract queries.

Extract required columns only and specify in the query, avoid select *
Extract required rows only. delta only
Avoid unnecessary sorting operations

union vs union all
distinct, group by, order by: use only when necessary

Avoid transformation logic in the extract queries. These logic should be offloaded to ETL engine.
Indexes

proper where clause;

use AND;
avoid OR/NOT/<>/NOT IN
be careful using wildcard and range

create proper indexes
maintain indexes regularly

Consider "Read uncommitted" (AKA dirty read, NOLOCK). Learn and use it carefully.
Use native utility and driver to execute your query. They are usually tuned for better performance.

Wednesday, December 2, 2020

Different file encoding and formats

Our ETL program need to handle different format and encoding coming from different sources. Below listed some common technical challenges.

Encoding: ASCII vs EBCDIC
Character Set: UTF8, Unicode, etc.
Format: Fixed length, Delimited, JSON, XML, Excel Spreadsheet and more
Numeric Format: Binary, Packed-decimal, Zoned-decimal and more

Some of the above format or encoding may be challenging in hand coding ETL. However, most modern ETL engine could help to deal with the above different format and encoding.

Monday, November 30, 2020

Transfer extracted data and checking

As mentioned in previous post, pushing or the publish/subscribe model are preferred. Todays topic is how to transfer the extracted files.

ETL: Interface files/tables or pipeline

We have different approaches to implement the ETL process. We may create an end-to-end ETL pipeline to propagate the data from extract, transform to load. Another common approach is to save the extracted data in some intermediate files or tables before transforming and loading.

Extract: Push, Pull or Publish/Subscribe

There are different extraction models. They are pushing, pulling and publish/subscribe models.

Change Data Capture (CDC)

The concept of incremental extraction is strict forward. We just extract the changed data. Today's post is going to tell you how. How can we identify changed data?

Extract: Full or Incremental (Delta)

Extract is the first step in the ETL process which obtains data from the source systems. This is our first question. How much data we need to extract?

What is ETL process?

ETL is the short form of Extract, Transform and Load. They are the processes to propagate data from single or multiple sources to the destination system. Firstly, data are extracted from the source system(s). Transform is the process to convert the extracted data into the required format. It may include data cleansing, combining, splitting, deriving, grouping, sorting, aggregating and lookup(ing). Finally, the transformed data are loaded into the destination system.

BI/DW Best Practice #14 - Flexible, rerunable and fault tolerance ETL processes

There are several different source systems. So, we may have errors from different reasons. We may have three options to handle. Halt, Proceed but reject failed records and Proceed with notification.

BI/DW Best Practice #12 - Manage data integrity

The data warehouse / data mart are usually implemented in RDBMS. RDBMS provides lots of features to manage data which are useful for both operational and analytical systems. However, per the different natures between operational and analytical systems. We have different practices to manage our data.

BI/DW Best Practice #11 - Decouple with source systems

Traditionally, there are data file interfaces between the source system and the DW. Extraction process extract the data into data files. The subsequence ETL process read the extracted data file. However, supported by many ETL tools today, it is possible to extract the source data through direct connection and streamline the data, transform and load into the target database. It enables the whole ETL process streaming the data in memory without intermediate file IO which maximize the performance. Should we take this approach? My answer is no.

Raymond Law Blog

Wednesday, December 16, 2020

Extract: Performance Tips

Wednesday, December 2, 2020

Different file encoding and formats

Monday, November 30, 2020

Transfer extracted data and checking

Tuesday, November 24, 2020

ETL: Interface files/tables or pipeline

Monday, November 23, 2020

Extract: Push, Pull or Publish/Subscribe

Thursday, November 19, 2020

Change Data Capture (CDC)

Wednesday, November 18, 2020

Extract: Full or Incremental (Delta)

Tuesday, November 17, 2020

What is ETL process?

Monday, September 28, 2020

BI/DW Best Practice #14 - Flexible, rerunable and fault tolerance ETL processes

Sunday, September 27, 2020

BI/DW Best Practice #12 - Manage data integrity

Saturday, September 26, 2020

BI/DW Best Practice #11 - Decouple with source systems

Extract: Performance Tips

Report Abuse