Monday, November 30, 2020

Transfer extracted data and checking

As mentioned in previous post, pushing or the publish/subscribe model are preferred. Todays topic is how to transfer the extracted files.

Tuesday, November 24, 2020

ETL: Interface files/tables or pipeline

We have different approaches to implement the ETL process. We may create an end-to-end ETL pipeline to propagate the data from extract, transform to load. Another common approach is to save the extracted data in some intermediate files or tables before transforming and loading.

Monday, November 23, 2020

Extract: Push, Pull or Publish/Subscribe

There are different extraction models. They are pushing, pulling and publish/subscribe models.

Thursday, November 19, 2020

Change Data Capture (CDC)

The concept of incremental extraction is strict forward. We just extract the changed data. Today's post is going to tell you how. How can we identify changed data?

Wednesday, November 18, 2020

Extract: Full or Incremental (Delta)

Extract is the first step in the ETL process which obtains data from the source systems. This is our first question. How much data we need to extract?

Tuesday, November 17, 2020

What is ETL process?

ETL is the short form of Extract, Transform and Load. They are the processes to propagate data from single or multiple sources to the destination system. Firstly, data are extracted from the source system(s). Transform is the process to convert the extracted data into the required format. It may include data cleansing, combining, splitting, deriving, grouping, sorting, aggregating and lookup(ing). Finally, the transformed data are loaded into the destination system.

Tuesday, November 10, 2020

Table Partitioning in Fact tables

Table partition is to break down the big table into smaller sub-tables, called partitions. Without table distribution, although there may be still single processing node, we may get performance gain from multi-threading and parallel IO. We may achieve better performance from using a multi-cored CPU and distribute the partitions in different or parallel disk device (e.g. SAN, RAID 10, etc.).

Monday, November 9, 2020

Table Distribution in dimensional data modeling

Some advanced RDBMS supports table distribution and partition. They are great features to help managing huge volume of data. This post is focusing in the table distribution.

The data are distributed across multiple processing nodes. Ideally, when we have huge amount of data, each processing node keeps portions of data and computes independently in parallel to give better overall performance. Practically, data redistribution may be happened to shuffling data across nodes if required for the computation. So, a proper distribution key and distribution style are important. They ensure the data are evenly distributed (load balancing) and reduce the data shuffling across nodes. Be remember that a wrong distribution key and style selected may be even worse to the query performance. 

Common distribution styles include round-robin, random, hash and replicated. In DW, we usually replicate the entire dimension tables (small tables) in all nodes and hash distribute the fact table (huge size table) across nodes. So, there will be portion of fact tables and entire dimension tables in each node. Under the star or snowflake schema structure, it is the simplest way to avoid shuffling.

A good distribution key can help data distributed evenly. We also usually choose a less frequently used filtering key. Otherwise, for example, if we use date key as the distribution key, all of the processing nodes, except one, will be idled when we analyst sales transaction of a single day. With proper selected distribution style and key in your data model, you may get benefit from distributing huge size dimension table across node.

Some quick guidelines for Table Distribution:

  •  Choose proper Distribution style
    •  Hash for fact table
    •  Replicated for dimension table
  •  Choose proper Distribution key for fact table
    •  Evenly distribution perspective
      •  many unique values
      •  No or less null value
    •  Avoid data shuffling between nodes or node idle
      •  choose field is usually for sorting (distinct, group by, order by)
      •  choose field is not usually in filtering (where) condition (e.g. don't use date key)

Wednesday, November 4, 2020

Constraints and indexes of Fact and Dimension Table

We usually implement the multi-dimensional data model in RDBMS or CUBE. Today, we focus on the Constraints and indexes in multi-dimensional data modeling in RDBMS.

In the market, there are different RDBMS providers and their internal design and implementation are different. So, today, we are just sharing the high level concepts. In practice, we should spend time to understand our selected RDBMS and adjust our implementation.

In OLTP applications, constraints are used to maintain the data integrity and quality (e.g. business rules) of the data in the database. Indexes are added to improve the querying performance. In OLAP applications, we usually have much more data and our design strategies are focus on the performance. The first concern is the querying performance, the second is the loading performance (which reduce the refresh downtime). For data integrity and quality, they are still important. However, constraints usually don't improve querying performance but impact the data loading performance. Instead, as the batch processing nature of DW, we likely centralize the validation in batches without enforcing constraints in DB. 

Tuesday, November 3, 2020

Outrigger Dimension

It is a dimension table doesn't associate with fact table directly, but other dimension tables. We may have different scenarios to make a outrigger dimension. 

Monday, November 2, 2020

Audit Dimension

In traditional data modeling, we usually add "created date", "last modified date", "created by" and "last modified by" columns in all tables to capture the audit information. In dimensional data modeling, such design may still be ok for the dimension table since they are usually short. However, it could be a problem if we manage our fact table in this way. It is why the audit dimension is introduced. Fact table contains millions of records or even more. These columns share the same values in all the rows from the same batch of data loading. So, we could have a better design to consolidate the audit information by audit dimension. 

Extract: Performance Tips

Below listed some common performance tips for extract queries. Extract required columns only and specify in the query, avoid select * Extrac...