Showing posts with label RDBMS. Show all posts
Showing posts with label RDBMS. Show all posts

Tuesday, November 10, 2020

Table Partitioning in Fact tables

Table partition is to break down the big table into smaller sub-tables, called partitions. Without table distribution, although there may be still single processing node, we may get performance gain from multi-threading and parallel IO. We may achieve better performance from using a multi-cored CPU and distribute the partitions in different or parallel disk device (e.g. SAN, RAID 10, etc.).

Monday, November 9, 2020

Table Distribution in dimensional data modeling

Some advanced RDBMS supports table distribution and partition. They are great features to help managing huge volume of data. This post is focusing in the table distribution.

The data are distributed across multiple processing nodes. Ideally, when we have huge amount of data, each processing node keeps portions of data and computes independently in parallel to give better overall performance. Practically, data redistribution may be happened to shuffling data across nodes if required for the computation. So, a proper distribution key and distribution style are important. They ensure the data are evenly distributed (load balancing) and reduce the data shuffling across nodes. Be remember that a wrong distribution key and style selected may be even worse to the query performance. 

Common distribution styles include round-robin, random, hash and replicated. In DW, we usually replicate the entire dimension tables (small tables) in all nodes and hash distribute the fact table (huge size table) across nodes. So, there will be portion of fact tables and entire dimension tables in each node. Under the star or snowflake schema structure, it is the simplest way to avoid shuffling.

A good distribution key can help data distributed evenly. We also usually choose a less frequently used filtering key. Otherwise, for example, if we use date key as the distribution key, all of the processing nodes, except one, will be idled when we analyst sales transaction of a single day. With proper selected distribution style and key in your data model, you may get benefit from distributing huge size dimension table across node.

Some quick guidelines for Table Distribution:

  •  Choose proper Distribution style
    •  Hash for fact table
    •  Replicated for dimension table
  •  Choose proper Distribution key for fact table
    •  Evenly distribution perspective
      •  many unique values
      •  No or less null value
    •  Avoid data shuffling between nodes or node idle
      •  choose field is usually for sorting (distinct, group by, order by)
      •  choose field is not usually in filtering (where) condition (e.g. don't use date key)

Wednesday, November 4, 2020

Constraints and indexes of Fact and Dimension Table

We usually implement the multi-dimensional data model in RDBMS or CUBE. Today, we focus on the Constraints and indexes in multi-dimensional data modeling in RDBMS.

In the market, there are different RDBMS providers and their internal design and implementation are different. So, today, we are just sharing the high level concepts. In practice, we should spend time to understand our selected RDBMS and adjust our implementation.

In OLTP applications, constraints are used to maintain the data integrity and quality (e.g. business rules) of the data in the database. Indexes are added to improve the querying performance. In OLAP applications, we usually have much more data and our design strategies are focus on the performance. The first concern is the querying performance, the second is the loading performance (which reduce the refresh downtime). For data integrity and quality, they are still important. However, constraints usually don't improve querying performance but impact the data loading performance. Instead, as the batch processing nature of DW, we likely centralize the validation in batches without enforcing constraints in DB. 

Wednesday, October 7, 2020

Hash join and dimensional data modeling

You may have heard that we always recommend using star schema instead of snowflake. We recommend limiting the dimension table size (e.g. short dimension table). We also recommend limiting the number of dimension tables associating to the fact table (e.g. thin fact table). The reasons behind are preparing for using the hash join algorithm. We optimize the data model to ensure the high performance hash join in user queries.

Tuesday, October 6, 2020

Basic of joining algorithms

This article introduces three basic joining algorithms, nested loop join, merge join and hash join. It gives basic ideas of the joining algorithms without going into the very details and their variations. We may pay more attention to the hash join. It is a useful algorithm for dimensional data modeling. Lots of modeling techniques are aimed to optimize the data model to perform a high performance hash join.

Extract: Performance Tips

Below listed some common performance tips for extract queries. Extract required columns only and specify in the query, avoid select * Extrac...