Monday, November 9, 2020

Table Distribution in dimensional data modeling

Some advanced RDBMS supports table distribution and partition. They are great features to help managing huge volume of data. This post is focusing in the table distribution.

The data are distributed across multiple processing nodes. Ideally, when we have huge amount of data, each processing node keeps portions of data and computes independently in parallel to give better overall performance. Practically, data redistribution may be happened to shuffling data across nodes if required for the computation. So, a proper distribution key and distribution style are important. They ensure the data are evenly distributed (load balancing) and reduce the data shuffling across nodes. Be remember that a wrong distribution key and style selected may be even worse to the query performance. 

Common distribution styles include round-robin, random, hash and replicated. In DW, we usually replicate the entire dimension tables (small tables) in all nodes and hash distribute the fact table (huge size table) across nodes. So, there will be portion of fact tables and entire dimension tables in each node. Under the star or snowflake schema structure, it is the simplest way to avoid shuffling.

A good distribution key can help data distributed evenly. We also usually choose a less frequently used filtering key. Otherwise, for example, if we use date key as the distribution key, all of the processing nodes, except one, will be idled when we analyst sales transaction of a single day. With proper selected distribution style and key in your data model, you may get benefit from distributing huge size dimension table across node.

Some quick guidelines for Table Distribution:

  •  Choose proper Distribution style
    •  Hash for fact table
    •  Replicated for dimension table
  •  Choose proper Distribution key for fact table
    •  Evenly distribution perspective
      •  many unique values
      •  No or less null value
    •  Avoid data shuffling between nodes or node idle
      •  choose field is usually for sorting (distinct, group by, order by)
      •  choose field is not usually in filtering (where) condition (e.g. don't use date key)

No comments:

Post a Comment

Extract: Performance Tips

Below listed some common performance tips for extract queries. Extract required columns only and specify in the query, avoid select * Extrac...