Wednesday, October 7, 2020

Hash join and dimensional data modeling

You may have heard that we always recommend using star schema instead of snowflake. We recommend limiting the dimension table size (e.g. short dimension table). We also recommend limiting the number of dimension tables associating to the fact table (e.g. thin fact table). The reasons behind are preparing for using the hash join algorithm. We optimize the data model to ensure the high performance hash join in user queries.

Shown in previous post, hash join takes O(N) to join a big and a smaller table. How it works in the query with many tables under the multi-dimensional data model? In hash join, ideally, query optimizer put the smaller tables into memory (without paging) to build hash tables. Dimension tables are these smaller tables. There should be M dimension tables and 1 fact table in any query under multi-dimensional data model. So, it loads M dimension tables into memory, fetch and compare each row with each hash table from fact table. It takes O(M*N) where M is the number of dimension tables and N is the row count in fact table.

There are three key points from the above illustration. Firstly, the data model should be simple enough to avoid any surprise in the joining algorithm selected by the query optimizer. It is why we always recommend using star schema. Secondly, the size of tables are important. Huge size fact table is allowed. However, the dimension tables should be small enough to be fitted into memory. Otherwise, paging is required which impact on the performance. Lastly, the number of dimension tables is a matter. The complexity is O(M*N). The fewer number of dimension tables, the better the run time performance. These three points are how we measure the efficiency of a dimensional data model.

  • star schema
  • short dimension table (it is ok to be a little wide)
  • thin fact table (it is ok to be long)

Modeling techniques are developed to make them happen. For example, flattened hierarchy is helpful to keep the data model de-normalized in star schema. Mini-dimension (type 4) is recommended for implementing rapidly changing dimensions which avoid the dimension growing too big. Junk dimension is useful to keep the fact table thin.

After understanding the hash join, we may notice how nice the modeling techniques are. And we may also understand some traps and challenges in dimensional data modeling. For example,

  • Why we avoid the trap of "fact table joins fact table"
  • Why huge size dimensions is challenging (e.g. customer dimension is sometime naturally huge in size in some business)

No comments:

Post a Comment

Extract: Performance Tips

Below listed some common performance tips for extract queries. Extract required columns only and specify in the query, avoid select * Extrac...