You may have heard that we always recommend using star schema instead of snowflake. We recommend limiting the dimension table size (e.g. short dimension table). We also recommend limiting the number of dimension tables associating to the fact table (e.g. thin fact table). The reasons behind are preparing for using the hash join algorithm. We optimize the data model to ensure the high performance hash join in user queries.
There are three key points from the above illustration. Firstly, the data model should be simple enough to avoid any surprise in the joining algorithm selected by the query optimizer. It is why we always recommend using star schema. Secondly, the size of tables are important. Huge size fact table is allowed. However, the dimension tables should be small enough to be fitted into memory. Otherwise, paging is required which impact on the performance. Lastly, the number of dimension tables is a matter. The complexity is O(M*N). The fewer number of dimension tables, the better the run time performance. These three points are how we measure the efficiency of a dimensional data model.
- star schema
- short dimension table (it is ok to be a little wide)
- thin fact table (it is ok to be long)
Modeling techniques are developed to make them happen. For example, flattened hierarchy is helpful to keep the data model de-normalized in star schema. Mini-dimension (type 4) is recommended for implementing rapidly changing dimensions which avoid the dimension growing too big. Junk dimension is useful to keep the fact table thin.
After understanding the hash join, we may notice how nice the modeling techniques are. And we may also understand some traps and challenges in dimensional data modeling. For example,
- Why we avoid the trap of "fact table joins fact table"
- Why huge size dimensions is challenging (e.g. customer dimension is sometime naturally huge in size in some business)
No comments:
Post a Comment