Raymond Law Blog: Dimension

Showing posts with label Dimension. Show all posts

Monday, November 9, 2020

Table Distribution in dimensional data modeling

Some advanced RDBMS supports table distribution and partition. They are great features to help managing huge volume of data. This post is focusing in the table distribution.

The data are distributed across multiple processing nodes. Ideally, when we have huge amount of data, each processing node keeps portions of data and computes independently in parallel to give better overall performance. Practically, data redistribution may be happened to shuffling data across nodes if required for the computation. So, a proper distribution key and distribution style are important. They ensure the data are evenly distributed (load balancing) and reduce the data shuffling across nodes. Be remember that a wrong distribution key and style selected may be even worse to the query performance.

Common distribution styles include round-robin, random, hash and replicated. In DW, we usually replicate the entire dimension tables (small tables) in all nodes and hash distribute the fact table (huge size table) across nodes. So, there will be portion of fact tables and entire dimension tables in each node. Under the star or snowflake schema structure, it is the simplest way to avoid shuffling.

A good distribution key can help data distributed evenly. We also usually choose a less frequently used filtering key. Otherwise, for example, if we use date key as the distribution key, all of the processing nodes, except one, will be idled when we analyst sales transaction of a single day. With proper selected distribution style and key in your data model, you may get benefit from distributing huge size dimension table across node.

Some quick guidelines for Table Distribution:

Choose proper Distribution style

Hash for fact table
Replicated for dimension table

Choose proper Distribution key for fact table

Evenly distribution perspective

many unique values
No or less null value

Avoid data shuffling between nodes or node idle

choose field is usually for sorting (distinct, group by, order by)
choose field is not usually in filtering (where) condition (e.g. don't use date key)

Wednesday, November 4, 2020

Constraints and indexes of Fact and Dimension Table

We usually implement the multi-dimensional data model in RDBMS or CUBE. Today, we focus on the Constraints and indexes in multi-dimensional data modeling in RDBMS.

In the market, there are different RDBMS providers and their internal design and implementation are different. So, today, we are just sharing the high level concepts. In practice, we should spend time to understand our selected RDBMS and adjust our implementation.

In OLTP applications, constraints are used to maintain the data integrity and quality (e.g. business rules) of the data in the database. Indexes are added to improve the querying performance. In OLAP applications, we usually have much more data and our design strategies are focus on the performance. The first concern is the querying performance, the second is the loading performance (which reduce the refresh downtime). For data integrity and quality, they are still important. However, constraints usually don't improve querying performance but impact the data loading performance. Instead, as the batch processing nature of DW, we likely centralize the validation in batches without enforcing constraints in DB.

Outrigger Dimension

It is a dimension table doesn't associate with fact table directly, but other dimension tables. We may have different scenarios to make a outrigger dimension.

Audit Dimension

In traditional data modeling, we usually add "created date", "last modified date", "created by" and "last modified by" columns in all tables to capture the audit information. In dimensional data modeling, such design may still be ok for the dimension table since they are usually short. However, it could be a problem if we manage our fact table in this way. It is why the audit dimension is introduced. Fact table contains millions of records or even more. These columns share the same values in all the rows from the same batch of data loading. So, we could have a better design to consolidate the audit information by audit dimension.

Multivalued Dimension

Per the structure of dimensional schema, each dimension attached to a fact table has a single value. However, in reality, there are number of situations that dimension may contain multiple values. For example, in family insurance policy, multiple insured-person (family members) are attached to a policy. So, there are some techniques to handle such situation.

1. Keep a simple dimension and ignore the multiple values

We simply eliminate the multi-valued dimension challenge by ignoring it. We usually use this method if such multi-valued attributes are not useful in the analysis. We may simply choose one particular value and leave the others. Or we may insert a dimension row to represent the mixed values while leaving their individual attributes.

2. Add multiple columns to represent the individual values if values are known limited

If the number of values are known limited, we may use this method. We add fixed number of columns to store their individual attributes. However, it is not really recommended. It is because the data will be hard to be analyzed when attributes stored in both way horizontally and vertically.

3. Use bridge table with proper weighting factors.

This method uses the bridge table which is a N-to-N relationship table between the fact and dimension tables. It is how we maintain multiple values in the dimensional schema. To manage the risk of double counting, we use a weighting factors field to represent the contribution ratio of each value. In each transaction (row of fact table), the sum of the corresponding weighting factors should be equal to 1. This method should be used carefully, all the measure calculation should associate with the weighting factors. Otherwise, double counting will still be happened.

Figure: Multivalued Dimension using Bridge Table

Monday, October 19, 2020

Hierarchy Dimension

We always recommend to flatten the (fixed-depth) hierarchy information into the dimension. It is how we keep the star schema simple and nice.

Role Playing Dimension

Conformed dimension consolidates different context in a single dimension. We may need to show/map different semantic content in reports and other BI tools. For these purpose, we implement multiple views on top of the conformed dimension. So, we can map and show specific semantic content (e.g. selling date vs ordering date) for different analysis. We call it the role-playing dimension.

Friday, October 16, 2020

Date Dimension

A special type of dimension to represent dates of business process event. Per the stability nature of calendar, we may simply use meaningful integer YYYYMMDD as its primary key. I believe that it is the only exceptional scenario that we accept meaningful primary key. It contains those date elements for displaying. It also contains flattened hierarchy information for filtering and grouping (e.g. calendar month, calendar year, fiscal month, fiscal year). We may also add some tailor made fields to the date dimension for our business. (e.g. holiday flag)

Rapidly Changing Dimension

Rapidly Changing Dimension is also known as Fast Changing Dimension. We usually implement the Slowly Changing Dimension using type 2 to keep history. However, it would be an issue to implement a rapidly changing dimension using type 2. Type 2 implementation would be a disaster to grow the dimension table rapidly (dimension becomes huge number of rows). Instead, we use type 4 to manage rapidly changing dimension.

Junk Dimension

Junk Dimension is also known as Garbage Dimension. There may be a lot of miscellaneous, low-cardinality flags and indicators (e.g. TransactionType, TransactionStatus, VoidFlag, ActiveFlag, etc.) in a transaction.

Degenerated Dimension

In some dimension, we may find that we don't have any dimension attributes but only its business key. Transaction Number is an example, other transaction header's attributes are logically organized in other dimensions. (e.g. Transaction date is organized to Date dimension) It doesn't make sense to create a tiny dimension with just the surrogate key and business key. So, we may consider to degenerate the dimension and store the business key (e.g. transaction number) in the fact table. Size and the content nature would be the way of our consideration (e.g. we degenerate the invoice number but not the invoice remarks). We call such a degenerated field in the fact table as the degenerated dimension.

Monday, October 12, 2020

Slowly Changing Dimension

The context of the business process may change over time. Slowly Changing Dimension is the technique to manage the change history of dimension table which the context changes slowly.

Conformed Dimension

Different dimension tables may contain same column names and domain content. They may be the dimensions defined from different business processes. We conform them into a single dimension table if appropriate (e.g. same grain). Such a conformed dimension can be reused in different analysis. Conformed Dimension also ensures the capability to build query to drill across different business processes. We can enjoy the reduced cost for its data preparation and data governance while keep the data model clean and easily understandable.

For example, Stores, Outlets, Warehouses and e-commerce sites may be conformed into TouchPoint dimension. SalesPersons and Buyers may be conformed into Personnel dimension.

Thursday, October 8, 2020

What is Dimension Table?

Dimensions provide the descriptive context (e.g. when, where, who, which) of the business process event. They provide the descriptive attributes for filtering, displaying and grouping purpose in the BI Reports or other BI tools.

Raymond Law Blog