Raymond Law Blog: October 2020

Friday, October 30, 2020

Aggregated Fact Table

Aggregated Fact Table is the summarized table (or materialized view) propagated from the atomic transaction fact table. The aggregated fact table contains fewer rows from the summarization. Query performance gain can be obtained through it when less details are required in some user analysis. We may build multiple aggregated fact tables in different levels (Daily and Monthly).The aggregated fact tables and their atomic transaction fact table can be co-exists and accessible by users. Some aggregate navigation technologies are available in some DB optimizers / BI Tools (e.g. aggregate aware in BO) to pick the best table for user query.

Thursday, October 29, 2020

Consolidated Fact Table

Transactional Fact and Snapshot Fact tables are usually designed from the corresponding individual business process. Consolidated Fact table is usually designed from particular analysis (report) which required consolidation of data (derive) from multiple business processes, which they can be presented in a same grain.

Factless Fact Table

Factless fact table is the fact table which log an business event but does not contain any measure fields. Typical use cases are the attendance and event log. Their row count (log count) is usually the key measurement in analysis.

We recommend deriving some measures from them to avoid factless if possible. For example, the measure "number of days to close" may be derived from an Issue Log Fact table.

Tuesday, October 27, 2020

Accumulating Snapshot Fact Table

This is the fact table to capture stepwise (fixed steps) business processes. It captures the life cycle of a well-defined process. So, the fact table record is first inserted in the first step of the business process and is updated later after completion of the next step(s). (e.g. from ordering to shipping). Typically, there are multiple date dimension fields (e.g. ordering date and shipping date). The number of days between them could be derived as its measure (e.g. days to close).

Monday, October 26, 2020

Periodic Snapshot Fact Table

This kind of fact table represents the snapshot status (i.e. the measures of the state at some point-in-time) periodically (e.g. daily, weekly, monthly). Typically, it is useful for recording account/stock balance. It is the preferred way to capture balance figures because the balance can't be aggregated from the transactions from an individual single period but from the very beginning.

Transactional Fact Table

This is the most common form of fact table. It represents an business process event happened in a particular time and space. Sales fact table is a classic example. Typically, we insert records into the transactional fact table without updating since the transactions never changed after happened. Sometimes, we may also allow updating in the ETL programs for enabling data fixing and reloading.

This kind of table also usually acts as the atomic grain fact table.

Thursday, October 22, 2020

Grain of the Fact Table

The primary key of fact table can simply be all its dimension keys (including degenerated dimensions) which are all columns except measures. Sometimes, it doesn't necessary to be all dimension keys since there may be some redundant information in dimensions. We usually don't need to generate a surrogate key for the fact table but simply use the composite primary key from those dimension keys columns. This primary key is also aligned to the granularity of the fact table.

Wednesday, October 21, 2020

Measure: Additive, Semi-additive and non-additive

Measures are the values in your fact tables which provide numeric meaning to the business process event. There are three kinds of numeric measures.

Multivalued Dimension

Per the structure of dimensional schema, each dimension attached to a fact table has a single value. However, in reality, there are number of situations that dimension may contain multiple values. For example, in family insurance policy, multiple insured-person (family members) are attached to a policy. So, there are some techniques to handle such situation.

1. Keep a simple dimension and ignore the multiple values

We simply eliminate the multi-valued dimension challenge by ignoring it. We usually use this method if such multi-valued attributes are not useful in the analysis. We may simply choose one particular value and leave the others. Or we may insert a dimension row to represent the mixed values while leaving their individual attributes.

2. Add multiple columns to represent the individual values if values are known limited

If the number of values are known limited, we may use this method. We add fixed number of columns to store their individual attributes. However, it is not really recommended. It is because the data will be hard to be analyzed when attributes stored in both way horizontally and vertically.

3. Use bridge table with proper weighting factors.

This method uses the bridge table which is a N-to-N relationship table between the fact and dimension tables. It is how we maintain multiple values in the dimensional schema. To manage the risk of double counting, we use a weighting factors field to represent the contribution ratio of each value. In each transaction (row of fact table), the sum of the corresponding weighting factors should be equal to 1. This method should be used carefully, all the measure calculation should associate with the weighting factors. Otherwise, double counting will still be happened.

Figure: Multivalued Dimension using Bridge Table

Monday, October 19, 2020

Hierarchy Dimension

We always recommend to flatten the (fixed-depth) hierarchy information into the dimension. It is how we keep the star schema simple and nice.

Role Playing Dimension

Conformed dimension consolidates different context in a single dimension. We may need to show/map different semantic content in reports and other BI tools. For these purpose, we implement multiple views on top of the conformed dimension. So, we can map and show specific semantic content (e.g. selling date vs ordering date) for different analysis. We call it the role-playing dimension.

Friday, October 16, 2020

Date Dimension

A special type of dimension to represent dates of business process event. Per the stability nature of calendar, we may simply use meaningful integer YYYYMMDD as its primary key. I believe that it is the only exceptional scenario that we accept meaningful primary key. It contains those date elements for displaying. It also contains flattened hierarchy information for filtering and grouping (e.g. calendar month, calendar year, fiscal month, fiscal year). We may also add some tailor made fields to the date dimension for our business. (e.g. holiday flag)

Rapidly Changing Dimension

Rapidly Changing Dimension is also known as Fast Changing Dimension. We usually implement the Slowly Changing Dimension using type 2 to keep history. However, it would be an issue to implement a rapidly changing dimension using type 2. Type 2 implementation would be a disaster to grow the dimension table rapidly (dimension becomes huge number of rows). Instead, we use type 4 to manage rapidly changing dimension.

Junk Dimension

Junk Dimension is also known as Garbage Dimension. There may be a lot of miscellaneous, low-cardinality flags and indicators (e.g. TransactionType, TransactionStatus, VoidFlag, ActiveFlag, etc.) in a transaction.

Degenerated Dimension

In some dimension, we may find that we don't have any dimension attributes but only its business key. Transaction Number is an example, other transaction header's attributes are logically organized in other dimensions. (e.g. Transaction date is organized to Date dimension) It doesn't make sense to create a tiny dimension with just the surrogate key and business key. So, we may consider to degenerate the dimension and store the business key (e.g. transaction number) in the fact table. Size and the content nature would be the way of our consideration (e.g. we degenerate the invoice number but not the invoice remarks). We call such a degenerated field in the fact table as the degenerated dimension.

Monday, October 12, 2020

Slowly Changing Dimension

The context of the business process may change over time. Slowly Changing Dimension is the technique to manage the change history of dimension table which the context changes slowly.

Conformed Dimension

Different dimension tables may contain same column names and domain content. They may be the dimensions defined from different business processes. We conform them into a single dimension table if appropriate (e.g. same grain). Such a conformed dimension can be reused in different analysis. Conformed Dimension also ensures the capability to build query to drill across different business processes. We can enjoy the reduced cost for its data preparation and data governance while keep the data model clean and easily understandable.

For example, Stores, Outlets, Warehouses and e-commerce sites may be conformed into TouchPoint dimension. SalesPersons and Buyers may be conformed into Personnel dimension.

Friday, October 9, 2020

What is Fact Table?

Fact table stores the measures of the business process event and associates them with the descriptive context (dimension tables). So, there are two kinds of elements contained in the fact table. They are the measures (usually numeric, e.g. UnitSold, AmountSold) and the dimension key.

What is Dimension Table?

Dimensions provide the descriptive context (e.g. when, where, who, which) of the business process event. They provide the descriptive attributes for filtering, displaying and grouping purpose in the BI Reports or other BI tools.

Hash join and dimensional data modeling

You may have heard that we always recommend using star schema instead of snowflake. We recommend limiting the dimension table size (e.g. short dimension table). We also recommend limiting the number of dimension tables associating to the fact table (e.g. thin fact table). The reasons behind are preparing for using the hash join algorithm. We optimize the data model to ensure the high performance hash join in user queries.

Basic of joining algorithms

This article introduces three basic joining algorithms, nested loop join, merge join and hash join. It gives basic ideas of the joining algorithms without going into the very details and their variations. We may pay more attention to the hash join. It is a useful algorithm for dimensional data modeling. Lots of modeling techniques are aimed to optimize the data model to perform a high performance hash join.

Star Schema and Snowflake Schema

Both Star Schema and Snowflake Schema are the forms of dimensional data model. They consist the fact and dimension tables which the fact tables are surrounded by dimensions.

Each dimension is completely de-normalized in single table in the Star Schema. However, in Snowflake Schema, some dimensions are normalized (e.g. Hierarchy relationship).

Friday, October 2, 2020

What is Dimensional Data Modeling?

It is the concept developed by Ralph Kimball. Dimensional data modeling is the technique to optimize the data storage in the dimensional data warehouse. Such an optimization is focused to the querying performance instead of volume of the data storage. So, it is known as a de-normalized data modeling technique to enable fast data query.

BI/DW Best Practice #17 - Think carefully before violating best practices

Remember that, we may be likely trapped ourselves with some troubles when we violate the best practices. So, I usually measure the project risk level by counting how many best practices violated. And try to keep it low to near zero.

Friday, October 30, 2020

Thursday, October 29, 2020

Wednesday, October 28, 2020

Tuesday, October 27, 2020

Monday, October 26, 2020

Friday, October 23, 2020

Thursday, October 22, 2020

Wednesday, October 21, 2020

Tuesday, October 20, 2020

1. Keep a simple dimension and ignore the multiple values

2. Add multiple columns to represent the individual values if values are known limited

3. Use bridge table with proper weighting factors.

Monday, October 19, 2020

Saturday, October 17, 2020

Friday, October 16, 2020

Thursday, October 15, 2020

Wednesday, October 14, 2020

Tuesday, October 13, 2020

Monday, October 12, 2020

Sunday, October 11, 2020

Friday, October 9, 2020

Thursday, October 8, 2020

Wednesday, October 7, 2020

Tuesday, October 6, 2020

Saturday, October 3, 2020

Friday, October 2, 2020

Thursday, October 1, 2020