Wednesday, November 4, 2020

Constraints and indexes of Fact and Dimension Table

We usually implement the multi-dimensional data model in RDBMS or CUBE. Today, we focus on the Constraints and indexes in multi-dimensional data modeling in RDBMS.

In the market, there are different RDBMS providers and their internal design and implementation are different. So, today, we are just sharing the high level concepts. In practice, we should spend time to understand our selected RDBMS and adjust our implementation.

In OLTP applications, constraints are used to maintain the data integrity and quality (e.g. business rules) of the data in the database. Indexes are added to improve the querying performance. In OLAP applications, we usually have much more data and our design strategies are focus on the performance. The first concern is the querying performance, the second is the loading performance (which reduce the refresh downtime). For data integrity and quality, they are still important. However, constraints usually don't improve querying performance but impact the data loading performance. Instead, as the batch processing nature of DW, we likely centralize the validation in batches without enforcing constraints in DB. 

Primary Key of Dimension Table:

  • As discussed earlier, we use surrogate key as the primary key of the dimension table (except date dimension). And yes, we enforce the primary key constraint in dimension table. It is because the primary key (the corresponding unique index) improve the performance of user queries. Querying performance is prioritized even though the primary key constraint may impact the data loading performance. And don't worry. It doesn't matter to data loading since the volume of changed data is usually low in dimension table.

Indexes of Dimension Table:

  • Unique index on the Primary Key (usually already comes with the Primary Key constraint)
  • Unique index on the business (natural) key (e.g. composite with the effective date if type 2 SCD)
  • Single key index on the frequently used filtering column
  • (Optional) composite key index on the combined frequently used filtering columns

Primary Key of Fact Table:

  • As discussed in previous post, we have primary key in the fact table which is composed from all (or the subset) dimension foreign key (including degenerated dimensions). So, yes, conceptually, we have primary key in fact table but we usually won't enforce the primary key constraint of fact table in DB if possible. The primary key of the fact table never improve the query performance as we won't associate the primary key in user queries. (e.g. we won't join between fact tables) Moreover, the primary key constraint impacts to the data loading performance. The constraint validation takes time when the data are loading into the table. Instead, the data integrity is maintained and validated in the ETL jobs or validation reports.
    • We don't need surrogate key for fact table
    • We don't need primary key constraint in fact table

Indexes of Fact Table:

  • Single Key index on the dimension foreign key column

Foreign Key Constraints:

  • Conceptually, we have foreign key relationship between fact and dimension tables. However, similarly, we won't enforce those foreign key constraints in DB. The same reason that it won't improve querying performance and impacts to the data loading performance. The data integrity should be maintained and validated in the ETL jobs or validation reports.
    • We don't need foreign key constraints

Other custom constraints: (e.g. business rules)

  • For the same reason, we won't enforce other custom constraints to check for any business rules in DB. The business rules should be maintained and validated in the ETL jobs or validation reports.
    • We don't need other custom constraints

No comments:

Post a Comment

Extract: Performance Tips

Below listed some common performance tips for extract queries. Extract required columns only and specify in the query, avoid select * Extrac...