Raymond Law Blog: 2020

Wednesday, December 16, 2020

Extract: Performance Tips

Below listed some common performance tips for extract queries.

Extract required columns only and specify in the query, avoid select *
Extract required rows only. delta only
Avoid unnecessary sorting operations

union vs union all
distinct, group by, order by: use only when necessary

Avoid transformation logic in the extract queries. These logic should be offloaded to ETL engine.
Indexes

proper where clause;

use AND;
avoid OR/NOT/<>/NOT IN
be careful using wildcard and range

create proper indexes
maintain indexes regularly

Consider "Read uncommitted" (AKA dirty read, NOLOCK). Learn and use it carefully.
Use native utility and driver to execute your query. They are usually tuned for better performance.

Wednesday, December 2, 2020

Different file encoding and formats

Our ETL program need to handle different format and encoding coming from different sources. Below listed some common technical challenges.

Encoding: ASCII vs EBCDIC
Character Set: UTF8, Unicode, etc.
Format: Fixed length, Delimited, JSON, XML, Excel Spreadsheet and more
Numeric Format: Binary, Packed-decimal, Zoned-decimal and more

Some of the above format or encoding may be challenging in hand coding ETL. However, most modern ETL engine could help to deal with the above different format and encoding.

Monday, November 30, 2020

Transfer extracted data and checking

As mentioned in previous post, pushing or the publish/subscribe model are preferred. Todays topic is how to transfer the extracted files.

ETL: Interface files/tables or pipeline

We have different approaches to implement the ETL process. We may create an end-to-end ETL pipeline to propagate the data from extract, transform to load. Another common approach is to save the extracted data in some intermediate files or tables before transforming and loading.

Extract: Push, Pull or Publish/Subscribe

There are different extraction models. They are pushing, pulling and publish/subscribe models.

Change Data Capture (CDC)

The concept of incremental extraction is strict forward. We just extract the changed data. Today's post is going to tell you how. How can we identify changed data?

Extract: Full or Incremental (Delta)

Extract is the first step in the ETL process which obtains data from the source systems. This is our first question. How much data we need to extract?

What is ETL process?

ETL is the short form of Extract, Transform and Load. They are the processes to propagate data from single or multiple sources to the destination system. Firstly, data are extracted from the source system(s). Transform is the process to convert the extracted data into the required format. It may include data cleansing, combining, splitting, deriving, grouping, sorting, aggregating and lookup(ing). Finally, the transformed data are loaded into the destination system.

Table Partitioning in Fact tables

Table partition is to break down the big table into smaller sub-tables, called partitions. Without table distribution, although there may be still single processing node, we may get performance gain from multi-threading and parallel IO. We may achieve better performance from using a multi-cored CPU and distribute the partitions in different or parallel disk device (e.g. SAN, RAID 10, etc.).

Table Distribution in dimensional data modeling

Some advanced RDBMS supports table distribution and partition. They are great features to help managing huge volume of data. This post is focusing in the table distribution.

The data are distributed across multiple processing nodes. Ideally, when we have huge amount of data, each processing node keeps portions of data and computes independently in parallel to give better overall performance. Practically, data redistribution may be happened to shuffling data across nodes if required for the computation. So, a proper distribution key and distribution style are important. They ensure the data are evenly distributed (load balancing) and reduce the data shuffling across nodes. Be remember that a wrong distribution key and style selected may be even worse to the query performance.

Common distribution styles include round-robin, random, hash and replicated. In DW, we usually replicate the entire dimension tables (small tables) in all nodes and hash distribute the fact table (huge size table) across nodes. So, there will be portion of fact tables and entire dimension tables in each node. Under the star or snowflake schema structure, it is the simplest way to avoid shuffling.

A good distribution key can help data distributed evenly. We also usually choose a less frequently used filtering key. Otherwise, for example, if we use date key as the distribution key, all of the processing nodes, except one, will be idled when we analyst sales transaction of a single day. With proper selected distribution style and key in your data model, you may get benefit from distributing huge size dimension table across node.

Some quick guidelines for Table Distribution:

Choose proper Distribution style

Hash for fact table
Replicated for dimension table

Choose proper Distribution key for fact table

Evenly distribution perspective

many unique values
No or less null value

Avoid data shuffling between nodes or node idle

choose field is usually for sorting (distinct, group by, order by)
choose field is not usually in filtering (where) condition (e.g. don't use date key)

Wednesday, November 4, 2020

Constraints and indexes of Fact and Dimension Table

We usually implement the multi-dimensional data model in RDBMS or CUBE. Today, we focus on the Constraints and indexes in multi-dimensional data modeling in RDBMS.

In the market, there are different RDBMS providers and their internal design and implementation are different. So, today, we are just sharing the high level concepts. In practice, we should spend time to understand our selected RDBMS and adjust our implementation.

In OLTP applications, constraints are used to maintain the data integrity and quality (e.g. business rules) of the data in the database. Indexes are added to improve the querying performance. In OLAP applications, we usually have much more data and our design strategies are focus on the performance. The first concern is the querying performance, the second is the loading performance (which reduce the refresh downtime). For data integrity and quality, they are still important. However, constraints usually don't improve querying performance but impact the data loading performance. Instead, as the batch processing nature of DW, we likely centralize the validation in batches without enforcing constraints in DB.

Outrigger Dimension

It is a dimension table doesn't associate with fact table directly, but other dimension tables. We may have different scenarios to make a outrigger dimension.

Audit Dimension

In traditional data modeling, we usually add "created date", "last modified date", "created by" and "last modified by" columns in all tables to capture the audit information. In dimensional data modeling, such design may still be ok for the dimension table since they are usually short. However, it could be a problem if we manage our fact table in this way. It is why the audit dimension is introduced. Fact table contains millions of records or even more. These columns share the same values in all the rows from the same batch of data loading. So, we could have a better design to consolidate the audit information by audit dimension.

Aggregated Fact Table

Aggregated Fact Table is the summarized table (or materialized view) propagated from the atomic transaction fact table. The aggregated fact table contains fewer rows from the summarization. Query performance gain can be obtained through it when less details are required in some user analysis. We may build multiple aggregated fact tables in different levels (Daily and Monthly).The aggregated fact tables and their atomic transaction fact table can be co-exists and accessible by users. Some aggregate navigation technologies are available in some DB optimizers / BI Tools (e.g. aggregate aware in BO) to pick the best table for user query.

Thursday, October 29, 2020

Consolidated Fact Table

Transactional Fact and Snapshot Fact tables are usually designed from the corresponding individual business process. Consolidated Fact table is usually designed from particular analysis (report) which required consolidation of data (derive) from multiple business processes, which they can be presented in a same grain.

Factless Fact Table

Factless fact table is the fact table which log an business event but does not contain any measure fields. Typical use cases are the attendance and event log. Their row count (log count) is usually the key measurement in analysis.

We recommend deriving some measures from them to avoid factless if possible. For example, the measure "number of days to close" may be derived from an Issue Log Fact table.

Tuesday, October 27, 2020

Accumulating Snapshot Fact Table

This is the fact table to capture stepwise (fixed steps) business processes. It captures the life cycle of a well-defined process. So, the fact table record is first inserted in the first step of the business process and is updated later after completion of the next step(s). (e.g. from ordering to shipping). Typically, there are multiple date dimension fields (e.g. ordering date and shipping date). The number of days between them could be derived as its measure (e.g. days to close).

Monday, October 26, 2020

Periodic Snapshot Fact Table

This kind of fact table represents the snapshot status (i.e. the measures of the state at some point-in-time) periodically (e.g. daily, weekly, monthly). Typically, it is useful for recording account/stock balance. It is the preferred way to capture balance figures because the balance can't be aggregated from the transactions from an individual single period but from the very beginning.

Transactional Fact Table

This is the most common form of fact table. It represents an business process event happened in a particular time and space. Sales fact table is a classic example. Typically, we insert records into the transactional fact table without updating since the transactions never changed after happened. Sometimes, we may also allow updating in the ETL programs for enabling data fixing and reloading.

This kind of table also usually acts as the atomic grain fact table.

Thursday, October 22, 2020

Grain of the Fact Table

The primary key of fact table can simply be all its dimension keys (including degenerated dimensions) which are all columns except measures. Sometimes, it doesn't necessary to be all dimension keys since there may be some redundant information in dimensions. We usually don't need to generate a surrogate key for the fact table but simply use the composite primary key from those dimension keys columns. This primary key is also aligned to the granularity of the fact table.

Wednesday, October 21, 2020

Measure: Additive, Semi-additive and non-additive

Measures are the values in your fact tables which provide numeric meaning to the business process event. There are three kinds of numeric measures.

Multivalued Dimension

Per the structure of dimensional schema, each dimension attached to a fact table has a single value. However, in reality, there are number of situations that dimension may contain multiple values. For example, in family insurance policy, multiple insured-person (family members) are attached to a policy. So, there are some techniques to handle such situation.

1. Keep a simple dimension and ignore the multiple values

We simply eliminate the multi-valued dimension challenge by ignoring it. We usually use this method if such multi-valued attributes are not useful in the analysis. We may simply choose one particular value and leave the others. Or we may insert a dimension row to represent the mixed values while leaving their individual attributes.

2. Add multiple columns to represent the individual values if values are known limited

If the number of values are known limited, we may use this method. We add fixed number of columns to store their individual attributes. However, it is not really recommended. It is because the data will be hard to be analyzed when attributes stored in both way horizontally and vertically.

3. Use bridge table with proper weighting factors.

This method uses the bridge table which is a N-to-N relationship table between the fact and dimension tables. It is how we maintain multiple values in the dimensional schema. To manage the risk of double counting, we use a weighting factors field to represent the contribution ratio of each value. In each transaction (row of fact table), the sum of the corresponding weighting factors should be equal to 1. This method should be used carefully, all the measure calculation should associate with the weighting factors. Otherwise, double counting will still be happened.

Figure: Multivalued Dimension using Bridge Table

Monday, October 19, 2020

Hierarchy Dimension

We always recommend to flatten the (fixed-depth) hierarchy information into the dimension. It is how we keep the star schema simple and nice.

Role Playing Dimension

Conformed dimension consolidates different context in a single dimension. We may need to show/map different semantic content in reports and other BI tools. For these purpose, we implement multiple views on top of the conformed dimension. So, we can map and show specific semantic content (e.g. selling date vs ordering date) for different analysis. We call it the role-playing dimension.

Friday, October 16, 2020

Date Dimension

A special type of dimension to represent dates of business process event. Per the stability nature of calendar, we may simply use meaningful integer YYYYMMDD as its primary key. I believe that it is the only exceptional scenario that we accept meaningful primary key. It contains those date elements for displaying. It also contains flattened hierarchy information for filtering and grouping (e.g. calendar month, calendar year, fiscal month, fiscal year). We may also add some tailor made fields to the date dimension for our business. (e.g. holiday flag)

Rapidly Changing Dimension

Rapidly Changing Dimension is also known as Fast Changing Dimension. We usually implement the Slowly Changing Dimension using type 2 to keep history. However, it would be an issue to implement a rapidly changing dimension using type 2. Type 2 implementation would be a disaster to grow the dimension table rapidly (dimension becomes huge number of rows). Instead, we use type 4 to manage rapidly changing dimension.

Junk Dimension

Junk Dimension is also known as Garbage Dimension. There may be a lot of miscellaneous, low-cardinality flags and indicators (e.g. TransactionType, TransactionStatus, VoidFlag, ActiveFlag, etc.) in a transaction.

Degenerated Dimension

In some dimension, we may find that we don't have any dimension attributes but only its business key. Transaction Number is an example, other transaction header's attributes are logically organized in other dimensions. (e.g. Transaction date is organized to Date dimension) It doesn't make sense to create a tiny dimension with just the surrogate key and business key. So, we may consider to degenerate the dimension and store the business key (e.g. transaction number) in the fact table. Size and the content nature would be the way of our consideration (e.g. we degenerate the invoice number but not the invoice remarks). We call such a degenerated field in the fact table as the degenerated dimension.

Monday, October 12, 2020

Slowly Changing Dimension

The context of the business process may change over time. Slowly Changing Dimension is the technique to manage the change history of dimension table which the context changes slowly.

Conformed Dimension

Different dimension tables may contain same column names and domain content. They may be the dimensions defined from different business processes. We conform them into a single dimension table if appropriate (e.g. same grain). Such a conformed dimension can be reused in different analysis. Conformed Dimension also ensures the capability to build query to drill across different business processes. We can enjoy the reduced cost for its data preparation and data governance while keep the data model clean and easily understandable.

For example, Stores, Outlets, Warehouses and e-commerce sites may be conformed into TouchPoint dimension. SalesPersons and Buyers may be conformed into Personnel dimension.

Friday, October 9, 2020

What is Fact Table?

Fact table stores the measures of the business process event and associates them with the descriptive context (dimension tables). So, there are two kinds of elements contained in the fact table. They are the measures (usually numeric, e.g. UnitSold, AmountSold) and the dimension key.

What is Dimension Table?

Dimensions provide the descriptive context (e.g. when, where, who, which) of the business process event. They provide the descriptive attributes for filtering, displaying and grouping purpose in the BI Reports or other BI tools.

Hash join and dimensional data modeling

You may have heard that we always recommend using star schema instead of snowflake. We recommend limiting the dimension table size (e.g. short dimension table). We also recommend limiting the number of dimension tables associating to the fact table (e.g. thin fact table). The reasons behind are preparing for using the hash join algorithm. We optimize the data model to ensure the high performance hash join in user queries.

Basic of joining algorithms

This article introduces three basic joining algorithms, nested loop join, merge join and hash join. It gives basic ideas of the joining algorithms without going into the very details and their variations. We may pay more attention to the hash join. It is a useful algorithm for dimensional data modeling. Lots of modeling techniques are aimed to optimize the data model to perform a high performance hash join.

Star Schema and Snowflake Schema

Both Star Schema and Snowflake Schema are the forms of dimensional data model. They consist the fact and dimension tables which the fact tables are surrounded by dimensions.

Each dimension is completely de-normalized in single table in the Star Schema. However, in Snowflake Schema, some dimensions are normalized (e.g. Hierarchy relationship).

Friday, October 2, 2020

What is Dimensional Data Modeling?

It is the concept developed by Ralph Kimball. Dimensional data modeling is the technique to optimize the data storage in the dimensional data warehouse. Such an optimization is focused to the querying performance instead of volume of the data storage. So, it is known as a de-normalized data modeling technique to enable fast data query.

BI/DW Best Practice #17 - Think carefully before violating best practices

Remember that, we may be likely trapped ourselves with some troubles when we violate the best practices. So, I usually measure the project risk level by counting how many best practices violated. And try to keep it low to near zero.

BI/DW Best Practice #16 - Start small, release in batches

Try to release reports batch by batch in smaller portions. Users and production supports will get crazy if you release too much and too fast. We should give enough time for the users to change their working procedures with the new reports and environment. Otherwise, users may stick to their old procedures instead of switching to the new system.

BI/DW Best Practice #15 - Be careful to be pilot project

The application defeats and stability of the tools using always impact to our projects. So, try to avoid them as much as possible.

BI/DW Best Practice #14 - Flexible, rerunable and fault tolerance ETL processes

There are several different source systems. So, we may have errors from different reasons. We may have three options to handle. Halt, Proceed but reject failed records and Proceed with notification.

BI/DW Best Practice #13 - Keep measures additive if possible

BI tools rollup the measures, so we keep the measures additive if possible. For example, we won't pre-calculate and keep the KPI "Unit per Transaction (UPT)" in single measure column as a ratio. Instead, we keep two measures, "Number of Units" and "Number of Transactions".

Sunday, September 27, 2020

BI/DW Best Practice #12 - Manage data integrity

The data warehouse / data mart are usually implemented in RDBMS. RDBMS provides lots of features to manage data which are useful for both operational and analytical systems. However, per the different natures between operational and analytical systems. We have different practices to manage our data.

BI/DW Best Practice #11 - Decouple with source systems

Traditionally, there are data file interfaces between the source system and the DW. Extraction process extract the data into data files. The subsequence ETL process read the extracted data file. However, supported by many ETL tools today, it is possible to extract the source data through direct connection and streamline the data, transform and load into the target database. It enables the whole ETL process streaming the data in memory without intermediate file IO which maximize the performance. Should we take this approach? My answer is no.

BI/DW Best Practice #10 - Handle unknown and null

NULL columns sometime give unexpected query results to users in user point of view. It happens in filtering (i.e. condition [COLUMN1 = 0 or COLUMN1 != 0] doesn't return all records, variance from the records with COLUMN1 is NULL). We should avoid those unexpected outcome by avoiding NULL in all dimension columns and dimension foreign key.

BI/DW Best Practice #9 - Contain date dimension in all fact tables

If we are using dimensional data model, make sure to include a date dimension (at least one) in every fact tables. It must be an item in our checklist for our dimensional model design. I can't imagine if we can miss the date in any business activities.

Monday, September 14, 2020

BI/DW Best Practice #8 - Use Surrogate key

We sometime hear the argument of using surrogate key or business key. There are number of reasons why we use surrogate key. I am not going through them one by one here. I would just emphasize my key reason here - to make the model capable to changes.

BI/DW Best Practice #7 - Atomic grain base fact table

If we are using dimensional data model, make sure to build atomic grain base fact tables. It is true that they takes more space and performs slower than aggregated fact table. We may build aggregated fact table on-top to improve the performance but don't skip the atomic grain base fact table. It can partly act as the single version of truth in your solution. The atomic grain base fact table is also likely reducing the efforts to rebuild everything from zero again if any future changes.

BI/DW Best Practice #6 - Denormalize the data in the user access layer

Before I start writing this article, I want to simply suggest to use star schema. However, someone may not build a dimensional data warehouse, so I generalize my words to denormalization. Per performance concern, we always use denormalized data model to support fast user queries.

BI/DW Best Practice #5 - Consolidate entities while keep correct grain

We may sometime consolidate similar analysis in a fact table. It may reduce the total size, complexity and effort from such shared fact table. However, we should make sure that we don't violate the correct granularity of the fact table.

Conforming fact tables in wrong granularity will lead to wrong results in analysis. Feel free to create a new fact table if they are in different grain. It is ok that we can have multiple fact tables in a subject area.

BI/DW Best Practice #4 - Organization wide participation to ensure data quality

The technologies don't intelligent alone. The people are the intelligent and the BI solution assists them. We should introduce the best practice in the organization to maintain their data quality.

BI/DW Best Practice #3 - Understand the business processes

We need to understand the business processes and ask the right questions. Some managers will ask for whatever information they can get without thinking carefully about what they really need. Too much information can be as harmful as too little. Other managers may omit things they ought to know, or they may not know to ask for some types of information they should have.

BI/DW Best Practice #2 - Use common business languages

In the BI/DW projects, try to use common and aligned business languages in those deliverable including reports and dashboards. It can avoid the misunderstanding of the terms in deliverable between business units.

Sometime, the business language may still be ambiguous between some business units. For example, don't confuse the terms - revenue, net profit and income. Are different business units use the different words talking the same thing? It is why we recommended having a organizational wide participation to define, document and spread the business terminologies within the organization. Usually, we may talk to the accountants of the organization first. They could give us great input before we propose the business terminology standard.

BI is not only a software solution, but also a set of practice to facilitate business decision making. The alignment of business terminology is one of them.

Monday, August 31, 2020

BI/DW Best Practice #1 - Principles

We always consider the business need, performance, capability to change and reality. I believe that they are the top design principles for a BI/DW project. They also represent the corresponding quality referring to the traps that I discussed in my last article.

BI/DW Best Practices

There are several general rules in designing and implementing BI/DW projects to ensure the successful and quality of the projects. I am trying to consolidate some of those important best practices from what I have learned and experienced.

In this series, I will focus on those practices related to my role, experiences and exposure. So, it doesn't a complete list of best practices in the BI/DW industry.

Best Practice #0 - What may happen if we don't follow the best practices

BI/DW Best Practice #0 - What may happen if we don't follow the best practices

BI Best practices guide us how to develop BI projects by introducing some principles, recommended methodologies and suggestions. By following them, we can avoid being trapped by some common troubles. Before we start introducing some practices, I would like to tell you some common troubles first. I classified them into three major categories using my words. They are the Performance Trap, Change Trap and Information Trap.

What is Data Warehouse?

Data warehouse (DW) is the architecture to collect and manage business data from various sources to provide meaningful business insight. It is different to the database of other operational systems. It collects the data from different business processes while operational database focuses in particular process. Its purpose for querying and analysis while operational database is for supporting its operations (e.g. frequent DML for sales transaction entries).

There are two well-known approaches to implement data warehouse.

What is Business Intelligence

Business Intelligence refers to the technologies, applications and practices to collect , integrate and convert the business data into information to support the analysis and business decision making.

Introduction

I was a Business Intelligence Consultant. I quit my job in Hong Kong last year. Now, I am a student studying in Calgary, Canada. I planned to go back to the BI industry afterward. Before I enter the job market again, I would like to take this chance to consolidate my working experiences and knowledge. They may include those skills and industry best practices I learned. I also want to document some challenges I faced and the methodologies used to overcome them. I believe that it would be helpful in my future career. I may get some hints by reviewing these articles when I have a new challenge. If you may reach my articles from search engine, I would be happy if my articles may give you some ideas.

Wednesday, December 16, 2020

Wednesday, December 2, 2020

Monday, November 30, 2020

Tuesday, November 24, 2020

Monday, November 23, 2020

Thursday, November 19, 2020

Wednesday, November 18, 2020

Tuesday, November 17, 2020

Tuesday, November 10, 2020

Monday, November 9, 2020

Some quick guidelines for Table Distribution:

Wednesday, November 4, 2020

Tuesday, November 3, 2020

Monday, November 2, 2020

Friday, October 30, 2020

Thursday, October 29, 2020

Wednesday, October 28, 2020

Tuesday, October 27, 2020

Monday, October 26, 2020

Friday, October 23, 2020

Thursday, October 22, 2020

Wednesday, October 21, 2020

Tuesday, October 20, 2020

1. Keep a simple dimension and ignore the multiple values

2. Add multiple columns to represent the individual values if values are known limited

3. Use bridge table with proper weighting factors.

Monday, October 19, 2020

Saturday, October 17, 2020

Friday, October 16, 2020

Thursday, October 15, 2020

Wednesday, October 14, 2020

Tuesday, October 13, 2020

Monday, October 12, 2020

Sunday, October 11, 2020

Friday, October 9, 2020

Thursday, October 8, 2020

Wednesday, October 7, 2020

Tuesday, October 6, 2020

Saturday, October 3, 2020

Friday, October 2, 2020

Thursday, October 1, 2020

Wednesday, September 30, 2020

Tuesday, September 29, 2020

Monday, September 28, 2020

Sunday, September 27, 2020

Saturday, September 26, 2020

Wednesday, September 16, 2020

Tuesday, September 15, 2020

Monday, September 14, 2020

Thursday, September 10, 2020

Saturday, September 5, 2020

Tuesday, September 1, 2020

Monday, August 31, 2020

Sunday, August 30, 2020

Saturday, August 29, 2020

Friday, August 28, 2020

Wednesday, August 26, 2020