Raymond Law Blog: Basic of joining algorithms

This article introduces three basic joining algorithms, nested loop join, merge join and hash join. It gives basic ideas of the joining algorithms without going into the very details and their variations. We may pay more attention to the hash join. It is a useful algorithm for dimensional data modeling. Lots of modeling techniques are aimed to optimize the data model to perform a high performance hash join.

Nested Loop Join

It joins two tables using two nested loops. The loops goes for all rows combination, and checks if the join predicate satisfied.

For example,

Select *

From TableA inner join TableB

On TableA.ID = TableB.ID

In nested loop join, the pseudo code likes below

For each row in TableA

For each row in TableB

If TableA.ID = TableB.ID

Return (TableA.row,TableB.row)

The algorithms runs in O(A*B). It works effectively when tables are small.

Merge Join

Merge join is an efficient joining algorithm requiring both the input data are pre-sorted on the merge (join predicate) columns. It runs in a single loop with two pointers and checks if the join predicate satisfied. Its pseudo code look like below

While Not End TableA and Not End TableB

Begin

if TableA.ID = TableB.ID

return (TableA.row,TableB.row); TableA.nextrow;

if TableA.ID > TableB.ID

then TableA.nextrow

else if TableA.ID < TableB.ID

then TableB.nextrow

End

The algorithm runs in O(A+B), so it is very efficient. However, the pre-sorted input requirement is harsh. Otherwise, extra efforts, O(A.log(A)) and O(B.log(B)) are required to sort the input. The overall outcomes will be O(A.log(A) + B.log(B)).

Hash Join

Hash Join builds a hash table in memory to hold the smaller table. It loops to fetch each row from the larger table, checks if the join predicate satisfied using a hash function. Its pseudo code is simplified as follow.

Build hash for TableA (TableA.IDs, TableA.rows)

For each row in TableB

Begin

rowB = hash_lookup(TableA.ID)

If rowB is not null

return(TableA.row,TableB.rows)

End

It is an efficient joining algorithm if the smaller table can be fitted in the memory. Otherwise, paging (virtual memory using disk) is required which slow down the whole process. In ideal scenario, it takes a constant time to build the hash table A and takes O(B) to fetch each row.

Raymond Law Blog

Tuesday, October 6, 2020

Basic of joining algorithms

Nested Loop Join

Merge Join

Hash Join

No comments:

Post a Comment

Extract: Performance Tips

Report Abuse