This article introduces three basic joining algorithms, nested loop join, merge join and hash join. It gives basic ideas of the joining algorithms without going into the very details and their variations. We may pay more attention to the hash join. It is a useful algorithm for dimensional data modeling. Lots of modeling techniques are aimed to optimize the data model to perform a high performance hash join.
Nested Loop Join
It joins two tables using two nested loops. The loops goes for all rows combination, and checks if the join predicate satisfied.
For example,
Select *
From TableA inner join TableB
On TableA.ID = TableB.ID
In nested loop join, the pseudo code likes below
For each row in TableA
For each row in TableB
If TableA.ID = TableB.ID
Return (TableA.row,TableB.row)
The algorithms runs in O(A*B). It works effectively when tables are small.
Merge Join
Merge join is an efficient joining algorithm requiring both the input data are pre-sorted on the merge (join predicate) columns. It runs in a single loop with two pointers and checks if the join predicate satisfied. Its pseudo code look like below
While Not End TableA and Not End TableB
Begin
if TableA.ID = TableB.ID
return (TableA.row,TableB.row); TableA.nextrow;
if TableA.ID > TableB.ID
then TableA.nextrow
else if TableA.ID < TableB.ID
then TableB.nextrow
End
The algorithm runs in O(A+B), so it is very efficient. However, the pre-sorted input requirement is harsh. Otherwise, extra efforts, O(A.log(A)) and O(B.log(B)) are required to sort the input. The overall outcomes will be O(A.log(A) + B.log(B)).
Hash Join
Hash Join builds a hash table in memory to hold the smaller table. It loops to fetch each row from the larger table, checks if the join predicate satisfied using a hash function. Its pseudo code is simplified as follow.
Build hash for TableA (TableA.IDs, TableA.rows)
For each row in TableB
Begin
rowB = hash_lookup(TableA.ID)
If rowB is not null
return(TableA.row,TableB.rows)
End
It is an efficient joining algorithm if the smaller table can be fitted in the memory. Otherwise, paging (virtual memory using disk) is required which slow down the whole process. In ideal scenario, it takes a constant time to build the hash table A and takes O(B) to fetch each row.
No comments:
Post a Comment