Tuesday, October 6, 2020

Basic of joining algorithms

This article introduces three basic joining algorithms, nested loop join, merge join and hash join. It gives basic ideas of the joining algorithms without going into the very details and their variations. We may pay more attention to the hash join. It is a useful algorithm for dimensional data modeling. Lots of modeling techniques are aimed to optimize the data model to perform a high performance hash join.

Nested Loop Join

It joins two tables using two nested loops. The loops goes for all rows combination, and checks if the join predicate satisfied.

For example,

    Select *

    From   TableA inner join TableB

    On     TableA.ID = TableB.ID

In nested loop join, the pseudo code likes below

    For each row in TableA

      For each row in TableB

        If TableA.ID = TableB.ID

          Return (TableA.row,TableB.row)

The algorithms runs in O(A*B). It works effectively when tables are small.

Merge Join

Merge join is an efficient joining algorithm requiring both the input data are pre-sorted on the merge (join predicate) columns. It runs in a single loop with two pointers and checks if the join predicate satisfied. Its pseudo code look like below

    While Not End TableA and Not End TableB

    Begin

      if TableA.ID = TableB.ID

        return (TableA.row,TableB.row); TableA.nextrow;

      if TableA.ID > TableB.ID

        then TableA.nextrow

      else if TableA.ID < TableB.ID

        then TableB.nextrow

    End

The algorithm runs in O(A+B), so it is very efficient. However, the pre-sorted input requirement is harsh. Otherwise, extra efforts, O(A.log(A)) and O(B.log(B)) are required to sort the input. The overall outcomes will be O(A.log(A) + B.log(B)).

Hash Join

Hash Join builds a hash table in memory to hold the smaller table. It loops to fetch each row from the larger table, checks if the join predicate satisfied using a hash function. Its pseudo code is simplified as follow.

    Build hash for TableA (TableA.IDs, TableA.rows)

    For each row in TableB

    Begin

      rowB = hash_lookup(TableA.ID)

      If rowB is not null

        return(TableA.row,TableB.rows)

    End

It is an efficient joining algorithm if the smaller table can be fitted in the memory. Otherwise, paging (virtual memory using disk) is required which slow down the whole process. In ideal scenario, it takes a constant time to build the hash table A and takes O(B) to fetch each row.

No comments:

Post a Comment

Extract: Performance Tips

Below listed some common performance tips for extract queries. Extract required columns only and specify in the query, avoid select * Extrac...