Traditionally, there are data file interfaces between the source system and the DW. Extraction process extract the data into data files. The subsequence ETL process read the extracted data file. However, supported by many ETL tools today, it is possible to extract the source data through direct connection and streamline the data, transform and load into the target database. It enables the whole ETL process streaming the data in memory without intermediate file IO which maximize the performance. Should we take this approach? My answer is no.
We always recommend to decouple DW and source systems. It is how we reduce the challenge of future changes. A file interface are usually the reliable way to exchange data while reduce coupling between system. If there are any upstream system upgrade/replace, it is usually easy to change the extraction program to export in required interface file format. It works even if the source system moved from relational DB to NOSQL environment.
You may argue that it may still work with similar complexity in the streamline approach under most modern ETL tools connectors. Technically yes, but I would say no in the next consideration which is in term of ownership and impact analysis. Firstly, we recommend that the source system owner takes the ownership of the extraction process. It is sometime ok that we develop the extraction process for them but they should take the ownership. It would help for the maintenance. If the source system owner own the extraction processes, they would aware of the existing of the extraction program when they have any scheduled downtime. They would also aware of an impact analysis for their system changes. Otherwise, if the DW team takes the ownership of the extraction programs. Unplanned extraction fails may happen if you are not notified for the scheduled downtime or changes of source systems. Please remember that there are multiple source systems for a DW and don't overestimate the quality of communication within an organization (or within the IT team). It is why we recommend that the source system owner should take the ownership of the extraction programs. This is the key reason of the system decoupling. We are avoiding to trap ourselves in trouble. Other reasons may include the flexibility of partial ETL run, etc. I am not going to go through them in details here.
No comments:
Post a Comment