An ETL import graph is build on logical dependencies of the jobs to each other. So typically a SQL transformation job depends on all the previous jobs that create the tables used in the query. But once there are a certain number of jobs, dependencies often get a bit more complicated and some of them become redundant in the process. A simple example can be seen in the dependency graph from figure, where the three red edges are redundant.

Continue reading

To speed up the ETL data pipeline, you should try to run jobs in parallel. Obviously, not all jobs can run at the same time in most cases, since there are dependency constraints between the jobs and limits of the servers capacity (number of processors and/or IO bandwidth). So assuming the server allows you to run n jobs in parallel, often there is the situation that the dependencies give you the option to run any of a set of m different jobs with m > n.

Continue reading

Author's picture

Christian Stade-Schuldt

Data Engineer @ HERE IoT innovation lab| Full-time geek | Cyclist | Learning from data

Data Engineer @ HERE

Berlin, Germany