Machine Learning Data Engineer Spotlight: Agile Data Transformations with dbt and Airflow

Machine Learning Data Engineer Spotlight: Agile Data Transformations with dbt and Airflow

As a machine learning data engineer at ActiveCampaign, a big part of my job is taking data from our application or third party services and loading it into our Snowflake data lake. Regardless of what transformations take place during this initial extract, transform, load (ETL) process, we frequently need further downstream processing to do things like expensive aggregations or pre-joins to other datasets.

This continuous update and re-alignment of data is part of our agile process as we add new machine learning models, custom reports, or innovative product features.

We use dbt to handle these “post-load” tasks that are part of our ETL pipeline.

dbt (data build tool)

dbt (short for data build tool) is an open-source tool to transform data inside a database for analytics or machine learning purposes. It has a lot of great features for our use-case:

  1. It’s mostly just SQL (structured query languages). The structured query languages relational database systems use for querying and updating data. This means that it’s easy to have analysts contribute to and understand the codebase since it is in a language they use every day.
  2. It can store transformed data in a variety of ways depending on the size of the data and the way that it changes from day to day. dbt can store transformations as database views – a named, stored SQL query – for cases where calculating the transformation is relatively quick. It can create fully-regenerated tables that are deleted and replaced on each run for data that changes frequently. It can also create incrementally-generated tables – a database table where new rows are inserted overtime – for large data sets that are written append-only. Often, switching from a view to an incrementally-generated table takes only a few new lines of code. Incremental tables can significantly reduce costs in Snowflake for very large datasets.
  3. It handles dependencies where some generated tables depend on other generated tables. This lets us create complicated database transformations in a series on well-named, understandable steps.
  4. It allows code sharing via macros. Macros in dbt are SQL snippets that can be included in any query. By using macros instead of copying and pasting SQL, updates and bug fixes can be made in one place but still update in all the queries that need them.
  5. It generates useful documentation. The dbt documentation system generates a web page that lists all the source tables and all the generated tables or views along with their columns. In addition, dbt gives us a convenient place to add comments about the use of a table or column to help others understand and use the data.

The missing piece from dbt, though, is scheduling it to run – that’s why we also use Airflow.

Together, dbt and Airflow make a great data team.

Airflow

Although dbt offers the spiffy-looking dbt Cloud, ActiveCampaign schedules dbt runs through Airflow (since all of our ETL code is scheduled via Airflow). This lets us run the appropriate dbt models immediately after new data is loaded as part of the same Airflow directed acyclic graph (DAG) collection of tasks.

While there are some python software packages that help integrate dbt with Airflow, none of them met all our needs. We wanted something that:

  1. Used an Airflow connection to manage the credentials to connect to the database.
  2. Had minimal operations overhead (that means that, ideally, it would just run on the Airflow worker itself instead of on a separate server).
  3. Could run a subset of dbt models easily. This lets us share code between models easily since they are all part of the same project, but still connect the running of the relevant models to the process that pulls in its data.

We ended up writing our own hook and operator that takes an Airflow Snowflake hook (the object that Airflow uses to look up database credentials) and turns its credentials into environment variables available to a call to dbt run in a subprocess.

This requires referencing environment variables in the dbt profiles.yml file instead of including the values directly. This can be done like this:

94fjh8sg image2020 11 05at12.34.47pm

This is nice because it means database credentials aren’t living in plaintext anywhere. We also ended up needing to install dbt in its own virtual python environment separate from the Airflow code because Airflow requires different versions of some python packages than those dbt requires.

This gives us an easy-to-run, comprehensive, customizable, and self-documenting codebase of “post-load” transformations on our data so we can enable other teams more quickly avoid errors and take advantage of this open-source tool’s development to have better performance with less work.

Never miss an update