Author: Rini Shiny M
This blog describes the integration of DBT with Snowflake.
Fig 1. Role of DBT in data analytics
Introduction to DBT
DBT (Data Build Tool), a Python application is a transformation tool, which focuses on the Transformation part in ELT (Extract, Load, Transform) processes. It aims at performing data engineering activities before the data is used for analytics. DBT performs transformation using select statements, which in turn converts into tables and views, thus making the transformation process simple and effective.
DBT comes in two variants
i) DBT CLI – Command Line Interface, which is run by a terminal.
ii)DBT Cloud – A web based application along with an IDE.
Fig 2. Data Build Tool
Key features of DBT
The documentation offered by DBT is updated as to when models are developed. It is also easily accessible throughout the development. The documentation is generated based on the descriptions provided, dependencies between models, SQL files, sources, and the tests defined.
2. Data Lineage
The data pipeline in DBT is represented in the form of lineage graphs. This gives proper visibility of the data and how the data maps with the business logic. The complete flow of data from the source to the target system is shown in the graph, thus providing transparency.
3. Version Control
Integration of GIT with DBT has made version control come hand in hand. All the models, tests generated, sources, packages used, and other configurations used in the project are versioned in the connected GIT repository.
DBT comes with prebuilt unique, not null, referential integrity, and accepted value testing. Jinja and SQL can be used to write custom test cases to add more testing features apart from the default tests.
In DBT, it is possible to develop models that can be reused by using the Jinja framework.
6. Data refreshes within dbt Cloud
Using DBT, it is possible to schedule refresh at the production environment according to the business requirement without the need of an orchestration tool.
Why DBT and Snowflake?
Snowflake is a data warehouse hosted as a Software-as-a-Service (SaaS) which is faster, user friendly, and more flexible than a traditional data warehouse. Snowflake’s data warehouse is developed using a SQL database engine with an architecture specifically designed for the cloud environment. Snowflake is built using multi-cluster, shared data architecture- This makes the Data Storage and Compute layer, the query processing layer separated. DBT is an open source tool that is used to manage the ELT load in Snowflake. DBT can be used with Snowflake for the following features.
- Converting tables to views- It is sufficient to change the materialization in a single config file to change a table to a view.
- Stored Procedure- The stored procedures created in dbt are shown in the models, which could be accessed and modified.
- Combining transformation logic- DBT groups similar transformation logic together using dbt tags.
Version Control- DBT supports version control by integrating with GitHub.
- Open-source community- Could enhance the development by sharing experiences from fellow developers rather than starting from scratch.
Integration of DBT with Snowflake
Step 1: Snowflake account creation
Create an account with Snowflake and note the highlighted details which will be used for database connection with DBT.
Fig 3. Snowflake Worksheet
Step 2: DBT Cloud sign up
Create an account with DBT using Try dbt Cloud Free. Once signed in, create a new project. To set up the created project, refer to the below steps.
Fig 4. DBT Project
Click on begin to proceed. Give an appropriate DBT project name in the next screen and hit continue. In the next screen, select Snowflake from the list of data warehouses.
Step 3: Connecting to Snowflake instance
The following fields are required when creating a Snowflake connection:
- Account- The Snowflake account to connect to. If the URL for the Snowflake account is like abc12345.east-us-2.azure.snowflakecomputing.com, then the account name should be abc12345.east-us-2.azure.
- Role- Role to be used after connecting to Snowflake. (Optional field)
- Database- Establish a connection with this logical database in the data warehouse to run queries.
- Warehouse- The virtual warehouse to use for running queries.
- Auth method-
- Username / Password – Enter the Snowflake username (specifically, the login_name) and the corresponding user’s Snowflake password, which will be used to authenticate dbt Cloud to run queries in Snowflake on behalf of a Snowflake user.
- Key Pair – The Keypair auth method is based on Snowflake’s Key Pair Authentication to authenticate the credentials when accessed from a DBT Cloud project. After generating an encrypted key pair, rsa_public_key should be set for the Snowflake user for the authentication process.
Fig 5. Snowflake Connection with DBT
Once the credentials are given, test the connection. On a successful connection, proceed to connect DBT with an empty GIT repository.