Skip to main content

Quick start

Prerequisites

note

Bauplan requires Python 3.10 or higher. Make sure you have a compatible Python version installed before proceeding.

Create a branch

Create an isolated data branch to work in - like a git branch, but for your lakehouse:

bauplan checkout -b <YOUR_USERNAME>.quickstart # create a new branch, and switch to it

Explore the data

Bauplan sandbox comes with pre-loaded public datasets. This command will show you the tables in your checked out branch.

bauplan table ls

You will see a list of tables, this is because a newly created branch inherits all previously existing tables from its parent branch. Before getting into our project, let's explore the schema of the table we're going to be working with:

bauplan table get titanic

You'll see the table's columns and their types - this is the input our pipeline will transform.

Scaffold a project

Run bauplan init to generate a ready-to-run project:

bauplan init

This creates three files:

  • bauplan_project.yaml - project metadata (id and name), this is also the place where secrets and parameters are stored.
  • models.py - a pipeline with one model (survival_rate_by_age) that reads from the titanic table, plus one expectation test (test_age) that validates the output
  • pyproject.toml - Python dependencies, which in this case is just bauplan itself.

Run the pipeline

The generated project defines a small pipeline: it reads from the titanic table, computes survival rates by age, and validates the output with an expectation test. Let's run it:

bauplan run

Looking at models.py, you'll find two functions:

  • survival_rate_by_age: a model decorated with @bauplan.model(). It reads the Age and Survived columns from the titanic table, groups passengers by age, and computes the average survival rate per age group.
  • test_age: an expectation decorated with @bauplan.expectation(). Expectations are Bauplan's built-in way to run data quality checks. This one verifies that the Age column in the output has no duplicate values.

Materialize the output as a table

If you check the output of bauplan run, you'll see that the survival_rate_by_age model executed successfully, and the expectation test passed. However, if you look at your tables again:

bauplan table ls --name survival_rate_by_age

You'll see that no new table was created for the model's output. By default, materialization_strategy parameter in @bauplan.model() is set to NONE, which results in memory without writing anything to the catalog. To persist the output as a table, make this change to models.py:

@bauplan.model(materialization_strategy='REPLACE')

This tells Bauplan to write (and fully replace on each run) the model's output as an Iceberg table in your branch. Run the pipeline again:

bauplan run

You can confirm the table is persisted by running:

bauplan query "SELECT * FROM survival_rate_by_age" # only 10 rows are shown by default

Learn more about materialization strategies and models here.

You should see two columns: Age and Survived, showing the average survival rate for each age group. Congratulations - you've just built and run your first Bauplan pipeline!