Quick start
Prerequisites
- Install bauplan.
Bauplan requires Python 3.10 or higher. Make sure you have a compatible Python version installed before proceeding.
Create a branch
Create an isolated data branch to work in - like a git branch, but for your lakehouse:
bauplan checkout -b <YOUR_USERNAME>.quickstart # create a new branch, and switch to it
Explore the data
Bauplan sandbox comes with pre-loaded public datasets. This command will show you the tables in your checked out branch.
bauplan table ls
You will see a list of tables, this is because a newly created branch inherits all previously existing tables from its parent branch. Before getting into our project, let's explore the schema of the table we're going to be working with:
bauplan table get titanic
You'll see the table's columns and their types - this is the input our pipeline will transform.
Scaffold a project
Run bauplan init to generate a ready-to-run project:
bauplan init
This creates three files:
bauplan_project.yaml- project metadata (id and name), this is also the place where secrets and parameters are stored.models.py- a pipeline with one model (survival_rate_by_age) that reads from thetitanictable, plus one expectation test (test_age) that validates the outputpyproject.toml- Python dependencies, which in this case is just bauplan itself.
Run the pipeline
The generated project defines a small pipeline: it reads from the titanic table, computes survival rates by age, and validates the output with an expectation test. Let's run it:
bauplan run
Looking at models.py, you'll find two functions:
survival_rate_by_age: a model decorated with@bauplan.model(). It reads theAgeandSurvivedcolumns from thetitanictable, groups passengers by age, and computes the average survival rate per age group.test_age: an expectation decorated with@bauplan.expectation(). Expectations are Bauplan's built-in way to run data quality checks. This one verifies that theAgecolumn in the output has no duplicate values.
Materialize the output as a table
If you check the output of bauplan run, you'll see that the survival_rate_by_age model executed successfully, and the expectation test passed. However, if you look at your tables again:
bauplan table ls --name survival_rate_by_age
You'll see that no new table was created for the model's output.
By default, materialization_strategy parameter in @bauplan.model() is set to NONE, which results in memory without writing anything to the catalog. To persist the output as a table, make this change to models.py:
@bauplan.model(materialization_strategy='REPLACE')
This tells Bauplan to write (and fully replace on each run) the model's output as an Iceberg table in your branch. Run the pipeline again:
bauplan run
You can confirm the table is persisted by running:
bauplan query "SELECT * FROM survival_rate_by_age" # only 10 rows are shown by default
Learn more about materialization strategies and models here.
You should see two columns: Age and Survived, showing the average survival rate for each age group. Congratulations - you've just built and run your first Bauplan pipeline!