Pipelines¶
In this guide, we will explain how to set up a bauplan project and run a pipeline.
Organizing projects¶
Pipelines are organized into folders, each of which must contain a bauplan_project.yml
file with:
a unique
project id
,a
project name
,
project:
id: 40d21649-a47h-437b-09hn-plm75edc1bn
name: quick_start
Ensure you are in the 01-quick_start
folder, which contains a simple pipeline composed of two functions.
To run a bauplan pipeline, execute bauplan run
in your terminal from the pipeline folder. For now, we will continue running in memory (details below).
While the pipeline runs remotely, you can monitor its progress in real-time through the terminal. Any print statements in your code will appear directly in the terminal, which is very useful during development.
To see a preview of the pipeline’s tables in the terminal, add --head
:
bauplan run --dry-run --head
bauplan models¶
The pipeline begins with two tables from the data lake: taxi_fhvhv
and taxi_zones
. The subsequent nodes, called “models,” are expressed as Python functions (we will explore combining Python and SQL later). To designate a Python function as a model, we use the @bauplan.model()
decorator.
Examine the code in models.py
:
The first model,
trips_and_zones
, takestaxi_fhvhv
andtaxi_zones
as input and joins them onPULocationID
andDOLocationID
The second model,
normalized_taxi_trips
, takes the output table from the previous model, performs data cleaning and normalization using Pandas, and returns a final table
The relationship between the nodes is expressed through naming conventions by passing the first model as an input argument to the second in the code: the model trips_and_zones
serves as input for normalized_taxi_trips
.
Note
Models in bauplan are functions that transform tabular objects into tabular objects. They should not be confused with ML models.
Python environments¶
Python functions often require packages and libraries. For example, the normalized_taxi_trips
function requires pandas 2.2.0
.
Bauplan allows you to express environments and dependencies entirely in code, eliminating the need to build and upload containers for each change. You can specify Python environments and dependencies using the @bauplan.python()
decorator.
Functions in a pipeline are fully containerized in bauplan. This means each function can have its own dependencies without concerns about environment consistency.
For example, to change the Pandas version required by normalized_taxi_trips
from 2.2.0
to 1.5.3
, simply modify the decorator:
@bauplan.python('3.11', pip={'pandas': '1.5.3', 'numpy': '1.23.2'})
Run the pipeline again, and the system will handle all necessary adjustments.
Materialization¶
Data pipelines create new tables that can be used downstream by other people and systems. With bauplan, you can create new tables in your data lake in any branch of your data catalog by running a pipeline in a target branch.
To specify which tables should be materialized in the data catalog, use the materialization_strategy
flag in the @bauplan.model()
decorator. By default (when not set), artifacts will not be materialized. When set to REPLACE
, the decorated model’s output will be materialized as an Iceberg table in the data catalog.
In models.py
, set the materialization_strategy
flag for the normalize_taxi_trip
model to REPLACE
:
@bauplan.model(materialization_strategy='REPLACE') # Other options are 'NONE' or 'APPEND'
Then checkout to your hello_bauplan
branch and run the pipeline:
bauplan checkout <YOUR_USERNAME>.hello_bauplan
bauplan run
👏👏 Congratulations, you just created a new table by running a pipeline!
The table normalize_taxi_trip
was materialized in your branch. You can now inspect the table, query it, or merge it into main
as needed.
Materialization Strategy and Memory Execution¶
In-Memory Execution¶
By default, if no artifacts have the materialization_strategy
flag set, bauplan operates entirely in memory. This approach significantly accelerates execution since the system doesn’t need to persist tables. In-memory execution is particularly efficient for rapid pipeline development and iteration.
Using –dry-run¶
There are two scenarios where you might want to use the --dry-run
flag with bauplan run
:
When artifacts have
materialization_strategy
set but you want to test changes in memoryWhen your active branch is
main
(since materialization inmain
is not permitted)
Note
Materialization is not allowed in the
main
branchAlways use
--dry-run
when working in themain
branch to avoid errorsThe
--dry-run
flag forces in-memory execution regardless ofmaterialization_strategy
settings
SQL models¶
Bauplan supports both SQL and Python, allowing you to combine them in your pipelines. For example, to add an SQL step that computes the top pickup locations for taxi trips in NY:
Create an SQL file named top_pickup_locations.sql
in the pipeline folder with this code:
-- bauplan: materialization_strategy=REPLACE
SELECT
COUNT(pickup_datetime) as number_of_trips,
Borough,
Zone
FROM
bauplan.normalized_taxi_trips
GROUP BY
Borough, Zone
ORDER BY COUNT(pickup_datetime) DESC
Run the pipeline again, and the system will automatically handle the Python-to-SQL transitions. A new table called top_pickup_locations
will appear in your branch.
To inspect and query the table:
bauplan table get bauplan.top_pickup_locations
bauplan query "SELECT * FROM bauplan.top_pickup_locations"
Note
Naming conventions are enforced implicitly in the query (
SELECT ... FROM bauplan.normalized_taxi_trips
)SQL files require a comment specifying materialization strategy
Bauplan uses DuckDB SQL dialect; refer to the DuckDB syntax documentation for details
This concludes our introductory tutorial to bauplan. You should now understand the platform’s core functions, including data branching, running, and querying. For more advanced capabilities and information about using our platform within your stack via our Python SDK, explore our examples section.