Pipelines

Open in Github.

In this guide, we will explain how to set up a bauplan project and run a pipeline.

Organizing projects

Pipelines are organized into folders, each of which must contain a bauplan_project.yml file with:

  • a unique project id,

  • a project name,

project:
    id: 40d21649-a47h-437b-09hn-plm75edc1bn
    name: quick_start

Ensure you are in the 01-quick_start folder, which contains a simple pipeline composed of two functions.

flowchart LR id0[(taxi_fhvhv)]-->id2[models.trips_and_zones] id1[(taxi_zones)] --> id2[models.trips_and_zones] id2[models.trips_and_zones]-->id3[models.normalized_taxi_trips]

To run a bauplan pipeline, execute bauplan run in your terminal from the pipeline folder. For now, we will continue running in memory (details below).

While the pipeline runs remotely, you can monitor its progress in real-time through the terminal. Any print statements in your code will appear directly in the terminal, which is very useful during development.

To see a preview of the pipeline’s tables in the terminal, add --head:

bauplan run --dry-run --head

bauplan models

The pipeline begins with two tables from the data lake: taxi_fhvhv and taxi_zones. The subsequent nodes, called “models,” are expressed as Python functions (we will explore combining Python and SQL later). To designate a Python function as a model, we use the @bauplan.model() decorator.

Examine the code in models.py:

  • The first model, trips_and_zones, takes taxi_fhvhv and taxi_zones as input and joins them on PULocationID and DOLocationID

  • The second model, normalized_taxi_trips, takes the output table from the previous model, performs data cleaning and normalization using Pandas, and returns a final table

The relationship between the nodes is expressed through naming conventions by passing the first model as an input argument to the second in the code: the model trips_and_zones serves as input for normalized_taxi_trips.

Note

Models in bauplan are functions that transform tabular objects into tabular objects. They should not be confused with ML models.

Python environments

Python functions often require packages and libraries. For example, the normalized_taxi_trips function requires pandas 2.2.0. Bauplan allows you to express environments and dependencies entirely in code, eliminating the need to build and upload containers for each change. You can specify Python environments and dependencies using the @bauplan.python() decorator. Functions in a pipeline are fully containerized in bauplan. This means each function can have its own dependencies without concerns about environment consistency. For example, to change the Pandas version required by normalized_taxi_trips from 2.2.0 to 1.5.3, simply modify the decorator:

@bauplan.python('3.11', pip={'pandas': '1.5.3', 'numpy': '1.23.2'})

Run the pipeline again, and the system will handle all necessary adjustments.

Materialization

Data pipelines create new tables that can be used downstream by other people and systems. With bauplan, you can create new tables in your data lake in any branch of your data catalog by running a pipeline in a target branch. To specify which tables should be materialized in the data catalog, use the materialization_strategy flag in the @bauplan.model() decorator. By default (when not set), artifacts will not be materialized. When set to REPLACE, the decorated model’s output will be materialized as an Iceberg table in the data catalog. In models.py, set the materialization_strategy flag for the normalize_taxi_trip model to REPLACE:

@bauplan.model(materialization_strategy='REPLACE')  # Other options are 'NONE' or 'APPEND'

Then checkout to your hello_bauplan branch and run the pipeline:

bauplan checkout <YOUR_USERNAME>.hello_bauplan
bauplan run

👏👏 Congratulations, you just created a new table by running a pipeline!

The table normalize_taxi_trip was materialized in your branch. You can now inspect the table, query it, or merge it into main as needed.

Materialization Strategy and Memory Execution

In-Memory Execution

By default, if no artifacts have the materialization_strategy flag set, bauplan operates entirely in memory. This approach significantly accelerates execution since the system doesn’t need to persist tables. In-memory execution is particularly efficient for rapid pipeline development and iteration.

Using –dry-run

There are two scenarios where you might want to use the --dry-run flag with bauplan run:

  1. When artifacts have materialization_strategy set but you want to test changes in memory

  2. When your active branch is main (since materialization in main is not permitted)

Note

  • Materialization is not allowed in the main branch

  • Always use --dry-run when working in the main branch to avoid errors

  • The --dry-run flag forces in-memory execution regardless of materialization_strategy settings

SQL models

Bauplan supports both SQL and Python, allowing you to combine them in your pipelines. For example, to add an SQL step that computes the top pickup locations for taxi trips in NY:

flowchart LR id0[(taxi_fhvhv)]-->id2[models.trips_and_zones] id1[(taxi_zones)] --> id2[models.trips_and_zones] id2[models.trips_and_zones]-->id3[models.normalized_taxi_trips] id3[models.normalized_taxi_trips] --> id4[top_pickup_locations.sql]

Create an SQL file named top_pickup_locations.sql in the pipeline folder with this code:

-- bauplan: materialization_strategy=REPLACE
SELECT
    COUNT(pickup_datetime) as number_of_trips,
    Borough,
    Zone
FROM
    bauplan.normalized_taxi_trips
GROUP BY
    Borough, Zone
ORDER BY COUNT(pickup_datetime) DESC

Run the pipeline again, and the system will automatically handle the Python-to-SQL transitions. A new table called top_pickup_locations will appear in your branch. To inspect and query the table:

bauplan table get bauplan.top_pickup_locations
bauplan query "SELECT * FROM bauplan.top_pickup_locations"

Note

  • Naming conventions are enforced implicitly in the query (SELECT ... FROM bauplan.normalized_taxi_trips)

  • SQL files require a comment specifying materialization strategy

  • Bauplan uses DuckDB SQL dialect; refer to the DuckDB syntax documentation for details

This concludes our introductory tutorial to bauplan. You should now understand the platform’s core functions, including data branching, running, and querying. For more advanced capabilities and information about using our platform within your stack via our Python SDK, explore our examples section.