Run a pipeline¶
In this guide, we will explain how to set up a bauplan project and run a pipeline.
Organizing projects¶
Pipelines are organized into folders, each of which must contain a bauplan_project.yml
file with:
a unique
project id
,a
project name
,
project:
id: 40d21649-a47h-437b-09hn-plm75edc1bn
name: quick_start
Note
For information about managing sensitive information in your projects, see Managing secrets. For the best practices in organizing Bauplan projects, see the here
Ensure you are in the 01-quick_start
folder, which contains a simple pipeline composed of two functions.
To run a Bauplan pipeline, execute bauplan run
in your terminal from the pipeline folder. For now, we will continue running in memory (details below).
While the pipeline runs remotely, you can monitor its progress in real-time through the terminal. Any print statements in your code will appear directly in the terminal, which is very useful during development.
To see a preview of the pipeline’s tables in the terminal, add --preview head
:
bauplan run --dry-run --preview head
Bauplan models¶
The pipeline begins with two tables from the data lake: taxi_fhvhv
and taxi_zones
. The subsequent nodes, called “models,” are expressed as Python functions (we will explore combining Python and SQL later). To designate a Python function as a model, we use the @bauplan.model()
decorator.
Examine the code in models.py
:
The first model,
trips_and_zones
, takestaxi_fhvhv
andtaxi_zones
as input and joins them onPULocationID
andDOLocationID
.The second model,
normalized_taxi_trips
, takes the output table from the previous model, performs data cleaning and normalization using Pandas, and returns a final table.
The relationship between the nodes is expressed through naming conventions by passing the first model as an input argument to the second in the code: the model trips_and_zones
serves as input for normalized_taxi_trips
.
Important
Models in Bauplan are functions that transform tabular objects into tabular objects. They should not be confused with ML models.
Python environments¶
Python functions often require packages and libraries. For example, the normalized_taxi_trips
function requires pandas 2.2.0
. Bauplan allows you to express environments and dependencies entirely in code, eliminating the need to build and upload containers for each change. You can specify Python environments and dependencies using the @bauplan.python()
decorator. Functions in a pipeline are fully containerized in Bauplan. This means each function can have its own dependencies without concerns about environment consistency. For example, to change the Pandas version required by normalized_taxi_trips
from 2.2.0
to 1.5.3
, simply modify the decorator:
@bauplan.python('3.11', pip={'pandas': '1.5.3', 'numpy': '1.23.2'})
Run the pipeline again, and the system will handle all necessary adjustments.
Materialization¶
Data pipelines create new tables that can be used downstream by other people and systems. With Bauplan, you can create new tables in your data lake in any branch of your data catalog by running a pipeline in a target branch. To specify which tables should be materialized in the data catalog, use the materialization_strategy
flag in the @bauplan.model()
decorator. By default (when not set), artifacts will not be materialized. When set to REPLACE
, the decorated model’s output will be materialized as an Iceberg table in the data catalog. In models.py
, set the materialization_strategy
flag for the normalize_taxi_trips
model to REPLACE
:
@bauplan.model(materialization_strategy='REPLACE') # Other options are 'NONE' or 'APPEND'
Then checkout to your hello_bauplan
branch and run the pipeline:
bauplan checkout <YOUR_USERNAME>.hello_bauplan
bauplan run
👏👏 Congratulations, you just created a new table by running a pipeline!
The table normalize_taxi_trip
was materialized in your branch. You can now inspect the table, query it.
bauplan table get normalize_taxi_trip
bauplan query "SELECT * FROM normalize_taxi_trip LIMIT 5"
Remember, Bauplan uses the DuckDB SQL dialect, so check DuckDB syntax documentation for the details.
For an extensive explanation of Bauplan models see Models
In-memory Execution and --dry-run
¶
To iterate quickly in your terminal, we suggest you run your pipelines in memory. This approach significantly accelerates execution since the system doesn’t need to write tables in object storage at each run. To run in memory you can simply set the materialization_strategy
flag to NONE
.
Alternatively, you can use the flag --dry-run
with bauplan run
. This flag will run in memory even when your models have materialization_strategy
set to REPLACE
or APPEND
. Moreover, --dry-run
will allow you to run pipelines directly in the main
branch (since by default materialization in main
is not permitted).
Note
Materialization is not allowed in the
main
branchAlways use
--dry-run
when working in themain
branch to avoid errorsThe
--dry-run
flag forces in-memory execution regardless ofmaterialization_strategy
settings
This concludes our introductory tutorial to Bauplan. You should now understand the platform’s core functions, including data branching, running, and querying. For more advanced capabilities and information about using our platform within your stack via our Python SDK, explore our Examples section.