Pipelines

In this guide, we will explain how to set up a Bauplan project and run a pipeline.

Organizing projects

Pipelines are organized into folders, each of which must contain a bauplan_project.yaml file with:

  • a unique project id,

  • a project name,

  • a default Python interpreter.

project:
    id: 40d21649-a47h-437b-09hn-plm75edc1bn
    name: hello_bauplan

defaults:
    python_version: 3.11

Make sure you are in the folder quick_start which contains a simple pipeline composed by two simple functions.

flowchart LR id0[(taxi_fhvhv)]-->id2[models.trips_and_zones] id1[(taxi_zones)] --> id2[models.trips_and_zones] id2[models.trips_and_zones]-->id3[models.normalized_taxi_trips]

To run a bauplan pipeline, simply execute bauplan run in your terminal from the pipeline folder.

Note that even if the pipeline runs remotely, we are able to what happens in real time directly in the terminal. Every print statement in your code will be printed directly in the terminal, which is very useful while developing.

If you want to see a quick preview of the tables that compose the pipeline in the terminal add –head:

bauplan run --head

Bauplan models

The starting point of the pipeline are two tables in the data lake taxi_fhvhv and taxi_zones. The subsequent nodes are called “models” and in this case are expressed as Python functions (we will see how to mix and match Python and SQL later). To tell the system what Python function is a model, we use the decorator @bauplan.model().

Check out the code in the file models.py:

  • The first model here is the function trips_and_zones, which takes taxi_fhvhv and taxi_zones , and join them on PULocationID and DOLocationID.

  • The second model is the function normalized_taxi_trips, which will take the output table of the previous model, do some data cleaning and normalization using Pandas and return a final table.

The relationship between the nodes is expressed by naming convention by simply passing the first model as the input argument to the second in the code: the model trips_and_zones is the input of the second model normalized_taxi_trips.

Note

Models in Bauplan are essentially functions from tabular objects to tabular objects. They are not to be confused with ML models.

Python environments

Python functions often need packages and libraries. For instance, the function normalized_taxi_trips of our example pipeline requires pandas 2.1.0.

One very cool thing about Bauplan is that it allows to express our environments and the dependencies entirely in the code, without having to build and upload a container every time.

We can express Python environments and dependencies entirely in code by simply using the decorator @bauplan.python().

The functions that make a pipeline are fully containerized in Bauplan. This means that each function can have its own dependencies and we don’t have to worry about environment consistency.

For instance, we can easily change the version of Pandas required by normalized_taxi_trips  from 2.1.0 to 1.5.3 by simply editing the decorator:

@bauplan.python('3.11', pip={'pandas': '1.5.3'})

Run the pipeline again and the system will figure out everything for us.

Materialization

Data pipelines are used to create new tables that can be used by other people and systems downstream. With Bauplan we can create new tables in our data lake in any branch of our data catalog by simply running a pipeline in a target branch.

To decide what individual tables are going to be materialized in the data catalog, we will use the flag materialize in the decorator @bauplan.model(). When we set the flag to True, the output of the decorated model will be materialized as an Iceberg table in the data catalog.

Go into the file models.py , and set the materialize flag in the model normalize_taxi_trip to True:

@bauplan.model(materialize=True)

Then run checkout to your hello_bauplan branch and run the pipeline:

bauplan branch checkout <YOUR_USERNAME>.hello_bauplan
bauplan run

👏👏 Congratulations, you just created a new table by running a pipeline!

The table normalize_taxi_trip has been materialized in your branch. Now you can inspect the table, query it, or merge it into main as you please.

Note

If none of your artifacts have the materialize flag set to True, Bauplan will operate entirely in memory. This significantly accelerates the run, as the system doesn’t need to write your tables to disk. Running in memory is an exceptionally efficient method for rapidly iterating over your pipelines during development.

SQL models

What if we want to use SQL in our pipelines? Well, we’re glad you asked. Bauplan supports both SQL and Python and allows you to mix and match between them.

Let’s say that we want to add an SQL step to our pipeline that computes the top pickup locations for taxi trips in NY and have a pipeline like this:

flowchart LR id0[(taxi_fhvhv)]-->id2[models.trips_and_zones] id1[(taxi_zones)] --> id2[models.trips_and_zones] id2[models.trips_and_zones]-->id3[models.normalized_taxi_trips] id3[models.normalized_taxi_trips] --> id4[top_pickup_locations.sql]

We can simply create an SQL file named top_pickup_locations.sql directly in the pipeline folder and copy this code in it:

-- bauplan: materialize=True
SELECT
    COUNT(pickup_datetime) as number_of_trips,
    Borough,
    Zone
FROM
    normalized_taxi_trips
GROUP BY
    Borough, Zone
ORDER BY COUNT(pickup_datetime) DESC

Now, let’s run the pipeline again and we will see that the system takes care automatically of going from Python to SQL and vice versa. We should have now a new table in your branch called top_pickup_locations.

As usual, we can inspect the table and interactively query it:

bauplan table get top_pickup_locations
bauplan query "SELECT * FROM top_pickup_locations"

A few small notes:

  • Naming convention is enforced implicitly in the query (SELECT ... FROM normalized_taxi_trips).

  • We need a comment in the SQL file to tell Bauplan whether this table needs to be materialized or not.

  • Bauplan SQL dialect is DuckDB so double check the syntax.

This concludes our introductory tutorial to Bauplan. You should have a grasp of how the platform works and how you can use its core functions, like data branching, running, and querying.

There are many things that we can build using Bauplan. Check out our examples section to learn more about more advanced capabilities and how to use our platform inside your stack using our Python SDK. The Examples will give you a good understanding of:

  • how to implement specific use cases, like Machine Learning pipelines, complex ETL, and Data Quality workflows

  • how to use Bauplan inside your stack and work with orchestrators, data visualization tools, interactive notebooks, and other computational engines, like Data Warehouses.