Models¶

In Bauplan, models are the core unit of data manipulation. They are declarative functions written in Python or SQL that transform one or more input tables into a single output table. Models are designed to provide a straightforward way to express data transformations without dealing with containerization, data movement and runtime configuration.

Models can be chained together to form Pipelines , where downstream models depend on the outputs of upstream ones.

Anatomy of a Bauplan Model¶

A typical Python model consists of two decorators @bauplan.model() and @bauplan.python() and a function.

@bauplan.model()
@bauplan.python('3.11')
def my_model(
        data=bauplan.Model(
            'input_table',
            columns=[
                'col_1',
                'col_2',
                'col_3',
            ],
            filter="timestamp >= '2022-12-15T00:00:00-05:00'",
        )
):
    ...

    return output_table

Bauplan Models are fully declarative, allowing for explicit column selection and filter pushdown. In practice, this means your code only requires the name of the inputs and, optionally, the desired columns and filters.

For example, you don’t need to specify the type of input_table, whether it’s an Iceberg Table, PyArrow Table, Pandas DataFrame, or Polars DataFrame. This approach makes the code fully portable, easier to reproduce across environments, and simpler to maintain.

Importing Python Packages¶

Bauplan models are fully containerized, with each model running in the cloud as its own isolated environment (similar to Function-as-a-Service frameworks, like AWS Lambda). To isolate environments, Bauplan uses a optimized version of Docker containers.

Each model in Bauplan runs in its own isolated Python environment — defined by the interpreter version and pip dependencies you declare in code. They are expressed entirely in code using the decorator @bauplan.python(). This decorator specifies the Python interpreter version (e.g., 3.11 or 3.12) and any additional libraries and their versions using pip.

@bauplan.model()
# specify the package and version - in this case Pandas 2.2.0
@bauplan.python('3.11', pip={'pandas': '2.2.0'})
def my_model(data=bauplan.Model('input_table')):
    # import the package declared in the decorator
    import pandas

    # use the package
    df = data.to_pandas()
    df = df['col_1']

    return df

This approach allows you to run each bauplan model as an independent unit in a fully declarative way: all you need to run a Bauplan model deterministically in the cloud is the code.

This enhances reproducibility, prevents unintended cross-contamination between environments on the same machine, and provides the freedom to introduce new Python packages without worrying about backward compatibility. In fact, you can run a pipeline where different models use different versions of the same packages and different versions of the Python interpreter.

@bauplan.model()
@bauplan.python('3.11', pip={'pandas': '2.2.0'})
def step_1(data=bauplan.Model('input_table')):
    import pandas
    ...

@bauplan.model()
@bauplan.python('3.10', pip={'pandas': '1.5.3'})
def step_2(data=bauplan.Model('step_1')):
    import pandas
    ...

Materializing Tables with Models¶

To write the output of a model as a table into the data catalog, use the materialization_strategy parameter in the @bauplan.model() decorator. The parameter accepts three values: REPLACE, APPEND, and NONE (which is the default).

Strategy	Description	Example
“REPLACE”	Fully overwrites the table on each run.	Running a pipeline twice on a model that writes 1,000 rows will result in a final table with 1,000 rows, as each run replaces the previous table.
“APPEND”	Appends new rows to an existing table.	Running a pipeline twice with a model that writes 1,000 rows results in a final table with 2,000 rows, as each run adds data to the existing table.
“NONE”	Streams model output in memory as an Arrow table without persisting to object storage.	Running will not write a table.

Regardless of the materialization_strategy parameter, running models with the dry run flag – bauplan run --dry-run – will force the system to run in memory (see using –dry-run in the tutorial).

Using SQL¶

Many transformations, especially filtering, joining, and basic aggregations, are easier in SQL than in Python.

While Bauplan supports SQL models directly, the recommended best practice today for most use cases is to embed your SQL in a Python function using DuckDB.

@bauplan.model()
@bauplan.python("3.11", pip={'duckdb': '1.2.0'})
def my_model(
    data=bauplan.Model(
        "input_table",
        columns=["col_1", "col_2", "col_3"],
        filter="timestamp >= '2022-12-15T00:00:00-05:00'",
    )
):
    import duckdb

    # DuckDB will convert the Arrow input into a SQL-accessible table
    output_table = duckdb.sql("""
        SELECT col_1, COUNT(*) as count
        FROM data
        GROUP BY col_1
    """).arrow()

    return output_table

Best Practices¶

Input Declaration
- Specify required columns explicitly.
- Use filter pushdown whenever feasible for efficiency.

Output docstrings

Specify the shape of the output table of a model in its docstrings.

@bauplan.model()
@bauplan.python("3.11")
def my_model(
    data=bauplan.Model(
        "input_table",
        columns=["col_1", "col_2", "col_3"],
        filter="timestamp >= '2022-12-15T00:00:00-05:00'",
    )
):
"""
This model will aggregate and count the data by Col_1.
The output table will look like this:
| Col_1 | Count  |
|-------|--------|
| A     | 100    |
"""

Model environment Management
- Include only minimal dependencies.
- Pin specific dependency versions for consistency.
Code Organization
- Group models that form a pipeline within a single file.
- Separate function bodies into external modules and call them within the Bauplan models. This keeps business logic code neatly separated from the DAG and environment declaration, making code refactoring easier and future-proof.
Testing and Development
- Use the bauplan run --dry-run flag for quick iteration.
- Employ print statements for real-time terminal feedback during development.