Models¶
A model in bauplan is a function that takes tabular data as input and outputs tabular data. Models are the foundational building blocks of bauplan pipelines, enabling both simple transformations and complex data processing workflows.
bauplan models provide a straightforward way to express data transformations in either pure Python or SQL without dealing with containerization and data movement.
Models are Python functions decorated with @bauplan.model()
and @bauplan.python()
:
@bauplan.model()
@bauplan.python('3.11')
def my_model(
data=bauplan.Model(
'input_table',
columns=[
'col_1',
'col_2',
'col_3',
],
filter="timestamp >= '2022-12-15T00:00:00-05:00'",
)
):
...
return output_table
Models in bauplan are fully declarative, allowing for explicit column selection and filter pushdown. In practice, this means our code only requires the name of the inputs and, optionally, the desired columns and filters.
For example, we don’t need to specify the type of input_table
, whether it’s an Iceberg Table, PyArrow Table, Pandas DataFrame, or Polars DataFrame. This approach makes the code fully portable, easier to reproduce across environments, and simpler to maintain.
Building Pipelines¶
Models are chained together to form pipelines - which take the form of Directed Acyclic Graphs (DAGs). To chain models into a DAG, simply pass a previous bauplan model as an input.
@bauplan.model()
@bauplan.python('3.11')
def step_1(data=bauplan.Model('input_table')):
...
@bauplan.model()
@bauplan.python('3.11')
def step_2(data=bauplan.Model('step_1')):
...
@bauplan.model()
@bauplan.python('3.11')
def step_3(data=bauplan.Model('step_2')):
...
Note that models can take multiple tabular inputs but can only return a single tabular output.
Importing Python Packages¶
bauplan models are fully containerized, with each model running in the cloud as its own isolated environment - similar to FaaS frameworks like AWS Lambda. To isolate environments, bauplan uses a customized version of Docker containers.
Model environments are expressed entirely in code using the decorator @bauplan.python()
. This decorator specifies the Python interpreter version (e.g., ‘3.11’ or ‘3.12’) and any additional libraries and their versions using pip
.
@bauplan.model()
# specify the package and version - in this case Pandas 2.2.0
@bauplan.python('3.11', pip={'pandas': '2.2.0'})
def my_model(data=bauplan.Model('input_table')):
# import the package declared in the decorator
import pandas
# use the package
df = data.to_pandas()
df = df['col_1']
return df
This approach allows us to run each bauplan model as an independent unit in a fully declarative way - all we need to run a bauplan model deterministically in the cloud is the code.
This enhances reproducibility, prevents unintended cross-contamination between environments on the same machine, and provides the freedom to introduce new Python packages without worrying about backward compatibility. In fact, we can run a pipeline where different models use different versions of the same packages and different interpreters, as shown below:
@bauplan.model()
@bauplan.python('3.11', pip={'pandas': '2.2.0'})
def step_1(data=bauplan.Model('input_table')):
import pandas
...
@bauplan.model()
@bauplan.python('3.10', pip={'pandas': '1.5.3'})
def step_2(data=bauplan.Model('step_1')):
import pandas
...
Writing Tables into Object Storage¶
The materialization_strategy
parameter controls how model output is stored in object storage as an Iceberg table. This parameter accepts three values: REPLACE
, APPEND
, and NONE
.
REPLACE
: Writes a new Iceberg table. If a table with the same name exists on the target branch, it will be dropped and reconstructed.Example: Running a DAG twice on a model that writes 1,000 rows will result in a final table with 1,000 rows, as each run replaces the previous table.
APPEND
: Appends output to an existing table.Example: Running a DAG twice with a model that writes 1,000 rows results in a final table with 2,000 rows, as each run adds data to the existing table.
NONE
: Streams model output in memory as an Arrow table without persisting to object storage.
@bauplan.model(materialization_strategy='NONE')
@bauplan.python('3.11')
def my_model(
data=bauplan.Model(
'input_table',
)
):
...
return output_table
Regardless of the materialization_strategy
parameter, the dry-run
flag will force the system to run in memory.
See Using –dry-run for more information about dry run mode.
Best Practices¶
- Input Declaration:
Specify required columns explicitly
Use filter pushdown whenever feasible for efficiency
- Environment Management:
Pin specific dependency versions for consistency
Include only minimal dependencies
Document environment requirements within docstrings
- Code Organization:
Group models that form a pipeline within a single file
Separate function bodies into external modules and call them within the bauplan models. This keeps business logic code neatly separated from the DAG and environment declaration, making code refactoring easier and future-proof.
- Testing and Development:
Use the
-dry-run
flag for quick iterationEmploy print statements for real-time terminal feedback during development