bauplan.decorators module

Bauplan functions are normal Python functions enriched by a few key decorators. This module contains the decorators used to define Bauplan models, expectations and Python environments, with examples of how to use them.

bauplan.decorators.expectation(**kwargs: Any) → Callable

An expectation is a function from one (or more) dataframe-like object(s) to a boolean: it is commonly used to perform data validation and data quality checks when running a pipeline. Expectations takes as input the table(s) they are validating and return a boolean indicating whether the expectation is met or not. A Python expectation needs a Python environment to run, which is defined using the python decorator, e.g.:

@bauplan.expectation()
@bauplan.python('3.10')
def test_joined_dataset(
    data=bauplan.Model(
        'join_dataset',
        columns=['anomaly']
    )
):
    # your data validation code here
    return expect_column_no_nulls(data, 'anomaly')

Parameters:: f – The function to decorate.

bauplan.decorators.model(name: str | None = None, columns: List[str] | None = None, materialize: bool | None = None, internet_access: bool | None = None, **kwargs: Any) → Callable

A model is a function from one (or more) dataframe-like object(s) to another dataframe-like object: it is used to define a transformation in a pipeline. Models are chained together implicitly by using them as inputs to their children. A Python model needs a Python environment to run, which is defined using the python decorator, e.g.:

@bauplan.model(
    columns=['*'],
    materialize=False
)
@bauplan.python('3.11')
def source_scan(
    data=bauplan.Model(
        'iot_kaggle',
        columns=['*'],
        filter="motion='false'"
    )
):
    # your code here
    return data

Parameters:

name – the name of the model (e.g. ‘users’); if missing the function name is used.
columns – the columns of the output dataframe after the model runs (e.g. [‘id’, ‘name’, ‘email’]). Use [‘*’] as a wildcard.
materialize – whether the model should be materialized.
internet_access – whether the model requires internet access.

bauplan.decorators.pyspark(version: str | None = None, conf: Dict[str, str] | None = None, **kwargs: Any) → Callable: Make pyspark session available. Add a spark=None parameter to the function model args

bauplan.decorators.python(version: str | None = None, pip: Dict[str, str] | None = None, **kwargs: Any) → Callable

Define a Python environment for a Bauplan function (e.g. a model or expectation). It is used to specify directly in code the configuration of the Python environment required to run the function, i.e. the Python version and the Python packages required.

Parameters:

version – The python version for the interpreter (e.g. '3.11').
pip – A dictionary of dependencies (and versions) required by the function (e.g. {'requests': '2.26.0'}).

Define the resources required by a Bauplan function (e.g. a model or expectation). It is used to specify directly in code the configuration of the resources required to run the function.

Parameters:

cpus – The number of CPUs required by the function (e.g: `0.5`)
memory – The amount of memory required by the function (e.g: `1G`, `1000`)
memory_swap – The amount of swap memory required by the function (e.g: `1G`, `1000`)
timeout – The maximum time the function is allowed to run (e.g: `60`)