Partitioning

Overview

Bauplan exposes Iceberg’s powerful partitioning capabilities through a declarative interface, making it easy to create and manage partitioned tables. This approach leverages Iceberg’s hidden partitioning while abstracting away the complexity of partition management.

Key features of Bauplan’s partitioning support:

  • Declarative Definition: Specify partitioning when creating tables or generating data assets

  • Automatic Management: Writing and reading from partitioned tables is handled by the system

  • Multiple Partition Types:
    • Time-based partitioning (year, day, hour)

    • Bucket partitioning (e.g., by client_id)

Note

While partitioning can increase write times to object storage, it significantly improves read performance when filtering by partitioned columns.

Importing Partitioned Data

When importing data into Bauplan, you can specify partitioning in two ways:

1. Direct Table Creation

Use the create_table method with the partitioned_by parameter:

import bauplan

client = bauplan.Client()

# Create a partitioned table
table = client.create_table(
    table='my_partitioned_table',
    search_uri='s3://your-bucket/data/*.parquet',
    partitioned_by="hour(tpep_pickup_datetime), PULocationID",
    branch='my_branch'
)

The equivalent CLI command is:

bauplan table create --name my_partitioned_table \
    --partitioned-by "hour(tpep_pickup_datetime), PULocationID" \
    --search-uri 's3://your-bucket/data/*.parquet'

2. Using Import Plans

For more complex scenarios, particularly when schema modifications are needed alongside partitioning:

import bauplan

client = bauplan.Client()

# Generate import plan
plan_state = client.plan_table_creation(
    table='my_partitioned_table',
    search_uri='s3://your-data/*.parquet',
    branch='my_branch'
)

# Modify plan to add partitioning
plan = plan_state.plan
plan['schema_info']['partitions'] = [
    {
        'from_column_name': 'datetime_column',
        'transform': {'name': 'year'}
    }
]

# Apply the modified plan
client.apply_table_creation_plan(plan)

Creating Partitioned Tables in Pipelines

You can create partitioned tables directly in your data pipelines using either Python or SQL models.

Python Models

Use the partitioned_by parameter in the @bauplan.model decorator:

import bauplan

@bauplan.python('3.11')
@bauplan.model(
    partitioned_by=['day(pickup_datetime)', 'PULocationID'],
    materialization_strategy='REPLACE'
)
def create_partitioned_table(
    data=bauplan.Model(
        'taxi_fhvhv',
        columns=[
            'PULocationID',
            'trip_miles',
            'tips',
            'pickup_datetime'
        ],
        filter="pickup_datetime >= '2022-12-23T00:00:00-05:00'"
    )
):
    """Creates a partitioned table from taxi data"""
    return data

SQL Models

Add partitioning information using SQL comments:

-- bauplan: materialization_strategy=REPLACE
-- bauplan: partitioned-by="day(pickup_datetime), PULocationID"

SELECT
    PULocationID,
    trip_miles,
    tips,
    pickup_datetime
FROM taxi_fhvhv
WHERE
    pickup_datetime >= '2022-12-25T00:00:00-05:00'