Partitioning
Overview
Bauplan exposes Iceberg's powerful partitioning capabilities through a declarative interface, making it easy to create and manage partitioned tables. This approach leverages Iceberg's hidden partitioning while abstracting away the complexity of partition management.
Key features of Bauplan's partitioning support:
- Declarative Definition: Specify partitioning when creating tables or generating data assets
- Automatic Management: Writing and reading from partitioned tables is handled by the system
- Multiple Partition Types:
- Time-based partitioning (year, day, hour)
- Bucket partitioning (e.g., by client ID)
While partitioning can increase write times to object storage, it significantly improves read performance when filtering by partitioned columns.
Importing Partitioned Data
When importing data into Bauplan, you can specify partitioning in two ways:
1. Direct Table Creation
Use the create_table
method with the partitioned_by
parameter:
import bauplan
client = bauplan.Client()
# Create a partitioned table
table = client.create_table(
table='my_partitioned_table',
search_uri='s3://your-bucket/data/*.parquet',
partitioned_by="hour(tpep_pickup_datetime), PULocationID",
branch='my_branch'
)
The equivalent CLI command is:
bauplan table create --name my_partitioned_table \
--partitioned-by "hour(tpep_pickup_datetime), PULocationID" \
--search-uri 's3://your-bucket/data/*.parquet'
2. Using Import Plans
For more complex scenarios, particularly when schema modifications are needed alongside partitioning:
import bauplan
client = bauplan.Client()
# Generate import plan
plan_state = client.plan_table_creation(
table='my_partitioned_table',
search_uri='s3://your-data/*.parquet',
branch='my_branch'
)
# Modify plan to add partitioning
plan = plan_state.plan
plan['schema_info']['partitions'] = [
{
'from_column_name': 'datetime_column',
'transform': {'name': 'year'}
}
]
# Apply the modified plan
client.apply_table_creation_plan(plan)
Creating Partitioned Tables in Pipelines
You can create partitioned tables directly in your data pipelines using either Python or SQL models.
Python Models
Use the partitioned_by
parameter in the @bauplan.model
decorator:
import bauplan
@bauplan.python('3.11')
@bauplan.model(
partitioned_by=['day(pickup_datetime)', 'PULocationID'],
materialization_strategy='REPLACE'
)
def create_partitioned_table(
data=bauplan.Model(
'taxi_fhvhv',
columns=[
'PULocationID',
'trip_miles',
'tips',
'pickup_datetime'
],
filter="pickup_datetime >= '2022-12-23T00:00:00-05:00'"
)
):
"""Creates a partitioned table from taxi data"""
return data
SQL Models
Add partitioning information using SQL comments:
-- bauplan: materialization_strategy=REPLACE
-- bauplan: partitioned-by="day(pickup_datetime), PULocationID"
SELECT
PULocationID,
trip_miles,
tips,
pickup_datetime
FROM taxi_fhvhv
WHERE
pickup_datetime >= '2022-12-25T00:00:00-05:00'