Partitioning¶
Overview¶
Bauplan exposes Iceberg’s powerful partitioning capabilities through a declarative interface, making it easy to create and manage partitioned tables. This approach leverages Iceberg’s hidden partitioning while abstracting away the complexity of partition management.
Key features of Bauplan’s partitioning support:
Declarative Definition: Specify partitioning when creating tables or generating data assets
Automatic Management: Writing and reading from partitioned tables is handled by the system
- Multiple Partition Types:
Time-based partitioning (year, day, hour)
Bucket partitioning (e.g., by client_id)
Note
While partitioning can increase write times to object storage, it significantly improves read performance when filtering by partitioned columns.
Importing Partitioned Data¶
When importing data into Bauplan, you can specify partitioning in two ways:
1. Direct Table Creation¶
Use the create_table
method with the partitioned_by
parameter:
import bauplan
client = bauplan.Client()
# Create a partitioned table
table = client.create_table(
table='my_partitioned_table',
search_uri='s3://your-bucket/data/*.parquet',
partitioned_by="hour(tpep_pickup_datetime), PULocationID",
branch='my_branch'
)
The equivalent CLI command is:
bauplan table create --name my_partitioned_table \
--partitioned-by "hour(tpep_pickup_datetime), PULocationID" \
--search-uri 's3://your-bucket/data/*.parquet'
2. Using Import Plans¶
For more complex scenarios, particularly when schema modifications are needed alongside partitioning:
import bauplan
client = bauplan.Client()
# Generate import plan
plan_state = client.plan_table_creation(
table='my_partitioned_table',
search_uri='s3://your-data/*.parquet',
branch='my_branch'
)
# Modify plan to add partitioning
plan = plan_state.plan
plan['schema_info']['partitions'] = [
{
'from_column_name': 'datetime_column',
'transform': {'name': 'year'}
}
]
# Apply the modified plan
client.apply_table_creation_plan(plan)
Creating Partitioned Tables in Pipelines¶
You can create partitioned tables directly in your data pipelines using either Python or SQL models.
Python Models¶
Use the partitioned_by
parameter in the @bauplan.model
decorator:
import bauplan
@bauplan.python('3.11')
@bauplan.model(
partitioned_by=['day(pickup_datetime)', 'PULocationID'],
materialization_strategy='REPLACE'
)
def create_partitioned_table(
data=bauplan.Model(
'taxi_fhvhv',
columns=[
'PULocationID',
'trip_miles',
'tips',
'pickup_datetime'
],
filter="pickup_datetime >= '2022-12-23T00:00:00-05:00'"
)
):
"""Creates a partitioned table from taxi data"""
return data
SQL Models¶
Add partitioning information using SQL comments:
-- bauplan: materialization_strategy=REPLACE
-- bauplan: partitioned-by="day(pickup_datetime), PULocationID"
SELECT
PULocationID,
trip_miles,
tips,
pickup_datetime
FROM taxi_fhvhv
WHERE
pickup_datetime >= '2022-12-25T00:00:00-05:00'