Bauplan: A Python-first Serverless Lakehouse

What does it mean?

Bauplan is designed to build complex data transformation, analytics, and machine learning pipelines directly on your object storage in the most streamlined way possible.

We take your files from object storage, convert them into Iceberg tables, and provide serverless functions to handle data querying and transformations in SQL and Python. Your data is automatically versioned and organized into branches that you can use to develop safely against production data without impacting the production environment.

All you need is pure Python code. We handle all the complex infrastructure challenges, making your workflow feel local and seamless, without the complications of Spark jobs or orchestrators.

What can I do with it?

  • Work with Iceberg tables: If you have Parquet files in an S3 bucket, you can import them into an Iceberg catalog with a single line of code. Working with Iceberg tables instead of files provides many benefits, including transactions, schema and partition evolution, time travel, and significantly faster query performance.

  • Create branches of your data: You can create branches of your data lake in seconds as a zero-copy operation. This allows you to work and collaborate on your data without worrying about overwriting tables used by downstream applications.

  • Run complex pipelines: You can execute pipelines remotely within seconds as serverless Python functions by defining containers directly in your code.

  • Run OLAP jobs in SQL: You can query data directly on your object storage using our optimized DuckDB runtime.

  • Program everything: Everything is code, so you can use our Python SDK to integrate with CI/CD flows, visualization tools, and orchestrators.

Why should I care?

  • Simpler abstractions: Bauplan handles containerization, runtime optimization, schema evolution, data versioning, and partitioning, enabling any Python developer to build data-intensive applications directly in the cloud—not just specialized MLOps and data engineers.

  • Lower costs: By building on object storage, you reduce costs and maintain the flexibility to choose your computational engine based on your use case. Stop using your warehouse for data transformation and ETL: save those credits for large analytical queries.

  • More robust workflows: Reliable software is built on automation, testing, and programmability. Using data branches, automatic versioning, and our Python SDK makes it easy to build CI/CD and orchestration flows that enable extensive testing, rollbacks, and incident recovery.

Use cases

In this repository, you’ll find numerous examples demonstrating how our customers use the platform to solve real-world problems.

Contents

Indices and tables