Bauplan: A Python-first Serverless Lakehouse

Bauplan is a Pythonic data platform that provides functions as a service for large-scale data pipelines and git-for-data over S3 data lakes. Bauplan handles tasks that would typically require an entire infrastructure team. Our goal is to allow you and your team to run large-scale ML workflows, AI applications and data transformation pipelines in the cloud without managing any data infrastructure.

Why we built it. We are a team of ML and data engineers and we built Bauplan because we’ve experienced firsthand the frustration of spending too much time wrestling with cloud infrastructure. Bauplan was built to offer a Python-first platform that is both extremely simple and robust at the same time.

Simple. Our serverless functions allow you to write pipelines as simple Python functions chained together without dealing with containerization, runtime configuration and specialized big-data frameworks like Spark.

Robust. Using Git-for-data and our unique system of Refs, we make sure that every pipeline run and every table and every model is automatically versioned, reproducible and auditable.

Main features

  • Pythonic by design. Build workflows using native Python in your favorite IDE—no DSLs, no YAML, no Spark required.

  • Work with tables directly in S3. Convert your Parquet and CSV files into Apache Iceberg tables with a single line of code. Get ACID transactions, schema and partition evolution, time travel, and optimized queries—without leaving your S3 bucket.

  • Git-for-data. Create zero-copy branches of your data lake instantly. Safely collaborate on real data without risking downstream breakage.

  • Serverless pipelines. Run fast, stateless Python functions in the cloud. Chain them together to build full pipelines—no containers, no runtime headaches.

  • SQL everywhere. Run interactive or async SQL queries across branches and tables in S3, with full support for versioned data.

  • CI/CD for data. Automate testing and deployment of data pipelines using data branches and our Python SDK—just like your code, with instant feedback loops.

  • Version and reproduce with Refs. Every pipeline run is tracked through data and code versioning. Use Refs to reproduce results, audit changes, and roll back with confidence.

Use cases

Run AI applications, ML workloads and data pipelines. Here, you’ll find numerous examples demonstrating how our customers use the platform to solve real-world problems.