bauplan: the programmable lakehouse¶
bauplan takes your files in S3, turns them into Iceberg tables in a data catalog, and allows you to query them and create complex pipelines in Python and SQL within minutes.
All you need is your code, we solve all the hard infrastructure problems for you so you can concentrate on more valuable things. You can think of it as a Data Lakehouse in your CLI. The whole point of Bauplan is to be very simple: all you need to know are Bauplan APIs and your own business logic.
What can I do with it?¶
Turn files into tables: if you have parquet files in an S3 bucket, you can import them into a data catalog with one line of code. Your files will be represented as Iceberg tables, which provides many benefits.
Branch your data lake: you can create branches of your data catalog as a zero-copy operation, allowing you to work with your data without worrying about overwriting the tables in your data lake.
Run queries: you can query the tables in your data catalog across branches in real-time with SQL. Our dialect is DuckDB.
Run pipelines: you can run pipelines remotely within seconds. You can use both SQL and/or Python.
Program everything: everything mentioned above can be achieved with one single API call. You can meta-program the whole system using our Python SDK.
But why should I care?¶
It’s very simple: more freedom and lower costs.
More freedom. If you already use Spark, Trino, Flink, Presto, Impala or Snowflake, you can use them to query every Iceberg table produced with Bauplan, and if you don’t, you can use our runtime out of the box.
Lower costs. By building on an object storage, you spend less and have the freedom to choose your computational engine by use case. Stop using your Warehouse for data transformation and ETL: keep those credits for big analytical queries.
No migration costs and no lock-in. We plug directly into your S3 so you don’t have to move the data into Bauplan like you would with a Warehouse and you can always get out with no switching costs.
Contents¶
- Home
- Getting Started
- Examples
- CLI Cheatsheet
- Datasets
- Reference
- FAQ
- I need some help. Who do I call?
- I found a bug. Who do I tell?
- Does Bauplan optimize my code?
- How much infrastructure do I need to know to use Bauplan?
- I don’t want to send my data outside my infrastructure. Can I have Bauplan in my VPC?
- Can I use the table I write with Bauplan somewhere else?
- My workloads are not very big individually, but I have an initial table that is quite large, does Bauplan scale enough for that?
- I have an orchestrator and I like it. Do I have to ditch it to use Bauplan?
- Does it scale?
- What does Bauplan mean?