FAQ

I need some help. Who do I call?

Use Slack. If you need anything, just drop a message in the Slack channel or DM a member of our team, and we’ll come running. This goes for everything: questions, suggestions, complaints, improvements. Don’t be shy and reach out!

I found a bug. Who do I tell?

If you found a bug, you are strongly encouraged to open an issue in this repository. When you do it, please remember to report the JobId printed out by the system (see “Deterministic reruns and debugging” above). You can also send us a Slack message in the channel or a DM.

Does Bauplan optimize my code?

Not really. Bauplan provides lighting fast feedback loop in the cloud while you build complex data workloads by removing many bottlenecks you would normally have - network bandwidth, caching, containerization and data passing between functions. However, Bauplan does not change your code in any way and does not force you to learn new syntax (e.g. PySpark). If a function ran in your laptop, it will run in Bauplan, more robustly and possibly faster. In the example pipeline in scikit_pipeline you can see that four different ways to implement the same functional step lead to different performance.

How much infrastructure do I need to know to use Bauplan?

As little as humanly possible. Bauplan removes unnecessary boilerplate code and infrastructure maintenance. A Data Lakehouse is a complex organism which requires several moving pieces to operate in harmony. By using Bauplan you don’t need to write, build and / or maintain a query engine, a general purpose runtime, a versioned data catalog, an auditing platform, a Docker repository and a Docker client.

I don’t want to send my data outside my infrastructure. Can I have Bauplan in my VPC?

Yes. We’ve been on the other side, we know it’s important to have this option ready from day 1.

Can I use the table I write with Bauplan somewhere else?

Yes, every artifact produced by Bauplan is available to other engines, and any other engine can provide an input to Bauplan as long as they support Iceberg-compatible tables.

My workloads are not very big individually, but I have an initial table that is quite large, does Bauplan scale enough for that?

There are limitations in terms of data volume, so if your table is more than 1TB it might not be seamless. However, since Bauplan uses Iceberg tables as the point of contact with other systems, one could use Spark or Dremio to go from TB scale to an initial aggregation of hundreds of GB, and use Bauplan from there.

I have an orchestrator and I like it. Do I have to ditch it to use Bauplan?

No, please keep it. Bauplan is not an orchestrator, it’s a serverless Data Lakehouse. We are good at being really fast, running pipelines, querying and branching data. We are not terribly good at scheduling, re-try and fan-out. We integrate with the outermost layer of orchestration, so you can keep your favorite frameworks and maintain the capabilities that really matter for an orchestrator. For instance, you can call Bauplan functions and DAGs as Airflow tasks, they just will be run by Bauplan optimized runtime.

Does it scale?

TL;DR it probably does, for what you care about, but people mean very different things with this question:

  • scale as in running larger DAGs, or a lot of them at the same time”. The short answer is yes. In principle, there is no limit to the size of DAGs in nodes, and there is no barrier to running a lot of them. While the system currently does not load balance / queue requests, the full immutability and independence of each node/run makes it easy to distribute DAGs if necessary. It is even possible to run functions from one DAG on different hosts with no code changes, thanks to the built-in abstraction on data passing.

  • scale as in more people using it at the same time”. The short answer is also yes. Bauplan optimizes the allocation of a certain reserved compute capacity for multiple users at the same time using cleverly caching and predictive AI models. In addition to well-known strategies based on temporal and spatial locality, we believe in “team locality”: users working in the same team will re-use similar data, packages and functions. In this way, Bauplan for teams can become even better than Bauplan for a single practitioner.

  • scale as in running in production, not just as a development tool”. Yes, yes and yes. Bauplan has been designed to make it easy for people to develop, and only incidentally for computers to execute. However, its modular nature makes it trivial to schedule entire DAGs (or even isolated functions) using existing workflow tools (such as Airflow or Prefect) or Bauplan own scheduler (forthcoming). In particular, scheduling can take advantages of spot instances to minimize the cloud spending with no code changes.

  • scale as in running on larger datasets”. The short answer is…it depends. For now, we assume that no node requires more than 200GB of RAM, where each function runs in a single container. The longer answer is that, since Bauplan allows you to run any code you like, the type of processing you do may change the details quite a bit (for example, frameworks like Duckdb and Polars support out-of-core processing). Moreover, while currently not supported, Bauplan could easily support map-reduce style computations as long as users define mappers and reducers, or parallel workloads if they provide a sharding function.

What does Bauplan mean?

Bauplan is a term from evolutionary biology and means ground or structural plan. It is a concept from evolutionary biology used to identify common sets of morphological features in organism like as symmetry, layers, segmentation, nerve, limb configuration. Different phyla of animals can be grouped based on their bauplan. For instance, the vertebrates share the same Bauplan, while invertebrates have many Baupläne. We wanted a name that could convey our passion for structural optimization of complex systems. What makes a bauplan successful in history of evolution? How many ways are there to optimize the structure of an organism against his environment?

If you want to know more about this kind of stuff check out this amazing book by Sean B. Carroll.