Data Catalog

This guide will walk you through the core features of Bauplan data catalog.

  • Import new data as Iceberg tables.

  • Create new data branches.

  • Merge branches.

With Bauplan, all the new tables uploaded and/or defined by the business logic can be materialized and persisted in a data catalog as Iceberg Tables.

One special thing about Bauplan is that it provides a very easy way to create branches of your data lake and write data artifacts in them by running your pipelines. We call these called data branches. You can think of them as sandboxed data environments, to manipulate production data without altering the primary production environment. Data branches are extremely powerful: they allow your team to explore, validate, and debug data artifacts and pipelines before merging them into the main production environment - similar to what we do with code.

Create a branch

The Bauplan data catalog contains Iceberg tables. One unique feature of Bauplan is its ability to effortlessly create branches of your data lake, enabling instant zero-copy duplication of multiple tables. These branches function as isolated data environments, allowing you to work with production data without impacting the primary production setup. Data branches are highly versatile: they enable your team to explore, validate, and debug data artifacts and pipelines before integrating them into the main production environment, much like how we manage code.

Data branches are namespaced by your username, so to create a data branch you will have to prefix it with your username.

bauplan branch create <YOUR_USERNAME>.<YOUR_BRANCH_NAME>

Let’s create, for example, a branch named hello_bauplan and checkout to it.

bauplan branch create <YOUR_USERNAME>.hello_bauplan
bauplan branch checkout <YOUR_USERNAME>.hello_bauplan

To know what branch you are in, run bauplan branch. It will display all your branches marking your active branch with a star.

bauplan branch

To see the content of your newly created data branch, you can run bauplan table.

bauplan table

Note that even if you haven’t written any new table into the branch yet, the branch is not empty. Because the newly created branch is a zero-copy of the main branch, it will contain all the tables in it.

Import data

We will now import some data in the newly created branch. In this tutorial, to import your files into the Bauplan catalog you will need a public S3 bucket (with ListObject allowed).

bauplan import plan 'your_S3_bucket'

If you don’t have that yet, don’t worry we have a public bucket and an open dataset you can use. We are going to import one month of the Green Taxi Trip Records, namely February 2023:

bauplan import plan 's3://alpha-hello-bauplan/green-taxi/*.parquet'

This command will generate an import plan in the file bauplan_import_plan.yaml with the files to be imported, an inferred schema, and potential conflicts. Inspect the file (example) and make sure that there are no conflicts in the conflicts field. If it looks like this conflicts: [] , it means that there are no conflicts.

If there are, simply edit the file as you need to make the table schema consistent and make sure that conflicts is an empty list in the end (i.e. conflicts: []).

Once you have reviewed the file, you can import the data into the data catalog by specifying the branch and name of the table to create. We will call the table green_taxi_table and prefix it with your username and underscore.

bauplan import apply --branch <YOUR_USERNAME>.hello_bauplan <YOUR_USERNAME>_green_taxi_table

👏👏 Congratulations, you just created your first data branch in the data catalog and you imported some data into it!

Merge a branch

Let us merge the new table imported in your hello_bauplan branch into the main branch.

First, let’s display the differences between your current branch and main to make sure you are merging the right tables.

bauplan branch diff main

You will see your newly imported table listed as present in your active branch but not in the main branch.

To merge the new table into main, we will use a similar process as git. First, checkout to main and then merge your source branch.

bauplan branch checkout main
bauplan branch merge <YOUR_USERNAME>.hello_bauplan

If you now inspect the content of your active branch, main, you will see your new table.

bauplan table
bauplan table get <YOUR_USERNAME>_green_taxi_table

In fact, you can even query your new table using bauplan query as we did in the quick_start. For instance, you can calculate the top pickup locations for green taxi trips.

bauplan query "SELECT COUNT(lpep_pickup_datetime) as number_of_trips, PULocationID, FROM <YOUR_USERNAME>_green_taxi_table GROUP BY PULocationID ORDER BY number_of_trips DESC"

👏👏 Congratulations, you just merged a data branch into the main data catalog!

There are more things we can do with the catalog and its branches, like dropping tables and branches. For a more comprehensive overview of all the commands, please see our docs.

Data branches are namespaced by your username, so to create a data branch you will have to prefix it with your username.