Data Catalog

This guide will walk you through the core features of Bauplan data catalog.

  • Import new data as Iceberg tables.

  • Create new data branches.

  • Merge branches.

With Bauplan, all the new tables uploaded and/or defined by the business logic can be materialized and persisted in a data catalog as Iceberg Tables.

One special thing about Bauplan is that it provides a very easy way to create branches of your data lake and write data artifacts in them by running your pipelines. We call these data branches. You can think of them as sandboxed data environments, to manipulate production data without altering the primary production environment. Data branches are extremely powerful: they allow your team to explore, validate, and debug data artifacts and pipelines before merging them into the main production environment - similar to what we do with code.

Create a branch

Data branches are namespaced by your username, so to create a data branch you will have to prefix it with your username.

bauplan branch create <YOUR_USERNAME>.<YOUR_BRANCH_NAME>

Let’s create, for example, a branch named hello_bauplan and checkout to it.

bauplan branch create <YOUR_USERNAME>.hello_bauplan
bauplan branch checkout <YOUR_USERNAME>.hello_bauplan

To know what branch you are in, run bauplan branch. It will display all your branches marking your active branch with a star.

bauplan branch

To see the content of your newly created data branch, you can run bauplan table.

bauplan table

Note that even if you haven’t written any new table into the branch yet, the branch is not empty. Because the newly created branch is a zero-copy of the main branch, it will contain all the tables in it.

Import data

We will now import some data into the newly created branch. By default, you cannot import new data into your active branch, so let’s check out back into the main branch first.

bauplan branch checkout main

Now, to import your files into the Bauplan catalog, you will need a public S3 bucket (with ListObject permission enabled).

bauplan import plan --table tablename 's3://your_S3_bucket'

If you don’t have that yet, don’t worry. We have a public bucket and an open dataset you can use. We are going to import one month of the Green Taxi Trip Records, specifically February 2023, into a new table named green_taxi_table and prefix it with your username followed by an underscore:

bauplan import plan --table <YOUR_USERNAME>_green_taxi_table 's3://alpha-hello-bauplan/green-taxi/*.parquet'

This command will generate an import plan in the file bauplan_import_plan.yaml (the default file name; changeable with --file filename.yaml) that includes the files to be imported, an inferred schema, the table name, and any potential conflicts. Inspect the file (example) and make sure that there are no conflicts in the conflicts field. If it shows conflicts: [], it means that there are no conflicts.

If there are conflicts, simply edit the file as needed to make the table schema consistent, and ensure that conflicts is an empty list in the end (i.e., conflicts: []).

Once you have reviewed the file, you can import the data into the data catalog by specifying the branch to which the import plan should be applied. Remember, by default, you cannot import new data into your active branch, so apply the plan to the branch created in the previous step.

bauplan import apply --branch <YOUR_USERNAME>.hello_bauplan

👏👏 Congratulations, you have just created your first data branch in the data catalog and imported some data into it!

Merge a branch

Let us merge the new table imported in your hello_bauplan branch into the main branch.

First, let’s display the differences between the main branch and the import branch to make sure you are merging the right tables.

bauplan branch diff <YOUR_USERNAME>.hello_bauplan

You will see your newly imported table listed as present in your active branch but not in the main branch.

To merge the new table into main, we will use a similar process as git. Make sure to be in main and then merge your source branch.

bauplan branch merge <YOUR_USERNAME>.hello_bauplan

If you now inspect the content of your active branch, main, you will see your new table.

bauplan table
bauplan table get bauplan.<YOUR_USERNAME>_green_taxi_table

In fact, you can even query your new table using bauplan query as we did in the quick_start. For instance, you can calculate the top pickup locations for green taxi trips.

bauplan query "SELECT COUNT(lpep_pickup_datetime) as number_of_trips, PULocationID, FROM bauplan.<YOUR_USERNAME>_green_taxi_table GROUP BY PULocationID ORDER BY number_of_trips DESC"

👏👏 Congratulations, you just merged a data branch into the main data catalog!

There are more things we can do with the catalog and its branches, like dropping tables and branches. For a more comprehensive overview of all the commands, please see our docs.

Data branches are namespaced by your username, so to create a data branch you will have to prefix it with your username.