Data Catalog¶
This guide will walk you through the core features of bauplan data catalog.
Import new data as Iceberg tables.
Create new data branches.
Merge branches.
With bauplan, all tables created through uploads or business logic can be materialized and persisted in a data catalog as Iceberg Tables. A distinctive feature of bauplan is its ability to create branches of your data lake and write data artifacts within them. We call these data branches. Think of them as sandboxed data environments where you can manipulate production data without affecting the primary production environment. Data branches are powerful tools that allow your team to explore, validate, and debug data artifacts and pipelines before merging them into the main production environment—similar to code version control.
Create a branch¶
Data branches are namespaced by your username, so you must prefix branch names with your username
.
bauplan branch create <YOUR_USERNAME>.<YOUR_BRANCH_NAME>
For example, to create a branch named hello_bauplan
and switch to it:
bauplan branch create <YOUR_USERNAME>.hello_bauplan
bauplan checkout <YOUR_USERNAME>.hello_bauplan
To see your current branch, run bauplan branch
. This command displays all your branches, marking your active branch with a green star.
bauplan branch
To see the content of your newly created data branch:
bauplan table
Note: Even without writing new tables, your branch isn’t empty. As it’s a zero-copy of the main
branch, it contains all tables existing in main
.
Import data¶
To import data, you’ll need a public S3 bucket with ListObject permission enabled (here is an example of json S3 permissions).
We provide a public bucket with an open dataset to get started.
Let’s import February 2023 Green Taxi Trip Records into a new table:
bauplan table create --name <YOUR_USERNAME>_green_taxi_table --search-uri 's3://alpha-hello-bauplan/green-taxi/*.parquet'
This command creates an empty table based on the parquet files’ schema. Then you can import the data in the table just created:
bauplan table import --name <YOUR_USERNAME>_green_taxi_table --search-uri 's3://alpha-hello-bauplan/green-taxi/*.parquet'
To verify the table creation:
bauplan table get <YOUR_USERNAME>_green_taxi_table
Sometimes, an import operation may fail due to schema conflicts between the table you created and the Parquet files you are trying to import. When such conflicts occur, the import will not proceed. In these situations, you can generate a plan file to help resolve the conflicts:
bauplan table create-plan --name <YOUR_USERNAME>_green_taxi_table --search-uri 's3://alpha-hello-bauplan/green-taxi/*.parquet' --save-plan table_creation_plan.yml
Review the table_creation_plan.yaml
file for conflicts (example). The conflicts
field should be empty (it should look like this conflicts: []
).
If there are conflicts, change the column schema directly in the file and make the table schema consistent, making sure that the field conflicts
is an empty list (i.e., conflicts: []
).
After reviewing, apply the plan:
bauplan table create-plan-apply --plan table_creation_plan.yaml
👏👏 Congratulations, you have just created your first data branch in the data catalog and imported data into it!
Merge a branch¶
To merge your hello_bauplan
branch into the main
branch:
Review the differences between branches:
bauplan branch diff main
You can compare your active branch with the main branch to identify the differences. This comparison will show which tables exist in one branch but not the other:
Switch to main and merge:
bauplan branch checkout main
bauplan branch merge <YOUR_USERNAME>.<YOUR_BRANCH_NAME>
Check the schema of the merged table:
bauplan table
bauplan table get <YOUR_TABLE> # in this tutorial the table should be <YOUR_USERNAME>_green_taxi_table
You can now query the table. For example, to find out how many records are in the table:
bauplan query "SELECT COUNT(lpep_pickup_datetime) as number_of_trips FROM <YOUR_USERNAME>_green_taxi_table"
👏👏 Congratulations, you just merged a data branch into the main data catalog!
Note
Data branches are user-specific; always prefix branch names with your username.
For a complete command reference, please consult our reference documentation.
The bauplan data catalog supports additional features like dropping tables and branches.