Import data

With Bauplan, data lives in versioned branches, just like code in Git.

All imports happen in data branches, not main, so you can safely explore, validate, and transform data without affecting production. Each branch is a zero-copy snapshot of its origin. Tables you create or modify are isolated to that branch until you merge.

This guide shows how to import data into a branch, inspect it, and merge your changes into main.

What you’ll need

To import data with the CLI you will need:

  • A Bauplan environment (sandbox or production).

  • An S3 bucket with Parquet or CSV files.

Upload a dataset in the public sandbox

If you are in the public Sandbox environment, you can navigate to Import Data in your left menu in the Bauplan UI and simply drag and drop your files. This action will upload your dataset in the public Sandbox and make it ready to be imported in a table.

../_images/import.png

Once your data is uploaded, you will have to follow the steps in the guide below - create an import branch, create a table and import data in it.

Warning

Public Sandbox Notice

This is a shared public environment. All data you upload will be visible to other users. Please do not upload sensitive or private data: use only public datasets.

Create an import branch

Create an import branch prefixed with your username:

bauplan branch create <your_username>.import_branch
bauplan checkout <your_username>.import_branch

Verify your active branch by running:

bauplan branch

Create a table and import a dataset

To import a new dataset, you will have to first create a new table in your data catalog and then import the data in it. This can be easily done using the table API.

Use the table create command to define and create your table:

bauplan table create --name <your_table_name> --search-uri 's3://your/s3/bucket/*.parquet'

Confirm that the table was created and inspect the table’s schema:

bauplan table get <your_table_name>

Use the table import command to populate your new table with the data:

bauplan table import --name <your_table_name> --search-uri 's3://your/s3/bucket/*.parquet'

Run a quick exploration query to make sure that the data was imported correctly:

bauplan query "SELECT COUNT(*) AS trips FROM yourname_green_taxi"

Merge into main

You can see what’s changed in your branch compared to main by running branch diff. This command compares your current branch against a target branch. Use this before merging to review exactly what changes will be introduced. For example:

bauplan branch diff main

Once you’ve verified your data, merge your import branch into the main catalog:

bauplan checkout main
bauplan branch merge <your_username>.import_branch

This will show:

  • New tables in your branch that don’t exist in main

  • Schema or data differences in existing tables

  • Tables deleted or renamed in your branch

Note: branch diff always compares your active branch to the branch you pass as an argument.

Import data programmatically

You can also perform imports programmatically using the bauplan Python SDK:

import bauplan

# instantiate a Bauplan client
client = bauplan.Client()

# Create the import branch from main
client.create_branch(
    branch='your_import_branch',
    from_ref='main',
)

# Create the table
client.create_table(
    table='your_table_name',
    search_uri='s3://your/s3/bucket/*.parquet',
    branch='your_import_branch'
)

# Import the data
state = client.import_data(
    table='your_table_name',
    search_uri='s3://your/s3/bucket/*.parquet',
    branch='your_.import_branch'
)