Data Versioning: Tags and Commits

What is double versioning?

A fundamental Bauplan principle is that every operation on the lake is doubly versioned.

This means:

  • You should make data changes—writes, replacements, deletes—through code. This code may be versioned by you (using Git or similar tools), and it is always immutably stored by us.

  • The data themselves are always managed through our catalog. Bauplan tracks the underlying files in object storage via evolving table snapshots, where each data change creates a new commit—just like Git does for your code.

With this setup, every data operation is recorded, uniquely identified, and deterministically linked to the code that produced it (e.g. “who and with which code, run when, produced this change to that table?”). That gives you immense power in terms of debugging, reproducibility, and auditing—right out of the box.

Using refs to navigate space and time

Bauplan “execution” commands—i.e. running a pipeline or querying table —accept a ref parameter. This lets you control exactly which version of the data to use: a branch (e.g.main), a specific commit hash (e.g. apo@aoidaodh1qnodq), or a semantic tag. This enables time travel and reproducibility in a few lines of code.

Every Run Is a Transaction

In Bauplan every pipeline run is executed as a transaction—with full atomicity and isolation guarantees.

Behind the scenes, Bauplan handles this by creating a temporary branch for the run. If the pipeline succeeds, its effects are committed to your working branch. If it fails, your development branch remains untouched. This behavior ensures that incomplete or failed jobs never corrupt your data state, making development safe and predictable.

This transactional behavior applies during development as well when you run on a schedule: you can perform table changes (and iterate!) without worrying about polluting your canonical branches.

To show how this works in practice, let’s walk through a real example. You can find the full working code in this repo (the snippets below may omit imports and logging for clarity!):

Let’s see how that plays out in practice 👇

Walkthrough: navigating commits, branches, and tags

If we want to develop in Bauplan, we always start by creating a new development branch from main:

client.create_branch(my_branch, from_ref='main')

Since nothing has changed in our development branch yet, both branches point to the same commit. We can verify that, and learn the get_commits API:

my_branch_last_commit = client.get_commits(my_branch, limit=1)[0]
source_branch_last_commit = client.get_commits('main', limit=1)[0]
assert my_branch_last_commit.ref.hash == source_branch_last_commit.ref.hash

Now let’s run a pipeline on our branch. This pipeline materializes a table based on its parameter (so if run_id=1 we will find a table with that value), therefore generating a new commit:

run_1 = client.run(..., ref=my_branch, parameters={'run_id': 1})

The branch head now changed:

my_branch_run_1_commit = client.get_commits(my_branch, limit=1)[0]
assert my_branch_run_1_commit.ref.hash != source_branch_last_commit.ref.hash

Crucially, each commit records the job that generated it. This means we can always trace back which run created what data:

job_id_in_the_commit = my_branch_run_1_commit.properties['bpln_job_id']
assert job_id_in_the_commit == run_1.job_id

Let’s run the pipeline again with a different DAG parameter: this creates a new commit, since run_2 comes after run_1 on our branch:

run_2 = client.run(..., ref=my_branch, parameters={'run_id': 2})

Now if we query the materialized table, it will reflect the latest value:

rows = client.query(run_id_query, ref=my_branch).to_pylist()
assert rows == [{'run_id': 2}]

But we can just as easily query the previous state by passing the older commit (the object my_branch_run_1_commit we got just after run 1):

rows = client.query(run_id_query, ref=my_branch_run_1_commit.ref).to_pylist()
assert rows == [{'run_id': 1}]

Tags: giving names to important commits

To simplify navigation, we can tag specific commits. For instance, we might want to mark a dataset version that passed compliance checks:

tag_1_ref = client.create_tag(my_compliance_dataset_tag, my_branch_run_1_commit.ref)

Now we can use the tag as a permanent reference in our operations, for example when querying:

target_tag = client.get_tag(my_compliance_dataset_tag)
rows = client.query(run_id_query, ref=target_tag).to_pylist()
assert rows == [{'run_id': 1}]

Auditing and debugging

Who did what?

Since every commit is tracked, we can also filter by author to audit recent changes:

my_author_commit_history = client.get_commits(my_branch, filter_by_author_name=full_name, limit=5)

Inspecting Failed Runs

All runs are transactional by default: i.e. a pipeline run either succeeded and it’s on the branch, or failed and the branch is untouched. In practice, a temporary branch is created from the current branch and all the intermediate artifacts are materialized there. In other words, not only main, but our development branch itself is untouched by this failed run - which we can easily check with our APIs:

failed_run = client.run(..., ref=my_branch, parameters={'something_that_fails_the_DAG': ...})

# This confirms the current branch still reflects the last successful run
assert client.query(run_id_query, ref=my_branch_name).to_pylist() == [{'run_id': 2}]

You can now inspect the failed run’s job metadata, logs, and any intermediate tables that were materialized

logs = client.get_job_logs(job_id)
for log_line in logs:
  print(f'[{log_line.stream.name}] {log_line.message}')

If you re-run the pipeline now with a new parameter (that does not trigger an error), you will get a new materialization in the development branch, while the old failed run stays available as part of your audit trail. Try it yourself!

Reverting: jumping back in time

Suppose that we now decide that the original version—the one tagged as “compliant”—is the one we want. We can revert the table to that state with a simple API:

revert_ref = client.revert_table(
    table=my_test_table_name,
    source_ref=target_tag,
    into_branch=my_branch_name,
    # this commit will be added to the standard body for clarity!
    commit_body=f'Revert to tag {my_compliance_dataset_tag}',
    replace=True
)

We confirm it worked by querying the table again, and checking we got the table from the first run back:

rows = client.query(run_id_query, ref=my_branch_name).to_pylist()
assert rows == [{'run_id': 1}]

Just like that, we’ve reverted to a known-good state!

Summary

With Bauplan, every data operation is:

  • Code-driven: you own the logic (and possibly your own code versioning tool, e.g. Git), and Bauplan stores pipeline code irrespective of its Git status (i.e. you can “forget” to commit to Git, but Bauplan still knows exactly what you do in a run).

  • Snapshot-tracked: Bauplan maintains the full lineage and history of all data changes in your lakehouse.

  • Auditable and reversible: you can inspect, trace, tag, and revert any change.

This “doubly versioned” approach brings the best of Git-like workflows to your data. And the best part? It all works out of the box.