Data Versioning: Tags and Commits¶
What is double versioning?¶
A fundamental Bauplan principle is that every operation on the lake is doubly versioned.
This means:
You should make data changes—writes, replacements, deletes—through code. This code may be versioned by you (using Git or similar tools), and it is always immutably stored by us.
The data themselves are always managed through our catalog. Bauplan tracks the underlying files in object storage via evolving table snapshots, where each data change creates a new commit—just like Git does for your code.
With this setup, every data operation is recorded, uniquely identified, and deterministically linked to the code that produced it (e.g. “who and with which code, run when, produced this change to that table?”). That gives you immense power in terms of debugging, reproducibility, and auditing—right out of the box.
Using refs to navigate space and time¶
Bauplan “execution” commands—i.e. running a pipeline or querying table
—accept a ref
parameter. This lets you control exactly which
version of the data to use: a branch (e.g.main
), a specific
commit hash (e.g. apo@aoidaodh1qnodq
), or a semantic tag. This
enables time travel and reproducibility in a few lines of code.
Every Run Is a Transaction¶
In Bauplan every pipeline run is executed as a transaction—with full atomicity and isolation guarantees.
Behind the scenes, Bauplan handles this by creating a temporary branch for the run. If the pipeline succeeds, its effects are committed to your working branch. If it fails, your development branch remains untouched. This behavior ensures that incomplete or failed jobs never corrupt your data state, making development safe and predictable.
This transactional behavior applies during development as well when you run on a schedule: you can perform table changes (and iterate!) without worrying about polluting your canonical branches.
To show how this works in practice, let’s walk through a real example. You can find the full working code in this repo (the snippets below may omit imports and logging for clarity!):
Let’s see how that plays out in practice 👇
Walkthrough: navigating commits, branches, and tags¶
If we want to develop in Bauplan, we always start by creating a new
development branch from main
:
client.create_branch(my_branch, from_ref='main')
Since nothing has changed in our development branch yet, both branches
point to the same commit. We can verify that, and learn the
get_commits
API:
my_branch_last_commit = client.get_commits(my_branch, limit=1)[0]
source_branch_last_commit = client.get_commits('main', limit=1)[0]
assert my_branch_last_commit.ref.hash == source_branch_last_commit.ref.hash
Now let’s run a pipeline on our branch. This pipeline materializes a
table based on its parameter (so if run_id=1
we will find a table
with that value), therefore generating a new commit:
run_1 = client.run(..., ref=my_branch, parameters={'run_id': 1})
The branch head now changed:
my_branch_run_1_commit = client.get_commits(my_branch, limit=1)[0]
assert my_branch_run_1_commit.ref.hash != source_branch_last_commit.ref.hash
Crucially, each commit records the job that generated it. This means we can always trace back which run created what data:
job_id_in_the_commit = my_branch_run_1_commit.properties['bpln_job_id']
assert job_id_in_the_commit == run_1.job_id
Let’s run the pipeline again with a different DAG
parameter:
this creates a new commit, since run_2
comes after run_1
on our
branch:
run_2 = client.run(..., ref=my_branch, parameters={'run_id': 2})
Now if we query the materialized table, it will reflect the latest value:
rows = client.query(run_id_query, ref=my_branch).to_pylist()
assert rows == [{'run_id': 2}]
But we can just as easily query the previous state by passing the older
commit (the object my_branch_run_1_commit
we got just after run 1):
rows = client.query(run_id_query, ref=my_branch_run_1_commit.ref).to_pylist()
assert rows == [{'run_id': 1}]
Tags: giving names to important commits¶
To simplify navigation, we can tag specific commits. For instance, we might want to mark a dataset version that passed compliance checks:
tag_1_ref = client.create_tag(my_compliance_dataset_tag, my_branch_run_1_commit.ref)
Now we can use the tag as a permanent reference in our operations, for example when querying:
target_tag = client.get_tag(my_compliance_dataset_tag)
rows = client.query(run_id_query, ref=target_tag).to_pylist()
assert rows == [{'run_id': 1}]
Auditing and debugging¶
Who did what?¶
Since every commit is tracked, we can also filter by author to audit recent changes:
my_author_commit_history = client.get_commits(my_branch, filter_by_author_name=full_name, limit=5)
Inspecting Failed Runs¶
All runs are transactional by default: i.e. a pipeline run either succeeded and it’s on the branch, or failed and the branch is untouched. In practice, a temporary branch is created from the current branch and all the intermediate artifacts are materialized there. In other words, not only main, but our development branch itself is untouched by this failed run - which we can easily check with our APIs:
failed_run = client.run(..., ref=my_branch, parameters={'something_that_fails_the_DAG': ...})
# This confirms the current branch still reflects the last successful run
assert client.query(run_id_query, ref=my_branch_name).to_pylist() == [{'run_id': 2}]
You can now inspect the failed run’s job metadata, logs, and any intermediate tables that were materialized
logs = client.get_job_logs(job_id)
for log_line in logs:
print(f'[{log_line.stream.name}] {log_line.message}')
If you re-run the pipeline now with a new parameter (that does not trigger an error), you will get a new materialization in the development branch, while the old failed run stays available as part of your audit trail. Try it yourself!
Reverting: jumping back in time¶
Suppose that we now decide that the original version—the one tagged as “compliant”—is the one we want. We can revert the table to that state with a simple API:
revert_ref = client.revert_table(
table=my_test_table_name,
source_ref=target_tag,
into_branch=my_branch_name,
# this commit will be added to the standard body for clarity!
commit_body=f'Revert to tag {my_compliance_dataset_tag}',
replace=True
)
We confirm it worked by querying the table again, and checking we got the table from the first run back:
rows = client.query(run_id_query, ref=my_branch_name).to_pylist()
assert rows == [{'run_id': 1}]
Just like that, we’ve reverted to a known-good state!
Summary¶
With Bauplan, every data operation is:
Code-driven: you own the logic (and possibly your own code versioning tool, e.g. Git), and Bauplan stores pipeline code irrespective of its Git status (i.e. you can “forget” to commit to Git, but Bauplan still knows exactly what you do in a run).
Snapshot-tracked: Bauplan maintains the full lineage and history of all data changes in your lakehouse.
Auditable and reversible: you can inspect, trace, tag, and revert any change.
This “doubly versioned” approach brings the best of Git-like workflows to your data. And the best part? It all works out of the box.