Quick start¶
These are the only things you need to do:
Clone this repo.
Install bauplan.
pip install bauplan --upgrade
Set up your username and authentication key (please, make sure you have both before starting)
bauplan config set api_key "your_bauplan_key"
Alternatively, you can manually create a ~/.bauplan/config.yml
file with the following structure:
profiles:
default:
api_key: <YOUR_API_KEY_HERE>
project_dir: .
Explore the data catalog¶
We pre-loaded some data for you so let’s start by looking at it in the data catalog using the CLI. This command will show you the Iceberg tables in the main branch in the data lake.
bauplan branch get main
We can then explore the schema of the tables in the data catalog. The important tables for this tutorial are taxi_fhvhv
and taxi_zones
(explore our datasets here).
Here bauplan
corresponds to the default namespace.
bauplan table get taxi_fhvhv
bauplan table get taxi_zones
Run a query¶
You can query the data directly in the data lake using the CLI:
bauplan query "SELECT max(tips) FROM taxi_fhvhv WHERE pickup_datetime = '2023-01-01T00:00:00-05:00'"
The results will be visualized in your terminal (we will show how to use different interfaces than the CLI later in the tutorial).
Run a pipeline¶
Go into the folder quick_start
, and run our demo pipeline. For now, we don’t have to write any new table in the data lake, so we will just run in-memory (dry-run).
cd 01-quick-start
bauplan run --dry-run
👏👏 Congratulations, you just ran your first bauplan pipeline! In this example, you ran a very simple pipeline composed of two Python functions:
- What just happened?
when you do
bauplan run
bauplan parsed the code in the filemodels.py
,built a logical plan based on the implicit dependencies between the nodes,
and ran the nodes of the pipeline as isolated functions in the cloud, while streaming back in real time in your terminal.