Quick start

Open in Github

These are the only things you need to do:

pip install bauplan --upgrade
  • Set up your username and authentication key (please, make sure you have both before starting)

bauplan config set api_key "your_bauplan_key"

Alternatively, you can manually create a ~/.bauplan/config.yml file with the following structure:

profiles:
  default:
    api_key: <YOUR_API_KEY_HERE>
    project_dir: .

Explore the data catalog

We pre-loaded some data for you so let’s start by looking at it in the data catalog using the CLI. This command will show you the Iceberg tables in the main branch in the data lake.

bauplan branch get main

We can then explore the schema of the tables in the data catalog. The important tables for this tutorial are taxi_fhvhv and taxi_zones (explore our datasets here). Here bauplan corresponds to the default namespace.

bauplan table get taxi_fhvhv
bauplan table get taxi_zones

Run a query

You can query the data directly in the data lake using the CLI:

bauplan query "SELECT max(tips) FROM taxi_fhvhv WHERE pickup_datetime = '2023-01-01T00:00:00-05:00'"

The results will be visualized in your terminal (we will show how to use different interfaces than the CLI later in the tutorial).

Run a pipeline

Go into the folder quick_start, and run our demo pipeline. For now, we don’t have to write any new table in the data lake, so we will just run in-memory (dry-run).

cd 01-quick-start
bauplan run --dry-run

👏👏 Congratulations, you just ran your first bauplan pipeline! In this example, you ran a very simple pipeline composed of two Python functions:

flowchart LR id0[(taxi_fhvhv)]-->id2[models.trips_and_zones] id1[(taxi_zones)] --> id2[models.trips_and_zones] id2[models.trips_and_zones]-->id3[models.normalized_taxi_trips]
What just happened?
  • when you do bauplan run bauplan parsed the code in the file models.py,

  • built a logical plan based on the implicit dependencies between the nodes,

  • and ran the nodes of the pipeline as isolated functions in the cloud, while streaming back in real time in your terminal.