Quick start

These are the only things you need to do:

pip install bauplan --upgrade
  • Set up your username and authentication key (please, make sure you have both before starting)

bauplan --profile default config set api_key "your_bauplan_key"

Alternatively, you can manually create a ~/.bauplan/config.yml file with the following structure:

profiles:
  default:
    env: prod
    api_key: "my_bauplan_key"
    project_dir: .

Explore the data catalog

We pre-loaded some data for you so let’s start by looking at it in the data catalog using the CLI.

This command will show you the Iceberg tables in the main branch in the data lake.

bauplan branch get main

We can then explore the schema of the tables in the data catalog. The important tables for this tutorial are taxi_fhvhv and taxi_zones (here you can have a look of the datasets).

bauplan table get taxi_fhvhv
bauplan table get taxi_zones

Run a query

You can query the data directly in the data lake using the CLI:

bauplan query "SELECT max(tips) FROM taxi_fhvhv WHERE pickup_datetime = '2023-01-01T00:00:00-05:00'"

The results will be visualized in your terminal (we will show how to use different interfaces than the CLI later in the tutorial ).

Run a pipeline

Go into the folder quick_start, and run our demo pipeline. You should see the terminal updating in real-time as the code is executed.

cd quick-start
bauplan run

👏👏 Congratulations, you just ran your first Bauplan pipeline! In this example, you ran a very simple pipeline composed of two Python functions:

flowchart LR id0[(taxi_fhvhv)]-->id2[models.trips_and_zones] id1[(taxi_zones)] --> id2[models.trips_and_zones] id2[models.trips_and_zones]-->id3[models.normalized_taxi_trips]
What happens when we do bauplan.run?
  • Bauplan parsed the code in the file models.py,

  • built a logical plan based on the implicit dependencies between the nodes,

  • and ran the nodes of the pipeline as isolated functions in the cloud, while streaming back in real time.