Expectations

In this example, we will illustrate how to use expectations tests. Expectations are statistical and quality tests against Bauplan models to ensure that the shape and the value of the data are what we expect. These tests allow us to catch data quality issues easily, and can be programmed in multiple ways in our workflows. We can use expectations to validate the output of our models as a step in a data pipeline and catch data quality issues before brining artifacts and pipelines to production. A very important use case is to ensure the validity of data at the ingestion, when our system sits on the receiving end of data produced by external systems, such as others teams and vendors. Expectations also embodies domain specific knowledge about the data. If well documented, they can be used as a way to develop thorough data documentation which facilitates the adoption of best practices like data contracts. Technically, expectations are functions from a tabular artifact to a boolean: they return True when the expectation is met and the test is passed, False when it fails and the data does not comply with our expectations. Because Bauplan leverages vectorized Single-Instruction-Multiple Data -SIMD- operations, they allow to process quite large tables extraordinarily fast.

The pipeline

Go into the folder 04-expectations, and check you have a bauplan_project.yml file with a unique project id, a project name and a default Python interpreter.

project:
    id: bde138c0-0c48-4f37-a2be-cc55c8e8504a
    name: simple-data-app

In this example we will run a pipeline that computes the average waiting times for a yellow taxi for all the neighborhood in NYC. As usual, all function are fully containerized, so we can use different libraries to compute each step, like PyArrow for the first two models and DuckDB for last one. These are arbitrary implementation choices, feel free to refactor the pipeline as you wish. The pipeline looks like this: - The model normalized_taxi_trips gets some raw data from the table taxi_fhvhv in the data catalog and does some cleaning where needed. - The model taxi_trip_waiting_time calculates the time between calling a cab and its arrival for each row. - The model zone_avg_waiting_times computes the average waiting times aggregated by pickup zones.

flowchart LR id0[(taxi_fhvhv)]-->id2[models.normalized_taxi_trips] id2[models.normalized_taxi_trips] --> id3[models.taxi_trip_waiting_time] id3[models.taxi_trip_waiting_time] --> id4[models.zone_avg_waiting_times]

The expectation test

The file expectations.py in the project folder contains an expectation test from bauplan.standard_expectations. Bauplan’s library comes with a number of standard tests to cover the most common use cases (e.g. expect a column mean to be grater/smaller than, expect some/no/all nulls, expect all/no unique values, etc). Of course, you don’t have to use our expectations. You can write you own or use other libraries like Great Expectations. For a more detailed description of Bauplan expectation library, please refer to the our documentation. To calculate the waiting times for each trip, it’s crucial that the on_scene_datetime column has no null values. So we write a simple python function that runs the expect_column_no_nulls test on the normalized_taxi_trips model, and returns True if passed and False if failed. Note that we need a special decorator @bauplan.expectation.

As you can see, the expectation test in this example will stop the pipeline if it fails by asserting the test result with _is_expectation_correct. However, this isn’t always necessary. We can change the behavior by removing the assert and printing the test result instead (see the commented code in the function). This flexibility allows us to decide what to do depending on the severity of data quality issues. Minor issues might not need to stop the workflow, while critical ones should halt the pipeline to prevent catastrophic data quality problems. Depending on the use case, we can program different workflows.

Try it yourself

In this example, we will make things a bit more interactive. Let us run the pipeline as is and see what happens:

bauplan branch create <YOUR_USERNAME>.expectations
bauplan branch checkout <YOUR_USERNAME>.expectations
bauplan run

Oh, shoot! Something is wrong with out data: clearly there are null values in the column on_scene_datetime.

expectation.exp.test_null_values_on_scene_datetime @ expectations.py:22 |  ===> We are now checking for null values in on_scene_datetime
expectation.exp.test_null_values_on_scene_datetime @ expectations.py:22 |  ---> Exception occurred: AssertionError(expectation test failed: we expected on_scene_datetime to have no null values)
expectation.exp.test_null_values_on_scene_datetime @ expectations.py:22 |  ---> Stack trace
expectation.exp.test_null_values_on_scene_datetime @ expectations.py:22 |  --->
expectation.exp.test_null_values_on_scene_datetime @ expectations.py:22 |  ---> File '/bpln/bbff4e04/76a82274/snapshot/expectations.py', line 45, in test_null_values_on_scene_datetime
expectation.exp.test_null_values_on_scene_datetime @ expectations.py:22 |  --->     assert _is_expectation_correct, "expectation test failed: we expected on_scene_datetime to have no null values"
expectation.exp.test_null_values_on_scene_datetime @ expectations.py:22 |
2024-06-03 11:11:59 ERR Task failed
2024-06-03 11:12:00 ERR runtime task (exp[3.11].test_null_values_on_scene_datetime @ expectations.py:22) failed due to: AssertionError: expectation test failed: we expected on_scene_datetime to have no null values jobID=ee6c19c3-dc27-4b19-a8c9-53be47a84cbb

What do we do now? Well, we can go in the pipeline code and make sure to exclude the rows causing the problem. To do that, simply open models.py, go into the function normalized_taxi_trips and de-comment this line:

pickup_location_table = pc.drop_null(pickup_location_table)

This line will drop all the rows where there are null values, which should solve our problem. To make sure that it works, run the pipeline again. Does it works? Hurray!

👏👏Congratulations: you just used your first bauplan expectation test.👏👏

Summary

With this example we have demonstrated how:

  • to use Bauplan to create and run expectation tests to ensure data quality in a data pipeline. By leveraging Bauplan’s standard expectations library, we validated that the on_scene_datetime column contains no null values, which is crucial for accurate waiting time calculations.

  • expectation tests can halt the pipeline when critical data quality issues are detected, allowing developers to address problems before they impact production. This example highlights Bauplan’s ability to integrate robust data quality checks seamlessly into data workflows, ensuring reliable and accurate data processing.