Data Quality and Expectations¶
In this example, we illustrate how to use expectation tests. These tests are statistical and quality checks applied to Bauplan models to ensure that the structure and values of the data meet our expectations. Expectation tests help detect data quality issues early and can be incorporated seamlessly into various workflows. We can validate the output of our models by adding expectations as steps in a data pipeline, allowing us to catch quality issues before artifacts or pipelines are deployed to production. A key use case is ensuring the validity of data at ingestion—when the system receives data from external sources, such as other teams or vendors. Expectations also capture domain-specific knowledge about the data. When well-documented, they contribute to comprehensive data documentation and promote the adoption of best practices, such as data contracts. From a technical perspective, expectations are functions that take a tabular artifact (like a table or dataframe) as input and return a boolean value:
True if the data meets the expectation and the test passes.
False if the data fails to meet the expectation, indicating a compliance issue.
This approach ensures data quality throughout the pipeline and improves reliability by enforcing standards and facilitating smoother collaboration. Because bauplan leverages vectorized Single-Instruction-Multiple Data -SIMD- operations, they allow to process quite large tables extraordinarily fast.
Set up¶
Install bauplan.
pip install bauplan --upgrade
Make sure you have a bauplan_project.yml
file in your pipeline
folder with a unique project id
, and a project name
.
project:
id: bde138c0-0c48-4f37-a2be-cc55c8e8504a
name: data-quality-expectations
The pipeline¶
In this example we will run a pipeline that computes the average waiting times for a yellow taxi for all the neighborhood in NYC. As usual, all function are fully containerized, so we can use different libraries to compute each step, like PyArrow for the first two models and DuckDB for last one. These are arbitrary implementation choices, feel free to refactor the pipeline as you wish. The pipeline looks like this:
The model
normalized_taxi_trips
gets some raw data from the tabletaxi_fhvhv
in the data catalog and does some cleaning where needed.The model
taxi_trip_waiting_time
calculates the time between calling a cab and its arrival for each row.The model
zone_avg_waiting_times
computes the average waiting times aggregated by pickup zones.
The expectation test¶
The file expectations.py
in the project folder contains an expectation test from bauplan.standard_expectations
. bauplan’s library comes with a number of standard tests to cover the most common use cases (e.g. expect a column mean to be grater/smaller than, expect some/no/all nulls, expect all/no unique values, etc). Of course, you don’t have to use our expectations. You can write you own or use other libraries like Great Expectations. For a more detailed description of bauplan expectation library, please refer to the our documentation.
To calculate the waiting times for each trip, it’s crucial that the on_scene_datetime
column has no null values. So we write a simple python function that runs the expect_column_no_nulls
test on the normalized_taxi_trips
model, and returns True
if passed and False
if failed. Note that we need a special decorator @bauplan.expectation
.
As you can see, the expectation test in this example will stop the pipeline if it fails by asserting the test result with _is_expectation_correct
. However, this isn’t always necessary. We can change the behavior by removing the assert and printing the test result instead (see the commented code in the function). This flexibility allows us to decide what to do depending on the severity of data quality issues. Minor issues might not need to stop the workflow, while critical ones should halt the pipeline to prevent catastrophic data quality problems. Depending on the use case, we can program different workflows.
Try it yourself¶
In this example, we will make things a bit more interactive. Let us run the pipeline as is and see what happens:
bauplan branch create <YOUR_USERNAME>.expectations
bauplan branch checkout <YOUR_USERNAME>.expectations
bauplan run
Oh, shoot! Something is wrong with out data: clearly there are null values in the column on_scene_datetime
.
expectation. @ expectations.py:24 | ---> Exception occurred: (expectation test failed: we expected on_scene_datetime to have no null values)
expectation. @ expectations.py:24 | ---> Stack trace:
expectation. @ expectations.py:24 | --->
expectation. @ expectations.py:24 | ---> File 'expectations.py', line 45, in test_null_values_on_scene_datetime
expectation. @ expectations.py:24 | ---> assert _is_expectation_correct, f"expectation test failed: we expected {column_to_check} to have no null values"
expectation. @ expectations.py:24 |
2024-10-28 16:13:38 ERR Task failed
2024-10-28 16:13:39 ERR expectation returned with exception: expectation test failed: we expected on_scene_datetime to have no null values jobID=17f24c93-d440-405f-9b3d-4f637be9d94c
What do we do now? Well, we can go in the pipeline code and make sure to exclude the rows causing the problem. To do that, simply open models.py
, go into the function normalized_taxi_trips
and de-comment this line:
pickup_location_table = pc.drop_null(pickup_location_table)
This line will drop all the rows where there are null values, which should solve our problem. To make sure that it works, run the pipeline again. Does it works? Hurray!
👏👏Congratulations: you just used your first bauplan expectation test.👏👏
Summary¶
With this example we have demonstrated how:
to use bauplan to create and run expectation tests to ensure data quality in a data pipeline. By leveraging bauplan’s standard expectations library, we validated that the
on_scene_datetime
column contains no null values, which is crucial for accurate waiting time calculations.expectation tests can halt the pipeline when critical data quality issues are detected, allowing developers to address problems before they impact production. This example highlights bauplan’s ability to integrate robust data quality checks seamlessly into data workflows, ensuring reliable and accurate data processing.