From unstructured to structured data with LLMs¶
Overview¶
This is a Bauplan reference implementation demonstrating how to transform unstructured data from financial PDFs (SEC 10-Q filings) into structured, analyzable tabular datasets using Large Language Models (LLMs). The pipeline ingests raw PDFs, extracts relevant financial data, and structures it into a final dataset suitable for downstream analysis and visualization.
Use Case¶
Given a set of financial PDFs from different companies, we aim to convert unstructured information into structured tables that:
Reside in object storage alongside the raw files
Run without ad hoc infrastructure
Are cost-efficient, versioned, and easily replicable
To achieve this, we use a LLM-powered transformation pipeline within Bauplan, which offers:
A Python runtime optimized for LLM calls
Out-of-the-box DAG abstraction for structuring tabular dependencies
Iceberg-backed data persistence, including:
Data branching for safe experimentation
Time-travel capabilities for reproducibility
Transactional guarantees for consistency
The final dataset is explored using a simple Streamlit application that fetches data directly from Bauplan via its Python APIs.
Credits: The financial PDFs used in this example come from the Llama Index SEC 10-Q dataset.
Setup¶
Python Environment¶
To set up the environment, ensure Python >=3.10 is installed and use a virtual environment:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Bauplan Setup¶
Join the Bauplan sandbox, sign in, and create your username and API key.
Complete the 3-minute tutorial to familiarize yourself with the platform.
Managing Secrets¶
The pipeline requires API keys for OpenAI and AWS (if using S3):
bauplan parameter set --name openai_api_key --value your_key --type secret
These secrets are encrypted and stored in bauplan_project.yml
for
secure access at runtime.
S3 Configuration¶
Ensure you have write access to the S3 bucket specified in run.py
:
the bucket will be used to storing raw PDF and metadata files. When
running DAGs in the Bauplan sandbox, buckets must be publicly
readable
for successful data import.
Data Flow¶
The end-to-end use case is managed by the run.py
script, which
leverages the Bauplan SDK to orchestrate data processing. The workflow
consists of the following steps:
Data Ingestion: Local PDF files containing financial data are uploaded to S3 object storage for durability and performance.
Metadata Management: A table in Bauplan stores metadata (S3 locations, company, quarter, etc.), ensuring efficient filtering and access. This step and all subsequent Bauplan operations occur safely within an isolated data branch.
LLM Processing: The pipeline in
bpln_pipeline
performs:Unstructured-to-structured transformation via an LLM
Post-processing in Python to refine extracted data
Storage of the final structured table within the same namespace
Production Deployment: If no errors occur, the temporary branch is merged into production, making the result of the pipeline available for further analysis.
Data Visualization: The Streamlit app in
app
provides a simple web interface to explore the transformed dataset.
Note: The code includes extensive comments for pedagogical purposes. Contact the Bauplan team for further inquiries.
Running the Pipeline¶
Execute the End-to-End Pipeline¶
Run the following command to process PDFs and generate a structured dataset:
cd src
python run.py --ingestion-branch your_username.ingestion_branch
Since branches in Bauplan are user-specific, use the
user.branch_name
pattern for isolation.
Verify Results¶
Check the generated table:
bauplan branch checkout main
bauplan table get my_pdfs.sec_10_q_analysis
Check out the maximum value for a column:
bauplan query "SELECT MAX(usd) as max_usd FROM my_pdfs.sec_10_q_analysis"
Exploring Results in Streamlit¶
Launch the visualization app:
cd app
streamlit run explore_analysis.py
The app will open in your browser, displaying insights from the extracted financial data.
Summary¶
This example demonstrates how Bauplan can:
Handle complex data transformations with LLMs
Efficiently manage unstructured-to-structured data pipelines
Provide safe and reproducible data processing via branching
Securely integrate with cloud services
Enable rapid prototyping and analysis through Python APIs
License¶
The code in this repository is released under the MIT License.