From unstructured to structured data with LLMs
Overview
This is a Bauplan reference implementation demonstrating how to transform unstructured data from financial PDFs (SEC 10-Q filings) into structured, analyzable tabular datasets using Large Language Models (LLMs). The pipeline ingests raw PDFs, extracts relevant financial data, and structures it into a final dataset suitable for downstream analysis and visualization.
Use Case
Given a set of financial PDFs from different companies, we aim to convert unstructured information into structured tables that:
- Reside in object storage alongside the raw files
- Run without ad hoc infrastructure
- Are cost-efficient, versioned, and easily replicable
To achieve this, we use a LLM-powered transformation pipeline within Bauplan, which offers:
-
A Python runtime optimized for LLM calls
-
Out-of-the-box DAG abstraction for structuring tabular dependencies
-
Iceberg-backed data persistence, including:
- Data branching for safe experimentation
- Time-travel capabilities for reproducibility
- Transactional guarantees for consistency
The final dataset is explored using a simple Streamlit application that fetches data directly from Bauplan via its Python APIs.
Credits: The financial PDFs used in this example come from the Llama Index SEC 10-Q dataset.
Setup
Python Environment
To set up the environment, ensure Python >=3.10 is installed and use a virtual environment:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Bauplan Setup
- Join the Bauplan sandbox, sign in, and create your username and API key.
- Complete the 3-minute tutorial to familiarize yourself with the platform.
Managing Secrets
The pipeline requires API keys for OpenAI and AWS (if using S3):
bauplan parameter set --name openai_api_key --value your_key --type secret
These secrets are encrypted and stored in bauplan_project.yml
for
secure access at runtime.
S3 Configuration
Ensure you have write access to the S3 bucket specified in run.py
: the
bucket will be used to storing raw PDF and metadata files. When running
DAGs in the Bauplan sandbox, buckets must be publicly
readable
for successful data import.
Data Flow
The end-to-end use case is managed by the run.py
script, which
leverages the Bauplan SDK to orchestrate data processing. The workflow
consists of the following steps:
-
Data Ingestion: Local PDF files containing financial data are uploaded to S3 object storage for durability and performance.
-
Metadata Management: A table in Bauplan stores metadata (S3 locations, company, quarter, etc.), ensuring efficient filtering and access. This step and all subsequent Bauplan operations occur safely within an isolated data branch.
-
LLM Processing: The pipeline in
bpln_pipeline
performs:- Unstructured-to-structured transformation via an LLM
- Post-processing in Python to refine extracted data
- Storage of the final structured table within the same namespace
-
Production Deployment: If no errors occur, the temporary branch is merged into production, making the result of the pipeline available for further analysis.
-
Data Visualization: The Streamlit app in
app
provides a simple web interface to explore the transformed dataset.
Note: The code includes extensive comments for pedagogical purposes. Contact the Bauplan team for further inquiries.
Running the Pipeline
Execute the End-to-End Pipeline
Run the following command to process PDFs and generate a structured dataset:
cd src
python run.py --ingestion-branch your_username.ingestion_branch
Since branches in Bauplan are user-specific, use the user.branch_name
pattern for isolation.
Verify Results
Check the generated table:
bauplan branch checkout main
bauplan table get my_pdfs.sec_10_q_analysis
Check out the maximum value for a column:
bauplan query "SELECT MAX(usd) as max_usd FROM my_pdfs.sec_10_q_analysis"
Exploring Results in Streamlit
Launch the visualization app:
cd app
streamlit run explore_analysis.py
The app will open in your browser, displaying insights from the extracted financial data.
Summary
This example demonstrates how Bauplan can:
- Handle complex data transformations with LLMs
- Efficiently manage unstructured-to-structured data pipelines
- Provide safe and reproducible data processing via branching
- Securely integrate with cloud services
- Enable rapid prototyping and analysis through Python APIs
License
The code in this repository is released under the MIT License.