Import
This guide explains how to import data into bauplan's data catalog as Iceberg tables.
General Requirements
- Data must be in Parquet or CSV format
- Data must be in S3 - local files cannot be imported directly
- S3 bucket must have proper permissions configured:
- For bauplan's clients, this is a one-off operation done during onboarding when pairing your data with the system
- For sandbox users, see sandbox requirements below
Sandbox Environment Requirements
When using the bauplan sandbox (beta environment), additional requirements apply:
- S3 bucket must be publicly readable
- S3 bucket must have
ListObject
permission enabled
These additional requirements exist because the sandbox runs in an isolated EC2 instance that can only access public data. In a production environment, the EC2 instance would be privately linked to your bucket through IAM.
Import Process Overview
The import process in bauplan consists of two main steps:
- Create Table: Define the table schema based on your data files
- Import Data: Load the data into your newly created table
Step 1: Create Table
Create an empty table using the create
command:
bauplan table create --name <YOUR_TABLE_NAME> --search-uri 's3://your-bucket/*.parquet'
This command will:
- Analyze your Parquet/CSV files to determine the schema
- Create an empty table with the appropriate structure
- Not yet import any data
Step 2: Import Data
After creating the table, import the data:
bauplan table import --name <YOUR_TABLE_NAME> --search-uri 's3://your-bucket/*.parquet'
You can also perform imports programmatically using the bauplan Python SDK:
import bauplan
client = bauplan.Client()
# Create the table
client.create_table(
table='my_table_name',
search_uri='s3://path/to/my/files/*.parquet',
branch='my_branch_name'
)
# Import the data
state = client.import_data(
table='my_table_name',
search_uri='s3://path/to/my/files/*.parquet',
branch='my_branch_name'
)
# Check for errors during import
if state.error:
print(f"Import failed: {state.error}")
Handling Schema Conflicts
If schema conflicts occur between files during import:
- Generate an import plan:
bauplan table create-plan --name <YOUR_TABLE_NAME> \
--search-uri 's3://your-bucket/*.parquet' \
--save-plan table_creation_plan.yml
- Review the
table_creation_plan.yml
file for conflicts - Modify the schema as needed
- Ensure the
conflicts
field is empty (conflicts: []
) - Apply the modified plan:
bauplan table create-plan-apply --plan table_creation_plan.yml
For a complete import reference - including error handling and advanced import options -, please consult the reference documentation.