Skip to main content

Import

This guide explains how to import data into bauplan's data catalog as Iceberg tables.

General Requirements

  • Data must be in Parquet or CSV format
  • Data must be in S3 - local files cannot be imported directly
  • S3 bucket must have proper permissions configured:
    • For bauplan's clients, this is a one-off operation done during onboarding when pairing your data with the system
    • For sandbox users, see sandbox requirements below

Sandbox Environment Requirements

When using the bauplan sandbox (beta environment), additional requirements apply:

  • S3 bucket must be publicly readable
  • S3 bucket must have ListObject permission enabled
note

These additional requirements exist because the sandbox runs in an isolated EC2 instance that can only access public data. In a production environment, the EC2 instance would be privately linked to your bucket through IAM.

Import Process Overview

The import process in bauplan consists of two main steps:

  1. Create Table: Define the table schema based on your data files
  2. Import Data: Load the data into your newly created table

Step 1: Create Table

Create an empty table using the create command:

bauplan table create --name <YOUR_TABLE_NAME> --search-uri 's3://your-bucket/*.parquet'

This command will:

  • Analyze your Parquet/CSV files to determine the schema
  • Create an empty table with the appropriate structure
  • Not yet import any data

Step 2: Import Data

After creating the table, import the data:

bauplan table import --name <YOUR_TABLE_NAME> --search-uri 's3://your-bucket/*.parquet'

You can also perform imports programmatically using the bauplan Python SDK:

import bauplan

client = bauplan.Client()

# Create the table
client.create_table(
table='my_table_name',
search_uri='s3://path/to/my/files/*.parquet',
branch='my_branch_name'
)

# Import the data
state = client.import_data(
table='my_table_name',
search_uri='s3://path/to/my/files/*.parquet',
branch='my_branch_name'
)

# Check for errors during import
if state.error:
print(f"Import failed: {state.error}")

Handling Schema Conflicts

If schema conflicts occur between files during import:

  1. Generate an import plan:
bauplan table create-plan --name <YOUR_TABLE_NAME> \
--search-uri 's3://your-bucket/*.parquet' \
--save-plan table_creation_plan.yml
  1. Review the table_creation_plan.yml file for conflicts
  2. Modify the schema as needed
  3. Ensure the conflicts field is empty (conflicts: [])
  4. Apply the modified plan:
bauplan table create-plan-apply --plan table_creation_plan.yml
note

For a complete import reference - including error handling and advanced import options -, please consult the reference documentation.