S3 Data Connector Deployment Guide
Production operating guide for the S3 data connector covering IAM authentication, credential chains, file-format tuning, metrics, and observability.
Authentication & Secrets
S3 authentication is selected via s3_auth:
| Value | Behavior |
|---|---|
| (unset) | Default AWS credential chain (IAM-based). Equivalent to iam_role with iam_role_source: auto. |
iam_role | Load credentials from the AWS credential chain; the source is further narrowed by iam_role_source. |
key | Use the explicit s3_key / s3_secret pair. Required for S3-compatible stores that do not speak IAM (MinIO, Cloudflare R2 with keys, Backblaze B2, etc.). |
public | Unauthenticated access for public buckets. |
IAM Role Source
When s3_auth is unset or iam_role, the credential source is controlled by iam_role_source:
| Value | Behavior |
|---|---|
auto | Default AWS credential chain (env vars → shared credentials file → IMDS/ECS/IRSA). |
metadata | Restrict to instance/container metadata only: IMDS (EC2), ECS task role, EKS IRSA (pod role). |
env | Restrict to environment variables only (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN). |
For production on EKS or ECS, prefer iam_role_source: metadata to guarantee the runtime only draws credentials from the workload identity, never from ambient environment variables.
Key Auth for S3-Compatible Stores
For MinIO, R2, B2, or on-prem S3 gateways:
params:
s3_auth: key
s3_key: ${secrets:s3_key}
s3_secret: ${secrets:s3_secret}
s3_endpoint: https://minio.internal:9000
s3_region: us-east-1
Keys must be sourced from a secret store in production. See Secret Stores.
Region Validation
s3_region is validated against AWS's known region set and must be lowercase. Invalid regions are rejected at startup. Custom S3-compatible endpoints still require a valid-looking AWS region code.
Resilience Controls
Retry Behavior
S3 I/O uses the AWS SDK's default retry strategy: standard adaptive backoff with retries on throttling (SlowDown, 503) and transient network errors. Per-operation retry parameters are not currently exposed at the Spice layer.
Permanent Failures
Authentication failures (401, 403) and missing buckets (404) surface immediately as query errors. Unlike the Databricks connector, the S3 connector does not permanently disable itself — subsequent queries re-attempt authentication, so transient IAM or network issues self-heal.
Capacity & Sizing
- Object store throughput: S3 scales horizontally per prefix. For large Parquet workloads, partition data by date or tenant to maximize parallel reads.
- Hive partitioning: Enable
hive_partitioning_enabled: truewhen listing partitioned datasets so DataFusion can prune irrelevant partitions at plan time instead of listing and filtering at execution time. - Schema inference cost: On first registration, Spice samples files to infer schema. Provide an explicit
schemain the dataset definition for large datasets to avoid repeated list/head operations. - DataFusion batch size: Object-store reads yield 8192-row record batches by default. Increase via runtime tuning for CPU-bound scans over compressed formats.
Metrics
S3 I/O metrics are collected via the shared runtime-object-store layer (request counts, retries, bytes read) and are exposed through Spice's runtime metrics. See Component Metrics for configuration.
The connector does not currently register S3-specific dataset-level instruments. Monitor S3 health via:
- Standard AWS CloudWatch metrics on the bucket (
AllRequests,4xxErrors,5xxErrors,TotalRequestLatency). - Spice's query-execution metrics (
query_duration_ms,query_processed_rows) fromruntime.metrics.
Task History
S3 object reads participate in Spice task history through DataFusion's object-store plan nodes. Individual object GETs are attributed to their enclosing sql_query or accelerated_table_refresh task via the DataFusion execution plan.
Known Limitations
- Writes are not supported; the S3 connector is read-only.
- S3 Express One Zone directory buckets are supported transparently via
s3://URIs when the region and endpoint match. - Server-side encryption with customer-provided keys (SSE-C) is not exposed; SSE-S3 and SSE-KMS work transparently when the role/user has KMS decrypt permission.
- Requester-pays buckets are not currently supported.
- Cross-region access incurs AWS data-transfer charges; place Spice in the same region as the bucket for best cost and latency.
Troubleshooting
| Symptom | Likely cause | Resolution |
|---|---|---|
The request signature we calculated does not match the signature you provided | Clock skew or wrong s3_key/s3_secret. | Verify secret values; check system clock (AWS tolerates only ~15 min drift). |
Access Denied | IAM policy lacks s3:GetObject or s3:ListBucket. | Attach a policy granting read on the bucket and prefix. Cross-account buckets also need bucket policy. |
NoSuchBucket | Bucket does not exist in the configured region. | Confirm bucket name and s3_region. |
EnvCredentialsNotSet on EKS | iam_role_source: env while running under IRSA. | Set iam_role_source: metadata or auto. |
InvalidSignatureException against MinIO/R2 | s3_endpoint not set or AWS SDK trying to sign for AWS S3. | Set s3_endpoint and s3_region to match the S3-compatible provider. |
| Slow queries on large partitioned datasets | Hive partitioning not enabled; every scan lists all files. | Set hive_partitioning_enabled: true and encode partitions as key=value/ in the path. |
