Skip to Content

Cloudflare Pipelines + R2 Data Catalog Terraform 2026: Complete Apache Iceberg Pipeline Setup

Step-by-step guide to building data pipelines with Apache Iceberg tables using Terraform — zero egress fees, managed catalog, and SQL transformations
May 5, 2026, 17:27 Eastern Daylight Time by
Cloudflare Pipelines + R2 Data Catalog Terraform 2026: Complete Apache Iceberg Pipeline Setup

Cloudflare Pipelines now supports Terraform for full lifecycle management. You can create streams, configure R2 Data Catalog sinks, and manage Apache Iceberg tables entirely through Infrastructure as Code — enabling zero-egress analytics with zero operational overhead.

What You Will Learn

  • How to configure R2 bucket with Data Catalog using Terraform
  • Creating pipeline streams, sinks, and SQL transformations
  • Authenticating with scoped API tokens for Iceberg tables
  • Querying data with R2 SQL, Spark, PyIceberg, and DuckDB

Why Cloudflare Pipelines + R2 Data Catalog?

Cloudflare Pipelines lets you ingest streaming data via Workers or HTTP endpoints, transform it with SQL, and write it to R2 as Apache Iceberg tables. The R2 Data Catalog, now available in the Cloudflare Terraform provider (v5.19.0+), manages those Iceberg tables with built-in compaction, time travel, and ACID transaction support.

The key advantage? Zero egress fees. Your analytics queries run against data stored in R2, and you never pay to move that data out. Combined with Terraform for Infrastructure as Code, you can version-control your entire data pipeline stack.

Professional Recommendation

For production workloads, always use scoped API tokens instead of global account tokens. The Terraform configuration below demonstrates the least-privilege approach with specific permission groups for the pipeline sink.

Prerequisites

  • Cloudflare account with R2 and Pipelines enabled
  • Terraform v1.6+ installed locally
  • Cloudflare provider v5.19.0+ in your Terraform configuration
  • API token with R2 Admin Read & Write permissions

Terraform Configuration: Complete Pipeline Setup

This end-to-end Terraform configuration creates a complete data pipeline: an R2 bucket with the data catalog enabled, a scoped API token for the sink, and the stream, sink, and pipeline resources that ingest JSON data into an Apache Iceberg table.

Provider Setup

terraform {
  required_providers {
    cloudflare = {
      source  = "cloudflare/cloudflare"
      version = "~> 5.19"
    }
  }
}

provider "cloudflare" {
  api_token = var.cloudflare_api_token
}

R2 Bucket with Data Catalog Enabled

resource "cloudflare_r2_bucket" "data_lake" {
  account_id = var.account_id
  name       = "analytics-data-lake"
}

resource "cloudflare_r2_data_catalog" "iceberg_catalog" {
  account_id = var.account_id
  bucket     = cloudflare_r2_bucket.data_lake.name
}

Scoped API Token for Pipeline Sink

data "cloudflare_account_api_token_permission_groups_list" "sink_permissions" {
  filter {
    name = "R2"
  }
}

resource "cloudflare_account_api_token" "pipeline_token" {
  name       = "pipeline-sink-token"
  policies   = [{
    effect = "allow"
    resources = {
      "com.cloudflare.r2.bucket.${cloudflare_r2_bucket.data_lake.id}" = "*"
    }
    permission_groups = data.cloudflare_account_api_token_permission_groups_list.sink_permissions.permission_groups[*].id
  }]
}

Pipeline Stream and Sink

resource "cloudflare_pipeline_stream" "events_stream" {
  account_id = var.account_id
  name       = "analytics-events-stream"
}

resource "cloudflare_pipeline_sink" "iceberg_sink" {
  account_id = var.account_id
  name       = "iceberg-sink"
  pipeline_id = cloudflare_pipeline.pipeline.id
  sink_type   = "r2_data_catalog"

  sink_config {
    r2_data_catalog {
      bucket       = cloudflare_r2_bucket.data_lake.name
      namespace    = "analytics"
      table        = "events"
      format       = "parquet"
    }
  }
}

resource "cloudflare_pipeline" "main_pipeline" {
  account_id = var.account_id
  name       = "analytics-pipeline"
  stream_id  = cloudflare_pipeline_stream.events_stream.id
  sink_ids   = [cloudflare_pipeline_sink.iceberg_sink.id]
  sql        = <<-EOT
    SELECT
      event_id,
      event_type,
      timestamp,
      user_id,
      properties,
      created_at
    FROM stream
  EOT
}

This configuration creates a pipeline that receives JSON events via HTTP, applies SQL transformations, and writes the results as Apache Iceberg tables to R2 Data Catalog. The sink automatically creates the namespace and table if they do not exist.

0 Egress Fees
ACID Transactions
IaC Terraform

Querying Iceberg Tables

Once your data lands in Iceberg tables, you can query it using multiple engines. R2 Data Catalog exposes a standard Iceberg REST catalog interface, compatible with:

R2 SQL

Built-in SQL query interface in Cloudflare dashboard

PyIceberg

Python library for Iceberg table operations

Apache Spark

Scala and PySpark connectors available

DuckDB

Fast OLAP queries with Iceberg catalog support

Example: Connecting with PyIceberg

from pyiceberg.catalog import load_catalog

catalog = load_catalog(
    "rest",
    uri="https://${account_id}.r2.cloudflarestorage.com",
    warehouse="analytics-data-lake",
    s3_endpoint="https://${account_id}.r2.cloudflarestorage.com",
    aws_access_key_id="${access_key}",
    aws_secret_access_key="${secret_key}",
)

table = catalog.load_table("analytics.events")
print(table.scan().to_pandas())

R2 Sink vs Iceberg Sink

You can choose between two sink types depending on your data format requirements:

Sink Type Format Support Use Case
R2 Data Catalog Parquet only Analytics, time travel queries, schema evolution
R2 Bucket (Raw) Parquet, JSON Raw data storage, backup, ML pipelines

To write raw Parquet or JSON files to R2 instead of Iceberg tables, replace the sink resource with an R2 sink. This requires R2 S3-compatible credentials instead of a catalog token.

Common Mistake to Avoid

R2 Data Catalog sinks only support Parquet format. JSON format is not supported for Iceberg tables. If you need JSON output, use the R2 bucket sink instead.

Clean Up Resources

When you no longer need the pipeline, destroy the resources in the correct order to avoid orphaned dependencies:

# Destroy in reverse order of creation
terraform destroy -target cloudflare_pipeline.main_pipeline
terraform destroy -target cloudflare_pipeline_sink.iceberg_sink
terraform destroy -target cloudflare_pipeline_stream.events_stream
terraform destroy -target cloudflare_account_api_token.pipeline_token
terraform destroy -target cloudflare_r2_data_catalog.iceberg_catalog
terraform destroy -target cloudflare_r2_bucket.data_lake

Final Verdict

Cloudflare Pipelines + R2 Data Catalog with Terraform provides a production-ready data pipeline architecture. You get managed Iceberg tables with ACID transactions, zero egress fees, and full Infrastructure as Code support. The scoped API tokens ensure least-privilege security, while the REST catalog interface lets you connect your preferred query engines.

Last Updated: May 06, 2026 | Source: Cloudflare Developers Documentation (Official Website)

Frequently Asked Questions

R2 Data Catalog is a managed Apache Iceberg data catalog built directly into your R2 bucket. It exposes a standard Iceberg REST catalog interface, so you can connect engines like Spark, Snowflake, PyIceberg, and DuckDB. It provides ACID transactions, time travel, and schema evolution without egress fees.
Yes, Terraform support for Cloudflare Pipelines and R2 Data Catalog was added in April 2026 via Cloudflare provider v5.19.0. You can now create streams, sinks, pipelines, and manage Iceberg tables entirely through Infrastructure as Code.
R2 Data Catalog sinks only support Parquet format. JSON format is not supported for Iceberg tables. If you need JSON output, you must use the R2 bucket sink instead of the Data Catalog sink.
You need an API token with R2 Data Catalog permissions and R2 storage permissions. For production, use scoped API tokens with specific permission groups rather than global account tokens. The token must have R2 Admin Read & Write permissions.
Yes, R2 SQL provides a built-in SQL query interface in the Cloudflare dashboard. You can also connect external engines like Apache Spark (Scala/PySpark), PyIceberg, DuckDB, Snowflake, and Apache Trino to query your Iceberg tables.
Apache Iceberg tables provide ACID transactions, time travel (query historical snapshots), schema evolution (add/remove columns without rewriting), and better query performance through hidden partitioning. Raw Parquet files lack these features.