OLake PoC: Real-Time PostgreSQL to Apache Iceberg CDC Without Kafka

This document summarizes a PoC process of setting up OLake — a lightweight tool that performs CDC directly from PostgreSQL to Apache Iceberg without Kafka — in a local Docker environment, and testing its performance, stability, and ease of use in comparison with Debezium.

OLake PoC: Real-Time PostgreSQL to Apache Iceberg CDC Without Kafka

Introduction

In a previous blog post, we built a CDC pipeline in the PAASUP DIP environment that synchronizes change data from PostgreSQL to StarRocks using Kafka and the Debezium Source Connector. While Debezium is a proven CDC tool, it comes with the operational overhead of managing supplementary infrastructure such as a Kafka cluster and Schema Registry.

Meanwhile, as data lakehouse architectures have gained traction, demand for CDC pipelines targeting Apache Iceberg has grown. For such use cases, Debezium requires an additional path via a Kafka → Iceberg Sink Connector, whereas OLake is a "Lakehouse First" lightweight CDC tool that pushes data directly from the source database to Iceberg — without Kafka.

In this blog post, we set up an OLake-based PostgreSQL → Apache Iceberg CDC pipeline in a local Docker environment and document the process of testing it for performance, stability, and ease of use. This test is part of a PoC to determine whether OLake should be included in the DIP catalog.


Table of Contents

  1. Debezium vs. OLake
  2. OLake Architecture
  3. Infrastructure Setup
  4. Loading Test Data into source_postgresql
  5. Configuring the OLake CDC Pipeline
  6. Querying Iceberg Tables
  7. OLake CDC Metadata Structure
  8. CDC Columns in Iceberg Tables
  9. Test Results
  10. Conclusion

1. Debezium vs. OLake

Debezium, used in a previous blog post, and OLake, tested this time, are both CDC tools — but their design philosophies and architectures are fundamentally different.

Design Philosophy

  • OLake — Takes a "Lakehouse First" approach. It is optimized for pushing data as fast as possible directly from a source database to a data lakehouse such as Apache Iceberg, without the need for a complex Kafka infrastructure.
  • Debezium — Takes an "Event-Driven" approach. It treats database changes as "events" and streams them to Kafka, enabling various downstream systems (search engines, caches, other databases, etc.) to subscribe to those events. It is a general-purpose CDC framework.

Architecture Comparison

Feature OLake Debezium
Core Structure Single binary/container-based. Connects directly from the source DB to the destination without a separate message queue. Kafka ecosystem-based (requires Kafka, Zookeeper, and Kafka Connect).
Data Flow DB → OLake → S3 (Iceberg/Parquet) DB → Debezium → Kafka → Sink Connector → Destination
State Management Stores checkpoints in local or designated storage. Stores state in internal Kafka topics (offsets, config, status).
Scaling Horizontal scaling is possible, but the structure is simpler. Distributed processing via Kafka Connect cluster worker node scaling.

Pros and Cons

Category OLake Debezium
Pros - 4–10x faster for bulk data loads by bypassing Kafka. - Can deliver data to virtually any destination, not just lakehouses.
- No Zookeeper/Kafka management required, reducing operational burden. - Familiar to organizations already using Kafka, with strong community support.
- Automatic support for schema evolution and Iceberg optimization. - Per-message transformation via SMT (Single Message Transforms).
- Lower infrastructure resource usage, reducing cloud costs. - Widely referenced across many enterprise environments.
Cons - Currently focused on data lakehouse targets such as Apache Iceberg. - Requires Kafka operational expertise (Java, JVM tuning, distributed systems understanding).
- Smaller community and fewer references compared to Debezium. - Possible latency at the Kafka storage stage; performance degradation during large-scale snapshots.
- Complex connector configuration and Schema Registry management.

When to Choose Which?

  • OLake — Best when the primary goal is to quickly move operational DB data to an S3/Iceberg lakehouse for analytics; when there is insufficient staff or time to build and operate Kafka infrastructure; or when initial full load performance is critical.
  • Debezium — Best when event delivery between microservices (MSA) or real-time search indexing is required; when a stable Kafka cluster is already in operation within the organization; or when a general-purpose CDC connecting diverse sources and destinations is needed.

2. OLake Architecture

PostgreSQL (Source)
    │  WAL (Logical Replication)
    ▼
OLake Worker (CDC Connector)
    │  olakego/source-postgres → olake-iceberg-java-writer
    ▼
Lakekeeper (REST Catalog)
    │
    ▼
Iceberg (MinIO, S3-compatible Storage)

OLake runs on top of a Temporal-based workflow engine. It reads the WAL (Write-Ahead Log) from PostgreSQL and converts it into Iceberg tables, using Lakekeeper as the Iceberg REST Catalog and MinIO as the object storage.


3. Infrastructure Setup

All services are deployed as Docker containers and connected via the olake-network bridge network.

Service Configuration

Service Container Name Port Role
OLake UI olake-ui 8000 Web UI for managing CDC pipelines
OLake Worker olake-temporal-worker CDC sync execution engine
Temporal temporal Workflow orchestration
Temporal PostgreSQL temporal-postgresql Storage for Temporal and OLake metadata
Source PostgreSQL source-postgresql 5432 CDC source database
MinIO minio 9000 / 9001 S3-compatible object storage
Lakekeeper lakekeeper 8181 Iceberg REST Catalog

Installing OLake

OLake is installed using the official one-liner script. Temporal, Worker, UI, PostgreSQL, and Elasticsearch are all deployed together.

curl -sfL https://olake.dev/install | sh

Installing Source PostgreSQL and Configuring CDC

Create a separate PostgreSQL container for the source database, independent of OLake.

docker run -d \
  --name source-postgresql \
  -e POSTGRES_USER=admin \
  -e POSTGRES_PASSWORD=password \
  -e POSTGRES_DB=mydb \
  -p 5432:5432 \
  postgres:16

Enable Logical Replication for CDC.

-- Modify postgresql.conf and restart
ALTER SYSTEM SET wal_level = 'logical';

-- Create a Replication Slot and Publication
SELECT pg_create_logical_replication_slot('olake_slot', 'pgoutput');
CREATE PUBLICATION olake_pub FOR ALL TABLES;

Installing MinIO and Creating a Bucket

docker run -d \
  --name minio \
  -e MINIO_ROOT_USER=admin \
  -e MINIO_ROOT_PASSWORD=password \
  -p 9000:9000 -p 9001:9001 \
  minio/minio server /data --console-address ":9001"

Create an iceberg bucket from the MinIO Console (http://localhost:9001).

Screenshot 2026-02-26 110503.png

Installing Lakekeeper

Lakekeeper is an Iceberg REST Catalog implementation that uses MinIO as its storage backend.

docker run -d \
  --name lakekeeper \
  -p 8181:8181 \
  -e LAKEKEEPER__LISTEN_PORT=8181 \
  -e LAKEKEEPER__BASE_URI=http://host.docker.internal:8181 \
  lakekeeper/catalog:latest

After installation, create the iceberg warehouse. This can also be done from the Lakekeeper UI (http://localhost:8181).

# Create a warehouse (S3 path: s3://iceberg/warehouse/)
curl -X POST http://localhost:8181/management/v1/warehouse \
  -H "Content-Type: application/json" \
  -d '{
    "warehouse-name": "iceberg",
    "storage-profile": {
      "type": "s3",
      "bucket": "iceberg",
      "endpoint": "http://host.docker.internal:9000",
      "path-style-access": true,
      "region": "us-east-1",
      "sts-enabled": true
    },
    "storage-credential": {
      "type": "s3",
      "credential-type": "access-key",
      "aws-access-key-id": "admin",
      "aws-secret-access-key": "password"
    }
  }'

Screenshot 2026-02-26 101324.png

Connecting to the Docker Network

Connect all services to olake-network to enable inter-container communication.

docker network connect olake-network source-postgresql
docker network connect olake-network minio
docker network connect olake-network lakekeeper

4. Loading Test Data into source_postgresql

Three datasets are loaded into the Source PostgreSQL to test the CDC pipeline.

Dataset Summary

Table Row Count Description Primary Key
card_subway_month 36,406 rows Seoul subway monthly boarding/alighting data line_name, station_name, use_date
who_covid19 514,560 rows WHO COVID-19 daily country-level data date_reported, country_code
orders 10,000,000 rows Test order data order_id

Loading Subway Data

import pandas as pd
from sqlalchemy import create_engine

engine = create_engine('postgresql://admin:password@localhost:5432/mydb')

for file in ['CARD_SUBWAY_MONTH_202501.csv', 'CARD_SUBWAY_MONTH_202502.csv']:
    df = pd.read_csv(file, encoding='euc-kr')
    # Rename Korean column names to English (for Iceberg compatibility)
    df.columns = ['use_date', 'line_name', 'station_name',
                   'ride_count', 'alight_count']
    df.to_sql('card_subway_month', engine, if_exists='append', index=False)

Note: OLake's Iceberg Writer replaces non-ASCII column names with _, so using Korean column names will cause all columns to be renamed to ______, resulting in a schema conflict. Always use English column names.

Loading WHO COVID-19 Data

Load 510,000 rows in chunks of 100,000.

for chunk in pd.read_csv('WHO-COVID-19-global-daily-data.csv', chunksize=100_000):
    chunk.to_sql('who_covid19', engine, if_exists='append', index=False)

Loading Orders Test Data

Load 10 million order rows in chunks of 100,000.

for chunk in pd.read_csv('orders_10M.csv', chunksize=100_000):
    chunk.to_sql('orders', engine, if_exists='append', index=False)

5. Configuring the OLake CDC Pipeline

Configure the Source, Destination, and Job from the OLake UI (http://localhost:8000).

5-1. Registering a Source

Field Value
Name pgsource
Type postgres
Host host.docker.internal:5432
Database mydb
Update Method CDC (olake_pub / olake_slot)

When a Source is registered in the OLake UI, it uses the olakego/source-postgres:v0.3.16 image to run a connection test (CHECK) and schema discovery (DISCOVER).

Screenshot 2026-02-26 110807.png

5-2. Registering a Destination

Field Value
Name target_iceberg
Type iceberg (REST Catalog)
REST Catalog URL http://host.docker.internal:8181/catalog
S3 Endpoint http://host.docker.internal:9000
Catalog Name olake_iceberg
S3 Path iceberg

Screenshot 2026-02-26 111011.png

5-3. Creating a CDC Job

Link the Source and Destination, select the streams (tables) to synchronize, and create the Job. Each table's Ingestion Mode (Upsert/Append) can be selected individually. OLake also includes a Data Filter feature.

Field Value
Job Name cdc_test
Source pgsource
Destination target_iceberg
Schedule * * * * * (every minute)
Sync Mode Full Refresh (initial load) + CDC
Ingestion Mode Upsert

Screenshot 2026-02-26 111133.png

Selected Streams

Stream Primary Key Iceberg Table
public.orders order_id cdc_test_mydb:public.orders
public.card_subway_month line_name, station_name, use_date cdc_test_mydb:public.card_subway_month
public.who_covid19 date_reported, country_code cdc_test_mydb:public.who_covid19

Once the Job is created, the OLake Worker executes a CDC sync every minute via a Temporal workflow. Each sync spawns a Docker container (olakego/source-postgres) to read the PostgreSQL WAL and writes to the Iceberg table via the Iceberg Java Writer (olake-iceberg-java-writer.jar).

How CDC Works

1. OLake Worker → Triggers a Temporal Workflow (cron every minute)
2. Docker container created (olakego/source-postgres:v0.3.16)
3. Reads WAL from the PostgreSQL Logical Replication Slot (olake_slot)
4. Delivers change records to the Iceberg Java Writer via gRPC
5. Iceberg Writer → Commits tables via the Lakekeeper REST Catalog API
6. Data files → Saved as Parquet to MinIO S3
7. State (LSN) saved → Resumes from this point in the next sync

CDC state is managed based on PostgreSQL LSN (Log Sequence Number), which tracks synchronization progress.

{
  "type": "STREAM",
  "global": {
    "state": { "lsn": "0/9AE53D30" },
    "streams": ["public.who_covid19", "public.orders", "public.card_subway_month"]
  }
}

6. Querying Iceberg Tables

Set up a Spark + Jupyter environment using Docker Compose.

# docker-compose.spark.yml
services:
  spark-iceberg:
    image: tabulario/spark-iceberg
    ports:
      - "8888:8888"  # Jupyter
      - "8080:8080"  # Spark UI
    environment:
      AWS_ACCESS_KEY_ID: admin
      AWS_SECRET_ACCESS_KEY: password
      AWS_REGION: us-east-1

Connect to the Lakekeeper catalog from Spark and run queries.

from pyspark.sql import SparkSession

# Stop any existing session (to avoid conflicts with the tabulario image's demo session)
existing = SparkSession.getActiveSession()
if existing:
    existing.stop()

spark = SparkSession.builder \
    .appName("iceberg-reader") \
    .config("spark.sql.defaultCatalog", "lakekeeper") \
    .config("spark.sql.catalog.lakekeeper", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.lakekeeper.type", "rest") \
    .config("spark.sql.catalog.lakekeeper.uri", "http://host.docker.internal:8181/catalog") \
    .config("spark.sql.catalog.lakekeeper.warehouse", "iceberg") \
    .config("spark.sql.catalog.lakekeeper.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \
    .config("spark.sql.catalog.lakekeeper.s3.endpoint", "http://host.docker.internal:9000") \
    .config("spark.sql.catalog.lakekeeper.s3.access-key-id", "admin") \
    .config("spark.sql.catalog.lakekeeper.s3.secret-access-key", "password") \
    .config("spark.sql.catalog.lakekeeper.s3.path-style-access", "true") \
    .getOrCreate()

# Query data synchronized via CDC
table_name = "orders"
df = spark.table(f"""lakekeeper.cdc_test_mydb_public.{table_name}""")

total = df.count()
print(f"✓ Total: {total:,} records")
print()
df.printSchema()
df.show(10, truncate=False)

Screenshot 2026-02-26 100216.png


7. OLake CDC Metadata Structure

OLake stores metadata in the postgres database inside the temporal-postgresql container.

Table Role
olake-dev-source Source connection information (host, DB, CDC configuration)
olake-dev-destination Destination connection information (Catalog URL, S3 path)
olake-dev-job Job definition (stream selection, schedule, CDC state)
olake-dev-catalog Available source/destination types

The lsn value in the CDC state represents the position in the PostgreSQL WAL. OLake updates the job state with this value after each sync completes. In subsequent syncs, only changes from this LSN onward are read and processed, implementing incremental synchronization.


8. CDC Columns in Iceberg Tables

Iceberg tables created by OLake include CDC metadata columns in addition to the original data columns.

Column Type Description
_cdc_timestamp timestamptz Timestamp of the change event in the source DB
_olake_timestamp timestamptz Timestamp when OLake processed the record
_op_type string Change type: r (read/initial), c (insert), u (update), d (delete)
_olake_id string OLake internal unique ID (equality delete key)

DELETE operations are recorded as equality delete files, with the deletion target identified by _olake_id. To exclude deleted records in Spark SQL, use the condition WHERE _op_type != 'd'.


9. Test Results

Performance

  • Initial Full Load: Even in a resource-constrained Windows environment, the initial synchronization of over 10 million records from PostgreSQL to Iceberg completed in just 162 seconds, demonstrating strong performance.

    10,552,212 Records / 162 Seconds = 65,137 RPS (Records Per Second)

    Job log — sync start
    Screenshot 2026-02-26 132436.png

    Job log — sync completed
    Screenshot 2026-02-26 132456.png

  • Incremental Sync (CDC): With a sync interval of one minute, syncs completed quickly when there were no WAL changes, and resource consumption was low since only the changed data was processed.

  • Large-Scale Data Processing: Tables of 510,000 rows (who_covid19) and 10 million rows (orders) both maintained stable CDC synchronization after the initial load.

  • Sync Container Overhead: Since a Docker container is created and destroyed for each sync, container creation costs occur even when there are no changes. For large-scale production use, adjusting the sync interval will be necessary.

Ease of Use

  • Simplified Configuration: The CDC pipeline can be set up in just three steps — Source → Destination → Job — from the OLake UI, which significantly lowers the barrier to entry compared to Debezium's Kafka Connect JSON configuration.
  • Built-in Connection Testing: Connection tests can be run directly from the UI when registering a Source or Destination, allowing configuration errors to be caught early.
  • Automatic Stream Discovery: The Discover feature automatically retrieves the list of tables from PostgreSQL for selection, eliminating the need to manually enter table information.

Stability

  • Non-ASCII Column Name Support: Non-ASCII column names (e.g., Korean) are replaced with _ by the Iceberg Writer, causing schema conflicts. Column names in the source DB must be converted to English in advance.
  • Long-Term Operational Stability: After running CDC continuously for several days (4 days), the card_subway_month stream completed over 535 syncs stably.

10. Conclusion

With OLake, a CDC pipeline from PostgreSQL to Apache Iceberg can be configured without writing any code. By registering a Source, Destination, and Job in the OLake UI, the Temporal workflow automatically handles periodic synchronization.

Component Choice Role
CDC Source PostgreSQL + Logical Replication WAL-based change capture
CDC Engine OLake (Temporal Worker) Change data extraction and delivery
Table Format Apache Iceberg ACID transactions, schema evolution, time travel
REST Catalog Lakekeeper Iceberg table metadata management
Object Storage MinIO Parquet data file storage
Query Engine Spark SQL Iceberg table querying

Compared to Kafka-based Debezium, OLake offers simpler infrastructure and easier initial setup. However, it still has areas of immaturity, including limited monitoring capabilities and lack of support for non-ASCII column names. It can be a strong choice for teams that need to quickly build a lakehouse-centric data pipeline.

Subscribe to PAASUP IDEAS

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe