DIP Platform TPC-DS Benchmark Test Plan

A test plan for quantitatively measuring the DIP platform's batch ELT, real-time CDC, and analytical query performance using the TPC-DS standard benchmark. It compares and validates Spark, StarRocks, OLake, and Kafka CDC pipelines.

DIP Platform TPC-DS Benchmark Test Plan

DIP Platform TPC-DS Benchmark Test Plan


Table of Contents

  1. Test Objectives
  2. Test Environment
  3. Test Scenarios
  4. TPC-DS Data Generation and Loading
  5. Performance Measurement Tools and Automation
  6. Test Execution Schedule
  7. Report Template
  8. Pass/Fail Criteria
  9. Risks and Considerations
  10. References

1. Test Objectives

This plan quantitatively measures the data processing performance of the DIP platform using the TPC-DS standard benchmark. Tests are divided into three areas — batch loading, real-time CDC, and analytical queries — with the following key metrics defined for each.

Test Area Measurement Target DIP Components
ELT — Batch Bulk data loading throughput Spark → Iceberg → StarRocks (External Catalog)
ELT — Real-time (CDC) Change data propagation latency and initial load throughput Path A: OLake → Iceberg / Path B: Debezium → Kafka → Iceberg
Query Analytical query response time StarRocks (Iceberg External Catalog)

2. Test Environment

2.1 Test Procedure

The diagram below illustrates the data flow across the entire benchmark pipeline. TPC-DS data generated by DuckDB is loaded into PostgreSQL, then written to Iceberg via batch (Spark) and real-time (OLake/Kafka) paths. Analytical queries are executed through the StarRocks External Catalog.

┌──────────────────────────────────────────────────────────────────────────────┐
│                     DIP TPC-DS Benchmark Pipeline                            │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ① Data Generation    ② Source Loading      ③ Batch ELT                     │
│  ┌──────────┐        ┌──────────┐         ┌──────────────┐                  │
│  │  DuckDB  │Parquet │PostgreSQL│  JDBC   │ Spark-Iceberg│                  │
│  │ TPC-DS   │──────▶│  tpcds   │───────▶│   Batch ELT  │                  │
│  │ dsdgen() │ 24tbl  │  _sf1    │  Read   │              │                  │
│  └──────────┘        └────┬─────┘         └──────┬───────┘                  │
│                           │                       │ Write                    │
│                           │                       ▼                          │
│  ④ Query Benchmark       │               ┌──────────────┐                   │
│  ┌──────────────┐  Ext.  │               │   Iceberg    │                   │
│  │  StarRocks   │◀───────┼───────────────│  (MinIO S3)  │                   │
│  │  TPC-DS 99Q  │ Catalog│               │  Lakekeeper  │                   │
│  └──────────────┘        │               └──────────────┘                   │
│                           │                       ▲                          │
│  ⑤ CDC Real-time Load   │                       │                          │
│  ┌──────────────┐        │               ┌───────┴──────┐                   │
│  │ Path A: OLake │◀───────┤               │  Iceberg     │                   │
│  │ (Kafka-less)  │───────┼──────────────▶│  Write       │                   │
│  └──────────────┘   WAL  │               └──────────────┘                   │
│  ┌──────────────┐        │                       ▲                          │
│  │ Path B: Kafka │◀───────┘               ┌───────┴──────┐                   │
│  │ Debezium+Sink│────────────────────────▶│  Iceberg     │                   │
│  └──────────────┘                         │  Sink        │                   │
│                                           └──────────────┘                   │
└──────────────────────────────────────────────────────────────────────────────┘

2.2 HW/SW Specifications

2.2.1 Hardware (Google Cloud)

Node Machine Type vCPU RAM Boot Disk Data Disk
rke2-node × 3 n2-standard-8 8 32 GB 100 GB pd-ssd 1,000 GB pd-ssd
pg × 1 e2-standard-8 8 32 GB 50 GB pd-ssd 1,000 GB pd-ssd
  • rke2-node data disk mount: /var/lib/longhorn (Longhorn distributed storage)
  • PostgreSQL data disk mount: /data

2.2.2 Software

Component Version
PostgreSQL 16.x
Spark 3.5.x
StarRocks 3.5.2
Kafka 3.8.1 (KRaft, single broker)
Debezium 2.7 (PostgreSQL Source Connector)
Iceberg Sink Tabular 0.6.19
OLake v0.3.16 (olakego/source-postgres)
Lakekeeper latest (Iceberg REST Catalog)

2.3 TPC-DS Data Scale Factors

Phase Scale Factor Data Size (Parquet) Row Count (all 24 tables) Purpose
SF-1 1 ~264 MB ~19.6 million Functional validation, pipeline debugging
SF-100 100 ~100 GB ~billions Primary performance measurement

SF-100 and above will be executed selectively based on available cluster resources.

2.4 TPC-DS Schema Overview

  • Fact tables (7): store_sales, catalog_sales, web_sales, inventory, store_returns, catalog_returns, web_returns
  • Dimension tables (17): customer, item, date_dim, store, warehouse, customer_demographics, etc.
  • Total tables: 24

3. Test Scenarios

3.1 ELT Performance — Batch ELT

Objective: Measure bulk data loading performance from PostgreSQL to Iceberg.

3.1.1 Test Path

PostgreSQL ──(Spark JDBC Read)──▶ Iceberg (MinIO S3) ──▶ StarRocks (External Catalog)

3.1.2 Metrics

Metric Unit Description
Total Load Time seconds (s) Total elapsed time to load all 24 tables
Throughput rows/s Rows processed per second
Table Load Time seconds (s) Per-table load time (especially fact tables)
Resource Utilization % CPU, memory, disk I/O, and network utilization

3.1.3 Execution Steps

  1. Generate TPC-DS data (DuckDB dsdgen)
  2. Create schema and load data into PostgreSQL (DuckDB postgres extension)
  3. Execute Spark job — start timing
  4. Verify load completion and validate source-target row counts
  5. Stop timing, collect resource metrics

3.2 ELT Performance — Real-time CDC

Objective: Measure end-to-end latency for real-time change propagation from the source DB, and compare initial load throughput between the two CDC paths (OLake vs Kafka).

3.2.1 Test Paths

Path A: PostgreSQL (WAL) ──▶ OLake Worker ──▶ Iceberg (Lakekeeper)
Path B: PostgreSQL (WAL) ──▶ Debezium Source ──▶ Kafka ──▶ Iceberg Sink Connector ──▶ Iceberg (Lakekeeper)
Path CDC Engine Characteristics
Path A OLake Direct PostgreSQL → Iceberg loading without Kafka. Purpose-built for Iceberg targets with UI-based configuration
Path B Debezium + Kafka Kafka-based event streaming. Extensible to various targets but relatively complex to configure

3.2.2 Metrics

Metric Unit Description Applicable Path
Initial Load Time seconds (s) Time to complete the full initial load A, B
Initial Load Throughput rows/s Rows processed per second during initial load A, B
End-to-End Latency ms Time from source INSERT/UPDATE to Iceberg reflection A, B
P50 / P95 / P99 Latency ms Percentile-based latency A, B
CDC Throughput events/s CDC events processed per second A, B
Kafka Consumer Lag events Kafka consumer lag (Kafka path only) B
Data Consistency % Source-target data match rate A, B

3.2.3 Test Cases

Initial Load Comparison:

ID Test Name Description Path SF
R-01 OLake Initial Load Full table initial load via OLake A 100
R-02 Kafka CDC Initial Load Snapshot-based initial load via Debezium B 100
R-03 Throughput Comparison Comparative analysis of initial load throughput based on R-01 and R-02 A, B 100

CDC Real-time Change Propagation:

ID Test Name Description Path Load
R-04 Single Row Latency Latency for a single INSERT to reflect in Iceberg A, B 1 row
R-05 Burst Insert Total reflection time after bulk INSERT of 1M rows A, B 1M rows
R-06 Sustained Load Continuous INSERT (100 rows/s × 10 min) A, B 60K rows
R-07 Mixed DML Mixed workload: 70% INSERT / 20% UPDATE / 10% DELETE A, B 10K rows
R-08 Peak Load Incremental load increase until maximum throughput is reached A, B 600K rows
R-09 Schema Evolution Verify CDC continues to function correctly after column addition A, B
R-10 Recovery Test Validate data consistency after CDC process failure and recovery A, B
R-11 Large Transaction Process 100K changes within a single transaction A, B 100K rows

3.2.4 Latency Measurement Method

A timestamp column (_benchmark_ts) is added to the source table, and latency is calculated as the difference between the source timestamp and the Iceberg reflection time.

-- PostgreSQL source: INSERT with timestamp
INSERT INTO store_sales_cdc (ss_sold_date_sk, ..., _benchmark_ts)
VALUES (..., clock_timestamp());

-- Iceberg target: Check reflection time (Spark SQL)
SELECT _benchmark_ts,
       current_timestamp() AS received_ts,
       (unix_timestamp(current_timestamp()) - unix_timestamp(_benchmark_ts)) * 1000 AS latency_ms
FROM lakekeeper.<namespace>.store_sales_cdc
WHERE _benchmark_ts > '2026-03-10 10:00:00';

Path A (OLake): The Iceberg namespace is automatically created by the OLake job configuration.
Path B (Kafka): The Iceberg namespace is determined by the Sink connector's dynamic routing (e.g., debezium_tpcds_sf1).

⚠️ Note: NTP time synchronization between source and target systems is essential for accurate latency measurement. Measuring from the same server is recommended whenever possible.

3.2.5 Execution Steps

Path A (OLake):

  1. Register source in OLake UI — PostgreSQL, tpcds_sf1
  2. Register destination — Iceberg REST Catalog + MinIO
  3. Create job (Full Refresh + CDC, Upsert mode)
  4. Verify initial load completion and validate source-target row counts
  5. Execute load generation scripts (INSERT / UPDATE / DELETE)
  6. Measure latency by checking Iceberg reflection time via Spark SQL
  7. Collect metrics and record results

Path B (Kafka CDC):

  1. Start Kafka + Connect cluster
  2. Run register-connectors.sh (idempotent connector registration)
  3. Wait for Debezium Source snapshot completion (initial load)
  4. Validate Iceberg row counts (via StarRocks External Catalog)
  5. Execute load generation scripts (INSERT / UPDATE / DELETE)
  6. Measure latency by checking Iceberg reflection time
  7. Monitor Consumer Lag via Kafka UI
  8. Collect metrics and record results

3.3 Query Performance (Analytical Query)

Objective: Measure analytical query response times against TPC-DS data loaded into Iceberg, queried through the StarRocks External Catalog.

3.3.1 Test Target

  • Engine: StarRocks 3.5.2 (Iceberg External Catalog)
  • Query count: 99 TPC-DS queries (converted to StarRocks-compatible syntax)
  • Data: Iceberg tables loaded via Batch ELT or CDC

3.3.2 Query Classification

Category Example TPC-DS Queries Characteristics
Simple Scan/Filter Q3, Q7, Q19, Q42, Q52, Q55 Single fact table, simple joins
Multi-way Join Q13, Q17, Q25, Q29, Q34 Joins across 3+ tables
Aggregation/Grouping Q1, Q6, Q11, Q27, Q46 GROUP BY, HAVING, window functions
Subquery/CTE Q4, Q14, Q23, Q31, Q39 Complex subqueries, WITH clauses
Large-scale Join Q64, Q72, Q78, Q80, Q82 Joins between large fact tables

3.3.3 Metrics

Metric Unit Description
Query Response Time seconds (s) Execution time per query
Geometric Mean seconds (s) TPC-DS standard geometric mean performance metric
Cold / Warm Run seconds (s) Without cache (Cold) vs with cache (Warm)
Throughput queries/hr Queries processed per hour (Power + Throughput test)
Concurrency Performance impact as concurrent users increase
Query Profile Execution plan, rows scanned, memory usage

3.3.4 Test Cases

ID Test Name Description SF
Q-01 Power Test (Cold Run) Sequential execution of 99 queries, 1st run after cache flush 100
Q-02 Warm Run Sequential execution of 99 queries, 2nd run (cache utilized) 100
Q-03 Category Drill-down Detailed profiling of representative queries per category 100
Q-04 Concurrency — 2 users Query execution with 2 concurrent sessions 100
Q-05 Concurrency — 4 users Query execution with 4 concurrent sessions 100
Q-06 Concurrency — 8 users Query execution with 8 concurrent sessions 100
Q-07 Mixed Workload Measure impact of concurrent batch loading during query execution 100
Q-08 Top-10 Slowest Before/after comparison of tuning the 10 slowest queries 100

3.3.5 Execution Steps

  1. Extract standard queries with DuckDB tpcds_queries() → Convert to StarRocks syntax
  2. Validate StarRocks compatibility and fix incompatible queries
  3. Flush cache (for Cold Run)
  4. Execute queries sequentially and measure individual execution times
  5. Collect query profiles (EXPLAIN ANALYZE)
  6. Calculate Geometric Mean

4. TPC-DS Data Generation and Loading

4.1 Data Generation

TPC-DS data is generated using the DuckDB TPC-DS extension (dsdgen) and exported in Parquet format with ZSTD compression.

# Generate TPC-DS data (SF-1)
python benchmark/data/generate_tpcds.py 1

# Generate TPC-DS data (SF-100)
python benchmark/data/generate_tpcds.py 100

4.2 PostgreSQL Loading

Parquet files are loaded into PostgreSQL at high speed using the DuckDB postgres extension. The script is idempotent, safely replacing existing data on repeated runs.

# Load into PostgreSQL (idempotent)
python benchmark/data/load_to_postgres.py 1

4.3 StarRocks External Catalog Setup

StarRocks queries Iceberg tables directly through an External Catalog without additional data loading.

CREATE EXTERNAL CATALOG bmt_catalog
PROPERTIES (
    "type" = "iceberg",
    "iceberg.catalog.type" = "rest",
    "iceberg.catalog.uri" = "http://lakekeeper:8181/catalog",
    "iceberg.catalog.warehouse" = "iceberg",
    "aws.s3.endpoint" = "http://minio:9000",
    "aws.s3.region" = "us-east-1",
    "aws.s3.enable_path_style_access" = "true"
);

-- Example query
SELECT COUNT(*) FROM bmt_catalog.tpcds_sf1.store_sales;

5. Performance Measurement Tools and Automation

Purpose Tool Description
Data generation DuckDB dsdgen TPC-DS standard data generation (Python script)
Query generation DuckDB tpcds_queries() TPC-DS standard query extraction and StarRocks syntax conversion
Batch ELT Spark + Lakekeeper JDBC Read → Iceberg Write
CDC (Path A) OLake Direct PostgreSQL → Iceberg loading
CDC (Path B) Debezium + Kafka + Iceberg Sink Kafka-based CDC pipeline
CDC load generation Python scripts / pgbench INSERT / UPDATE / DELETE workloads
Query benchmark Python + pymysql Automated StarRocks query execution and timing
Kafka monitoring Kafka UI (Provectus) Consumer Lag, Topic, and Connector status monitoring
Resource collection sar, iostat, vmstat OS-level CPU / Memory / Disk I/O monitoring

6. Test Execution Schedule

Phase Activity Key Tasks
Step 1 Environment Setup Infrastructure provisioning, TPC-DS data generation, PostgreSQL loading, component configuration
Step 2 Batch ELT Execute and measure Spark → Iceberg loading performance
Step 3 Query Performance Execute Q-01 through Q-08 (Cold/Warm Run), fix query compatibility issues
Step 4 Real-time CDC — Initial Load Execute R-01 through R-03, compare OLake vs Kafka initial load throughput
Step 5 Real-time CDC — Change Propagation Execute R-04 through R-11, measure latency and data consistency
Step 6 Analysis and Reporting Consolidate results, analyze bottlenecks, produce final report

7. Report Template

The tables below serve as report templates. Actual values will be populated after test execution.

7.1 Batch ELT Results Summary

SF Tables Total Rows Total Size Load Time Throughput
1 24 19,557,579 ~264 MB 324s 57K rows/s
100 24 ~100 GB

7.2 CDC Initial Load Comparison

Path Method Data Elapsed Time Throughput Notes
A OLake 10,550,966 rows (3 tbl) 125s 84K rows/s Direct loading without Kafka
B Kafka CDC (optimized) 19,557,579 rows (24 tbl) ~600s ~33K rows/s Debezium + Kafka
B Kafka CDC (default) 19,557,579 rows (24 tbl) ~1,740s ~11K rows/s Default settings

⚠️ Path A (OLake) and Path B (Kafka CDC) target different numbers of tables (3 vs 24), so comparison should be based on per-row throughput (rows/s). For a fair comparison, the SF-100 test will target the same set of tables across both paths.

7.3 CDC Real-time Results Summary

Test Row Count P50 (ms) P95 (ms) P99 (ms) Max (ms) Path
Single Row 1 A, B
Burst 1M 1,000,000 A, B
Sustained 60,000 A, B
Mixed DML 10,000 A, B
Peak Load 600,000 A, B

7.4 Query Performance Results Summary

SF Queries Success Rate Geometric Mean Min Max Cold / Warm
1 99 100% 1.044s 0.150s 21.793s 1.044s / —
100 99 — / —

7.5 Concurrency Test Results

Concurrent Users Geometric Mean Throughput (queries/hr) Performance Degradation
1 baseline
2
4
8

8. Pass/Fail Criteria

8.1 Batch ELT

Criteria Threshold
Data consistency 100% source-target row count match
SF-100 load time Under 3 hours
Throughput Minimum 10K rows/s

8.2 Real-time CDC

Criteria Threshold Applicable Path
Initial load data consistency 100% source-target row count match A, B
Initial load throughput Confirm Path A (OLake) outperforms Path B (Kafka CDC) A, B
Single-row E2E latency (P95) Under 3 seconds A, B
Sustained load latency (P95) Under 10 seconds A, B
CDC data consistency 100% source-target match for INSERT / UPDATE / DELETE A, B
Post-recovery consistency Zero data loss A, B

8.3 Query Performance

Criteria Threshold
SF-1 Geometric Mean Under 5 seconds
SF-100 Geometric Mean Under 60 seconds
4-user concurrency degradation Within 50% of single-user baseline

Pass/fail thresholds may be adjusted based on actual hardware specifications and cluster resource availability.


9. Risks and Considerations

Risk Impact Mitigation
TPC-DS query incompatibility with StarRocks Some queries cannot be measured Manual query rewrite (Q49, Q70, Q86 already fixed)
Insufficient resources (SF-100) Test failure or OOM Fall back to SF-10, then evaluate resource scaling
OLake non-ASCII column name limitation Schema conflicts with non-Latin column names Pre-convert to ASCII column names
OLake community maturity Potential unexpected bugs Fall back to Kafka CDC path if issues arise
Kafka CDC configuration complexity Setup delays Use idempotent scripts and connector JSON templates
Debezium tasks.max=1 constraint Source Connector cannot parallelize Use snapshot.max.threads for snapshot-phase parallelism
Kafka Sink topic discovery delay Cannot consume new topics during initial load Set metadata.max.age.ms=5000 to shorten metadata refresh interval
Cross-system time synchronization Increased latency measurement error Enforce NTP synchronization; prefer same-server measurement
Network bottleneck Skewed test results Secure dedicated network or use same-host configuration

10. References

Subscribe to PAASUP IDEAS

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe