DIP Platform TPC-DS Benchmark — Part 1. Batch ELT & Query Performance

Can a data platform prove its performance before you commit? We put DIP through the TPC-DS SF-100 standard benchmark. From batch ELT throughput to analytical query response times and concurrency limits—here are the numbers that speak for themselves.

DIP Platform TPC-DS Benchmark — Part 1. Batch ELT & Query Performance

Table of Contents


Series Overview

This is a series that quantitatively validates the data processing capabilities of the DIP platform using the TPC-DS standard benchmark. The full benchmark is conducted in four stages, and this post covers the results of ① Batch ELT and ② Query Performance.

# Test Area DIP Components Blog
Batch ELT — Bulk data ingestion throughput Spark → Iceberg → StarRocks This post
Query — Analytical query response time · Concurrency StarRocks (Iceberg External Catalog) This post
Kafka CDC — Real-time change data capture Kafka Connect → Iceberg Part 2 (upcoming)
OLake CDC — Lightweight CDC pipeline OLake → Iceberg Part 2 (upcoming)

Part 2 will compare Kafka CDC and OLake CDC in terms of initial load speed, real-time latency, and operational convenience.


Key Results at a Glance

Batch ELT — 960M rows ingested in 54 min (296K rows/sec) · 3.3× faster than the 3-hour target

Query — 100% pass on all 99 TPC-DS queries, Geometric Mean 1.474s · 40× faster than the 60s target

QphDS@100 = 244,234 · Global top-tier for comparable specs · QphDS/$ ~60,900/hr (outstanding ROI)

Concurrency — 4-user degradation at 2.48× · Met the target of under 3×

All 6 predefined success criteria: PASS


1. Benchmark Purpose and Background

When adopting a data platform, the first question to answer is "How fast does it perform at our scale?" To objectively verify whether the DIP platform meets expected performance in production environments, we conducted the industry-standard TPC-DS SF-100 benchmark.

TPC-DS is a decision support benchmark modeled on a retail scenario, consisting of 99 analytical queries that include complex joins, subqueries, and window functions. SF-100 corresponds to approximately 100GB of raw data (~30GB after compression), making it a suitable scale for evaluating mid-size analytical workloads.

Success Criteria

The following criteria were defined before the benchmark began.

Metric Criteria Rationale
SF-100 Batch Ingestion Time Under 3 hours ELT must complete within half the typical nightly batch window (4–6 hours) to leave headroom for downstream tasks (aggregation, report generation)
Batch Ingestion Throughput Minimum 10K rows/sec Processing SF-100 (960M rows) in 3 hours requires at least ~89K rows/sec, but 10K rows/sec is set as a universal floor covering small daily incremental loads (millions of rows)
SF-100 Query Geometric Mean Under 60 seconds Upper limit for interactive analytical query response. Beyond 1 minute, exploratory analysis flow is disrupted and productivity declines
4-User Concurrency Degradation Within 50% of single-user (under 3×) Assumes a small analytics team (3–5 members) on a 3-CN-node environment. Must sustain at least 25% better efficiency than linear degradation (4×) to be considered practical concurrency
Data Integrity 100% source-target match Fundamental requirement for any data platform. Row count or checksum mismatch between source and target fundamentally undermines analytical reliability—non-negotiable
TPC-DS Query Pass Rate 99/99 (100%) The 99 TPC-DS queries extensively use advanced ANSI SQL features (CTEs, window functions, ROLLUP, etc.), making a full pass rate an indicator of SQL compatibility completeness

2. Test Environment

2.1 Infrastructure (Google Cloud)

Role Machine Type vCPU RAM Boot Disk Data Disk
rke2-node × 3 n2-standard-8 8 32 GB 100 GB pd-ssd 1,000 GB pd-ssd
PostgreSQL × 1 e2-standard-8 8 32 GB 50 GB pd-ssd 1,000 GB pd-ssd

2.2 Software Stack

Component Version Role
PostgreSQL 16.x Source RDBMS
Spark 3.5.x Batch ELT Engine
StarRocks 3.5.2 (FE 1 + CN 3) Analytical Query Engine
Lakekeeper latest Iceberg REST Catalog
GCS S3-compatible Object Storage

2.3 StarRocks CN Node Configuration

Parameter Value
CN Node Count 3
MemLimit per Node 16 GB
exec_mem_limit (Single User) 16 GB
exec_mem_limit (Concurrency Test) 16 GB ÷ concurrent users (per node)

3. Data Pipeline Architecture

The overall benchmark pipeline consists of four stages.

┌─────────────────────────────────────────────────────────────────────────────┐
│                     DIP TPC-DS Benchmark Pipeline                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ① Data Generation      ② Source Loading       ③ Batch ELT                │
│  ┌──────────┐           ┌──────────┐           ┌───────────────┐           │
│  │  DuckDB  │  Parquet  │PostgreSQL│   JDBC    │ Spark-Iceberg │           │
│  │ dsdgen() │─────────▶│  tpcds   │─────────▶│   Batch ELT   │           │
│  │  SF-100  │  24 tbl   │  _sf100  │   Read    │               │           │
│  └──────────┘           └──────────┘           └───────┬───────┘           │
│                                                         │ Write             │
│                                                         ▼                   │
│  ④ Query Benchmark                               ┌───────────────┐         │
│  ┌───────────────┐   External Catalog            │    Iceberg    │         │
│  │   StarRocks   │◀─────────────────────────────│   (MinIO S3)  │         │
│  │  TPC-DS 99Q   │  bmt_catalog.tpcds_sf100      │  Lakekeeper   │         │
│  └───────────────┘                                └───────────────┘         │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
  • ① Data Generation: Generates SF-100 data in Parquet (ZSTD) format using DuckDB's TPC-DS extension (dsdgen).
  • ② Source Loading: Bulk-loads the generated Parquet files into PostgreSQL, simulating a real-world scenario where the source system is an RDBMS.
  • ③ Batch ELT: Reads from PostgreSQL via Spark JDBC and ingests into Lakekeeper-based Iceberg tables.
  • ④ Query Benchmark: Queries Iceberg tables directly through StarRocks External Catalog, executing all 99 TPC-DS queries.

4. Data Generation and Source Loading

4.1 TPC-DS SF-100 Data Generation

24 tables (7 Fact + 17 Dimension) were generated using DuckDB dsdgen().

Category Tables Total Rows Total Size
Fact 7 953,722,450 ~29.3 GB
Dimension 17 5,297,158 ~96 MB
Total 24 ~960M ~30.1 GB

The scale of the major Fact tables is as follows.

Table Rows Size
inventory 399,330,000 885 MB
store_sales 287,997,024 12.0 GB
catalog_sales 143,997,065 9.2 GB
web_sales 72,001,237 4.5 GB
store_returns 28,795,080 1.4 GB
catalog_returns 14,404,374 869 MB
web_returns 7,197,670 435 MB

4.2 PostgreSQL Loading Results

Data was loaded directly from Parquet into PostgreSQL using the DuckDB postgres extension, setting up the source RDBMS environment.

Metric Value
Total Rows Loaded ~960M
Total Loading Time ~24.5 min (1,470s)
Avg. Throughput 652K rows/sec

5. Batch ELT: Spark → Iceberg Ingestion

5.1 Ingestion Results Summary

Results of reading from PostgreSQL via Spark JDBC and ingesting into Lakekeeper-based Iceberg tables.

Metric Value
Tables Succeeded 24/24 (100%)
Total Rows Ingested 959,019,608 (~960M)
Wall-clock Time 3,240s (54 min)
Phase 1 — Dimension (17 tables) 20.4s
Phase 2 — Fact (7 tables) 3,219.6s (99.4% of total)
Throughput 296K rows/sec

Ingestion was executed using a 2-phase strategy. Phase 1 loads the 17 lightweight Dimension tables in parallel, followed by Phase 2 which loads the 7 large Fact tables in parallel. 99% of total time was spent on Fact table ingestion.

5.2 Per-Fact-Table Ingestion Performance

Table Rows Time (s) Throughput (rows/sec) JDBC Partitions
inventory 399,330,000 425.7 938,028 12
store_sales 287,997,024 1,166.1 246,972 12
catalog_sales 143,997,065 860.1 167,419 12
web_sales 72,001,237 488.5 147,394 8
store_returns 28,795,080 136.8 210,507 4
catalog_returns 14,404,374 91.2 158,020 4
web_returns 7,197,670 51.2 140,533 4

The largest bottleneck was store_sales, accounting for approximately 36% (1,166s) of total ingestion time. This table contains ~290M rows at 12GB and was split into 12 JDBC partitions for parallel reads, but processing efficiency was limited since the partition count exceeded the available Executor cores (6).

5.3 Spark Tuning Details

OOM errors and performance degradation occurred during SF-100 ingestion. These were resolved by tuning three key areas.

Spark Resource Configuration

Parameter Value Notes
Executors 3 1 per K8s node
Executor Memory 18g Allocated from 32GB per node
Executor Overhead 4g Off-heap (22g total/node)
Executor Cores 2
Driver Memory 4g
Shuffle Partitions 48 executors × cores × 8

JDBC Read Optimization

Parameter Default Tuned Effect
fetchsize 0 (full load) 100,000 Streaming reads reduce GC pressure, prevent OOM
JDBC Partitions (large Fact) 1 12 Range-split on item_sk, parallel reads
JDBC Partitions (mid-size Fact) 1 4–8 Adjusted per table size

The critical change was fetchsize. The default value of 0 loads the entire result set into memory, causing immediate OOM on large tables. Setting it to 100,000 switched to cursor-based streaming reads, which was the key to achieving stability.

Iceberg Write Optimization

Parameter Default Tuned Effect
write.spark.fanout.enabled true false Prevents simultaneous Writer opens per partition → resolves OOM
Write Method DataFrame writeTo CTAS + ORDER BY Sort by partition key, then write sequentially
Compression Snappy ZSTD Improved compression ratio

When fanout.enabled is true, Writers are kept open for all partitions simultaneously, causing memory to spike on Fact tables with many partitions. Switching to false and applying the CTAS pattern—sorting by partition key (date_sk) before sequential writes—stabilized memory usage.

Ingestion Strategy Summary

Item Description
2-Phase Execution Phase 1: 17 Dimension tables (parallel) → Phase 2: 7 Fact tables (parallel)
Partitioning Fact tables: partitioned by date_sk
Sorting ORDER BY item_sk (sorted within partitions)
Column Name Normalization SF-100 dsdgen column0/column00 → mapped to TPC-DS standard column names
Idempotency safe_drop_table() — falls back to non-PURGE if DROP TABLE PURGE fails

5.4 Projected Scale-Out Performance

The current 3-node environment achieved 54 minutes, but further performance gains are achievable by adding nodes.

Metric Current (3 nodes) Recommended Min (5 nodes) Recommended Optimal (6 nodes)
K8s Nodes 3 × n2-standard-8 5 × n2-standard-8 6 × n2-standard-8
Total Executor Cores 6 15 18
Total Executor Memory 54g 80g 96g
Est. Ingestion Time 54 min ~35 min ~25 min
Throughput 296K rows/sec ~450K rows/sec ~550K rows/sec
Speedup baseline 1.5× 2.2×
Configuration Est. Monthly Cost Ingestion Time Cost Efficiency
Current (3 nodes) ~$730 54 min baseline
Recommended Min (5 nodes) ~$1,217 ~35 min 1.7× cost, 1.5× perf
Recommended Optimal (6 nodes) ~$1,460 ~25 min 2.0× cost, 2.2× perf

Based on n2-standard-8 at ~$243/month (GCP on-demand). The 6-node configuration offers the best cost-to-performance ratio.


6. Query Performance Benchmark

All 99 TPC-DS queries were executed against Iceberg tables via the StarRocks External Catalog (bmt_catalog.tpcds_sf100). StarRocks leverages Iceberg metadata for Data Skipping and Predicate Pushdown, querying directly through the External Catalog without any data replication.

6.1 Power Test — Cold Run / Warm Run

Results from sequential execution of all 99 queries in a single session.

Metric Cold Run Warm Run
Total Queries 99 99
Passed 99 (100%) 99 (100%)
Failed 0 0
Geometric Mean 1.474s 1.458s
Total Time 249.4s 246.5s
Min 0.131s 0.132s
Max 18.789s 18.555s
Avg 2.519s 2.490s
exec_mem_limit 16 GB 16 GB

A notable finding is that the difference between Cold and Warm runs was only 1.1%. This is due to the limited internal cache effect in External Catalog queries—which also means that the platform delivers consistent response times without relying on caching.

6.2 Concurrency Test — Multi-User Performance

Performance was measured while increasing concurrent users from 1 → 2 → 4 → 8. Each session independently executes the same 99 queries, and exec_mem_limit is allocated inversely proportional to the number of concurrent users. Wall-clock represents the average elapsed time per session, reflecting the actual perceived duration of the benchmark.

Concurrent Users exec_mem_limit Total Queries Passed Failed Geo Mean Wall-clock
1 (Cold) 16 GB 99 99 0 1.474s 249.4s
1 (Warm) 16 GB 99 99 0 1.458s 246.5s
2 8 GB 198 196 2 2.044s 397.8s
4 4 GB 396 393 3 3.660s 703.0s
8 2 GB 792 746 46 6.055s 1,194.7s

Degradation Analysis (Geometric Mean Basis)

Concurrent Users Geo Mean Degradation Assessment
1 (baseline) 1.474s
2 2.044s 1.39× Good — near-linear scaling
4 3.660s 2.48× Target met (under 3×)
8 6.055s 4.11× Frequent OOM due to memory shortage

At 4 concurrent users, the degradation ratio of 2.48× met the predefined success criteria (under 3×). The 1.39× ratio at 2 users is particularly noteworthy, indicating minimal resource contention and effective concurrency handling by StarRocks.

6.3 OOM Failure Analysis

A detailed analysis of OOM failures that occurred during the 8-user test.

Failure Summary

Concurrent Users Passed Failed Failure Rate Primary Failed Queries
2 196 2 1.0% q78 (OOM in both sessions simultaneously)
4 393 3 0.8% q23 (2), q67 (1)
8 746 46 5.8% q67 (8), q23 (6), q78 (6), q04 (6), etc.

8-User Failed Queries — Top 10

Query Failures (/8 sessions) Query Characteristics
q67 8 Large GROUP BY + window functions
q23 6 Complex subqueries, CTEs
q78 6 Joins between large Fact tables
q04 6 Multiple joins + CTEs
q72 3 Large-scale joins
q70 2 ROLLUP + grouping
q66 2 Multiple UNIONs
q65 2 Aggregation + subqueries
q38 2 INTERSECT + multiple joins

OOM Mechanism

OOM failures follow a predictable pattern. While exec_mem_limit is a per-query soft limit, the physical memory of each CN node (16GB) is a shared resource across all sessions. With 8 sessions × 2GB = 16GB equaling the node's MemLimit exactly, OOM occurs whenever memory peaks coincide.

An interesting observation is that when the first session to reach peak memory is killed, the remaining sessions gain enough headroom to complete successfully. This explains why q67 failed in all 8 sessions while q72 failed in only 3.

Mitigation Strategies

  • Short-term: Apply SET GLOBAL enable_spill = true — spills intermediate results to disk when approaching memory limits, reducing OOM occurrences
  • Mid-term: Scale CN node memory to 32GB+ per node for stable operation with 8 or more concurrent users

7. Issues and Resolutions

A summary of the key issues encountered during the benchmark and how they were resolved. This is documented in detail for reference by those setting up similar environments.

# Issue Root Cause Resolution
1 Spark OOM (Java heap space) FanoutDataWriter opens Writers for all partitions simultaneously fanout.enabled=false + CTAS ORDER BY for sequential writes
2 DROP TABLE PURGE failure S3 file-metadata mismatch causing PURGE error safe_drop_table() — falls back to non-PURGE on failure
3 Spark session conflict Resource contention with existing demo session Terminated pre-initialized session, recreated dedicated session
4 StarRocks query incompatibility (3 queries) q49: reserved word conflict; q70·q86: ORDER BY + grouping incompatibility Alias renaming + subquery wrapping → 99/99 passed
5 Concurrency test OOM Physical memory contention across CN node sessions Recommended enable_spill=true + memory scaling

8. Overall Results Summary

8.1 Batch ELT Results

Metric Value
Tables 24
Total Rows 960M
Total Size ~30 GB
Ingestion Time 54 min
Throughput 296K rows/sec

8.2 Query Performance Results

Run Queries Pass Rate Geometric Mean Total Time
Cold 99 100% 1.474s 249.4s
Warm 99 100% 1.458s 246.5s

8.3 Concurrency Test Results

Concurrent Users exec_mem_limit Pass Rate Geometric Mean Degradation
1 (baseline) 16 GB 100% 1.474s
2 8 GB 99.0% 2.044s 1.39×
4 4 GB 99.2% 3.660s 2.48×
8 2 GB 94.2% 6.055s 4.11×

8.4 Success Criteria Results

Metric Criteria Result Verdict
SF-100 Batch Ingestion Time Under 3 hours 54 min ✅ PASS
Batch Ingestion Throughput Min 10K rows/s 296K rows/sec ✅ PASS
SF-100 Geometric Mean Under 60s 1.474s ✅ PASS
4-User Concurrency Degradation Under 3× 2.48× ✅ PASS
Data Integrity 100% source-target match 100% ✅ PASS
TPC-DS Query Pass Rate 99/99 99/99 (100%) ✅ PASS

All success criteria were met.

8.5 Key Performance Metrics

The benchmark results are distilled into three metrics, each delivering a different message to a different audience.

① Aggregate Performance: QphDS (Queries per Hour at Scale Factor)

"How much analytical workload can the system process per hour?"

QphDS is the official TPC-DS summary metric, making it the only standardized figure that enables direct comparison with global vendors such as Snowflake, Databricks, and Oracle.

Item Value
Formula SF × 3,600 / GeoMean (sec)
SF 100
GeoMean (Cold Run) 1.474s
QphDS@100 244,234

Calculation: 100 × 3,600 / 1.474 = 244,234. This is a GeoMean-based Power metric reflecting the representative response time across all 99 queries.

② Cost Efficiency: QphDS/$ (Price-Performance)

"How much processing power do you get per dollar?"

This metric distinguishes whether performance was achieved by throwing expensive hardware at the problem or through an efficient architecture. DIP achieved high performance on just three n2-standard-8 nodes—a relatively modest infrastructure—giving it a significant advantage in ROI. The cost calculation includes both infrastructure (HW) and DIP platform software costs.

Item Value
Infrastructure (HW) Monthly Cost ~$730 (n2-standard-8 × 3, GCP on-demand)
DIP SW Monthly Cost ~$2,160 ($3/hr × 720hr)
Total Monthly Cost ~$2,890
Hourly Cost (HW + SW) ~$4.01
QphDS/$ (hourly) ~60,900
QphDS/$ (monthly) ~84.5

Even with both HW and SW costs fully accounted for, DIP delivers strong price-performance compared to the total spend required by other cloud DW solutions to achieve equivalent QphDS.

③ User-Facing Performance: Query Geometric Mean

"How many seconds from query submission to results?"

Even non-technical business users intuitively understand "results in 1.4 seconds." In typical cloud DW environments, the TPC-DS SF-100 GeoMean ranges from 30 to 60 seconds—DIP delivers 20–40× faster response times.

Item Value
GeoMean (Cold) 1.474s
GeoMean (Warm) 1.458s
Cold/Warm Difference 1.1% (consistent performance without cache dependency)

Three Metrics Combined

Metric Meaning DIP Result Key Message
QphDS@100 Hourly processing capacity 244,234 Global top-tier for comparable hardware specs
QphDS/$ Processing efficiency per dollar ~60,900/hr Strong ROI even with full HW+SW costs
GeoMean Perceived response time 1.474s 20–40× faster than typical 30–60s alternatives

9. Conclusion and Recommendations

9.1 Conclusion

Batch ELT completed ingestion of 24 tables with 960M rows into Iceberg in 54 minutes (296K rows/sec)—approximately 3.3× faster than the 3-hour target.

Query Performance achieved a 100% pass rate on all 99 TPC-DS queries with a Geometric Mean of 1.474s, far exceeding the 60-second target. The minimal 1.1% difference between Cold and Warm runs demonstrates consistent performance without cache dependency. Translated to the official TPC-DS metric, this yields QphDS@100 = 244,234—global top-tier for comparable hardware specs given the modest 3-node n2-standard-8 configuration—with a strong price-performance ratio of QphDS/$ ~60,900/hr inclusive of both HW and SW costs.

Concurrency remained stable up to 4 users (2.48× degradation, 0.8% failure rate). At 8 users, memory contention led to OOM failures (5.8% failure rate), requiring memory scaling or Spill configuration.

9.2 Recommendations

Area Recommendation
8-User OOM Mitigation Apply SET GLOBAL enable_spill = true to enable disk spilling
CN Memory Scaling Recommend 32GB+ per CN node for stable 8+ user operation
Batch ELT Optimization Scale out to 6 nodes for ~25 min ingestion time with optimal cost-performance ratio

9.3 Coming Up Next

Part 2 will share the Kafka CDC and OLake CDC benchmark results. We will cover initial load speed comparisons, real-time change data latency measurements, and a comparative analysis from an operational convenience perspective.


This benchmark was conducted on the DIP v1.0 platform. Results may vary depending on environment and configuration.

Subscribe to PAASUP IDEAS

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe