DIP Platform TPC-DS 1 TB Benchmark

Benchmarking the DIP platform at TPC-DS SF-1000 (1 TB, 6.3B rows): Batch ELT at 1,451K rows/sec, StarRocks query GeoMean of 2.2s, CN scaling analysis, and Trino comparison — all on S3 object storage.

DIP Platform TPC-DS 1 TB Benchmark

Table of Contents


Introduction

This is part of a series that quantitatively validates the data processing capabilities of the DIP platform using the TPC-DS standard benchmark. In the previous installment (Part 1-1), we validated Batch ELT and query performance at SF-100 (100 GB) scale. This post scales up 10× to SF-1000 (1 TB) to re-validate ingestion performance and query response times in a large-scale data environment.

# Test Area Scale Post
Batch ELT + Query (SF-100) 960M rows / 30 GB Part 1-1
Batch ELT + Query (SF-1000) 6.3B rows / 1 TB This post

Key Results Summary

Batch ELT — 6.3B rows loaded in ~73 min (1,451K rows/sec) · 4.9× throughput improvement vs. SF-100

Query — 100% pass rate on all 99 TPC-DS queries, Geometric Mean 2.198s (CN4 Warm) · 3.1× faster than Trino

QphDS@1000 = 1,637,853 · Global competitiveness confirmed even on S3 Object Storage

CN Scaling — CN3→CN5 expansion yields 44.8% improvement in Warm Total; CN4 is the optimal cost-performance balance point

vs. Trino — Faster on 97 of 99 queries, up to 63× faster on complex queries


1. Benchmark Purpose and Background

While the SF-100 benchmark validated DIP platform performance, real enterprise environments typically handle analytical workloads in the hundreds of GB to multi-TB range. TPC-DS SF-1000 (1 TB) covers approximately 6.3 billion rows and is an appropriate scale for evaluating medium-to-large analytical workload performance.

The objectives of this benchmark are as follows:

  1. Large-scale ingestion validation: Confirm Batch ELT pipeline throughput for loading 6.3B rows into Iceberg
  2. Query response time validation: Measure response times for all 99 TPC-DS queries at SF-1000 scale
  3. CN node scaling effects: Quantify performance changes across CN3/CN4/CN5 configurations
  4. External comparison: Performance comparison against Trino

2. Test Environment

2.1 Infrastructure (Google Cloud)

Component Machine Type vCPU RAM Boot Disk Data Disk
rke2-node × 5 n2-standard-16 16 64 GB 200 GB pd-ssd 1,000 GB pd-ssd
PostgreSQL × 1 e2-standard-16 16 64 GB 50 GB pd-ssd 1,000 GB pd-ssd

2.2 Software Stack

Component Version Role
PostgreSQL 16.x Source RDBMS
Spark 3.5.x Batch ELT engine
StarRocks 4.0.8 (FE 1 + CN 3~5) Analytical query engine
Lakekeeper latest Iceberg REST Catalog
GCS S3-compatible object storage
DuckDB Data generation & Parquet → PostgreSQL loading

2.3 StarRocks CN Node Configuration

Item Value
CN node count 3~5 (scaling test)
vCPU per node 15 cores
MemLimit per node 50.859 GiB
be.conf mem_limit 90%

3. Data Pipeline Architecture

The full benchmark pipeline consists of four stages:

┌──────────────────────────────────────────────────────────────────────────────┐
│                    DIP TPC-DS SF-1000 Benchmark Pipeline                     │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ① Data Generation      ② Source Load          ③ Batch ELT                 │
│  ┌───────────┐          ┌───────────┐           ┌───────────────┐           │
│  │  DuckDB   │ Parquet  │PostgreSQL │   JDBC    │ Spark-Iceberg │           │
│  │  dsdgen() │────────▶│  tpcds    │─────────▶│   Batch ELT   │           │
│  │  SF-1000  │ 24 tbl   │ _sf1000   │ ctid-based│               │           │
│  │ (8 splits)│          └───────────┘ chunk read└───────┬───────┘           │
│  └───────────┘                                           │ Write             │
│                                                           ▼                   │
│  ④ Query Benchmark                              ┌───────────────┐           │
│  ┌───────────────┐   External Catalog           │    Iceberg    │           │
│  │   StarRocks   │◀──────────────────────────── │   (GCS)       │           │
│  │  TPC-DS 99Q   │  bmt_catalog.tpcds_sf1000    │  Lakekeeper   │           │
│  └───────────────┘                               └───────────────┘           │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘
  • ① Data Generation: SF-1000 data is generated using DuckDB's TPC-DS extension (dsdgen). Fact tables are split into 8 parts during generation to prevent OOM, then converted to Parquet (ZSTD).
  • ② Source Load: Parquet files are loaded from DuckDB into PostgreSQL, reproducing a scenario where the source system is an RDBMS.
  • ③ Batch ELT: PostgreSQL is read via Spark JDBC and loaded into Iceberg tables backed by Lakekeeper. A ctid-based chunk-split parallel read approach is used to eliminate I/O bottlenecks.
  • ④ Query Benchmark: StarRocks External Catalog queries Iceberg tables directly, executing all 99 TPC-DS queries.

4. Data Generation and Source Loading

4.1 TPC-DS SF-1000 Data Generation

Using DuckDB dsdgen(), 24 tables (7 Fact + 17 Dimension) were generated. SF-1000 is approximately 10× larger than SF-100.

Category Tables Total Rows Notes
Fact 7 6,326,962,833 Generated in 8 splits
Dimension 17 20,313,689
Total 24 6,347,276,522 (~6.3B) ~1 TB

Key Fact table sizes:

Table Rows vs. SF-100
store_sales 2,879,987,999 ×10.0
catalog_sales 1,439,980,416 ×10.0
inventory 783,000,000 ×2.0
web_sales 720,000,376 ×10.0
store_returns 287,999,764 ×10.0
catalog_returns 143,996,756 ×10.0
web_returns 71,997,522 ×10.0

4.2 PostgreSQL Load Results

Split Parquet files were loaded directly into PostgreSQL using the DuckDB postgres extension.

Item Value
DuckDB memory_limit 40 GB
DuckDB threads 12
chunk_size 50,000,000 rows
Total tables loaded 24 (17 Dimension + 7 Fact)

5. Batch ELT: Spark → Iceberg Loading

5.1 Load Results Summary

Results of reading PostgreSQL via Spark JDBC and loading into Iceberg tables backed by Lakekeeper:

Item Value
Successful tables 24/24 (100%)
Total rows loaded 6,347,276,522 (~6.3B)
Phase 1 — Dimension (17 tables) 67.1s
Phase 2 — Fact (7 tables) 4,305.9s (~71.8 min)
Wall-clock time 4,373.0s (72.9 min)
Sequential sum 4,450.7s
Concurrency effect 1.0×
Throughput 1,451K rows/sec

Throughput improved from 296K rows/sec at SF-100 to 1,451K rows/sec at SF-1000 — approximately 4.9×. This is attributed to the introduction of ctid-based chunk-split parallel reads and a significant increase in JDBC partition count for large tables (up to 240).

Loading was executed in a 2-phase strategy: Phase 1 loads the 17 lightweight Dimension tables in parallel, followed by Phase 2 loading the 7 large Fact tables in parallel. Fact table loading accounts for 98.5% of the total time.

5.2 Fact Table Loading Performance

Table Rows Time (s) Throughput (rows/s) Partition Column ctid_chunks
store_sales 2,879,987,999 1,824.2 1,578,776 ss_sold_date_sk 240
catalog_sales 1,439,980,416 1,296.1 1,111,044 cs_sold_date_sk 240
web_sales 720,000,376 671.6 1,072,138 ws_sold_date_sk 120
inventory 783,000,000 111.1 7,046,857 inv_date_sk 72
store_returns 287,999,764 207.1 1,390,741 sr_returned_date_sk 24
catalog_returns 143,996,756 125.4 1,148,081 cr_returned_date_sk 24
web_returns 71,997,522 70.5 1,021,434 wr_returned_date_sk 24

The biggest bottleneck was store_sales, taking 1,824s to process ~2.9B rows. With ctid_chunks set to 240 for maximum parallelism, throughput reached 1,578K rows/sec. The inventory table, with only 4 columns, recorded an exceptionally high throughput of 7,046K rows/sec.

5.3 Spark Tuning Details

SF-1000 scale required additional tuning beyond what was needed for SF-100.

ctid-Based Chunk-Split Reads

At SF-100, JDBC reads used ORDER BY on a date column. At SF-1000, applying ORDER BY on date columns over multi-billion-row Fact tables caused PostgreSQL to access the entire heap via random I/O, causing a massive spike in block I/O. This was resolved by switching to ctid (physical row location) range-based chunked sequential reads, which eliminated random I/O and significantly improved speed.

Spark Executor Temp Directory Space

Temporary files generated during large Iceberg writes exceeded the default ephemeral storage limit of Kubernetes executor pods. This was resolved by mounting a dedicated temp directory and explicitly specifying ephemeral storage limits in the Spark configuration:

conf.set("spark.kubernetes.executor.request.ephemeral-storage", "30Gi")
conf.set("spark.kubernetes.executor.limit.ephemeral-storage", "50Gi")

Validation Results

Row count comparison between PostgreSQL and Iceberg confirmed all 24 tables OK — 100% data integrity achieved.


6. Query Performance Benchmark

All 99 TPC-DS queries were executed against Iceberg tables via StarRocks External Catalog (bmt_catalog.tpcds_sf1000). StarRocks performs Data Skipping and Predicate Pushdown using Iceberg metadata, querying directly via External Catalog without any data replication.

6.1 Power Test — Cold Run / Warm Run

Results from 1 Cold + 3 Warm runs on CN4 (4-node) configuration. q78 was run with enable_spill=true due to high Hash Join memory usage; all other queries used enable_spill=false.

Item Cold Run Warm Run (3-run avg)
Total queries 99 99
Passed 99 (100%) 99 (100%)
Failed 0 0
Geometric Mean 4.376s 2.198s
Total Time 837.7s 481.1s
Min 0.131s (q41) 0.134s (q41)
Max 61.4s (q23) 60.5s (q23)
Avg 8.463s 4.859s
query_mem_limit 35 GB 35 GB

Despite a 10× data increase from SF-100 to SF-1000, response time increased by only 1.49× (GeoMean: 1.474s → 2.198s), demonstrating that StarRocks' MPP architecture and Iceberg partition pruning remain effective at large scale.

The 2× GeoMean difference between Cold and Warm runs (vs. only 1.1% at SF-100) is due to DataCache covering only ~19% of the full ~1 TB dataset, meaning 81% must still be re-read from GCS on Warm runs. The root cause is a low datacache_max_flying_memory_mb setting that bottlenecks disk cache population speed.

6.2 CN Node Scaling Benchmark (CN3 / CN4 / CN5)

Performance was measured by varying the CN node count from 3 → 4 → 5. All configurations ran 1 Cold + 3 Warm runs, with all 99 queries passing.

Overall Summary

Metric CN3 CN4 CN5 CN3→CN4 CN4→CN5 CN3→CN5
Cold Total (s) 892.4 837.7 474.9 -6.1% -43.3% -46.8%
Cold Geo Mean (s) 3.435 4.376 2.500 +27.4% -42.9% -27.2%
Warm avg Total (s) 732.0 481.1 404.0 -34.3% -16.0% -44.8%
Warm avg Geo Mean (s) 2.565 2.198 2.022 -14.3% -8.0% -21.2%
Warm avg Avg (s) 7.394 4.859 4.081 -34.3% -16.0% -44.8%
Warm avg Max (s) 164.75 (q67) 60.52 (q23) 46.62 (q23)

Why CN3 Cold GeoMean (3.435) is lower than CN4 (4.376): q67 cold takes 180.8s on CN3, heavily inflating the Total, but most simple queries are faster on CN3 cold. In Warm runs, CN3's CPU/memory limitations become apparent and the rankings reverse.

Scaling Efficiency Analysis

Range Node Increase Ideal Improvement Actual Improvement (Warm Total) Efficiency
CN3 → CN4 +33% 25.0% +34.3% 137% (super-linear)
CN4 → CN5 +25% 20.0% +16.0% 80% (diminishing returns)
CN3 → CN5 +67% 40.0% +44.8% 112%

The super-linear improvement (137%) from CN3→CN4 is because large queries under memory pressure on CN3 (q67: 164s → 43s) normalized on CN4. From CN4→CN5, S3 I/O-bound queries hit a ceiling with the additional node, leading to diminishing returns (80%).

Top 15 Queries by Absolute Improvement CN3→CN5 (warm_avg)

Query CN3 CN4 CN5 CN3→CN5 Abs CN3→CN5 %
q67 164.750 42.965 33.869 -130.9s -79.4%
q23 93.381 60.517 46.623 -46.8s -50.1%
q78 60.817 47.162 32.424 -28.4s -46.7%
q09 32.543 24.829 20.361 -12.2s -37.4%
q28 20.870 14.915 12.573 -8.3s -39.8%
q88 20.198 14.614 12.593 -7.6s -37.7%
q04 18.302 13.438 11.293 -7.0s -38.3%
q75 16.476 13.602 10.490 -6.0s -36.3%
q65 15.772 12.367 9.773 -6.0s -38.0%
q50 15.334 12.079 9.804 -5.5s -36.1%
q97 13.412 10.849 8.461 -5.0s -36.9%
q14 13.280 10.942 8.578 -4.7s -35.4%
q11 10.084 7.508 6.336 -3.7s -37.2%
q24 10.030 7.239 6.477 -3.6s -35.4%
q38 6.432 5.107 4.296 -2.1s -33.2%

Queries Slower on CN5 than CN4 (★, warm_avg, +5% or more)

Query CN4 CN5 Diff Characteristic
q44 8.327 10.799 +29.7% store_sales full scan + window func
q22 5.062 6.214 +22.8% inventory 12-month aggregation
q21 0.420 0.488 +16.2% inventory simple join
q33 1.218 1.348 +10.7% store_sales + date filter
q69 1.139 1.234 +8.3% multi-fact subquery
q58 0.908 0.983 +8.3% catalog/store/web 3-way
q98 0.863 0.929 +7.6% store_sales date range
q77 1.095 1.176 +7.4% multi-fact UNION
q68 1.007 1.076 +6.9% store_sales + date
q41 0.134 0.143 +6.7% no fact table (item only)
q53 1.098 1.171 +6.6% store_sales monthly
q42 0.400 0.423 +5.7% store_sales simple
q43 0.806 0.847 +5.1% store_sales simple

These are mostly simple/small queries under 10 seconds where inter-node coordination overhead (network orchestration, distributed planner cost) exceeds actual computation time. The additional CN5 node introduces more coordination overhead that burdens small queries.

q67 Outlier Analysis

Config Cold Warm 1 Warm 2 Warm 3 Warm avg
CN3 180.786 165.210 163.808 165.232 164.750
CN4 43.278 42.335 44.067 42.494 42.965
CN5 33.301 34.234 33.618 33.756 33.869

q67 (store_sales 12-month multi-dimensional rollup) remained stuck at 164s on CN3 even in Warm runs. On CN4 it dropped to 43s (~4×), and further to 34s on CN5. The suspected cause is intermediate result spilling due to insufficient memory on CN3. CN3 is considered unfit for production at SF-1000 scale; CN4 or higher is the minimum recommendation.

Per-Query CN3 / CN4 / CN5 Comparison (Warm avg, in seconds)

★ = CN5 slower than CN4 (warm_avg, +5% or more)
▲ = CN5 improvement vs. CN3 ≥ 50%

Query CN3 CN4 CN5 CN3→CN4 CN4→CN5 CN3→CN5 Notes
q01 1.696 1.578 1.299 -7.0% -17.7% -23.4%
q02 2.735 1.999 1.757 -26.9% -12.1% -35.8%
q03 0.924 0.807 0.704 -12.7% -12.8% -23.8%
q04 18.302 13.438 11.293 -26.6% -16.0% -38.3%
q05 2.259 1.846 1.832 -18.3% -0.8% -18.9%
q06 0.756 0.725 0.755 -4.1% +4.1% -0.1%
q07 2.456 1.759 1.788 -28.4% +1.6% -27.2%
q08 0.720 0.744 0.741 +3.3% -0.4% +2.9%
q09 32.543 24.829 20.361 -23.7% -18.0% -37.4%
q10 1.225 1.166 1.169 -4.8% +0.3% -4.6%
q11 10.084 7.508 6.336 -25.6% -15.6% -37.2%
q12 0.471 0.552 0.461 +17.2% -16.5% -2.1%
q13 3.216 2.513 2.120 -21.9% -15.6% -34.1%
q14 13.280 10.942 8.578 -17.6% -21.6% -35.4%
q15 1.075 2.378 2.305 +121.2% -3.1% +114.4%
q16 1.370 1.236 1.247 -9.8% +0.9% -9.0%
q17 3.130 2.583 2.115 -17.5% -18.1% -32.4%
q18 2.203 2.120 1.884 -3.8% -11.1% -14.5%
q19 0.920 0.805 0.807 -12.5% +0.2% -12.3%
q20 0.491 0.481 0.487 -2.0% +1.2% -0.8%
q21 0.434 0.420 0.488 -3.2% +16.2% +12.4%
q22 6.223 5.062 6.214 -18.7% +22.8% -0.1%
q23 93.381 60.517 46.623 -35.2% -23.0% -50.1%
q24 10.030 7.239 6.477 -27.8% -10.5% -35.4%
q25 2.471 2.067 1.805 -16.4% -12.7% -26.9%
q26 1.277 1.126 0.992 -11.8% -11.9% -22.3%
q27 1.973 1.547 1.396 -21.6% -9.8% -29.2%
q28 20.870 14.915 12.573 -28.5% -15.7% -39.8%
q29 6.057 5.096 4.200 -15.9% -17.6% -30.7%
q30 1.455 1.433 1.392 -1.5% -2.9% -4.3%
q31 2.465 1.990 1.857 -19.3% -6.7% -24.7%
q32 0.507 0.462 0.472 -8.9% +2.2% -6.9%
q33 1.270 1.218 1.348 -4.1% +10.7% +6.1%
q34 1.237 1.042 1.058 -15.8% +1.5% -14.5%
q35 2.262 2.042 2.014 -9.7% -1.4% -11.0%
q36 2.066 1.657 1.377 -19.8% -16.9% -33.3%
q37 1.792 1.444 1.272 -19.4% -11.9% -29.0%
q38 6.432 5.107 4.296 -20.6% -15.9% -33.2%
q39 0.949 0.875 0.813 -7.8% -7.1% -14.3%
q40 0.710 2.371 0.800 +233.9% -66.3% +12.7%
q41 0.133 0.134 0.143 +0.8% +6.7% +7.5%
q42 0.434 0.400 0.423 -7.8% +5.7% -2.5%
q43 1.105 0.806 0.847 -27.1% +5.1% -23.3%
q44 13.563 8.327 10.799 -38.6% +29.7% -20.4%
q45 1.186 1.142 1.085 -3.7% -5.0% -8.5%
q46 2.161 1.765 1.663 -18.3% -5.8% -23.0%
q47 5.396 4.752 3.721 -11.9% -21.7% -31.0%
q48 2.157 1.656 1.680 -23.2% +1.4% -22.1%
q49 2.591 1.953 1.829 -24.6% -6.3% -29.4%
q50 15.334 12.079 9.804 -21.2% -18.8% -36.1%
q51 8.072 6.118 4.908 -24.2% -19.8% -39.2%
q52 0.434 0.420 0.437 -3.2% +4.0% +0.7%
q53 1.425 1.098 1.171 -22.9% +6.6% -17.8%
q54 1.224 1.044 1.072 -14.7% +2.7% -12.4%
q55 0.430 0.419 0.407 -2.6% -2.9% -5.3%
q56 0.965 0.956 0.999 -0.9% +4.5% +3.5%
q57 3.919 3.640 2.854 -7.1% -21.6% -27.2%
q58 0.979 0.908 0.983 -7.3% +8.3% +0.4%
q59 5.342 3.841 3.237 -28.1% -15.7% -39.4%
q60 1.226 1.219 1.176 -0.6% -3.5% -4.1%
q61 1.652 1.574 1.464 -4.7% -7.0% -11.4%
q62 1.456 1.193 1.017 -18.1% -14.8% -30.1%
q63 1.420 1.081 1.081 -23.9% +0.0% -23.9%
q64 12.278 9.353 8.643 -23.8% -7.6% -29.6%
q65 15.772 12.367 9.773 -21.6% -21.0% -38.0%
q66 1.685 1.423 1.284 -15.5% -9.8% -23.8%
q67 164.750 42.965 33.869 -73.9% -21.2% -79.4%
q68 1.162 1.007 1.076 -13.3% +6.9% -7.4%
q69 1.223 1.139 1.234 -6.9% +8.3% +0.9%
q70 5.319 3.763 3.854 -29.3% +2.4% -27.5%
q71 2.001 1.910 1.882 -4.6% -1.5% -5.9%
q72 8.816 7.998 6.291 -9.3% -21.3% -28.6%
q73 0.769 0.832 0.770 +8.2% -7.5% +0.1%
q74 8.351 6.147 5.246 -26.4% -14.7% -37.2%
q75 16.476 13.602 10.490 -17.4% -22.9% -36.3%
q76 11.943 10.518 10.004 -11.9% -4.9% -16.2%
q77 1.191 1.095 1.176 -8.1% +7.4% -1.3%
q78 60.817 47.162 32.424 -22.4% -31.3% -46.7%
q79 2.283 2.118 1.741 -7.2% -17.8% -23.7%
q80 5.779 4.609 4.016 -20.2% -12.9% -30.5%
q81 1.560 1.453 1.420 -6.9% -2.3% -9.0%
q82 3.302 2.392 2.219 -27.6% -7.2% -32.8%
q83 0.848 0.811 0.809 -4.4% -0.2% -4.6%
q84 1.299 1.268 1.239 -2.4% -2.3% -4.6%
q85 2.136 2.048 1.925 -4.1% -6.0% -9.9%
q86 1.616 1.144 1.163 -29.2% +1.7% -28.0%
q87 6.678 5.091 4.505 -23.8% -11.5% -32.5%
q88 20.198 14.614 12.593 -27.6% -13.8% -37.7%
q89 1.515 1.378 1.122 -9.0% -18.6% -25.9%
q90 1.567 1.298 1.218 -17.2% -6.2% -22.3%
q91 0.852 0.805 0.814 -5.5% +1.1% -4.5%
q92 0.437 0.400 0.411 -8.5% +2.7% -5.9%
q93 8.565 6.319 5.448 -26.2% -13.8% -36.4%
q94 1.627 1.335 1.311 -17.9% -1.8% -19.4%
q95 3.700 3.047 2.948 -17.6% -3.2% -20.3%
q96 2.514 1.871 1.693 -25.6% -9.5% -32.7%
q97 13.412 10.849 8.461 -19.1% -22.0% -36.9%
q98 0.980 0.863 0.929 -11.9% +7.6% -5.2%
q99 2.297 1.913 1.714 -16.7% -10.4% -25.4%

Scaling Conclusions and Recommendations

Config Warm Total Warm GeoMean Assessment
CN3 732.0s 2.565s Memory pressure, q67 bottleneck
CN4 481.1s 2.198s Optimal balance point
CN5 404.0s 2.022s Additional improvement, some regressions
  • CN4 → CN5 ROI: 16.0% improvement in warm_avg Total; 80% efficiency vs. +25% node cost. Absolute time reduction: 77.1s.
  • CN3 is unfit for production: q67 warm at 164s is SLA-violating. q23 (93s) and q78 (61s) are also 2× worse than CN4.
  • CN5 recommended when: workloads are heavy on complex aggregation/join queries. For simple small queries (under 1s), slight regressions on q44, q22, q21 etc. require checking applicable SLA requirements.

6.3 Trino Comparison

A comparison against the Trino Official benchmark was performed on the same SF-1000 scale.

Environment Comparison

Item Our (CN4) Trino Official
Engine version StarRocks 4.0.8 Trino 475
Node configuration CN × 4 Coordinator 1 + Worker 4
vCPU/node 15 cores 16 cores
Memory/node 50.9 GiB 64 GiB
Storage S3 Object Storage NVMe SSD
Table format Iceberg (Lakekeeper REST) Iceberg Catalog
query_mem_limit 35 GB/CN 35 GB/Worker

Key difference: Trino Official uses local NVMe SSD, while our setup uses S3 Object Storage (remote I/O). StarRocks substantially outperforms Trino even under S3's I/O disadvantage.

Overall Summary Comparison

Metric Our CN4 (Warm avg) Trino Official Ratio
Total 481.1s 2,552.1s Our 5.30× faster
Geo Mean 2.198s 6.782s Our 3.09× faster
Min 0.134s (q41) 0.271s (q41)
Max 60.5s (q23) 504.6s (q72)

Our CN4 is 5.30× faster than Trino (Total basis). We had an advantage on 97 of 99 queries.

Top 10 Queries with Largest Advantage over Trino

Query Our (s) Trino (s) Speedup Characteristic
q72 7.998 504.573 63.1× catalog_sales + inventory + catalog_returns 3-way join
q54 1.044 31.864 30.5× catalog/store/web_sales complex subquery
q11 7.508 133.230 17.7× year-over-year self-join
q04 13.438 200.249 14.9× 3 fact tables, 3-year range
q14 10.942 143.172 13.1× store_sales 3× self-join
q74 6.147 78.071 12.7× store_sales + web_sales year-over-year
q16 1.236 14.035 11.4× catalog_returns + catalog_sales
q02 1.999 21.289 10.7× web_sales + catalog_sales monthly comparison
q95 3.047 29.607 9.7× web_sales + web_returns
q47 4.752 42.704 9.0× store_sales window function

Queries Where Trino is Faster

Query Our (s) Trino (s) Ratio
q40 2.371 1.554 Trino 1.53× faster
q15 2.378 2.158 Trino 1.10× faster

Both q40 and q15 are simple catalog_sales date-filter queries. Trino's Iceberg file pruning appears to work more favorably for these specific access patterns.

Per-Query Our vs. Trino Comparison (Warm avg, in seconds)

Tri/Our: Higher = Our advantage

Query Our (s) Trino (s) Tri/Our Notes
q01 1.578 2.276 1.44×
q02 1.999 21.289 10.65×
q03 0.807 1.764 2.19×
q04 13.438 200.249 14.90×
q05 1.846 5.574 3.02×
q06 0.725 2.909 4.01×
q07 1.759 3.638 2.07×
q08 0.744 2.534 3.41×
q09 24.829 36.125 1.45×
q10 1.166 2.471 2.12×
q11 7.508 133.230 17.75×
q12 0.552 1.067 1.93×
q13 2.513 18.297 7.28×
q14 10.942 143.172 13.08×
q15 2.378 2.158 0.91× Trino faster
q16 1.236 14.035 11.36×
q17 2.583 5.487 2.12×
q18 2.120 4.736 2.23×
q19 0.805 1.870 2.32×
q20 0.481 1.257 2.61×
q21 0.420 0.945 2.25×
q22 5.062 7.871 1.55×
q23 60.517 488.653 8.07×
q24 7.239 43.436 6.00×
q25 2.067 4.347 2.10×
q26 1.126 2.961 2.63×
q27 1.547 4.219 2.73×
q28 14.915 24.678 1.65×
q29 5.096 10.640 2.09×
q30 1.433 2.770 1.93×
q31 1.990 10.260 5.16×
q32 0.462 1.035 2.24×
q33 1.218 2.743 2.25×
q34 1.042 2.455 2.36×
q35 2.042 8.023 3.93×
q36 1.657 3.342 2.02×
q37 1.444 6.178 4.28×
q38 5.107 38.408 7.52×
q39 0.875 5.696 6.51×
q40 2.371 1.554 0.66× Trino faster
q41 0.134 0.271 2.02×
q42 0.400 0.996 2.49×
q43 0.806 4.108 5.10×
q44 8.327 16.468 1.98×
q45 1.142 2.212 1.94×
q46 1.765 4.780 2.71×
q47 4.752 42.704 8.99×
q48 1.656 12.464 7.53×
q49 1.953 3.827 1.96×
q50 12.079 21.771 1.80×
q51 6.118 12.247 2.00×
q52 0.420 1.007 2.40×
q53 1.098 1.768 1.61×
q54 1.044 31.864 30.52×
q55 0.419 1.187 2.83×
q56 0.956 2.133 2.23×
q57 3.640 26.042 7.15×
q58 0.908 3.032 3.34×
q59 3.841 20.819 5.42×
q60 1.219 2.749 2.26×
q61 1.574 2.054 1.30×
q62 1.193 5.104 4.28×
q63 1.081 1.719 1.59×
q64 9.353 16.990 1.82×
q65 12.367 20.616 1.67×
q66 1.423 3.511 2.47×
q67 42.965 89.430 2.08×
q68 1.007 2.883 2.86×
q69 1.139 2.181 1.91×
q70 3.763 21.499 5.71×
q71 1.910 2.578 1.35×
q72 7.998 504.573 63.09×
q73 0.832 1.933 2.32×
q74 6.147 78.071 12.70×
q75 13.602 26.270 1.93×
q76 10.518 11.266 1.07×
q77 1.095 2.785 2.54×
q78 47.162 51.497 1.09×
q79 2.118 5.572 2.63×
q80 4.609 4.762 1.03×
q81 1.453 4.040 2.78×
q82 2.392 11.622 4.86×
q83 0.811 2.274 2.80×
q84 1.268 3.086 2.43×
q85 2.048 5.424 2.65×
q86 1.144 3.422 2.99×
q87 5.091 41.300 8.11×
q88 14.614 30.738 2.10×
q89 1.378 2.227 1.62×
q90 1.298 3.890 3.00×
q91 0.805 2.035 2.53×
q92 0.400 0.994 2.48×
q93 6.319 22.926 3.63×
q94 1.335 5.884 4.41×
q95 3.047 29.607 9.72×
q96 1.871 4.905 2.62×
q97 10.849 22.726 2.09×
q98 0.863 1.911 2.21×
q99 1.913 10.970 5.73×

7. Iceberg Partition Strategy Comparison

We tested whether the existing *_date_sk-based partitioning (1,823 partitions, ~20 MB per file) causes a Small Files Problem, and measured the performance difference against a no-partition + date_sk-sorted approach (Union method).

Experimental Conditions

Item Original Union
Partitioning *_date_sk (1,823 partitions) None
File sort order PostgreSQL ctid order date_sk ASC
File count (store_sales) 1,823 × ~20 MB ~56 × ~500 MB

Overall Summary Results

Metric Original Union Difference
Warm avg Total 481.1s 538.6s +11.9%
Warm avg Geo Mean 2.198s 2.396s +9.0%
Per-query classification (±5%) Improved 18 / Neutral 35 / Regressed 46

The Union method is generally worse than Original. The Small Files Problem is less severe than expected for two reasons.

First, DataCache absorbs small file I/O overhead starting from Warm runs. Since DataCache caches at the file level, even small files are fully cached on re-execution. The "large file" advantage of Union in Cold runs disappears in Warm runs.

Second, StarRocks' DPP (Dynamic Partition Pruning) operates effectively even with 1,823 partitions. In the Union approach, the absence of partition metadata forces fallback to Iceberg file-level statistics (min/max) for pruning, which is significantly less efficient. In particular, the inventory table (4 columns) has only 4–10 files under Union, making file-level pruning nearly ineffective, resulting in severe regressions on q21 (+149%), q39 (+101%), and q72 (+162%).

Conclusion: The Original partition strategy (*_date_sk) is optimal for the current environment.


8. Issues and Resolutions

Key issues encountered during the benchmark and how they were resolved:

# Issue Cause Resolution
1 OOM during dat generation Insufficient memory generating SF-1000 Fact table dat files Split Fact table dat files into 8 parts
2 PostgreSQL load interrupted catalog_sales load stopped at chunk 14 Used resume_sf1000.py to resume from chunk 15
3 PG→Iceberg block I/O spike ORDER BY date column causing full heap random access in PostgreSQL Switched to ctid-based chunk-split parallel read
4 Spark Executor out of disk space K8s pod ephemeral storage limit exceeded Explicitly set ephemeral-storage 30Gi/50Gi
5 CN node memory config variable exec_mem_limit variable does not exist in this version Changed to query_mem_limit (already applied)
6 q78 OOM Hash Join memory overflow between large Fact tables Applied enable_spill=true for q78 only
7 DataCache coverage at 19% datacache_max_flying_memory_mb=2 setting Recommend increasing to 2048 (requires kubectl access)

9. Full Results Summary

9.1 Batch ELT Results

Item SF-100 SF-1000
Tables 24 24
Total rows 960M 6.3B
Load time 54 min 72.9 min
Throughput 296K rows/sec 1,451K rows/sec
Data integrity 100% 100%

9.2 Query Performance Results (CN4 baseline)

Run Queries Pass Rate Geometric Mean Total Time
Cold 99 100% 4.376s 837.7s
Warm (3-run avg) 99 100% 2.198s 481.1s

9.3 CN Scaling Results

CN Config Warm Geo Mean Warm Total vs. CN3
CN3 2.565s 732.0s baseline
CN4 2.198s 481.1s -34.3%
CN5 2.022s 404.0s -44.8%

9.4 Trino Comparison Results

Comparison Total Geo Mean Primary Cause
Our vs. Trino Our 5.30× faster Our 3.09× faster StarRocks query engine optimization

9.5 Key Performance Metrics

Benchmark results summarized from three perspectives:

① Composite Performance Metric: QphDS@1000

Item Value
Formula SF × 3,600 / GeoMean (seconds)
SF 1,000
GeoMean (Warm avg, CN4) 2.198s
QphDS@1000 1,637,853

Calculation: 1,000 × 3,600 / 2.198 = 1,637,853. This is approximately 6.7× higher than QphDS@100 (244,234) from SF-100, demonstrating that performance degraded only 1.5× despite a 10× scale factor increase.

② Perceived Performance Metric: Query Geometric Mean

Item Value
GeoMean (Cold, CN4) 4.376s
GeoMean (Warm avg, CN4) 2.198s
GeoMean (Warm avg, CN5) 2.022s
vs. Trino 3.09× faster

Achieving under 2.2s average response on Warm runs against 1 TB on S3 Object Storage is a level that fully supports interactive analyst query workflows.

③ Scale Efficiency Metric: SF-100 → SF-1000

Metric SF-100 SF-1000 Ratio
Data scale 960M rows / 30 GB 6.3B rows / 1 TB ×10
Ingestion throughput 296K rows/sec 1,451K rows/sec ×4.9
Query GeoMean 1.474s 2.198s ×1.49
QphDS 244,234 1,637,853 ×6.7

With a 10× data increase, query response time grew by only 1.5×, achieving sub-linear scaling.


10. Conclusions and Recommendations

10.1 Conclusions

Batch ELT completed loading 24 tables and 6.3B rows into Iceberg in approximately 73 minutes at SF-1000 scale (1,451K rows/sec). This is a 4.9× throughput improvement over SF-100 (296K rows/sec), confirming the effectiveness of ctid-based chunk-split parallel reads and large-scale JDBC partitioning.

Query performance achieved 100% pass rate on all 99 TPC-DS queries with a Geometric Mean of 2.198s (CN4 Warm). Despite a 10× data increase, GeoMean grew only 1.5×, validating the scalability of StarRocks' MPP architecture and Iceberg partition pruning. Performance was 5.3× faster than Trino despite the inherent disadvantage of S3 storage.

CN scaling confirmed CN4 as the optimal cost-performance balance point. CN3→CN4 achieved super-linear improvement (137% efficiency), while CN4→CN5 showed diminishing returns (80% efficiency) due to S3 I/O bounds. CN3 is deemed unfit for production due to SLA-violating bottlenecks on q67 (164s).

Partition strategy: the Original (*_date_sk partitioning) is broadly superior to Union (no partitions), because DataCache effectively absorbs small file I/O overhead and DPP operates accurately across all 1,823 partitions.

10.2 Recommendations

Item Recommendation
CN node configuration Minimum CN4 recommended; CN3 is unfit for SF-1000 production use
DataCache optimization Increase datacache_max_flying_memory_mb to 2048 → expected to close Cold/Warm gap
q78 Spill Keep enable_spill=true (prevents Hash Join memory overflow)
Partition strategy Maintain *_date_sk-based partitioning
CN memory increase query_mem_limit=62 GiB

This benchmark was conducted on the DIP v1.0 platform. Results may vary depending on environment and configuration.

Subscribe to PAASUP IDEAS

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe