By dongwooshin in Global — 16 Apr 2026

DIP Platform TPC-DS 1 TB Benchmark

Benchmarking the DIP platform at TPC-DS SF-1000 (1 TB, 6.3B rows): Batch ELT at 1,451K rows/sec, StarRocks query GeoMean of 2.2s, CN scaling analysis, and Trino comparison — all on S3 object storage.

Introduction
Key Results Summary
1. Benchmark Purpose and Background
2. Test Environment
3. Data Pipeline Architecture
4. Data Generation and Source Loading
5. Batch ELT: Spark → Iceberg Loading
6. Query Performance Benchmark
7. Iceberg Partition Strategy Comparison
8. Issues and Resolutions
9. Full Results Summary
10. Conclusions and Recommendations

Introduction

This is part of a series that quantitatively validates the data processing capabilities of the DIP platform using the TPC-DS standard benchmark. In the previous installment (Part 1-1), we validated Batch ELT and query performance at SF-100 (100 GB) scale. This post scales up 10× to SF-1000 (1 TB) to re-validate ingestion performance and query response times in a large-scale data environment.

#	Test Area	Scale	Post
①	Batch ELT + Query (SF-100)	960M rows / 30 GB	Part 1-1
②	Batch ELT + Query (SF-1000)	6.3B rows / 1 TB	This post

Key Results Summary

Batch ELT — 6.3B rows loaded in ~73 min (1,451K rows/sec) · 4.9× throughput improvement vs. SF-100

Query — 100% pass rate on all 99 TPC-DS queries, Geometric Mean 2.198s (CN4 Warm) · 3.1× faster than Trino

QphDS@1000 = 1,637,853 · Global competitiveness confirmed even on S3 Object Storage

CN Scaling — CN3→CN5 expansion yields 44.8% improvement in Warm Total; CN4 is the optimal cost-performance balance point

vs. Trino — Faster on 97 of 99 queries, up to 63× faster on complex queries

1. Benchmark Purpose and Background

While the SF-100 benchmark validated DIP platform performance, real enterprise environments typically handle analytical workloads in the hundreds of GB to multi-TB range. TPC-DS SF-1000 (1 TB) covers approximately 6.3 billion rows and is an appropriate scale for evaluating medium-to-large analytical workload performance.

The objectives of this benchmark are as follows:

Large-scale ingestion validation: Confirm Batch ELT pipeline throughput for loading 6.3B rows into Iceberg
Query response time validation: Measure response times for all 99 TPC-DS queries at SF-1000 scale
CN node scaling effects: Quantify performance changes across CN3/CN4/CN5 configurations
External comparison: Performance comparison against Trino

2. Test Environment

2.1 Infrastructure (Google Cloud)

Component	Machine Type	vCPU	RAM	Boot Disk	Data Disk
rke2-node × 5	n2-standard-16	16	64 GB	200 GB pd-ssd	1,000 GB pd-ssd
PostgreSQL × 1	e2-standard-16	16	64 GB	50 GB pd-ssd	1,000 GB pd-ssd

2.2 Software Stack

Component	Version	Role
PostgreSQL	16.x	Source RDBMS
Spark	3.5.x	Batch ELT engine
StarRocks	4.0.8 (FE 1 + CN 3~5)	Analytical query engine
Lakekeeper	latest	Iceberg REST Catalog
GCS	—	S3-compatible object storage
DuckDB	—	Data generation & Parquet → PostgreSQL loading

2.3 StarRocks CN Node Configuration

Item	Value
CN node count	3~5 (scaling test)
vCPU per node	15 cores
MemLimit per node	50.859 GiB
`be.conf` mem_limit	90%

3. Data Pipeline Architecture

The full benchmark pipeline consists of four stages:

┌──────────────────────────────────────────────────────────────────────────────┐
│                    DIP TPC-DS SF-1000 Benchmark Pipeline                     │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ① Data Generation      ② Source Load          ③ Batch ELT                 │
│  ┌───────────┐          ┌───────────┐           ┌───────────────┐           │
│  │  DuckDB   │ Parquet  │PostgreSQL │   JDBC    │ Spark-Iceberg │           │
│  │  dsdgen() │────────▶│  tpcds    │─────────▶│   Batch ELT   │           │
│  │  SF-1000  │ 24 tbl   │ _sf1000   │ ctid-based│               │           │
│  │ (8 splits)│          └───────────┘ chunk read└───────┬───────┘           │
│  └───────────┘                                           │ Write             │
│                                                           ▼                   │
│  ④ Query Benchmark                              ┌───────────────┐           │
│  ┌───────────────┐   External Catalog           │    Iceberg    │           │
│  │   StarRocks   │◀──────────────────────────── │   (GCS)       │           │
│  │  TPC-DS 99Q   │  bmt_catalog.tpcds_sf1000    │  Lakekeeper   │           │
│  └───────────────┘                               └───────────────┘           │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

① Data Generation: SF-1000 data is generated using DuckDB's TPC-DS extension (dsdgen). Fact tables are split into 8 parts during generation to prevent OOM, then converted to Parquet (ZSTD).
② Source Load: Parquet files are loaded from DuckDB into PostgreSQL, reproducing a scenario where the source system is an RDBMS.
③ Batch ELT: PostgreSQL is read via Spark JDBC and loaded into Iceberg tables backed by Lakekeeper. A ctid-based chunk-split parallel read approach is used to eliminate I/O bottlenecks.
④ Query Benchmark: StarRocks External Catalog queries Iceberg tables directly, executing all 99 TPC-DS queries.

4. Data Generation and Source Loading

4.1 TPC-DS SF-1000 Data Generation

Using DuckDB dsdgen(), 24 tables (7 Fact + 17 Dimension) were generated. SF-1000 is approximately 10× larger than SF-100.

Category	Tables	Total Rows	Notes
Fact	7	6,326,962,833	Generated in 8 splits
Dimension	17	20,313,689	—
Total	24	6,347,276,522 (~6.3B)	~1 TB

Key Fact table sizes:

Table	Rows	vs. SF-100
store_sales	2,879,987,999	×10.0
catalog_sales	1,439,980,416	×10.0
inventory	783,000,000	×2.0
web_sales	720,000,376	×10.0
store_returns	287,999,764	×10.0
catalog_returns	143,996,756	×10.0
web_returns	71,997,522	×10.0

4.2 PostgreSQL Load Results

Split Parquet files were loaded directly into PostgreSQL using the DuckDB postgres extension.

Item	Value
DuckDB memory_limit	40 GB
DuckDB threads	12
chunk_size	50,000,000 rows
Total tables loaded	24 (17 Dimension + 7 Fact)

5. Batch ELT: Spark → Iceberg Loading

5.1 Load Results Summary

Results of reading PostgreSQL via Spark JDBC and loading into Iceberg tables backed by Lakekeeper:

Item	Value
Successful tables	24/24 (100%)
Total rows loaded	6,347,276,522 (~6.3B)
Phase 1 — Dimension (17 tables)	67.1s
Phase 2 — Fact (7 tables)	4,305.9s (~71.8 min)
Wall-clock time	4,373.0s (72.9 min)
Sequential sum	4,450.7s
Concurrency effect	1.0×
Throughput	1,451K rows/sec

Throughput improved from 296K rows/sec at SF-100 to 1,451K rows/sec at SF-1000 — approximately 4.9×. This is attributed to the introduction of ctid-based chunk-split parallel reads and a significant increase in JDBC partition count for large tables (up to 240).

Loading was executed in a 2-phase strategy: Phase 1 loads the 17 lightweight Dimension tables in parallel, followed by Phase 2 loading the 7 large Fact tables in parallel. Fact table loading accounts for 98.5% of the total time.

5.2 Fact Table Loading Performance

Table	Rows	Time (s)	Throughput (rows/s)	Partition Column	ctid_chunks
store_sales	2,879,987,999	1,824.2	1,578,776	ss_sold_date_sk	240
catalog_sales	1,439,980,416	1,296.1	1,111,044	cs_sold_date_sk	240
web_sales	720,000,376	671.6	1,072,138	ws_sold_date_sk	120
inventory	783,000,000	111.1	7,046,857	inv_date_sk	72
store_returns	287,999,764	207.1	1,390,741	sr_returned_date_sk	24
catalog_returns	143,996,756	125.4	1,148,081	cr_returned_date_sk	24
web_returns	71,997,522	70.5	1,021,434	wr_returned_date_sk	24

The biggest bottleneck was store_sales, taking 1,824s to process ~2.9B rows. With ctid_chunks set to 240 for maximum parallelism, throughput reached 1,578K rows/sec. The inventory table, with only 4 columns, recorded an exceptionally high throughput of 7,046K rows/sec.

5.3 Spark Tuning Details

SF-1000 scale required additional tuning beyond what was needed for SF-100.

ctid-Based Chunk-Split Reads

At SF-100, JDBC reads used ORDER BY on a date column. At SF-1000, applying ORDER BY on date columns over multi-billion-row Fact tables caused PostgreSQL to access the entire heap via random I/O, causing a massive spike in block I/O. This was resolved by switching to ctid (physical row location) range-based chunked sequential reads, which eliminated random I/O and significantly improved speed.

Spark Executor Temp Directory Space

Temporary files generated during large Iceberg writes exceeded the default ephemeral storage limit of Kubernetes executor pods. This was resolved by mounting a dedicated temp directory and explicitly specifying ephemeral storage limits in the Spark configuration:

conf.set("spark.kubernetes.executor.request.ephemeral-storage", "30Gi")
conf.set("spark.kubernetes.executor.limit.ephemeral-storage", "50Gi")

Validation Results

Row count comparison between PostgreSQL and Iceberg confirmed all 24 tables OK — 100% data integrity achieved.

6. Query Performance Benchmark

All 99 TPC-DS queries were executed against Iceberg tables via StarRocks External Catalog (bmt_catalog.tpcds_sf1000). StarRocks performs Data Skipping and Predicate Pushdown using Iceberg metadata, querying directly via External Catalog without any data replication.

6.1 Power Test — Cold Run / Warm Run

Results from 1 Cold + 3 Warm runs on CN4 (4-node) configuration. q78 was run with enable_spill=true due to high Hash Join memory usage; all other queries used enable_spill=false.

Item	Cold Run	Warm Run (3-run avg)
Total queries	99	99
Passed	99 (100%)	99 (100%)
Failed	0	0
Geometric Mean	4.376s	2.198s
Total Time	837.7s	481.1s
Min	0.131s (q41)	0.134s (q41)
Max	61.4s (q23)	60.5s (q23)
Avg	8.463s	4.859s
query_mem_limit	35 GB	35 GB

Despite a 10× data increase from SF-100 to SF-1000, response time increased by only 1.49× (GeoMean: 1.474s → 2.198s), demonstrating that StarRocks' MPP architecture and Iceberg partition pruning remain effective at large scale.

The 2× GeoMean difference between Cold and Warm runs (vs. only 1.1% at SF-100) is due to DataCache covering only ~19% of the full ~1 TB dataset, meaning 81% must still be re-read from GCS on Warm runs. The root cause is a low datacache_max_flying_memory_mb setting that bottlenecks disk cache population speed.

6.2 CN Node Scaling Benchmark (CN3 / CN4 / CN5)

Performance was measured by varying the CN node count from 3 → 4 → 5. All configurations ran 1 Cold + 3 Warm runs, with all 99 queries passing.

Overall Summary

Metric	CN3	CN4	CN5	CN3→CN4	CN4→CN5	CN3→CN5
Cold Total (s)	892.4	837.7	474.9	-6.1%	-43.3%	-46.8%
Cold Geo Mean (s)	3.435	4.376	2.500	+27.4%	-42.9%	-27.2%
Warm avg Total (s)	732.0	481.1	404.0	-34.3%	-16.0%	-44.8%
Warm avg Geo Mean (s)	2.565	2.198	2.022	-14.3%	-8.0%	-21.2%
Warm avg Avg (s)	7.394	4.859	4.081	-34.3%	-16.0%	-44.8%
Warm avg Max (s)	164.75 (q67)	60.52 (q23)	46.62 (q23)	—	—	—

Why CN3 Cold GeoMean (3.435) is lower than CN4 (4.376): q67 cold takes 180.8s on CN3, heavily inflating the Total, but most simple queries are faster on CN3 cold. In Warm runs, CN3's CPU/memory limitations become apparent and the rankings reverse.

Scaling Efficiency Analysis

Range	Node Increase	Ideal Improvement	Actual Improvement (Warm Total)	Efficiency
CN3 → CN4	+33%	25.0%	+34.3%	137% (super-linear)
CN4 → CN5	+25%	20.0%	+16.0%	80% (diminishing returns)
CN3 → CN5	+67%	40.0%	+44.8%	112%

The super-linear improvement (137%) from CN3→CN4 is because large queries under memory pressure on CN3 (q67: 164s → 43s) normalized on CN4. From CN4→CN5, S3 I/O-bound queries hit a ceiling with the additional node, leading to diminishing returns (80%).

Top 15 Queries by Absolute Improvement CN3→CN5 (warm_avg)

Query	CN3	CN4	CN5	CN3→CN5 Abs	CN3→CN5 %
q67	164.750	42.965	33.869	-130.9s	-79.4%
q23	93.381	60.517	46.623	-46.8s	-50.1%
q78	60.817	47.162	32.424	-28.4s	-46.7%
q09	32.543	24.829	20.361	-12.2s	-37.4%
q28	20.870	14.915	12.573	-8.3s	-39.8%
q88	20.198	14.614	12.593	-7.6s	-37.7%
q04	18.302	13.438	11.293	-7.0s	-38.3%
q75	16.476	13.602	10.490	-6.0s	-36.3%
q65	15.772	12.367	9.773	-6.0s	-38.0%
q50	15.334	12.079	9.804	-5.5s	-36.1%
q97	13.412	10.849	8.461	-5.0s	-36.9%
q14	13.280	10.942	8.578	-4.7s	-35.4%
q11	10.084	7.508	6.336	-3.7s	-37.2%
q24	10.030	7.239	6.477	-3.6s	-35.4%
q38	6.432	5.107	4.296	-2.1s	-33.2%

Queries Slower on CN5 than CN4 (★, warm_avg, +5% or more)

Query	CN4	CN5	Diff	Characteristic
q44	8.327	10.799	+29.7%	store_sales full scan + window func
q22	5.062	6.214	+22.8%	inventory 12-month aggregation
q21	0.420	0.488	+16.2%	inventory simple join
q33	1.218	1.348	+10.7%	store_sales + date filter
q69	1.139	1.234	+8.3%	multi-fact subquery
q58	0.908	0.983	+8.3%	catalog/store/web 3-way
q98	0.863	0.929	+7.6%	store_sales date range
q77	1.095	1.176	+7.4%	multi-fact UNION
q68	1.007	1.076	+6.9%	store_sales + date
q41	0.134	0.143	+6.7%	no fact table (item only)
q53	1.098	1.171	+6.6%	store_sales monthly
q42	0.400	0.423	+5.7%	store_sales simple
q43	0.806	0.847	+5.1%	store_sales simple

These are mostly simple/small queries under 10 seconds where inter-node coordination overhead (network orchestration, distributed planner cost) exceeds actual computation time. The additional CN5 node introduces more coordination overhead that burdens small queries.

q67 Outlier Analysis

Config	Cold	Warm 1	Warm 2	Warm 3	Warm avg
CN3	180.786	165.210	163.808	165.232	164.750
CN4	43.278	42.335	44.067	42.494	42.965
CN5	33.301	34.234	33.618	33.756	33.869

q67 (store_sales 12-month multi-dimensional rollup) remained stuck at 164s on CN3 even in Warm runs. On CN4 it dropped to 43s (~4×), and further to 34s on CN5. The suspected cause is intermediate result spilling due to insufficient memory on CN3. CN3 is considered unfit for production at SF-1000 scale; CN4 or higher is the minimum recommendation.

Per-Query CN3 / CN4 / CN5 Comparison (Warm avg, in seconds)

★ = CN5 slower than CN4 (warm_avg, +5% or more)
▲ = CN5 improvement vs. CN3 ≥ 50%

Query	CN3	CN4	CN5	CN3→CN4	CN4→CN5	CN3→CN5	Notes
q01	1.696	1.578	1.299	-7.0%	-17.7%	-23.4%
q02	2.735	1.999	1.757	-26.9%	-12.1%	-35.8%
q03	0.924	0.807	0.704	-12.7%	-12.8%	-23.8%
q04	18.302	13.438	11.293	-26.6%	-16.0%	-38.3%
q05	2.259	1.846	1.832	-18.3%	-0.8%	-18.9%
q06	0.756	0.725	0.755	-4.1%	+4.1%	-0.1%
q07	2.456	1.759	1.788	-28.4%	+1.6%	-27.2%
q08	0.720	0.744	0.741	+3.3%	-0.4%	+2.9%
q09	32.543	24.829	20.361	-23.7%	-18.0%	-37.4%
q10	1.225	1.166	1.169	-4.8%	+0.3%	-4.6%
q11	10.084	7.508	6.336	-25.6%	-15.6%	-37.2%
q12	0.471	0.552	0.461	+17.2%	-16.5%	-2.1%
q13	3.216	2.513	2.120	-21.9%	-15.6%	-34.1%
q14	13.280	10.942	8.578	-17.6%	-21.6%	-35.4%
q15	1.075	2.378	2.305	+121.2%	-3.1%	+114.4%	★
q16	1.370	1.236	1.247	-9.8%	+0.9%	-9.0%
q17	3.130	2.583	2.115	-17.5%	-18.1%	-32.4%
q18	2.203	2.120	1.884	-3.8%	-11.1%	-14.5%
q19	0.920	0.805	0.807	-12.5%	+0.2%	-12.3%
q20	0.491	0.481	0.487	-2.0%	+1.2%	-0.8%
q21	0.434	0.420	0.488	-3.2%	+16.2%	+12.4%	★
q22	6.223	5.062	6.214	-18.7%	+22.8%	-0.1%	★
q23	93.381	60.517	46.623	-35.2%	-23.0%	-50.1%	▲
q24	10.030	7.239	6.477	-27.8%	-10.5%	-35.4%
q25	2.471	2.067	1.805	-16.4%	-12.7%	-26.9%
q26	1.277	1.126	0.992	-11.8%	-11.9%	-22.3%
q27	1.973	1.547	1.396	-21.6%	-9.8%	-29.2%
q28	20.870	14.915	12.573	-28.5%	-15.7%	-39.8%
q29	6.057	5.096	4.200	-15.9%	-17.6%	-30.7%
q30	1.455	1.433	1.392	-1.5%	-2.9%	-4.3%
q31	2.465	1.990	1.857	-19.3%	-6.7%	-24.7%
q32	0.507	0.462	0.472	-8.9%	+2.2%	-6.9%
q33	1.270	1.218	1.348	-4.1%	+10.7%	+6.1%	★
q34	1.237	1.042	1.058	-15.8%	+1.5%	-14.5%
q35	2.262	2.042	2.014	-9.7%	-1.4%	-11.0%
q36	2.066	1.657	1.377	-19.8%	-16.9%	-33.3%
q37	1.792	1.444	1.272	-19.4%	-11.9%	-29.0%
q38	6.432	5.107	4.296	-20.6%	-15.9%	-33.2%
q39	0.949	0.875	0.813	-7.8%	-7.1%	-14.3%
q40	0.710	2.371	0.800	+233.9%	-66.3%	+12.7%
q41	0.133	0.134	0.143	+0.8%	+6.7%	+7.5%	★
q42	0.434	0.400	0.423	-7.8%	+5.7%	-2.5%	★
q43	1.105	0.806	0.847	-27.1%	+5.1%	-23.3%	★
q44	13.563	8.327	10.799	-38.6%	+29.7%	-20.4%	★
q45	1.186	1.142	1.085	-3.7%	-5.0%	-8.5%
q46	2.161	1.765	1.663	-18.3%	-5.8%	-23.0%
q47	5.396	4.752	3.721	-11.9%	-21.7%	-31.0%
q48	2.157	1.656	1.680	-23.2%	+1.4%	-22.1%
q49	2.591	1.953	1.829	-24.6%	-6.3%	-29.4%
q50	15.334	12.079	9.804	-21.2%	-18.8%	-36.1%
q51	8.072	6.118	4.908	-24.2%	-19.8%	-39.2%
q52	0.434	0.420	0.437	-3.2%	+4.0%	+0.7%
q53	1.425	1.098	1.171	-22.9%	+6.6%	-17.8%	★
q54	1.224	1.044	1.072	-14.7%	+2.7%	-12.4%
q55	0.430	0.419	0.407	-2.6%	-2.9%	-5.3%
q56	0.965	0.956	0.999	-0.9%	+4.5%	+3.5%
q57	3.919	3.640	2.854	-7.1%	-21.6%	-27.2%
q58	0.979	0.908	0.983	-7.3%	+8.3%	+0.4%	★
q59	5.342	3.841	3.237	-28.1%	-15.7%	-39.4%
q60	1.226	1.219	1.176	-0.6%	-3.5%	-4.1%
q61	1.652	1.574	1.464	-4.7%	-7.0%	-11.4%
q62	1.456	1.193	1.017	-18.1%	-14.8%	-30.1%
q63	1.420	1.081	1.081	-23.9%	+0.0%	-23.9%
q64	12.278	9.353	8.643	-23.8%	-7.6%	-29.6%
q65	15.772	12.367	9.773	-21.6%	-21.0%	-38.0%
q66	1.685	1.423	1.284	-15.5%	-9.8%	-23.8%
q67	164.750	42.965	33.869	-73.9%	-21.2%	-79.4%	▲
q68	1.162	1.007	1.076	-13.3%	+6.9%	-7.4%	★
q69	1.223	1.139	1.234	-6.9%	+8.3%	+0.9%	★
q70	5.319	3.763	3.854	-29.3%	+2.4%	-27.5%	★
q71	2.001	1.910	1.882	-4.6%	-1.5%	-5.9%
q72	8.816	7.998	6.291	-9.3%	-21.3%	-28.6%
q73	0.769	0.832	0.770	+8.2%	-7.5%	+0.1%
q74	8.351	6.147	5.246	-26.4%	-14.7%	-37.2%
q75	16.476	13.602	10.490	-17.4%	-22.9%	-36.3%
q76	11.943	10.518	10.004	-11.9%	-4.9%	-16.2%
q77	1.191	1.095	1.176	-8.1%	+7.4%	-1.3%	★
q78	60.817	47.162	32.424	-22.4%	-31.3%	-46.7%
q79	2.283	2.118	1.741	-7.2%	-17.8%	-23.7%
q80	5.779	4.609	4.016	-20.2%	-12.9%	-30.5%
q81	1.560	1.453	1.420	-6.9%	-2.3%	-9.0%
q82	3.302	2.392	2.219	-27.6%	-7.2%	-32.8%
q83	0.848	0.811	0.809	-4.4%	-0.2%	-4.6%
q84	1.299	1.268	1.239	-2.4%	-2.3%	-4.6%
q85	2.136	2.048	1.925	-4.1%	-6.0%	-9.9%
q86	1.616	1.144	1.163	-29.2%	+1.7%	-28.0%
q87	6.678	5.091	4.505	-23.8%	-11.5%	-32.5%
q88	20.198	14.614	12.593	-27.6%	-13.8%	-37.7%
q89	1.515	1.378	1.122	-9.0%	-18.6%	-25.9%
q90	1.567	1.298	1.218	-17.2%	-6.2%	-22.3%
q91	0.852	0.805	0.814	-5.5%	+1.1%	-4.5%
q92	0.437	0.400	0.411	-8.5%	+2.7%	-5.9%
q93	8.565	6.319	5.448	-26.2%	-13.8%	-36.4%
q94	1.627	1.335	1.311	-17.9%	-1.8%	-19.4%
q95	3.700	3.047	2.948	-17.6%	-3.2%	-20.3%
q96	2.514	1.871	1.693	-25.6%	-9.5%	-32.7%
q97	13.412	10.849	8.461	-19.1%	-22.0%	-36.9%
q98	0.980	0.863	0.929	-11.9%	+7.6%	-5.2%	★
q99	2.297	1.913	1.714	-16.7%	-10.4%	-25.4%

Scaling Conclusions and Recommendations

Config	Warm Total	Warm GeoMean	Assessment
CN3	732.0s	2.565s	Memory pressure, q67 bottleneck
CN4	481.1s	2.198s	Optimal balance point
CN5	404.0s	2.022s	Additional improvement, some regressions

CN4 → CN5 ROI: 16.0% improvement in warm_avg Total; 80% efficiency vs. +25% node cost. Absolute time reduction: 77.1s.
CN3 is unfit for production: q67 warm at 164s is SLA-violating. q23 (93s) and q78 (61s) are also 2× worse than CN4.
CN5 recommended when: workloads are heavy on complex aggregation/join queries. For simple small queries (under 1s), slight regressions on q44, q22, q21 etc. require checking applicable SLA requirements.

6.3 Trino Comparison

A comparison against the Trino Official benchmark was performed on the same SF-1000 scale.

Environment Comparison

Item	Our (CN4)	Trino Official
Engine version	StarRocks 4.0.8	Trino 475
Node configuration	CN × 4	Coordinator 1 + Worker 4
vCPU/node	15 cores	16 cores
Memory/node	50.9 GiB	64 GiB
Storage	S3 Object Storage	NVMe SSD
Table format	Iceberg (Lakekeeper REST)	Iceberg Catalog
query_mem_limit	35 GB/CN	35 GB/Worker

Key difference: Trino Official uses local NVMe SSD, while our setup uses S3 Object Storage (remote I/O). StarRocks substantially outperforms Trino even under S3's I/O disadvantage.

Overall Summary Comparison

Metric	Our CN4 (Warm avg)	Trino Official	Ratio
Total	481.1s	2,552.1s	Our 5.30× faster
Geo Mean	2.198s	6.782s	Our 3.09× faster
Min	0.134s (q41)	0.271s (q41)	—
Max	60.5s (q23)	504.6s (q72)	—

Our CN4 is 5.30× faster than Trino (Total basis). We had an advantage on 97 of 99 queries.

Top 10 Queries with Largest Advantage over Trino

Query	Our (s)	Trino (s)	Speedup	Characteristic
q72	7.998	504.573	63.1×	catalog_sales + inventory + catalog_returns 3-way join
q54	1.044	31.864	30.5×	catalog/store/web_sales complex subquery
q11	7.508	133.230	17.7×	year-over-year self-join
q04	13.438	200.249	14.9×	3 fact tables, 3-year range
q14	10.942	143.172	13.1×	store_sales 3× self-join
q74	6.147	78.071	12.7×	store_sales + web_sales year-over-year
q16	1.236	14.035	11.4×	catalog_returns + catalog_sales
q02	1.999	21.289	10.7×	web_sales + catalog_sales monthly comparison
q95	3.047	29.607	9.7×	web_sales + web_returns
q47	4.752	42.704	9.0×	store_sales window function

Queries Where Trino is Faster

Query	Our (s)	Trino (s)	Ratio
q40	2.371	1.554	Trino 1.53× faster
q15	2.378	2.158	Trino 1.10× faster

Both q40 and q15 are simple catalog_sales date-filter queries. Trino's Iceberg file pruning appears to work more favorably for these specific access patterns.

Per-Query Our vs. Trino Comparison (Warm avg, in seconds)

Tri/Our: Higher = Our advantage

Query	Our (s)	Trino (s)	Tri/Our	Notes
q01	1.578	2.276	1.44×
q02	1.999	21.289	10.65×
q03	0.807	1.764	2.19×
q04	13.438	200.249	14.90×
q05	1.846	5.574	3.02×
q06	0.725	2.909	4.01×
q07	1.759	3.638	2.07×
q08	0.744	2.534	3.41×
q09	24.829	36.125	1.45×
q10	1.166	2.471	2.12×
q11	7.508	133.230	17.75×
q12	0.552	1.067	1.93×
q13	2.513	18.297	7.28×
q14	10.942	143.172	13.08×
q15	2.378	2.158	0.91×	Trino faster
q16	1.236	14.035	11.36×
q17	2.583	5.487	2.12×
q18	2.120	4.736	2.23×
q19	0.805	1.870	2.32×
q20	0.481	1.257	2.61×
q21	0.420	0.945	2.25×
q22	5.062	7.871	1.55×
q23	60.517	488.653	8.07×
q24	7.239	43.436	6.00×
q25	2.067	4.347	2.10×
q26	1.126	2.961	2.63×
q27	1.547	4.219	2.73×
q28	14.915	24.678	1.65×
q29	5.096	10.640	2.09×
q30	1.433	2.770	1.93×
q31	1.990	10.260	5.16×
q32	0.462	1.035	2.24×
q33	1.218	2.743	2.25×
q34	1.042	2.455	2.36×
q35	2.042	8.023	3.93×
q36	1.657	3.342	2.02×
q37	1.444	6.178	4.28×
q38	5.107	38.408	7.52×
q39	0.875	5.696	6.51×
q40	2.371	1.554	0.66×	Trino faster
q41	0.134	0.271	2.02×
q42	0.400	0.996	2.49×
q43	0.806	4.108	5.10×
q44	8.327	16.468	1.98×
q45	1.142	2.212	1.94×
q46	1.765	4.780	2.71×
q47	4.752	42.704	8.99×
q48	1.656	12.464	7.53×
q49	1.953	3.827	1.96×
q50	12.079	21.771	1.80×
q51	6.118	12.247	2.00×
q52	0.420	1.007	2.40×
q53	1.098	1.768	1.61×
q54	1.044	31.864	30.52×
q55	0.419	1.187	2.83×
q56	0.956	2.133	2.23×
q57	3.640	26.042	7.15×
q58	0.908	3.032	3.34×
q59	3.841	20.819	5.42×
q60	1.219	2.749	2.26×
q61	1.574	2.054	1.30×
q62	1.193	5.104	4.28×
q63	1.081	1.719	1.59×
q64	9.353	16.990	1.82×
q65	12.367	20.616	1.67×
q66	1.423	3.511	2.47×
q67	42.965	89.430	2.08×
q68	1.007	2.883	2.86×
q69	1.139	2.181	1.91×
q70	3.763	21.499	5.71×
q71	1.910	2.578	1.35×
q72	7.998	504.573	63.09×
q73	0.832	1.933	2.32×
q74	6.147	78.071	12.70×
q75	13.602	26.270	1.93×
q76	10.518	11.266	1.07×
q77	1.095	2.785	2.54×
q78	47.162	51.497	1.09×
q79	2.118	5.572	2.63×
q80	4.609	4.762	1.03×
q81	1.453	4.040	2.78×
q82	2.392	11.622	4.86×
q83	0.811	2.274	2.80×
q84	1.268	3.086	2.43×
q85	2.048	5.424	2.65×
q86	1.144	3.422	2.99×
q87	5.091	41.300	8.11×
q88	14.614	30.738	2.10×
q89	1.378	2.227	1.62×
q90	1.298	3.890	3.00×
q91	0.805	2.035	2.53×
q92	0.400	0.994	2.48×
q93	6.319	22.926	3.63×
q94	1.335	5.884	4.41×
q95	3.047	29.607	9.72×
q96	1.871	4.905	2.62×
q97	10.849	22.726	2.09×
q98	0.863	1.911	2.21×
q99	1.913	10.970	5.73×

7. Iceberg Partition Strategy Comparison

We tested whether the existing *_date_sk-based partitioning (1,823 partitions, ~20 MB per file) causes a Small Files Problem, and measured the performance difference against a no-partition + date_sk-sorted approach (Union method).

Experimental Conditions

Item	Original	Union
Partitioning	`*_date_sk` (1,823 partitions)	None
File sort order	PostgreSQL ctid order	`date_sk` ASC
File count (store_sales)	1,823 × ~20 MB	~56 × ~500 MB

Overall Summary Results

Metric	Original	Union	Difference
Warm avg Total	481.1s	538.6s	+11.9%
Warm avg Geo Mean	2.198s	2.396s	+9.0%
Per-query classification (±5%)	—	Improved 18 / Neutral 35 / Regressed 46	—

The Union method is generally worse than Original. The Small Files Problem is less severe than expected for two reasons.

First, DataCache absorbs small file I/O overhead starting from Warm runs. Since DataCache caches at the file level, even small files are fully cached on re-execution. The "large file" advantage of Union in Cold runs disappears in Warm runs.

Second, StarRocks' DPP (Dynamic Partition Pruning) operates effectively even with 1,823 partitions. In the Union approach, the absence of partition metadata forces fallback to Iceberg file-level statistics (min/max) for pruning, which is significantly less efficient. In particular, the inventory table (4 columns) has only 4–10 files under Union, making file-level pruning nearly ineffective, resulting in severe regressions on q21 (+149%), q39 (+101%), and q72 (+162%).

Conclusion: The Original partition strategy (*_date_sk) is optimal for the current environment.

8. Issues and Resolutions

Key issues encountered during the benchmark and how they were resolved:

#	Issue	Cause	Resolution
1	OOM during dat generation	Insufficient memory generating SF-1000 Fact table dat files	Split Fact table dat files into 8 parts
2	PostgreSQL load interrupted	catalog_sales load stopped at chunk 14	Used `resume_sf1000.py` to resume from chunk 15
3	PG→Iceberg block I/O spike	ORDER BY date column causing full heap random access in PostgreSQL	Switched to ctid-based chunk-split parallel read
4	Spark Executor out of disk space	K8s pod ephemeral storage limit exceeded	Explicitly set ephemeral-storage 30Gi/50Gi
5	CN node memory config variable	`exec_mem_limit` variable does not exist in this version	Changed to `query_mem_limit` (already applied)
6	q78 OOM	Hash Join memory overflow between large Fact tables	Applied `enable_spill=true` for q78 only
7	DataCache coverage at 19%	`datacache_max_flying_memory_mb=2` setting	Recommend increasing to `2048` (requires kubectl access)

9. Full Results Summary

9.1 Batch ELT Results

Item	SF-100	SF-1000
Tables	24	24
Total rows	960M	6.3B
Load time	54 min	72.9 min
Throughput	296K rows/sec	1,451K rows/sec
Data integrity	100%	100%

9.2 Query Performance Results (CN4 baseline)

Run	Queries	Pass Rate	Geometric Mean	Total Time
Cold	99	100%	4.376s	837.7s
Warm (3-run avg)	99	100%	2.198s	481.1s

9.3 CN Scaling Results

CN Config	Warm Geo Mean	Warm Total	vs. CN3
CN3	2.565s	732.0s	baseline
CN4	2.198s	481.1s	-34.3%
CN5	2.022s	404.0s	-44.8%

9.4 Trino Comparison Results

Comparison	Total	Geo Mean	Primary Cause
Our vs. Trino	Our 5.30× faster	Our 3.09× faster	StarRocks query engine optimization

9.5 Key Performance Metrics

Benchmark results summarized from three perspectives:

① Composite Performance Metric: QphDS@1000

Item	Value
Formula	`SF × 3,600 / GeoMean (seconds)`
SF	1,000
GeoMean (Warm avg, CN4)	2.198s
QphDS@1000	1,637,853

Calculation: 1,000 × 3,600 / 2.198 = 1,637,853. This is approximately 6.7× higher than QphDS@100 (244,234) from SF-100, demonstrating that performance degraded only 1.5× despite a 10× scale factor increase.

② Perceived Performance Metric: Query Geometric Mean

Item	Value
GeoMean (Cold, CN4)	4.376s
GeoMean (Warm avg, CN4)	2.198s
GeoMean (Warm avg, CN5)	2.022s
vs. Trino	3.09× faster

Achieving under 2.2s average response on Warm runs against 1 TB on S3 Object Storage is a level that fully supports interactive analyst query workflows.

③ Scale Efficiency Metric: SF-100 → SF-1000

Metric	SF-100	SF-1000	Ratio
Data scale	960M rows / 30 GB	6.3B rows / 1 TB	×10
Ingestion throughput	296K rows/sec	1,451K rows/sec	×4.9
Query GeoMean	1.474s	2.198s	×1.49
QphDS	244,234	1,637,853	×6.7

With a 10× data increase, query response time grew by only 1.5×, achieving sub-linear scaling.

10. Conclusions and Recommendations

10.1 Conclusions

Batch ELT completed loading 24 tables and 6.3B rows into Iceberg in approximately 73 minutes at SF-1000 scale (1,451K rows/sec). This is a 4.9× throughput improvement over SF-100 (296K rows/sec), confirming the effectiveness of ctid-based chunk-split parallel reads and large-scale JDBC partitioning.

Query performance achieved 100% pass rate on all 99 TPC-DS queries with a Geometric Mean of 2.198s (CN4 Warm). Despite a 10× data increase, GeoMean grew only 1.5×, validating the scalability of StarRocks' MPP architecture and Iceberg partition pruning. Performance was 5.3× faster than Trino despite the inherent disadvantage of S3 storage.

CN scaling confirmed CN4 as the optimal cost-performance balance point. CN3→CN4 achieved super-linear improvement (137% efficiency), while CN4→CN5 showed diminishing returns (80% efficiency) due to S3 I/O bounds. CN3 is deemed unfit for production due to SLA-violating bottlenecks on q67 (164s).

Partition strategy: the Original (*_date_sk partitioning) is broadly superior to Union (no partitions), because DataCache effectively absorbs small file I/O overhead and DPP operates accurately across all 1,823 partitions.

10.2 Recommendations

Item	Recommendation
CN node configuration	Minimum CN4 recommended; CN3 is unfit for SF-1000 production use
DataCache optimization	Increase `datacache_max_flying_memory_mb` to 2048 → expected to close Cold/Warm gap
q78 Spill	Keep `enable_spill=true` (prevents Hash Join memory overflow)
Partition strategy	Maintain `*_date_sk`-based partitioning
CN memory increase	`query_mem_limit=62 GiB`

This benchmark was conducted on the DIP v1.0 platform. Results may vary depending on environment and configuration.

Table of Contents