Benchmark

90 Metrics, Complete
Transparency

Every claim on this site is backed by reproducible benchmarks. We show our wins, our losses, and even our overfitting analysis.

ColorBench: deterministic, float64, 13 categories. All data, scripts, and checkpoints are open source. No cherry-picking. Full train/test split analysis included.

66 GenSpace wins
15 ties
9 OKLab wins

83 internal + 7 independent validation = 90 total metrics

MetricSpace

Color Difference Accuracy

MetricSpace predicts human color perception more accurately than any existing standard, including the industry-standard CIEDE2000 formula.

STRESS (CIE 217:2016) evaluated on COMBVD (3,813 pairs from 6 sub-datasets), MacAdam 1974 (128 pairs), and Human Feedback (3,552 judgements). Lower is better.

GenSpace

Generation Benchmark: 66-9

GenSpace is purpose-built for creating colors: gradients, palettes, gamut mapping. Head-to-head against OKLab (the current CSS standard), it wins 66 out of 90 metrics.

83 internal metrics (deterministic, float64) + 7 independent validation metrics. Opponent: OKLab with standard Euclidean deltaE. Same test harness, same precision.

66
GenSpace wins
9
OKLab wins
15
Ties
13
Categories
By Category

Category Breakdown

Click any card to expand and see every individual metric in that category.

Performance breakdown across 13 test categories. Click any card to see individual metrics.

83 internal metrics in 12 categories + 7 independent validation metrics. All deterministic, float64.

Gamut

25W2T0L

How well the space maps to real device screens

Cusp validity, boundary smoothness, clipping across sRGB/P3/Rec2020

27 metrics

Gradient

7W1T3L

How smooth colors blend between two endpoints

CV of perceptual step size, hue drift, banding metrics

11 metrics

Application

9W3T0L

Real-world tasks like palettes, tints, and accessibility

Palette generation, gamut mapping, WCAG contrast, animation

12 metrics

Perceptual

5W0T0L

Agreement with how humans actually see color

Munsell, MacAdam, Hung-Berns hue linearity validation

5 metrics

Structural

4W2T2L

Mathematical properties that affect reliability

Hue reversals, OOG excursion, chroma amplification, LMS

8 metrics

Hue

2W0T0L

Whether hue labels match human expectation

Hue RMS vs Munsell, primary lightness range

2 metrics

Achromatic

2W0T0L

Perfect grays without color contamination

Gray ramp chroma residual under sRGB and D65

2 metrics

Advanced

2W4T0L

Edge cases and stress tests

1000-trip roundtrip, Jacobian condition, 8-bit precision

6 metrics

Special

2W0T1L

Problem areas where OKLab is known to struggle

Yellow chroma, blue-to-white midpoint, red-to-white shift

3 metrics

Banding

1W1T0L

Visible stepping artifacts in gradients

Invisible step ratio, duplicate 8-bit bucket count

2 metrics

Accessibility

1W0T1L

Usability for colorblind viewers

CVD simulation minimum step deltaE (protan/deutan)

2 metrics

Numerical

0W2T1L

Mathematical precision of conversions

Round-trip error across sRGB, P3, Rec2020 (float64)

3 metrics
Full Data

Metric Explorer

Search and filter all 83 internal benchmark metrics. Every number is reproducible.

Sortable, filterable table of all metrics. Values are from ColorBench HEAD running GenSpace v10-BH vs OKLab, both at float64 precision.

MetricCategoryOKLabGenSpaceWinner
CVD deutan min step ΔEΔEAccessibility0.160.11OKLab
CVD protan min step ΔEΔEAccessibility0.130.13GenSpace
Gray ramp pure D65 C*C*Achromatic7.61e-71.88e-15GenSpace
Gray ramp sRGB C*C*Achromatic5.57e-76.30e-13GenSpace
1000-trip RTmax ΔEAdvanced5.77e-136.97e-14GenSpace
8-bit exact/10KcountAdvanced10,00010,000Tie
Animation frame CV%Advanced62.160.1GenSpace
Channel mono violationscountAdvanced00Tie
Cross-gamut amplification×Advanced1.0×1.0×Tie
Jacobian conditionAdvanced6.496.47Tie
Chroma preservation (no mud)Application0.4140.41Tie
Data viz min pairwise ΔEΔEApplication14.3414.5GenSpace
Eased animation CV%Application64.164.5Tie
Muddy gradients (C drop >50%)countApplication1212Tie
Multi-stop gradient CV%Application37.737.3GenSpace
Palette harmony accuracy°Application11.79.1GenSpace
Palette L* spacing%Application78.976.5GenSpace
Photo gamut map fidelity°Application0.980.96GenSpace
Shade palette hue drift°Application8.66GenSpace
Shade palette worst hue drift°Application20.920.4GenSpace
Tint/shade hue preservation°Application8.87.9GenSpace
WCAG midpoint contrastratioApplication2.732.88GenSpace
Duplicate 8-bit steps%Banding16.113.8GenSpace
Invisible gradient steps%Banding99.799.8Tie
Cusp smoothness (max jump)Gamut0.8050.072GenSpace
Gamut volume fill%Gamut11Tie
P3 boundary bad huescountGamut1214GenSpace
P3 boundary continuityGamut0.4440.079GenSpace
P3 boundary mean jumpGamut0.020.003GenSpace
P3 cliff max%Gamut0.160.1GenSpace
P3 cusp mean smoothnessGamut0.0080.005GenSpace
P3 cusp smoothnessGamut0.7780.039GenSpace
P3 invalid cuspscountGamut520GenSpace
P3 mono violationscountGamut710GenSpace
P3 valid cuspscuspsGamut308/360360/360GenSpace
Rec2020 boundary bad huescountGamut13020GenSpace
Rec2020 boundary continuityGamut0.5620.248GenSpace
Rec2020 boundary mean jumpGamut0.0250.006GenSpace
Rec2020 cliff max%Gamut0.720.18GenSpace
Rec2020 cusp mean smoothnessGamut0.0070.006GenSpace
Rec2020 cusp smoothnessGamut0.7560.157GenSpace
Rec2020 mono violationscountGamut601GenSpace
Rec2020 valid cuspscuspsGamut360/360360/360Tie
sRGB boundary bad huescountGamut12315GenSpace
sRGB boundary continuityGamut0.5450.301GenSpace
sRGB boundary mean jumpGamut0.020.005GenSpace
sRGB cliff max%Gamut0.650.16GenSpace
sRGB cusp mean smoothnessGamut0.0090.005GenSpace
sRGB invalid cuspscountGamut610GenSpace
sRGB mono violationscountGamut880GenSpace
sRGB valid cuspscuspsGamut299/360360/360GenSpace
3-color gradient CV%Gradient39.3434.92GenSpace
Banding meanstepsGradient1.841.83Tie
Bright gradient CV (L>0.6)%Gradient32.1832.76OKLab
Cross-lightness gradient CV%Gradient22.0818.03GenSpace
Dark gradient CV (L<0.4)%Gradient47.2837.24GenSpace
Gradient CV (mean)%Gradient38.237.45GenSpace
Gradient CV (p95)%Gradient136.69138.78OKLab
High-chroma gradient CV%Gradient29.6326.92GenSpace
Max hue drift (non-crossing)°Gradient112.777.5GenSpace
Near-achromatic gradient CV%Gradient85.95106.73OKLab
Worst-case gradient CV%Gradient412.6377.7GenSpace
Hue RMS°Hue30.127.5GenSpace
Primary L rangeHue0.5160.6GenSpace
Round-trip P3 16.7Mmax ΔENumerical1.67e-152.00e-15Tie
Round-trip Rec2020 2.1Mmax ΔENumerical1.55e-151.78e-15Tie
Round-trip sRGB 16.7Mmax ΔENumerical1.67e-155.64e-8OKLab
Hue agreement w/ CIE Lab°Perceptual8.58.3GenSpace
Hue leaf constancy°Perceptual73.359.8GenSpace
MacAdam isotropyratioPerceptual1.991.78GenSpace
Munsell Hue spacing%Perceptual18.511.4GenSpace
Munsell Value uniformity%Perceptual2.80.16GenSpace
Blue→White midpoint G/RratioSpecial1.4081.513GenSpace
Red→White midpoint G-BSpecial0.0620.063OKLab
Yellow chromaSpecial0.2110.333GenSpace
Extreme chroma amplification×Structural5.79×3.79×GenSpace
Hue reversal max angle°Structural30.6GenSpace
Hue reversals (count)countStructural8066GenSpace
Negative LMS colors%Structural00Tie
OOG excursion pairs%Structural9.89.8Tie
OOG max distanceStructural0.110.103GenSpace
Primary hue disc (P3)°Structural1.081.37OKLab
Primary hue disc (sRGB)°Structural1.311.65OKLab
Showing 83 of 83 metrics
60 GenSpace15 Ties8 OKLab
Independent Validation

Tested on Data We Never Trained On

Three independent datasets from published color science research (1980-1998). GenSpace wins 6 out of 7 metrics against OKLab on data it never saw.

Hung & Berns 1995 (hue linearity, 168 samples), Ebner & Fairchild 1998 (constant-hue surfaces, 321 samples), Pointer 1980 (real surface color gamut, 576 boundary points). None used in optimization.

Hung & Berns 1995

168 samples

Do straight lines in the color space match straight lines in human hue perception?

Hue linearity: angular deviation from constant-hue lines. 12 hues, 13 targets each, 9 observers.

Red
2.42 vs 2.84
Red-yellow
4.24 vs 3.55
Yellow
5.71 vs 4.51
Yellow-green
5.48 vs 4.69
Green
1.64 vs 1.69
Green-cyan
7.13 vs 8.77
Cyan
9.37 vs 9.94
Cyan-blue
7.01 vs 5.58
Blue
4.62 vs 3.29
Blue-magenta
4.52 vs 3.76
Magenta
3.59 vs 3.54
Magenta-red
3.78 vs 4.43
Score 6W 1T 5L

Ebner & Fairchild 1998

321 samples

When you change lightness and chroma but keep the hue name the same, does the color space agree?

Constant perceived-hue surface deviation. 15 hues. Mean and max angular deviation from ideal.

Space Mean Max
CIE Lab 2.95 16.0
OKLab 2.23 8.1
GenSpace 2.10 8.6
GenSpace wins mean deviation. OKLab wins max deviation.

Pointer's Gamut 1980

576 pts

How uniformly does each space represent real-world surface colors?

Real surface color boundary (16 L levels, 36 hue angles). Chroma CV, boundary smoothness, hue uniformity.

Space C* CV Smooth Hue CV
CIE Lab 0.479 0.144 0.034
OKLab 0.413 0.132 0.370
GenSpace 0.404 0.125 0.262
CIE Lab wins hue uniformity because Pointer's data is defined in CIE Lab coordinates.

Independent Validation Total

Across 3 published datasets (1980-1998), none used in training

6 - 1 (7 metrics)
Honesty Check

Overfitting Analysis

We optimized MetricSpace on color difference data. Could it have just memorized the answers? We tested this honestly and show you the results.

80/20 stratified split (seed=42), multiple DOF configurations. Train-test gap exists (+1.8) but held-out test still beats all competitors.

Does the model genuinely predict color perception, or did it just memorize the training data? We tested this rigorously with held-out data the model never saw during training.

80/20 train-test split (seed=42, 3050/763 pairs). Multiple DOF configurations tested. Cross-validated estimate: STRESS 24.3.

ModelParamsDOFTrainTestGap
v20b baseline027.7227.57-0.15
v21 (full-data)7222.1423.91+1.77
Phase 1 train-only625.3525.65+0.30
Phase 1+2 train-only4822.7824.59+1.82

Key Findings

Mild overfitting exists. +1.8 STRESS gap between train and held-out test. This is real and acknowledged.
Gap is from data variance, not memorization. v21 (full-data) gap = +1.77, train-only gap = +1.82. Nearly identical.
Held-out test still beats all competitors. Held-out test STRESS (24.59) vs CIEDE2000 (29.20) on full dataset = 16% better.
MacAdam generalizes independently. MacAdam 1974 (never trained on) = 19.12, consistent with v21's 19.51.
Low DOF shows near-zero gap. 6 DOF shows +0.30 gap. Overfitting scales with parameters but stays controlled.

Published STRESS: 22.48 (full-data optimized) | Cross-validated: ~24.3 | Both are still #1 among all tested competitors.

MetricSpace

When to Use MetricSpace

MetricSpace is purpose-built for color difference prediction — not generation. Use it when you need to measure, not create.

Quality Control

Print, display, textile color matching. 23% lower STRESS than CIEDE2000 on COMBVD.

Color Matching Tolerance

Pair-dependent SL/SC weighting adapts to the specific lightness and chroma of each color pair.

A/B Testing

Human Feedback STRESS = 23.26 vs CIEDE2000's 62.54. 63% better at predicting real user preferences.

Accessibility Checking

Euclidean deltaE that's actually perceptually calibrated. OKLab STRESS = 47 — not designed for distance prediction.

Research

Transparent pipeline, fully invertible, open source. All parameters, datasets, and optimization scripts published.

Verdict

Are We the Best?

Color difference measurement: Yes.

MetricSpace v21 achieves the lowest published STRESS on COMBVD, MacAdam, and Human Feedback simultaneously. No other metric matches human perception this accurately across multiple datasets. Caveat: cross-validated estimate is ~24.3 (not the published 22.48), and CIEDE2000 wins on 3 of 6 COMBVD sub-datasets.

Generation tasks: Best-rounded, not best at everything.

GenSpace wins 66-9 vs OKLab across 90 metrics, including 6-1 on independent 3rd-party datasets OKLab was optimized on. However: OKLab is better for near-achromatic gradients (24%), CVD deutan palettes (43%), and native CSS oklch(). CIE Lab's hue angles remain the established industry reference for hue naming.

Overall: First to do both.

Helmlab is the first color space library to achieve state-of-the-art in both perceptual color difference measurement and visual generation quality simultaneously. No other space does both.

Methodology

How We Test

Deterministic

Every metric is computed at float64 precision with fixed seeds. Run the same code, get the same numbers. No stochastic variation.

Head-to-head

Same test harness, same input colors, same precision for both spaces. Winner is determined by the metric's natural direction (lower or higher is better).

No Cherry-picking

All metrics are reported, including our 9 losses. We do not add or remove metrics based on whether we win them.

Open Source

ColorBench source code, all data files, and checkpoint parameters are publicly available on GitHub for independent verification.

What We Did NOT Test

  • HDR color differences — no HDR psychophysical dataset available
  • Cross-surround conditions — all data is standard viewing conditions
  • Display-specific gamuts — only standard sRGB / Display P3 / Rec.2020 primaries
  • Computational performance — not benchmarked (GenSpace ~35 FLOPs, MetricSpace ~150 FLOPs per color)
  • Perceptual ranking with human observers — GenSpace metrics test geometric/mathematical properties, not direct human preference