Convert National Lab Datasets to Evals
Journal
We are building XRD-Bench: a high-quality AI evaluation benchmark derived from U.S. national laboratory X-ray datasets (SIMPOD, opXRD, LLNL CT, Materials Project XAS). The goal is to test whether models and agents can perform real materials science reasoning on powder X-ray diffraction data — from basic pattern perception through full characterization. We model our eval after MaCBench, ScienceAgentBench, MatSciBench, ChemBench, and GPQA, borrowing their best structural and methodological patterns. All eval data must trace back to real national lab datasets — no synthetic or made-up data.
Sessions
Session 1 (2026-03-20): Built 14-structure SIMPOD dataset (all 7 crystal systems), created Pillar 1 Perception question generator with 6 question types (83 questions total), scoring module with answer extraction — found and fixed a downsampled-diffractogram resolution bug; ready for baseline model eval. Full log
Session 2 (2026-03-21): Ran first baseline eval (Claude Sonnet 4): v1 scored 90.4% (75/83); trace analysis revealed 3 token-truncation bugs, 3 question-quality issues, and 1 ground-truth error; applied 5 targeted fixes (maxtokens, P1.2 prompt, P1.3 trigonal/hexagonal acceptalso, P1.5 rewording, P1.6 margin check); v2 scored 98.8% (82/83) with 1 genuine model error remaining (boundary numerical comparison). Full log
Session 3 (2026-03-22): Cross-model comparison: Claude 3 Haiku scored 74.7% vs Sonnet 4's 98.8% (-24.1pp gap), validating Pillar 1 discriminativeness; gap driven entirely by counting tasks (P1.2: -86pp, P1.6: -57pp) while 4 lookup/comparison types ceiling at 100% for both models; Haiku systematically undercounts table rows (mean ratio 0.45×) due to absent chain-of-thought. Full log
Session 4 (2026-03-23): Added 4 harder question variants (P1.1b Nth-rank peak, P1.3b crystal-system-without-SG, P1.4b close-intensity comparison, P1.5b near-boundary presence) expanding v3 to 137 questions across 10 types; Haiku accuracy on new types: P1.1b 38%, P1.4b 62%, P1.5b 86%, P1.3b 93% — discriminative question coverage doubled from 33% to 69% while Sonnet holds 99.3% overall. Full log
Session 5 (2026-03-24): Expanded structure set from 14 to 34 (20 real minerals via pymatgen, balanced 4-5 per crystal system), generating v4 with 335 questions across 10 types; Haiku eval on v4 scored 70.1% (235/335), stable vs v3's 71.5% — confirmed P1.2 peak-count cliff: 0% accuracy on tables with 4+ peaks; Sonnet v4 results pending. Full log
Session 6 (2026-03-25): Completed v4 analysis: Sonnet 98.2% (329/335) vs Haiku 70.2% (235/335), gap 28.0pp matches v3's stability; discovered P1.2 failure is token-driven (Haiku 0% on 4+ peaks, Sonnet 88%), not question design; counting tasks (62-82pp gaps) are dominant discriminator; ✓ Pillar 1 approved for publication. Full log
Session 7 (2026-03-26): Verified Sonnet v4 results (329/335 = 98.2%, 6 boundary errors only); created PILLAR1FINALSIGNOFF.md with full validation checklist and psychometric analysis; Pillar 1 ready for external review and publication; began Pillar 2 architecture design. Full log
Session 8 (2026-03-27): Groundwork for Pillar 2: verified SIMPOD metadata completeness (all 34 structures, space groups, lattice parameters), researched opXRD dataset (92k diffractograms, 900+ with full structural labels, public + CC-BY license), confirmed Phase 1 data ready (P2.1 space group, P2.2 lattice parameter, P2.8 code task); created PILLAR2IMPLEMENTATIONREADINESS.md with detailed data assessment, timeline (Sessions 9–13), and mitigation plan for minor gaps; ✓ Ready for Phase 1 implementation pending user sign-off on Pillar 2 architecture. Full log
Want an agent running research like this?
Sundial agents work inside your files and keep projects moving day after day.
Join the beta