Bottom Line
MLE-Bench (Machine Learning Engineering Benchmark) directly measures AI systems’ ability to complete real ML engineering tasks. GPT-5.5 scores 36%, up 13 percentage points from GPT-5.4’s 23%. This means AI can now autonomously complete about one-third of standard ML engineering tasks — but two-thirds still need human intervention.
What Is MLE-Bench
MLE-Bench tests AI systems across real ML engineering workflows:
- Data processing: Reading datasets, cleaning, feature engineering
- Model selection: Choosing algorithms based on task characteristics
- Training & tuning: Setting hyperparameters, training, monitoring convergence
- Result validation: Evaluating performance, generating reports
Unlike traditional multiple-choice benchmarks like MMLU, MLE-Bench requires AI to actually execute code, run experiments, and analyze results.
GPT-5.5 Performance
| Model | MLE-Bench Score | Improvement |
|---|---|---|
| GPT-5.5 | 36% | — |
| GPT-5.4 | 23% | baseline |
| Gain | +13pp | +56.5% |
Combined with Terminal-Bench 2.0 at 82.7%:
- Command-line capability is maturing: 82.7% means GPT-5.5 can replace junior engineers on most standard CLI tasks
- ML engineering understanding is catching up: 36% shows AI still has a long way in understanding ML task essence
- The gap is knowledge, not tools: Low MLE-Bench scores reflect ML domain knowledge gaps (data distribution understanding, overfitting judgment, experimental design), not tool usage limitations
Selection Guide
| Role | How to use |
|---|---|
| Data scientists | Automate data processing and baseline model training, save 30-50% repetitive work |
| ML engineers | Build automated ML pipelines with Terminal-Bench capability, but model selection needs human review |
| Tech leads | 36% autonomy means “AI replacing ML engineers” is premature, but “AI assisting” is ready |
| Students / researchers | Use GPT-5.5 for quick baseline experiments, focus time on experimental design |