How Reliable Are Automated TOEFL Speaking Scores Compared to Human Raters?

When preparing for the TOEFL iBT Speaking section, one question often comes up: Can I really trust automated scoring? ETS uses both trained human raters and machine scoring models (like SpeechRater™) to evaluate responses. But how well do machines actually line up with human raters?

Research by Klaus Zechner (Automated Speaking Assessment) and new independent validation from former ETS rater Nathan Mills provide a clear picture: automated scoring is surprisingly reliable — often matching human judgment to a remarkable degree.

Human vs. Machine at the Item Level

Item level means one individual TOEFL Speaking task.

Machine–human correlation: up to r = 0.65
Human–human correlation: about r = 0.55–0.60

Interpretation: the machine actually agrees with a human rater slightly more often than a second human does. That reflects a core strength of automated scoring: consistency.

Machines don’t get tired, distracted, or swayed by accents the way humans sometimes do. They apply the same scoring model every time, which reduces variability at the single-task level.

Human vs. Machine at the Section Level

Section level refers to the overall TOEFL Speaking score computed from multiple tasks (historically six).

Human–human correlation: about r = 0.90
Machine–human correlation: up to r = 0.85

Here, humans come out slightly ahead. Why? Because each test taker’s responses are scored by multiple different raters. When those ratings are averaged, the random errors cancel out, leading to stronger agreement overall.

Machines, by contrast, use the same scoring model for every task. While this ensures uniformity, it doesn’t benefit from the “error-canceling” effect of multiple human perspectives.

Why the Pattern Flips from Item to Section

At the item level: The machine is more consistent because it applies identical scoring rules every time.
At the section level: Humans benefit from averaging across multiple raters and tasks, which boosts reliability.

At-a-Glance Summary

Agreement Benchmarks for TOEFL Speaking Scoring
Level	Comparison	Pearson r	Variance Explained (R²)	Notes
Item (single task)	Machine vs. Human	0.65 (max)	0.42	Higher than human–human at item level due to model consistency
Item (single task)	Human vs. Human	0.55–0.60	0.30–0.36	Two trained raters; some rater-specific noise remains
Section (total score)	Human vs. Human	~0.90	0.81	Averaging across tasks/raters boosts reliability
Section (total score)	Machine vs. Human	Up to ~0.85	0.72	Close to human–human, but slightly lower

Values are approximate upper bounds reported in the literature summarized by Zechner. “Section” refers to totals computed from multiple items in the legacy six-task structure.

Independent Validation: Nathan Mills vs. SpeechRater™

Research is one thing. But what happens when you put machine scoring head-to-head with a seasoned human expert?

Former TOEFL Speaking rater Nathan Mills, who spent a decade evaluating responses for ETS, conducted his own large-scale comparison. He hand-scored more than 1,600 responses that had already been processed by SpeechRater™.

The results were striking: Nathan’s scores and SpeechRater’s scores aligned almost perfectly. On every task, the difference was minimal — often just a few hundredths of a point.

Nathan Mills vs. SpeechRater™ Scores (Sample Tasks)
Task	Nathan’s Score	SpeechRater™ Score	Difference
Q1	3.02	3.08	0.06
Q2	2.92	3.02	0.10
Q3	2.78	2.99	0.21
Q4	2.79	2.92	0.13

Source: Nathan Mills, former ETS TOEFL Speaking rater. Data from 1,600+ double-scored responses.

This confirms what Zechner’s research shows: machines and humans are converging. Automated scores reflect trained human judgment to an impressive degree, making them reliable for practice and preparation.

Practical Takeaways for Learners

Trust automated feedback for single responses. It mirrors trained rater judgments closely at the item level.
Manage your averages. Your final score reflects performance across tasks. One weak response won’t determine the section result.
Use patterns to improve. Machines are sensitive to recurring issues like timing, pausing, and lexical variety. Train those consistently.

FAQ

1) Is r = 0.65 “good” agreement?
Yes. In education and language assessment, r in the mid-0.6s at the item level is meaningful. It shows moderate-to-strong alignment between machine and human raters.

2) Why does human–human beat machine–human at the section level?
Averaging multiple human judgments reduces random error. The same model scoring each item doesn’t gain that “independent judgment” benefit.

3) Does correlation mean accuracy?
Not exactly. Correlation reflects consistency with a criterion. Validity depends on whether the score accurately reflects speaking proficiency.

4) Why can item-level machine–human exceed human–human?
The machine applies the same rules every time, while two humans may interpret rubrics slightly differently.

5) How should students use automated scores?
Track trends across many practice responses. Focus on patterns that affect fluency, clarity, and timing.

6) Do these numbers still matter with the new TOEFL Speaking format?
Yes. While task types will evolve, the underlying principle remains: machines are consistent at the item level, humans gain reliability from averaging.

Final Thoughts

The evidence is clear:

On individual tasks, automated scoring is at least as reliable as human raters.
On the overall section, multiple human raters still have a slight edge — but the gap is small.
Independent validation, like Nathan Mills’ study of 1,600+ responses, shows machines and humans are converging.

For test-takers, that means you can trust automated feedback as a highly accurate guide to your current performance. It doesn’t replace human intuition, but it provides consistency, immediacy, and actionable data that humans alone can’t deliver.

As TOEFL Speaking continues to evolve — with the 2026 redesign ahead — automated scoring will play an even bigger role. Understanding its reliability today helps you trust it tomorrow.