The TOEFL Listen & Repeat Task:

A Data-Driven Analysis of Task Design, Collocation Load, and What Test-Takers Are Actually Being Measured On

Abstract

The TOEFL Listen & Repeat task looks simple on the surface: hear a sentence, repeat it once. In practice, it is doing much more. This paper analyzes the task using two sources: (1) the 2025 TOEFL iBT Technical Manual and (2) a dataset of 49 ETS sample prompts. The goal is to understand what the task measures, how prompts are constructed, and why learners succeed or fail.

The core finding is this: performance is not driven by word-by-word memory. It is driven by the ability to process, retain, and reproduce collocation chains under time pressure. Across 49 prompts, we observe 90 meaningful collocations, with density increasing sharply in later items (Q6–Q7). The scoring model supports this interpretation, emphasizing meaning preservation, intelligibility, and similarity to the prompt rather than exact reproduction.

The implication is practical. Most learners who “have a memory problem” actually have a chunking problem. And that distinction matters if you want to build useful feedback.

1. Introduction

Listen & Repeat sits at the foundation of the TOEFL Speaking section. It is short, structured, and highly controlled. It also carries more diagnostic information than most learners realize.

According to the technical manual, the Speaking section measures both foundational language skills and communication ability. Listen & Repeat specifically targets the ability to process spoken input and reproduce it accurately and intelligibly.

The task consists of seven prompts, delivered within a scenario, with increasing sentence length and complexity.

At first glance, this looks like a memory task. That interpretation breaks down quickly once you look at the data.

Research questions

Question	Why it matters
What does Listen & Repeat actually measure?	Clarifies construct beyond “memory”
How are prompts structured?	Explains progression and difficulty
How dense are prompts lexically?	Quantifies chunk load
Why do learners fail Q6–Q7?	Links data to performance breakdown

2. What ETS is actually trying to measure

The TOEFL is not purely a communication test. It is a hybrid model.

The manual makes this explicit: the test combines foundational skills (vocabulary, syntax, processing) with communicative ability.

Listen & Repeat sits on the foundational side. But not in a trivial way.

It was included because:

It provides rapid evidence of language proficiency
It captures processing + production simultaneously
It correlates with broader speaking ability

Construct breakdown

Construct	What it means here	Implication
Auditory processing	Understanding the sentence in real time	No replay = no recovery
Short-term retention	Holding structure long enough to repeat	Chunking becomes critical
Speech production	Delivering clearly and fluently	Pronunciation still matters

So already, “memory task” is too simple. This is a processing + retention + production task.

3. Anatomy of the task

ETS defines the structure clearly:

7 prompts
Delivered in a scenario (e.g., campus, gym, library)
Visual progression through the setting
Sentences increase in length and complexity

Design logic

Element	What ETS says	What it actually does
Scenario	Contextualized sentences	Creates predictable language patterns
Progression	Increasing complexity	Builds chunk load gradually
Single repetition	Repeat once	Forces real-time processing

That last point matters. No second attempt means:

No repair
No restructuring
No thinking time

You either processed the sentence correctly or you didn’t.

4. Data set

This paper uses:

7 ETS Listen & Repeat sets
49 prompts total

Metric	Value
Prompt sets	7
Prompts per set	7
Total prompts	49

The contexts include:

libraries
gyms
hotels
nature reserves
campus services

This matters because it drives lexical repetition patterns.

5. Collocation statistics

Using a strict definition (meaningful lexical chunks only), we identified:

90 collocations
across 49 prompts

Metric	Value
Total collocations	90
Prompts	49
Average per prompt	1.84

Distribution by position

Prompt	Typical load	Function
Q1–Q2	1–2 collocations	Orientation
Q3–Q5	2–3 collocations	Instruction / location
Q6–Q7	3–4 collocations	Policy / complex guidance

So difficulty is not random. It is systematically built through chunk density.

6. The collocation architecture

This is where the task becomes interesting.

ETS is not building sentences from individual words. It is building them from reusable chunk families.

Top collocation families

Family	Examples	Role
Welcome + place	welcome to the library	Entry chunk
Service locations	help desk, registration desk	Core noun anchors
Time & schedule	on time, due dates	Late-position chunks
Location phrases	front entrance, breakout rooms	Spatial anchors
Policy language	posted rules, late-departure fee	Complex endings

Key insight

A “sentence” is typically:

3–4 collocation units chained together

Example structure:

chunk 1: orientation
chunk 2: location
chunk 3: instruction
chunk 4: condition or policy

That is what the learner is actually processing.

7. Why Q7 breaks people

The manual says later sentences are longer and more complex.

But length is not the real issue.

Real difficulty drivers

Factor	Effect
More chunks	Higher retention load
Later chunk position	Higher drop rate
Abstract language	Lower recall probability
Conditional structure	More processing steps

So when a learner says:
“I can’t remember long sentences”

What’s actually happening is:

they lose chunk 3 or 4
not the entire sentence

8. Scoring logic

The rubric is revealing.

A score of 4:

allows changes
as long as meaning is preserved

A score of 3:

includes most content
but meaning is not accurate

Score interpretation

Score	Meaning
5	Exact and clear
4	Meaning preserved
3	Most content, meaning weakened
2	Large omissions
1	Minimal repetition

Key takeaway

The test does not require perfect memory.

It requires:

meaning preservation across chunks

9. Automated scoring

ETS uses:

fluency
intelligibility
repeat accuracy

Feature layer

Dimension	What is measured
Fluency	speed, pauses
Intelligibility	pronunciation, rhythm
Accuracy	word overlap, similarity

So even perfect memory won’t save you if:

delivery collapses
pronunciation obscures meaning

10. Reliability and validity

The Speaking section shows:

reliability: 0.94
human-machine agreement: 0.89

That is strong.

It means:

the task is stable
the scoring model is consistent

In other words, the test is not guessing.

11. Diagnostic implications

This is where things get interesting.

Most feedback today says:
“memory problem”

That’s not precise enough.

Better diagnostic model

Observed issue	Real problem	Better label
Loses sentence ending	Chunk retention failure	Final-chunk collapse
Misses service words	Weak collocation family	Service-location weakness
Confuses timing phrases	Weak time chain	Time-chain weakness
Hard to understand	Poor intelligibility	Delivery breakdown

This is where My Speaking Score adds real value.

12. Conclusion

Listen & Repeat is not a memory drill.

It is a high-speed processing task built on collocation chains.

The evidence shows:

structured progression
increasing chunk density
scoring based on meaning, not exact wording
consistent statistical behavior

So the real skill is:

hearing chunks
retaining chunks
reproducing chunks clearly

Or, in less academic terms:

You’re not remembering a sentence.
You’re rebuilding it in real time from pieces.

Appendix: Methods

Collocation definition

Counted:

noun phrases (help desk)
verb phrases (check the schedule)
fixed expressions (on time)

Excluded:

function word strings
grammatical fillers

Data

7 ETS sets
49 prompts

Limitations

sample prompts, not full operational bank
collocation identification involves judgment
does not include acoustic scoring features

TOEFL Speaking Listen & Repeat: Task Analysis