Some Speech Perception Experiments 


Cooper, F. S., P. C. Delattre, A. M. Liberman, J. M. Borst and L. J. Gerstman (1952) Some Experiments on the Perception of Synthetic Speech Sounds. Journal of the Acoustical Society of America vol. 24, no. 6. 597-606.

Delattre, P. C., A. M. Liberman and F. S. Cooper (1955) Acoustic Loci and Transitional Cues for Consonants. Journal of the Acoustical Society of America vol. 27, no. 4. 769-773.

Miller, G. A. and P. E. Nicely (1955) An analysis of perceptual confusions among some English consonants. Journal of the Acoustical Society of America vol. 27, 338-352. Reprinted in J. L. Miller, R. D. Kent and B. S. Atal, eds. (1991) Papers in Speech Communication: Speech Perception. Acoustical Society of America.

Liberman, A. M., F. S. Cooper, D. P. Shankweiler, and M. Studdert-­Kennedy (1967) Perception of the speech code. Psychological Review 74 (6), 431-461.

Lisker, L. and A. S. Abramson (1970) The voicing dimension: some experiments in comparative phonetics. Proceedings of the Sixth International Congress of Phonetic Sciences, Prague, 1967. Reprinted in Miller et al. (1991).

1. Cooper et al. (1952)

Shortly after the invention of the sound spectrograph at Bell Telephone Laboratories, researchers at Haskins Laboratories invented one of the earliest speech synthesizers, "pattern playback". This had an opto-electronic read-head that enabled it to convert the pattern of light and dark areas on a spectrogram into the resonant frequencies of a synthesizer. As well as reading spectrograms of real speech, the Haskins teams conducted an important series of experiments with hand-drawn, stylized pseudo-spectrograms, in order to find out which details of the acoustic spectrogram were important for the correct identification of each kind of sound. In this paper they summarize the most important findings from several experiments.

1.1. Stop consonants followed by vowels: importance of noise burst frequency (figure 2)

"it appears that this one variable - the frequency position of the burst - provides the listener with a basis for distinguishing among p, t, and k. We see that high frequency bursts were heard as t for all vowels. Bursts at lower frequencies were heard as k when they were on a level with, or slightly above, the second formant of the vowel; otherwise they were heard as p. It is clear that for p and k the identification of the consonant depended, not solely on the frequency position of the burst of noise, but rather on this position in relation to the vowel."

1.2. Stop consonants followed by vowels: importance of formant frequency transitions.

At the start of a vowel, rapid changes in the spectrogram are seen, reflecting the change in the sound as the vocal tract moves from the position appropriate to the consonant to the position appropriate to the vowel. Because these transitions are a short-lived and mechanical side-effect of the movement from the consonant to the vowel, they might be expected to be unimportant to perception. Cooper et al. discovered that this is not so: the direction of the transitions enables listeners to perceive the difference between different consonants. The shape of the first formant transition cues the voicing distinction, and the shape of the second formant transition relates to place of articulation, in a way that depends on the vowel. (This work was developed by Delattre et al. 1955)

1.3. They also studied the spectrographic patterns of nasals, /l/ and vowels. For the nasals, similar transitions were observed for /m/, /n/ and /ŋ/ as for /b/, /d/, and /g/ in the previous experiments.

2. Delattre et al. (1955)

The earlier paper found that different second formant transitions are heard as stop consonants with different places of articulation. This paper explores that relationship more systematically. The three places of articulation (bilabial, alveolar and velar) are shown to be associated with three frequency regions - acoustic loci. "Since the articulatory place of production of each consonant is, for the most part, fixed, we might expect to find that there is correspondingly a a fixed frequency position - or "locus" - for its second formant ... the various transitions that produce the best d with each of the seven vowels do, in fact, appear to be coming from the same general region"

Later: "The best g is produced by a second formant at 3000 cps [i.e. 3 kHz ― JC], the best d at 1800 cps, and the best b at 720 cps."

Because different formant transitions - in some cases, in opposite directions - are used to encode the same place of articulation, the Haskins group of researchers became sceptical about finding stable (invariant) acoustic cues for phonological units, and argued instead for a motor theory of speech production, in which the articulatory events were seen as the underlying units of speech perception (Liberman et al. 1967). This idea has persisted, but has continued to be one of great controversy: see, for example, the papers by Lindblom, Stevens, Ohala and Fowler in Journal of the Acoustical Society of America 99 (3).

3. Lisker and Abramson (1970)

In previous descriptive work on various languages, Lisker and Abramson identified Voice Onset Time (VOT) as an important acoustic feature of phonation constrasts. In brief, voice onset at the same time as the consonant release (VOT = 0 ms) is typical of voiceless unaspirated stops, whereas VOT 20-25 ms after the consonant release is characteristic of voiceless aspirated consonants.

In this paper, Lisker and Abramson used a speech synthesizer to generate various /Ca/ tokens, with 37 different voice onset times ranging from 150 ms before the stop release to 150 ms after it. The stimuli varied in 10 ms steps, but in the range from –10 ms to + 50 ms VOT, 5 ms steps were used. Subjects were five native speakers of Latin American Spanish, twelve of American English and eight of Thai. Spanish has a two-way contrast between prevoiced [b] and voiceless unaspirated [p] etc. For /b/ vs. /p/, English has unaspirated [p] vs. aspirated [ph], /d/ vs. /t/ (phonetically, [t] vs. [th]) and /g/ vs. /k/ (i.e. [k] vs. [kh]). Thai has three-way contrasts such as [d] vs. [t] vs. [th]. Subjects were asked to identify whether each stimulus was in a certain category (e.g. /d/ or /t/).

Lisker and Abramson found that the relationship between Voice Onset Time and % identification of each category was strikingly non-linear (an S-shaped ― logistic ― function, in fact, for binary contrasts). At a certain point on the scale, subjects' ability to identify a category is no better than chance (50% correct). If their ability to discriminate stimuli is sharp around this point, the perception is said to be categorial. Away from the category boundary, identification of stimuli approaches 100% correct. In these regions, discrimination is typically poor. That is, subjects find it more difficult to tell the difference between one token of [d] and another with a slightly different VOT.

This pattern of categorical perception has been found for several other phonological contrasts, but not all. Vowel perception is less categorial: the identification functions are similar, but discrimination is not so sharp.