The autocorrelation method of f0 estimation

The idea behind the autocorrelation method is illustrated in figure 5.4 .

The top panel , (a), shows a 1024-sample portion of a vowel. Consider the sample at the time marked by the vertical line, which is close to the peak of a voicing pulse in (a). Panel (b) shows the same portion, shifted to the right (i.e. later in time) by 50 samples. The sample lying on the vertical line in (a) — call it x [t] — is 50 samples later than the sample on the line in (b), x[t–50]. Those two samples have rather different magnitudes, as indeed do x[t] and x[t –50] for all values of t. If we were to measure the difference between x [t ] and x[t–50] for all t, it would be a big discrepancy. Panel (c) shows the same portion again, shifted backwards by 90 samples, compared to (a). x[t] in (a) is aligned with x[t–90] in (c). Note that they are also very different values: the sample on the line in (c) is close to a dip in the signal, and has a large negative value. The difference between all samples in (a) and all samples in (c) is once again high.

Now compare (a) with (d) , which is shifted by 129 samples. At that size of shift, the peaks of (d) are pretty well aligned with the peaks of (a), and likewise for the troughs. There is still some degree of difference between the two signal portions, but at this time-lag, the difference is low. If we were to carry on shifting, and consider a shift of, say 170 samples, we would find that the degree of difference between (a) and a time-delayed copy is again greater than that of (d). The time-lag of 129 samples between (a) and (d) is the size of shift that is necessary in order to make a copy of (a) most like itself. We say that the correlation between the copy, (d), and (a) — the autocorrelation, since (d) is a copy of (a) — is greatest at the lag at which the difference between the signal and its copy is smallest.
 
The reason why a lag of 129 samples gives rise to the smallest difference between (a) and its copy is because the voicing pulses in the signal recur at a 129-sample interval. At a sampling rate of 16,000 samples/s, 129 samples is 16,000χ129 = 124.031 Hz.

This gives us the basis of a method for accurately calculating the frequency of voicing pulses. For each sample in a signal, consider a time-window of (say) 256 samples either side of that sample. Now, calculate the overall difference between that 512-sample portion and copies of itself shifted by every time lag from –512 samples through to +512 samples. That’s an awful lot of computing: 1024 separate comparisons (excluding time lag 0!), for just one sample. Still, if we can do it fast enough it will be worth it if it yields an accurate estimate of f0, which it does.

Next: The autocorrelation program