The autocorrelation method of
f0
estimation
The idea behind the autocorrelation method is illustrated in
figure 5.4
.
The top panel
, (a), shows a 1024-sample portion of a vowel. Consider the sample at the
time marked by the vertical line, which is close to the peak of a voicing
pulse in (a). Panel (b) shows the same portion,
shifted to the right (i.e. later in time) by 50
samples. The sample lying on the vertical line in (a) call it x
[t] is 50 samples later than the sample on the line in (b),
x[t50]. Those two samples have rather different magnitudes, as
indeed do x[t] and x[t 50] for all values of
t. If we were to measure the difference between x [t
] and x[t50] for all t, it would be a big
discrepancy.
Panel (c) shows the same portion again, shifted
backwards by 90 samples, compared
to (a). x[t] in (a) is aligned with x[t90] in (c). Note that they are also
very different values: the sample on the line in (c) is close to a dip in
the signal, and has a large negative value. The difference between all
samples
in (a) and all samples in (c) is once again high.
Now compare (a) with (d)
, which is shifted by 129 samples. At that size of shift, the peaks of (d)
are pretty well aligned with the peaks of (a), and likewise for the troughs.
There is still some degree of difference between the two signal portions,
but at this time-lag, the difference is low. If we were to carry on
shifting,
and consider a shift of, say 170 samples, we would find that the degree of
difference between (a) and a time-delayed copy is again greater than that
of (d). The time-lag of 129 samples between (a) and (d) is the size of shift
that is necessary in order to make a copy of (a) most like itself. We say
that the correlation between the copy, (d), and (a) the autocorrelation,
since (d) is a copy of (a) is greatest at the lag at which the difference
between the signal and its copy is smallest.
The reason why a lag of 129 samples gives rise to the smallest difference
between (a) and its copy is because the voicing pulses in the signal recur
at a 129-sample interval. At a sampling rate of 16,000 samples/s, 129
samples is 16,000χ129 = 124.031 Hz.
This gives us the basis of a method for accurately calculating the frequency
of voicing pulses. For each sample in a signal, consider a time-window of
(say) 256 samples either side of that sample. Now, calculate the overall
difference between that 512-sample portion and copies of itself shifted by
every time lag from 512 samples through to +512 samples. Thats an awful
lot of computing: 1024 separate comparisons (excluding time lag 0!), for
just one sample. Still, if we can do it fast enough it will be worth it if
it yields an accurate estimate of f0, which it does.
Next:
The autocorrelation program