speaker
Oxford University logo
Phonetics Laboratory
Faculty of Linguistics, Philology, and Phonetics

Voicing detection

The idea behind the method of voicing detection in the program voicing.c is that in voiced speech there is more energy in lower frequencies (e.g. frequencies below, say, 400Hz), whereas in voiceless speech there is not much energy in this frequency range.

This program is an extension of filter.c . The input signal is filtered by a 400 Hz low-pass Butterworth filter. After each sample in the output of the filter, yf[i], is calculated, the square of that value is calculated and stored in the vector yfsqr , by the line “yfsqr[i] = yf[i]*yf[i];”. Then, the squares are summed for each interval of 160 samples (i.e. 10 ms at 16000 samples/s). For the first 159 samples, though, from sample 0 to 158, the sum of squares so far (i.e. since sample 0) is calculated instead.

Now, the sum of n numbers from x[in] to x[i] could easily be calculated by a loop: e.g. “for (j = i-n; j <= i ; i++) sum  =  sum+x[j];”. However, if this summation is repeated for successive values of i, a lot of processor time will be wasted, as n–1 numbers summed on the previous iteration will be summed again, as the following example shows:

 

Sequence 14 15 16 17 18 19 20 21 22 23 24
Items 1 to 8 14 15 16 17 18 19 20 21      
Items 2 to 9   15 16 17 18 19 20 21 22    
   
These seven  numbers  summed again
     
Items 3 to 10     16 17 18 19 20 21 22 23  
     
These seven  numbers  summed again
   

This wasteful re-calculation can be avoided: instead of summing from x[i–n] to x[i] on each iteration, we take the sum calculated on the previous iteration, sum[i–1], subtract the first term of that sum, x[i–1–n], and add the current term x[i]. In voicing.c , the running sum of squares of yf[i] is calculated by the line:

for (i = 159 ; i <= *length ; i++)
          sumsq[i] = (sumsq[i-1] - yfsqr[i-160]) + yfsqr[i];

The processing saving is immense: for a window of 160 items, 159 additions would be repeated on each sample. Since the length of the signal can easily be tens of thousands of samples long, millions of unnecessary recalculations can be avoided.

When the sums of squares have been calculated, the output signal y[i] is calculated by dividing the sum of squares at sample i by 160, to obtain the mean sum of squares, and the square root is taken. This gives the rms amplitude of the sample, which is compared to a threshold value of 600. The expression “sqrt(sumsq[i]/160) > 600” has the value 1 (true) if sqrt(sumsq[i]/160) is greater than 600 and 0 (false) if it is not greater than 600. Thus, according to the following lines, y[i] will be 1 if the rms amplitude is over 600 units (i.e. voiced) and 0 if it is not (i.e. voiceless).

for (i = 0; i <= *length; i++)
   y[i] = (sqrt(sumsq[i]/160) > 600);    /* threshold */
   signal_out(length,y,outfile);

The value of 600 was determined by experimentation to be a good level at which to set the threshold. It assumes that the signal being analysed is normalised to a range of 32000, that is, almost the full range of short integers. A recording does not necessarily satisfy this assumption, but you can normalize a signal in filename.dat to this range using the program normalize.c . For example, once you have compiled normalize.c to produce normalize.exe, in MS-DOS you can type “normalize filename.dat normfile.dat ” to produce a normalized file. Even so, it may be necessary to alter the threshold value 600, for recordings in which voicing is weak, for instance, but this is easy to do.

Next: An example