Pitch Detection using AM Analysis Techniques

This section examines how these techniques discussed in the previous sections can be used in the tracking the pitch of a speakers voice. As seen in the last section, when a piece of speech was analysed for points of maximal AM we found that during voices speech these where found regularly spaced in 'lines' between the speech formants. Upon closer examination it was found that the spacing between the maximal AM points in a line approximated to the period of the speakers pitch.

If we examine the spread of maximal AM points on the plot shown on Figure 1, showing the spectogram and AM points for a 6 second utterance, we can see that the 'lines' are clearly marked. These distinguish themselves from the other maximal AM points, such as those found where there is no speech energy, by their regular spacing and smooth frequency contours.

Figure 1: AM Image of Speech Signal with Points of Maximal AM Depth (100-1000Hz,1-6seconds)

Speech File (Sun .au format)-"The north wind and the sun were disputing which was the stonger when a traveller came along wrapped in a warm"

Therefore if these 'maximal AM lines' can be tracked and distinguished from the maximal AM points found when there is no voicing then it should be possible to obtain pitch estimations for each 'line'. As can be seen, there are usually up to 4 lines found during voicing, allowing a number of parallel pitch estimates for any one time period.

Taking advantage of the stability of these 'lines' over both time and frequency by extracting only those points occuring in long lines of slowly varying frequency at a relatively constant period we should have only those points useful to us for pitch detection. Such an extraction has been carried out on the sample seen in Figure 1, this can be seen as a graph in Figure 2 showing all the maximal AM points, with those extracted for pitch detection highlighted in green.

Figure 2: Maximal AM depth points (black) and points used as 'lines' for pitch estimation (green)

Comparing Figure's 1 and 2 we can see that most of the 'lines' have been successfully extracted, leaving the maximal AM points occuring due to background noise or unvoiced speech behind.

Now that these lines have been extracted then we can use each one to yield a pitch estimation by measuring the period between maximal AM points on each line. If multiple lines (and therefore multiple pitch estimates) are found for any 10ms time frame then the median estimate is used.

This approach yielded the pitch contour for the 6 second test utterance used in the previous diagrams, this can be seen together with an 'ideal' pitch contour found via hand corrected auotcorrelogram in Figure 3 below:

Figure 4: Pitch Estimates using maximal AM depth points (green) compared with 'ideal' values (blue)

As can be seen, whilst the pitch estimates follow the pitch contours approximately there are a quite a number of points which are wildely inaccurate, and the estimates have a tendancy to 'oscillate' around the correct pitch value in the frequency domain. A more detailed analysis discovered that out of the 253 frames used in this test 26% had errors of more than 10% (gross errors), with an average error of 0.102 of the reference pitch value.

These error values are clearly too high for applications where accurate pitch estimates are required, however, naturally enough it was found that the more 'maximal AM lines' were used to calculate the pitch values the more accurate the result. Therefore a slightly different approach would be to ignore the 'maximal' AM IF deviations and instead measure the period of all IF deviations across all frequency channels. Of course, not all of the IF deviations will be due to AM effects, in fact these will be in the majority, however the period of these deviations will be spread at random whilst the IF deviations due to AM will all be at roughly the same value. By counting the number of 'hits' for each possible period value, that is estimates of a particular period value from the frequency channels, the pitch value should come through as a strong peak amougst the possible period values.

Figure 5: Graph of 'Hits' and Pitch Estimates for various margin values

If we look at Figure 5 we can see the results of this method of pitch detection, the 'estimate' line shows the number of hits that have been counted for the most popular pitch estimate. When no voiced speech is present this value falls back to a normal background level (seen when the reference pitch value is at zero). Naturally, this background level never reaches zero, therefore a minimum level, or margin, for a pitch estimation must be used so that pitch estimations are not given for background. As can be seen from the graph, if this level is set too low then spurious pitch estimates can result from background noise, set the level too high and pitch estimates for the start and end of voiced speech will be missed. A value must be chosen which minimises the chance of missing and spurious errors. This can be seen on Figure 5 where a margin of 13 produces many spurious pitch estimates, with a margin of 28 the pitch estimates are truncated, clearly the pitch contour which fits the reference contour is with a margin of 20.

In order to compare this method of pitch detection with other popular techniques a set of reference utterances from 10 different speakers (5 male, 5 female) have been analysed using the pitch detection method described above. The pitch contours resulting from this method can then be compared against a set of hand corrected reference contours to discover its accuracy.

The table below shows th results from this analysis, the data is set out as follows:

Frames Number of 10ms frames in test utterance
FinErr On all frames with an error of less than 10% (of the correct pitch estimate) this is the average error
GrossE Number of frames with an error of over 10% (Gross error)
GrossP Percentage of frames with gross errors
GNZ E Number of gross error frames where the pitch estimate produced a spurious result
GNZ P Percentage of frames with GNZ E
GZ E Number of gross error frames where the pitch estimator did not produce a pitch estimate when it shouldhave
GZ P Percentage of frames with GZ E


Sample	Margin	Frames	FineErr	GrossE	GrossP	GNZ E	GNZ P	GZ E	GZ P
============================================================================== 
f1	18	3221	0.0121	332	10.31	10	0.31	322	10.00	
f2	15	3370	0.0151	308	9.14	36	1.07	272	8.07	
f3	18	3050	0.0142	367	12.03	3	0.10	364	11.93	
f4	18	3160	0.0159	372	11.77	20	0.63	352	11.14	
f5	20	3870	0.0143	177	4.57	14	0.36	163	4.21	
m1	13	3736	0.0164	1189	31.83	87	2.33	1102	29.50	
m2	15	3187	0.0148	801	25.13	37	1.16	764	23.97	
m3	15	2717	0.0126	436	16.05	14	0.52	422	15.53	
m4	15	3370	0.0111	1366	40.53	28	0.83	1338	39.70	
m5	15	4030	0.0113	1007	24.99	16	0.40	991	24.59	

Average Values   	FinErr  GrossP
======================================
Female			0.0143	9.564
Male			0.0132	27.7
All			0.0137	18.63

The first conclusion to draw from the data is the major difference between the results of female and male speakers, as can be seen from the average values male speakers have 3 times as many gross errors as female speakers. This, as can be seen from the values of GZ P and GNZ P, is because many of the pitch estimates for male speakers have been missed out. Unfortunately, if the graphs of 'hit' values are examined for male speakers we find that for much of the voiced speech the number of 'hits' does not rise much above the background level, making pitch estimation impossible. This is true of all the results to some degree or other, at some points there are no clear peaks in the periodicity of IF deviations to use for pitch estimation. This could be particulaly prevelent in the male speakers because the period of their voices is far longer than that of female speakers, and so more likely to be corrupted by spurious peaks/troughs in IF. However, looking at the results for female speakers we see that the gross errors are relatively low, less than 10% with only 1.43% fine errors. Also, because the gross errors are mostly due to missed pitch estimations there are never wild fluctuations as seen in other pitch detectors due to pitch doubling/halving. Therefore if the estimator does find a pitch value you can be pretty sure that it will be correct.

Figure 6: Graph of Voice/Unvoiced Error (Gross error) to fine error percentages for various Pitch Detection Algorithms

When these results are compared with other commonly used pitch detection algorithms we see that due to the problems encountered with male speakers the results do not compare favourably. Figure 6 shows a graph of voice/unvoiced errors, basically gross errors, and fine error means for pitch contours generated using IF deviation and a variety of other techniques. As can be seen, the whilst the fine error mean scores are respectable, lying somewhere in the middle of the graph, the voiced/unvoiced errors are much higher, due to the problems discussed earlier. Thanks go to Eric Mousset for comparing this technique with other's when writing 'A Comparison of Several Recent Methods of Fundamental Frequency and Voicing Decision Estimation'.

Up to this point all of these test results have taken place in clean conditions. In order to test this pitch detection method under more general conditions a sample utterance from speaker f1 was taken a mixed with white noise at SNR levels of -6dB 0dB and +6dB with the following results:

S/N	Margin	Frames	FineErr	GrossE	GrossP	GNZ E	GNZ P	GZ E	GZ P	
============================================================================= 
clean	18	3221	0.0121	332	10.31	10	0.31	322	10.00	
-6 dB	13	595	0.0167	69	11.60	5	0.84	64	10.76
0 dB	11	595	0.0202	94	15.80	17	2.86	77	12.94
6 dB	9	595	0.0290	158	26.55	26	4.37	132	22.18

Figure 7: Graph of 'Hits' and Pitch Estimates for various SNR

If we compare these results with those taken from a larger utterance by f1 under clean conditions we can see that even with large levels of white noise the pitch detection algorithm is still relatively robust. Taking a closer look at the results shown in Figure 7 we see that the pitch contour remains relatively close to the reference even at SNR of 0 dB, it is only when we have twice as much noise as signal that we get wild fluctuations in the pitch estimates at various points. Unfortunately it has not been possible to compare this these results with the other pitch detection algorithms seen in Figure 6 because none of the algorithms have been tested under noisy conditions. So it could true that whilst this algorithm performs relatively inaccurately under clean conditions it may outperform its competitors when exposed to noise.

Improvements to this technique that could offer better performance, especially in male speakers are mainly centered on robust peak/trough detection algorithms since it is the low number of correct period estimates that is causing the system problems with male speakers. Another slight improvement could result in taking into account period estimates at double/half of a value when choosing the best pitch period estimate, just in case pitch doubling/halving is having a serious effect on the IF deviation period.

Modified 3/6/96 by Jeremy Goslin

home