Automatic Segmentation of the Clean Speech Signal

Speech Segmentation is the measure of the change
point detection for partitioning an input speech signal into regions
each of which accords to only one speaker. In this paper, we apply
two features based on multi-scale product (MP) of the clean speech,
namely the spectral centroid of MP, and the zero crossings rate of
MP. We focus on multi-scale product analysis as an important tool
for segmentation extraction. The MP is based on making the product
of the speech wavelet transform coefficients (WTC). We have
estimated our method on the Keele database. The results show the
effectiveness of our method. It indicates that the two features can find
word boundaries, and extracted the segments of the clean speech.





References:
[1] F. Kubala, T. Anastasakos, H. Jin, L. Nguyen, and R. M. Schwartz.
“Transcribing radio news,” in Proc. ICSLP, 1996.
[2] L. Zhang, H. J. Lu, "Speaker change detection and tracking in real time
news broadcasting analysis," in Proc. ACM Multimedia, 2002, pp. 602-
610.
[3] S. E. Tranter, K. Yu, G. Evermann, and P. C. Woodland. “Generating
and evaluating segmentations for automatic speech recognition of
conversational telephone speech,” in Proc. ICASSP, Canada, 2004, pp.
753-756.
[4] J. Wang, H. Sung, and P. Lin, "Unsupervised change detection using
SVM misclassification rate," IEEE Trans. Computers, vol. 56, pp. 1234–
1244, 2009.
[5] I. McCowan, H. Bourland, and J. Ajmera, "speech/music segmentation
using entropy," Speech Comm., vol. 40, pp. 351–363, 2003.
[6] D. Wang, R. Vogt, M. Mason, and S. Sridharan, "Automatic audio
segmentation using the GLR," in Proc. International Conference on
Signal process. Comm. Systems, Australia, 2008, pp. 1-5.
[7] J. Hansen, and B. Zhou, "Unsupervised audio stream segmentation via
the BIC," in Proc. ICSLP, 2000, pp. 714-717.
[8] D. Elter, T. Sikora, and H. Kim, "Hybrid speaker based segmentation
system using MLC," in Proc. International Conference on Acoustics,
Speech and Signal Processing, 2005, pp. 745-748.
[9] S. Tranter, and D. Reynolds, “Speaker diarization for broadcast news,”
in the Speaker and Language Recognition Workshop, ODYSSEY'04,
2004, Spain.
[10] S. Mallat, A Wavelet Tour of Signal Processing The Sparse Way. 3rd ed.,
Academic Press Elsevier, 2008.
[11] M. A. Ben Messaoud, A. Bouzid, and N. Ellouze, 2013. “An efficient
method for fundamental frequency determination of noisy speech,” in
LNAI 7911, T. Drugman, T. Dutoit, Eds. Verlag Berlin Heidelberg:
Springer, pp. 33–41.
[12] G. Meyer, F. Plante, and W. A. Ainsworth, “A pitch extraction reference
database,” in Proc. EUROSPEECH, Madrid, 1995, pp. 837–840.