Jonathan Foote, John Boreczky, and Lynn Wilcox. Finding Presentations in Recorded Meetings Using Audio and Video Features. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (Phoenix, AZ), vol. 6, pp. 3029-3032, 1999.
Abstract: This paper describes a method for finding segments in video-recorded meetings that correspond to presentations. These segments serve as indexes into the recorded meeting. The system automatically detects intervals of video that correspond to presentation slides. We assume that only one person speaks during an interval when slides are detected. Thus these intervals can be used as training data for a speaker spotting system. An HMM is automatically constructed and trained on the audio data from each slide interval. A Viterbi alignment then resegments the audio according to speaker. Since the same speaker may talk across multiple slide intervals, the acoustic data from these intervals is clustered to yield an estimate of the number of distinct speakers and their order. This allows the individual presentations in the video to be identified from the location of each presenter's speech. Results are presented for a corpus of six meeting videos.