1. Introduction
The typical mobile worker visits remote locations and participates in meetings with different people on a regular basis. A common task that must be performed at some subsequent time is the creation of a summary of what happened during a meeting, including who said what, the ideas that were conceived, the events that occurred, and the conclusions that were reached. Oftentimes, it's not just the specific conclusions but also the reasons they were reached and the points of view expressed by various participants that are important.
As most of us know, making an accurate summary is an error-prone process, especially if the only record we have is our own memory, perhaps supplemented with handwritten notes. A commonly used portable memory aid is the audiocassette recorder. It can be effective, but lacks the ability to capture important events that could be helpful later such as gestures, images of participants, body language, drawings, and so on. An easy-to-use method for incorporating video data would help solve this problem.
There have been several meeting recorder systems based on capturing panoramic videos proposed in recent years [1]-[4]. These systems provide a non-intrusive recording technique and use subsequent analysis to generate a more user-oriented perspective view during playback. In [3], the user-oriented view is determined based on speaker motion. A perhaps more intuitive solution is to compute the speaker direction as suggested in [4]. The user interface can be made more effective by combining audio and video analysis. A multimodal approach to creating meeting records based on speech recognition, face detection and people tracking has been reported in CMU's Meeting Room System [2]. Furthermore, techniques such as summarization and dialog analysis aimed at providing a higher level of understanding of the meetings to facilitate searching and retrieval have been explored [6].
The proposed solution is a portable meeting recorder that captures an omni-directional audio/video recording for a meeting. We assume that only limited data can be computed in real-time but that this is sufficient to produce a recording that can be replayed on the spot, if required. Subsequent analysis of the recorded data enables various output formats that improve the production of a meeting summary and allow for efficient browsing and navigation of the meeting video. Output formats include a viewable representation that shows an image of the person speaking so that the video can be played like a TV program showing a sequence of talking heads. A searchable representation is also produced that provides efficient techniques for navigating the multimedia data.
Our meeting recorder system is designed with portability and compatibility with commercial hardware in mind. Although the resolution of our system is lower than other panoramic systems such as FlyCam and RingCam, the advantage is in its simplicity and compatibility with existing commercial hardware, making it suitable for a portable system.
Even though the technology for video capture and storage commonly available today requires a bulky PC-based implementation of a prototype, in the near future the technical device capabilities assumed in this work will be available in a handheld device. This makes it essential for us to solve the problems inherent in a portable system now so that solutions are available when these systems reach the market.
Technical issues addressed in this work include the combination of audio and video data to locate the person speaking. We have developed a novel method of four-channel sound localization that accurately computes the angle and elevation of speakers from the capture hardware. Combined with a face detection algorithm, this technique effectively calculates a view of people speaking in a meeting.
Searching a recorded meeting for specific information can be a tedious and time-consuming process. We've also developed a novel user interface that represents speaker transitions and shows when events happened during a meeting and the context in which they occurred. This lets users easily navigate to those points in the video. A novel technique for compressed domain analysis of the MPEG-2 stream finds localized motion indicative of people moving. An algorithm for audio analysis measures the intensity of a conversation and the speed of participant interaction. These are both represented in the UI in a way that improves navigation of the recorded video.
An additional useful feature that was developed for the portable meeting recorder is the automatic detection of the room in which a meeting occurred. This can significantly improve the speed with which a large collection of meeting videos is searched. We describe a unique algorithm that clusters meeting videos and provides such a room-based search capability.






