3.1. Semantic Analysis and Extraction
Semantic document elements are the building blocks of Multimedia Thumbnails. We divide these elements into three groups depending on the element presentation type:
- Purely visual elements that can be presented to the user only through the visual channel;
- Purely audible elements that can be presented to the user only through the audio channel; and
- Audiovisual elements that can be presented to the user synchronously through both visual and audio channels.
3.1.1 Extraction of Document Elements
In order to automatically generate a Multimedia Thumbnail, semantic document elements and their locations on the page as well as the reading order should be extracted. If a document is in a scanned image format, postscript, or PDF document, a preprocessing step is applied to the document that includes layout analysis and optical character recognition via commercial software. The software also automatically determines a reading order based on the layout. The output of the preprocessor, which is a collection of document elements, is further analyzed to assign semantic labels to visual document elements, such as title, section heading, and figure captions. Publication name and date are generally difficult to automatically extract. If this information is not present in the file header, it can be provided to the algorithm as metadata.
In addition to identifying visual information in this way, the analysis step also determines audible document information from the document image and metadata. Examples of audible information include figure captions, keywords, publication date, and publication name that can be synthesized to speech. We extract keywords from a document with TF-IDF analysis [18].
3.1.2 Pre-processing for Real-Time Rendered Documents
In static documents such as postscript and PDF, the layout of a document page is already known as well as the coordinates of text and figures. However, this is not the case in symbolic source documents that are rendered real-time, such as HTML pages and MSWord documents. In real-time rendered documents, the layout of a page changes depending on the page size and the selected printer properties. Nevertheless, these symbolic source representations potentially contain very valuable semantic information about their contents, such as text and formatting information of titles, headers, and figures that is usually difficult to obtain accurately from image-based representations.
In most commonly used symbolic representations, it is possible to extract document elements and their semantic tags from the file by either simple parsing of the description (e.g., HTML) or using the APIs that allow access to the proprietary representations (e.g., MSWord, MSPowerPoint). The extracted information is stored in an XML description file. The coordinates of those elements are not known until after the file is rendered for display or printing. In some cases a Document Object Model (DOM) [19] can be used to obtain this information. Nevertheless, DOM is not supported by all document representation formats. Our solution is to generate a static visual representation of a symbolic source document by printing the file and storing the coordinates of the document elements in a second XML description file. The generated XML description is less accurate in terms of semantic labels of document elements, for example links between figures and figure captions do not exist. On the other hand it contains accurate location information for images and text. Merging of the information contained in the two XML files is performed by content matching. Content matching is performed based on the document element type. For matching of figure content, bitmaps corresponding to figure areas are extracted from original files using the location information present in the XML files and color layout similarity is employed. For matching of text elements, a tri-word-gram word similarity measure is used that is similar to the method employed in [20]. The output of content matching is a description of the document content that contains semantic labels of document elements, such as title, section heading, and caption, as well as their absolute coordinates on the page.
3.1.3 Selection of Presentation Channels
Some document elements, such as a title can be easily presented only through the visual channel by zooming in and panning over the title, presented only through the audio channel by synthesizing title to speech, or presented through both channels. Possible and preferred presentation channels for document elements are presented in Table 1 based on our previous user study [12]. For example, users prefer to have "author names" presented only in the visual channel because the text-to-speech engine often wrongly pronounces names. In contrary, users like very much to have title and figures with captions in both visual and audio channels. Users also indicated that they like to hear the number of pages and page numbers in the audio channel. More details can be found in [12].
Table 1. Preferred presentation channels for document elements based on our user study in [12].
| Document element | Possible presentation channels | Preferred presentation channel |
|---|---|---|
| Title | Visual and Audio | Audiovisual |
| Figure with captions | Visual and Audio | Audiovisual |
| Figure with no captions | Visual | Visual |
| Section headings | Visual and Audio | Audiovisual |
| Abstract | Visual and Audio | Visual |
| References | Visual and Audio | Visual |
| Page thumbnail | Visual | Visual |
| Author names | Visual and Audio | Visual |
| Publication name | Visual and Audio | Audio |
| Publication date | Visual and Audio | Audio |
| Keywords | Audio | Audio |
| Page number | Visual and Audio | Audio |
| Number of pages | Audio | Audio |






