Online medical journal article layout analysis
نویسندگان
چکیده
We describe a physical and logical layout analysis algorithm, which is applied to segment and label online medical journal articles (regular HTML and PDF-Converted-HTML files). For these articles, the geometric layout of the Web page is the most important cue for physical layout analysis. The key to physical layout analysis is then to render the HTML file in a Web browser, so that the visual information in zones (composed of one or a set of HTML DOM nodes), especially their relative position, can be utilized. The recursive X-Y cut algorithm is adopted to construct a hierarchical zone tree structure. In logical layout analysis, both geometric and linguistic features are used. The HTML documents are modeled by a Hidden Markov Model with 16 states, and the Viterbi algorithm is then used to find the optimal label sequence, concluding the logical layout analysis.
منابع مشابه
Bibliographic data extraction from HTML medical journal articles
MEDLINE, a biomedical literature database compiled by the US National Library of Medicine, contains 15 million records from approximately 5000 selected journals, and is searched over 3million times a day worldwide. With more journal articles being published online in hypertext markup language (HTML), the automatic extraction of bibliographic data from HTML articles is important for creating MED...
متن کاملAutomated Document Labeling
An increasing number of publishers are using the Internet and the World Wide Web to provide their subscribers with access to online journals. New techniques are needed to capture, classify, analyze, extract, modify, and reformat Web-based document information for computer storage, access, and processing. An R&D division of the National Library of Medicine (NLM) is developing an automated system...
متن کاملStyle-independent document labeling: design and performance evaluation
The Medical Article Records System or MARS has been developed at the U.S. National Library of Medicine (NLM) for automated data entry of bibliographical information from medical journals into MEDLINE®, the premier bibliographic citation database at NLM. Currently, a rule-based algorithm (called ZoneCzar) is used for labeling important bibliographical fields (title, author, affiliation, and abst...
متن کاملOnline analysis of local field potentials for seizure detection in freely moving rats
Objective(s): Seizure detection during online recording of electrophysiological parameters is very important in epileptic patients. In the present study, online analysis of field potential recordings was used for detecting spontaneous seizures in epileptic animals.Materials and Methods: Epilepsy was induced in rats by pilocarpine injecti...
متن کاملLayout Definition of Online Magazines with Splitter Components
The capabilities of current mobile devices and the quality of their screens reached a level, where online reading experience competes with the printed media. Commercially printed magazines and newspapers commonly apply different grid-based page designs. In case of the online magazines the variable conditions, e.g. screen resolution, user preferences and the actual content require to provide ada...
متن کامل