Aligning Needles in a Haystack: Paraphrase Acquisition Across the Web
نویسندگان
چکیده
This paper presents a lightweight method for unsupervised extraction of paraphrases from arbitrary textual Web documents. The method differs from previous approaches to paraphrase acquisition in that 1) it removes the assumptions on the quality of the input data, by using inherently noisy, unreliable Web documents rather than clean, trustworthy, properly formatted documents; and 2) it does not require any explicit clue indicating which documents are likely to encode parallel paraphrases, as they report on the same events or describe the same stories. Large sets of paraphrases are collected through exhaustive pairwise alignment of small needles, i.e., sentence fragments, across a haystack of Web document sentences. The paper describes experiments on a set of about one billion Web documents, and evaluates the extracted paraphrases in a natural-language Web search application.
منابع مشابه
Clustering and Matching Headlines for Automatic Paraphrase Acquisition
For developing a data-driven text rewriting algorithm for paraphrasing, it is essential to have a monolingual corpus of aligned paraphrased sentences. News article headlines are a rich source of paraphrases; they tend to describe the same event in various different ways, and can easily be obtained from the web. We compare two methods of aligning headlines to construct such an aligned corpus of ...
متن کاملThe Needles-in-Haystack Problem
We consider a new data mining problem of detecting the members of a rare class of data, the needles, that have been hidden in a set of records, the haystack. Besides the haystack, a single instance of a needle is given. It is assumed that members of the needle class are similar according to an unknown needle characterization. The goal is to find the needle records hidden in the haystack. This p...
متن کاملScaling Web-based Acquisition of Entailment Relations
Paraphrase recognition is a critical step for natural language interpretation. Accordingly, many NLP applications would benefit from high coverage knowledge bases of paraphrases. However, the scalability of state-of-the-art paraphrase acquisition approaches is still limited. We present a fully unsupervised learning algorithm for Web-based extraction of entailment relations, an extended model of...
متن کاملGeneralizing Sub-sentential Paraphrase Acquisition across Original Signal Type of Text Pairs
This paper describes a study on the impact of the original signal (text, speech, visual scene, event) of a text pair on the task of both manual and automatic sub-sentential paraphrase acquisition. A corpus of 2,500 annotated sentences in English and French is described, and performance on this corpus is reported for an efficient system combination exploiting a large set of features for paraphra...
متن کاملGuest Editors' Introduction: Information Discovery--Needles and Haystacks
For thousands of years, people have realized the importance of archiving and finding information. With the advent of computers, it became possible to store large amounts of information in electronic form — and finding useful needles in the resulting haystacks has since become one of the most important problems in information management. Many systems exist to help users navigate the considerable...
متن کامل