Extracting Product Information from Email Receipts Using Markov Logic

نویسندگان

  • Stanley Kok
  • Wen-tau Yih
چکیده

Email receipts (e-receipts) frequently record e-commerce transactions between users and online retailers, and contain a wealth of product information. Such information could be used in a variety of applications if it could be reliably extracted. However, extracting product information from ereceipts poses several challenges. For example, the high labor cost of annotating e-receipts makes traditional supervised approaches infeasible. E-receipts may also be generated from a variety of templates, and are usually encoded in plain text rather than HTML, making it difficult to discover the regularity of how product information is presented. In this paper, we present an approach that addresses all these challenges. Our approach is based on Markov logic [22], a language that combines probability and logic. From a corpus of unlabeled e-receipts, we identify all possible templates by jointly clustering the e-receipts and the lines in them. From the non-template portions of e-receipts, we learn patterns describing how product information is laid out, and use them to extract the product information. Experiments on a corpus of real-world e-receipts demonstrate that our approach performs well. Furthermore, the extracted information can be reliably used as labeled data to bootstrap a supervised statistical model, and our experiments show that such a model is able to extract even more product information.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Quantifier Scope Disambiguation Using Extracted Pragmatic Knowledge: Preliminary Results

It is well known that pragmatic knowledge is useful and necessary in many difficult language processing tasks, but because this knowledge is difficult to acquire and process automatically, it is rarely used. We present an open information extraction technique for automatically extracting a particular kind of pragmatic knowledge from text, and we show how to integrate the knowledge into a Markov...

متن کامل

Systematic literature review of fuzzy logic based text summarization

Information Overloadrq  is not a new term but with the massive development in technology which enables anytime, anywhere, easy and unlimited access; participation & publishing of information has consequently escalated its impact. Assisting userslq    informational searches with reduced reading surfing time by extracting and evaluating accurate, authentic & relevant information are the primary c...

متن کامل

Machine Reading Using Markov Logic Networks for Collective Probabilistic Inference

DARPA’s Machine Reading project is directed at extracting specific information from natural language text such as events from news articles. We describe a component of FAUST, a system designed for machine reading, which combines stateof-the-art information extraction (IE), based on statistical parsing and local sentencewise analysis, with global article-wide inference using Markov Logic Network...

متن کامل

Statistical Models for Exploring Individual Email Communication Behavior

As digital communication devices play an increasingly prominent role in our daily lives, the ability to analyze and understand our communication patterns becomes more important. In this paper, we investigate a latent variable modeling approach for extracting information from individual email histories, focusing in particular on understanding how an individual communicates over time with recipie...

متن کامل

Making Travel Smarter: Extracting Travel Information From Email Itineraries Using Named Entity Recognition

The purpose of this research is to address the problem of extracting information from travel itineraries and discuss the challenges faced in the process. Businessto-customer emails like booking confirmations and e-tickets are usually machine generated by filling slots in pre-defined templates which improve the presentation of such emails but also make the emails more complex in structure. Extra...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009