Template Induction over Unstructured Email Corpora

نویسندگان

  • Julia Proskurnia
  • Marc-Allen Cartright
  • Lluis Garcia Pueyo
  • Ivo Krka
  • James Bradley Wendt
  • Tobias Kaufmann
  • Balint Miklos
چکیده

Unsupervised template induction over email data is a central component in applications such as information extraction, document classification, and auto-reply. The benefits of automatically generating such templates are known for structured data, e.g. machine generated HTML emails. However much less work has been done in performing the same task over unstructured email data. We propose a technique for inducing high quality templates from plain text emails at scale based on the suffix array data structure. We evaluate this method against an industry-standard approach for finding similar content based on shingling, running both algorithms over two corpora: a synthetically created email corpus for a high level of experimental control, as well as user-generated emails from the well-known Enron email corpus. Our experimental results show that the proposed method is more robust to variations in cluster quality than the baseline and templates contain more text from the emails, which would benefit extraction tasks by identifying transient parts of the emails. Our study indicates templates induced using suffix arrays contain approximately half as much noise (measured as entropy) as templates induced using shingling. Furthermore, the suffix array approach is substantially more scalable, proving to be an order of magnitude faster than shingling even for modestly-sized training clusters. Public corpus analysis shows that email clusters contain on average 4 segments of common phrases, where each of the segments contains on average 9 words, thus showing that templatization could help users reduce the email writing effort by an average of 35 words per email in an assistance or auto-reply related task.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Genre Analysis of Reprint Request E-mails Written by EFL and Physics Professionals

The present study aimed to analyze reprint request e-mail messages written by postgraduates (MA students) of two fields of study, namely Physics and EFL, to realize the differences and similarities between the two email types. To investigate the purpose of the study, a sample of 100 e-mail messages, 50 Physics and 50 EFL, were analyzed according to Swales’ (1990) model for reprint requests and ...

متن کامل

Approach to Automatic Translation Template Acquisition Based on Unannotated Bilingual Grammar Induction

In this paper, we propose a new approach which can automatically acquire translation templates from the unannotated bilingual spoken language corpora in the domain of travel information accessing. In the approach, two basic algorithms named grammar induction algorithm and dynamic programming algorithm are adopted. Our approach is an unsupervised, statistical, data-driven method which avoids the...

متن کامل

Proceedings Template - WORD

Spam is unsolicited bulk email which is extremely annoying to the recipients and the ISPs. However, most of the traditional spam filtering methods commonly neglect the bulk character of spam. This paper proposes a model of cooperative anti-spam system based on multilayer agents. We compared our model to the stateof-the-art and found that our model achieved better performance and robustness on s...

متن کامل

Probabilistic Domain Modelling With Contextualized Distributional Semantic Vectors

Generative probabilistic models have been used for content modelling and template induction, and are typically trained on small corpora in the target domain. In contrast, vector space models of distributional semantics are trained on large corpora, but are typically applied to domaingeneral lexical disambiguation tasks. We introduce Distributional Semantic Hidden Markov Models, a novel variant ...

متن کامل

Towards a Structured Representation of Generic Concepts and Relations in Large Text Corpora

Extraction of structured information from text corpora involves identifying entities and the relationship between entities expressed in unstructured text. We propose a novel iterative pattern induction method to extract relation tuples exploiting lexical and shallow syntactic pattern of a sentence. We start with a single pattern to illustrate how the method explores additional paterns and tuple...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017