Classi cation of Corporate and Public Text
نویسنده
چکیده
In this project we try to tackle the problem of classifying a body of text as a corporate message (text usually sent to a private, select audience, usually sent in a company setting and often related to company information) and public messages (text that is more open and can be broadcasted to a larger audience). Corporate messages usually contain sensitive, private data that would put an organization or individual at risk if leaked. This includes messages about nancial trades, company deals, and business meetings. Public messages, however, are less secretive and more casual by nature. This problem is very similar to Data Loss Prevention, a security issue that involves systems that identify, monitor and protect con dential data from leakage. While information about these systems is also con dential, the general industry techniques involve regular expressions, keywords, and hashing. Regular expressions are used to match data such as social security numbers, telephone numbers, and addresses. Keyword matching is used to identify a few words that are marked as private. And hashing works by hashing the substrings of private documents and classi es a new document as private if it contains a substring with a matching hash. Our problem is similar to DLP, but given our data set it would be erroneous to consider it as DLP. Instead of looking for con dential information, we look to see whether it would be in a corporate message. Still, this work can help shed light on DLP, perhaps by improving the keywords used in DLP techniques. In or comparison of text classi iers, we used Naive Bayes, Logistic Regression, and Support Vector Machine classi ers and found that SVMs showed consistently better results. However, noticing that the corporate and public messages were centered around certain topics, we used LDA to improve our logistic regression model, which has probablistic inclinations. While we found that with LDA logistic regression results improved, they were still slightly below SVMs.
منابع مشابه
Text and Picture Segmentation by the Distribution Analysis of Wavelet
Statistical classi cation is an important topic in image processing. Classi cation helps to interpret images, and it can be incorporated into other image processing algorithms, e.g., image compression [1], to improve performance. A particularly interesting type of classi cation is the segmentation of pictures and text. By pictures, we mean continuous-tone images such as photographs. By text, we...
متن کاملAutomatic Discovery of DocumentClassi cation Knowledge
We investigate approaches for automatic discovery of document classi cation knowledge from text databases. We review existing rule-based text classi cation learning algorithms such as SWAP-1 and RIPPER. After identifying their weakness, we propose a new technique known as the IBRI algorithm by unifying the strengths of rule-based learning and instance-based approaches and adapting to characteri...
متن کاملUsing Wavelet Coe cient Distributions
In this paper, an algorithm is developed for segmenting document images into four classes: background, photograph, text, and graph. Features used for classi cation are based on the distribution patterns of wavelet coe cients in high frequency bands. Two important attributes of the algorithm are its multiscale nature|it classi es an image at di erent resolutions adaptively, enabling accurate cla...
متن کاملText Classification from Labeled and Unlabeled Documents Using
This paper shows that the accuracy of learned text classi ers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classi cation problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available. We introduce an algorithm for learning from lab...
متن کاملChallenges of the Email Domain for Text Classification
Interactive classi cation of email into a userde ned hierarchy of folders is a natural domain for application of text classi cation methods. This domain presents several challenges. First, the user's changing mailling habits mandate classi cation technology adapt in a dynamic environment. Second, the classi cation technology needs to be able to handle heterogeneity in folder content and folder ...
متن کامل