On the (In)effectiveness of Mosaicing and Blurring as Tools for Document Redaction

نویسندگان

  • Steven Hill
  • Zhimin Zhou
  • Lawrence Saul
  • Hovav Shacham
چکیده

In many online communities, it is the norm to redact names and other sensitive text from posted screenshots. Sometimes solid bars are used; sometimes a blur or other image transform is used. We consider the effectiveness of two popular image transforms— mosaicing (also known as pixelization) and blurring— for redaction of text. Our main finding is that we can use a simple but powerful class of statistical models— so-called hidden Markov models (HMMs)—to recover both short and indefinitely long instances of redacted text. Our approach borrows on the success of HMMs for automatic speech recognition, where they are used to recover sequences of phonemes from utterances of speech. Here we use HMMs in an analogous way to recover sequences of characters from images of redacted text. We evaluate an implementation of our system against multiple typefaces, font sizes, grid sizes, pixel offsets, and levels of noise. We also decode numerous real-world examples of redacted text. We conclude that mosaicing and blurring, despite their widespread usage, are not viable approaches for text redaction.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

روش جدید متن‌کاوی برای استخراج اطلاعات زمینه کاربر به‌منظور بهبود رتبه‌بندی نتایج موتور جستجو

Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...

متن کامل

C-sanitized: a privacy model for document redaction and sanitization

Within the current context of Information Societies, large amounts of information are daily exchanged and/or released. The sensitive nature of much of this information causes a serious privacy threat when documents are uncontrollably made available to untrusted third parties. In such cases, appropriate data protection measures should be undertaken by the responsible organization, especially und...

متن کامل

Translation Technology Tools and Professional Translators’ Attitudes toward Them

Today technology is an integral part of professional translation; and it is generally assumed that translators’ attitudes toward translation technology tools influence their interaction with technology (Bundgaard, 2017). Therefore, the present two-phase study seeks to shed some light on what translation technology tools are and how professional translators feel toward them. The research method ...

متن کامل

Document Image Dewarping Based on Text Line Detection and Surface Modeling (RESEARCH NOTE)

Document images produced by scanner or digital camera, usually suffer from geometric and photometric distortions. Both of them deteriorate the performance of OCR systems. In this paper, we present a novel method to compensate for undesirable geometric distortions aiming to improve OCR results. Our methodology is based on finding text lines by dynamic local connectivity map and then applying a l...

متن کامل

Plagiarism checker for Persian (PCP) texts using hash-based tree representative fingerprinting

With due respect to the authors’ rights, plagiarism detection, is one of the critical problems in the field of text-mining that many researchers are interested in. This issue is considered as a serious one in high academic institutions. There exist language-free tools which do not yield any reliable results since the special features of every language are ignored in them. Considering the paucit...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • PoPETs

دوره 2016  شماره 

صفحات  -

تاریخ انتشار 2016