Extractor Based on PDF - Renderer

نویسندگان

  • W. Yip Lu
  • Jia-Lang Seng
چکیده

In this paper we propose a new solution for PDF (Portable Document File) text extraction. Firstly, we made a comparison of some PDF text extractor tools. We started with a brief presentation of some available tools that have been used in some research works. Secondly, we analyzed the performance of ICEpdf and PDFBox (Java Open Source tools). Our experimental results showed that none of the tools strictly subsumes another. Both of them have a clear font and overlapping problem. Thus, to overcome these issues we proposed a new text extractor engine project based on Java PDF-Renderer, whish shows a good rendering compared to the previous ones. Our result can be helpful for researchers who need such a tool, to understand the characteristics of each one, and to choose a suitable tool for their works. Keywords— PDF; Portable Document File; Text extractor tool; PDF-Renderer.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Extract Me If You Can: Abusing PDF Parsers in Malware Detectors

Owing to the popularity of the PDF format and the continued exploitation of Adobe Reader, the detection of malicious PDFs remains a concern. All existing detection techniques rely on the PDF parser to a certain extent, while the complexity of the PDF format leaves an abundant space for parser confusion. To quantify the difference between these parsers and Adobe Reader, we create a reference Jav...

متن کامل

Asymptote: Lifting TEX to three dimensions

Asymptote, a modern successor to the METAPOST vector graphics language that features robust floatingpoint numerics, high-order functions, and deferred drawing, has recently been enhanced to generate fully interactive three-dimensional output. This data can either be viewed with Asymptote’s native OpenGL-based renderer or internally converted to Adobe’s highly compressed PRC format for embedding...

متن کامل

A Pattern Recognition System for Malicious PDF Files Detection

Malicious PDF files have been used to harm computer security during the past two-three years, and modern antivirus are proving to be not completely effective against this kind of threat. In this paper an innovative technique, which combines a feature extractor module strongly related to the structure of PDF files and an effective classifier, is presented. This system has proven to be more effec...

متن کامل

Improving frameless rendering by focusing on change (Online ID 0319)

Realtime rendering requires accurate display of a dynamic scene with minimal delay. Frameless rendering [Bishop et al. 1994] offers unique flexibility in this regard: because it samples time per pixel, it can respond to change with very little delay, and at any location in the image. However, sampling is random, resulting in blurring in changing image regions. We present an approach for improvi...

متن کامل

Improving the Extraction of Text in PDFs by Simulating the Human Reading Order

Text preprocessing and segmentation are critical tasks in search and text mining applications. Due to the huge amount of documents that are exclusively presented in PDF format, most of the Data Mining (DM) and Information Retrieval (IR) systems must extract content from the PDF files. In some occasions this is a difficult task: the result of the extraction process from a PDF file is plain text,...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011