Vi-DIFF: Understanding Web Pages Changes

نویسندگان

  • Zeynep Pehlivan
  • Myriam Ben Saad
  • Stéphane Gançarski
چکیده

Nowadays, many applications are interested in detecting and discovering changes on the web to help users to understand page updates and more generally, the web dynamics. Web archiving is one of these fields where detecting changes on web pages is important. Archiving institutes are collecting and preserving different web site versions for future generation. A major problem encountered by archiving systems is to understand what happened between two versions of web pages. In this paper, we address this requirement by proposing a new change detection approach that computes the semantic differences between two versions of HTML web pages. Our approach, called Vi-DIFF, detects changes on the visual representation of web pages. It detects two types of changes: content and structural changes. Content changes include modifications on text, hyperlinks and images. In contrast, structural changes alter the visual appearance of the page and the structure of its blocks. Our ViDIFF solution can serve for various applications such as crawl optimization, archive maintenance, web changes browsing, etc. Experiments on Vi-DIFF were conducted and the results are promising.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Detecting Stealth Web Pages That Use Click-Through Cloaking

Search spam is an attack on search engines’ ranking algorithms to promote spam links into top search ranking that they do not deserve. Cloaking is a wellknown search spam technique in which spammers serve one page to search-engine crawlers to optimize ranking, but serve a different page to browser users to maximize potential profit. In this experience report, we investigate a different and rela...

متن کامل

Analyzing new features of infected web content in detection of malicious web pages

Recent improvements in web standards and technologies enable the attackers to hide and obfuscate infectious codes with new methods and thus escaping the security filters. In this paper, we study the application of machine learning techniques in detecting malicious web pages. In order to detect malicious web pages, we propose and analyze a novel set of features including HTML, JavaScript (jQuery...

متن کامل

Comparing Ontologies with ecco

In this paper we present the diff tool ecco, which detects changes to both axioms and concepts between OWL ontologies. Furthermore, the tool aligns axiom changes between each other, according to a fine-grained change categorisation, and subsequently aligns axiom changes with the concepts that each of those directly affect. The diff is open source, and made available as a standalone command-line...

متن کامل

تشخیص ناهنجاری روی وب از طریق ایجاد پروفایل کاربرد دسترسی

Due to increasing in cyber-attacks, the need for web servers attack detection technique has drawn attentions today. Unfortunately, many available security solutions are inefficient in identifying web-based attacks. The main aim of this study is to detect abnormal web navigations based on web usage profiles. In this paper, comparing scrolling behavior of a normal user with an attacker, and simu...

متن کامل

A method for measuring the evolution of a topic on the Web: The case of "informetrics"

The World Wide Web is growing at an enormous speed, and has become an indispensable source for information and research. New pages are being added to the Web, but there are additional processes as well: pages are moved or removed and/or their content changes. In order to obtain a better understanding of these processes, we developed a method for tracking topics on the Web for long periods of ti...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010