employing jaccard

Frequent-Itemset Mining Using Locality-Sensitive Hashing

2016

Debajyoti Bera Rameshwar Pratap

The Apriori algorithm is a classical algorithm for the frequent itemset mining problem. A significant bottleneck in Apriori is the number of I/O operation involved, and the number of candidates it generates. We investigate the role of LSH techniques to overcome these problems, without adding much computational overhead. We propose randomized variations of Apriori that are based on asymmetric LS...

متن کامل

Heterodera glycines in Indiana: III. 2-D Protein Patterns of Geographical Isolates.

Journal: :Journal of nematology 1986

V R Ferris J M Ferris L L Murdock J Faghihi

Protein patterns obtained by two-dimensional polyacrylamide gel electrophoresis for three isolates of Heterodera glycines from southern Indiana appear qualitatively similar and have higher pairwise Jaccard similarity coefficients with each other than with isolates from northern Indiana. Three isolates from three northern counties share proteins not present in the southern isolates, but as a gro...

متن کامل

Classical retrieval and overlap measures satisfy the requirements for rankings based on a Lorenz curve

Journal: :Inf. Process. Manage. 2006

Leo Egghe Ronald Rousseau

Classical information retrieval and overlap measures such as the Jaccard index, the Dice coefficient and Salton’s cosine measure can be characterized by Lorenz curves. This result demonstrates the existence of a formal link between information retrieval and the information sciences on the one hand, and concentration and diversity theory, as used, e.g., in social economics and ecology on the oth...

متن کامل

Ghent University-iMinds at MediaEval 2013: An Unsupervised Named Entity-based Similarity Measure for Search and Hyperlinking

2013

Tom De Nies Wesley De Neve Erik Mannens Rik Van de Walle

In this paper, we describe our approach to the Search and Hyperlinking task at the MediaEval 2013 benchmark. This task focuses on video retrieval and linking in the context of a large and rich dataset provided by the BBC. Our approach makes use of one of three types of audio transcripts, enriched with Named Entities. To compute similarity, we adapt the Jaccard metric to use Named Entities. This...

متن کامل

Construction of weak and strong similarity measures for ordered sets of documents using fuzzy set techniques

Journal: :Inf. Process. Manage. 2003

Leo Egghe Christine Michel

Ordered sets of documents are encountered more and more in information distribution systems, such as information retrieval systems (IRS). Classical similarity measures for ordinary sets of documents hence need to be extended to these ordered sets. This is done in this paper using fuzzy set techniques. First a general similarity measure is developed which contains the classical strong similarity...

متن کامل

Empirical Comparisons of MASC Word Sense Annotations

2012

Gerard de Melo Collin F. Baker Nancy Ide Rebecca J. Passonneau Christiane Fellbaum

We analyze how different conceptions of lexical semantics affect sense annotations and how multiple sense inventories can be compared empirically, based on annotated text. Our study focuses on the MASC project, where data has been annotated using WordNet sense identifiers on the one hand, and FrameNet lexical units on the other. This allows us to compare the sense inventories of these lexical r...

متن کامل

Comparative influence of spatial scale on beta diversity within regional assemblages of birds and butterflies

2004

Ralph Mac Nally Erica Fleishman Lesley P. Bulluck Christopher J. Betrus

Methods Data on species composition for both taxonomic groups were collecting using standard inventory methods for birds and butterflies in temperate regions. Data were compiled at three sampling grains, sites (average 12 ha), canyons (average 74 ha) and mountain ranges. For each sampling grain in turn, we calculated similarity of species composition using the Jaccard index. First, we investiga...

متن کامل

Method for Identification of Suitable Persons in Collaborators' Networks

2012

Pavla Drázdilová Alisa Babskova Jan Martinovic Katerina Slaninová Stepan Minks

Finding and recommendation of suitable persons based on their characteristics in social or collaboration networks is still a big challenge. The purpose of this paper is to discover and recommend suitable persons or whole community within a developers’ network. The experiments were realized on the data collection of specialized web portal used for collaboration of developers Codeplex.com. Users ...

متن کامل

Tanimoto's Best Barbecue: Discovering Regulatory Modules using Tanimoto Scores

2007

Axel Mosig Peter Menzel Peter F. Stadler

We present a combinatorial method for discovering cis-regulatory modules in promoter sequences. Our approach combines “sliding window” approaches with a scoring function based on the so-called Tanimoto score. This allows to identify sets of binding sites that tend to occur preferentially in the vicinity of each other in a given set of promoter sequences belonging to co-expressed or orthologous ...

متن کامل

Regular Language Distance and Entropy

2017

Austin J. Parker Kelly B. Yancey Matthew P. Yancey

This paper addresses the problem of determining the distance between two regular languages. It will show how to expand Jaccard distance, which works on finite sets, to potentially-infinite regular languages. The entropy of a regular language plays a large role in the extension. Much of the paper is spent investigating the entropy of a regular language. This includes addressing issues that have ...

متن کامل