Universal Dependency Analysis
نویسندگان
چکیده
Most data is multi-dimensional. Discovering whether any subset of dimensions, or subspaces, of such data is significantly correlated is a core task in data mining. To do so, we require a measure that quantifies how correlated a subspace is. For practical use, such a measure should be universal in the sense that it captures correlation in subspaces of any dimensionality and allows to meaningfully compare correlation scores across different subspaces, regardless how many dimensions they have and what specific statistical properties their dimensions possess. Further, it would be nice if the measure can non-parametrically and efficiently capture both linear and non-linear correlations. In this paper, we propose UDS, a multivariate correlation measure that fulfills all of these desiderata. In short, we define UDS based on cumulative entropy and propose a principled normalization scheme to bring its scores across different subspaces to the same domain, enabling universal correlation assessment. UDS is purely non-parametric as we make no assumption on data distributions nor types of correlation. To compute it on empirical data, we introduce an efficient and non-parametric method. Extensive experiments show that UDS outperforms state of the art.
منابع مشابه
An annotation scheme for Persian based on Autonomous Phrases Theory and Universal Dependencies
A treebank is a corpus with linguistic annotations above the level of the parts of speech. During the first half of the present decade, three treebanks have been developed for Persian either originally or subsequently based on dependency grammar: Persian Treebank (PerTreeBank), Persian Syntactic Dependency Treebank, and Uppsala Persian Dependency Treebank (UPDT). The syntactic analysis of a sen...
متن کاملA Universal Dependencies Treebank for Marathi
This paper describes the creation of a free and open-source dependency treebank for Marathi, the first open-source treebank for Marathi following the Universal Dependencies (UD) syntactic annotation scheme. In the paper, we describe some of the syntactic andmorphological phenomena in the language that required special analysis, and how they fit into the UD guidelines. We also evaluate the parsi...
متن کاملDeveloping Universal Dependencies for Mandarin Chinese
This article proposes a Universal Dependency Annotation Scheme for Mandarin Chinese, including POS tags and dependency analysis. We identify cases of idiosyncrasy of Mandarin Chinese that are difficult to fit into the current schema which has mainly been based on the descriptions of various Indo-European languages. We discuss differences between our scheme and those of the Stanford Chinese Depe...
متن کاملClassifying Languages by Dependency Structure. Typologies of Delexicalized Universal Dependency Treebanks
This paper shows how the current Universal Dependency treebanks can be used for clustering structural global linguistic features of the treebanks to reveal a purely structural syntactic typology of languages. Different uniand multi-dimensional data extraction methods are explored and tested in order to assess both the coherence of the underlying syntactic data and the quality of the clustering ...
متن کاملTowards an open-source universal-dependency treebank for Erzya
This article describes the first steps towards a open-source dependency treebank for Erzya based on universal dependency (UD) annotation standards. The treebank contains 610 sentences with 6661 tokens and is based on texts from a range of open-source and public domain original Erzya sources. This ensures its free availability and extensibility. Texts in the treebank are first morphologically an...
متن کاملA Transition-based System for Universal Dependency Parsing
This paper describes the system for our participation of team Wanghao-ftd-SJTU in the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. In this work, we design a system based on UDPipe1 for universal dependency parsing, where transitionbased models are trained for different treebanks. Our system directly takes raw texts as input, performing several intermedia...
متن کامل