Web-Based Corpus Software

نویسنده

  • Saturnino Luz
چکیده

What is a web-based corpus and what is web-based corpus software? The answer is, strictly speaking, that there is no such thing as web-based corpus software. However, one should not be discouraged by this rather negative assessment. In fact, if one examines the title closely, different bracketings of the phrase might suggest interesting possibilities. For example, if one chooses to write it as ‘(web-based corpus) software’, the emphasis falls on the idea of the World Wide Web as a large corpus. It is, however, a very chaotic one. It is chaotic in the sense that it is diffi cult for its users to account for and control the sort of phenomena such a large and dynamic repository might refl ect when explored, say, through an ordinary search engine. This makes the task of formulating and testing hypotheses extremely diffi cult. All sorts of ‘noise’ might creep in: there are texts written by native and non-native writers, computer-generated text (e.g. text resulting from the ubiquitous web-page translation services currently on offer), duplication, and other forms of text which do not conform to standard norms. Little, if anything, can be done to guarantee the quality or integrity of the data being used. Still, this chaotic, noisy environment can be of some use to the statistically minded (computational) linguist. To borrow an example from Manning and Schütze (1999), one could use the web to decide which of the following word sequences to treat as a language unit: ‘strong coffee’ or ‘powerful coffee’. A quick search reveals over 30,000 occurrences of ‘strong coffee’ against just over 400 occurrences of ‘powerful coffee’, thus indicating that the former forms a collocation pattern while the latter apparently does not. In contrast, should one wish to write ‘web-based corpus software’ as ‘webbased (corpus software)’, the emphasis clearly falls on ‘corpus software’, of which web-based corpus software would simply be one type. In other words, one could simply regard the Web as the medium through which better constructed, human-designed corpora can be searched and studied by a large

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Corporator: A Tool For Creating RSS-Based Specialized Corpora

This paper presents a new approach and a software for collecting specialized corpora on the Web. This approach takes advantage of a very popular XML-based norm used on the Web for sharing content among websites: RSS (Really Simple Syndication). After a brief introduction to RSS, we explain the interest of this type of data sources in the framework of corpus development. Finally, we present Corp...

متن کامل

Web-Based Semantic Similarity: An Evaluation in the Biomedical Domain

Computation of semantic similarity between concepts is a very common problem in many language related tasks and knowledge domains. In the biomedical field, several approaches have been developed to deal with this issue by exploiting the structured knowledge available in domain ontologies (such as SNOMED-CT or MeSH) and specific, closed and reliable corpora (such as clinical data). However, in r...

متن کامل

Hedges in English for Academic Purposes: A Corpus-based study of Iranian EFL learners

Hedges, as tools to express tentativeness and doubt, have been studied in plenty of research papers in the Iranian EFL research setting. However, their use in a learner corpus, portraying Iranian learner English, is in need of more research attention. With this end in view, this study aimed at investigating how Iranian EFL learners who have majored in English-related fields in Iran deployed hed...

متن کامل

NOMOS: A Semantic Web Software Framework for Annotation of Multimodal Corpora

We present NOMOS, an open-source software framework for annotation, processing, and analysis of multimodal corpora. NOMOS is designed for use by annotators, corpus developers, and corpus consumers, emphasizing configurability for a variety of specific annotation tasks. Its features include synchronized multi-channel audio and video playback, compatibility with several corpora, platform independ...

متن کامل

Slovak National Corpus tools and resources

The article presents current state of affairs in several projects conducted by the Slovak National Corpus department of the L’. Štúr Institute of Linguistics, Slovak Academy of Sciences. We describe the Slovak National Corpus, Corpus of Spoken Slovak, tools used for linguistics analysis and an ongoing effort to create Slovak WordNet. 1 Slovak National Corpus The Slovak National Corpus is a huge...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011