Derivation in the Czech National Corpus
نویسندگان
چکیده
7KH DLP RI WKLV SDSHU LV WR GHVFULEH RQH WKH PDLQ PHDQV RI &]HFK ZRUG IRUPDWLRQ GHULYDWLRQ 1HZ &]HFK ZRUGV DUH FUHDWHG E\ FRPSRVLWLRQ RU E\ GHULYDWLRQ E\ XVLQJ SUHIL[HV RU VXIIL[HV 7KH VXIIL[HV ZKLFK DUH DGGHG WR WKH VWHP DUH XVHG PXFK PRUH IUHTXHQWO\ WKDQ SUHIL[HV VWDQGLQJ EHIRUH WKH VWHP 7KH PRVW IUHTXHQW VXIIL[HV ZLOO EH FODVVLILHG DFFRUGLQJ WR WKH SDUDGLJPDWLF DQG VHPDQWLF SURSHUWLHV DQG DFFRUGLQJ WR WKH FKDQJHV WKH\ FDXVH LQ WKH VWHP 7KH UHVHDUFK LV GRQH RQ WKH &]HFK QDWLRQDO FRUSXV &1& WKH IUHTXHQFLHV RI WKH LQYHVWLJDWHG VXIIL[HV LOOXVWUDWH WKHLU SURGXFWLYLW\ LQ SUHVHQW GD\ &]HFK ODQJXDJH 7KLV UHVHDUFK LV RI D SDUWLFXODU YDOXH IRU D KLJKO\ LQIOHFWHG ODQJXDJH VXFK DV &]HFK 3RVVLEOH DSSOLFDWLRQV RI WKLV V\VWHP DUH YDULRXV 1/3 V\VWHPV H J VSHOOLQJ FKHFNHUV DQG PDFKLQH WUDQVODWLRQ V\VWHPV 7KH UHVXOWV RI WKLV ZRUN VHUYH IRU WKH FRPSXWDWLRQDO SURFHVVLQJ RI &]HFK ZRUG IRUPDWLRQ DQG LQ IXWXUH IRU WKH FUHDWLRQ RI WKH &]HFK GHULYDWLRQDO GLFWLRQDU\
منابع مشابه
Word-Formation Network for Czech
In the present paper, we describe the development of the lexical network DeriNet, which captures core word-formation relations on the set of around 266 thousand Czech lexemes. The network is currently limited to derivational relations because derivation is the most frequent and most productive word-formation process in Czech. This limitation is reflected in the architecture of the network: each...
متن کاملAnalysis of Czech Web 1T 5-Gram Corpus and Its Comparison with Czech National Corpus Data
In this paper, newly issued Czech Web 1T 5-grams corpus created by Google and LDC is analysed and compared with reference n-gram corpus obtained from Czech National Corpus. Original 5-grams from both corpora were post-processed and statistical trigram language models of various vocabulary sizes and parameters were created. The comparison of various corpus statistics such as unique and total wor...
متن کاملBalanced corpus of informal spoken Czech: compilation, design and findings
The paper presents ORAL2008, a new 1-million corpus of spoken Czech compiled within the framework of the Czech National Corpus project. ORAL2008 is designed as a representation of authentic spoken language used in informal situations and it is balanced in the main sociolinguistic categories of speakers. The paper concentrates also on the data collection, its broad coverage and the transcription...
متن کاملOral2008: New Balanced Corpus of Spoken Czech 1
Attention paid to spoken language has increased in the last decades, as well as its importance for linguistic research and natural language processing in general. However, compilation of spoken corpora as an indispensable source of data is very laborious and thus expensive. Nevertheless, more and more spoken corpora are being created currently. There are various approaches to their design, dept...
متن کاملShallow Parsing of Czech Sentence Based on Correct Morphological Disambiguation
The basis of such an approach is provided by a very complex and sophisticated rule-based morphological disambiguation which can disambiguate Czech sentence with a very high reliability, i.e. with a minimum number of errors. This is, of course, very important for any language and all the more so for Czech whose ambiguity rate is generally extremely high (as compared e.g. to other Slavic language...
متن کاملAnnotating foreign learners’ Czech
One of the challenges of contemporary corpus linguistics is the compilation and annotation of corpora consisting of texts produced by non-native speakers. In addition to morphosyntactic tagging and lemmatisation, such texts can be annotated by information relevant to the specific nonstandard use. Cases of deviant language use can be corrected and identified by a tag specifying the type of the e...
متن کامل