Re-analysis of publicly available datasets
نویسندگان
چکیده
Re-analysis of publicly available datasets Lucia Peixoto, Davide Risso, Shane G. Poplawski, Mathieu, E. Wimmer, Terence P. Speed, Marcelo A. Wood and Ted Abel We retrieved the pre-processed data of several publicly available studies from GEO (see main text for details). In this Section, we plot the PCA of each dataset using the original normalization. Starting from the data as normalized by the authors, or applying UQ scaling normalization if the authors provided only raw counts, we apply RUVs using all the genes as negative controls and choosing the value of k that led to the best looking RLE plot. For each dataset, we retained only the genes expressed in at least three replicate samples. This analysis is intended to show that published normalized datasets often show residual unwanted variation and that RUVs can remove unwanted variation when present and does not compromise the data when scaling normalization is working well. A more careful analysis of each dataset, e.g. by selecting a problem-specific set of negative control genes, could lead to better results.
منابع مشابه
A Systematic Evaluation and Benchmark for Person Re-Identification: Features, Metrics, and Datasets
Person re-identification (re-id) is a critical problem in video analytics applications such as security and surveillance. The public release of several datasets and code for vision algorithms has facilitated rapid progress in this area over the last few years. However, directly comparing re-id algorithms reported in the literature has become difficult since a wide variety of features, experimen...
متن کاملA Comprehensive Evaluation and Benchmark for Person Re-Identification: Features, Metrics, and Datasets
Person re-identification (re-id) is a critical problem in video analytics applications such as security and surveillance. The public release of several datasets and code for vision algorithms has facilitated rapid progress in this area over the last few years. However, directly comparing re-id algorithms reported in the literature has become difficult since a wide variety of features, experimen...
متن کاملTowards a Dataset for Natural Language Requirements Processing
[Context and motivation] The current breakthrough of natural language processing (NLP) techniques can provide the requirements engineering (RE) community with powerful tools that can help addressing specific tasks of natural language (NL) requirements analysis, such as traceability, ambiguity detection and requirements classification, to name a few. [Question/problem] However, modern NLP techni...
متن کاملRecall and bias of retrieving gene expression microarray datasets through PubMed identifiers
BACKGROUND The ability to locate publicly available gene expression microarray datasets effectively and efficiently facilitates the reuse of these potentially valuable resources. Centralized biomedical databases allow users to query dataset metadata descriptions, but these annotations are often too sparse and diverse to allow complex and accurate queries. In this study we examined the ability o...
متن کاملDNApi: A De Novo Adapter Prediction Algorithm for Small RNA Sequencing Data
With the rapid accumulation of publicly available small RNA sequencing datasets, third-party meta-analysis across many datasets is becoming increasingly powerful. Although removing the 3´ adapter is an essential step for small RNA sequencing analysis, the adapter sequence information is not always available in the metadata. The information can be also erroneous even when it is available. In thi...
متن کامل