Machine Learning Pipelines: Provenance, Reproducibility and FAIR Data Principles
نویسندگان
چکیده
Machine learning (ML) is an increasingly important scientific tool supporting decision making and knowledge generation in numerous fields. With this, it also becomes more that the results of ML experiments are reproducible. Unfortunately, often not case. Rather, ML, similar to many other disciplines, faces a reproducibility crisis. In this paper, we describe our goals initial steps end-to-end pipelines. We investigate which factors beyond availability source code datasets influence experiments. propose ways apply FAIR data practices workflows. present preliminary on role tool, ProvBook, capturing comparing provenance their using Jupyter Notebooks. ReproduceMeGit analyze pipelines described
منابع مشابه
Fair Pipelines
This work facilitates ensuring fairness of machine learning in the real world by decoupling fairness considerations in compound decisions. In particular, this work studies how fairness propagates through a compound decision-making processes, which we call a pipeline. Prior work in algorithmic fairness only focuses on fairness with respect to one decision. However, many decision-making processes...
متن کاملKnowledge Provenance in Virtual Observatories: Application to Image Data Pipelines
Scientific data services are increasing in usage and scope, and with these increases comes growing need for access to provenance information. Our goal is to design and implement an extensible provenance solution that is deployed at the science data ingest time. In this paper, we describe our work in the setting of a particular set of data services in the area of solar coronal physics. The paper...
متن کاملCOMPUTING SCIENCE Provenance and data differencing for workflow reproducibility analysis
One of the foundations of science is that researchers must publish the methodology used to achieve their results so that others can attempt to reproduce them. This has the added benefit of allowing methods to be adopted and adapted for other purposes. In the field of e-Science, services – often choreographed through workflow, process data to generate results. The reproduction of results is ofte...
متن کاملProvenance and data differencing for workflow reproducibility analysis
One of the foundations of science is that researchers must publish the methodology used to achieve their results so that others can attempt to reproduce them. This has the added benefit of allowing methods to be adopted and adapted for other purposes. In the field of e-Science, services – often choreographed through workflow, process data to generate results. The reproduction of results is ofte...
متن کاملStrategies and Principles of Distributed Machine Learning on Big Data
The rise of Big Data has led to new demands for Machine Learning (ML) systems to learn complex models with millions to billions of parameters, that promise adequate capacity to digest massive datasets and offer powerful predictive analytics (such as high-dimensional latent features, intermediate representations, and decision functions) thereupon. In order to run ML algorithms at such scales, on...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Lecture Notes in Computer Science
سال: 2021
ISSN: ['1611-3349', '0302-9743']
DOI: https://doi.org/10.1007/978-3-030-80960-7_17