Machine Learning Pipelines: Provenance, Reproducibility and FAIR Data Principles

نویسندگان

چکیده

Machine learning (ML) is an increasingly important scientific tool supporting decision making and knowledge generation in numerous fields. With this, it also becomes more that the results of ML experiments are reproducible. Unfortunately, often not case. Rather, ML, similar to many other disciplines, faces a reproducibility crisis. In this paper, we describe our goals initial steps end-to-end pipelines. We investigate which factors beyond availability source code datasets influence experiments. propose ways apply FAIR data practices workflows. present preliminary on role tool, ProvBook, capturing comparing provenance their using Jupyter Notebooks. ReproduceMeGit analyze pipelines described

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fair Pipelines

This work facilitates ensuring fairness of machine learning in the real world by decoupling fairness considerations in compound decisions. In particular, this work studies how fairness propagates through a compound decision-making processes, which we call a pipeline. Prior work in algorithmic fairness only focuses on fairness with respect to one decision. However, many decision-making processes...

متن کامل

Knowledge Provenance in Virtual Observatories: Application to Image Data Pipelines

Scientific data services are increasing in usage and scope, and with these increases comes growing need for access to provenance information. Our goal is to design and implement an extensible provenance solution that is deployed at the science data ingest time. In this paper, we describe our work in the setting of a particular set of data services in the area of solar coronal physics. The paper...

متن کامل

COMPUTING SCIENCE Provenance and data differencing for workflow reproducibility analysis

One of the foundations of science is that researchers must publish the methodology used to achieve their results so that others can attempt to reproduce them. This has the added benefit of allowing methods to be adopted and adapted for other purposes. In the field of e-Science, services – often choreographed through workflow, process data to generate results. The reproduction of results is ofte...

متن کامل

Provenance and data differencing for workflow reproducibility analysis

متن کامل

Strategies and Principles of Distributed Machine Learning on Big Data

The rise of Big Data has led to new demands for Machine Learning (ML) systems to learn complex models with millions to billions of parameters, that promise adequate capacity to digest massive datasets and offer powerful predictive analytics (such as high-dimensional latent features, intermediate representations, and decision functions) thereupon. In order to run ML algorithms at such scales, on...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Lecture Notes in Computer Science

سال: 2021

ISSN: ['1611-3349', '0302-9743']

DOI: https://doi.org/10.1007/978-3-030-80960-7_17