apache spark

نتایج جستجو برای: apache spark

تعداد نتایج: 18089 فیلتر نتایج به سال:

A Fast Parallel Random Forest Algorithm Based on Spark

Journal: :Applied sciences 2023

To improve the computational efficiency and classification accuracy in context of big data, an optimized parallel random forest algorithm is proposed based on Spark computing framework. First, a new Gini coefficient defined to reduce impact feature redundancy for higher accuracy. Next, number candidate split points calculations continuous features, approximate equal-frequency binning method det...

متن کامل

Matrix Factorization at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies

Journal: :CoRR 2016

Alex Gittens Aditya Devarakonda Evan Racah Michael F. Ringenburg Lisa Gerhardt Jey Kottalam Jialin Liu Kristyn J. Maschhoff Shane Canon Jatin Chhugani Pramod Sharma Jiyan Yang James Demmel Jim Harrell Venkat Krishnamurthy Michael W. Mahoney Prabhat

We explore the trade-offs of performing linear algebra using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks. We examine three widely-used and important matrix factorizations: NMF (for physical plausability), PCA (for its ubiqu...

متن کامل

Splash: User-friendly Programming Interface for Parallelizing Stochastic Algorithms

Journal: :CoRR 2015

Yuchen Zhang Michael I. Jordan

Stochastic algorithms are efficient approaches to solving machine learning and optimization problems. In this paper, we propose a general framework called Splash for parallelizing stochastic algorithms on multi-node distributed systems. Splash consists of a programming interface and an execution engine. Using the programming interface, the user develops sequential stochastic algorithms without ...

متن کامل

Cuttlefish: A Lightweight Primitive for Adaptive Query Processing

Journal: :CoRR 2018

Tomer Kaftan Magdalena Balazinska Alvin Cheung Johannes Gehrke

Modern data processing applications execute increasingly sophisticated analysis that requires operations beyond traditional relational algebra. As a result, operators in query plans grow in diversity and complexity. Designing query optimizer rules and cost models to choose physical operators for all of these novel logical operators is impractical. To address this challenge, we develop Cuttlefis...

متن کامل

Scalable Forecasting Techniques Applied to Big Electricity Time Series

2017

Antonio Galicia José F. Torres Francisco Martínez-Álvarez Alicia Troncoso Lora

This paper presents different scalable methods to predict time series of very long length such as time series with a high sampling frequency. The Apache Spark framework for distributed computing is proposed in order to achieve the scalability of the methods. Namely, the existing MLlib machine learning library from Spark has been used. Since MLlib does not support multivariate regression, the fo...

متن کامل

Digital Forensics Compute Cluster: A High Speed Distributed Computing Capability for Digital Forensics

2017

Daniel Gonzales Zev Winkelman Trung Tran Ricardo Sanchez Dulani Woods John Hollywood

We have developed a distributed computing capability, Digital Forensics Compute Cluster (DFORC2) to speed up the ingestion and processing of digital evidence that is resident on computer hard drives. DFORC2 parallelizes evidence ingestion and file processing steps. It can be run on a standalone computer cluster or in the Amazon Web Services (AWS) cloud. When running in a virtualized computing e...

متن کامل

Performance Benefits of DataMPI: A Case Study with BigDataBench

2014

Fan Liang Chen Feng Xiaoyi Lu Zhiwei Xu

Apache Hadoop and Spark are gaining prominence in Big Data processing and analytics. Both of them are widely deployed on Internet companies. On the other hand, high-performance data analysis requirements are causing academical and industrial communities to adopt state-of-the-art technologies in HPC to solve Big Data problems. Recently, we have proposed a key-value pair based communication libra...

متن کامل

نمودار تعداد نتایج جستجو در هر سال

با کلیک روی نمودار نتایج را به سال انتشار فیلتر کنید

A Fast Parallel Random Forest Algorithm Based on Spark

Matrix Factorization at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies

Splash: User-friendly Programming Interface for Parallelizing Stochastic Algorithms

Cuttlefish: A Lightweight Primitive for Adaptive Query Processing

Scalable Forecasting Techniques Applied to Big Electricity Time Series

Digital Forensics Compute Cluster: A High Speed Distributed Computing Capability for Digital Forensics

Performance Benefits of DataMPI: A Case Study with BigDataBench

Scalable, High-Performance, and Generalized Subtree Data Anonymization Approach for Apache Spark

A gray-box modeling methodology for runtime prediction of Apache Spark jobs

Effective Prediction of Missing Data on Apache Spark over Multivariable Time Series