نتایج جستجو برای: apache spark

تعداد نتایج: 18089  

Journal: :Applied sciences 2023

To improve the computational efficiency and classification accuracy in context of big data, an optimized parallel random forest algorithm is proposed based on Spark computing framework. First, a new Gini coefficient defined to reduce impact feature redundancy for higher accuracy. Next, number candidate split points calculations continuous features, approximate equal-frequency binning method det...

Journal: :CoRR 2016
Alex Gittens Aditya Devarakonda Evan Racah Michael F. Ringenburg Lisa Gerhardt Jey Kottalam Jialin Liu Kristyn J. Maschhoff Shane Canon Jatin Chhugani Pramod Sharma Jiyan Yang James Demmel Jim Harrell Venkat Krishnamurthy Michael W. Mahoney Prabhat

We explore the trade-offs of performing linear algebra using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks. We examine three widely-used and important matrix factorizations: NMF (for physical plausability), PCA (for its ubiqu...

Journal: :CoRR 2015
Yuchen Zhang Michael I. Jordan

Stochastic algorithms are efficient approaches to solving machine learning and optimization problems. In this paper, we propose a general framework called Splash for parallelizing stochastic algorithms on multi-node distributed systems. Splash consists of a programming interface and an execution engine. Using the programming interface, the user develops sequential stochastic algorithms without ...

Journal: :CoRR 2018
Tomer Kaftan Magdalena Balazinska Alvin Cheung Johannes Gehrke

Modern data processing applications execute increasingly sophisticated analysis that requires operations beyond traditional relational algebra. As a result, operators in query plans grow in diversity and complexity. Designing query optimizer rules and cost models to choose physical operators for all of these novel logical operators is impractical. To address this challenge, we develop Cuttlefis...

2017
Antonio Galicia José F. Torres Francisco Martínez-Álvarez Alicia Troncoso Lora

This paper presents different scalable methods to predict time series of very long length such as time series with a high sampling frequency. The Apache Spark framework for distributed computing is proposed in order to achieve the scalability of the methods. Namely, the existing MLlib machine learning library from Spark has been used. Since MLlib does not support multivariate regression, the fo...

2017
Daniel Gonzales Zev Winkelman Trung Tran Ricardo Sanchez Dulani Woods John Hollywood

We have developed a distributed computing capability, Digital Forensics Compute Cluster (DFORC2) to speed up the ingestion and processing of digital evidence that is resident on computer hard drives. DFORC2 parallelizes evidence ingestion and file processing steps. It can be run on a standalone computer cluster or in the Amazon Web Services (AWS) cloud. When running in a virtualized computing e...

2014
Fan Liang Chen Feng Xiaoyi Lu Zhiwei Xu

Apache Hadoop and Spark are gaining prominence in Big Data processing and analytics. Both of them are widely deployed on Internet companies. On the other hand, high-performance data analysis requirements are causing academical and industrial communities to adopt state-of-the-art technologies in HPC to solve Big Data problems. Recently, we have proposed a key-value pair based communication libra...

نمودار تعداد نتایج جستجو در هر سال

با کلیک روی نمودار نتایج را به سال انتشار فیلتر کنید