ParSA: High-throughput scientific data analysis framework with distributed file system

نویسندگان

  • Tao Zhang
  • Xiangzheng Sun
  • Wei Xue
  • Nan Qiao
  • Huang Huang
  • Jiwu Shu
  • Weimin Zheng
چکیده

Scientific data analysis and visualization have become the key component for nowadays large scale simulations. Due to the rapidly increasing data volume and awkward I/O pattern among high structured files, known serial methods/tools cannot scale well and usually lead to poor performance over traditional architectures. In this paper,wepropose anew framework: ParSA (parallel scientific data analysis) for highthroughput and scalable scientific analysis, with distributed file system. ParSA presents the optimization strategies for grouping and splitting logical units to utilize distributed I/O property of distributed file system, scheduling the distribution of block replicas to reduce network reading, as well as to maximize overlapping the data reading, processing, and transferring during computation. Besides, ParSA provides the similar interfaces as the NetCDF Operator (NCO), which is used in most of climate data diagnostic packages, making it easy to use this framework. We utilize ParSA to accelerate well-known analysis methods for climatemodels onHadoopDistributed File System (HDFS). Experimental results demonstrate the high efficiency and scalability of ParSA, getting the maximum 1.3 GB/s throughput on a six nodes Hadoop cluster with five disks per node. Yet, it can only get 392 MB/s throughput on a RAID-6 storage

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Biomolecular committor probability calculation enabled by processing in network storage

Computationally complex and data intensive atomic scale biomolecular simulation is enabled via processing in network storage (PINS): a novel distributed system framework to overcome bandwidth, compute, storage, organizational, and security challenges inherent to the wide-area computation and storage grid. PINS is presented as an effective and scalable scientific simulation framework to meet the...

متن کامل

HADOOP: A Framework for Distributed Computing

With data growing so rapidly and the rise of unstructured data accounting for about 90 % of the data today, the time has come for the enterprises to re-evaluate their approach to data storage, management and its analysis. This enormously growing data has been given the name Big Data. Hadoop platform has been designed to tackle the problems associated with handling such an enormous data-that doe...

متن کامل

Adaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments

Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...

متن کامل

CalvinFS: Consistent WAN Replication and Scalable Metadata Management for Distributed File Systems

Existing file systems, even the most scalable systems that store hundreds of petabytes (or more) of data across thousands of machines, store file metadata on a single server or via a shared-disk architecture in order to ensure consistency and validity of the metadata. This paper describes a completely different approach for the design of replicated, scalable file systems, which leverages a high...

متن کامل

Towards a Next Generation Distributed Middleware System for Many-Task Computing

Distributed computing systems have evolved over decades to support various types of scientific applications and overall computing paradigms have been categorized into HTC (High-Throughput Computing) to support bags of tasks which are usually long running, HPC (High-Performance Computing) for processing tightly-coupled communication-intensive tasks on top of dedicated clusters of workstations or...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Future Generation Comp. Syst.

دوره 51  شماره 

صفحات  -

تاریخ انتشار 2015