Canopus: Enabling Extreme-Scale Data Analytics on Big HPC Storage via Progressive Refactoring

نویسندگان

  • Tao Lu
  • Eric Suchyta
  • Jong Youl Choi
  • Norbert Podhorszki
  • Scott Klasky
  • Qing Liu
  • David Pugmire
  • Matthew Wolf
  • Mark Ainsworth
چکیده

High accuracy scientific simulations on high performance computing (HPC) platforms generate large amounts of data. To allow data to be efficiently analyzed, simulation outputs need to be refactored, compressed, and properly mapped onto storage tiers. This paper presents Canopus, a progressive data management framework for storing and analyzing big scientific data. Canopus allows simulation results to be refactored into a much smaller dataset along with a series of deltas with fairly low overhead. Then, the refactored data are compressed, mapped, and written onto storage tiers. For data analytics, refactored data are selectively retrieved to restore data at a specific level of accuracy that satisfies analysis requirements. Canopus enables end users to make trade-offs between analysis speed and accuracy onthe-fly. Canopus is demonstrated and thoroughly evaluated using blob detection on fusion simulation data.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

HPC and Big Data Convergence for Extreme Heterogeneous Systems

As the data deluge grows ever greater, large-scale data analytics workloads are quickly becoming critical computational tools within the scientific community. Recently, convergence efforts have focused on combining aspects HPC and ”big data” analytics workloads together using a unified supercomputing system. This has the opportunity to bring advanced analytical tools to scientists which enable ...

متن کامل

The Need for Resilience Research in Workflows of Big Compute and Big Data Scientific Applications

Projections and reports about exascale failure modes conclude that we need to protect numerical simulations and data analytics from an increasing risk of hardware and software failures and silent data corruptions (SDC) [1, 4]. At this scale, hardware and software failures could be as frequent as ten or more per day. According to [9], the semiconductor industry will have increased difficulty pre...

متن کامل

Experiences Running and Optimizing the Berkeley Data Analytics Stack on Cray Platforms

The Berkeley Data Analytics Stack (BDAS) is an emerging framework for big data analytics. It consists of the Spark analytics framework, the Tachyon in-memory filesystem, and the Mesos cluster manager. Spark was designed as an in-memory replacement for Hadoop that can in some cases improve performance by up to 100X. In this paper, we describe our experiences running BDAS on the new Cray Urika-XA...

متن کامل

Outlook on moving of computing services towards the data sources

The Internet of things (IoT) is potentially interconnecting unprecedented amounts of raw data, opening countless possibilities by two main logical layers: become data into information, then turn information into knowledge. The former is about filtering the significance in the appropriate format, while the latter provides emerging categories of the whole domain. This path of the data is a bottom...

متن کامل

Mero: Co-Designing an Object Store for Extreme Scale

Within the HPC community, there is consensus that Exascale computing will be plagued with issues related to data I/O performance and data storage infrastructure reliability, caused primarily by the growing gap between compute and storage performance, and the ever increasing volumes of data generated by scientific simulations, instruments and sensors. The architectural assumptions for extreme co...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017