Canopus: Enabling Extreme-Scale Data Analytics on Big HPC Storage via Progressive Refactoring
نویسندگان
چکیده
High accuracy scientific simulations on high performance computing (HPC) platforms generate large amounts of data. To allow data to be efficiently analyzed, simulation outputs need to be refactored, compressed, and properly mapped onto storage tiers. This paper presents Canopus, a progressive data management framework for storing and analyzing big scientific data. Canopus allows simulation results to be refactored into a much smaller dataset along with a series of deltas with fairly low overhead. Then, the refactored data are compressed, mapped, and written onto storage tiers. For data analytics, refactored data are selectively retrieved to restore data at a specific level of accuracy that satisfies analysis requirements. Canopus enables end users to make trade-offs between analysis speed and accuracy onthe-fly. Canopus is demonstrated and thoroughly evaluated using blob detection on fusion simulation data.
منابع مشابه
HPC and Big Data Convergence for Extreme Heterogeneous Systems
As the data deluge grows ever greater, large-scale data analytics workloads are quickly becoming critical computational tools within the scientific community. Recently, convergence efforts have focused on combining aspects HPC and ”big data” analytics workloads together using a unified supercomputing system. This has the opportunity to bring advanced analytical tools to scientists which enable ...
متن کاملThe Need for Resilience Research in Workflows of Big Compute and Big Data Scientific Applications
Projections and reports about exascale failure modes conclude that we need to protect numerical simulations and data analytics from an increasing risk of hardware and software failures and silent data corruptions (SDC) [1, 4]. At this scale, hardware and software failures could be as frequent as ten or more per day. According to [9], the semiconductor industry will have increased difficulty pre...
متن کاملExperiences Running and Optimizing the Berkeley Data Analytics Stack on Cray Platforms
The Berkeley Data Analytics Stack (BDAS) is an emerging framework for big data analytics. It consists of the Spark analytics framework, the Tachyon in-memory filesystem, and the Mesos cluster manager. Spark was designed as an in-memory replacement for Hadoop that can in some cases improve performance by up to 100X. In this paper, we describe our experiences running BDAS on the new Cray Urika-XA...
متن کاملOutlook on moving of computing services towards the data sources
The Internet of things (IoT) is potentially interconnecting unprecedented amounts of raw data, opening countless possibilities by two main logical layers: become data into information, then turn information into knowledge. The former is about filtering the significance in the appropriate format, while the latter provides emerging categories of the whole domain. This path of the data is a bottom...
متن کاملMero: Co-Designing an Object Store for Extreme Scale
Within the HPC community, there is consensus that Exascale computing will be plagued with issues related to data I/O performance and data storage infrastructure reliability, caused primarily by the growing gap between compute and storage performance, and the ever increasing volumes of data generated by scientific simulations, instruments and sensors. The architectural assumptions for extreme co...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017