Starfish: A Self-tuning System for Big Data Analytics
نویسندگان
چکیده
Modern industrial, government, and academic organizations are collecting massive amounts of data (“big data”) at an unprecedented scale and pace. The ability to perform timely and costeffective analytical processing of such large datasets to extract deep insights is now a key ingredient for success. These insights can drive automated processes for advertisement placement, improve customer relationship management, and lead to major scientific breakthroughs. Existing database systems are adapting to the new status quo while large-scale dataflow systems (like Dryad and MapReduce) are becoming popular for analytical workloads on big data. My research interests are in ease-of-use, manageability, and automated tuning of such large-scale data processing systems. Ensuring good and robust system performance poses several new challenges. First, workloads are now analyzing big data consisting of a hybrid mix of structured and unstructured datasets stored in nontraditional data layouts. The structure and properties of the data may not be known initially, and will evolve over time. Complex analysis techniques and rapid development needs often require the use of both declarative and procedural programming languages. Finally, the space of tuning choices is extremely high-dimensional, with choices ranging from various workload configuration settings to cluster provisioning and data layouts. My research work involves (1) exploring novel optimization opportunities in the MapReduce platform that range from the job level to the workload level, while considering factors like scheduling, data layouts and provisioning; (2) exploiting new data layouts and partitioning for improving performance and system manageability in both database and dataflow systems; (3) introducing a SQL-tuning-aware query optimizer that is capable of improving on current query plans by executing some subplans proactively, collecting monitoring data from the runs, and iterating; and (4) using database-style optimization strategies to enable I/O-efficient statistical computing.
منابع مشابه
MapReduce Programming and Cost-based Optimization? Crossing this Chasm with Starfish
MapReduce has emerged as a viable competitor to database systems in big data analytics. MapReduce programs are being written for a wide variety of application domains including business data processing, text analysis, natural language processing, Web graph and social network analysis, and computational science. However, MapReduce systems lack a feature that has been key to the historical succes...
متن کاملBig Data Analytics and Now-casting: A Comprehensive Model for Eventuality of Forecasting and Predictive Policies of Policy-making Institutions
The ability of now-casting and eventuality is the most crucial and vital achievement of big data analytics in the area of policy-making. To recognize the trends and to render a real image of the current condition and alarming immediate indicators, the significance and the specific positions of big data in policy-making are undeniable. Moreover, the requirement for policy-making institutions to ...
متن کاملPStorM: Profile Storage and Matching for Feedback-Based Tuning of MapReduce Jobs
The MapReduce programming model has become widely adopted for large scale analytics on big data. MapReduce systems such as Hadoop have many tuning parameters, many of which have a significant impact on performance. The map and reduce functions that make up a MapReduce job are developed using arbitrary programming constructs, which make them black-box in nature and therefore renders it difficult...
متن کاملA Fuzzy TOPSIS Approach for Big Data Analytics Platform Selection
Big data sizes are constantly increasing. Big data analytics is where advanced analytic techniques are applied on big data sets. Analytics based on large data samples reveals and leverages business change. The popularity of big data analytics platforms, which are often available as open-source, has not remained unnoticed by big companies. Google uses MapReduce for PageRank and inverted indexes....
متن کاملApplication of Big Data Analytics in Power Distribution Network
Smart grid enhances optimization in generation, distribution and consumption of the electricity by integrating information and communication technologies into the grid. Today, utilities are moving towards smart grid applications, most common one being deployment of smart meters in advanced metering infrastructure, and the first technical challenge they face is the huge volume of data generated ...
متن کامل