Insight: In-situ Online Service Failure Path Inference in Production Computing Infrastructures
نویسندگان
چکیده
Online service failures in production computing environments are notoriously difficult to debug. When those failures occur, the software developer often has little information for debugging. In this paper, we present Insight, a system that reproduces the execution path of a failed service request onsite immediately after a failure is detected. Upon a request failure is detected, Insight dynamically creates a shadow copy of the production server and performs guided binary execution exploration in the shadow node to gain useful knowledge on how the failure occurs. Insight leverages both environment data (e.g., input logs, configuration files, states of interacting components) and runtime outputs (e.g., console logs, system calls) to guide the failure path finding. Insight does not require source code access or any special system recording during normal production run. We have implemented Insight and evaluated it using 13 failures from a production cloud management system and 8 open source software systems. The experimental results show that Insight can successfully find high fidelity failure paths within a few minutes. Insight is light-weight and unobtrusive, making it practical for online service failure inference in the production computing environment.
منابع مشابه
A Framework for Evaluating Cloud Computing User’s Satisfaction in Information Technology Management
Cloud computing is a new discussion in enterprise IT. It has already become popular in terms of distributed technology in some companies. It enables managers to setup and run the intended businesses by avoiding excessive spending on computers, software and hiring expert staff, which proves to be cost effective. Cloud computing also helps users pay for the IT services without spending massive am...
متن کاملImproving Mobile Grid Performance Using Fuzzy Job Replica Count Determiner
Grid computing is a term referring to the combination of computer resources from multiple administrative domains to reach a common computational platform. Mobile Computing is a Generic word that introduces using of movable, handheld devices with wireless communication, for processing data. Mobile Computing focused on providing access to data, information, services and communications anywhere an...
متن کاملImproving Mobile Grid Performance Using Fuzzy Job Replica Count Determiner
Grid computing is a term referring to the combination of computer resources from multiple administrative domains to reach a common computational platform. Mobile Computing is a Generic word that introduces using of movable, handheld devices with wireless communication, for processing data. Mobile Computing focused on providing access to data, information, services and communications anywhere an...
متن کاملAutomatic Failure-Path Inference: A Generic Introspection Technique for Internet Applications
Automatic Failure-Path Inference (AFPI) is an application-generic, automatic technique for dynamically discovering the failure dependency graphs of componentized Internet applications. AFPI’s first phase is invasive, and relies on controlled fault injection to determine failure propagation; this phase requires no a priori knowledge of the application and takes on the order of hours to run. Once...
متن کاملبررسی تأثیرات رایانش ابری بر یادگیری الکترونیکی
In the world of training, online training is introduced as a modern model of training services. Cloud computing is a modern technology which is provided software, infrastructure and platform as internet. Also, online training is introduced as a modern model of training services on the web. In this research, the impact of cloud computing on e-learning on the case of Mehralborz online university ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014