José Pergentino de Araujo Neto
Cloud data centers, realizing that the amount of unused resources is significant, have started offering them as transient resources with unpredictable, irreversible revocation. The use of transient resources indicates many relevant issues still pose critical challenges, including security, strong and reliable connectivity and fault-tolerance approaches. To effectively use transient cloud servers to fulfill user requests, it is necessary to define an appropriate fault-tolerant mechanism and its respective parameters to avoid data loss if an unexpected failure occurs. We present an agent-based framework, namely BRA2Cloud, for integrating bag-of-tasks enabled systems using unreliable transient resources. To guarantee application execution and better use of idle resources, it is necessary to create an execution plan through fault tolerance definitions to increase reliability. To do this, BRA2Cloud agents combine features to predict failures in a multi-agent architecture that dynamically creates fault-tolerant multi-strategies, considering the current availability scenario and providing a resilient environment according to users’ application needs. Our approach was validated using real data retrieved between 2017 and 2019 from Amazon Spot Instances. Exhaustive experiments achieved high accuracy levels, reaching a 91% survival prediction success rate, which indicates the model is effective under realistic working conditions. We consider the results promising, decreasing up to 74.48% in total execution time when compared to other approaches in the literature. As the main requirements of our proposal, we have defined a series of features that BRA2Cloud should have in order to address the impact of these definitions on resiliency provision, application execution time reduction, and monetary cost reduction.
Compartilhe este artigo