Home Artificial Intelligence Constructing toward more autonomous and proactive cloud technologies with AI

Constructing toward more autonomous and proactive cloud technologies with AI

0
Constructing toward more autonomous and proactive cloud technologies with AI

Cloud Intelligence/AIOps blog series

In the primary blog post on this series, Cloud Intelligence/AIOps – Infusing AI into Cloud Computing Systems, we presented a temporary overview of Microsoft’s research on Cloud Intelligence/AIOps (AIOps), which innovates AI and machine learning (ML) technologies to assist design, construct, and operate complex cloud platforms and services effectively and efficiently at scale. As cloud computing platforms have continued to emerge as one of the fundamental infrastructures of our world, each their scale and complexity have grown considerably. In our previous blog post, we discussed the three major pillars of AIOps research: AI for Systems, AI for Customers, and AI for DevOps, in addition to the 4 major research areas that constitute the AIOps problem space: detection, diagnosis, prediction, and optimization. We also envisioned the AIOps research roadmap as constructing toward creating more autonomous, proactive, manageable, and comprehensive cloud platforms. 

Vision of AIOps Research

Autonomous Proactive Manageable Comprehensive
Fully automate the operation of cloud systems to attenuate system downtime and reduce manual efforts. Predict future cloud status, support proactive decision-making, and stop bad things from happening. Introduce the notion of tiered autonomy for infusing autonomous routine operations and deep human expertise.  Span AIOps to the complete cloud stack for global optimization/management and extend to multi-cloud environments.

Starting with this blog post, we’ll take a deeper dive into Microsoft’s vision for AIOps research and the continuing efforts to comprehend that vision. This blog post will concentrate on how our researchers leveraged state-of-the-art AIOps research to assist make cloud technologies more autonomous and proactive. We’ll discuss our work to make the cloud more manageable and comprehensive in future blog posts.

Autonomous cloud

Motivation

Cloud platforms require quite a few actions and decisions every second to be sure that computing resources are properly managed and failures are promptly addressed. In practice, those actions and decisions are either generated by rule-based systems constructed upon expert knowledge or made manually by experienced engineers. Still, as cloud platforms proceed to grow in each scale and complexity, it is obvious that such solutions will probably be insufficient for the long run cloud system. On one hand, rigid rule-based systems, while being knowledge empowered, often involve huge numbers of rules and require frequent maintenance for higher coverage and flexibility. Still, in practice, it is usually unrealistic to maintain such systems up to this point as cloud systems expand in each size and complexity, and even tougher to ensure consistency and avoid conflicts between all the principles. Alternatively, engineering efforts are very time-consuming, liable to errors, and difficult to scale.

Highlight: On-demand video

AI Explainer: Foundation models ​and the following era of AI

Explore how the transformer architecture, larger models and more data, and in-context learning have helped advance AI from perception to creation.

To interrupt the constraints on the coverage and scalability of the prevailing solutions and improve the adaptability and manageability of the decision-making systems, cloud platforms must shift toward a more autonomous management paradigm. As an alternative of relying solely on expert knowledge, we want suitable AI/ML models to fuse operational data and expert knowledge together to enable efficient, reliable, and autonomous management decisions. Still, it would take many research and engineering efforts to beat various barriers for developing and deploying autonomous solutions to cloud platforms.

Toward an autonomous cloud

Within the journey towards an autonomous cloud, there are two major challenges. The primary challenge lies within the heterogeneity of cloud data. In practice, cloud platforms deploy an enormous variety of monitors to gather data in various formats, including telemetry signals, machine-generated log files, and human input from engineers and users. And the patterns and distributions of those data generally exhibit a high degree of diversity and are subjected to changes over time. To be sure that the adopted AIOps solutions can function autonomously in such an environment, it is important to empower the management system with robust and extendable AI/ML models able to learning useful information from heterogeneous data sources and drawing right conclusions in various scenarios.

The complex interaction between different components and services presents one other major challenge in deploying autonomous solutions. While it could possibly be easy to implement autonomous features for one or a number of components/services, tips on how to construct end-to-end systems able to mechanically navigating the complex dependencies in cloud systems presents the true challenge for each researchers and engineers. To deal with this challenge, it will be significant to leverage each domain knowledge and data to optimize the automation paths in application scenarios. Researchers and engineers also needs to implement reliable decision-making algorithms in every decision stage to enhance the efficiency and stability of the entire end-to-end decision-making process.

Over the past few years, Microsoft research groups have developed many recent models and methods for overcoming those challenges and improving the extent of automation in various cloud application scenarios across the AIOps problem spaces. Notable examples include:

  • Detection: Gandalf and ATAD for the early detection of problematic deployments; HALO for hierarchical faulty localization; and Onion for detecting incident-indicating logs.
  • Diagnosis: SPINE and UniParser for log parsing; Logic and Warden for regression and incident diagnosis; and CONAN for batch failure diagnosis.
  • Prediction: TTMPred for predicting time to mitigate incidents; LCS for predicting the low-capacity status in cloud servers; and Eviction Prediction for predicting the eviction of spot virtual machines.
  • Optimization: MLPS for optimizing the reallocation of containers; and RESIN for the management of memory leak in cloud infrastructure.

These solutions not only improve service efficiency and reduce management time with more automatous design, but in addition end in higher performance and reliability with fewer human errors. As an illustration of our work toward a more autonomous cloud, we’ll discuss our exploration for supporting automatic secure deployment services below.

Exemplary scenario: Automatic secure deployment

In online services, the continual integration and continuous deployment (CI/CD) of latest patches and builds are critical for the timely delivery of bug fixes and have updates. Because recent deployments with undetected bugs or incompatible issues could cause severe service outages and create significant customer impact, cloud platforms implement strict safe-deployment procedures before releasing each recent deployment to the production environments. Such procedures typically involve multi-stage testing and verification in a sequence of canary environments with increasing scopes. When a deployment-related anomaly is identified in one in all these stages, the responsible deployment is rolled back for further diagnosis and fixing. Owing to the challenges of identifying deployment-related anomalies with heterogeneous patterns and managing an enormous variety of deployments, safe-deployment systems administrated manually will be extremely costly and error prone.

To support automatic and reliable anomaly detection in secure deployment, we proposed a general methodology named ATAD for the effective detection of deployment-related anomalies in time-series signals. This method addresses the challenges of capturing changes with various patterns in time-series signals and the dearth of labeled anomaly samples as a result of the heavy cost of labeling. Specifically, this method combines ideas from each transfer learning and energetic learning to make good use of the temporal information within the input signal and reduce the variety of labeled samples required for model training. Our experiments have shown that ATAD can outperform other state-of-the-art anomaly detection approaches, even with just one%-5% of labeled data.

At the identical time, we collaborated with product teams in Azure to develop and deploy Gandalf, an end-to-end automatic secure deployment system that reduces deployment time and increases the accuracy of detecting bad deployment in Azure. As a data-driven system, Gandalf monitors a big array of data, including performance metrics, failure signals and deployment records. It also detects anomalies in various patterns throughout the whole safe-deployment process. After detecting anomalies, Gandalf applies a vote-veto mechanism to reliably determine whether each detected anomaly is attributable to a selected recent deployment. Gandalf then mechanically decides whether the relevant recent deployment must be stopped for a fix or if it’s secure enough to proceed to the following stage. After rolling out in Azure, Gandalf has been effective at helping to capture bad deployments, achieving greater than 90% precision and near 100% recall in production over a period of 18 months.

Flow of Automatic Safe Deployment System
Flow of Automatic Protected Deployment System

Proactive cloud

Motivation

Traditional decision-making within the cloud focuses on optimizing immediate resource usage and addressing emerging issues. While this reactive design will not be unreasonable in a comparatively static system, it could possibly result in short-sighted decisions in a dynamic environment. In cloud platforms, each the demand and utilization of computing resources are undergoing constant changes, including regular periodical patterns, unexpected spikes, and gradual shifts in each temporal and spatial dimensions. To enhance the long-term efficiency and reliability of cloud platforms, it’s critical to adopt a proactive design that takes the long run status of the system into consideration within the decision-making process.

A proactive design leverages data-driven models to predict the long run status of cloud platforms and enable downstream proactive decision-making. Conceptually, a typical proactive decision-making system consists of two modules: a prediction module and a decision-making module, as displayed in the next diagram.

Cloud Platform Prediction Module

Within the prediction module, historical data are collected and processed for training and fine-tuning the prediction model for deployment. The deployed prediction model takes in the web data stream and generates prediction ends in real time. Within the decision-making module, each the present system status and the expected system status, together with other information reminiscent of domain knowledge and past decision history, is taken into account for making decisions that balance each present and future advantages.

Toward proactive design

Proactive design, while creating recent opportunities for improving the long-term efficiency and reliability of cloud systems, does expose the decision-making process to additional risks. On one hand, because of the inherent randomness within the each day operation of cloud platforms, proactive decisions are at all times subjected to the uncertainty risk from the stochastic elements in each running systems and the environments. Alternatively, the reliability of prediction models adds one other layer of risks in making proactive decisions. Subsequently, to ensure the performance of proactive design, engineers must put mechanisms in place to deal with those risks.

To administer uncertainty risk, engineers have to reformulate the decision-making in proactive design to account for the uncertainty elements. They’ll often use methodological frameworks, reminiscent of prediction+optimization and optimization under chance-constraints, to include uncertainties into the goal functions of optimization problems. Well-designed ML/AL models also can learn uncertainty from data for improving proactive decisions against uncertainty elements. As for risks related to the prediction model, modules for improving data quality, including quality-aware feature engineering, robust data imputation, and data rebalancing, must be applied to scale back prediction errors. Engineers also needs to make continuous efforts to enhance and update the robustness of prediction models. Furthermore, safeguarding mechanisms are essential to forestall decisions that will cause harm to the cloud system.

Microsoft’s AIOps research has pioneered the transition from reactive decision-making to proactive decision-making, especially in problem spaces of prediction and optimization. Our efforts not only result in significant improvement in lots of application scenarios traditionally supported by reactive decision-making, but in addition create many recent opportunities. Notable proactive design solutions include Narya and Nenya for hardware failure mitigation, UAHS and CAHS for the intelligent virtual machine provisioning, CUC for the predictive scheduling of workloads, and UCaC for bin packing optimization under likelihood constraints. Within the discussion below, we’ll use hardware failure mitigation for instance for instance how proactive design will be applied in cloud scenarios.

Exemplary scenario: Proactive hardware failure mitigation

A key threat to cloud platforms is hardware failure, which might cause interruptions to the hosted services and significantly impact the client experience. Traditionally, hardware failures are only resolved reactively after the failure occurs, which usually involves temporal interruptions of hosted virtual machines and the repair or substitute of impacted hardware. Such an answer provides limited assist in reducing negative customer experiences.

Narya is a proactive disk-failure mitigation service able to taking mitigation actions before failures occur. Specifically, Narya leverages ML models to predict potential disk failures, after which make decisions accordingly. To regulate risks related to uncertainty, Narya evaluates candidate mitigation actions based on the estimated impacts to customers and chooses actions with minimum impact. A feedback loop also exists for collecting follow-up assessments to enhance prediction and decision modules.

Hardware failures in cloud systems are sometimes highly interdependent. Subsequently, to scale back the impact of predictions errors, Narya introduces a novel dependency-aware model to encode the dependency relationship between nodes to enhance the failure prediction model. Narya also implements an adaptive approach that uses A/B testing and bandit modeling to enhance the power to estimate the impacts of actions. Several safeguarding mechanisms in numerous stages of Narya are also in place to eliminate the prospect of creating unsafe mitigation actions. Implementation of Narya in Azure’s production environment has reduced the node hardware interruption rate for virtual machines by greater than 26%.

Narya's Feedback loop

Our recent work, Nenya, is one other example for proactive failure mitigation. Under a reinforcement learning framework, Nenya fuses prediction and decision-making modules into an end-to-end proactive decision-making system. It could weigh each mitigation costs and failure rates to higher prioritize cost-effective mitigation actions against uncertainty. Furthermore, the standard failure mitigation method often suffers from data imbalance issues; cases of failure form only a really small portion of all cases, which have mostly healthy situations. Such data imbalance would introduce bias to each the prediction and decision-making process. To deal with this problem, Nenya adopts a cascading framework to be sure that mitigation decisions are usually not made with heavy costs. Experiments with Microsoft 365 data sets on database failure have proved that Nenya can reduce each mitigation costs and database failure rates compared with existing methods.

Future work

As management systems turn out to be more automated and proactive, it will be significant to pay special attention to each the protection of cloud systems and the responsibility to cloud customers. The autonomous and proactive decision system will depend heavily on advanced AI/ML models with little manual effort. How one can be sure that the choices made by those approaches are each secure and responsible is an important query that future work should answer.

The autonomous and proactive cloud relies on the effective data usage and feedback loop across all stages within the management and operation of cloud platforms. On one hand, high-quality data on the status of cloud systems are needed to enable downstream autonomous and proactive decision-making systems. Alternatively, it will be significant to watch and analyze the impact of every decision on the whole cloud platform with a view to improve the management system. Such feedback loops can exist concurrently for a lot of related application scenarios. Subsequently, to higher support an autonomous and proactive cloud, a unified data plane answerable for the processing and feedback loop can take a central role in the entire system design and must be a key area of investment.

As such, the long run of cloud relies not only on adopting more autonomous and proactive solutions, but in addition on improving the manageability of cloud systems and the great infusion of AIOps technologies over all stacks of cloud systems. In future blog posts, we’ll discuss tips on how to work toward a more manageable and comprehensive cloud.

Stay tuned!

LEAVE A REPLY

Please enter your comment!
Please enter your name here