Network assurance with machine reasoning and machine learning
Assurance is a key function of Intent-Based Networking (IBN). It is responsible for validating that the network is actively satisfying the business intent and objectives set forth by the operator. Network Assurance involves three main aspects:
- continuous verification of system behavior by examining run time states and events,
- providing visibility and insights into the operation of the network for validation, understanding and prediction,
- remediation of issues including faults and policy violations, thereby enabling a closed-loop control paradigm.
Artificial Intelligence (AI) is playing an increasing role in IT operations and is poised to change how networks are managed. The above three aspects of Network Assurance can be realized through the use of AI, and more specifically using a combination of Machine Reasoning (MR) and Machine Learning (ML). I consider MR and ML as the Yin and the Yang of AIOps (the application of AI to IT operations). But before diving into how MR and ML harmoniously enable Network Assurance, let’s start with an overview of these technologies and their core strengths. So, what is Machine Reasoning and what is Machine Learning?
Machine Reasoning is a branch of AI that relies on capturing human knowledge using semantic languages which formally codify concepts, relationships and rules. The captured knowledge can be used to compute inferences (new facts) from asserted facts using rules and symbolic logic. Any workflow that is decidable in nature can be mechanized using MR. A workflow is considered “decidable” if it can be modeled as a deterministic decision tree based on facts. MR offers 100% accuracy for inferences based on captured knowledge and 0% precision with uncaptured knowledge. MR is well-suited for solving problems that require deep domain expertise. Humans need to explicitly capture all the knowledge a priori in order for a reasoner to be able to operate on new data. Early incarnations of MR systems were referred to as “Expert Systems”. Basically, they were programs designed to solve problems within a specialized domain that typically requires a human expert. The technology has evolved significantly since the expert systems of the 1980’s, propelled by advances in Semantic Technologies and Linked Data.
Machine Learning is another branch of AI that relies on building statistical mathematical models from big data sets comprising input and output values. The models approximate the functions that govern the relationship between inputs and outputs without relying on any domain knowledge. In a sense, it is a black-box approach for building a model that describes the behavior of an unknown system. It can also be used in a white-box approach for well-known systems. Given the statistical nature of the models, the accuracy of prediction of ML systems ranges somewhere between 0 and 100%, depending on the size and quality of the training data set. ML is well-suited for solving problems such as dynamic baselining of system behavior, trend analysis, anomaly detection, classification, recommendation and prediction. Depending on the problem at hand, humans need to train the model a priori before the ML system can be used, or the ML system could learn from trial and error while on the job (reinforcement learning). Of course, the latter may not be viable in many scenarios (think network configuration).
Now let’s turn back into the world of networking, to examine how MR and ML apply and complement one another in the context of Network Assurance: As discussed above, one main aspect of assurance is the continuous verification that the network state and behavior are consistent with the desired intent. ML, when applied to network telemetry, provides this capability. Using ML, it is possible to establish dynamic baselines of what constitutes normal operating conditions for a given intent. An assurance system employing ML can then detect when the operating conditions deviate from the normal baseline for a given deployment and trigger root cause analysis and remediation.
Assurance using ML outperforms a system that relies on static thresholds, especially when the monitored key performance indicators (KPIs) vary widely between deployments, e.g. wireless client on-boarding times.
Another main aspect of assurance is providing insights and visibility into the operation of the network. This is another area where ML shines: by employing analytics based on ML algorithms, it is possible to analyze relationships between various network KPIs in order to gain better understanding of the performance and behavior of the network. It is even possible to predict when an anomalous condition, that violates the expressed intent, is imminent based on trend analysis. Not only can ML detect network issues, it can also help identify potential root causes based on the correlation between KPIs. This is where MR complements the power of ML by applying the mechanized expert knowledge in order to zoom down from the potential causes to the actual root cause of the network problem. MR does that by mechanizing the troubleshooting workflows that an expert network operator goes through when faced with those types of issues. The collective knowledge of networking technologies, protocols, features and products can be captured and automated with MR.
The last main aspect of assurance involves the proposed remediation of network problems or intent violations. This is a second area where MR excels: MR can automatically identify the corrective actions and steps required in order to address a network fault or policy violation. MR achieves this by relying on knowledge bases that capture, in a machine-interpretable form, the decision trees and remediation workflows associated with the specific networking technologies and products. Such actions/steps can then be provided to a human operator as recommendations, and the operator can carry out the remediation. Alternatively, the remediation actions can be executed by the machine reasoner, by invoking the Activation functions of the Intent Based Network. The latter approach enables a closed-loop control paradigm, where remediation is completely automatic.
Machine Reasoning | Machine Learning | |
Technology Approach | Capturing human knowledge, symbolic logic | Mathematical model learned from large data sets |
Applicability | Mechanizing Decidable Workflows | Trend Analysis, Anomaly Detection and Classification |
Network Assurance Function | Automatic Troubleshooting, Root cause identification, Automatic Remediation | Dynamic Baselining, Issue Identification, Insights and Visibility |
To close, the AI technology landscape includes both ML and MR. The two provide complimentary capabilities and combined together they enable AIOps to cover all aspects of network assurance. ML captures the patterns in the data, and MR captures human expertise, facts and logic.
By Samer Salam
Published with permission from blogs.cisco.com.