There has been some debate about whether a big data approach is relevant to application performance monitoring (APM). Some experts claim that sophisticated analytics and sampled data are more than sufficient when it comes to monitoring and diagnostics, and a big data approach is unnecessary.
As an APM practitioner I strongly disagree with such claims so I decided to write this article to explore the subject further. I believe the purpose of APM is to help us understand and improve application performance beyond just monitoring and alerting. A big data approach provides the complete and correct set of data and analytics to help us continuously improve application performance.
The primary effect of big data is that enables us to gain immediate insight without needing to come up with hypothesis, design sampling strategies, and run experiments to test a set of theories. With big data we observe the entire “universe” of the problem and the resulting analysis is complete and correct because we have removed the sampling/selection bias from the process.
Any time data is sampled, filtered, or aggregated the resulting record represents only some percentage of the truth.
Below I address the main ways that big data improves the mean time to resolution (MTTR) for application problems. It’s important to note that reducing MTTR is the main reason that companies purchase Application Performance Monitoring solutions.
Classifying performance problems
Applications are often plagued by multiple performance problems. A big data approach helps IT divide and conquer the long tail of problems more efficiently.
Definitive analysis
Big data removes the “would/could/should” from the performance analysis. In the absence of precise data, performance analysis starts hinging on conjecture and becomes misleading. It is also just as important to determine what is NOT the cause of a problem. Often when doing analysis the team will get tempted to use prior knowledge in the absence of forensic detail (“Last time we had a performance problem it was our logging code”) and frequently go down the wrong path. With big data we can quickly say “it’s not the logging code” because we are capturing everything and there is no record the logging code is being used here, and move on without wasting time and effort.
Diagnosing intermittent problems
Intermittent performance problems tend to be the most challenging to diagnose for several reasons:
A big data approach addresses all these challenges and enables IT to quickly diagnose these problems. With big data, it is not necessary to understand the failure conditions up front as diagnostics data is continuously captured in full detail. For the same reason, there is always forensic data available regardless of when the problem transpires and how the environment changed.
Analyzing ephemeral environments
A big data approach is very effective in diagnosing problems in cloud, virtualized, or containerized environments. In these ephemeral application environments, the application infrastructure is constantly changing and triggered/sampled approach is missing the state changes as components come to life and disappear.
Understanding the user journey
Understanding the user population is invaluable in drawing insight about the global performance trends but is sometimes insufficient in understanding the steps that could lead to big performance problems. A single user action can lead to performance problems for the entire application. A big data approach guarantees that all forensic data is available to reconstruct the breadcrumb of the incident.
Forensic exploration and code audit
Forensic exploration is one of my favorite aspects of APM big data. You can find problems you were not even looking for!
Who has time for that you say? People that are persnickety about application performance! Often the rich historical transaction detail or high resolution environment data unveils completely unforeseen behaviors and corner cases of how users use or break the application. I have lost count on how many times I have heard the statement, “It shouldn’t be doing that” but the facts say otherwise.
Continuous performance improvement
With big data we can do more than just monitoring and diagnostics. We can start methodically reducing performance bloat. The availability of deep performance data allows us to focus on continuous performance improvement.
Utilization analysis
Applications are constantly changing with new feature releases, and tend to accumulate technical and performance debt. The result is that over time a well performing application starts degrading. Big data provides the insight to understand which components of an applications are taking the most time and focus efforts on optimizing performance.
Application ecosystem analysis
When it comes to an enterprise setting applications are never designed, built, or operated in complete isolation. In some cases different applications may share systems, networks or infrastructure. In other cases applications may share common libraries, data, or APIs. Sharing of components or resource has a lot of benefits but also leads to performance problems that often affect multiple applications.
Leveraging big data helps application support teams uncover performance problems and patterns across the entire application environment not just a single application component. Once a problem is discovered in a single app big data analytics helps use look for other applications that have the same problem or are at risk.
I wanted to summarize my experience through hundreds of performance engineering engagements:
We have a performance situation on a critical app. Maybe a QA test is not passing or a production environment is degraded. We quickly mobilize the tiger team to analyze the problem and come up with a recommendation. I have found that the success of the triage effort highly depends on the quality of the forensic data. The incompleteness of the evidence divides the team. Then multiple possible root cause candidates need to be researched. On the other hand, complete and accurate forensic data removes ambiguity, rallies the team, and leads to faster resolution.
No tiger team ever said “We need less detail to find the root cause!”
Published with permission from Riverbed.