Why a Big Data Approach is Key for APM

Data

4 MIN READ

Why a Big Data Approach is Key for APM

There has been some debate about whether a big data approach is relevant to application performance monitoring (APM). Some experts claim that sophisticated analytics and sampled data are more than sufficient when it comes to monitoring and diagnostics, and a big data approach is unnecessary.

As an APM practitioner I strongly disagree with such claims so I decided to write this article to explore the subject further. I believe the purpose of APM is to help us understand and improve application performance beyond just monitoring and alerting. A big data approach provides the complete and correct set of data and analytics to help us continuously improve application performance.

The primary effect of big data is that enables us to gain immediate insight without needing to come up with hypothesis, design sampling strategies, and run experiments to test a set of theories. With big data we observe the entire “universe” of the problem and the resulting analysis is complete and correct because we have removed the sampling/selection bias from the process.

Any time data is sampled, filtered, or aggregated the resulting record represents only some percentage of the truth.

Below I address the main ways that big data improves the mean time to resolution (MTTR) for application problems. It’s important to note that reducing MTTR is the main reason that companies purchase Application Performance Monitoring solutions.

How big data helps with MTTR

Classifying performance problems

Applications are often plagued by multiple performance problems. A big data approach helps IT divide and conquer the long tail of problems more efficiently.

Definitive analysis

Big data removes the “would/could/should” from the performance analysis. In the absence of precise data, performance analysis starts hinging on conjecture and becomes misleading. It is also just as important to determine what is NOT the cause of a problem. Often when doing analysis the team will get tempted to use prior knowledge in the absence of forensic detail (“Last time we had a performance problem it was our logging code”) and frequently go down the wrong path. With big data we can quickly say “it’s not the logging code” because we are capturing everything and there is no record the logging code is being used here, and move on without wasting time and effort.

Diagnosing intermittent problems

Intermittent performance problems tend to be the most challenging to diagnose for several reasons:

The conditions of the failure are often elusive
Re-occurrence is unpredictable
There are few opportunities to observe the problem
The environment itself is changing through the course of these long running problems

A big data approach addresses all these challenges and enables IT to quickly diagnose these problems. With big data, it is not necessary to understand the failure conditions up front as diagnostics data is continuously captured in full detail. For the same reason, there is always forensic data available regardless of when the problem transpires and how the environment changed.

Analyzing ephemeral environments

A big data approach is very effective in diagnosing problems in cloud, virtualized, or containerized environments. In these ephemeral application environments, the application infrastructure is constantly changing and triggered/sampled approach is missing the state changes as components come to life and disappear.

Understanding the user journey

Understanding the user population is invaluable in drawing insight about the global performance trends but is sometimes insufficient in understanding the steps that could lead to big performance problems. A single user action can lead to performance problems for the entire application. A big data approach guarantees that all forensic data is available to reconstruct the breadcrumb of the incident.

Forensic exploration and code audit

Forensic exploration is one of my favorite aspects of APM big data. You can find problems you were not even looking for!

Who has time for that you say? People that are persnickety about application performance! Often the rich historical transaction detail or high resolution environment data unveils completely unforeseen behaviors and corner cases of how users use or break the application. I have lost count on how many times I have heard the statement, “It shouldn’t be doing that” but the facts say otherwise.

Continuous performance improvement

With big data we can do more than just monitoring and diagnostics. We can start methodically reducing performance bloat. The availability of deep performance data allows us to focus on continuous performance improvement.

Utilization analysis

Applications are constantly changing with new feature releases, and tend to accumulate technical and performance debt. The result is that over time a well performing application starts degrading. Big data provides the insight to understand which components of an applications are taking the most time and focus efforts on optimizing performance.

Application ecosystem analysis

When it comes to an enterprise setting applications are never designed, built, or operated in complete isolation. In some cases different applications may share systems, networks or infrastructure. In other cases applications may share common libraries, data, or APIs. Sharing of components or resource has a lot of benefits but also leads to performance problems that often affect multiple applications.

Leveraging big data helps application support teams uncover performance problems and patterns across the entire application environment not just a single application component. Once a problem is discovered in a single app big data analytics helps use look for other applications that have the same problem or are at risk.

The human effect of big data

I wanted to summarize my experience through hundreds of performance engineering engagements:

We have a performance situation on a critical app. Maybe a QA test is not passing or a production environment is degraded. We quickly mobilize the tiger team to analyze the problem and come up with a recommendation. I have found that the success of the triage effort highly depends on the quality of the forensic data. The incompleteness of the evidence divides the team. Then multiple possible root cause candidates need to be researched. On the other hand, complete and accurate forensic data removes ambiguity, rallies the team, and leads to faster resolution.

No tiger team ever said “We need less detail to find the root cause!”

Wrap up

So you see, big data is not only useful but necessary when it comes to APM. Sampling doesn’t come close to providing you with the completeness and depth of data necessary to identify and resolve problems in dynamic, composite application environments.