How you can find the reason for each anomaly in your metrics
We use metrics and KPIs to observe the health of our products: to be sure that all the pieces is stable or the product is growing as expected. But sometimes, metrics change suddenly. Conversions may rise by 10% on in the future, or revenue may drop barely for just a few quarters. In such situations, it’s critical for businesses to know not only what is occurring but additionally why and what actions we should always take. And that is where analysts come into play.
My first data analytics role was KPI analyst. Anomaly detection and root cause evaluation has been my essential focus for nearly three years. I’ve found key drivers for dozens of KPI changes and developed a technique for approaching such tasks.
In this text, I would really like to share with you my experience. So next time you face unexpected metric behaviour, you should have a guide to follow.
Before moving on to evaluation, let’s define our essential goal: what we would really like to attain. So what’s the purpose of our anomaly root cause evaluation?
Probably the most straightforward answer is knowing key drivers for metric change. And it goes without saying that it’s an accurate answer from an analyst’s standpoint.
But let’s look from a business side. The essential reason to spend resources on this research is to reduce the potential negative impact on our customers. For instance, if the conversion has dropped due to a bug in the brand new app version released yesterday, it’ll be higher to seek out it out today relatively than in a month when lots of of consumers may have already churned.
Our essential goal is to minimise the potential negative impact on our customers.
As an analyst, I like having optimization metrics even for my work tasks. Minimizing potential adversarial effects feels like a correct mindset to assist us concentrate on the correct things.
So keeping the essential goal in mind, I’d try to seek out answers to the next questions:
- Is it an actual problem affecting our customers’ behaviour or simply a knowledge issue?
- If our customers’ behaviour actually modified, could we do anything with it? What can be the potential effect of various options?
- If it’s a knowledge issue, could we use other tools to observe the identical process? How could we fix the broken process?
From my experience, the perfect first motion is to breed the affected customer journey. For instance, suppose the variety of orders within the e-commerce app decreased by 10% on iOS. In that case, it’s price attempting to purchase something and double-check whether there are any product issues: buttons usually are not visible, the banner can’t be closed, etc.
Also, remember to take a look at logging to be sure that information is captured appropriately. Every little thing could also be comfortable with customer experience, but we may lose data about purchases.
I consider it’s a vital step to start out your anomaly investigation. To start with, after DIY, you’ll higher understand the affected a part of the client journey: what are the steps, how data is logged. Secondly, you might find the basis cause and save yourself hours of study.
Tip: It’s more prone to reproduce the problem if the anomaly magnitude is important, which implies the issue impacts many purchasers.
As we discussed earlier, to start with, it’s essential to know whether customers are influenced, or it’s just a knowledge anomaly.
I definitely advise you to examine that the info is up-to-date. You could see a 50% decrease in yesterday’s revenue since the report captured only the primary half of the day. You may take a look at the raw data or refer to your Data Engineering team.
If there are not any known data-related problems, you possibly can double-check the metric using different data sources. In lots of cases, the products have client-side (for instance, Google Analytics or Amplitude) and back-end data (for instance, application logs, access logs or logs of API gateway). So we are able to use different data sources to confirm KPI dynamics. In case you see an anomaly only in a single data source, your problem is probably going data-related and doesn’t affect customers.
The opposite thing to consider is time windows and data delays. Once, a product manager got here to me saying activation was broken because conversion from registration to the primary successful motion (i.e. purchase in case of e-commerce) had been decreasing for 3 weeks. Nevertheless, it was an on a regular basis situation.
The foundation reason behind the decrease was the time window. We track activation throughout the first 30 days after registration. So cohorts registered 4+ weeks ago had the entire month to make the primary motion. But customers from the last cohort had just one week to convert, so conversion for them is predicted to be much lower. If you would like to compare conversions for these cohorts, change the time window to at least one week or wait.
In case of knowledge delays, you could have an analogous decreasing trend in recent days. For instance, our mobile analytical system used to send events in batches when the device was using a Wi-Fi network. So on average, it took 3–4 days to get all events from all devices. So seeing fewer energetic devices for the last 3–4 days was usual.
The great practice for such cases is trimming the last period out of your graphs. It should prevent your team from making incorrect decisions based on data. Nevertheless, people should still unintentionally bump into such inaccurate metrics, and you need to spend a while understanding how methodologically accurate metrics are before diving deep into root cause evaluation.
The following step is to take a look at trends more globally. First, I prefer to zoom out and take a look at longer trends to get the entire picture.
For instance, let’s take a look at the variety of purchases. The variety of orders has been growing steadily week after week, with an expected decrease at the tip of December (Christmas and Recent Yr time). But then, at the start of May, KPI significantly dropped and continued decreasing. Should we start panicking?
Actually, more than likely, there’s no reason to panic. We will take a look at metric trends for the last three years and spot that the variety of purchases decreases each summer. So it’s a case of seasonality. For a lot of products, we are able to see lower engagement throughout the summertime because customers go on vacation. Nevertheless, this seasonality pattern isn’t ubiquitous: for instance, travel or summer festival sites could have an opposite seasonality trend.
Let’s take a look at yet one more example — the variety of energetic customers for an additional product. We could see a decrease since June: monthly energetic users was 380K — 400K, and now it’s only 340–360K (around a -10% decrease). We’ve already checked that there have been no such changes in summer during several previous years. Should we conclude that something is broken in our product?
Wait, not yet. On this case, zooming out may also help. Taking into consideration long-term trends, we are able to see that the last three weeks’ values are near those in February and March. The true anomaly is 1.5 months of the high number of consumers from the start of April till mid-May. We could have wrongly concluded that KPI has dropped, nevertheless it just returned to the norm. Considering that it was spring 2020, higher traffic on our site is probably going because of COVID isolation: customers were sitting at home and spending more time online.
The last but not least point of your initial evaluation is to define the precise time when KPI modified. In some cases, the change may occur suddenly inside 5 minutes. While in others, it might probably be a really slight shift in trend. For instance, energetic users used to grow +5% WoW (week-over-week), but now it’s just +3%.
It’s price attempting to define the change point as accurately as possible (even with minute precision) because it’ll assist you pick up essentially the most plausible hypothesis later.
How briskly the metric has modified can provide you with some clues. For instance, if conversion modified inside 5 minutes, it might probably’t be because of the rollout of a brand new app version (it often takes days for purchasers to update their apps) and is more likely because of back-end changes (for instance, API).
Understanding the entire context (what’s happening) could also be crucial for our investigation.
What I often check to see the entire picture:
- Internal changes. It goes without saying internal changes can influence KPIs, so I often look up all releases, experiments, infrastructure incidents, product changes (i.e. recent design or price changes) and vendor updates (for instance, upgrade to the most recent version of the BI tool we’re using for reporting).
- External aspects could also be different depending in your product. Currency exchange rates in fintech can affect customers’ behaviour, while big news or weather changes can influence search engine market share. You may brainstorm similar aspects on your product. Attempt to be creative in fascinated by external aspects. For instance, once we discovered that the decrease in traffic on site was because of the network issues in our most vital region.
- Competitors activities. Try to seek out out whether your essential competitors are doing something immediately — an in depth marketing campaign, an incident when their product is unavailable or market closure. The best technique to do it’s to search for mentions on Twitter, Reddit or news. Also, there are a variety of sites monitoring services’ issues and outages (for instance, DownDetector or DownForEveryoneOrJustMe) where you can check your competitors’ health.
- Customers’ voice. You may study problems together with your product out of your customer support team. So don’t hesitate to ask them whether there are any recent complaints or a rise in customer contacts of a selected type. Nevertheless, please do not forget that few people may contact customer support (especially in case your product just isn’t essential for on a regular basis life). For instance, once many-many years ago, our search engine was wholly broken for ~100K users of the old versions of Opera browser. The issue endured for a few days, but lower than ten customers reached out to the support.
Since we’ve already defined the anomaly time, it’s pretty easy to get all events that happened nearby. These events are your hypothesis.
Tip: In case you suspect internal changes (release or experiment) are the basis reason behind your KPI drop-off. The perfect practice is to revert these changes (if possible) after which try to know the precise problem. It should assist you reduce the potential negative effects on customers.
At this moment, you hopefully have already got an understanding of what is occurring across the time of the anomaly and a few hypotheses in regards to the root causes.
Let’s start by the anomaly from a better level. For instance, if there’s an anomaly in conversion on Android for the USA customers, it’s price checking iOS and web and customers from other regions. You then will have the ability to know the dimensions of the issue adequately.
After that, it’s time to dive deep and check out to localize anomaly (to define as narrow as possible a segment or segments affected by KPI change). Probably the most straightforward way is to take a look at your product’s KPI trends in numerous dimensions.
The list of such meaningful dimensions can differ significantly depending in your product, so it’s price brainstorming together with your team. I’d suggest the next groups of things:
- technical features: for instance, platform, operation system, app version;
- customer features: for instance, recent or existing customer (cohorts), age, region;
- customer behaviour: for instance, product features adopted, experiment flags, marketing channels.
When examining KPI trends split by different dimensions, it’s higher to look only at significant enough segments. For instance, if revenue has dropped by 10%, there’s no reason to take a look at countries that contribute lower than 1% to total revenue. Metrics are inclined to be more volatile in smaller groups, so insignificant segments may add an excessive amount of noise. I prefer to group all small slices into the `other` group to avoid losing this signal completely.
For instance, we are able to take a look at revenue split by platforms. Absolutely the numbers for various platforms can differ significantly, so I normed all series on the primary point to match dynamics over time. Sometimes, it’s higher to normalize on average for the primary N points. For instance, average the primary seven days to capture weekly seasonality.
That’s how you can do it in Python.
import plotly.express as pxnorm_value = df[:7].mean()
norm_df = df.apply(lambda x: x/norm_value, axis = 1)
px.line(norm_df, title = 'Revenue by platform normed on 1st point')
The graph tells us the entire story: before May, revenue trends for various platforms were pretty close, but then something happened on iOS, and iOS revenue decreased by 10–20%. So iOS platform is especially affected by this variation, while others are pretty stable.
After determining the essential segments affected by the anomaly, let’s attempt to decompose our KPI. It could give us a greater understanding of what’s happening.
We often use two forms of KPIs in analytics: absolute numbers and ratios. So let’s discuss the approach for decomposition in each case.
We will decompose an absolute number by norming it. For instance, let’s take a look at the whole time spent in service (a typical KPI for content products). We will decompose it into two separate metrics.
Then we are able to take a look at the dynamics for each metrics. In the instance below, we are able to see that variety of energetic customers is stable while the time spent per customer dropped, which implies we haven’t lost customers entirely, but because of some reason, they began to spend less time on our service.
For ratio metrics, we are able to take a look at the numerator and denominator dynamics individually. For instance, let’s use conversion from registration to the primary purchase inside 30 days. We will decompose it into two metrics:
- the number of consumers who did purchase inside 30 days after registration (numerator),
- the variety of registrations (denominator).
In the instance below, the conversion rate decreased from 43.5% to 40% in April. Each the variety of registrations and the variety of converted customers increased. It means there are additional customers with lower conversion. It may occur because of various reasons:
- recent marketing channel or marketing campaign with lower-quality users;
- technical changes in data (for instance, we modified the definition of regions, and now we’re bearing in mind more customers);
- fraud or bot traffic on site.
Tip: If we saw a drop-off in converted users while total users were stable, that may indicate problems in a product or data regarding the actual fact of conversion.
For conversions, it also could also be helpful to show it right into a funnel. For instance, in our case, we are able to take a look at the conversions for the next steps:
- accomplished registration
- products’ catalogue
- adding an item to the basket
- placing order
- successful payment.
Conversion dynamics for every step may show us the stage in a customer journey where the change happened.
Because of this of all of the evaluation stages mentioned above, you need to have a fairly whole picture of the present situation:
- what exactly modified;
- what segments are affected;
- what is occurring around.
Now it’s time to sum it up. I prefer to place all information down in a structured way, describing tested hypotheses and conclusions we’ve made and what it’s the present understanding of the first root cause and next steps (in the event that they are needed).
Tip: It’s price writing down all tested hypotheses (not only proven ones) because it’ll avoid duplicating unnecessary work.
The essential thing to do now could be to confirm that our primary root cause can completely explain KPI change. I often model the situation if there are not any known effects.
For instance, within the case of conversion from registration to the primary purchase, we may need discovered a fraud attack, and we all know how one can discover bot traffic using IP addresses and user agents. So we could take a look at the conversion rate without the effect of the known primary root cause — fraud traffic.
As you possibly can see, the fraud traffic explains only around 70% of drop-off, and there may very well be other aspects affecting KPI. That’s why it’s higher to double-check that you simply’ve found all significant aspects.
Sometimes, it might be difficult to prove your hypothesis, for instance, changes in price or design that you simply couldn’t A/B test appropriately. Everyone knows that correlation doesn’t imply causation.
The possible ways to examine the hypothesis in such cases:
- To have a look at similar situations previously, for instance, price changes and whether there was an analogous correlation with KPI.
- Attempt to discover customers with modified behaviour, corresponding to those that began spending much less time in our app, and conduct a survey.
After this evaluation, you’ll still doubt the consequences, nevertheless it may increase confidence that you simply’ve found the right answer.
Tip: The survey could also help in the event you are stuck: you’ve checked all hypotheses and still haven’t found a proof.
At the tip of the extensive investigation, it’s time to take into consideration how one can make it easier and higher next time.
My best practices after ages of coping with anomalies investigations:
- It’s super-helpful to have a checklist specific to your product — it might probably prevent and your colleagues hours of labor. It’s price putting together an inventory of hypotheses and tools to examine them (links to dashboards, external sources of data in your competitors etc.). Please, consider that writing down the checklist just isn’t a one-time activity: you need to add recent knowledge to it when you face recent forms of anomalies so it stays up-to-date.
- The opposite useful artifact is a changelog with all meaningful events on your product, for instance, changes in price, launches of competitive products or recent feature releases. The changelog will let you find all significant events in a single place not searching through multiple chats and wiki pages. It may be demanding to not forget to update the changelog. You might make it a part of analytical on-call duties to ascertain clear ownership.
- Typically, you would like input from different people to know the situation’s whole context. A preliminary prepared working group and a channel for KPI anomaly investigations can save precious time and keep all stakeholders updated.
- Last but not least, to reduce the potential negative impact on customers, we should always have a monitoring system in place to study anomalies as soon as possible and begin searching for root causes. So avoid wasting time establishing and improving your alerting and monitoring.
The important thing messages I would really like you to consider:
- Coping with root cause evaluation, you need to concentrate on minimizing the potential negative impact on customers.
- Attempt to be creative and look broadly: get all of the context of what’s happening inside your product, infrastructure, and what are potential external aspects.
- Dig deep: take a look at your metrics from different angles, attempting to examine different segments and decompose your metrics.
- Be prepared: it’s much easier to cope with such research in the event you have already got a checklist on your product, a changelog and a working group to brainstorm.
Thanks lots for reading this text. I hope now you won’t be stuck facing a root cause evaluation task since you have already got a guide at hand. If you will have any follow-up questions or comments, please don’t hesitate to depart them within the comments section.