I consider that the first goal of analysts is to assist their product teams make the precise decisions based on data. It signifies that the predominant results of analysts’ work shouldn’t be just getting some numbers or dashboards but influencing reasonable data-driven decisions. So, presenting the outcomes of our research is a critical a part of analysts’ day-to-day work.
Have you ever ever experienced not noticing some obvious anomaly until you create a graph? You will not be alone. Almost no person can extract insights from dry tables of numbers. That’s why we’d like visualisations to unveil the insights in the information. Serving as a bridge between data and product teams, an information analyst must excel in visualisation.
That’s why I would love to debate data visualisations and begin with the framework to decide on probably the most suitable chart type on your use case.
It could be tempting to take a look at data just using summary statistics. You’ll be able to compare datasets by mean values and variance and never take a look at data in any respect. Nonetheless, it would result in misinterpretations of your data and mistaken decisions.
Some of the famous examples is Anscombe’s quartet. It was created by statistician Francis Anscombe, and it consists of 4 data sets with almost equal descriptive statistics: means, variances and correlations. But once we take a look at the information, we will see how different the datasets are.
You could find more mind-blowing examples (even a dinosaur) with the identical descriptive statistics here.
This instance clearly shows how outliers can skew your summary statistics and why we’d like to visualise our data.
Besides outliers, visualisations are also a greater option to present the outcomes of your research. Graphs are more easily comprehensible and have the flexibility to consolidate a considerable amount of knowledge. So, it’s a vital domain for analysts to concentrate to.
After we begin to take into consideration visualisation for our task, we’d like to define its primary goal or the context for the visualisation.
There are two significant use cases for creating charts: exploratory and explanatory analytics.
Exploratory visualisations are your “private talk” with data when trying to seek out insights and understand the inner structure. For such visualisations, you would possibly pay less attention to design and details, i.e., omit titles or not use consistent color schemes across charts, since these visualisations are only on your eyes.
I normally start with a bunch of quick chart prototypes. Nonetheless, even on this case, you continue to must take into consideration probably the most suitable chart type. Proper visualisation can enable you find insights, while the mistaken one can hide the clues. So, select properly.
Explanatory visualisations are intended to convey information to your audience. On this case, it’s worthwhile to focus more on details and the context to attain your goal.
Once I am working on explanatory visualisations, I normally think in regards to the following inquiries to define my goal crystal-clearly:
- Who’s my audience? What context have they got? What information do I want to elucidate to them? What are they thinking about?
- What do I need to attain? What concerns my audience may need? What information can I show them to attain my goal?
- Am I showing the entire picture? Do I want to take a look at the query from the opposite standpoint to present all the knowledge for the audience to make an informed decision?
Also, your decisions on visualisation might rely on the medium, whether you’ll make a live presentation or simply send it in Slack or via e-mail. Listed below are a few examples:
- Within the case of a live presentation, you may have fewer comments on charts since you may discuss all of the needed context, while in an e-mail, it’s higher to supply all the main points.
- A table with many numbers won’t work for live presentations for the reason that slide with a lot information might distract the audience out of your speech. At the identical time, it’s absolutely okay for written communication when the audience can undergo all of the numbers at their very own pace.
So, when selecting a chart type, we shouldn’t take into consideration visualisations in isolation. We want to contemplate our primary goal and audience. Please keep it in mind.
How many differing types of charts do you already know? I bet you may name quite a number of of them: linear charts, bar charts, Sankey diagrams, heat maps, box plots, bubble charts, etc. But have you ever ever thought of visualisations more granularly: what are the constructing blocks, and the way are they perceived by your readers?
William S. Cleveland and Robert McGill investigated this query of their article “Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods” in the Journal of American Statistical Association, September 1984. This text focuses on visual perception — the flexibility to decode information presented in a chart. The authors identified a set of constructing blocks for visualisations — visual encodings — for instance, position, length, area or color saturation. No surprise, different visual encodings have different levels of difficulty for people to interpret.
The authors tried to hypothesise and test these hypotheses via experiments on how accurately people can extract information from the graph depending on the weather used. Their goal was to check how valid people’s judgements are.
They used previous psychological research and experiments to rank different visualisation constructing blocks from probably the most accurate to the least. Here’s the list:
- position — for instance, scatter plot;
- length — for instance, bar chart;
- direction or slope — for instance, line chart;
- angle— for instance, pie chart;
- area — for instance, bubble chart;
- volume — 3D chart;
- color hue and saturation — for instance, heat map.
I’ve highlighted only probably the most common elements from the article for analytical day-to-day tasks.
As we discussed earlier, the first goal of visualisation is to convey information, and we’d like to give attention to our audience and the way they perceive the message. So, we’re thinking about people’s correct understanding. That’s why I normally try to make use of visual encodings from the highest of the list since they’re easier for people to interpret.
We’ll see many chart examples below, so let’s quickly discuss the tools I take advantage of for it.
There are numerous options for visualization:
- Excel or Google Sheet,
- BI tools like Tableau or Superset,
- Libraries in Python or R.
Typically, I prefer using the Plotly library for Python because it lets you create nicely-looking interactive charts easily. In rare cases, I take advantage of Matplotlib or Seaborn. For instance, I prefer Matplotlib for histograms (as you will note below) because, by default, it gives me exactly what I want, while this shouldn’t be the case with Plotly.
Now, let’s jump to the practice and discuss use cases and the best way to select the perfect visualisations to deal with them.
You would possibly often be stuck excited about what chart to make use of in your use case since so lots of them exist.
There are useful tools, akin to a fairly handy Chart Chooser described within the “Storytelling with Data” blog. It could possibly enable you to get some ideas of what to start out with.
Stephen Few proposed the opposite approach I find pretty helpful. He has an article, “Eenie, Meenie, Minie, Moe: Choosing the Right Graph for Your Message”. In this text, he identifies the seven common use cases for data visualisations and proposes visualisation types to deal with them.
Here is the list of those use cases:
- Time series
- Nominal comparison
- Deviation
- Rating
- Part-to-whole
- Frequency distribution
- Correlation
We’ll undergo all of them and discuss some examples of visualisations for every case. I don’t entirely agree with the creator’s proposals regarding visualisation types, and I’ll share my view on it.
Graph examples below are based on synthetic data unless it’s explicitly mentioned.
Time series
What’s a use case? It’s probably the most common use case for visualization. We would like to take a look at changes in a single or several metrics over time very often.
Chart recommendations
Probably the most straightforward option (especially if you’ve got several metrics) is to make use of a line chart. It highlights the trend and offers the audience a whole overview of the information.
For instance, I used a line chart to indicate how the variety of sessions on each platform changes over time. We will see that iOS is the fastest-growing segment, while the others are pretty stagnant.
Using a line plot (not a scatter plot) is crucial since the line plot emphasises trends via slopes.
You’ll be able to get such a graph quite effortlessly using Plotly. We’ve got a dataset like this with a monthly variety of sessions.
Then, we will use Plotly Express to create a line chart, passing data, title and overriding labels.
import plotly.express as pxpx.line(
ts_df,
title = 'Sessions by platforms',
labels = {'value': 'sessions', 'os': 'platform', 'month_date': 'month'},
color_discrete_map={
'Android': px.colours.qualitative.Vivid[1],
'Windows': px.colours.qualitative.Vivid[2],
'iOS': px.colours.qualitative.Vivid[4]
}
)
We won’t discuss intimately design and the best way to tweak it in Plotly here because it’s a fairly huge topic that deserves a separate article.
We normally put time on an x-axis for line charts and use equal periods between data points.
There’s a typical misunderstanding that we must make the y-axis zero-based (it must include 0). Nonetheless, it’s not true for line charts. In some cases, such an approach might even hinder the insights in your data.
For instance, compare the 2 charts below. On the primary chart, the variety of sessions looks pretty stable, while on the second, the drop-off in the midst of December is sort of apparent. Nonetheless, it’s the exact same dataset, and only y-ranges differ.
Your options for time series data will not be limited to line charts. Sometimes, a bar chart generally is a higher option, for instance, if you’ve got few data points and wish to stress individual values moderately than trends.
Making a bar chart in Plotly can also be pretty straightforward.
fig = px.bar(
df,
title = 'Sessions',
labels = {'value': 'sessions', 'os': 'platform', 'month_date': 'month'},
text_auto = ',.6r' # specifying format for bar labels
)fig.update_layout(xaxis_type='category')
# to forestall converting string to dates
fig.update_layout(showlegend = False)
# hiding ledend since we do not need it
Nominal comparison
What’s a use case? It’s the case when you should compare one or several metrics across segments.
Chart recommendations
If you’ve got a pair of knowledge points, you should utilize just numbers in text as an alternative of a chart. I like this approach because it’s concise and uncluttered.
In lots of cases, bar charts might be handy to match the metrics. Though vertical bar charts are often more common, horizontal ones might be a greater option when you’ve got long names for segments.
For instance, we will compare the annual GMVs (Gross Merchandise Value) per customer for various regions.
To make a bar chart horizontal, you simply must pass orientation = "h"
.
fig = px.bar(df,
text_auto = ',.6r',
title = 'Average annual GMV (Gross Merchandise Value)',
labels = {'country': 'region', 'value': 'average GMV in GBP'},
orientation = 'h'
)fig.update_layout(showlegend = False)
fig.update_xaxes(visible = False) # to cover x-axes
Essential note: all the time use zero-based axes for bar charts. Otherwise, you would possibly mislead your audience.
When there are too many numbers for a bar chart, I prefer a heat map. On this case, we use color saturation to encode the numbers, which shouldn’t be very accurate, so we also keep the labels. For instance, let’s add one other dimension to our average GMV view.
No surprise, you may create a heat map in Plotly as well.
fig = px.imshow(
table_df.values,
x = table_df.columns, # labels for x-axis
y = table_df.index, # labels for y-axis
text_auto=',.6r', aspect="auto",
labels=dict(x="age group", y="region", color="GMV in GBP"),
color_continuous_scale='pubugn',
title = 'Average annual GMV (Gross Merchandise Value) in GBP'
)fig.show()
Deviation
What’s a use case? It’s the case when we wish to spotlight the differences between values and baseline (for instance, benchmark or forecast).
Chart recommendations
For the case of comparing metrics for various segments, the perfect option to convey this concept using visualisations is the mixture of bar charts and baseline.
We did such a visualisation in one in every of my previous articles in our research on topic modelling for hotel reviews. I compared the share of customer reviews mentioning the actual topic for every hotel chain and baseline (average rate across all of the comments). I’ve also highlighted segments which can be significantly different with color.
Also, we regularly have a task to indicate deviation from the prediction. We will use line plots comparing dynamics for the forecast and the factual data. I prefer to indicate the forecast as a dotted line to stress that it’s not so solid as fact.
This case of a line chart is a little more complicated than those we discussed above. So, as an alternative of Plotly Express, we’ll need to make use of Plotly Graphical Objects to customize the chart.
import plotly.graph_objects as go# making a figure
fig = go.Figure()
# adding dashed line trace for forecast
fig.add_trace(
go.Scatter(
mode='lines',
x=df.index,
y=df.forecast,
line=dict(color='#696969', dash='dot', width = 3),
showlegend=True,
name = 'forecast'
)
)
# adding solid line trace for factual data
fig.add_trace(
go.Scatter(
mode='lines',
x=df.index,
y=df.fact,
marker=dict(size=6, opacity=1, color = 'navy'),
showlegend=True,
name = 'fact'
)
)
# setting title and size of layout
fig.update_layout(
width = 800,
height = 400,
title = 'Each day Lively Users: forecast vs fact'
)
# specifying axis labels
fig.update_xaxes(title = 'day')
fig.update_yaxes(title = 'variety of users')
Rating
What’s a use case? This task is analogous to the Nominal comparison. We also want to match metrics across the several segments, but we would love to intensify the rating — the order of the segments. For instance, it could possibly be the highest 3 regions with the very best average annual GMV or the highest 3 marketing campaigns with the very best ROI.
Chart recommendations
No surprise, we will use bar charts much like the nominal comparison. The one vital nuance to consider is ordering the segments in your chart by the metric you’re thinking about. For instance, we will visualise the highest 3 regions by annual Gross Merchandise Value.
Part-to-whole
What’s use case? The goal is to know what’s the split of total by some subdivisions. It is advisable to do it for one segment or for several at the identical time to match their structures.
Chart recommendations
Probably the most straightforward solution can be to make use of a bar chart to indicate the share of every category or subdivision. It’s price ordering your categories in descending order to make visualisation easier to interpret.
The above approach works each for one or several segments. Nonetheless, sometimes, it’s easier to match the structure using a stacked bar chart. For instance, we will take a look at the share of consumers by age in several regions.
Pie charts are sometimes utilized in such cases. But I wouldn’t recommend you do it. As we all know from visual perception research, comparing angles or areas is tougher than simply lengths. So, bar charts can be preferable.
Also, we may need a task to take a look at the structure over time. The best option can be an area chart. It’s going to show you each split across subdivisions and trends via slopes (that’s why it’s a greater option than simply a bar chart with months as categories).
To create an area chart, you should utilize px.area
function in Plotly.
px.area(
df,
title = 'Customer age in Switzerland',
labels = {'value': 'share of users, %',
'age_group': 'customer age', 'month': 'month'},
color_discrete_sequence=px.colours.diverging.balance
)
Frequency distribution
What’s a use case? I normally start with such visualisation when working with recent data. The goal is to know how value is distributed:
- Is it normally distributed?
- Is it unimodal?
- Do now we have any outliers in our data?
Chart recommendations
The primary selection for frequency distributions is histograms (vertical bar charts normally without margins between categories). I typically prefer normed histograms since they’re easier to interpret than absolute values.
If you should see frequency distributions for several metrics, you may draw several histograms concurrently. On this case, it’s crucial to make use of normed histograms. Otherwise, you won’t find a way to match distributions if the variety of objects differs in groups.
For instance, we will visualise the distributions of annual GMVs for purchasers from the UK and Switzerland.
For this visualisation, I used matplotlib
. I prefer matplotlib
to Plotly for histograms because I like their default design.
from matplotlib import pyplothist_range = [0, 10000]
hist_bins = 100
pyplot.hist(
distr_df[distr_df.region == 'United Kingdom'].value.values,
label = 'United Kingdom',
alpha = 0.5, range = hist_range, bins = hist_bins,
color = 'navy',
# calculating weights to get normalised histogram
weights = np.ones_like(distr_df[distr_df.region == 'United Kingdom'].index)*100/distr_df[distr_df.region == 'United Kingdom'].shape[0]
)
pyplot.hist(
distr_df[distr_df.region == 'Switzerland'].value.values,
label = 'Switzerland',
color = 'red',
alpha = 0.5, range = hist_range, bins = hist_bins,
weights = np.ones_like(distr_df[distr_df.region == 'Switzerland'].index)*100/distr_df[distr_df.region == 'Switzerland'].shape[0]
)
pyplot.legend(loc = 'upper right')
pyplot.title('Distribution of consumers GMV')
pyplot.xlabel('annual GMV in GBP')
pyplot.ylabel('share of users, %')
pyplot.show()
If you should compare distributions across many categories, reading many histograms on the identical graph can be difficult. So, I’d recommend you utilize box plots. They show less information (only medians, quartiles and outliers) and require some education for the audience. Nonetheless, within the case of many categories, it could be your only option.
For instance, let’s take a look at the distributions of the time spent on site by region.
When you don’t remember the best way to read a box plot, here’s a scheme that offers some clues.
So, let’s undergo all of the constructing blocks of the box plot visualisation:
- the box on the visualisation shows IQR (interquartile range) — 25% and 75% percentiles,
- the road in the midst of the box specifies the median (50% percentile),
- whiskers equal to 1.5 * IQR or to the min/max value within the dataset in the event that they are less extreme,
- if you’ve got any numbers more extreme than 1.5 * IQR (outliers), they might be depicted as points on the graph.
Here is the code to generate a box plot in Plotly. I used Graphical Objects as an alternative of Plotly Express to eliminate outliers from the visualisation. It is useful when you’ve got extreme outliers or too lots of them in your dataset.
fig = go.Figure()fig.add_trace(go.Box(
y=distr_df[distr_df.region == 'United Kingdom'].value,
name="United Kingdom",
boxpoints=False, # no data points
marker_color=px.colours.qualitative.Prism[0],
line_color=px.colours.qualitative.Prism[0]
))
fig.add_trace(go.Box(
y=distr_df[distr_df.region == 'Germany'].value,
name="Germany",
boxpoints=False, # no data points
marker_color=px.colours.qualitative.Prism[1],
line_color=px.colours.qualitative.Prism[1]
))
fig.add_trace(go.Box(
y=distr_df[distr_df.region == 'France'].value,
name="France",
boxpoints=False, # no data points
marker_color=px.colours.qualitative.Prism[2],
line_color=px.colours.qualitative.Prism[2]
))
fig.add_trace(go.Box(
y=distr_df[distr_df.region == 'Switzerland'].value,
name="Switzerland",
boxpoints=False, # no data points
marker_color=px.colours.qualitative.Prism[3],
line_color=px.colours.qualitative.Prism[3]
))
fig.update_layout(title = 'Time spent on site monthly')
fig.update_yaxes(title = 'time spent in minutes')
fig.update_xaxes(title = 'region')
fig.show()
Correlation
What’s a use case? The goal is to know the relation between two numeric datasets, whether one value increases with the opposite one or not.
Chart recommendations
A scatter plot is the perfect solution to indicate a correlation between the values. You may also need to add a trend line to spotlight the relation between metrics.
If you’ve got many data points, you would possibly face an issue with a scatter plot: it’s unattainable to see the information structure with too many points because they overlay one another. On this case, reducing opacity might enable you to disclose the relation.
For instance, compare the 2 graphs below. The second gives a greater understanding of the information distribution.
We’ll use Plotly Graphical objects for this graph because it’s pretty custom. To create such a graph, we’d like to specify two traces — one for the scatter plot and one for the regression line.
import plotly.graph_objects as go# scatter plot
fig = go.Figure()
fig.add_trace(
go.Scatter(
mode='markers',
x=corr_df.x,
y=corr_df.y,
marker=dict(size=6, opacity=0.1, color = 'grey'),
showlegend=False
)
)
# regression line
fig.add_trace(
go.Scatter(
mode='lines',
x=linear_corr_df.x,
y=linear_corr_df.linear_regression,
line=dict(color='navy', dash='dash', width = 3),
showlegend=False
)
)
fig.update_layout(width = 600, height = 400,
title = 'Correlation between revenue and customer tenure')
fig.update_xaxes(title = 'months since registration')
fig.update_yaxes(title = 'monthly revenue, GBP')
It’s essential to place the regression line because the second trace because otherwise, it could be overlayed by a scatter plot.
Also, it could be insightful to indicate frequency distributions for each variables. It doesn’t sound effortless, but you may easily do that using a joint plot from seaborn
library. Here’s a code for it.
import seaborn as snssns.set_theme(style="darkgrid")
g = sns.jointplot(
x="x", y="y", data=corr_df,
kind="reg", truncate=False,
joint_kws = {'scatter_kws':dict(alpha=0.15), 'line_kws':{'color':'navy'}},
color="royalblue", height=7)
g.set_axis_labels('months since registration', 'monthly revenue, GBP')
We’ve covered all of the use cases for data visualisations.
Is it all of the visualisation types I want to know?
I have to confess that infrequently, I face tasks when the above suggestions will not be enough, and I want another graphs.
Listed below are some examples:
- Sankey diagrams or sunburst charts for customer journey maps,
- Choropleth for data when it’s worthwhile to show geographical data,
- Word clouds to present a really high-level view of texts,
- Sparklines if you should see trends for multiple lines.
For inspiration, I normally use the galleries of popular visualisation libraries, for instance, Plotly or seaborn.
Also, you may all the time ask ChatGPT in regards to the possible options to present your data. It provides quite an affordable guidance.
In this text, we’ve discussed the fundamentals of knowledge visualisations:
- Why do we’d like to visualise data?
- What questions must you ask yourself before you begin working on visualisation?
- What are the fundamental constructing blocks, and which of them are the simplest for the audience to perceive?
- What are the common use cases for data visualisation, and what chart types you should utilize to deal with them?
I hope the provided framework will enable you to not be stuck by quite a lot of options and create higher visualisations on your audience.
Thanks loads for reading this text. If you’ve got any follow-up questions or comments, please leave them within the comments section.