Door: Jan de Leeuw
Data visualization is trending in data science and can help a company thrive. It can convey clear messages to shareholders who are less familiar with the data, like a company’s board. It can lead to valuable insights that help improve customer satisfaction, increase profits and improve processes. However, misinterpreting data can lead to bad decisions. This article discusses some of the most common pitfalls in data visualization. Avoiding these pitfalls can help in clearly conveying the right message.
The first and foremost pitfall in data visualization is the use of bad data. The expression ‘Garbage in, Garbage out’ surely applies to data visualization. When incorrect data is used to visualize demand, a manager can make bad decisions. A mistake as simple as a misplaced decimal point can lead to false conclusions. Did you know that Popeye allegedly only grows strong from spinach because of a misplaced decimal point? It attributed ten times more iron to spinach than it contains and it boosted the sales of spinach.
To ensure correct and consistent data is available, the quality should be monitored continuously. Setting up clear rules to ensure consistent notation can avoid big mistakes. A single mix-up in notation of dates, dd/mm/yyyy versus mm/dd/yyyy can lead to misleading results. A method to help monitoring the data is by performing consistency checks. If a company’s January sales show a growth of 35% whereas the other months show a growth of 3% – 4%, there might be a mistake in the January data.
A second pitfall is using the wrong type of data visualization. The next graphs show the total sales of books and magazines over time. The correct type of visualization is strongly correlated with the question it should answer as is illustrated in the graphs that follow.
Stacked Bar Chart: What were the yearly total sales of books and magazines in the last fifteen years? The sales for solely magazines is hard to read from this graph whereas the total sales amount in a given year is easily read from this chart.
Bar Chart: What was the sales of books compared to magazines in the last fifteen years? This visualization makes it easy to compare results for books versus magazines for different years. Reading a total takes some extra work.
Line Graph: How did sales of books and magazines develop over time? The line chart shows how the sales developed over time. Note that the total amount of book and magazine sales is harder to determine using this graph.
To avoid this pitfall, keep in mind why the visualization is created and who its audience is. Does the visualization serve its purpose?
A third pitfall is the misuse of colors. Colors are one of the hardest visual elements to interpret. Misinterpretation of colors is easy. For example, red is often associated with something negative. Linking the color red to data that is relatively less good than an alternative but not per se bad can cause misinterpretation.
The tables below show an example of sales in units. Suppose 80 units need to be sold on any given day to make a profit. Using the red-yellow-green color scale, a more black and white thought comes in mind: ‘We should consider closing on Monday, because sales are bad on Monday’. However, using different intensities of a single color to depict the amount of sales draws more attention to the best sales. A thought here can be: ‘Monday is not as good for sales as Thursday, can we do something to boost sales on Monday?’ Color can evoke emotions. So, carefully choose the color pallet that fits the data and the message it should give. Think about where the visual will be displayed, since perception of colors can differ in print and on screen.
Adding too much data to a single visualization is a pitfall as well. A lot of information in a single visualization can be distracting and can confuse the audience. Keep it simple and remove all elements that do not add to conveying the message. For example, suppose someone kept track of the total fruit consumption of the last ten years. The left bar chart shows all the available data. Presenting all the data like this is distracting, the unnecessary data should be filtered out. It would be difficult to compare the consumption of apples, bananas and cranberries over the last few years with the left chart. If that is the objective, simplifying and only showing the fruits of interests helps conveying a clear message better.
We strongly believe that data visualization can add value to a company, only if it is done with great care. To avoid these pitfalls, the most important questions a data scientist should ask himself or herself before creating a visualization are:
Which audience do I target and what do I need them to understand? Where a scatterplot can easily be misinterpreted by a less quantitatively oriented audience, the slope of a line showing a trend is easier to interpret.
What is the purpose of the visualization? Do the graphs provide insightful information?
Does the visualization represent reality? Are there mistakes in the data? Do we have all relevant information?
To conclude, visualization should never be about making fancy charts or expressing the scientist’s creative capabilities. It should be about helping your audience reach their goals. Avoiding the pitfalls should enable the data scientist to create meaningful charts that depicts the correct message clearly.