Skip to main content

Plotting Multiple Variables at Once

Updated Sep 22, 2021 ·

Overview

To handle the visualization of many variables simultaneously, you can use advanced plotting techniques such as pair plots, correlation heatmaps, and parallel coordinates plots.

Using Pair Plots

Pair plots are effective for visualizing up to ten variables at once. They display the distribution of each variable and the relationships between each pair.

Above is a pair plot of LA home prices includes four variables, arranged in a grid of panels. Each panel provides insights into different aspects of the data.

Diagonal Panels

Panels along the diagonal show the distribution of variables. For categorical variables, like city, the distribution is displayed as a bar plot.

Continuous variables, such as number of beds, price, and area, are shown with histograms.

Off-Diagonal Panels

Panels off the diagonal reveal the relationships between pairs of variables. Continuous variable pairs are represented by scatter plots, showing their correlation. For instance, a scatter plot might show the relationship between the number of beds and area, with a correlation value indicated.

Categorical and Continuous Variables

When one variable is categorical and the other is continuous, the plot includes a box plot and a histogram. For example, the box plot shows prices by city, and the histogram represents the same data.

Correlation Heatmaps

Pair plots are useful for quickly exploring datasets. For datasets with many continuous variables, a correlation heatmap, a variant of the pair plot, offers a more compact and scalable alternative.

Correlation heatmaps display the relationships between continuous variables using color rather than numbers. They are useful for comparing a large number of variables.

Heatmap Example

In a customer satisfaction survey dataset, the correlation heatmap reveals strong relationships between various product features. Bright red areas indicate high correlation among price-related aspects.

Reference: Rossi, Allenby, and McCulloch (2005). Bayesian Statistics & Marketing

Parallel Coordinates Plots

For datasets with many continuous variables, parallel coordinates plots are effective for understanding relationships or grouping data into clusters.

Above is a parallel coordinates plot shows multiple variables, including the human development index score, with each line representing a country. Variables are plotted along the x-axis, and the y-axis ranges from the lowest to the highest value for each variable. Splitting the data by continent can reveal patterns.

By segmenting the dataset by continent, distinct patterns emerge. For example, South American countries show consistent metrics, while European countries have high values across all metrics. African countries display low GNIs with a broad range in other metrics. This plot allows for easy comparison of many variables simultaneously.

Case Study: Fatty acid levels in olive oils

This dataset contains fatty acid levels from olive oil samples sourced from six different regions in Italy. Each line in the plot corresponds to a single oil sample. Because the region is a categorical variable, there are six parallel coordinates plots, with one plot for each region.

Reference: Graphics of large datasets

Observations:

  • Sardinian oils have low elcosenoic acid and high linoleic acid levels.
  • Calabrian oils show a wide range of elcosenoic and stearic acid levels.
  • Northern oils have high oleic acid levels.