Getting Insights

Updated Sep 22, 2021 ·

Overview

There are three primary methods for deriving insights from a dataset. Each method has its unique advantages, and they are often used together.

Calculate summary statistics like mean, median, and standard deviation
Run statistical models such as linear and logistic regression
Visualize data with plots like scatter plots or histograms

The Datasaurus Dozen

The Datasaurus Dozen is a collection of 13 datasets, each with x and y coordinates. These datasets illustrate the importance of visualizing data.

The term "variable" refers to a column of data
Examples include datasets named "away" and "bullseye"

When calculating the mean of the x values in each dataset, you’ll find similar values across the datasets. This similarity extends to the y coordinates, with means being nearly identical across datasets.

The variation of the x and y values is measured by standard deviation
The variation is also the same for each dataset.
Standard deviation describes how spread out the values are

Visualizing the datasets reveals differences that calculations alone miss. Scatter plots of each dataset show distinct patterns. The datasets, including lines, circles, a star, and a dinosaur, emphasize the importance of plotting data.

Continuous and Categorical Variables

Different types of data require different types of plots. Choosing the right plot depends on whether your variables are continuous or categorical.

Continuous variables
- Numbers like heights, temperatures, and revenues
- Arithmetic operations can be performed on them
Categorical variables
- Text-based classifications like eye color, country, and industry

Some variables can be treated as either continuous or categorical, depending on the context.

Age: A number (continuous) but often grouped in surveys (categorical)
Time: Naturally continuous but can be categorized (e.g., months in a report)

The choice depends on the specific question you’re trying to answer.

Overview​

The Datasaurus Dozen​

Continuous and Categorical Variables​

Overview

The Datasaurus Dozen

Continuous and Categorical Variables