Exploratory Data Analysis

Updated May 12, 2023 ·

Overview

Exploratory Data Analysis (EDA) is the process of analyzing data to identify patterns, trends, and potential issues. It provides insights to guide further steps in a project, such as building a machine learning model.

Steps in the EDA Process

EDA begins by exploring the dataset to get a better understanding of its structure and content.

Examine and analyze the dataset
Identify missing or unusual values
Visualize the dataset
Characterize/classify the dataset

Inspecting Data

To better understand the data, start by looking at its structure. You can quickly access key information like data types and missing values.

head() displays the first few rows of data
info() gives a summary, including data types and non-null counts

Example:

import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())  # Shows the first 5 rows
print(df.info())  # Shows summary info about the DataFrame

Checking Class Imbalance

Class imbalance refers to situations where one category of data is underrepresented. This can affect model performance.

value_counts() - to inspect distribution of different classes
Calculate the proportion of each class to identify imbalances

Example:

# Print class imbalance
print(df['sex'].value_counts(normalize=True))

Sample outputL

sex
1    0.691
0    0.309
Name: proportion, dtype: float64 

Handling Missing Values

Missing data can introduce bias into a model, so it's important to identify and handle it properly.

Use isnull() to detect missing values
Decide how to handle missing data: remove, impute, or leave as is

Example:

print(df['age'].isnull().sum())  # Counts missing values in 'age' column

Identifying Outliers

Outliers are data points that are far outside the typical range. These can affect model accuracy and need to be addressed.

Rare events, caused by measurement or data entry errors
Detect outliers using box plots or the interquartile range (IQR)
Remove outliers or keep them based on context

Outliers can skew the model's performance by focusing on extreme values that don't reflect the overall trend. Sometimes, though, they can be useful if they represent rare but important cases.

Visualizing Data

Visualizations help reveal trends, patterns, and potential issues in the data. They are essential for gaining deeper insights.

Visualizations include histograms or scatter plots
Identify trends, distributions, and relationships between variables

To create basic visualizations, we can use plot() :

# Creates a histogram for the 'age' column
df['age'].plot(kind='hist', bins=10) 

To set the graph title and axis labels:

plt.title('This is the graph title')
plt.xlabel('Years')                     ## X-axis
plt.ylabel('Frequency')                 ## Y-axis
plt.show

Goals of EDA

The purpose of EDA is to explore the dataset, uncover patterns, and prepare for further analysis.

Understand data characteristics and trends
Detect outliers or unexpected values
Form hypotheses and check assumptions
Inform decisions on feature selection and modeling

Overview​

Steps in the EDA Process​

Inspecting Data​

Checking Class Imbalance​

Handling Missing Values​

Identifying Outliers​

Visualizing Data​

Goals of EDA​

Overview

Steps in the EDA Process

Inspecting Data

Checking Class Imbalance

Handling Missing Values

Identifying Outliers

Visualizing Data

Goals of EDA