Procedure to do Exploratory Data Analysis…

Praveen Varukolu
2 min readJun 18, 2021

--

EDA-Exploratory Data Analysis

“Data is the new oil” is a common saying nowadays in the areas of marketing, medical science, economics, finance, any research field, and the IT industry.

New Oil, i.e., Data is a collection of numbers, words, events, facts, measurements, and observations. The data after processing gives us information. The information leads to useful knowledge. So for that, we need to do some analysis on data.

Exploratory Data Analysis: It is a process of examining or understanding the data and extracting insights from the data.

EDA is very essential because it is a good practice to first understand the problem statement and the various relationships between the data features before getting your hands dirty.

1) First Observe the dataset whether it is a Supervised(classification) problem or an Unsupervised(cluster) problem.

2) Next observe the variables and their data types, i.e., how many have quantitative and qualitative variables.

3) Plot histograms and boxplots for quantitative variables, bar plots for qualitative variables, and make some observations.

4) Make observations of each variable on target variables.

5) Observation means…,

(a) Null values:

(i) Are null values present or not

(ii) If present how much percent

(iii) Impute the null values by using any one of these… Mean, Median, Mode.

(b) Outliers:

(i) Outliers exist or not

(ii) If exists it is influential or not

(iii) Apply Z-transformation you will know influential or not

(c) Transformation:

(i) Data is skewed or not

(ii) Each variable has different units

(iii) So it is good to apply transformation techniques like a log, min, max, z, etc.

6) Sometimes it is difficult to analyze all variables on target variables, then go with some common sense that which variables affect your target variable more.

7) There is dependence or relation between one variable with another variable

For eg: education increases income increases.

So you will do some analysis variables together on the target variable.

8) Apply correlation if a correlation exists we should remove some variable.

9) If a correlation exists we can apply PCA to reduce the variables(dimensions)

10) Next convert all qualitative variables into quantitative variables this is called “Dummy imputation”.

For example, the gender column has male and female in a dataset we can replace that male=1, female=0.

The above lines are basic EDA steps and depend on the problem we will do some more analysis.

--

--

Praveen Varukolu
Praveen Varukolu

Written by Praveen Varukolu

Data Scientist, AI & ML Enthusiast. Working at PoonawallaFincorp,Pune.