Exploratory Data Evaluation, because the name suggests is evaluation to explore the information. It consists of quite a few components; neither are all essential on a regular basis, nor all of them have equal importance. Below, I’m listing down a number of components based on my experience. *Please note that it’s in no way an exhaustive list, but a guiding framework.*

## 1. Understand the lay of the land.

*You don’t know what you don’t know — but you’ll be able to explore!*

The at the start thing to do is to get the texture of the information — take a look at the information entries, eye-ball the column values. What number of rows, columns you’ve gotten.

- a retailer dataset might inform you —
*Mr X visited store#2000 on the 01st of Aug 2023 and purchased a can of Coke and one pack of Walker Crisps* - a social media dataset might inform you —
*Mrs Y logged onto the social networking website at 09:00 am on the third of June and browsed A, B, and C sections, looked for her friend Mr A after which logged out after 20 mins.*

It’s useful to get the business context of the information you’ve gotten, knowing the source and mechanism of information collection; *for e.g. survey data vs. digitally collected data etc.).*

## 2. Double-click into variables

Variables are the talking tongue of a dataset, they’re constantly talking to you. You simply must ask the best questions and listen fastidiously.

**→ Inquiries to ask::**– What do the variables mean/represent?

– Are the variables continuous or categorical? .. Any inherent order?

– What are the possible values they will take?

**→ ACTION::**

- For continuous variables — check distributions using histograms, box-plots and punctiliously study the mean, median, standard deviations etc.
- For categorical / ordinal variables — discover their unique values, and do a frequency table checking essentially the most / least occurring ones.

You could or may not understand all variables, labels and values — but attempt to get as much information as you’ll be able to

## 3. Search for patterns/relationships in your data

Through EDA, you’ll be able to discover patterns, trends, and relationships throughout the data.

**→ Inquiries to ask::**

*– Do you’ve gotten any prior assumptions/hypothesis of relationships between variables?*

– Any business reason for some variables to be related to 1 one other?

– Do variables follow any particular distributions?

– Any business reason for some variables to be related to 1 one other?

– Do variables follow any particular distributions?

Data Visualisation techniques, summaries, and correlation evaluation help reveal hidden patterns that is probably not apparent at first glance. Understanding these patterns can provide priceless insights for decision-making or hypothesis generation.

**→ ACTION::**Think visual bi-variate evaluation.

- In case of continuous variables — use scatter plots, create correlation matrix / heat maps etc.
- A combination of continuous and ordinal/categorical variables — Consider plotting bar or pie charts, and create good-old contingency tables to visualise the co-occurrence.

EDA means that you can validate statistical assumptions, resembling normality, linearity, or independence, for evaluation or data modelling.

## 4. Detecting anomalies.

Here’s your probability to change into Sherlock Holmes in your data and search for anything out of the peculiar! Ask yourself::

**– Are there any duplicate entries within the dataset?**

Duplicates are entries that represent the identical sample point multiple times. Duplicates should not useful most often as they don’t give any additional information. They is perhaps the results of an error and may mess up your mean, median and other statistics.

→ Check together with your stakeholders and take away such errors out of your data.

**– Labelling errors for categorical variables?**

Search for unique values for categorical variables and create a frequency chart. Search for mis-spellings and labels that may represent similar things?

**– Do some variables have Missing Values?**

This will occur to each numeric and categorical variables. Check if

**Are there rows which have missing values for quite a lot of variables (columns)?**This implies there are data points which have blanks across the vast majority of columns → they should not very useful, we may have to drop them.**Are there variables (or columns) which have missing values across multiple rows?**This implies there are variables which would not have any values/labels across most data points → they can’t add much to our understanding, we may have to drop them.

→

ACTION::– Count the proportion of NULL or missing values for all variables. Variables with greater than 15%-20% should make you suspicious.

– Filter out rows with missing values for a column and check how the remainder of the columns look. Is it that the vast majority of columns have missing values together ?.. is there a pattern?

**– Are there Outliers in my dataset?**

Outlier detection is about identifying data points that don’t fit the norm. it’s possible you’ll see very high or extremely low values for certain numerical variables, or a high/low frequency for categorical class variables.

**What seems an outlier generally is a data error.**While outliers are data points which might be unusual for a given feature distribution, unwanted entries or recording errors are samples that shouldn’t be there in the primary place.**What seems an outlier can just be an outlier.**In other cases, we’d just have data points with extreme values and perfectly high quality reasoning behind them.

→

ACTION::Study the histograms, scatter plots, and frequency bar charts to know if there are a number of data points that are farther from the remainder. Think through:

– Can they be true and take these extreme values?

– Is there a business reasoning or justification for these extremities

– Would they add value to your evaluation at a later stage

## 5. Data Cleansing.

Data cleansing refers back to the means of removing unwanted variables and values out of your dataset and eliminating any irregularities in it. These anomalies can disproportionately skew the information and hence adversely affect the outcomes of our evaluation from this dataset.

Remember: Garbage In, Garbage Out

## – Course correct your data.

- Remove the duplicate entries in case you find any, missing values and outliers — which don’t add value to your dataset. Eliminate unnecessary rows/ columns.
- Correct any mis-spellings, or mis-labelling you observe in the information.
- Any data errors you see which should not adding value to the information also have to be removed.

## – Cap Outliers or allow them to be.

- In some data modelling scenarios, we may have to cap outliers at either end. Capping is usually done on the 99th/ninety fifth percentile for the upper end or the first/fifth percentile for the lower-end capping.

**– Treat Missing Values.**

We generally drop data points (rows) with quite a lot of missing values across variables. Similarly, we drop variables (columns) which have missing values across quite a lot of data points

If there are a number of missing values we’d look to plug those gaps or simply allow them to be because it is.

- For continuous variables with missing values, we will plug them by utilizing mean or median values (perhaps across a selected strata)
- For categorical missing values, we’d assign essentially the most used ‘class’ or perhaps create a brand new ‘not defined’ class.

## – Data enrichment.

Based on the needs of the long run evaluation, you’ll be able to add more features (variables) to your dataset; resembling (not restricted to)

- Creating binary variables indicating the presence or absence of something.
- Creating additional labels/classes by utilizing IF-THEN-ELSE clauses.
- Scale or encode your variables as per your future analytics needs.
- Mix two or more variables — use arrange of mathematical functions like sum, difference, mean, log and lots of other transformations.