Steps of Data Cleansing in Data Science

Ridhima
Nov 30, 2019

The latent goal to collect data is to see it from all verticals, derive what is unseen and put it in a pattern that various contextual algorithms can learn easily. However, many rounds of “if and if not” conditions are run respectively to pick up the most accurate data patterns under the data cleansing strategy. Their accuracy is significant because they are going to train algorithms, or typically say they are accelerating to artificial intelligence.

Here are some common data cleansing steps to follow its methodology:

Importing Data: The market research firms and many other organisations delve into web scraping. Also known as data extraction, the researchers capture data to call for cleansing. This is what we call importing data. Mostly, the data is imported in their original file format, which is later converted into a viable file. Originally, that file could be any Excel, CSV, SQL or SAP file.
Conversion: Upon importing, data cleansing services providers bring pan data in a uniform format or pattern. Therefore, all data entries are consolidated to shape up in an identical format.
Filtering Unwanted Observations: The problem of duplicate data entries stems in the form of duplicate and irrelevant observations.

Duplicate Observations: These are the entries that create conflicts due to the same data, class or features. The duplicity may arise:

At the time of compiling data from multiple places
At the time of scraping data
At the time of transferring files by the clients

Irrelevant Observations: These are the conflicting data entries that do not fit the specific criterion of the data to solve a particular problem. These entries could be:

Models that do not meet requirements
Class that should not be there
Features that do not meet goal

Structural Errors due to poor warehousing: It is a matter of serious concern because the inconsistent capitalization and typo errors may steer to incorrect analysis. Sometimes, the trivial errors can cause blunder.

Typos: It refers to typographical errors, like briliant (which should be brilliant) or Hotlier (that should be Hotelier).

Inconsistent capitalization: It defines the usage of capital initial letters inconsistently.
Mislabeled classes- This oddity may arise for using abbreviations and full form of the same class names, such as Not Application or NA.

Outliers (Reasons): This problem stems due to using an incorrect method of data modeling or patterns.

For improving model performance
Suspicious measurement

Verification: This step verifies various entries while tallying them with internal and external data sources.
Normalization or Filling Missing Data: Simply ignoring missing values in datasets can interfere with the consistency of the data. Today, most of the data is scraped and cleansed to train algorithms. They cannot accept missing values. So, you should have to deal with missing values by:
1. filling the missing values.
2. discarding observations that have values.

Ways to deal with missing data:

Missing categorical data:

Label that entry as “Missing”
Add a new class for the feature
Tell algorithm as the “missing value”

Missing numeric data

Flag and fill the missing values
Assign a variable to missing data
Fill the original missing value with ‘0’

Export Data: This step effectuates transmission of data to the destination or data warehouse. Therefore, the file should match with its format. These formats could be CSV, SQL Database, XML, TIFF, PDF or any other one.