Data is the key to every statistical analysis. Data cleansing involves removing errors and inconsistencies from data to improve its quality. If the data is not clean, your results will not be clean. Simply put, Garbage In = Garbage Out.
From the instant data is recorded to the moment it is extracted a lot can go wrong.
Data may be entered incorrectly: Zeroes can be added or dropped, classification codes can be entered incorrectly, customer identifiers can be typed in inaccurately, background information may be excluded, dates may be recorded in different formats, and the list goes on and on – just think about one of your employees trying to provide an impatient customer with good service by rushing through a form that needs to be filled and you will get the picture.
Mistakes may occur when manipulating the data: fields names may be changed, mathematical operations may be applied incorrectly, figures may be rounded inappropriately, dates may be wrongly converted, units of measurement may be inconsistent, etc.
When multiple data sources are integrated, even more errors are created mostly because various sources often contain the same information in different representations (think about how many times you had to fill different forms with the same information when, for example, applying for a loan). When the various sources are integrated, inconsistencies have to be reconciled one way or the other which may contaminate the data even more.
Many things can go wrong, and with enough data over enough time many things will go wrong. So, what is the best time for data cleansing?
When it comes to data checking and verification, earlier is better than later. That said, it is (almost) never too late. Many practitioners do not realize how important it is but just imagine the implications of taking critical decisions based on incorrect information. So even if you catch an error later in the process it is better to pause and fix the problem rather than relying on incorrect information.
Remember, Garbage In = Garbage Out.
Clean data is vital for reliable decision making and a lot can be done to keep the data clean. Still, many errors would typically sneak in. Thus, once data is extracted it is necessary to inspect it carefully. While this process is time consuming, it helps in preventing data-related problems down the road. Furthermore, it is often the first step of the analysis – it is when intuition is developed and when initial patterns are identified and more refined questions emerge.