Data analytics involves a lot of transformations and therefore requires a careful attention to detail. The data generally contains many inconsistencies; the most common discrepancy is issue of Missing Values. Even a modest amount of missing values scattered throughout the data set will cause significant reduction in sample set. There are various methods by which you can handle missing values in the data. This process is known as imputation.
1) When the dependent variable contains missing values, simply eliminate the records.
2) Correctly Identify slices of data and Substitute with measure of central tendency like Median, Mean & Mode. Identifying the right slice is also important. You can group by various parameters and take a central tendency. Choose the one with highest bias (chi-square)
3) If the missing value forms a Normal distribution pattern, find the missing value by normal inverse function.
4) Treating the missing values as a dependent variable in a regression equation. Use the multiple linear regression function to impute the missing variable. You can try other methods instead of regression like classification, decision tree etc.
5) Use business logic to understand the missing values.
6) Check the data capturing process, there could a error present at source of data entry. Also it helps identify if the missing data points are at random or non-random. If it is random missing error then you can use simple imputations, however if it is non-random then you need advanced techniques to impute values. Also look at bias in the particular column, if the bias is significant then you need advanced techniques. If bias is minimal then you can proceed with simple imputation.
7) Identify the list of possible values for the missing data set. Try and replace each possible value and create different data sets and build the model. Calculate differences in accuracies and consistency based on different substitutes. This way you can even add variation of the values into the missing element and remove bias.
Use regression to determine the distribution of the values in place of missing values. Create a What-If scenario by imputing every range of value.
9) Do nothing remove missing values and duplicate records of sample data set to increase the size of the data set.
10) Measure similarlity of records like vectors. The similarity is the cosine function between records, and find similar records to the missing data values.
11) Use logistic regression to measure likelihood of observed or likelihood of missing. If value missing the output is 0, else 1. The rest of the variables (non-missing) act as independent variables. This does not predict anything but only a likelihood of finding the variable missing. Records with same probability or closest probability is considered similar and missing data is donated.
Multiple imputation generally yields better results but it requires high-end statistical software for computation. It becomes necessary to use the help of statistical software.
This article is originally found on Praveen Kodur http://www.praveenkodur.com/blog/.
