Data is the core basis of any computer environment. For machine learning systems, data is required to be fetched and first converted into a usable form. To process the raw input data and convert it into a usable form for machine learning is called as Data Preprocessing. Data preprocessing is performed by various techniques. The selection of techniques for data preprocessing depends upon the context of the data. I will discuss the most common practices for Data Preprocessing in this post. I work on Python for machine learning. So the below techniques are relevant to Python. Similar techniques are used in different languages like R.
Import the Libraries
The first step is to import the libraries. In python, we require libraries like numpy, matplotlib, pandas, etc. The library – pandas, is commonly used to import datasets and manage dataset. So the first step is to import the libraries and then to import the dataset. To execute the code I am writing, make sure to place this python file inside the same folder as your input dataset file.
#Import the libraries import numpy import pandas as pd import matplotlib |
You can download the complete code as well as the dataset file here.
Import the Dataset
#Import the dataset dataset = pd.read_csv('Data.csv') |
Once we have fetched the complete dataset into our python variable, we need to filter the independent and dependent variables. This can be accomplished using the iloc function. If you are using Spyder(Anaconda), you can hover your cursor on iloc and press Ctrl+I to see what parameters this function will require. Variable X will hold the independent variable matrix called observations or features and Y will hold the dependent variable vector.
#Fetch independent variable(observation) matrix X = dataset.iloc[:,:-1].values #iloc[index for rows, index for columns] #Fetch dependent variable vector Y = dataset.iloc[:,3].values |
The case of Missing Data
Now let’s assume, the dataset we obtained had some missing values. This can interfere with the accuracy of the predictions that our machine learning program will generate. However, handling of missing data might not be required in all cases, but it is still an important concept to understand. So in case of missing data, the most common technique is to replace that missing value with the mean value of that complete column. To do this mathematical mean calculation we need another python library – scikit-learn.
#Missing Values from sklearn.preprocessing import Imputer imputer= Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0) |
Here we have created an imputer object which will replace any missing value in the data by ‘NaN’. The strategy used to replace the missing value is Mean. Axis = 0 means the mean value has to be calculated along the values available across that column. Axis = 1 would mean calculation of mean value across the rows.
Once the imputer object is created for the missing values, we need to fit this imputer and change the values in the dataset. Since mean can only be performed on the numeric columns, we will assign this imputer to the numeric columns only. The complete code to replace missing values with the mean of the column is below.
#Missing Values from sklearn.preprocessing import Imputer imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0) X[:,1:] = imputer.fit_transform(X[:,1:]) |
Categorical Variables
As we just observed above, only the numeric data could be taken into use to perform any mathematical calculation and string data is left abandoned. However, if we need to process the string data, it needs to be first categorized and converted into mathematical values for calculations. For example – France will be represented by 0, Germany by 1 and Spain by 2. To directly convert this, we again use the scikit-learn library’s encoding functions.
#Categorizing data from sklearn.preprocessing import LabelEncoder labelencoder_X = LabelEncoder() # Encoding the Independent Variable X[:,0] = labelencoder_X.fit_transform(X[:,0]) # Encoding the dependent Variable Y[:,] = labelencoder_X.fit_transform(Y[:,]) |
After this step, the first column in our data that was having the country names as Spain, France, etc has been converted into their mathematical representation of 0,1 etc. If we look deeply into this, earlier there was no relation between Spain and France, however, after converting it into mathematical representation they have developed a relationship as 1>0 which might cause an issue to the calculations. To overcome this issue, we use the approach of dummy variables. We create dummy variables equal to the number of categories created in the above step.
Values obtained in Dummy Variables :
Category France Spain Germany
France 1 0 0
Spain 0 1 0
France 1 0 0
Germany 1 0 1
To achieve this we use OneHotEncoder function from the LabelEncoder library.
#Create Dummy Variables from sklearn.preprocessing import OneHotEncoder onehotencoder = OneHotEncoder(categorical_features = [0]) #Here 0 is the index of the column for which dummy variables have to be created. X = onehotencoder.fit_transform(X).toarray() |
Test And Train Sets
This is again an important part of data preprocessing. Out of the total data, we filter out some data as test data and the remaining data as the training observation data(or features). The train data will be used by the program to observe the behavior and predict the answer while the test data will be used to cross verify the answers. For the generation of test and train sets, scikit-learn library’s cross_validation method is used.
#Test and Train Data from sklearn.cross_validation import train_test_split X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0) |
Test_size refers to the percentage of data to be moved to test set from the parent dataset.
Feature Scaling
Scaling is required for mathematical calculations in machine learning to bring all data to a common scale. For example, age = 40 and salary = 90000. Here salary is way bigger than age which will result in the huge gap while performing mathematical calculations on the two variables. To bring them to smaller and easily comparable values, we scale them using feature scaling. Feature scaling might not be required every time, but it might be required in some cases. To perform feature scaling, we import StandardScaler from the scikit-learn python library.
# Feature Scaling from sklearn.preprocessing import StandardScaler scale_X = StandardScaler() X_train = scale_X.fit_transform(X_train) X_test = scale_X.fit_transform(X_test) scale_Y = StandardScaler() Y_train = scale_Y.fit_transform(Y_train) Y_test = scale_Y.fit_transform(Y_test) |
With this our data preprocessing concept is complete and we can have a look at the complete data preprocessing template. The template is very useful and saves a lot of time. In most of the cases, this template can be used directly by just changing few values as per requirements like – dataset name, dataset column index values, and test size value.
Data Preprocessing Template
#Import the libraries import numpy import pandas as pd import matplotlib #Import the dataset dataset = pd.read_csv('Data.csv') #Fetch independent variable(observation) matrix X = dataset.iloc[:,:-1].values Y = dataset.iloc[:,2:3].values """ -- This block is optional to use -- #Missing Values from sklearn.preprocessing import Imputer imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0) X[:,1:] = imputer.fit_transform(X[:,1:]) """ """ -- This block is optional to use -- #Categorizing data from sklearn.preprocessing import LabelEncoder labelencoder_X = LabelEncoder() # Encoding the Independent Variable X[:,0] = labelencoder_X.fit_transform(X[:,0]) # Encoding the dependent Variable Y[:,] = labelencoder_X.fit_transform(Y[:,]) #Create Dummy Variables from sklearn.preprocessing import OneHotEncoder onehotencoder = OneHotEncoder(categorical_features = [0]) X = onehotencoder.fit_transform(X).toarray() """ #Test and Train Data from sklearn.cross_validation import train_test_split X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0) """ -- This block is optional to use -- # Feature Scaling from sklearn.preprocessing import StandardScaler scale_X = StandardScaler() X_train = scale_X.fit_transform(X_train) X_test = scale_X.fit_transform(X_test) scale_Y = StandardScaler() Y_train = scale_Y.fit_transform(Y_train) Y_test = scale_Y.fit_transform(Y_test) """ |