- Table of contents
- Introduction
- Data Preprocessing
- Splitting Training set and Test set
- Feature Scaling
- Resources
dataset = pd.read_csv("data.csv")
Country Age Salary Purchased
0 France 44.0 72000.0 No
1 Spain 27.0 48000.0 Yes
2 Germany 30.0 54000.0 No
3 Spain 38.0 61000.0 No
4 Germany 40.0 NaN Yes
5 France 35.0 58000.0 Yes
6 Spain NaN 52000.0 No
7 France 48.0 79000.0 Yes
8 Germany 50.0 83000.0 No
9 France 37.0 67000.0 Yes
.iloc[]
allowed inputs are:- An integer, e.g.
dataset.iloc[0]
> return row 0 in<class 'pandas.core.series.Series'>
Country France Age 44 Salary 72000 Purchased No
- A list or array of integers, e.g.
dataset.iloc[[0]]
> return row 0 in DataFrame format
Country Age Salary Purchased 0 France 44.0 72000.0 No
- A slice object with ints, e.g.
dataset.iloc[:3]
> return row 0 up to row 3 in DataFrame format
Country Age Salary Purchased 0 France 44.0 72000.0 No 1 Spain 27.0 48000.0 Yes 2 Germany 30.0 54000.0 No
- Select First 3 Rows & up to Last Columns (not included)
X = dataset.iloc[:3, :-1]
Country Age Salary 0 France 44.0 72000.0 1 Spain 27.0 48000.0 2 Germany 30.0 54000.0
- An integer, e.g.
DataFrame.values
: Return a Numpy representation of the DataFrame (i.e: Only the values in the DataFrame will be returned, the axes labels will be removed)- For ex:
X = dataset.iloc[:3, :-1].values
[['France' 44.0 72000.0]
['Spain' 27.0 48000.0]
['Germany' 30.0 54000.0]]
- sklearn.impute.
SimpleImputer(missing_values={should be set to np.nan} strategy={"mean",“median”, “most_frequent”, ..})
- imputer.
fit(X[:, 1:3])
: Fit the imputer on X. - imputer.
transform(X[:, 1:3])
: Impute all missing values in X.
from sklearn.impute import SimpleImputer
#Create an instance of Class SimpleImputer: np.nan is the empty value in the dataset
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
#Replace missing value from numerical Col 1 'Age', Col 2 'Salary'
imputer.fit(X[:, 1:3])
#transform will replace & return the new updated columns
X[:, 1:3] = imputer.transform(X[:, 1:3])
- Since for the independent variable, we will convert into vector of 0 & 1
- Using the
ColumnTransformer
class & OneHotEncoder
: encoding technique for features are nominal(do not have any order)
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
transformers
: specify what kind of transformation, and which cols- Tuple
('encoder' encoding transformation, instance of Class OneHotEncoder, [cols to transform])
remainder ="passthrough"
> to keep the cols which not be transformed. Otherwise, the remaining cols will not be included
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])] , remainder="passthrough" )
- Fit and Transform with input = X in the Instance
ct
of classColumnTransformer
#fit and transform with input = X
#np.array: need to convert output of fit_transform() from matrix to np.array
X = np.array(ct.fit_transform(X))
- Before converting categorical column [0]
Country
Country Age Salary Purchased
0 France 44.0 72000.0 No
1 Spain 27.0 48000.0 Yes
- After converting, France = [1.0, 0, 0] vector
[[1.0 0.0 0.0 44.0 72000.0]
[0.0 0.0 1.0 27.0 48000.0]
[0.0 1.0 0.0 30.0 54000.0]
- For the dependent variable, since it is the Label > we use
Label Encoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
#output of fit_transform of Label Encoder is already a Numpy Array
y = le.fit_transform(y)
#y = [0 1 0 0 1 1 0 1 0 1]
- Using the
train_test_split
of SkLearn - Model Selection - Recommend Split:
test_size = 0.2
random_state = 1
: fixing the seed for random state so that we can have the same training & test sets anytime
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)
-
What ? Feature Scaling (FS): scale all the features in the same scale to prevent 1 feature dominates the others & then neglected by ML Model
-
Note #1: FS no need to apply in all the times in all ML Models (like Multi-Regression Models)
- Why no need FS for Multi-Regression Model: y = b0 + b1 * x1 + b2 * x2 + b3 * x3, since we have the coefficients (b0, b1, b2, b3) to compensate, so there is no need FS.
-
Note #2: For dummy variables from Categorial Features Encoding, no need to apply FS
-
Note #3: FS MUST be done AFTER splitting Training & Test sets
-
Why ?
- Test Set suppose to the brand-new set, which we are not supposed to work with the Training Set
- FS is technique to get the mean & median of features in order to scale
- If we apply FS before splitting Training & Test sets, it will include the mean & median of both Training Set and Test Set
- FS MUST be done AFTER Splitting => Otherwise, we will cause Information Leakage
- There are 2 main Feature Scaling Technique: Standardisation & Normalisation
Standardisation
: This makes the dataset, center at 0 i.e mean at 0, and changes the standard deviation value to 1.- Usage: apply all the situations
Normalisation
: This makes the dataset in range [0, 1]- Usage: apply when the all the features in the data set have the normal distribution
- We will use
StandardScaler
fromsklearn.preprocessing
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
- For
X_train
: applyStandardScaler
by usingfit_transform
X_train[:,3:] = sc.fit_transform(X_train[:,3:])
- For
X_test
: applyStandardScaler
only usetransform
, because we want to apply the SAME scale asX_train
#only use Transform to use the SAME scaler as the Training Set
X_test[:,3:] = sc.transform(X_test[:,3:])