We use real data to implement a binary classification task: an income survey with a prediction. That is, given a resident’s various circumstances, such as work, family, education, etc., to predict whether the resident’s annual income can be greater than 50K/year. So those who earn greater than 50K are positive examples, and those less than or equal to 50K are negative examples.
This data set is extracted from the 1994 Census database
Tag value: >50K, <=50K.
Attributes:
age
, age: continuous valueworkclass
, the nature of work: enumerated type, similar to private enterprises, government, etc.fnlwgt
, weight: continuous valueeducation
, education level: enumerated type, such as bachelor, master, etc.education-num
, duration of education: continuous valuemarital-status
, marital status: enumerated type, married, unmarried, divorced, etc.occupation
, occupation: enumerated type, including many types, such as technical support, maintenance workers, sales, farmers, fishermen, soldiers, etcrelationship
, family roles: enumerated type, husband, wife, etc.sex
, gender: enumerated typecapital-gain
, capital gain: continuous valuecapitial-loss
, capital loss: continuous valuehours-per-week
, working hours per week: continuous valuenative-country
, country or origin: enumerated type
Data analysis and data processing should actually be separate course beyond the scope of this book. So, we only do some simple data processing so that the neural network can be used for training.
For continuous values, we can directly use the original data. For enumerated types, we need to convert them into continuous values. Take gender as an example: Female=0
, Male=1
. For other enumerated types, integer mappings starting from 0 can be used.
A trick to use is to take the element index in a list
in python
, which can be used as an integer code:
sex_list = ["Female", "Male"]
array_x[0,9] = sex_list.index(row[9].strip())
strip()
trims off beginning spaces. Data will be read like this: "_Female", because it is in csv
format. There is always a space in front of it. index
is the index of the list, so that the index of the string "Female" is 0, and the index of the string "Male" is 1.
Save all data in a numpy
array in rows, and finally store it in the npz
format:
np.savez(data_npz, data=self.XData, label=self.YData)
The original data has separated the training and test data. So, we call the data processing procedure for the two data sets separately, and save them as Income_Train.npz
and Income_Test.npz
.
train_file = "../../Data/ch14.Income.train.npz"
test_file = "../../Data/ch14.Income.test.npz"
def LoadData():
dr = DataReader_2_0(train_file, test_file)
dr.ReadData()
dr.NormalizeX()
dr.Shuffle()
dr.GenerateValidationSet()
return dr
You must first call the NormalizeX()
function for normalization because there are many attribute fields, and the range of values varies greatly. Since it is a binary classification problem, when doing data processing, we have marked things more than 50K as 1, and things less than or equal to 50K as 0. So there is no need to normalize the label value.
We build a model like how we did in 14.2, but in order to complete the binary classification task, we will add on a logistic function at the end.
Figure 14-10 Completed abstract model of a real application of the binary classification task
def model(dr):
num_input = dr.num_feature
num_hidden1 = 32
num_hidden2 = 16
num_hidden3 = 8
num_hidden4 = 4
num_output = 1
max_epoch = 100
batch_size = 16
learning_rate = 0.1
params = HyperParameters_4_0(
learning_rate, max_epoch, batch_size,
net_type=NetType.BinaryClassifier,
init_method=InitialMethod.MSRA,
stopper=Stopper(StopCondition.StopDiff, 1e-3))
net = NeuralNet_4_0(params, "Income")
fc1 = FcLayer_1_0(num_input, num_hidden1, params)
net.add_layer(fc1, "fc1")
a1 = ActivationLayer(Relu())
net.add_layer(a1, "relu1")
......
fc5 = FcLayer_1_0(num_hidden4, num_output, params)
net.add_layer(fc5, "fc5")
logistic = ClassificationLayer(Logistic())
net.add_layer(logistic, "logistic")
net.train(dr, checkpoint=1, need_test=True)
return net
Hyperparameter description:
- Learning rate = 0.1
- Maximum
epoch=100
- Batch size = 16
- Two-class network type
- MSRA initialization
- Relative error stop condition = 1e-3
The net.train()
function is a blocking function. It only returns when the training is complete.
The left side of the figure below is a graph of the loss function, and the right is a graph of the accuracy. Ignore the fluctuation of the test data. Just look at the trend of the red validation set. The loss function value decreases, and the rate of accuracy increases.
Why not set max_epoch
to a larger value, such as 1000, to get better results? Training more times will not get better results because of the risk of overfitting. Interested readers can try it out for themselves.
Figure 14-11 Changes in loss function and accuracy values during training
Here is the final output:
......
epoch=99, total_iteration=169699
loss_train=0.296219, accuracy_train=0.800000
loss_valid=nan, accuracy_valid=0.838859
time used: 29.866002321243286
testing...
0.8431606905710491
Lastly, the result obtained with the independent test set is 84%. This is a good result compared with other papers related to the data set.
ch14, Level4
[1] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
Ronny Kohavi and Barry Becker Data Mining and Visualization Silicon Graphics. e-mail: ronnyk '@' sgi.com for questions.