Skip to content

Latest commit

 

History

History
162 lines (115 loc) · 6.42 KB

14.4-A real application of the binary classification task.md

File metadata and controls

162 lines (115 loc) · 6.42 KB

14.4 A real application of the binary classification task

We use real data to implement a binary classification task: an income survey with a prediction. That is, given a resident’s various circumstances, such as work, family, education, etc., to predict whether the resident’s annual income can be greater than 50K/year. So those who earn greater than 50K are positive examples, and those less than or equal to 50K are negative examples.

14.4.1 Prepare the data

This data set is extracted from the 1994 Census database $^{[1]}$.

Data field interpretation

Tag value: >50K, <=50K.

Attributes:

  • age, age: continuous value
  • workclass, the nature of work: enumerated type, similar to private enterprises, government, etc.
  • fnlwgt, weight: continuous value
  • education, education level: enumerated type, such as bachelor, master, etc.
  • education-num, duration of education: continuous value
  • marital-status, marital status: enumerated type, married, unmarried, divorced, etc.
  • occupation, occupation: enumerated type, including many types, such as technical support, maintenance workers, sales, farmers, fishermen, soldiers, etc
  • relationship, family roles: enumerated type, husband, wife, etc.
  • sex, gender: enumerated type
  • capital-gain, capital gain: continuous value
  • capitial-loss, capital loss: continuous value
  • hours-per-week, working hours per week: continuous value
  • native-country, country or origin: enumerated type

Data processing

Data analysis and data processing should actually be separate course beyond the scope of this book. So, we only do some simple data processing so that the neural network can be used for training.

For continuous values, we can directly use the original data. For enumerated types, we need to convert them into continuous values. Take gender as an example: Female=0, Male=1. For other enumerated types, integer mappings starting from 0 can be used.

A trick to use is to take the element index in a list in python, which can be used as an integer code:

sex_list = ["Female", "Male"]
array_x[0,9] = sex_list.index(row[9].strip())

strip() trims off beginning spaces. Data will be read like this: "_Female", because it is in csv format. There is always a space in front of it. index is the index of the list, so that the index of the string "Female" is 0, and the index of the string "Male" is 1.

Save all data in a numpy array in rows, and finally store it in the npz format:

np.savez(data_npz, data=self.XData, label=self.YData)

The original data has separated the training and test data. So, we call the data processing procedure for the two data sets separately, and save them as Income_Train.npz and Income_Test.npz.

Download Data

train_file = "../../Data/ch14.Income.train.npz"
test_file = "../../Data/ch14.Income.test.npz"

def LoadData():
    dr = DataReader_2_0(train_file, test_file)
    dr.ReadData()
    dr.NormalizeX()
    dr.Shuffle()
    dr.GenerateValidationSet()
    return dr

You must first call the NormalizeX() function for normalization because there are many attribute fields, and the range of values varies greatly. Since it is a binary classification problem, when doing data processing, we have marked things more than 50K as 1, and things less than or equal to 50K as 0. So there is no need to normalize the label value.

14.4.2 Build a model

We build a model like how we did in 14.2, but in order to complete the binary classification task, we will add on a logistic function at the end.

Figure 14-10 Completed abstract model of a real application of the binary classification task

def model(dr):
    num_input = dr.num_feature
    num_hidden1 = 32
    num_hidden2 = 16
    num_hidden3 = 8
    num_hidden4 = 4
    num_output = 1

    max_epoch = 100
    batch_size = 16
    learning_rate = 0.1

    params = HyperParameters_4_0(
        learning_rate, max_epoch, batch_size,
        net_type=NetType.BinaryClassifier,
        init_method=InitialMethod.MSRA,
        stopper=Stopper(StopCondition.StopDiff, 1e-3))

    net = NeuralNet_4_0(params, "Income")

    fc1 = FcLayer_1_0(num_input, num_hidden1, params)
    net.add_layer(fc1, "fc1")
    a1 = ActivationLayer(Relu())
    net.add_layer(a1, "relu1")
    ......
    fc5 = FcLayer_1_0(num_hidden4, num_output, params)
    net.add_layer(fc5, "fc5")
    logistic = ClassificationLayer(Logistic())
    net.add_layer(logistic, "logistic")

    net.train(dr, checkpoint=1, need_test=True)
    return net

Hyperparameter description:

  1. Learning rate = 0.1
  2. Maximum epoch=100
  3. Batch size = 16
  4. Two-class network type
  5. MSRA initialization
  6. Relative error stop condition = 1e-3

The net.train() function is a blocking function. It only returns when the training is complete.

14.4.3 Training result

The left side of the figure below is a graph of the loss function, and the right is a graph of the accuracy. Ignore the fluctuation of the test data. Just look at the trend of the red validation set. The loss function value decreases, and the rate of accuracy increases.

Why not set max_epoch to a larger value, such as 1000, to get better results? Training more times will not get better results because of the risk of overfitting. Interested readers can try it out for themselves.

Figure 14-11 Changes in loss function and accuracy values during training

Here is the final output:

......
epoch=99, total_iteration=169699
loss_train=0.296219, accuracy_train=0.800000
loss_valid=nan, accuracy_valid=0.838859
time used: 29.866002321243286
testing...
0.8431606905710491

Lastly, the result obtained with the independent test set is 84%. This is a good result compared with other papers related to the data set.

Code location

ch14, Level4

Reference

[1] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Ronny Kohavi and Barry Becker Data Mining and Visualization Silicon Graphics. e-mail: ronnyk '@' sgi.com for questions.

https://archive.ics.uci.edu/ml/datasets/Census+Income