Skip to content

Latest commit

 

History

History
18 lines (10 loc) · 3.25 KB

File metadata and controls

18 lines (10 loc) · 3.25 KB

Implementing Logistic Regression from Scratch

Kaggle Notebook

Implementing Logistic Regression from Scratch

Classification

In statistics and machine learning, classification refers to a type of supervised learning. For this task, training data with known class labels are given and is used to develop a classification rule for assigning new unlabeled data to one of the classes. A special case of the task is binary classification, which involves only two classes. Some examples:

  • Classifying an email as spam or non-spam
  • Classifying a tumor as benign or malignant

The algorithms that sort unlabeled data into labeled classes are called classifiers. Loosely speaking, the sorting hat from Hogwarts can be thought of as a classifier that sorts incoming students into four distinct houses. In real life, some common classifiers are logistic regression, k-nearest neighbors, decision tree, random forest, support vector machine, naive Bayes, linear discriminant analysis, stochastic gradient descent, XGBoost, AdaBoost and neural networks.

The Purpose of the Notebook

Many advanced libraries, such as scikit-learn, make it possible for us to train various models on labeled training data, and predict on unlabeled test data, with a few lines of codes. While it is very convenient for day-to-day practice, it does not give insight into the details of what really happens underneath, when we run those codes. In the present notebook, we implement a logistic regression model manually from scratch, without using any advanced library, to understand how it works in the context of binary classification. The basic idea is to segment the computations into pieces, and write functions to compute each piece in a sequential manner, so that we can build a function on the basis of the previously defined functions. Wherever applicable, we have complemented a function which is constructed using for loops, with a much faster vectorized implementation of the same.