Supervised Learning: Part I


Getting Started with Machine Learning

The classic definition of Machine Learning is: Machine learning is a field of computer science that gives computers the ability to learn without being explicitly programmed.

Machine learning is a core sub-area of artificial intelligence; it enables computers to get into a mode of self-learning without being explicitly programmed. When exposed to new data, these computer programs are enabled to learn, grow, change, and develop by themselves.While the concept of machine learning has been around for a long time, (an early and notable example: Alan Turing’s famous WWII Enigma Machine) the ability to apply complex mathematical calculations to big data automatically iteratively and quickly has been gaining momentum over the last several years.

To better understand the uses of machine learning, consider some of the instances where machine learning is applied: the self-driving Google car, cyber fraud detection, online recommendation engines like friend suggestions on Facebook, Netflix showcasing the movies and shows you might like, and “more items to consider” and “get yourself a little something” on Amazon are all examples of applied machine learning.

Now there are different categories of Machine Learning, each having its own importance. The categories of Machine Learning algorithms are:

  • Supervised Machine Learning
  • Unsupervised Machine Learning
  • Semi Supervised Machine Learning
  • Reinforcement Learning

A very detailed and comprehensive explanation of all these categories is provided in this link

Let’s start with Supervised Learning in this blog.

Supervised Learning.

This kind of learning is possible when inputs and the outputs are clearly identified, and algorithms are trained using labeled examples. Supervised Learning is broadly divided into 2 parts:

  • Regression: If the desired output consists of prediction of one or more continuous variables, then the task is called regression. An example of a regression problem would be the prediction of the length of a salmon as a function of its age and weight.

  • Linear Regression
  • Polynomial Regression
  • Stepwise Regression
  • Ridge Regression
  • Lasso Regression
  • Classification: Classification is a learning approach in which the computer program learns from the data input given to it and then uses this learning to classify new observation.
    • Logistic Regression
    • Decision Trees
    • Naive Bayes
    • K-Nearest Neighbours
    • Neural Networks

Before we start, we need to clear some ML notations.

Attributes or Features: An attribute is a property of an instance that may be used to determine its classification. In the IRIS dataset, the attributes are the petal and sepal length and width. They are also known as Features.

Target variable: In the machine learning context, target variable is the variable that is or should be the output. In the IRIS dataset target variables are the 3 flower species.

Now machine learning algorithms can be applied in 2 ways:

1) Implement your own algorithm from scratch.

2) Use Third party libraries like Google’s Scikit Learn.

It is highly recommended that you code your algorithms from scratch while learning to have a thorough understanding but as per industry standards, you are not generally expected to implement individual algorithms. Most of them use scikit or some other library for all the work. This blog will contain code for both types i.e from scratch and sklearn code.

Now lets get started with Regression.

Regression

Linear Regression

What is Linear Regression?

Linear regression is a basic and commonly used type of predictive analysis. The overall idea of regression is to examine two things:

  • Does a set of predictor variables do a good job in predicting an outcome (dependent) variable?
  • Which variables in particular are significant predictors of the outcome variable, and in what way do they impact the outcome variable?
    These regression estimates are used to explain the relationship between one dependent variable and one or more independent variables.

The simplest form of the regression equation with one dependent and one independent variable is defined by the formula ** y = c + b*x**, where

  • y = estimated dependent variable score,
  • c = constant (y intercept),
  • b = regression coefficient (slope), and
  • x = score on the independent variable.

For example, in the task of predicting the house prices, the different attributes of the house such as no of bedrooms, carpet area, proximity to hospital, etc. become the independent variables (X: x1, x2, x3,…) while the price estimate of the house becomes the dependent variable(Y).

Three major uses for regression analysis:

  • First, the regression might be used to identify the strength of the effect that the independent variable(s) have on a dependent variable. Typical questions are what is the strength of relationship between dose and effect, sales and marketing spending, or age and income.
  • Second, it can be used to forecast effects or impact of changes. That is, the regression analysis helps us to understand how much the dependent variable changes with a change in one or more independent variables. A typical question is, “how much additional sales income do I get for each additional $1000 spent on marketing?”
  • Third, regression analysis predicts trends and future values. The regression analysis can be used to get point estimates. A typical question is, “what will the price of gold be in 6 months?”

Follow these links for more detailed explanation:

Linear Regression From scratch

import numpy as np
from statistics import mean
import matplotlib.pyplot as plt
from matplotlib import style
import random

style.use('fivethirtyeight')

xs = np.array([1,2,3,4,5,6])
ys = np.array([5,4,6,5,6,7])

#xs = np.array([1,2,3,4,5,6])
#ys = np.array([1,2,3,4,5,6])

def best_fit_slope_and_intercept(xs,ys):
    m = ( ((mean(xs)*mean(ys))-mean(xs*ys))/
        ((mean(xs)**2) - mean(xs*xs)) )
    b = np.mean(ys) - m*np.mean(xs)
    return m,b

def sq_error(y_orig, y_line):
    return sum((y_line - y_orig)**2)

#R-square error
def coeff_of_determination(y_orig, y_line):
    y_mean_line = [mean(y_orig) for y in y_orig]
    sq_err_reg = sq_error(y_orig, y_line)
    sq_err_mean = sq_error(y_orig, y_mean_line)
    return 1 - (sq_err_reg/sq_err_mean)

def linearreg_predictior(xs,ys,pred_x):
    m,b = best_fit_slope_and_intercept(xs,ys)
    reg_line = [(m*x)+b for x in xs]
    pred_y = (m*pred_x) + b
    plt.scatter(xs,ys)
    plt.plot(xs, reg_line)
    plt.scatter(pred_x, pred_y)
    plt.show()
    
    r_sq = coeff_of_determination(ys,reg_line)
    print(r_sq)
    return None

linearreg_predictior(xs,ys,4.5)

0.263888888889

The points on the graph are the input data points and the line is the best-fit line for the given input data.

R-Square: It determines how much of the total variation in Y (dependent variable) is explained by the variation in X (independent variable).

scikit learn code

Steps To Be followed When Applying an Algorithm using Sklearn (ML Pipeline):

  • Split the dataset into training and testing dataset. The testing dataset is generally smaller than training one as it will help in training the model better.
  • Select any algorithm based on the problem (classification or regression). Not every algorithm is suitable for all the problems.
  • Then pass the training dataset to the algorithm to train it. We use the .fit() method
  • Then pass the testing data to the trained algorithm to predict the outcome. We use the .predict() method.
  • We then check the accuracy by passing the predicted outcome and the actual output to the model.
# importing basic libraries

import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from sklearn.model_selection import train_test_split

#importing dataset

train = pd.read_csv('Train_BigMartSales.csv',encoding = "ISO-8859-1")
test = pd.read_csv('Test_BigMartSales.csv',encoding = "ISO-8859-1")

# importing linear regressionfrom sklearn
from sklearn.linear_model import LinearRegression
test.head()
Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Identifier Outlet_Establishment_Year Outlet_Size Outlet_Location_Type Outlet_Type
0 FDW58 20.750 Low Fat 0.007565 Snack Foods 107.8622 OUT049 1999 Medium Tier 1 Supermarket Type1
1 FDW14 8.300 reg 0.038428 Dairy 87.3198 OUT017 2007 NaN Tier 2 Supermarket Type1
2 NCN55 14.600 Low Fat 0.099575 Others 241.7538 OUT010 1998 NaN Tier 3 Grocery Store
3 FDQ58 7.315 Low Fat 0.015388 Snack Foods 155.0340 OUT017 2007 NaN Tier 2 Supermarket Type1
4 FDY38 NaN Regular 0.118599 Dairy 234.2300 OUT027 1985 Medium Tier 3 Supermarket Type3
#Now follow the steps according to the ML Pipeline
#splitting into training and cv for cross validation

X = train.loc[:,['Outlet_Establishment_Year','Item_MRP']]
x_train, x_cv, y_train, y_cv = train_test_split(X,train.Item_Outlet_Sales)

#training the model
clf = LinearRegression()
clf.fit(x_train,y_train)

#predicting on cv

pred = clf.predict(x_cv)

#calculating mse

mse = np.mean((pred - y_cv)**2)

r_sq = clf.score(x_cv,y_cv)

print("R-square ",r_sq)


## calculating coefficients

coeff = DataFrame(x_train.columns)

coeff['Coefficient Estimate'] = Series(clf.coef_)
print(coeff)
R-square  0.342111766614
                           0  Coefficient Estimate
0  Outlet_Establishment_Year            -13.386884
1                   Item_MRP             15.426249

We hope this post was helpful. Feel free to comment in case of doubts and do let us know your feedback. Stay tuned for more!