# NET ID: FILL IN HERE

##### This week we're introducing linear classifiers, namely the **Perceptron**, and then delving deeper into **Model Validation**.

### Exercises 1, 2, 3, and 5 are required. Exercises 4, 6, and 7 are optional.


# Lecture 8: Linear Classifiers and Model Validation

## Perceptron

Perceptron was developed by American psychologist Frank Rosenblatt in 1957 at the **Cornell** Aeronautical Laboratory. Shout-out to one of our greatest alumni!

Perceptron is a linear binary classifier. So the underlying assumption about the dataset is that there are two labels - i.e. binary labels (conventionally, +1 and -1), and that the two classes should be classified with a linear hyperplane (Although, keep in mind that the [Multi-Layer Perceptron (MLP)](https://en.wikipedia.org/wiki/Multilayer_perceptron) is applicable to non-linearly separable data. We won't cover MLP in this course as it is part of Artificial Neural Networks (ANN), which is not in our scope).



The Perceptron "learns" a series of weights, each of which corresponds to each input feature, i.e. X of our data. For example, we are given a dataset of dog and cat. The input feature set consists of three columns: weight $x_1$, height $x_2$, and length $x_3$ of each animal. Then Perceptron will keep track of three different weights: $w_1$, $w_2$, and $w_3$. Each pair of input features and weights is multiplied and summed up: $s = w_1*x_1 + w_2*x_2 + w_3*x_3$. If the summed-up result $s$ is greater than a certain threshold, then we predict one class, and if it is less than the threshold, then we predict the other. For example, if our threshold is 0, then we can set it as: if $s > 0$, then the given input feature is a description of (+1) label (i.e. a dog), and if $s < 0$, then it is (-1) label (i.e. a cat). Then Perceptron will check if the predictions made were correct. If some of them were not, then the weights are updated accordingly. This process continues for a certain number of "epochs," or iterations. The end goal is to classify every point correctly by finding a *perfect* linear hyperplane.

The final step is to check if our predictions were classified correctly. If they were not, then the weights are updated using a learning rate. This process continues for a certain number of iterations, known as “epochs.” The goal is to determine the weights that produce a linear decision boundary that correctly classifies the predictions.

<img src="https://cdn-images-1.medium.com/max/1600/1*n6sJ4yZQzwKL9wnF5wnVNg.png" alt="perceptron.png" style="width: 50%;"/>

## Demo: Perceptron Learning Algorithm
This algorithm is simple and provides a great intuition for how to use your data to find a great linear binary classifier. The perceptron algorithm is an __iterative__ algorithm. This means that we will constantly update our classifer __w__ until it performs well on our training data. Intuitively, we want use the points that our incorrectly classifies to help develop a better classifer. Lets see how our model improves across iterations:

In [None]:
# import necessary packages
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
#Plots our data points and Classifier
def plot_perceptron(w):
    plt.scatter(X[:,0], X[:,1], color = c)
    left = min(X[:,0])
    right = max(X[:,0])
    if w[2] != 0:
        plt.plot(np.linspace(left, right, num=50), [-(w[2] + w[0]*x)/w[1] for x in np.linspace(left, right, num=50)])
    plt.xlabel('x1')
    plt.ylabel('x2')
    plt.title('Perceptron Learning Example')
    plt.xlim([-3,3])
    plt.ylim([-4,4])
    plt.show()
    plt.close()

In [None]:
n = 100 # number of data points
X = np.random.randn(n,2)
offset = np.ones((n,1))
X = np.hstack((X, offset))
w_true = np.random.randn(3,1)

y = np.sign(X.dot(w_true))
c = []
for i in range(n):
    if y[i] > 0:
        c.append('r')
    else:
        c.append('b')
plot_perceptron(w_true)

#### As you can see above, our goal is to find the line above that linearly separates our blue data points from our red data points. Let's use the perceptron algorithm to do this:

In [None]:
#initialize random normal vector w
w = np.random.randn(3)
#function that returns the index of a point that is missclassifed
def find_missclassified(w):
    for it in range(10000):
        i = np.random.randint(0, n)
        if y[i]*(X[i,:].dot(w)) <= 0:
            return i
    return None
plot_perceptron(w)

In [None]:
#Run this cell once at a time to see how our classifier improves at each iteration
point = find_missclassified(w)
print(point)
if point is None:
    print("Perfect Classifier!!!")
    plot_perceptron(w)
else:
    print("Updating perceptron")
    w = w + y[point]*X[point, :]
    plot_perceptron(w)

Slowly but surely our classifier is getting better! The intuition behind the perceptron is that we use the incorrectly classified points to change __w__ in order to make better guesses.

# Problem 1: Perceptron Learning (4 points)



Let's create a perceptron to predict whether someone has breast cancer. This is no different from what you've done with models before, but we're going to write our code inside a function so we can reuse it later.

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Perceptron
from sklearn.metrics import accuracy_score

"""
test_size: a float between 0 and 1 indicating the size of the train set
"""
def classifier_accuracy(classifier, features, goal, n_training_points, n_testing_points=None):

    if n_training_points > len(goal):
        raise ValueError("bad input to classifier_accuracy: number of training points requested is greater than length of dataset")
    if n_training_points <= 0:
        raise ValueError("bad input to classifier_accuracy: number of training points requested must be greater than 0")

    if n_testing_points is None:
        n_testing_points = len(goal) - n_training_points

    if n_testing_points > len(goal):
        raise ValueError("bad input to classifier_accuracy: number of testing points requested is greater than length of dataset")
    if n_testing_points < 0:
        raise ValueError("bad input to classifier_accuracy: number of testing points requested must be greater than 0")

    if n_training_points + n_testing_points > len(goal):
        raise ValueError("bad input to classifier_accuracy: number of training + testing points requested is greater than length of dataset")

    n_total_points = n_training_points + n_testing_points

    if n_training_points + n_testing_points < len(goal):
        indices = np.random.choice(len(features), n_total_points)
        features = features[indices,:]
        goal = goal[indices]

    test_size = n_testing_points / n_total_points

    ###############################################################################
    ##### Don't touch anything in this cell above this line! Only add code below.
    ###############################################################################

    #FILL HERE: Make a train test split with a test size of test_size


    #FILL HERE: train the classifier
    classifier.fit("FILL HERE", "FILL HERE")


    #FILL HERE: Compute your model's train and test accuracy using accuracy_score
    train_accuracy =
    test_accuracy =

    return train_accuracy, test_accuracy


X, y = load_breast_cancer(return_X_y=True)
# we train on 400 of the ~500 data points, equivalent to 20% test set
train, test = classifier_accuracy(Perceptron(), X, y, 400)
print("train accuracy:\t", train)
print("test accuracy:\t", test)


# Problem 2: Data Limitations of Cross Validation Pt. 1 - Train Size (1 point)

Cross validation is an extraordinarily powerful technique and is used in almost every supervised learning problem. But, it does have its limitations -- some problems are inherent to supervised learning, and cross validation can't fix those issues.

One such limitation has to do with the size of the train set (i.e. how many points we pass into `model.fit`). Let's explore this limitation.

The code block below may include Python constructs that you're not familiar with. That's okay -- **we don't expect you to be able to understand all the code**, and that's not the point of this example. Just **set n_training_points** by replacing "FILL IN HERE", and **analyze the average accuracy** outputted. If you get an error, read the error message and change your number accordingly.

**Make sure to try both small (single digits) and big (triple digits) values of n_training_points!**

In [None]:
X, y = load_breast_cancer(return_X_y=True)

# TODO try setting n_training_points to a few different numbers. If you get an error, read the error message and change your number accordingly.
n_training_points = "FILL IN HERE"

n_testing_points = 100
accs = []
for i in range(1000):
    for _ in range(50):
        try:
            train_acc, test_acc = classifier_accuracy(Perceptron(), X, y, n_training_points, n_testing_points)
            accs.append(test_acc)
            break
        except ValueError as e:
            if "bad input to classifier_accuracy:" in str(e):
                raise e
            continue

print("avg. accuracy:", round(np.mean(accs),4))


## Analysis

##### What is the relationship between # of training points and average accuracy?

# Problem 3: Data Limitations of Cross Validation Pt. 2 - Test Set Size (3 points)

What about test size (i.e. how many points we pass into our scoring function)? Do we also need a large test set?

Let's see what happens when we vary the size of the test set.

**In this problem, you will produce and analyze some scatterplots. We've provided most of the code; the bulk of your work will be in interpreting graphs.**

The dataset is about phone prices. We're trying to classify each phone into one of four price bins.

The following block of code loops through test sizes varying from from 5 points to 50 points (with a fixed number of training points). For _each_ of these test sizes, it runs classifier_accuracy 500 times and finds summary statistics of those 500 runs. Recall that this is similar to what we did with cross-validation, except instead of producing one value, we're looping through different test sizes and producing a value for each test size.

In [None]:
from scipy import stats
from sklearn.tree import DecisionTreeClassifier

# setting up the configuration
num_iters_per_test_size = 500 # for each test size, we'll collect 500 accuracies
test_sizes = list(range(5,50,5))
cellphone_df = pd.read_csv("phone.csv")
goal = cellphone_df["price_range"]
features = cellphone_df.drop(columns=["price_range"]).to_numpy()

# will store summary statistics about train and test accuracy
train_acc_stats = []
test_acc_stats = []
for test_size in test_sizes:
    tree = DecisionTreeClassifier()
    accs = [classifier_accuracy(tree, features, goal, 1500, test_size) for i in range(num_iters_per_test_size)]
    train_acc_stats.append(stats.describe([x[0] for x in accs])) # train accuracy statistics
    test_acc_stats.append(stats.describe([x[1] for x in accs])) # test accuracy statistics

In [None]:
"""
Pass in one of the following:
- "nobs"
- "min"
- "max"
- "mean"
- "variance"
- "kurtosis"

Returns two arrays: one for train accuracy statistics and one for test accuracy statistics.
For example, get_summary_statistics("mean") will return:
1) the cross-validation train accuracy for each test size in `test_sizes`
2) the cross-validation test accuracy for each test size in `test_sizes`
"""
def get_summary_statistics(summary_statistic_name):
    index = 0
    if summary_statistic_name == "nobs":
        index = 0
    elif summary_statistic_name == "min":
        index = 1
    elif summary_statistic_name == "max":
        index = 1
    elif summary_statistic_name == "mean":
        index = 2
    elif summary_statistic_name == "variance":
        index = 3
    elif summary_statistic_name == "kurtosis":
        index = 5
    elif summary_statistic_name == "constant":
        return [1 for i in train_acc_stats], [1 for i in test_acc_stats]
    else:
        return None, None
    train_statistics = [stat[index] for stat in train_acc_stats]
    test_statistics = [stat[index] for stat in test_acc_stats]
    if summary_statistic_name == "min":
        train_statistics = [minmax[0] for minmax in train_statistics]
        test_statistics = [minmax[0] for minmax in test_statistics]
    elif summary_statistic_name == "max":
        train_statistics = [minmax[1] for minmax in train_statistics]
        test_statistics = [minmax[1] for minmax in test_statistics]
    return train_statistics, test_statistics

Below is some starter code for producing the scatterplots. All you need to do is change "constant" to one of the following summary statistic names:
- "nobs"
- "min"
- "max"
- "mean"
- "variance"
- "kurtosis"

and that summary statistic will be graphed below for different test sizes.

In [None]:
summary_statistic_name = "constant" # TODO replace this with the name of a summary statistic

train_yaxis, test_yaxis = get_summary_statistics(summary_statistic_name)

if train_yaxis is None:
    print("\"" + summary_statistic_name + "\" is not a valid statistic name! Make sure you spelled it correctly")
else:
    fig, axes = plt.subplots(1,2, sharex=True, sharey=True)
    fig.set_figwidth(15)
    fig.subplots_adjust(hspace=0.5)
    # fig.
    axes[0].scatter(test_sizes,train_yaxis)
    axes[0].set_title("Train accuracy " + summary_statistic_name, fontsize=16)
    axes[0].set_xlabel("Test size", fontsize=16)
    axes[0].set_ylabel(summary_statistic_name, fontsize=16)

    axes[1].scatter(test_sizes,test_yaxis)
    axes[1].set_title("Test accuracy " + summary_statistic_name, fontsize=16)
    axes[1].set_xlabel("Test size", fontsize=16)
    axes[1].set_ylabel(summary_statistic_name, fontsize=16)

    plt.show()


## Analysis

##### Which is better, a small test set or a large test set? Why is it better? What could be causing this problem in small/large test sets?


# Problem 4: Application to final project (OPTIONAL)

Try finding the variance of your accuracies for a model in your final project, just like in example 3.

# Problem 5: Selection Bias in Cross Validation (2 points)

Below, we use a decision tree to classify try to predict whether a phone both has talk time >= 20 and does not have 3G.

In [None]:
cellphone_df = pd.read_csv("phone.csv")
features = cellphone_df[['battery_power']]
goal = (cellphone_df["talk_time"] >= 20) & (cellphone_df["three_g"] == False)

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix
from contextlib import suppress

accuracies, positive_predictive_values, negative_predictive_values = [],[],[]
n_folds = 10
kf = KFold(n_folds)
for train_index, test_index in kf.split(features):
    X_train, X_test = features.iloc[train_index], features.iloc[test_index]
    y_train, y_test = goal[train_index], goal[test_index]
    tree = DecisionTreeClassifier()
    tree.fit(X_train, y_train)
    pred = tree.predict(X_test)

    #true negative, false positive, false negative, true positive
    tn, fp, fn, tp = confusion_matrix(y_test, pred, labels=[False,True]).ravel()

    accuracies.append((tn + tp) / len(y_test))
    if tp + fn > 0:
        positive_predictive_values.append(tp / (tp + fn))
    if tn + fp > 0:
        negative_predictive_values.append(tn / (tn + fp))

cv_accuracy = np.mean(accuracies)

print("Accuracy:", cv_accuracy)

That's an amazing accuracy! Now, let's look closer at our accuracy breakdown.

We'll look at the **positive predictive value** and the **negative predictive value**. The positive predictive value is the accuracy score if we only look at samples with a true label of "True", and the negative predictive value is the accuracy score if we only look at samples with a true label of "False".

So, the question _"Out of the phones that have a talk time >= 20 but don't have 3G, what percent do we predict correctly?"_ corresponds to the positive predictive value.

In [None]:
cv_positive_predictive_value = np.mean(positive_predictive_values)
cv_negative_predictive_value = np.mean(negative_predictive_values)

print("accuracy for positive samples",  cv_positive_predictive_value)
print("accuracy for negative samples", cv_negative_predictive_value)

### Uh oh. We have a positive predictive value of 0%. How could this be, when our accuracy score is so high?

_Hint 1_: We use just `battery_power` to predict, and get a 98% accuracy. That seems wrong. How did our model have such a high accuracy? What contributed to that accuracy?

_Hint 2_: How do our predictions look? How does the goal look?

# Problem 6: A Common Mistake (optional)

The following piece of code contains a (fatal) issue. Find it! (No need to correct the error -- just state what it is.)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import mean_squared_error

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

KNN = KNeighborsClassifier()
KNN.fit(X_train, y_train)

y_pred = KNN.predict(X_test)

error = mean_squared_error(y_pred, y_test)

print("Error:", error)

### What's wrong?

_Hint: Think about what error is appropriate to use in regression vs. classification problems._


# Problem 7: Model Complexity and Performance Measures (optional)

As our models get more complex, we often need to **adjust how we measure and report model performance**. In this example, we will investigate how model complexity affects our performance measures. We will use KNN Regressors, which use a very similar concept to KNN Classifiers, to predict the battery power of a phone.

## Hypothesis: What happens to our performance measures when models get more complex?

FILL IN HERE

## Experimentation

In [None]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import scale
# use just 500 data points to help make this demo have more obvious results
cellphone_df = pd.read_csv("phone.csv").sample(n=500, random_state=42).reset_index()
goal = cellphone_df["battery_power"]

# we need to scale for KNN
simple_features = scale(cellphone_df[["blue","ram"]])
complex_features = scale(cellphone_df.drop(columns=["index","battery_power","price_range","fc","n_cores","m_dep"]))

#### A simple model

Create a [KNN Regressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html). Then, 15-Fold cross validation to measure the accuracy (look at the examples near the bottom of [KFold's documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) for reference).

In addition to calculating the average of the accuracies on each fold, also calculate the variance of those accuracies.

In [None]:
# create a KNN regressor that uses 10 neighbors

# create KFold object with 15"splits". Use random_state=42 (similar to how we pass random_state=42 into train_test_split).

simple_errors = []
# follow examples to iterate through folds. Use saimple_features here, so that we create a simpler model
    # fit the regressor
    # use mean_squared_error to get the mse
    # append accuracy on this fold to the "simple_errors" list

# calculate and print out the average mse and the variance of the mse
print("average accuracy:", "FILL IN HERE")
print("variance of accuracies:", "FILL IN HERE")


##### A complex model

Do the same thing as above, but use complex_features instead of simple_features

In [None]:
# create a KNN regressor that uses 10 neighbors

# create KFold object with 15 "splits". Use random_state=42 again.

complex_errors = []
# follow examples to iterate through folds. Use complex_features here, so that we create a more complex model
    # fit the regressor
    # use mean_squared_error to get the mse
    # append mse on this fold to the "complex_errors" list
# calculate and print out the average mse and the variance of the mse
print("average accuracy:", "FILL IN HERE")
print("variance of accuracies:", "FILL IN HERE")


In [None]:
plt.boxplot([simple_errors, complex_errors],labels=["simple model","complex model"])
plt.ylabel("MSE",fontsize=16)
plt.title("MSE of simple vs. complex KNNRegressor in 15-Fold CV")
plt.show()


## Analysis

#### Let's compare the simple and complex models. What did you observe about the mean?


FILL IN HERE

#### What did you observe about the variance?

FILL IN HERE

#### What could cause these results?


FILL IN HERE

#### Describe a situation in which a model with lower accuracy is more desirable than a model with higher accuracy. Be as specific as possible.


FILL IN HERE   

## Further Musings
So, we've encountered an issue with how we've been measuring performance, especially with complex models.

#### How could we address this issue? How can we give a more holistic report of a model's performance?

(This part is graded on completion and effort.)

FILL IN HERE