# NETID: <fill in here\>

### Problems
- Problem 1 (3 points total)
  - 1a (2 points)
  - 1b (1 point)
- Problem 2 (6 points total)
  - 2a (2 points)
  - 2b (2 points)
  - 2c (2 points)
- Problem 3 (1 point)
- Bonus (2 points)

# Applications of Supervised Learning

Last class we covered a popular machine learning model used for classification: K-Nearest Neighbors (KNN). In this lecture, we went over two more classification models: Decision Trees and Logistic Regression. Like KNN, each of these models have their own underlying assumptions and advantages.

### Import necessary packages

In [1]:
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree
from sklearn import datasets
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression

# Decision Trees

The decision tree algorithm can be used to do both classification as well as regression. It has the advantage of not assuming a linear model*. Decisions trees are usually easy to represent visually which makes it easy to understand how the model actually works.

\* Decision trees are piecewise linear - they have linear boundaries for components, but having multiple branches/layers makes non-linear overall.

### Geometric Intuition

In the bottom photo, we can interpret decision trees abstractly as tree figures. The tree branches from a deciding condition (node) into another deciding condition or a final classification. For instance, at the top node, a sample can either have \<61k income or \>=61k income. If it is the former, we classify the sample as not having attended the Burning Man festival. Otherwise, we continue branching.

These condition nodes can be visualized as linear boundaries on a coordinate system. Notice which conditions correspond to which lines. Also note how some boundaries do not stretch the entire plane, because they were branched off other nodes.

![image](https://docs.microsoft.com/en-us/azure/machine-learning/studio/media/algorithm-choice/image5.png)

### Mathematical Intuition
To understand the mathematical basis behind decision trees, we ask a motivation question: How do we know what conditions to choose to split upon?

We can create a measure to determine the quality of a condition node. This measure can be based on what feature we choose, as well as the specific point we decide to split upon. Data scientists refer to this quantity as *entropy*.

The goal of the full decision tree algorithm is to take the necessary steps to minimize entropy, choosing the right features at every stage to accomplish this.

### Example: Breast Cancer Diagnosis
The following dataset contains information about digitized images of a fine needle aspirate (FNA) of a breast mass. Each row in our dataset contains data for a patient. The 'diagnosis' column tells us the outcomne of whether or not a patient was diagnosis was benign (b) or malignant (m).

In [2]:
df = pd.read_csv('lecture7example.csv')
X=df.drop(['id', 'diagnosis', 'Unnamed: 32'], axis=1)
Y=df['diagnosis']
df.head()

FileNotFoundError: ignored

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=1998)

Last week, we built a KNN classifier or this problem. In the code below we created a test-train split of our data and trained a KNN classifer. As we learned last class, accuracy_score() calculates the ratio of correct prediction we make.

In [None]:
# K-nearest neighbors
knn = KNeighborsClassifier()
knn.fit(X_train, Y_train)
knn_pred_train = knn.predict(X_train)
knn_pred_test = knn.predict(X_test)
print("Train Accuracy: ", accuracy_score(Y_train, knn_pred_train))
print("Test Accuracy: ", accuracy_score(Y_test, knn_pred_test))

## Problem 1a (2 pts)
Our knn-classifier performed pretty well at predicting which cases are malignant and wich are benign. Now we are going to see how a decision tree peforms. In the next cell, train the decision tree classifier on our training data, and then calulate the training accuracy and testing accuracy.

In [None]:
# This is the function we use to create the decision tree model
model = tree.DecisionTreeClassifier(max_depth = 5)

# TODO: train the model
# FILL IN HERE

# TODO: Calculate the training and testing accuracy
dtree_pred_train = "FILL IN HERE"
dtree_pred_test = "FILL IN HERE"
print("Train Accuracy: ", "FILL IN HERE")
print("Test Accuracy: ", "FILL IN HERE")


## Problem 1b (1 pt)
Interpret the accuracy values you found to with the DecisionTreeClassifier with. Please make sure to answer the following questions:
1. How do these scores differ with the scores of the KNN classifier?
2. Is the model underfitting or overfitiing our data?
3. How do the scores change as we vary the max_depth of our tree?

Fill in answer here:

# Logistic Regression

Logistic regression, like linear regression, is a generalized linear model. However, the final output of a logistic regression model is not continuous; it is binary (0 or 1). The following sections will explain how this works.

### A Mathematical Overview (for the brave)
The goal of logistic regression is to take a set of datapoints and classify them. This means that we expect to have discrete outputs representing a set of classes. In simple logistic regression, this must be a binary set: our classes must be one of only two possible values. Here are some things that are sometimes modeled as binary classes:

<li> Sick or Not Sick </li>
<li> Rainy or Dry </li>
<li> Democrat or Republican </li>

The objective is to find an equation that is able to take input data and classify it into one of the two classes. Luckily, the logistic equation is for just such a task.

The <b>logistic equation</b> is the basis of the logistic regression model. It looks like this:

![image](https://wikimedia.org/api/rest_v1/media/math/render/svg/5e648e1dd38ef843d57777cd34c67465bbca694f)

The t in the equation is some linear combination of n variables, or a linear function in an n-dimensional feature space. The formulation of t therefore has the form ax+b. In fitting a logistic regression model, the goal is therefore to minimize error in the logistic equation with the chosen t (of the form ax+b)  by tuning a and b.


The logistic equation (also known as the sigmoid function) works as follows:
1. Takes an input of n variables
2. Takes a linear combination of the variables as parameter t (this is another way of saying t has the form ax+b)
3. Outputs a value given the input and parameter t

The output of the logistic equation is always between 0 and 1.

A visualization of the outputs of the logistic equation is as below (note that this is but one possible output of a logit regression model):
![image](https://upload.wikimedia.org/wikipedia/commons/8/88/Logistic-curve.svg)

### Income Prediction
We'll use logistic regression to predict whether annual income is greater than $50k based on census data. You can read more about the dataset <a href="https://www.kaggle.com/uciml/adult-census-income">here</a>.

In [4]:
inc_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header = None, names = ['age', 'workclass', 'fnlwgt', 'education', 'education.num', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'capital.gain', 'capital.loss', 'hours.per.week', 'native.country', 'income'])
# drop null values
inc_data = inc_data.dropna()
inc_data.head()


Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


Our goal is to predict whether a person's income is less than <=50K  or >50K. Right now the data in the income column is stored as a string, but we want to look at it as binary data.

Below, we have converted the data in that column so that an income value of <=50K would be a 0, and an income value of >50K would be a 1. We iterate over the dataframe and use an if/else statement with " <=50K" and " >50K" (notice the spaces), but we can alternatively use `pd.get_dummies()`.

In [None]:
inc = inc_data['income']

for i in range(0, len(inc)):
    if inc[i] == " <=50K":
        inc[i] = 0
    elif inc[i] == " >50K":
        inc[i] = 1
print(inc)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  inc[i] = 0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  inc[i] = 1


Instead of manually converting all categorical data to quantitative data, we will use the LabelEncoder function.

In [3]:
# the column is present in both categorical and numeric form
del inc_data['education']

# convert all features to categorical integer values
enc = LabelEncoder()
for i in inc_data.columns:
    inc_data[i] = enc.fit_transform(inc_data[i])

NameError: ignored

## Problem 2a (2 pts)

Build a logistic regression model predicting income based on other income related factors (e.g. `education.num`). You should split the dataset into a training set and a test set as covered previously in the course, fit the model on the observations in the training set, and predict the target variable for the test set. Save your predictions in a variable named `predictions`.

In [None]:
# we separate X (features) and income Y (target)
incX = inc_data[['education.num']]
incY = inc_data['income']

# TODO: train test split your data with 20% being used for testing
incX_train, incX_test, incY_train, incY_test = "FILL IN HERE"

# This is the function we use to create the logistic regression model (default k=5)
model = LogisticRegression()

# TODO: fit the model using the train data
# FILL IN HERE

# TODO store the predictions for the training and test set
pred_train = "FILL IN HERE"
pred_test = "FILL IN HERE"

print("Test Accuracy: ", accuracy_score("FILL IN HERE", "FILL IN HERE"))
print("Training Accuracy: ", accuracy_score("FILL IN HERE", "FILL IN HERE"))




## Problem 2b (2 pts):
Let's see how a decision tree classifier performs with different `max_depth` values.

Complete the following code so we find the `max_depth` that gives us the best test accuracy. We can do this by iterating over various values of depth `k`, training.

In [None]:
best_depth = 1      # Keep track of depth that produces tree with highest accuracy
best_accuracy = 0   # The best accuracy from a given tree
for k in range(1, 100):
    # TODO: create and fit a model of depth k
    model = "FILL IN HERE"

    # TODO: find the accuracy of the model's predictions (pred_test)
    # compared to the actual samples, and score the accuracy in acc_test.
    pred_test = "FILL IN HERE"
    acc_test = "FILL IN HERE"

    # TODO: compare the accuracy found with the best current depth/accuracy found
    # and update if necessary

    # YOUR CODE GOES HERE

print(best_accuracy)
print(best_depth)

## Problem 2c (2 pts):
Using the most accurate depth value found in part (b), estimate the ERROR (not accuracy) of your model by using 5-fold cross validation. Refer the documentation found [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html)

In [None]:
from sklearn.model_selection import KFold

# Fill in code here
kf = "FILL IN HERE"
errors = 0
for train_index, test_index in kf.split(incX):
    # TODO: define X_train, X_test, Y_train, Y_test


    # TODO: define model
    model = "FILL IN HERE"

    # TODO: fit model
    # FILL IN HERE

    # TODO: compare predictions with actual targets, add accuracy_score to errors
    # FILL IN HERE
errors /= 5
print(errors)

In [None]:
## Problem 2c (2 pts):

## Problem 3 (1 pt):
How does the depth of a decision tree affect overfitting?


###### fill in here

## Extra credit: Random Forests (possible + 2 pts)
Random Forests are essentially many decision trees combined. The training algorithm for random forests applies the general technique of bootstrap aggregating, or bagging, to tree learners. Given a training set X = x1, ..., xn with responses Y = y1, ..., yn, bagging repeatedly (B times) selects a random sample with replacement of the training set and fits trees to these samples:

For b = 1, ..., B:
    Sample, with replacement, n training examples from X, Y; call these Xb, Yb.
    Train a classification or regression tree fb on Xb, Yb.
After training, predictions for unseen samples x' can be made by averaging the predictions from all the individual regression trees on x':

Implememnt a Random forest classifier by creating and training 20 decision trees with max_depth 5. Let the predictions be chosen through majority voting on the total training data. Does your model perform better than using a single decision tree?

#### **Note: sampling with "replacement" is important**  

In [None]:
import random

# Randomize order of training elements for each tree
def rand_sample(size):
    indices = []
    for i in range(size):
        indices.append(random.randint(0,size-1))
    return indices

# Load the whole dataset into X_train and Y_train and initialize a variable tree_preds to contain each tree's prediction
X_train = X
Y_train = Y
tree_preds = []

# Create 20 Decision Trees for the lecture 7 dataset
for t in range(20):
    model = tree.DecisionTreeClassifier(max_depth=5)
    sample = rand_sample(df.shape[0])
    X_train_tree = X_train.iloc[sample]
    Y_train_tree = Y_train.iloc[sample]
    # TODO: FILL In Code Here

print("Accuracy of one decision tree: ", "FILL IN HERE")
print("Accuracy of the random decision forest: ", "FILL IN HERE")
