{"cells":[{"cell_type":"markdown","metadata":{"id":"bB7k4iHZVyRp"},"source":["\n","NETID: PLEASE FILL ME IN\n"]},{"cell_type":"markdown","metadata":{"id":"t10Fmd_VVyRv"},"source":["# Introduction to Classifiers"]},{"cell_type":"markdown","metadata":{"id":"zJFjeouTgrai"},"source":["### Problems\n","- Problem 1 (4 points)\n","- Problem 2 (3 points)\n","- Problem 3 (2 points)\n","- Problem 4 (1 point)"]},{"cell_type":"markdown","metadata":{"id":"WhGluyWlVyRx"},"source":["Two lectures ago we covered linear regression and predicting the value of a continuous variable. We use __classifiers__ to predict binary or categorical variables. Classifiers can help us answer yes/no questions or categorize an observation into one of several categories.\n","\n","## kNN Classifier\n","\n","There are various classification algorithms, each of which is better suited to some situations than others. In this lecture we are learning about __kNN__, which is one of these classifiers"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"jXnamTZwVyRy","outputId":"9bbbee22-4708-4365-abfb-a6bfc8978404","scrolled":true},"outputs":[{"data":{"text/html":["
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
diagnosisradius_meantexture_meanperimeter_meanarea_meansmoothness_meancompactness_meanconcavity_meanconcave points_meansymmetry_mean...radius_worsttexture_worstperimeter_worstarea_worstsmoothness_worstcompactness_worstconcavity_worstconcave points_worstsymmetry_worstfractal_dimension_worst
0M17.9910.38122.801001.00.118400.277600.30010.147100.2419...25.3817.33184.602019.00.16220.66560.71190.26540.46010.11890
1M20.5717.77132.901326.00.084740.078640.08690.070170.1812...24.9923.41158.801956.00.12380.18660.24160.18600.27500.08902
2M19.6921.25130.001203.00.109600.159900.19740.127900.2069...23.5725.53152.501709.00.14440.42450.45040.24300.36130.08758
3M11.4220.3877.58386.10.142500.283900.24140.105200.2597...14.9126.5098.87567.70.20980.86630.68690.25750.66380.17300
4M20.2914.34135.101297.00.100300.132800.19800.104300.1809...22.5416.67152.201575.00.13740.20500.40000.16250.23640.07678
\n","

5 rows × 31 columns

\n","
"],"text/plain":[" diagnosis radius_mean texture_mean perimeter_mean area_mean \\\n","0 M 17.99 10.38 122.80 1001.0 \n","1 M 20.57 17.77 132.90 1326.0 \n","2 M 19.69 21.25 130.00 1203.0 \n","3 M 11.42 20.38 77.58 386.1 \n","4 M 20.29 14.34 135.10 1297.0 \n","\n"," smoothness_mean compactness_mean concavity_mean concave points_mean \\\n","0 0.11840 0.27760 0.3001 0.14710 \n","1 0.08474 0.07864 0.0869 0.07017 \n","2 0.10960 0.15990 0.1974 0.12790 \n","3 0.14250 0.28390 0.2414 0.10520 \n","4 0.10030 0.13280 0.1980 0.10430 \n","\n"," symmetry_mean ... radius_worst texture_worst perimeter_worst \\\n","0 0.2419 ... 25.38 17.33 184.60 \n","1 0.1812 ... 24.99 23.41 158.80 \n","2 0.2069 ... 23.57 25.53 152.50 \n","3 0.2597 ... 14.91 26.50 98.87 \n","4 0.1809 ... 22.54 16.67 152.20 \n","\n"," area_worst smoothness_worst compactness_worst concavity_worst \\\n","0 2019.0 0.1622 0.6656 0.7119 \n","1 1956.0 0.1238 0.1866 0.2416 \n","2 1709.0 0.1444 0.4245 0.4504 \n","3 567.7 0.2098 0.8663 0.6869 \n","4 1575.0 0.1374 0.2050 0.4000 \n","\n"," concave points_worst symmetry_worst fractal_dimension_worst \n","0 0.2654 0.4601 0.11890 \n","1 0.1860 0.2750 0.08902 \n","2 0.2430 0.3613 0.08758 \n","3 0.2575 0.6638 0.17300 \n","4 0.1625 0.2364 0.07678 \n","\n","[5 rows x 31 columns]"]},"execution_count":16,"metadata":{},"output_type":"execute_result"}],"source":["import numpy as np\n","import pandas as pd\n","from sklearn.model_selection import train_test_split\n","from sklearn.neighbors import KNeighborsClassifier\n","\n","df = pd.read_csv('lecture6data.csv')\n","df=df.drop('Unnamed: 32',axis=1)\n","df=df.drop('id',axis=1)\n","df.head()"]},{"cell_type":"markdown","metadata":{"id":"9akLHuF-VyR0"},"source":["## _Problem 1 (4 points)_\n","\n","Build a kNN model predicting whether an observation is benign or malignant. You should split the dataset into a training set and a test set as covered previously in the course, fit the model on the observations in the training set, and predict the target variable for the test set.\n","\n","There are a couple of things to note for this problem. First, you are free to choose whichever features you want to predict the target feature, but you should not use id or the target variable itself. Second, you can optionally choose the k parameter for the kNN model (the default value is 5).\n","\n","Save your predictions in a variable named \"predictions\".\n","\n","**Please do not change the variable names already provided as they are used later in the demo**"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"PWhmOP4zVyR0"},"outputs":[],"source":["# TODO separate your X (features) and your Y (target).\n","\n","# TODO train test split your data with 20% being used for testing\n","x_train, x_test, y_train, y_test = \"FILL IN HERE\"\n","\n","# This is the function we use to create the kNN model (default k=5)\n","model = KNeighborsClassifier()\n","\n","# TODO fit the model using the train data\n","\n","# TODO store the predictions for the test sets\n","predictions = \"FILL IN HERE\"\n"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"CJ4MAE58VyR1"},"outputs":[],"source":["# TODO find the accuracy score of your predictions\n","from sklearn.metrics import accuracy_score\n","print(\"sklearn's accuracy score for diagnosis:\", accuracy_score(\"FILL IN HERE\", \"FILL IN HERE\"))\n"]},{"cell_type":"markdown","metadata":{"id":"f7ZJD1i4VyR2"},"source":["### _end of Problem 1_"]},{"cell_type":"markdown","metadata":{"id":"Ea8mH3RFVyR2"},"source":["## Measuring Accuracy\n","\n","Measuring the accuracy of classifiers is more intuitive than calculating the accuracy of a linear regression model. When we predict categorical values, our accuracy score is simply the proportion of values that we computed correctly. For example, if we have a test set of size 100 and we predict 93 of the observations correctly, we have an accuracy score of 93 percent."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"oxEhC3VjVyR3","outputId":"beea10db-fb8a-4397-8d29-b2764f1334ed"},"outputs":[{"name":"stdout","output_type":"stream","text":["accuracy: 0.956140350877193\n","baseline: 0.6228070175438597\n","improvment 0.5352112676056336\n"]}],"source":["# Compute the accuracy score of the model created above\n","accuracy = accuracy_score(y_test, predictions)\n","print('accuracy:',accuracy)\n","\n","# Compute the accuracy of predicting all diagnoses are benign\n","y_train.describe()\n","y_test.describe()\n","base_array = np.full(114, 'B')\n","\n","baseline = accuracy_score(y_test, base_array)\n","print('baseline:',baseline)\n","\n","# Compute the percent improvement from the baseline\n","improvement = (accuracy - baseline) / baseline\n","print('improvment',improvement)"]},{"cell_type":"markdown","metadata":{"id":"6_hroBPcVyR4"},"source":["The above improvement shows just how beneficial the kNN model can be. It also shows us that we have chosen an appropriate value for k because there is an improvement over the baseline assumption (average category of values).\n","\n","## Fit/Overfitting\n","\n","Below are accuracy scores of the same kNN model, but with the value of k changing. Note how the accuracy changes as k increases. As mentioned during the lecture, a high value of k can improve the accuracy of the model, but too high a value of k will essentially be the average of all of the data."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"cfUXCvzEVyR4"},"outputs":[],"source":["# Model when k=1\n","model1 = KNeighborsClassifier(1)\n","model1.fit(x_train, y_train)\n","predictions1 = model1.predict(x_test)\n","\n","# Model when k=10\n","model10 = KNeighborsClassifier(10)\n","model10.fit(x_train, y_train)\n","predictions10 = model10.predict(x_test)\n","\n","# Model when k=100\n","model100 = KNeighborsClassifier(100)\n","model100.fit(x_train, y_train)\n","predictions100 = model100.predict(x_test)\n","\n","print(\"accuracy score when k=1:\", accuracy_score(y_test, predictions1))\n","print(\"accuracy score when k=10:\", accuracy_score(y_test, predictions10))\n","print(\"accuracy score when k=100:\", accuracy_score(y_test, predictions100))\n"]},{"cell_type":"markdown","metadata":{"id":"JOdpzyS0VyR5"},"source":["## _Problem 2 (3 points)_\n","\n","Now we are going to plot the relationship between the value of k and the accuracy score of the model for this data set.\n","\n","Using a loop, create models with neighbors ranging from 1-30. Find the accuracy for each of these models and graph them with number of neighbors on the x-axis and accuracy on the y-axis. Please label your axes, and add a title to your plot.\n","\n","**You do not need to redo the train test split.**"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"vmMKbkXCVyR6"},"outputs":[],"source":["import matplotlib.pyplot as plt\n","%matplotlib inline\n","from sklearn.metrics import accuracy_score"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"8OQjmhePVyR7"},"outputs":[],"source":["# TODO find the accuracy of the model with each value of k from 1-30 inclusive"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"zDyQauwRVyR7"},"outputs":[],"source":["# TODO create the plot of the accuracy array determined in the previous cell"]},{"cell_type":"markdown","metadata":{"id":"mgXa0kpoVyR8"},"source":["### _end of Problem 2_\n","\n","## Confusion Matrix\n","\n","**Reminder**: The confusion matrix is depicted below\n","\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
Positive'
(Predicted)
Negative'
(Predicted)
Positive
(Actual)
True PositiveFalse Negative
Negative
(Actual)
False PositiveTrue Negative
\n","\n","Here are the equations specified in the lecture for your convenience with the next problem.\n","\n","**Sensitivity** = True Positive /(True Positive + False Negative)\n","\n","**Specificity** = True Negative /(True Negative + False Positive)\n","\n","**Accuracy** = (True Positive + True Negative) / Total\n","\n","**Error** = (False Positive + False Negative) / Total\n","\n","**Precision** = True Positive / (True Positive + False Positive)"]},{"cell_type":"markdown","metadata":{"id":"bOoPfT6oVyR8"},"source":["## _Problem 3 (2 points)_\n","\n","Given the Table Below, Calculate the **Specificity**, **Sensitivity**, **Overall Error Rate**, **Overall Accuracy**, **Precision** of the data. (show us the calculations, don't just hard code the answers!)\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
Positive'
(Predicted)
Negative'
(Predicted)
Positive
(Actual)
14632
Negative
(Actual)
21590
"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"tomE5mxCVyR8","outputId":"b629f495-5216-4585-b11c-69a334ee03d7"},"outputs":[{"name":"stdout","output_type":"stream","text":["sensitivity: FILL IN HERE\n","specificity: FILL IN HERE\n","accuracy: FILL IN HERE\n","error: FILL IN HERE\n","precision: FILL IN HERE\n"]}],"source":["print(\"sensitivity:\", \"FILL IN HERE\")\n","print(\"specificity:\", \"FILL IN HERE\")\n","print(\"accuracy:\", \"FILL IN HERE\")\n","print(\"error:\", \"FILL IN HERE\")\n","print(\"precision:\", \"FILL IN HERE\")"]},{"cell_type":"markdown","metadata":{"id":"TiszM_7pVyR9"},"source":["### _end of Problem 3_"]},{"cell_type":"markdown","metadata":{"id":"xk7BetDjhqVW"},"source":["## _Problem 4_ (1 point)\n","\n","Let's say you want to run kNN on a dataset with both continuous features and binary features. Can you think of any potential issues that might arise from mixing these data types? Also, how might you preprocess categorical data to use in a kNN? Ordinal data?\n","\n"]},{"cell_type":"markdown","metadata":{"id":"01g3Z8-3iRUR"},"source":["#### fill in here"]},{"cell_type":"markdown","metadata":{"id":"6DoYMM9ujvTA"},"source":["### Just a reminder about the mid semester feedback form posted on ED. We'd really appreciate it if you could fill it out!!"]},{"cell_type":"markdown","metadata":{"id":"cFuBOiB3VyR9"},"source":["## _Problem 5_ (extra credit)"]},{"cell_type":"markdown","metadata":{"id":"ZQdHNetaVyR9"},"source":["\n","Before running kNN, which of the following kinds of preprocessing should we do? Choose all that apply.\n","\n","1) Scale\n","\n","2) Center\n","\n","3) Remove correlated features\n","\n","4) Remove outliers\n"]},{"cell_type":"markdown","metadata":{"id":"P-W21H7BVyR9"},"source":[]},{"cell_type":"markdown","metadata":{"id":"D5tvjsHHVyR-"},"source":["### _end of Problem 5_"]},{"cell_type":"markdown","metadata":{"id":"XO2XXoKNVyR-"},"source":["## _Problem 6_ (extra credit)\n","\n","We've talked about sensitivity and specificity. Recall these high level intuitions:\n","- high sensitivity -> able to correctly identify positives\n","- high specificty -> able to correctly identify negatives"]},{"cell_type":"markdown","metadata":{"id":"DIVMXZQHVyR-"},"source":["### Part a\n","Identify a model that has have 100% sensitivity, no matter what dataset it is run on. Similarly, identify a model with 100% specificity.\n","\n","**Hint**: Recall that a \"model\" is just a function, meaning that it takes in an input and *spits out an output*. Your job is to figure out, if you get an input x, should the output for that x be 0 or should it be 1?\n"]},{"cell_type":"markdown","metadata":{"id":"U9j4qBSfVyR-"},"source":[]},{"cell_type":"markdown","metadata":{"id":"vihFMmwgVyR-"},"source":["### Part b\n","In Problem 2, you plotted kNN accuracy vs `k`. Now, make a plot kNN sensitivity vs. `k` and another plot for kNN specificity. Use the same dataset as in Problem 2, and go from k=1 to k=30."]},{"cell_type":"markdown","metadata":{"id":"pbdYqCNYVyR-"},"source":["### Part c\n","Now, plot the average of specificity and sensitivity against the number of neighbors."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"mqjwOrhVVyR_"},"outputs":[],"source":[]},{"cell_type":"markdown","metadata":{"id":"owWv0IpUVyR_"},"source":["### Part d (just for fun)\n","Prove the following statement:\n","\n","*If the testing set has the same number of positive and negative examples, then the accuracy is equal to the average of the specificity and sensitivity.*"]},{"cell_type":"markdown","metadata":{"id":"1F0bAXg8VyR_"},"source":["### Part f (even more fun???)\n","In part D, we saw a special case where accuracy is the average between specificity and sensitivity. Let's look at the general case -- how different is accuracy from the average between specificity and sensitivity? Investigate this question by proving the following statement:\n","\n","Let `p` = # of false positives, and `N` = # of true negatives. Suppose that in the training set, there are `x` times as many positive examples as their are negative examples. Also suppose that the number of true positives is equal to the number of false positives. Then, `acc` = `R` * `avg`, where `acc` = accuracy, `avg` = the average of specificy and sensitivity, and `R` = $\\frac{(p+N)x}{Nx^2+\\left(p+N\\right)x+p}$."]},{"cell_type":"markdown","metadata":{"id":"Yji96Z5HVySA"},"source":["### _end of Problem 6_"]},{"cell_type":"markdown","metadata":{"id":"BtxBPCYAVySA"},"source":["## _Problem 7_ (0 points)\n","\n","In INFO1998, we focus on the high level concepts and applications, and as a result, we rarely delve into the computations that go into our machine learning algorithms. This isn't to say that those computations are unimportant. When working with **enormous** datasets, some algorithms become **infeasible** since they need to do so much *computation* that training a model could take several days. When working with **complex** datasets with **special properties**, we can sometimes **adapt** algorithms based on our understanding of the algorithm's *computations*.\n","\n","This question challenges you to think about the computations involved in kNN.\n","\n","Also, this is worth 0 points, so feel free to look up the answer if you want."]},{"cell_type":"markdown","metadata":{"id":"mDsC4JtVVySA"},"source":["### Part a\n","When a kNN model is making a prediction for a sample, what does it need to do? Be specific."]},{"cell_type":"markdown","metadata":{"id":"O-fu3D0OVySA"},"source":[]},{"cell_type":"markdown","metadata":{"id":"0fPJ-7KSVySA"},"source":["### Part b\n","In a past class, saw that the linear regression model is just a linear function; that is, the whole model can be represented by just a couple numbers (the weights/coefficients). Based on Part a, what data is necessary to represent a trained kNN model?"]},{"cell_type":"markdown","metadata":{"id":"AeWzRE7eVySA"},"source":[]},{"cell_type":"markdown","metadata":{"id":"329Xg_-JVySA"},"source":["### Part c\n","Based on Part b, describe the training algorithm for a kNN model. Recall that a training algorithm is how you go from a training set to a representation of a model. Hint: it's super simple."]},{"cell_type":"markdown","metadata":{"id":"8c_Z_-E6VySB"},"source":[]},{"cell_type":"markdown","metadata":{"id":"ljIpMUZ3VySB"},"source":["In parts D and E, write your answers in terms of the quantities:\n","- number of training samples, T\n","- number of samples in dataset, N\n","- number of features in dataset, F\n","- number of neighbors, k"]},{"cell_type":"markdown","metadata":{"id":"ivit14hWVySB"},"source":["### Part d\n","Write down an expression estimating the memory needed to represent a trained kNN model."]},{"cell_type":"markdown","metadata":{"id":"9XUSKNqmVySB"},"source":[]},{"cell_type":"markdown","metadata":{"id":"t4b7pL5-VySB"},"source":["### Part e\n","Write down an expression estimating the time needed for kNN to make a prediction for a single point."]},{"cell_type":"markdown","metadata":{"id":"Q1SSxap_VySB"},"source":[]},{"cell_type":"markdown","metadata":{"id":"KB32mppQVySB"},"source":["### Part f\n","Estimate the memory space needed to represent a kNN with the specifications below. Also estimate how long it would take to predict 1,000 test samples.\n","\n","- number of training samples: 1,000,000\n","- number of samples in dataset: 100,000,000\n","- number of features in dataset: 50\n","- number of neighbors: 5\n","- size of one feature of one sample: 8 bytes\n","- time to calculate distance for x features: x/10,000 seconds"]},{"cell_type":"markdown","metadata":{"id":"j5aJbNCCVySB"},"source":[]},{"cell_type":"markdown","metadata":{"id":"ymet0kjZVySB"},"source":["### Part g\n","There are many variations on kNN that aim to speed up kNN predictions. This might involve saving less of the train set, checking only a subset of the saved data, encoding data differently, or organizing data differently. Look up two of these variations, and compare their advantages/disadvantages."]},{"cell_type":"markdown","metadata":{"id":"sG5nQX90VySC"},"source":[]}],"metadata":{"colab":{"provenance":[]},"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.3"},"notebookId":"PAG=GBo}Q}it}"},"nbformat":4,"nbformat_minor":0}