{"cells":[{"cell_type":"markdown","metadata":{"id":"bB7k4iHZVyRp"},"source":["\n","NETID: PLEASE FILL ME IN\n"]},{"cell_type":"markdown","metadata":{"id":"t10Fmd_VVyRv"},"source":["# Introduction to Classifiers"]},{"cell_type":"markdown","metadata":{"id":"zJFjeouTgrai"},"source":["### Problems\n","- Problem 1 (4 points)\n","- Problem 2 (3 points)\n","- Problem 3 (2 points)\n","- Problem 4 (1 point)"]},{"cell_type":"markdown","metadata":{"id":"WhGluyWlVyRx"},"source":["Two lectures ago we covered linear regression and predicting the value of a continuous variable. We use __classifiers__ to predict binary or categorical variables. Classifiers can help us answer yes/no questions or categorize an observation into one of several categories.\n","\n","## kNN Classifier\n","\n","There are various classification algorithms, each of which is better suited to some situations than others. In this lecture we are learning about __kNN__, which is one of these classifiers"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"jXnamTZwVyRy","outputId":"9bbbee22-4708-4365-abfb-a6bfc8978404","scrolled":true},"outputs":[{"data":{"text/html":["<div>\n","<style scoped>\n","    .dataframe tbody tr th:only-of-type {\n","        vertical-align: middle;\n","    }\n","\n","    .dataframe tbody tr th {\n","        vertical-align: top;\n","    }\n","\n","    .dataframe thead th {\n","        text-align: right;\n","    }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n","  <thead>\n","    <tr style=\"text-align: right;\">\n","      <th></th>\n","      <th>diagnosis</th>\n","      <th>radius_mean</th>\n","      <th>texture_mean</th>\n","      <th>perimeter_mean</th>\n","      <th>area_mean</th>\n","      <th>smoothness_mean</th>\n","      <th>compactness_mean</th>\n","      <th>concavity_mean</th>\n","      <th>concave points_mean</th>\n","      <th>symmetry_mean</th>\n","      <th>...</th>\n","      <th>radius_worst</th>\n","      <th>texture_worst</th>\n","      <th>perimeter_worst</th>\n","      <th>area_worst</th>\n","      <th>smoothness_worst</th>\n","      <th>compactness_worst</th>\n","      <th>concavity_worst</th>\n","      <th>concave points_worst</th>\n","      <th>symmetry_worst</th>\n","      <th>fractal_dimension_worst</th>\n","    </tr>\n","  </thead>\n","  <tbody>\n","    <tr>\n","      <th>0</th>\n","      <td>M</td>\n","      <td>17.99</td>\n","      <td>10.38</td>\n","      <td>122.80</td>\n","      <td>1001.0</td>\n","      <td>0.11840</td>\n","      <td>0.27760</td>\n","      <td>0.3001</td>\n","      <td>0.14710</td>\n","      <td>0.2419</td>\n","      <td>...</td>\n","      <td>25.38</td>\n","      <td>17.33</td>\n","      <td>184.60</td>\n","      <td>2019.0</td>\n","      <td>0.1622</td>\n","      <td>0.6656</td>\n","      <td>0.7119</td>\n","      <td>0.2654</td>\n","      <td>0.4601</td>\n","      <td>0.11890</td>\n","    </tr>\n","    <tr>\n","      <th>1</th>\n","      <td>M</td>\n","      <td>20.57</td>\n","      <td>17.77</td>\n","      <td>132.90</td>\n","      <td>1326.0</td>\n","      <td>0.08474</td>\n","      <td>0.07864</td>\n","      <td>0.0869</td>\n","      <td>0.07017</td>\n","      <td>0.1812</td>\n","      <td>...</td>\n","      <td>24.99</td>\n","      <td>23.41</td>\n","      <td>158.80</td>\n","      <td>1956.0</td>\n","      <td>0.1238</td>\n","      <td>0.1866</td>\n","      <td>0.2416</td>\n","      <td>0.1860</td>\n","      <td>0.2750</td>\n","      <td>0.08902</td>\n","    </tr>\n","    <tr>\n","      <th>2</th>\n","      <td>M</td>\n","      <td>19.69</td>\n","      <td>21.25</td>\n","      <td>130.00</td>\n","      <td>1203.0</td>\n","      <td>0.10960</td>\n","      <td>0.15990</td>\n","      <td>0.1974</td>\n","      <td>0.12790</td>\n","      <td>0.2069</td>\n","      <td>...</td>\n","      <td>23.57</td>\n","      <td>25.53</td>\n","      <td>152.50</td>\n","      <td>1709.0</td>\n","      <td>0.1444</td>\n","      <td>0.4245</td>\n","      <td>0.4504</td>\n","      <td>0.2430</td>\n","      <td>0.3613</td>\n","      <td>0.08758</td>\n","    </tr>\n","    <tr>\n","      <th>3</th>\n","      <td>M</td>\n","      <td>11.42</td>\n","      <td>20.38</td>\n","      <td>77.58</td>\n","      <td>386.1</td>\n","      <td>0.14250</td>\n","      <td>0.28390</td>\n","      <td>0.2414</td>\n","      <td>0.10520</td>\n","      <td>0.2597</td>\n","      <td>...</td>\n","      <td>14.91</td>\n","      <td>26.50</td>\n","      <td>98.87</td>\n","      <td>567.7</td>\n","      <td>0.2098</td>\n","      <td>0.8663</td>\n","      <td>0.6869</td>\n","      <td>0.2575</td>\n","      <td>0.6638</td>\n","      <td>0.17300</td>\n","    </tr>\n","    <tr>\n","      <th>4</th>\n","      <td>M</td>\n","      <td>20.29</td>\n","      <td>14.34</td>\n","      <td>135.10</td>\n","      <td>1297.0</td>\n","      <td>0.10030</td>\n","      <td>0.13280</td>\n","      <td>0.1980</td>\n","      <td>0.10430</td>\n","      <td>0.1809</td>\n","      <td>...</td>\n","      <td>22.54</td>\n","      <td>16.67</td>\n","      <td>152.20</td>\n","      <td>1575.0</td>\n","      <td>0.1374</td>\n","      <td>0.2050</td>\n","      <td>0.4000</td>\n","      <td>0.1625</td>\n","      <td>0.2364</td>\n","      <td>0.07678</td>\n","    </tr>\n","  </tbody>\n","</table>\n","<p>5 rows × 31 columns</p>\n","</div>"],"text/plain":["  diagnosis  radius_mean  texture_mean  perimeter_mean  area_mean  \\\n","0         M        17.99         10.38          122.80     1001.0   \n","1         M        20.57         17.77          132.90     1326.0   \n","2         M        19.69         21.25          130.00     1203.0   \n","3         M        11.42         20.38           77.58      386.1   \n","4         M        20.29         14.34          135.10     1297.0   \n","\n","   smoothness_mean  compactness_mean  concavity_mean  concave points_mean  \\\n","0          0.11840           0.27760          0.3001              0.14710   \n","1          0.08474           0.07864          0.0869              0.07017   \n","2          0.10960           0.15990          0.1974              0.12790   \n","3          0.14250           0.28390          0.2414              0.10520   \n","4          0.10030           0.13280          0.1980              0.10430   \n","\n","   symmetry_mean  ...  radius_worst  texture_worst  perimeter_worst  \\\n","0         0.2419  ...         25.38          17.33           184.60   \n","1         0.1812  ...         24.99          23.41           158.80   \n","2         0.2069  ...         23.57          25.53           152.50   \n","3         0.2597  ...         14.91          26.50            98.87   \n","4         0.1809  ...         22.54          16.67           152.20   \n","\n","   area_worst  smoothness_worst  compactness_worst  concavity_worst  \\\n","0      2019.0            0.1622             0.6656           0.7119   \n","1      1956.0            0.1238             0.1866           0.2416   \n","2      1709.0            0.1444             0.4245           0.4504   \n","3       567.7            0.2098             0.8663           0.6869   \n","4      1575.0            0.1374             0.2050           0.4000   \n","\n","   concave points_worst  symmetry_worst  fractal_dimension_worst  \n","0                0.2654          0.4601                  0.11890  \n","1                0.1860          0.2750                  0.08902  \n","2                0.2430          0.3613                  0.08758  \n","3                0.2575          0.6638                  0.17300  \n","4                0.1625          0.2364                  0.07678  \n","\n","[5 rows x 31 columns]"]},"execution_count":16,"metadata":{},"output_type":"execute_result"}],"source":["import numpy as np\n","import pandas as pd\n","from sklearn.model_selection import train_test_split\n","from sklearn.neighbors import KNeighborsClassifier\n","\n","df = pd.read_csv('lecture6data.csv')\n","df=df.drop('Unnamed: 32',axis=1)\n","df=df.drop('id',axis=1)\n","df.head()"]},{"cell_type":"markdown","metadata":{"id":"9akLHuF-VyR0"},"source":["## <span style=\"color:purple\">_Problem 1 (4 points)_</span>\n","\n","Build a kNN model predicting whether an observation is benign or malignant. You should split the dataset into a training set and a test set as covered previously in the course, fit the model on the observations in the training set, and predict the target variable for the test set.\n","\n","There are a couple of things to note for this problem. First, you are free to choose whichever features you want to predict the target feature, but you should not use id or the target variable itself. Second, you can optionally choose the k parameter for the kNN model (the default value is 5).\n","\n","Save your predictions in a variable named \"predictions\".\n","\n","**Please do not change the variable names already provided as they are used later in the demo**"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"PWhmOP4zVyR0"},"outputs":[],"source":["# TODO separate your X (features) and your Y (target).\n","\n","# TODO train test split your data with 20% being used for testing\n","x_train, x_test, y_train, y_test = \"FILL IN HERE\"\n","\n","# This is the function we use to create the kNN model (default k=5)\n","model = KNeighborsClassifier()\n","\n","# TODO fit the model using the train data\n","\n","# TODO store the predictions for the test sets\n","predictions = \"FILL IN HERE\"\n"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"CJ4MAE58VyR1"},"outputs":[],"source":["# TODO find the accuracy score of your predictions\n","from sklearn.metrics import accuracy_score\n","print(\"sklearn's accuracy score for diagnosis:\", accuracy_score(\"FILL IN HERE\", \"FILL IN HERE\"))\n"]},{"cell_type":"markdown","metadata":{"id":"f7ZJD1i4VyR2"},"source":["### <span style=\"color:purple\"> _end of Problem 1_</span>"]},{"cell_type":"markdown","metadata":{"id":"Ea8mH3RFVyR2"},"source":["## Measuring Accuracy\n","\n","Measuring the accuracy of classifiers is more intuitive than calculating the accuracy of a linear regression model. When we predict categorical values, our accuracy score is simply the proportion of values that we computed correctly. For example, if we have a test set of size 100 and we predict 93 of the observations correctly, we have an accuracy score of 93 percent."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"oxEhC3VjVyR3","outputId":"beea10db-fb8a-4397-8d29-b2764f1334ed"},"outputs":[{"name":"stdout","output_type":"stream","text":["accuracy: 0.956140350877193\n","baseline: 0.6228070175438597\n","improvment 0.5352112676056336\n"]}],"source":["# Compute the accuracy score of the model created above\n","accuracy = accuracy_score(y_test, predictions)\n","print('accuracy:',accuracy)\n","\n","# Compute the accuracy of predicting all diagnoses are benign\n","y_train.describe()\n","y_test.describe()\n","base_array = np.full(114, 'B')\n","\n","baseline = accuracy_score(y_test, base_array)\n","print('baseline:',baseline)\n","\n","# Compute the percent improvement from the baseline\n","improvement = (accuracy - baseline) / baseline\n","print('improvment',improvement)"]},{"cell_type":"markdown","metadata":{"id":"6_hroBPcVyR4"},"source":["The above improvement shows just how beneficial the kNN model can be. It also shows us that we have chosen an appropriate value for k because there is an improvement over the baseline assumption (average category of values).\n","\n","## Fit/Overfitting\n","\n","Below are accuracy scores of the same kNN model, but with the value of k changing. Note how the accuracy changes as k increases. As mentioned during the lecture, a high value of k can improve the accuracy of the model, but too high a value of k will essentially be the average of all of the data."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"cfUXCvzEVyR4"},"outputs":[],"source":["# Model when k=1\n","model1 = KNeighborsClassifier(1)\n","model1.fit(x_train, y_train)\n","predictions1 = model1.predict(x_test)\n","\n","# Model when k=10\n","model10 = KNeighborsClassifier(10)\n","model10.fit(x_train, y_train)\n","predictions10 = model10.predict(x_test)\n","\n","# Model when k=100\n","model100 = KNeighborsClassifier(100)\n","model100.fit(x_train, y_train)\n","predictions100 = model100.predict(x_test)\n","\n","print(\"accuracy score when k=1:\", accuracy_score(y_test, predictions1))\n","print(\"accuracy score when k=10:\", accuracy_score(y_test, predictions10))\n","print(\"accuracy score when k=100:\", accuracy_score(y_test, predictions100))\n"]},{"cell_type":"markdown","metadata":{"id":"JOdpzyS0VyR5"},"source":["## <span style=\"color:purple\">_Problem 2 (3 points)_</span>\n","\n","Now we are going to plot the relationship between the value of k and the accuracy score of the model for this data set.\n","\n","Using a loop, create models with neighbors ranging from 1-30. Find the accuracy for each of these models and graph them with number of neighbors on the x-axis and accuracy on the y-axis. Please label your axes, and add a title to your plot.\n","\n","**You do not need to redo the train test split.**"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"vmMKbkXCVyR6"},"outputs":[],"source":["import matplotlib.pyplot as plt\n","%matplotlib inline\n","from sklearn.metrics import accuracy_score"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"8OQjmhePVyR7"},"outputs":[],"source":["# TODO find the accuracy of the model with each value of k from 1-30 inclusive"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"zDyQauwRVyR7"},"outputs":[],"source":["# TODO create the plot of the accuracy array determined in the previous cell"]},{"cell_type":"markdown","metadata":{"id":"mgXa0kpoVyR8"},"source":["### <span style=\"color:purple\">_end of Problem 2_</span>\n","\n","## Confusion Matrix\n","\n","**Reminder**: The confusion matrix is depicted below\n","\n","<style type=\"text/css\">\n",".tg  {border-collapse:collapse;border-spacing:0;}\n",".tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}\n",".tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}\n",".tg .tg-ik58{background-color:#ffcb2f;border-color:inherit;text-align:left;vertical-align:top}\n",".tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}\n","</style>\n","<table class=\"tg\">\n","  <tr>\n","    <td class=\"tg-ik58\"></td>\n","    <td class=\"tg-ik58\">Positive'<br>(Predicted)</td>\n","    <td class=\"tg-ik58\">Negative'<br>(Predicted)</td>\n","  </tr>\n","  <tr>\n","    <td class=\"tg-0pky\">Positive<br>(Actual)</td>\n","    <td class=\"tg-0pky\">True Positive</td>\n","    <td class=\"tg-0pky\">False Negative</td>\n","  </tr>\n","  <tr>\n","    <td class=\"tg-0pky\">Negative<br>(Actual)</td>\n","    <td class=\"tg-0pky\">False Positive</td>\n","    <td class=\"tg-0pky\">True Negative</td>\n","  </tr>\n","</table>\n","\n","Here are the equations specified in the lecture for your convenience with the next problem.\n","\n","**Sensitivity** = True Positive /(True Positive + False Negative)\n","\n","**Specificity** = True Negative /(True Negative + False Positive)\n","\n","**Accuracy** = (True Positive + True Negative) / Total\n","\n","**Error** = (False Positive + False Negative) / Total\n","\n","**Precision** = True Positive / (True Positive + False Positive)"]},{"cell_type":"markdown","metadata":{"id":"bOoPfT6oVyR8"},"source":["## <span style=\"color:purple\">_Problem 3 (2 points)_</span>\n","\n","Given the Table Below, Calculate the **Specificity**, **Sensitivity**, **Overall Error Rate**, **Overall Accuracy**, **Precision** of the data. (show us the calculations, don't just hard code the answers!)\n","<style type=\"text/css\">\n",".tg  {border-collapse:collapse;border-spacing:0;}\n",".tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}\n",".tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}\n",".tg .tg-ik58{background-color:#ffcb2f;border-color:inherit;text-align:left;vertical-align:top}\n",".tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}\n","</style>\n","<table class=\"tg\">\n","  <tr>\n","    <td class=\"tg-ik58\"></td>\n","    <td class=\"tg-ik58\">Positive'<br>(Predicted)</td>\n","    <td class=\"tg-ik58\">Negative'<br>(Predicted)</td>\n","  </tr>\n","  <tr>\n","    <td class=\"tg-0pky\">Positive<br>(Actual)</td>\n","    <td class=\"tg-0pky\">146</td>\n","    <td class=\"tg-0pky\">32</td>\n","  </tr>\n","  <tr>\n","    <td class=\"tg-0pky\">Negative<br>(Actual)</td>\n","    <td class=\"tg-0pky\">21</td>\n","    <td class=\"tg-0pky\">590</td>\n","  </tr>\n","</table>"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"tomE5mxCVyR8","outputId":"b629f495-5216-4585-b11c-69a334ee03d7"},"outputs":[{"name":"stdout","output_type":"stream","text":["sensitivity: FILL IN HERE\n","specificity: FILL IN HERE\n","accuracy: FILL IN HERE\n","error: FILL IN HERE\n","precision: FILL IN HERE\n"]}],"source":["print(\"sensitivity:\", \"FILL IN HERE\")\n","print(\"specificity:\", \"FILL IN HERE\")\n","print(\"accuracy:\", \"FILL IN HERE\")\n","print(\"error:\", \"FILL IN HERE\")\n","print(\"precision:\", \"FILL IN HERE\")"]},{"cell_type":"markdown","metadata":{"id":"TiszM_7pVyR9"},"source":["### <span style=\"color:purple\">_end of Problem 3_</span>"]},{"cell_type":"markdown","metadata":{"id":"xk7BetDjhqVW"},"source":["## <span style=\"color:purple\">_Problem 4_ (1 point)</span>\n","\n","Let's say you want to run kNN on a dataset with both continuous features and binary features. Can you think of any potential issues that might arise from mixing these data types? Also, how might you preprocess categorical data to use in a kNN? Ordinal data?\n","\n"]},{"cell_type":"markdown","metadata":{"id":"01g3Z8-3iRUR"},"source":["#### fill in here"]},{"cell_type":"markdown","metadata":{"id":"6DoYMM9ujvTA"},"source":["### Just a reminder about the mid semester feedback form posted on ED. We'd really appreciate it if you could fill it out!!"]},{"cell_type":"markdown","metadata":{"id":"cFuBOiB3VyR9"},"source":["## <span style=\"color:purple\">_Problem 5_ (extra credit)</span>"]},{"cell_type":"markdown","metadata":{"id":"ZQdHNetaVyR9"},"source":["\n","Before running kNN, which of the following kinds of preprocessing should we do? Choose all that apply.\n","\n","1) Scale\n","\n","2) Center\n","\n","3) Remove correlated features\n","\n","4) Remove outliers\n"]},{"cell_type":"markdown","metadata":{"id":"P-W21H7BVyR9"},"source":[]},{"cell_type":"markdown","metadata":{"id":"D5tvjsHHVyR-"},"source":["### <span style=\"color:purple\">_end of Problem 5_</span>"]},{"cell_type":"markdown","metadata":{"id":"XO2XXoKNVyR-"},"source":["## <span style=\"color:purple\">_Problem 6_ (extra credit)</span>\n","\n","We've talked about sensitivity and specificity. Recall these high level intuitions:\n","- high sensitivity -> able to correctly identify positives\n","- high specificty -> able to correctly identify negatives"]},{"cell_type":"markdown","metadata":{"id":"DIVMXZQHVyR-"},"source":["### <span style=\"color:purple\">Part a</span>\n","Identify a model that has have 100% sensitivity, no matter what dataset it is run on. Similarly, identify a model with 100% specificity.\n","\n","**Hint**: Recall that a \"model\" is just a function, meaning that it takes in an input and *spits out an output*. Your job is to figure out, if you get an input x, should the output for that x be 0 or should it be 1?\n"]},{"cell_type":"markdown","metadata":{"id":"U9j4qBSfVyR-"},"source":[]},{"cell_type":"markdown","metadata":{"id":"vihFMmwgVyR-"},"source":["### <span style=\"color:purple\">Part b</span>\n","In Problem 2, you plotted kNN accuracy vs `k`. Now, make a plot kNN sensitivity vs. `k` and another plot for kNN specificity. Use the same dataset as in Problem 2, and go from k=1 to k=30."]},{"cell_type":"markdown","metadata":{"id":"pbdYqCNYVyR-"},"source":["### <span style=\"color:purple\">Part c</span>\n","Now, plot the average of specificity and sensitivity against the number of neighbors."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"mqjwOrhVVyR_"},"outputs":[],"source":[]},{"cell_type":"markdown","metadata":{"id":"owWv0IpUVyR_"},"source":["### <span style=\"color:purple\">Part d (just for fun)</span>\n","Prove the following statement:\n","\n","*If the testing set has the same number of positive and negative examples, then the accuracy is equal to the average of the specificity and sensitivity.*"]},{"cell_type":"markdown","metadata":{"id":"1F0bAXg8VyR_"},"source":["### <span style=\"color:purple\">Part f (even more fun???)</span>\n","In part D, we saw a special case where accuracy is the average between specificity and sensitivity. Let's look at the general case -- how different is accuracy from the average between specificity and sensitivity? Investigate this question by proving the following statement:\n","\n","Let `p` = # of false positives, and `N` = # of true negatives. Suppose that in the training set, there are `x` times as many positive examples as their are negative examples. Also suppose that the number of true positives is equal to the number of false positives. Then, `acc` = `R` * `avg`, where `acc` = accuracy, `avg` = the average of specificy and sensitivity, and `R` = $\\frac{(p+N)x}{Nx^2+\\left(p+N\\right)x+p}$."]},{"cell_type":"markdown","metadata":{"id":"Yji96Z5HVySA"},"source":["### <span style=\"color:purple\">_end of Problem 6_</span>"]},{"cell_type":"markdown","metadata":{"id":"BtxBPCYAVySA"},"source":["## <span style=\"color:purple\">_Problem 7_ (0 points)</span>\n","\n","In INFO1998, we focus on the high level concepts and applications, and as a result, we rarely delve into the computations that go into our machine learning algorithms. This isn't to say that those computations are unimportant. When working with **enormous** datasets, some algorithms become **infeasible** since they need to do so much *computation* that training a model could take several days. When working with **complex** datasets with **special properties**, we can sometimes **adapt** algorithms based on our understanding of the algorithm's *computations*.\n","\n","This question challenges you to think about the computations involved in kNN.\n","\n","Also, this is worth 0 points, so feel free to look up the answer if you want."]},{"cell_type":"markdown","metadata":{"id":"mDsC4JtVVySA"},"source":["### <span style=\"color:purple\">Part a</span>\n","When a kNN model is making a prediction for a sample, what does it need to do? Be specific."]},{"cell_type":"markdown","metadata":{"id":"O-fu3D0OVySA"},"source":[]},{"cell_type":"markdown","metadata":{"id":"0fPJ-7KSVySA"},"source":["### <span style=\"color:purple\">Part b</span>\n","In a past class, saw that the linear regression model is just a linear function; that is, the whole model can be represented by just a couple numbers (the weights/coefficients). Based on Part a, what data is necessary to represent a trained kNN model?"]},{"cell_type":"markdown","metadata":{"id":"AeWzRE7eVySA"},"source":[]},{"cell_type":"markdown","metadata":{"id":"329Xg_-JVySA"},"source":["### <span style=\"color:purple\">Part c</span>\n","Based on Part b, describe the training algorithm for a kNN model. Recall that a training algorithm is how you go from a training set to a representation of a model. Hint: it's super simple."]},{"cell_type":"markdown","metadata":{"id":"8c_Z_-E6VySB"},"source":[]},{"cell_type":"markdown","metadata":{"id":"ljIpMUZ3VySB"},"source":["In parts D and E, write your answers in terms of the quantities:\n","- number of training samples, T\n","- number of samples in dataset, N\n","- number of features in dataset, F\n","- number of neighbors, k"]},{"cell_type":"markdown","metadata":{"id":"ivit14hWVySB"},"source":["### <span style=\"color:purple\">Part d</span>\n","Write down an expression estimating the memory needed to represent a trained kNN model."]},{"cell_type":"markdown","metadata":{"id":"9XUSKNqmVySB"},"source":[]},{"cell_type":"markdown","metadata":{"id":"t4b7pL5-VySB"},"source":["### <span style=\"color:purple\">Part e</span>\n","Write down an expression estimating the time needed for kNN to make a prediction for a single point."]},{"cell_type":"markdown","metadata":{"id":"Q1SSxap_VySB"},"source":[]},{"cell_type":"markdown","metadata":{"id":"KB32mppQVySB"},"source":["### <span style=\"color:purple\">Part f</span>\n","Estimate the memory space needed to represent a kNN with the specifications below. Also estimate how long it would take to predict 1,000 test samples.\n","\n","- number of training samples: 1,000,000\n","- number of samples in dataset: 100,000,000\n","- number of features in dataset: 50\n","- number of neighbors: 5\n","- size of one feature of one sample: 8 bytes\n","- time to calculate distance for x features: x/10,000 seconds"]},{"cell_type":"markdown","metadata":{"id":"j5aJbNCCVySB"},"source":[]},{"cell_type":"markdown","metadata":{"id":"ymet0kjZVySB"},"source":["### <span style=\"color:purple\">Part g</span>\n","There are many variations on kNN that aim to speed up kNN predictions. This might involve saving less of the train set, checking only a subset of the saved data, encoding data differently, or organizing data differently. Look up two of these variations, and compare their advantages/disadvantages."]},{"cell_type":"markdown","metadata":{"id":"sG5nQX90VySC"},"source":[]}],"metadata":{"colab":{"provenance":[]},"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.3"},"notebookId":"PAG=GBo}Q}it}"},"nbformat":4,"nbformat_minor":0}