{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Cars 4 - Not a Pixar Movie" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## INFO 1998 Final Project" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Grant Rineheimer and Benjamin Tang and Dylan Tom" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This project performs an explorative data and predictive analysis on a used car dataset. \n", "\n", "We want to answer the following questions\n", "1. What features can we use to predict the price of a used car? (Regression)\n", "2. Given certain features, can we predict the manufacturer of the car? (Classification)\n", "\n", "The approach is outlined as follows:\n", "1. Preprocessing and cleaning the dataset\n", "2. Data Visualization\n", "3. Machine Learning Models\n", "4. Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Source of Data: https://www.kaggle.com/austinreese/craigslist-carstrucks-data" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "#Import Necessary Packages\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sb\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import mean_squared_error\n", "from sklearn.preprocessing import LabelEncoder\n", "from sklearn.metrics import accuracy_score\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn import tree\n", "from sklearn import datasets" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(426880, 26)" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Import Dataset\n", "data = pd.read_csv('vehicles.csv')\n", "df = pd.DataFrame(data)\n", "df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This dataset has 426,880 instances with 26 features. Shown below are the names of the attributes and the first 5 rows of the dataframe. There are many rows which have NaN as an entry which do not provide additional features to determine a correlation or make predictions. For example, 'url', 'region_url', 'image url', 'description' are outdated and non-useful columns which can be removed to make the dataframe easier to read. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Pretty printing has been turned OFF\n" ] }, { "data": { "text/plain": [ "['id', 'url', 'region', 'region_url', 'price', 'year', 'manufacturer', 'model', 'condition', 'cylinders', 'fuel', 'odometer', 'title_status', 'transmission', 'VIN', 'drive', 'size', 'type', 'paint_color', 'image_url', 'description', 'county', 'state', 'lat', 'long', 'posting_date']" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%pprint\n", "list(df.columns)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | id | \n", "url | \n", "region | \n", "region_url | \n", "price | \n", "year | \n", "manufacturer | \n", "model | \n", "condition | \n", "cylinders | \n", "... | \n", "size | \n", "type | \n", "paint_color | \n", "image_url | \n", "description | \n", "county | \n", "state | \n", "lat | \n", "long | \n", "posting_date | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "7222695916 | \n", "https://prescott.craigslist.org/cto/d/prescott... | \n", "prescott | \n", "https://prescott.craigslist.org | \n", "6000 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "az | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
1 | \n", "7218891961 | \n", "https://fayar.craigslist.org/ctd/d/bentonville... | \n", "fayetteville | \n", "https://fayar.craigslist.org | \n", "11900 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "ar | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
2 | \n", "7221797935 | \n", "https://keys.craigslist.org/cto/d/summerland-k... | \n", "florida keys | \n", "https://keys.craigslist.org | \n", "21000 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "fl | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
3 | \n", "7222270760 | \n", "https://worcester.craigslist.org/cto/d/west-br... | \n", "worcester / central MA | \n", "https://worcester.craigslist.org | \n", "1500 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "ma | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
4 | \n", "7210384030 | \n", "https://greensboro.craigslist.org/cto/d/trinit... | \n", "greensboro | \n", "https://greensboro.craigslist.org | \n", "4900 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "nc | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
5 rows × 26 columns
\n", "\n", " | id | \n", "price | \n", "year | \n", "odometer | \n", "lat | \n", "long | \n", "
---|---|---|---|---|---|---|
count | \n", "2.480840e+05 | \n", "2.480840e+05 | \n", "248084.000000 | \n", "2.465850e+05 | \n", "242819.000000 | \n", "242819.000000 | \n", "
mean | \n", "7.311672e+09 | \n", "6.629931e+04 | \n", "2010.330275 | \n", "1.045731e+05 | \n", "38.652560 | \n", "-95.509484 | \n", "
std | \n", "4.308408e+06 | \n", "1.242535e+07 | \n", "9.815420 | \n", "2.114410e+05 | \n", "5.932695 | \n", "18.644916 | \n", "
min | \n", "7.301584e+09 | \n", "0.000000e+00 | \n", "1900.000000 | \n", "0.000000e+00 | \n", "-84.122245 | \n", "-159.827728 | \n", "
25% | \n", "7.308403e+09 | \n", "5.500000e+03 | \n", "2007.000000 | \n", "4.700000e+04 | \n", "34.746623 | \n", "-114.465026 | \n", "
50% | \n", "7.312860e+09 | \n", "1.199900e+04 | \n", "2013.000000 | \n", "9.439500e+04 | \n", "39.338500 | \n", "-89.600000 | \n", "
75% | \n", "7.315307e+09 | \n", "2.499000e+04 | \n", "2016.000000 | \n", "1.400000e+05 | \n", "42.484503 | \n", "-81.152649 | \n", "
max | \n", "7.317101e+09 | \n", "3.736929e+09 | \n", "2022.000000 | \n", "1.000000e+07 | \n", "82.252826 | \n", "173.885502 | \n", "
\n", " | id | \n", "region | \n", "price | \n", "year | \n", "manufacturer | \n", "model | \n", "cylinders | \n", "odometer | \n", "title_status | \n", "VIN | \n", "... | \n", "condition_salvage | \n", "condition_nan | \n", "transmission_automatic | \n", "transmission_manual | \n", "transmission_other | \n", "transmission_nan | \n", "drive_4wd | \n", "drive_fwd | \n", "drive_rwd | \n", "drive_nan | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "7316814884 | \n", "auburn | \n", "33590 | \n", "2014 | \n", "gmc | \n", "sierra 1500 crew cab slt | \n", "8 | \n", "57923.0 | \n", "clean | \n", "3GTP1VEC4EG551563 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "
1 | \n", "7316814758 | \n", "auburn | \n", "22590 | \n", "2010 | \n", "chevrolet | \n", "silverado 1500 | \n", "8 | \n", "71229.0 | \n", "clean | \n", "1GCSCSE06AZ123805 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "
2 | \n", "7316814989 | \n", "auburn | \n", "39590 | \n", "2020 | \n", "chevrolet | \n", "silverado 1500 crew | \n", "8 | \n", "19160.0 | \n", "clean | \n", "3GCPWCED5LG130317 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "
3 | \n", "7316743432 | \n", "auburn | \n", "30990 | \n", "2017 | \n", "toyota | \n", "tundra double cab sr | \n", "8 | \n", "41124.0 | \n", "clean | \n", "5TFRM5F17HX120972 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "
4 | \n", "7316356412 | \n", "auburn | \n", "15000 | \n", "2013 | \n", "ford | \n", "f-150 xlt | \n", "6 | \n", "128000.0 | \n", "clean | \n", "NaN | \n", "... | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "
5 rows × 42 columns
\n", "