Glossary

On this page you can find some common terminology referenced in this course. Please note that this glossary may not be comprehensive. When in doubt, ask on Ed or refer back to lectures.

Common INFO 1998 Terminology

Term	Definition	Link
Data Science	Data science is the field of study that combines domain expertise, programming skills, and knowledge of mathematics and statistics to extract meaningful insights from data.	Lecture 1

Machine Learning	Machine learning is a subfield of artificial intelligence, which is broadly defined as the capability of a machine to imitate intelligent human behavior.	Lecture 1

Concatenating	Combines together two data frames, either row-wise or column-wise. Concatenating is also combining two data frames, but while join offers a low level of control, concat has a lot more options.	Lecture 2

Imputation	Compensates for missing values in a dataset. 3 main techniques: Randomly replacing NaNs, Using summary statistics, and Using regression, clustering, and other advanced techniques	Lecture 2

Ordering	Converts categorical data to a numerical scale to easily facilitate analysis	Lecture 2

Dummy variables	Creates binary variable for each category in a categorical variable	Lecture 2

Filtering	Filtering means looking at only certain rows, based on the values in columns.	Lecture 2

Feature engineering	Generates new features which provide additional information to the user and to the model	Lecture 2

Joining	Joins together two data frames on any specified key (fills in NaN otherwise). 4 common types: inner, outer, left, and right	Lecture 2

Binning	Makes continuous data categorical by lumping ranges of data into discrete “levels”	Lecture 2

Subsetting	Subsetting means getting rid of unnecessary columns and focus on certain characteristics.	Lecture 2

Summarizing	Summarizing is very useful as it gives us a quantitative overview of the dataset. Besides count, for numerical datasets, it gives us mean, standard deviation, minimum maximum and quartile numbers. For non-numerical datasets, it gives other useful overall statistics.	Lecture 2

Standardizing	Turns data into normal distribution with mean of 0 and standard deviation of 1	Lecture 2

Normalizing	Turns data into values between 0 and 1 for easy comparison between different features	Lecture 2

Dataframe	DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, they provide functions for selecting and manipulating data.	Lecture 2

Pandas	Open-source data analysis library built on top of the Python programming language to manage data in an orderly way. [API here] (https://pandas.pydata.org/docs/reference/index.html)	Lecture 2

Correlation plots	2D matrix with all variables on each axis; each entry is the correlation coefficient between each pair of variables	Lecture 3

Error bars	A line through a point on a graph parallel to one of the axes which represents the estimated error in a measurement (the uncertainty)	Lecture 3

Boxplot	Also known as box and whisker plot. It provides Summary of data.Gives range, interquartile range, median, and outlier information	Lecture 3

Seaborn	Another visualization package (plot graphs) built on matplotlib with high level commands	Lecture 3

Violin plot	Combination of boxplot and density plot to show the spread and shape of the data. Can show if data is normal	Lecture 3

Heatmap	Describes the density or intensity of variables, visualize patterns, and anomalies. Varying degrees of one metric are represented using color	Lecture 3

Matplotlib	Python data visualization package inspired from MATLAB.Capable of handling most data visualization needs	Lecture 3

Residual Plot	Scatter plot of residual values. Residuals are on vertical axis and the independent variable on the horizontal axis. Helps determine the accuracy of line of best fit.	Lecture 3

Linear Regression	Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. Can be configured with hyperparameters.	Lecture 3

Overfitting	Overfitting refers to a model that models the training data too well. An overfitted model cannot accurately generalize predictions to new data	Lecture 4

Model training	The model learns a relationship/program. The first phase in creating the model.	Lecture 4

Model validation	Validates whether the learned relationship is accurate on other data. This is very important to determine if the model is overfitted.	Lecture 4

Variance	A measure of overfitting. Results from sensitivity to fluctuations in the data.	Lecture 5

Bias	A measure of underfitting. Results from incorrect assumptions in the algorithm	Lecture 5

Underfitting	A model that can neither model the training data nor generalize to new data	Lecture 5

Classification	A supervised machine learning method where the model tries to predict the correct label of a given input data	Lecture 5

Error	Error is used to see how accurately our model can predict on data it uses to learn; as well as new, unseen data. Based on our error, we choose the machine learning model which performs best for a particular dataset.	Lecture 5

Collinear	When two features have a correlation near -1 or 1. This makes them redundant. But if features are collinear with the target, it’s a good choice for linear regression	Lecture 5

Classifier	Predict the class/category (based off of target variable) of a set of data points	Lecture 5

K-nearest neighbor (KNN) classifier	Uses k (user-specified value and hyperparameter) nearest data points to predict unknown one.	Lecture 6

Confusion matrix	Table used to describe the performance of classifier on a set of binary test data for which the true values are known	Lecture 6

Sensitivity	True positive rate (how many positives are correctly identified as positives)	Lecture 6

Specificity	True negative rate (how many negatives are correctly identified as negatives)	Lecture 6

Overall accuracy	Proportion of correct predictions ( (true positives + true negative) / total)	Lecture 6

Overall error rate	Proportion of incorrect predictions ( (false positive + false negative) / total)	Lecture 6

Precision	Proportion of correct positive predictions among all positive predictions ( true positive / (true positive + false positive))	Lecture 6

Decision Trees	Supervised ML model used to predict target by learning decision rules from features	Lecture 7

Classification and regression trees (CART)	Used for classification/regression; models a non-linear relationship	Lecture 7

Logistic Regression	Used for binary classification; transforms linear relationship of probability by using the sigmoid function	Lecture 7

K-fold cross validation	Create equally sized k-partitions/folds of training data. The average of these errors is the validation error	Lecture 7

Linear classifiers	Hyper plane (decision boundary) used to classify data points	Lecture 8

Linearly separable	Occurs when you cannot partition a dataset with a linear decision boundary. Not linearly separable, often due to outliers	Lecture 8

Perceptron Learning Algorithm	Algorithm that find a normal vector w that perfectly classifies all the points in data set	Lecture 8

Support Vector Machine (SVM)	A machine learning model that is able to generalise between two different classes if the set of labeled data is provided in the training set to the algorithm	Lecture 8

Margins	Use cost function to penalize misclassified points	Lecture 8

Kernels	Map 2 dimensional data onto 3 dimensional data; makes it easier to find a hyperplane	Lecture 8

Unsupervised learning	Training data is unlabeled; algorithm tries to learn by itself. E.g. clustering, dimensionality reduction, etc.	Lecture 8

Clustering Algorithms	Hierarchical Cluster Analysis (HCA), k-Means clustering, Gaussian Mixture Models (GMMs)	Lecture 9

References:

https://www.datarobot.com/wiki/data-science/

https://mitsloan.mit.edu/ideas-made-to-matter/machine-learning-explained

https://pandas.pydata.org/docs/user_guide/dsintro.html#:~:text=DataFrame%20is%20a%202%2Ddimensional,most%20commonly%20used%20pandas%20object.

https://www.biologyforlife.com/interpreting-error-bars.html

http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm

https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/

https://www.datacamp.com/blog/classification-machine-learning#:~:text=Classification%20is%20a%20supervised%20machine,prediction%20on%20new%20unseen%20data.

https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/

https://www.simplilearn.com/tutorials/machine-learning-tutorial/bias-and-variance#:~:text=Errors%20in%20Machine%20Learning,-We%20can%20describe&text=In%20Machine%20Learning%2C%20error%20is,best%20for%20a%20particular%20dataset.