90 Words to Machine Learning Mastery
A compilation from multiple sources
Accuracy: Accuracy is a fundamental metric for evaluating the performance of a classification model. It is defined as the ratio of correctly predicted instances (both true positives and true negatives) to the total number of instances in the dataset. While accuracy provides a straightforward measure of a model’s overall performance, it can be misleading in cases of imbalanced datasets, where one class significantly outnumbers the other. In such situations, a high accuracy might simply reflect the model’s ability to predict the majority class correctly, without truly capturing the model’s effectiveness in distinguishing between classes. Thus, accuracy should be considered alongside other metrics like precision, recall, and the F1 score for a more comprehensive evaluation.
Activation Function: A function used in neural networks to introduce non-linearity into the model, allowing it to learn complex patterns. Common examples include ReLU, sigmoid, and tanh.
Appropriate Fitting- Best fit of the training data in a model.
Backpropagation: The algorithm used to train neural networks, involving the propagation of error gradients backward through the network to update weights.
Bagging (Bootstrap Aggregating): An ensemble method that improves the stability and accuracy of machine learning algorithms by training multiple models on different random subsets of the training data.
Batch Normalization: A technique to normalize the inputs of each layer in a neural network, improving training speed and stability.
Bias — A prediction error. Bias is the accuracy of our predictions. A high bias means the prediction will be inaccurate. Intuitively, bias can be thought of as having a ‘bias’ towards people. If you are highly biased, you are more likely to make wrong assumptions about them. An oversimplified mindset creates an unjust dynamic: you label them accordingly to a ‘bias.’
Forman described bias as: “Bias is the algorithm’s tendency to consistently learn the wrong thing by not considering all the information in the data (underfitting).”
Bias-Variance Tradeoff — Bias is how far predictions are from the correct class or value, and variance is how different a model’s predictions are from each other. In statistics and machine learning, the bias-variance tradeoff is the property of a set of predictive models whereby models with a lower bias in parameter estimation have a higher variance of the parameter estimates across samples and vice versa.
Boosting: An ensemble learning technique that combines the predictions of several base learners to improve robustness and accuracy. Examples include AdaBoost and Gradient Boosting.
Confusion Matrix: A table used to evaluate the performance of a classification model, showing the true positives, true negatives, false positives, and false negatives. It provides a comprehensive view of how well the model is performing across all classes.
Convexity — It is a measure of the curvature or the degree of the curve. Convexity guarantees that our function has no more than one local minimum value.
Collinearity- a phenomenon in which one feature variable in a regression model is highly linearly correlated with another feature variable.
Cross-Validation — a statistical method used to estimate the skill of machine learning models. Cross-validation is mainly used when the dataset is relatively small. So, when you don’t have enough data for a validation set, this can be used. But with cross-validation, we can reuse the data as much as we can and the result will be a good approximation of the result we would have gotten if we’d used the full dataset. This can help alleviate the problem with small datasets not being able to represent the true distribution of the data at least to some degree. We use cross-validation performance to tune for hyper-parameters in general.
Data Augmentation: A technique used to increase the diversity of a training dataset without actually collecting new data. Commonly used in image processing, it involves transformations such as rotations, flips, and translations to create modified versions of existing data points.
Deep Neural Network — When we have multiple hidden layers, the neural network is called a deep neural network and the process of learning is deep learning.
Decision Boundary — A hyperplane that separates a feature space into distinct classes.
Decision tree — A flowchart-like structure in which each internal node represents a “test” on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes).
Distance Function — In mathematics, a metric or distance function is a function that defines a distance between each pair of elements of a set. The most commonly used measures are Euclidean Distance (L2 Distance), Manhattan Distance (L1 Distance), and the Hamming Distance. If we have integer data, it may make sense to use a Manhattan distance measure. Or use Hamming distance for bit data. Our data here, all our measurements are of a length and are in centimeters, so no adjustment is needed and a Euclidian distance measure actually makes a lot of sense, since it is measuring some kind of length and the unweighted combination is also sound. For KNeighborsClassifier
, the default distance measure or metric
(as the argument is called in scikit-learn) is the 𝐿𝑝 measure or the Minkowski distance (metric=minkowski
in scikit-learn call) with 𝑝=2p=2 (p=2
in KNeighborsClassifier
arguments), which is nothing other than the 𝐿2L2 distance measure or the Euclidian distance.
Dropout: A regularization technique for neural networks that randomly drops neurons during training to prevent overfitting.
Early Stopping: A form of regularization used to prevent overfitting in iterative training processes, such as gradient descent. It involves monitoring the performance of the model on a validation set and stopping the training process when performance begins to degrade.
Embedding: A representation of data in a lower-dimensional space. Embeddings are often used in natural language processing to convert words or phrases into dense vectors that capture semantic relationships.
Ensemble Learning: A method that combines multiple models to produce a better predictive performance than any single model.
Epoch: A full pass through the entire training dataset during the training process of a neural network.
F1 Score: The F1 score is a performance metric used to evaluate the accuracy of a classification model, particularly when dealing with imbalanced datasets. It is the harmonic mean of precision and recall, providing a single measure that balances the trade-off between the two metrics. A high F1 score indicates that the model has both high precision and high recall, meaning it is both accurate in its positive predictions and successful in identifying the majority of positive cases. This makes the F1 score a useful metric in scenarios where both false positives and false negatives have significant consequences, such as in medical testing or fraud detection.
Features — In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a phenomenon being observed. Choosing informative, discriminating and independent features is a crucial step for effective algorithms in pattern recognition, classification, and regression.
- Feature Engineering: The process of using domain knowledge to create features that make machine learning algorithms work better.
- Feature Selection: Techniques used to select a subset of relevant features for building robust learning models.
Generalization- refers to your model’s ability to adapt properly to new, previously unseen data, drawn from the same distribution as the one used to create the model.
Gradient Descent — Gradient descent is an optimization algorithm used to minimize some function by iteratively (this is an iterative method) moving in the direction of steepest descent as defined by the negative of the gradient. In machine learning, we use gradient descent to update the parameters of our model. This works for L2 functions. Not for L1 functions.
Grid Search: A hyperparameter optimization technique that exhaustively searches through a specified subset of the hyperparameter space. It trains and evaluates the model using different combinations of hyperparameters to find the best performing set.
Hinge Loss Function — This is a surrogate for misclassification loss and has some very nice properties for finding the best decision boundary for classification. In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for “maximum-margin” classification, most notably for support vector machines (SVMs).
Hyperparameters- In simple terms, this means that these are the parameters that have to be set by us. Sometimes hyperparameters can change during execution, so it’s not always constant. Typically, each learning algorithm has a set of parameters that are not learned while training. We sometimes refer to these as hyper-parameters which lets us keep parameters for the weights that the learning algorithm is choosing, at least for a learning algorithm with a parameterized hypothesis space. So, these hyper-parameters are set based on the problem being solved and the domain, not just the data the learning algorithm is training on. Hyperparameters tend to determine factors such as how quickly an algorithm should converge as in the case of the learning rate, or how much complexity should be penalized as in the case of regularizers.
IID (Independent and Identically Distributed random variables)- In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. If each data is dependent on the instance of the previous data instance it is NOT IID.
K- Nearest Neighbor — K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions). K is the number of neighbors to look at. We don’t want a too large K, because that means listening to very distant neighbors, and lumping things together that are actually quite dissimilar. On the other hand, we don’t want a too small K, because then random meaningless local variations can have too much influence on our choice. The two key factors of using KNN are the distance measures and the value of K.
Kernel Function — In machine learning, a “kernel” is usually used to refer to the kernel trick, a method of using a linear classifier to solve a non-linear problem. The kernel function is what is applied to each data instance to map the original non-linear observations into a higher-dimensional space in which they become separable.
L1 Regularizer - Also known as L1 norm or Lasso regularization, is a technique used to prevent overfitting in machine learning models by adding a penalty equal to the absolute value of the magnitude of coefficients. This regularization method encourages sparsity, meaning it tends to drive some coefficients to be exactly zero, effectively performing feature selection. By penalizing large coefficients, the L1 regularizer helps in simplifying the model, making it more interpretable and less prone to overfitting on the training data. This is particularly useful in high-dimensional datasets where many features may be irrelevant.
L2 Regularizer- Also known as L2 norm or Ridge regularization, is a technique used to reduce overfitting by adding a penalty equal to the square of the magnitude of coefficients to the loss function. Unlike L1 regularization, which can drive coefficients to zero, L2 regularization tends to shrink the coefficients towards zero but not exactly zero, leading to smaller and more evenly distributed values. This helps in stabilizing the model by discouraging large coefficients, thus improving its generalization ability on unseen data. L2 regularization is particularly effective in situations where all features contribute to the prediction and none should be completely excluded.
Lasso Regression — Linear Regression with an L1 regularizer. adds “absolute value of magnitude” of coefficient as penalty term to the loss function.
Latent Variable: A variable that is not directly observed but is inferred from other variables in the model. Latent variables are often used in models such as factor analysis, hidden Markov models, and topic models.
Learning Curve: A plot that shows the performance of a machine learning model on both the training and validation datasets as a function of the number of training iterations or the size of the training dataset. It is used to diagnose whether a model is overfitting or underfitting.
Learning Rate: A hyperparameter that controls how much to change the model in response to the estimated error each time the model weights are updated.
Linear Function — Can be simply defined as a function that always follows the principle of
input/output = constant.
A linear equation is always a polynomial of degree 1 (for example x+2y+3=0). In the two-dimensional cases, they always form lines; in other dimensions, they might also form planes, points, or hyperplanes. Their “shape” is always perfectly straight, with no curves of any kind. This is why we call them linear equations.
Linear Regression — In statistics, linear regression is a linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). In other words, this means finding the line of best fit.
Linearly Separable — If the data can be completely separable by a line it is known as linearly separable.
Logistic Regression — Contrary to its name, logistic regression doesn’t actually create a regression that is, it doesn’t answer questions with a real-valued number. Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist. (Classification using Regression).
Logistic Transfer Function (Logistic Function) — The logistic transfer function or simply the logistic function, maps the entire real line and transforms it into a space between 0 and 1. So the function our model is finding converts all rational numbers the entire space of possible values into something we can consider a probability.
Loss Function — Functions that quantify mistakes are known as loss functions. It’s a method of evaluating how well specific algorithm models the given data. If predictions deviate too much from actual results, loss function would cough up a very large number. Gradually, with the help of some optimization function, the loss function learns to reduce the error in prediction.
Matching Loss –Fortunately, and thanks to the power of mathematics, for particular transfer functions, you can actually pick a complimentary loss so that you do get a convex loss landscape. These are called matching losses. If you have a matching loss for your particular transfer function, then the minimum point is the global minimum, which means you can find the best line.
Metadata- Metadata is simply data about data. This means a description and context for the data. Metadata helps organize, find, and understand data.
E.g. Consider taking a photo with your mobile device, when you take a picture the phone doesn’t just captured the image but also identifies where the image was taken and when. This is what enables the apps on your phone to map your images or sort them by the date they were taken. It can also store a lot of other information. Professional cameras will record information about the camera itself, shutter speed and aperture even copyright information. When you convert images from one form to another say from your camera’s raw form to GIF or from JPEG to PNG, you might actually lose metadata.
Meta-Learning- To be honest, we don’t always carefully maintain the distinction between hyperparameters in model parameters and sometimes the line does get blurry. Learning algorithms that tune hyper parameters or model structures.
Model Complexity- In machine learning, model complexity often refers to the number of features or terms included in a given predictive model, as well as whether the chosen model is linear, nonlinear, and so on. It can also refer to the algorithmic learning complexity or computational complexity.
Model Fitting (model. fit ()) — Model fitting is a measure of how well a machine learning model generalizes to similar data to that on which it was trained. During the fitting process, you run an algorithm on data for which you know the target variable, known as “labeled” data, and produce a machine learning model.
Model Parameters- Model parameters are the internal variables within a machine learning model that are learned from the training data. These parameters define the model’s structure and behavior, determining how the input data is transformed into the output predictions. In algorithms like linear regression, model parameters are the coefficients of the features. In neural networks, they are the weights and biases. These parameters are adjusted during the training process to minimize the loss function, which measures the difference between the predicted and actual values. Proper tuning of model parameters is crucial for the model to generalize well to new, unseen data. Unlike hyperparameters, which are set prior to training, model parameters are learned and optimized during the training process.
Neuron (Artificial Neuron) — Well to us, it’s a mathematical operator that takes some number as inputs and gives a number as its output. Yep, that’s a function just like we’re usually talking about, what makes neurons special is that we interconnect them. Typically, each neuron actually involves two separate simple functions, a linear function, then a nonlinear transformation. So, within each neuron in our network, the first linear function does what all linear functions do, multiplies its weights against the input values, and returns the thumb. Then that value is passed to the activation function and outcomes some other number. But some activation functions you might hear of are our sigmoids, hyperbolic tangents, and rectified linear units or ReLU.
Neural network — a series of algorithms that endeavors to recognize underlying relationships in a set of data through a process that mimics the way the human brain operates. In this sense, neural networks refer to systems of neurons, either organic or artificial in nature. So, a neural network is a number of interconnected neurons, and the default way to connect them is in layers. So, the output of one batch of neurons serves as the input to another, the very first input is the example it’s classifying (when we have a classification problem). And so, we call that layer the input layer, the input layer is then fed into the next layer which is some set of neurons that yes are just functions. This is the first hidden layer (this is because we have no clue about what is going on inside), it’s got some numbers in there that mean something probably. In between the output and the input, layers can be any number of hidden layers and in the case of recurrent neural networks, they can maybe even loop back on themselves. But the default is a feed-forward neural network that feeds each hidden layer with the output of the layer before. One of the open problems around neural networks is explainability, which basically means understanding why a neural network gives the output it does.
No Free Lunch Theorem — Coined by David Wolpert. What the “No free lunch theorem” states are, that there is no universal learning algorithm that works better than any other on all machine learning prediction tasks or application domains. Remember, the prediction tasks and domains are in principle defined by the data-generating distribution or distributions underlying it. So the no-free lunch theorem says that, when averaged over all possible data-generating distributions, any learning algorithm will have the same performance, including the stupidest ones. According to the no free lunch theorem, there’s no universal solution that we can promise performs best even on everything we care about, because we don’t really know what the underlying distributions are for everything we care about.
Non-Linear Functions — Any function that is not linear is simply put, Non-linear. Higher degree polynomials are nonlinear. Trigonometric functions (like sin or cos) are nonlinear. Square roots are nonlinear.
Normalization: A preprocessing step to scale input features so that they have a mean of zero and a standard deviation of one. Normalization can improve the convergence of gradient descent and the overall performance of the model.
One-Hot Encoding: A method used to convert categorical variables into a numerical format that can be provided to machine learning algorithms.
Overfitting — refers to a model that models the training data too well. Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. This is like memorizing the letters to the answers of an MCQ. There is no use since, for a different paper, the individual will perform poorly. If overfitting is like memorizing the sequence of letters that will give you a perfect score on one particular exam, not separating training and test data is like a teacher, giving you all the answers to an exam ahead of time.
Overfitting: A modeling error which occurs when a function is too closely fit to a limited set of data points, as previously defined, but this term can be expanded with additional details.
Perceptron Classifier- This is used when classes are linearly separable. In machine learning, the perceptron is an algorithm for the supervised learning of binary classifiers. A binary classifier is a function that can decide whether or not an input, represented by a vector of numbers, belongs to some specific class It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector. Randomly choose a point and see how our current line classifies it. If it is correct, well don’t do anything, if it’s incorrect, adjust that line, and move it up or down depending on the misclassification.
Predict (model. predict()) — given a trained model, predict the label of a new set of data. This method accepts one argument, the new data X_new (e.g. model. predict(X_new)), and returns the learned label for each object in the array.
Precision: Precision is a crucial metric in evaluating the performance of a classification model, especially in scenarios where the cost of false positives is high. It measures the accuracy of the positive predictions made by the model by calculating the ratio of true positive predictions to the total number of positive predictions (true positives plus false positives). High precision indicates that the model produces a low number of false positives, meaning that when the model predicts a positive class, it is likely to be correct. This metric is particularly important in applications such as medical diagnosis, where incorrectly predicting a condition (false positive) can lead to unnecessary treatments and anxiety for patients.
Principal Component Analysis (PCA): A dimensionality reduction technique that transforms data into a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. PCA is used to reduce the dimensionality of large datasets while preserving as much variance as possible.
Pruning: Pruning is a technique used in decision tree algorithms to reduce the size of the tree by removing sections that provide little predictive power. The main goal of pruning is to enhance the generalization ability of the model by reducing overfitting. Pruning can be done in two main ways: pre-pruning (early stopping) and post-pruning (pruning after the tree is fully grown). Pre-pruning involves setting conditions to stop the tree from growing further during the training process, while post-pruning involves cutting back the tree after it has been fully grown by removing branches that have little importance or do not improve model performance significantly. This process helps in creating a more robust model that performs better on unseen data.
Pre-pruning: Pre-pruning is a technique used in decision tree algorithms to prevent the tree from growing too large and overfitting the training data. It involves setting criteria to stop the tree from splitting further during the training process. Common criteria include limiting the maximum depth of the tree, requiring a minimum number of samples in a node for it to be split, or setting a threshold for the minimum improvement in the loss function required for a split. By stopping the growth of the tree early, pre-pruning helps in maintaining a simpler and more generalized model.
Post-pruning: Post-pruning, also known as pruning or backward pruning, is a technique applied after a decision tree has been fully grown. The idea is to remove sections of the tree that provide little power in predicting target variables to reduce complexity and improve generalization. This is done by evaluating the impact of removing a subtree and replacing it with a leaf node, usually based on cross-validation performance or a separate validation set. Post-pruning can involve methods such as reduced error pruning and cost complexity pruning, which aim to simplify the model by cutting back overfitted branches.
Random Trees — Random forests or random decision forests are an ensemble learning method for classification, regression, and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes or mean prediction of the individual trees.
Recall: Recall, also known as sensitivity or the true positive rate, is an essential metric for evaluating the completeness of a classification model’s positive predictions. It calculates the ratio of true positive predictions to the total number of actual positive instances (true positives plus false negatives). High recall indicates that the model successfully identifies most of the actual positive cases, minimizing the number of false negatives. This metric is especially critical in applications where missing a positive instance has severe consequences, such as in fraud detection or identifying diseases. A model with high recall ensures that most of the positive cases are captured, even if it means including some false positives.
Regularization — A technique that is used to penalize models for being too complex. This is used to control the bias-variance trade-off. A regularizer is an extra mathematical term added to the objective function which penalizes complexity. Now the objective you want the learning algorithm to minimize is a combination of the penalty for making mistakes as well as the penalty for using a complex model. L1 and L2 regularizers penalize The magnitude of weights in the loss function.
Reinforcement Learning: An area of machine learning where agents learn to make decisions by taking actions in an environment to maximize cumulative reward.
Representation Learning- We can manually identify a bunch of features that are appropriate for the question we care about, but we can also use various machine learning techniques to automatically find features. This process of transforming raw data into a more useful form is sometimes known as representation learning, and there are lots of standard feature extraction practices.
Residuals: The differences between observed and predicted values in a regression model.
Ridge Regression- Linear Regression with an L2 regularizer. Adds “squared magnitude” of coefficient as penalty term to the loss function.
ROC Curve (Receiver Operating Characteristic): A graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.
Semi-Supervised Learning: A class of machine learning that makes use of both labeled and unlabeled data for training.
Smoothing Parameter — The only controlling parameter that controls the smoothness of the fitted curve. E.g. K in KNN.
Stratified Split- When splitting a dataset into training and test data it's not always good to do it randomly. For example, if we have a dataset consisting of spam and non-spam emails and if all the spam emails end up in the test data this would not provide the expected results. Therefore, we need stratified splitting. This is when data is split to ensure consistency for some features. So, for the example we just saw, where 700 out of the 1,000 emails are good and we want 30 percent to be in the test set, the stratified split will make sure 630 of our no spam emails will be in the training set and the rest will be in the test set. This is when you have IID (Independent and Identically Distributed random variables) data.
Support Vector Machines (SVM) — Support Vector Machines are models with a specific separating line coming from our use of the margin loss along with an L2 regularizer.
The objective of the support vector machine algorithm is to find a hyperplane in N-dimensional space (N — the number of features) that distinctly classifies the data points. Since Support Vector Machines are trying to maximize the distance, SVMs are also called large margin classifiers, which is the line that creates as much space as possible between the closest points to the decision boundary, which are at least two, one from each class. Hard margin SVMs work only on linearly separable data and will force all hinge losses to be exactly zero. In soft margin SVMs, the smaller the c used or the softer you get, the more the algorithm will prefer to put points on the wrong side of the separation line if it keeps the norm of the weight vector down, causing a bigger margin among the points that are on the correct side, but also likely increasing the generalization power. So when are SVMs a good choice of learning algorithm? Well, because of the kernel trick, they actually handle nonlinear relationships rather well. To be fair, so do your neural networks but because SVMs are based on convex optimization, you don’t need as much data to have confidence you’re finding the best model within the hypothesis space.
Surrogate Loss — A loss used in place of the original loss, helping to make the problem solvable.
Tensor: A multi-dimensional array used in machine learning, particularly in deep learning frameworks such as TensorFlow and PyTorch.
Tuning: The process of adjusting the hyperparameters of a model to optimize its performance.
Transfer Learning: A machine learning method where a model developed for a particular task is reused as the starting point for a model on a second task.
Transfer Function — Functions that map the output of linear regression to another domain. Such as the Sine function and Logistic Transfer Function. It converts the output of a regression function to a class label. For Logistic Regression, the transfer function takes the number reported by our regression model and translates it into a class label.
Underfitting: A modeling error which occurs when a function is too simple to capture the underlying structure of the data, as previously defined, but this term can be expanded with additional details.
Unsupervised Learning: A type of machine learning that looks for previously undetected patterns in a dataset with no pre-existing labels.
Validation Set: A subset of the dataset used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters.
Variance — This is a prediction error. Variance is the difference between many model’s predictions. Hence, any ‘noise’ in the dataset, might be captured by the model. A high variance tends to occur when we use complicated models that can overfit our training sets. For example, a variance can be thought of as having different stereotypes based on different demographics.
For example, a complicated model might depict people’s names as a good predictor of our hypothesis. However, names are random and should not have any predictive power. In one dataset, people with the name ‘Alex’ can indicate they are likely to be criminals. However, in another dataset, people with the name ‘Alex’ can indicate they are likely to be graduates. Hence, names should not be used as a predictive variable.
Forman described variance as: “Variance is the algorithm’s tendency to learn random things irrespective of the real signal by fitting highly flexible models that follow the error/noise in the data too closely (overfitting).”
Vectors — Vectors are commonly used in machine learning as they lend a convenient way to organize data. Often one of the very first steps in making a machine-learning model is vectorizing the data.