Math Concepts Every Machine Learning Interviewee Should Know Quiz

Explore 15 essential math concepts and problem-solving skills frequently tested in machine learning interviews. This easy-level quiz covers topics like statistics, linear algebra, probability, calculus, and their applications in fundamental AI and machine learning problems.

Mean vs Median
Consider the data set [2, 4, 4, 8, 100]. What is the median value of this set?
1. 4
2. 8
3. 24
4. 100
Explanation: The median is the middle value after sorting the set, which is 4 in this case. 8 and 100 are higher values but do not fall in the center position. 24 is not present in the data. 100 is the largest number and is not the median.
Basic Probability
If the probability of an event A is 0.4, what is the probability that A does not occur?
1. 0.6
2. 0.4
3. 1.4
4. 0.2
Explanation: The probability that event A does not occur is 1 minus the probability that it does occur, so 1 - 0.4 = 0.6. 0.4 is the probability that A occurs, not that it does not. 1.4 is greater than the total probability. 0.2 does not relate to the given value.
Standard Deviation Interpretation
Which statement best describes the standard deviation in a data set used for training a model?
1. It measures the spread of values around the mean.
2. It counts how many values are in the set.
3. It indicates whether the data is sorted.
4. It finds the most frequent value in the set.
Explanation: Standard deviation quantifies how much data points deviate from the mean value. The number of data points relates to the count, not standard deviation. Data being sorted is not about standard deviation. The most frequent value is called the mode.
Dot Product in Linear Algebra
What is the dot product of two vectors [1, 2] and [3, 4]?
1. 11
2. 14
3. 21
4. 10
Explanation: The dot product is calculated as (1*3) + (2*4) = 3 + 8 = 11. 14 is obtained by multiplying cross-elements incorrectly. 21 comes from multiplying all numbers together, which is incorrect. 10 does not result from the correct formula.
Normalization Purpose
Why is feature normalization important before applying k-nearest neighbors to a dataset with age in years and income in dollars?
1. To ensure each feature contributes equally to distance calculations.
2. To make the dataset larger.
3. To convert all values to integers.
4. To change the order of data points.
Explanation: Normalization puts features on the same scale so no single feature dominates the distance metric. Making the dataset larger is not a goal of normalization. Converting to integers is unnecessary for normalization. Changing the order is unrelated to normalization.
Gradient in Machine Learning
In gradient descent, what does the gradient represent at a point on the loss curve?
1. The direction of the steepest increase in the function.
2. The midpoint between two points.
3. The average value of the function.
4. The value at which the function is minimized.
Explanation: The gradient points toward the steepest increase of the function, which gradient descent uses to move in the opposite direction and minimize loss. The midpoint relates to coordinates, not gradients. The average value is unrelated to gradients. The function's minimum value is found by moving against the gradient.
Matrix Transpose Definition
Given a matrix A = [[1, 2], [3, 4]], what is the transpose of A?
1. [[1, 3], [2, 4]]
2. [[1, 2], [3, 4]]
3. [[2, 1], [4, 3]]
4. [[4, 3], [2, 1]]
Explanation: Transposing a matrix swaps rows with columns, so the result is [[1, 3], [2, 4]]. Keeping the matrix unchanged, as in the second option, is not a transpose. The third and fourth options reverse the entries, which is not the definition of transpose.
Overfitting Indicator
In which situation is a model most likely overfitting: 98% accuracy on training data but only 50% on test data?
1. Overfitting
2. Underfitting
3. Perfect Fitting
4. Random Guessing
Explanation: High accuracy on training but poor test accuracy usually means the model memorizes training data and doesn't generalize, indicating overfitting. Underfitting would show poor performance on both sets. Perfect fitting is not possible, and random guessing would not reach high training accuracy.
Calculating Conditional Probability
If P(A) = 0.3, P(B) = 0.5, and P(A and B) = 0.15, what is P(A|B)?
1. 0.3
2. 0.5
3. 0.15
4. 0.3
Explanation: Conditional probability P(A|B) = P(A and B)/P(B) = 0.15/0.5 = 0.3. The option 0.5 is P(B), not the conditional probability. 0.15 is P(A and B), not P(A|B). The duplication of 0.3 is a realistic distractor, but the calculation confirms 0.3 is correct.
Activation Functions
Which function outputs 1 if the input is positive and 0 if the input is negative or zero?
1. Heaviside step function
2. Sigmoid function
3. ReLU function
4. Tanh function
Explanation: The Heaviside step function jumps from 0 to 1 when the input crosses zero. The sigmoid outputs values between 0 and 1 but never reaches exactly 0 or 1. ReLU outputs 0 for negative inputs but outputs the exact value for positives. Tanh outputs values from -1 to 1.
Outlier Impact
How does an outlier, such as the value 999 in the set [1, 2, 3, 4, 999], most affect the mean and median?
1. The mean increases much more than the median.
2. Both the mean and median increase equally.
3. The median increases more than the mean.
4. Neither the mean nor the median is affected.
Explanation: Outliers have a large effect on the mean because it is calculated as the average, while the median is more resistant to such changes. The second and third options are false because the mean is always more sensitive to single large values. The last option is incorrect as both are at least somewhat affected.
Eigenvalue Terminology
In linear algebra, what do you call a nonzero vector that changes direction only by a scalar factor when a matrix is applied to it?
1. Eigenvector
2. Determinant
3. Trace
4. Norm
Explanation: An eigenvector is a vector whose direction remains unchanged when multiplied by a specific matrix, just scaled. The determinant is a scalar property of matrices. Trace is the sum of diagonal elements, not a vector. Norm measures the length of a vector.
Types of Random Variables
Which of the following is best described as a discrete random variable in a machine learning context?
1. Number of spam messages in an inbox per day
2. Temperature in Celsius recorded every minute
3. Blood pressure measurement
4. Time taken for a model to train
Explanation: A discrete random variable can take only distinct, separate values, such as a count of spam messages. Temperature can take continuous values, so it's not discrete. Blood pressure and time are also continuous variables and can have fractional values.
Bayes' Theorem Use
Which concept allows updating the probability estimate for a hypothesis as more evidence is observed?
1. Bayes' Theorem
2. Pythagorean Theorem
3. Linear Regression
4. K-means Clustering
Explanation: Bayes' Theorem is used to update the probability of a hypothesis in light of new evidence. The Pythagorean Theorem is related to geometry, not probabilities. Linear regression is used for prediction, not for updating probabilities. K-means is a clustering method.
Regularization Motivation
What is the primary mathematical motivation for adding a regularization term to a machine learning model's loss function?
1. To penalize model complexity and prevent overfitting
2. To improve model size regardless of data fit
3. To always reduce training error to zero
4. To make computation slower and harder
Explanation: Regularization discourages excessively complex models by adding a penalty to the loss, helping prevent overfitting. Simply improving size is not the goal. Reducing training error to zero can cause overfitting, not less. Slowing computation is not a motivation.
Correlation Coefficient Value
What does a correlation coefficient of -1 indicate about the relationship between two variables, X and Y?
1. A perfect negative linear relationship
2. No relationship at all
3. A perfect positive linear relationship
4. The variables are unrelated
Explanation: A correlation of -1 signifies that one variable increases as the other decreases in a perfect linear way. No relationship would show a correlation near zero. A perfect positive relationship is described by +1, not -1. 'Unrelated' is inaccurate for a -1 correlation.

Math Concepts Every Machine Learning Interviewee Should Know Quiz

Mean vs Median

Basic Probability

Standard Deviation Interpretation

Dot Product in Linear Algebra

Normalization Purpose

Gradient in Machine Learning

Matrix Transpose Definition

Overfitting Indicator

Calculating Conditional Probability

Activation Functions

Outlier Impact

Eigenvalue Terminology

Types of Random Variables

Bayes' Theorem Use

Regularization Motivation

Correlation Coefficient Value