The Tree of Machine Learning Concepts

ML Paradigms & Structures

Learning Paradigm	Structure	Linearity	Key Algorithms / Concepts

Highlighting Key

Linear Algebra: Defining boundaries, transformations.

Calculus: Gradient calculation, loss minimization.

Optimization Theory: Finding best parameters, constraint satisfaction.

Classifier Comparison: Generative vs. Discriminative

A detailed comparison of core classification algorithms highlighting their underlying mathematical principles, assumptions, and complexity.

Classifier	Type	Mapper Function	Objective Function	Learning Algorithm	Assumptions	Complexity	Outliers	Convergence
Naïve Bayes	Generative	$ P(y \mid x) \propto P(y) \prod_i P(x_i \mid y) $	Maximize log-likelihood (or MAP)	Closed-form MLE/MAP	Conditional independence; distributional form (e.g., Gaussian)	Low: $ O(nd) $	Robust	Closed-form
Fisher’s LDA	Generative	$ f(x) = w^\top x $ (linear)	Maximize between/within-class scatter ratio	Eigen-decomposition	Gaussian, homoscedastic ($ \Sigma_k = \Sigma $), equal priors	Low: $ O(d^3) $	Sensitive	Closed-form
Perceptron	Discriminative	$ f(x) = \operatorname{sign}(w^\top x) $	Minimize number of misclassifications	Perceptron update rule	Data linearly separable	Low: $ O(ndT) $	Sensitive	Guaranteed if separable
Logistic Regression	Discriminative	$ P(y=1 \mid x) = \sigma(w^\top x) $	Minimize Negative Log-Likelihood	Gradient Descent / IRLS	Linear decision boundary in log-odds	Medium: iterative	Robust	Global optimum (convex)
Hard-margin SVM	Discriminative	$ f(x) = \operatorname{sign}(w^\top x + b) $	Minimize $ \frac{1}{2} \\|w\\|^2 $ s.t. $ y_i (w^\top x_i + b) \geq 1 $	Quadratic Programming (QP)	Perfectly linearly separable	Very Sensitive	Closed-form (via dual QP)
Soft-margin SVM	Discriminative	$ f(x) = \operatorname{sign}\left( \sum_i \alpha_i y_i K(x_i, x) + b \right) $	Minimize $ \frac{1}{2} \\|w\\|^2 + C \sum_i \xi_i $	QP (dual formulation)	None (soft margin handles noise)	Robust	Global optimum (convex QP)
Decision Tree	Discriminative	Piecewise constant (axis-aligned regions)	Min Gini/Entropy (class) or MSE (reg)	Greedy recursive splitting	Informative features; axis-aligned splits	Medium: $ O(nd \log n) $	Moderately robust	Heuristic (no global optimum)
Random Forest	Discriminative	Ensemble average of tree predictions	Same as base tree	Bagging + random feature subsets	Decorrelated trees improve generalization	High: $ O(n d T) $	Robust	Stabilizes as $ T \to \infty $
k-NN	Instance-based	$ \hat{y} = \operatorname{mode}(y_{\text{neighbors}}) $	None (memorization)	None (lazy learner)	Smoothness: similar $ x \to $ similar $ y $	High predict: $ O(nd) $	Sensitive	N/A (no training)
PCA	Transformation	$ z = W^\top x $	Minimize $ \\|X - X W W^\top\\|_F^2 $	Eigendecomposition / SVD	High-variance = informative; linear subspace	Medium: $ O(d^3) $	Sensitive	Closed-form
k-Means	Clustering	Assign to nearest centroid	Minimize $ \sum_{k} \sum_{x \in C_k} \\|x - \mu_k\\|^2 $	Lloyd’s algorithm (EM-like)	Spherical, same-size, convex clusters	Medium: $ O(n k d T) $	Sensitive	Local optimum (non-convex)
GMM	Generative	$ P(x) = \sum_{k=1}^K \pi_k \mathcal{N}(x \mid \mu_k, \Sigma_k) $	Maximize log-likelihood	Expectation-Maximization (EM)	Data = mixture of Gaussians	Medium-High	Moderately robust	Local optimum (non-convex)

ML Project Workflow (Kanban)

This is the typical workflow for a machine learning project, from problem definition to deployment. Drag and drop the cards to move tasks between stages.

🎯 Problem Definition

Define business question (e.g., "Predict churn")

Set success metric (e.g., "85% accuracy")

📥 Data Collection

Gather data from main database

Query 3rd-party APIs for external data

🔍 EDA (Exploratory Data Analysis)

Find missing values & outliers

Plot correlation matrix

🛠️ Preprocessing

Handle missing values (Impute)

Scale numerical features (StandardScaler)

Encode categorical features (OneHot)

Run PCA for dimensionality reduction

🧠 Model Training

Train SVM model

Train Random Forest (for comparison)

📊 Model Evaluation

Test model on unseen data

Review Confusion Matrix

🚀 Deployment

Build prediction API (Flask/FastAPI)

The 10 Entities that would attribute to our decision process

A detailed comparison of core classification algorithms highlighting their underlying mathematical principles, assumptions, and complexity.

Model	Family	Param / Non-param	Learning Setting	Inductive Bias	Loss Function	Regularization	Hyperparameters	Feature Dependency	Evaluation Strategy
Linear Regression	Discriminative	Parametric	Supervised Regression	Linear relationship between features and target	MSE (Mean Squared Error)	L2 (Ridge), L1 (Lasso)	alpha (regularization), fit_intercept	Assume linear dependencies, performance degrades with complex, non-linear interactions	CV, Train/Test Split, $R_2$
Logistic Regression	Discriminative	Parametric	Supervised (Classification)	Linear Decision Boundary in log-odd space	Binary/Multinomial Cross-Entropy	L1, L2	C (inv. regularization), penalty, solver	Cannot capture non-linear interactions	CV, Accuracy, ROC-AUC, log-loss
Perceptron	Discriminative	Parametric	Supervised (Classification)	Linearly separable, hard threshold hyperplane	None. But perceptron Loss	None.	None (or max_iter, eta)	Highly dependent on linear combinations of features	Accuracy,Perceptron convergence,training mistakes, iteration counts
Naive Bayes	Generative	Parametric	Supervised (Classification)	All features are independent of each other given class	None. Trained via Likelihood (MLE) / Neg Log-likelihood (MAP)	Smoothing (Laplace)	alpha (smoothing)	Low, ignore features dependency, biased	CV, ACCURACY, ROC-AUC, log-loss
k-NN	Discriminative	Non-parametric	Supervised (Classification/Regression)	Similarity hypothesis: smoothness: similar inputs -> outputs	None. Memorizes data, not optimizing any loss functions	None. k, distance metric	k, distance mtric, weighting	Highly sensitive to all features. Require feature Scaling	CV, ACCURACY(class), MSE
Decision Tree	Discriminative	Non-param	Supervised	Axis-parallel for decison boundary, hierarchical splits	Gini/Entropy (Classification), MSE (Reg) - splitting criteria	Pruning	max_depth, min_samples_split, min_samples_leaf, ccp_alpha	Medium as it greedily captures non-linear interactions	CV, F1, ACCURACY, MSE
Random Forest	Discriminative	Non-parametric	Supervised	High variance learners can form strong (low variance) learner	Aggregated Gini / Entropy / MSE, same as tree(per base learner)	Bagging, feature subset, tree depth, depth limites	n_estimators, max_features, max_depth	Low, using max_features at each split.	CV, OOB, ACCURACY, ROC-AUC
Linear SVM	Discriminative	Parametric	Supervised Classification	Maximum Margin Hyperplane or linear separator	Hinged Loss	L2 (implicit in margin)	C, kernel type (class_weight)	High: linear; sensitive to scale, irr. features	CV, ROC-AUC, F1
Kernel SVM	Discriminative	Non-Parametric	Supervised Classification	Data linearly separable in a higher-dimensional space by kernel	Hinged Loss	L2 (in RKHS norm)	C, $ $ kernel width, gamma, degree	Sensitive to scale; RBF suffers in highD	CV, ROC-AUC
PCA	Neither, it is a transformation.	Parametric	Unsupervised	Directions of high variance are the most important	None (sometimes L2 reconstruction error)	None	n_components	High: require standardization	Explained variance, reconstruction error
k-Means	Clustering	Not sure, not a fixed function, but have fixed $k$ centroids or parameters	Unsupervised	Clusters are spherical, convex, same size/density	Sum of Squared Distances (inertia)	None	k (clusters), init method	High: needs scaling, degrades in highD	Silhoutte score, inertia
DBSCAN	Clustering	Non-parametric	Unsupervised	Clusters are continuous, high-density regions separated by low-density regions.	None (rule based)	None	eps, min_samples	Medium: need scaling, degrades in highD	Silhoutte, clusters found from visual inspection, domain validity
Gaussian Mixture Model	Generative	Parametric	Unsupervised	Data composed Gaussian distributions mixture	Neg Log-Likelihood	Covariance regularization (reg_covar)	n_components, covariance_type	High: need scaling, cov. est unstable in highD	held-out log Likelihood, BIC/AIC

Top YouTube Channels for Core Mathematical Foundations

Besides graduate coursework, youtube-university gives us opportunities to learn from these expert-led channels:

1. 3Blue1Brown

Why it stands out: Unparalleled visual and geometric intuition.

Linear Algebra: The "Essence of Linear Algebra" series is the gold standard—covers vectors, matrices, determinants, eigenvalues, and transformations with stunning animations.
Calculus: "Essence of Calculus" explains derivatives, integrals, and gradient intuition in a way that directly connects to optimization.
Optimization: Videos on neural networks and gradient descent (e.g., in the Deep Learning series) beautifully tie calculus + linear algebra to parameter optimization.

✅ Best for: Building deep geometric understanding—essential for interpreting ML algorithms.

2. Mathematical Monk (Jeffrey Miller)

Why it stands out: Rigorous yet accessible lectures from a former Duke professor.

Linear Algebra: Clear treatment of vector spaces, linear maps, and SVD.
Optimization Theory: Excellent playlist on Convex Optimization, duality, KKT conditions, and constrained optimization—rare on YouTube at this level.
Calculus & Probability: His ML playlist includes gradient-based methods, Lagrange multipliers, and loss landscapes.

✅ Best for: Graduate-level intuition with formalism—ideal if you're bridging theory and practice.

3. Steve Brunton (databrunton)

Why it stands out: Applied mathematician focused on data-driven science and engineering.

Linear Algebra: His "Linear Algebra" series covers SVD, PCA, and transformations with real-world examples.
Optimization: Dedicated videos on gradient descent, Newton’s method, and constrained optimization (including Lagrange multipliers).
Calculus: Explains gradients and Jacobians in the context of dynamical systems and ML.

✅ Best for: Seeing how these concepts power real algorithms (e.g., in control, ML, and signal processing).

4. MIT OpenCourseWare

Why it stands out: Full university courses from MIT.

Linear Algebra: Gilbert Strang’s Linear Algebra—legendary course covering matrix theory, orthogonality, and applications.
Calculus: Calculus Revisited (by Herbert Gross): Classic but clear.
Optimization: Courses like Optimization Methods in Business Analytics or Convex Optimization by Prof. Asu Ozdaglar cover duality and algorithms rigorously.

✅ Best for: Structured, semester-style depth—ideal if you want textbook-level rigor.

5. Coding Train / Daniel Shiffman

While not theoretical, his Coding Math and ML playlists show how to implement gradients, matrix transforms, and optimizers from scratch in code—great for reinforcing theory with practice.

Bonus: StatQuest with Josh Starmer

Not focused on pure math, but his videos on gradient descent, logistic regression, and PCA translate calculus and linear algebra into bio/ML contexts with unmatched clarity.

Summary Recommendation

Start with 3Blue1Brown for intuition.
Deepen with Mathematical Monk or Steve Brunton for theory + application.
Use MIT OCW when you need formal proofs or comprehensive coverage.

These channels together give you a complete mental toolkit—from visualizing a Jacobian to solving constrained optimization problems in high-dimensional spaces.

Navigation