Navigation

The Tree of Machine Learning Concepts

ML Paradigms & Structures

Learning Paradigm Structure Linearity Key Algorithms / Concepts

Highlighting Key

Linear Algebra: Defining boundaries, transformations.

Calculus: Gradient calculation, loss minimization.

Optimization Theory: Finding best parameters, constraint satisfaction.

Classifier Comparison: Generative vs. Discriminative

A detailed comparison of core classification algorithms highlighting their underlying mathematical principles, assumptions, and complexity.

Classifier Type Mapper Function Objective Function Learning Algorithm Assumptions Complexity Outliers Convergence
Naïve Bayes Generative \( P(y \mid x) \propto P(y) \prod_i P(x_i \mid y) \) Maximize log-likelihood (or MAP) Closed-form MLE/MAP Conditional independence; distributional form (e.g., Gaussian) Low: \( O(nd) \) Robust Closed-form
Fisher’s LDA Generative \( f(x) = w^\top x \) (linear) Maximize between/within-class scatter ratio Eigen-decomposition Gaussian, homoscedastic (\( \Sigma_k = \Sigma \)), equal priors Low: \( O(d^3) \) Sensitive Closed-form
Perceptron Discriminative \( f(x) = \operatorname{sign}(w^\top x) \) Minimize number of misclassifications Perceptron update rule Data linearly separable Low: \( O(ndT) \) Sensitive Guaranteed if separable
Logistic Regression Discriminative \( P(y=1 \mid x) = \sigma(w^\top x) \) Minimize Negative Log-Likelihood Gradient Descent / IRLS Linear decision boundary in log-odds Medium: iterative Robust Global optimum (convex)
Hard-margin SVM Discriminative \( f(x) = \operatorname{sign}(w^\top x + b) \) Minimize \( \frac{1}{2} \|w\|^2 \) s.t. \( y_i (w^\top x_i + b) \geq 1 \) Quadratic Programming (QP) Perfectly linearly separable Very Sensitive Closed-form (via dual QP)
Soft-margin SVM Discriminative \( f(x) = \operatorname{sign}\left( \sum_i \alpha_i y_i K(x_i, x) + b \right) \) Minimize \( \frac{1}{2} \|w\|^2 + C \sum_i \xi_i \) QP (dual formulation) None (soft margin handles noise) Robust Global optimum (convex QP)
Decision Tree Discriminative Piecewise constant (axis-aligned regions) Min Gini/Entropy (class) or MSE (reg) Greedy recursive splitting Informative features; axis-aligned splits Medium: \( O(nd \log n) \) Moderately robust Heuristic (no global optimum)
Random Forest Discriminative Ensemble average of tree predictions Same as base tree Bagging + random feature subsets Decorrelated trees improve generalization High: \( O(n d T) \) Robust Stabilizes as \( T \to \infty \)
k-NN Instance-based \( \hat{y} = \operatorname{mode}(y_{\text{neighbors}}) \) None (memorization) None (lazy learner) Smoothness: similar \( x \to \) similar \( y \) High predict: \( O(nd) \) Sensitive N/A (no training)
PCA Transformation \( z = W^\top x \) Minimize \( \|X - X W W^\top\|_F^2 \) Eigendecomposition / SVD High-variance = informative; linear subspace Medium: \( O(d^3) \) Sensitive Closed-form
k-Means Clustering Assign to nearest centroid Minimize \( \sum_{k} \sum_{x \in C_k} \|x - \mu_k\|^2 \) Lloyd’s algorithm (EM-like) Spherical, same-size, convex clusters Medium: \( O(n k d T) \) Sensitive Local optimum (non-convex)
GMM Generative \( P(x) = \sum_{k=1}^K \pi_k \mathcal{N}(x \mid \mu_k, \Sigma_k) \) Maximize log-likelihood Expectation-Maximization (EM) Data = mixture of Gaussians Medium-High Moderately robust Local optimum (non-convex)

ML Project Workflow (Kanban)

This is the typical workflow for a machine learning project, from problem definition to deployment. Drag and drop the cards to move tasks between stages.

🎯 Problem Definition

Define business question (e.g., "Predict churn")
Set success metric (e.g., "85% accuracy")

📥 Data Collection

Gather data from main database
Query 3rd-party APIs for external data

🔍 EDA (Exploratory Data Analysis)

Find missing values & outliers
Plot correlation matrix

🛠️ Preprocessing

Handle missing values (Impute)
Scale numerical features (StandardScaler)
Encode categorical features (OneHot)
Run PCA for dimensionality reduction

🧠 Model Training

Train SVM model
Train Random Forest (for comparison)

📊 Model Evaluation

Test model on unseen data
Review Confusion Matrix

🚀 Deployment

Build prediction API (Flask/FastAPI)

Interactive ML Model Selector

Answer questions about your dataset to get personalized model recommendations.

Step 1: Dataset Overview

Step 2: Feature Characteristics

Continuous Categorical Mixed
Yes No (e.g., age vs. income)
Yes No (e.g., skewed)

Step 3: Domain Requirements

Yes No
Yes No
Yes (e.g., real-time) No (e.g., batch)

Step 4: Data Quality

None Low (0-5%) High (5%+)
Yes No
Balanced Imbalanced

Step 5: Model Constraints

High (GPU, large RAM) Low (CPU, limited RAM)
Yes (frequent updates) No (stable model)
Yes (e.g., geospatial, time series) No

Model Recommendations

Loading...
Analyzing your dataset...

Dataset Summary

All Model Scores

The 10 Entities that would attribute to our decision process

A detailed comparison of core classification algorithms highlighting their underlying mathematical principles, assumptions, and complexity.

Model Family Param / Non-param Learning Setting Inductive Bias Loss Function Regularization Hyperparameters Feature Dependency Evaluation Strategy
Linear Regression Discriminative Parametric Supervised Regression Linear relationship between features and target MSE (Mean Squared Error) L2 (Ridge), L1 (Lasso) alpha (regularization), fit_intercept Assume linear dependencies, performance degrades with complex, non-linear interactions CV, Train/Test Split, $R_2$
Logistic Regression Discriminative Parametric Supervised (Classification) Linear Decision Boundary in log-odd space Binary/Multinomial Cross-Entropy L1, L2 C (inv. regularization), penalty, solver Cannot capture non-linear interactions CV, Accuracy, ROC-AUC, log-loss
Perceptron Discriminative Parametric Supervised (Classification) Linearly separable, hard threshold hyperplane None. But perceptron Loss None. None (or max_iter, eta) Highly dependent on linear combinations of features Accuracy,Perceptron convergence,training mistakes, iteration counts
Naive Bayes Generative Parametric Supervised (Classification) All features are independent of each other given class None. Trained via Likelihood (MLE) / Neg Log-likelihood (MAP) Smoothing (Laplace) alpha (smoothing) Low, ignore features dependency, biased CV, ACCURACY, ROC-AUC, log-loss
k-NN Discriminative Non-parametric Supervised (Classification/Regression) Similarity hypothesis: smoothness: similar inputs -> outputs None. Memorizes data, not optimizing any loss functions None. k, distance metric k, distance mtric, weighting Highly sensitive to all features. Require feature Scaling CV, ACCURACY(class), MSE
Decision Tree Discriminative Non-param Supervised Axis-parallel for decison boundary, hierarchical splits Gini/Entropy (Classification), MSE (Reg) - splitting criteria Pruning max_depth, min_samples_split, min_samples_leaf, ccp_alpha Medium as it greedily captures non-linear interactions CV, F1, ACCURACY, MSE
Random Forest Discriminative Non-parametric Supervised High variance learners can form strong (low variance) learner Aggregated Gini / Entropy / MSE, same as tree(per base learner) Bagging, feature subset, tree depth, depth limites n_estimators, max_features, max_depth Low, using max_features at each split. CV, OOB, ACCURACY, ROC-AUC
Linear SVM Discriminative Parametric Supervised Classification Maximum Margin Hyperplane or linear separator Hinged Loss L2 (implicit in margin) C, kernel type (class_weight) High: linear; sensitive to scale, irr. features CV, ROC-AUC, F1
Kernel SVM Discriminative Non-Parametric Supervised Classification Data linearly separable in a higher-dimensional space by kernel Hinged Loss L2 (in RKHS norm) C, $ $ kernel width, gamma, degree Sensitive to scale; RBF suffers in highD CV, ROC-AUC
PCA Neither, it is a transformation. Parametric Unsupervised Directions of high variance are the most important None (sometimes L2 reconstruction error) None n_components High: require standardization Explained variance, reconstruction error
k-Means Clustering Not sure, not a fixed function, but have fixed $k$ centroids or parameters Unsupervised Clusters are spherical, convex, same size/density Sum of Squared Distances (inertia) None k (clusters), init method High: needs scaling, degrades in highD Silhoutte score, inertia
DBSCAN Clustering Non-parametric Unsupervised Clusters are continuous, high-density regions separated by low-density regions. None (rule based) None eps, min_samples Medium: need scaling, degrades in highD Silhoutte, clusters found from visual inspection, domain validity
Gaussian Mixture Model Generative Parametric Unsupervised Data composed Gaussian distributions mixture Neg Log-Likelihood Covariance regularization (reg_covar) n_components, covariance_type High: need scaling, cov. est unstable in highD held-out log Likelihood, BIC/AIC

Top YouTube Channels for Core Mathematical Foundations

Besides graduate coursework, youtube-university gives us opportunities to learn from these expert-led channels:

1. 3Blue1Brown

Why it stands out: Unparalleled visual and geometric intuition.

  • Linear Algebra: The "Essence of Linear Algebra" series is the gold standard—covers vectors, matrices, determinants, eigenvalues, and transformations with stunning animations.
  • Calculus: "Essence of Calculus" explains derivatives, integrals, and gradient intuition in a way that directly connects to optimization.
  • Optimization: Videos on neural networks and gradient descent (e.g., in the Deep Learning series) beautifully tie calculus + linear algebra to parameter optimization.

✅ Best for: Building deep geometric understanding—essential for interpreting ML algorithms.

2. Mathematical Monk (Jeffrey Miller)

Why it stands out: Rigorous yet accessible lectures from a former Duke professor.

  • Linear Algebra: Clear treatment of vector spaces, linear maps, and SVD.
  • Optimization Theory: Excellent playlist on Convex Optimization, duality, KKT conditions, and constrained optimization—rare on YouTube at this level.
  • Calculus & Probability: His ML playlist includes gradient-based methods, Lagrange multipliers, and loss landscapes.

✅ Best for: Graduate-level intuition with formalism—ideal if you're bridging theory and practice.

3. Steve Brunton (databrunton)

Why it stands out: Applied mathematician focused on data-driven science and engineering.

  • Linear Algebra: His "Linear Algebra" series covers SVD, PCA, and transformations with real-world examples.
  • Optimization: Dedicated videos on gradient descent, Newton’s method, and constrained optimization (including Lagrange multipliers).
  • Calculus: Explains gradients and Jacobians in the context of dynamical systems and ML.

✅ Best for: Seeing how these concepts power real algorithms (e.g., in control, ML, and signal processing).

4. MIT OpenCourseWare

Why it stands out: Full university courses from MIT.

✅ Best for: Structured, semester-style depth—ideal if you want textbook-level rigor.

5. Coding Train / Daniel Shiffman

While not theoretical, his Coding Math and ML playlists show how to implement gradients, matrix transforms, and optimizers from scratch in code—great for reinforcing theory with practice.

Bonus: StatQuest with Josh Starmer

Not focused on pure math, but his videos on gradient descent, logistic regression, and PCA translate calculus and linear algebra into bio/ML contexts with unmatched clarity.

Summary Recommendation

  • Start with 3Blue1Brown for intuition.
  • Deepen with Mathematical Monk or Steve Brunton for theory + application.
  • Use MIT OCW when you need formal proofs or comprehensive coverage.

These channels together give you a complete mental toolkit—from visualizing a Jacobian to solving constrained optimization problems in high-dimensional spaces.