The Tree of Machine Learning Concepts
ML Paradigms & Structures
| Learning Paradigm | Structure | Linearity | Key Algorithms / Concepts |
|---|
Highlighting Key
Linear Algebra: Defining boundaries, transformations.
Calculus: Gradient calculation, loss minimization.
Optimization Theory: Finding best parameters, constraint satisfaction.
Classifier Comparison: Generative vs. Discriminative
A detailed comparison of core classification algorithms highlighting their underlying mathematical principles, assumptions, and complexity.
| Classifier | Type | Mapper Function | Objective Function | Learning Algorithm | Assumptions | Complexity | Outliers | Convergence |
|---|---|---|---|---|---|---|---|---|
| Naïve Bayes | Generative | \( P(y \mid x) \propto P(y) \prod_i P(x_i \mid y) \) | Maximize log-likelihood (or MAP) | Closed-form MLE/MAP | Conditional independence; distributional form (e.g., Gaussian) | Low: \( O(nd) \) | Robust | Closed-form |
| Fisher’s LDA | Generative | \( f(x) = w^\top x \) (linear) | Maximize between/within-class scatter ratio | Eigen-decomposition | Gaussian, homoscedastic (\( \Sigma_k = \Sigma \)), equal priors | Low: \( O(d^3) \) | Sensitive | Closed-form |
| Perceptron | Discriminative | \( f(x) = \operatorname{sign}(w^\top x) \) | Minimize number of misclassifications | Perceptron update rule | Data linearly separable | Low: \( O(ndT) \) | Sensitive | Guaranteed if separable |
| Logistic Regression | Discriminative | \( P(y=1 \mid x) = \sigma(w^\top x) \) | Minimize Negative Log-Likelihood | Gradient Descent / IRLS | Linear decision boundary in log-odds | Medium: iterative | Robust | Global optimum (convex) |
| Hard-margin SVM | Discriminative | \( f(x) = \operatorname{sign}(w^\top x + b) \) | Minimize \( \frac{1}{2} \|w\|^2 \) s.t. \( y_i (w^\top x_i + b) \geq 1 \) | Quadratic Programming (QP) | Perfectly linearly separable | Very Sensitive | Closed-form (via dual QP) | |
| Soft-margin SVM | Discriminative | \( f(x) = \operatorname{sign}\left( \sum_i \alpha_i y_i K(x_i, x) + b \right) \) | Minimize \( \frac{1}{2} \|w\|^2 + C \sum_i \xi_i \) | QP (dual formulation) | None (soft margin handles noise) | Robust | Global optimum (convex QP) | |
| Decision Tree | Discriminative | Piecewise constant (axis-aligned regions) | Min Gini/Entropy (class) or MSE (reg) | Greedy recursive splitting | Informative features; axis-aligned splits | Medium: \( O(nd \log n) \) | Moderately robust | Heuristic (no global optimum) |
| Random Forest | Discriminative | Ensemble average of tree predictions | Same as base tree | Bagging + random feature subsets | Decorrelated trees improve generalization | High: \( O(n d T) \) | Robust | Stabilizes as \( T \to \infty \) |
| k-NN | Instance-based | \( \hat{y} = \operatorname{mode}(y_{\text{neighbors}}) \) | None (memorization) | None (lazy learner) | Smoothness: similar \( x \to \) similar \( y \) | High predict: \( O(nd) \) | Sensitive | N/A (no training) |
| PCA | Transformation | \( z = W^\top x \) | Minimize \( \|X - X W W^\top\|_F^2 \) | Eigendecomposition / SVD | High-variance = informative; linear subspace | Medium: \( O(d^3) \) | Sensitive | Closed-form |
| k-Means | Clustering | Assign to nearest centroid | Minimize \( \sum_{k} \sum_{x \in C_k} \|x - \mu_k\|^2 \) | Lloyd’s algorithm (EM-like) | Spherical, same-size, convex clusters | Medium: \( O(n k d T) \) | Sensitive | Local optimum (non-convex) |
| GMM | Generative | \( P(x) = \sum_{k=1}^K \pi_k \mathcal{N}(x \mid \mu_k, \Sigma_k) \) | Maximize log-likelihood | Expectation-Maximization (EM) | Data = mixture of Gaussians | Medium-High | Moderately robust | Local optimum (non-convex) |
ML Project Workflow (Kanban)
This is the typical workflow for a machine learning project, from problem definition to deployment. Drag and drop the cards to move tasks between stages.
🎯 Problem Definition
📥 Data Collection
🔍 EDA (Exploratory Data Analysis)
🛠️ Preprocessing
🧠 Model Training
📊 Model Evaluation
🚀 Deployment
Interactive ML Model Selector
Answer questions about your dataset to get personalized model recommendations.
Step 1: Dataset Overview
Step 2: Feature Characteristics
Step 3: Domain Requirements
Step 4: Data Quality
Step 5: Model Constraints
Model Recommendations
Dataset Summary
All Model Scores
The 10 Entities that would attribute to our decision process
A detailed comparison of core classification algorithms highlighting their underlying mathematical principles, assumptions, and complexity.
| Model | Family | Param / Non-param | Learning Setting | Inductive Bias | Loss Function | Regularization | Hyperparameters | Feature Dependency | Evaluation Strategy |
|---|---|---|---|---|---|---|---|---|---|
| Linear Regression | Discriminative | Parametric | Supervised Regression | Linear relationship between features and target | MSE (Mean Squared Error) | L2 (Ridge), L1 (Lasso) | alpha (regularization), fit_intercept | Assume linear dependencies, performance degrades with complex, non-linear interactions | CV, Train/Test Split, $R_2$ |
| Logistic Regression | Discriminative | Parametric | Supervised (Classification) | Linear Decision Boundary in log-odd space | Binary/Multinomial Cross-Entropy | L1, L2 | C (inv. regularization), penalty, solver | Cannot capture non-linear interactions | CV, Accuracy, ROC-AUC, log-loss |
| Perceptron | Discriminative | Parametric | Supervised (Classification) | Linearly separable, hard threshold hyperplane | None. But perceptron Loss | None. | None (or max_iter, eta) | Highly dependent on linear combinations of features | Accuracy,Perceptron convergence,training mistakes, iteration counts |
| Naive Bayes | Generative | Parametric | Supervised (Classification) | All features are independent of each other given class | None. Trained via Likelihood (MLE) / Neg Log-likelihood (MAP) | Smoothing (Laplace) | alpha (smoothing) | Low, ignore features dependency, biased | CV, ACCURACY, ROC-AUC, log-loss |
| k-NN | Discriminative | Non-parametric | Supervised (Classification/Regression) | Similarity hypothesis: smoothness: similar inputs -> outputs | None. Memorizes data, not optimizing any loss functions | None. k, distance metric | k, distance mtric, weighting | Highly sensitive to all features. Require feature Scaling | CV, ACCURACY(class), MSE |
| Decision Tree | Discriminative | Non-param | Supervised | Axis-parallel for decison boundary, hierarchical splits | Gini/Entropy (Classification), MSE (Reg) - splitting criteria | Pruning | max_depth, min_samples_split, min_samples_leaf, ccp_alpha | Medium as it greedily captures non-linear interactions | CV, F1, ACCURACY, MSE |
| Random Forest | Discriminative | Non-parametric | Supervised | High variance learners can form strong (low variance) learner | Aggregated Gini / Entropy / MSE, same as tree(per base learner) | Bagging, feature subset, tree depth, depth limites | n_estimators, max_features, max_depth | Low, using max_features at each split. | CV, OOB, ACCURACY, ROC-AUC |
| Linear SVM | Discriminative | Parametric | Supervised Classification | Maximum Margin Hyperplane or linear separator | Hinged Loss | L2 (implicit in margin) | C, kernel type (class_weight) | High: linear; sensitive to scale, irr. features | CV, ROC-AUC, F1 |
| Kernel SVM | Discriminative | Non-Parametric | Supervised Classification | Data linearly separable in a higher-dimensional space by kernel | Hinged Loss | L2 (in RKHS norm) | C, $ $ kernel width, gamma, degree | Sensitive to scale; RBF suffers in highD | CV, ROC-AUC |
| PCA | Neither, it is a transformation. | Parametric | Unsupervised | Directions of high variance are the most important | None (sometimes L2 reconstruction error) | None | n_components | High: require standardization | Explained variance, reconstruction error |
| k-Means | Clustering | Not sure, not a fixed function, but have fixed $k$ centroids or parameters | Unsupervised | Clusters are spherical, convex, same size/density | Sum of Squared Distances (inertia) | None | k (clusters), init method | High: needs scaling, degrades in highD | Silhoutte score, inertia |
| DBSCAN | Clustering | Non-parametric | Unsupervised | Clusters are continuous, high-density regions separated by low-density regions. | None (rule based) | None | eps, min_samples | Medium: need scaling, degrades in highD | Silhoutte, clusters found from visual inspection, domain validity |
| Gaussian Mixture Model | Generative | Parametric | Unsupervised | Data composed Gaussian distributions mixture | Neg Log-Likelihood | Covariance regularization (reg_covar) | n_components, covariance_type | High: need scaling, cov. est unstable in highD | held-out log Likelihood, BIC/AIC |
Top YouTube Channels for Core Mathematical Foundations
Besides graduate coursework, youtube-university gives us opportunities to learn from these expert-led channels:
1. 3Blue1Brown
Why it stands out: Unparalleled visual and geometric intuition.
- Linear Algebra: The "Essence of Linear Algebra" series is the gold standard—covers vectors, matrices, determinants, eigenvalues, and transformations with stunning animations.
- Calculus: "Essence of Calculus" explains derivatives, integrals, and gradient intuition in a way that directly connects to optimization.
- Optimization: Videos on neural networks and gradient descent (e.g., in the Deep Learning series) beautifully tie calculus + linear algebra to parameter optimization.
✅ Best for: Building deep geometric understanding—essential for interpreting ML algorithms.
2. Mathematical Monk (Jeffrey Miller)
Why it stands out: Rigorous yet accessible lectures from a former Duke professor.
- Linear Algebra: Clear treatment of vector spaces, linear maps, and SVD.
- Optimization Theory: Excellent playlist on Convex Optimization, duality, KKT conditions, and constrained optimization—rare on YouTube at this level.
- Calculus & Probability: His ML playlist includes gradient-based methods, Lagrange multipliers, and loss landscapes.
✅ Best for: Graduate-level intuition with formalism—ideal if you're bridging theory and practice.
3. Steve Brunton (databrunton)
Why it stands out: Applied mathematician focused on data-driven science and engineering.
- Linear Algebra: His "Linear Algebra" series covers SVD, PCA, and transformations with real-world examples.
- Optimization: Dedicated videos on gradient descent, Newton’s method, and constrained optimization (including Lagrange multipliers).
- Calculus: Explains gradients and Jacobians in the context of dynamical systems and ML.
✅ Best for: Seeing how these concepts power real algorithms (e.g., in control, ML, and signal processing).
4. MIT OpenCourseWare
Why it stands out: Full university courses from MIT.
- Linear Algebra: Gilbert Strang’s Linear Algebra—legendary course covering matrix theory, orthogonality, and applications.
- Calculus: Calculus Revisited (by Herbert Gross): Classic but clear.
- Optimization: Courses like Optimization Methods in Business Analytics or Convex Optimization by Prof. Asu Ozdaglar cover duality and algorithms rigorously.
✅ Best for: Structured, semester-style depth—ideal if you want textbook-level rigor.
5. Coding Train / Daniel Shiffman
While not theoretical, his Coding Math and ML playlists show how to implement gradients, matrix transforms, and optimizers from scratch in code—great for reinforcing theory with practice.
Bonus: StatQuest with Josh Starmer
Not focused on pure math, but his videos on gradient descent, logistic regression, and PCA translate calculus and linear algebra into bio/ML contexts with unmatched clarity.
Summary Recommendation
- Start with 3Blue1Brown for intuition.
- Deepen with Mathematical Monk or Steve Brunton for theory + application.
- Use MIT OCW when you need formal proofs or comprehensive coverage.
These channels together give you a complete mental toolkit—from visualizing a Jacobian to solving constrained optimization problems in high-dimensional spaces.