Index page
100PageMlblook
Notes from 100 Page ML Book
I decided to add notes to this blog too. All such notes will be tagged with "book" for easier search.
This one is my notes while reading "Andriy Burkov : The Hundred-Page Machine Learning Book". Amazon. Do support the author if you can.
A quick note on how I make notes. I first annotate the pdf of the book. And then type down the text to make it searchable. Yes I probably could use OCR but this helps me remember more. Also, this is not meant to be comprehensive reviews but only what I find interesting from the book. I read a lot about Deep Learning so these will keep popping up.
Okay now let us get to it :)
Initial thoughts from the content
- Seems like a book which summarizes ML and tiny bit of DL
- Not in depth but more of an executive summary of sorts
- Most of the major algorithms explained in brief
- Bits of extra information scattered here and there
Notes
- I skipped making notes of anything I knew prior. So these points are things that I wanted to read again or just found interesting while I was reading the book.
- I skipped things like linear regression while making notes so if you dont know what those are better read the book :)
- Why ML -> Solve practical problems
SVM
- SVM sees feature vectors as high dimensional spaces and puts them on a n dimensional plot with an n dimensional hyperplace
- minimize euclidean norm
- kernels that make boundaries non linear
- look for largest margin
- Hinge loss -> if data is not linearly separable. penalizes the side of the decision boundary
- SVMs with hinge -> soft margin. normal -> hard margin
- largin margin : generalization
- kernel trick -> implicitly transform original space into a higher dimensional space
- lagrange multipliers -> optimization problem by finding equivalent representation -> can be solved by quadratic algos
- RBF most widely used
Random variable
- Prob distribution -> list of prov associated with each possible value -> prob mass function
- continuous random variable -> inf possible values in interval -> prob density function
- expectation -> mean of random variable
Unbiased estimator
- Unlimited no of unbiased estimators -> mean will give actual value.
Shallow learning
- Learns parameters directly from features.
- Vs DL -> learnt from outputs of previous layers
Cost func
- avg loss -> empirical risk
Decision tree
- acyclic graph
- in each branch, specific feature is examined
- choose next leaf based on threshold
- ID3 is approximated by constructing a non parametric model
- recursively continue
- Entropy is an uncertainty measure -> max when all random values have equal probability
GD
- SGD -> uses batches to compute gradient
- adagrad -> scales ¦Á for each parameter wrt history
- momentum -> accelerate SGD
Techniques
- Binning -> convert continous feature into multiple binary ones
- Normalize -> Increase speed
- Standardization -> scale between ¦Ì and ¦Ò
Data imputation
- same value outside normal range
- avg value
- use regression to fix
Regularization
- L1 -> sparse model,lasso reg
- L2 -> feature selection, ridge reg
Hyper param
- Grid search
- Bayesian optimization
- Evolutionary optimization
RNN
- Sequence
- not feed forward -> loops
- each unit gets 2 inps -> vector of outputs from prev layer, vector of states from prev time step
- backprop through time
- gated RNN -> forget gate
- store info for future use
- read write and erase info stored in units
Seq2seq
- Encoder -> generate state with meaning representation -> embedding
- decoder -> take embedding and give output
- best results with attention
Ensemble
- Train many low accuracy models and combine
Other learnings
- Active learning -> label add to those which contribute most to model. Either density (how many examples around x) or uncertainty (how uncertain prediction of model)
- SVM -> Use svm to predict differences and get them annotated
Semi supervised
- self learning
- autoencoder
- bottleneck layer -> embedding
- denoising -> corrupts left hand side with random peturbation/ normal gaussian noise
Zero shot
- use embeddings to represent input x and also output y
Combine models
- Average
- majority vote
- Stack -> Use stacked model to tune hyper params
Other stuff
- regularization -> dropout, batch norm, early stop
- avoid loops
- density estimation -> model probablity density fn -> novelty
- DBSCAN -> build clusters with arbitrary shape
- Gaussian mixture model -> member of several clusters with diff membership score
- UMAP seems to be better then tsne :o
- Ranking -> LambdaMart -> optimize lists on metric. eg Mean average precision (MAP)