Index page

100PageMlblook

Notes from 100 Page ML Book

I decided to add notes to this blog too. All such notes will be tagged with "book" for easier search. This one is my notes while reading "Andriy Burkov : The Hundred-Page Machine Learning Book". Amazon. Do support the author if you can.

A quick note on how I make notes. I first annotate the pdf of the book. And then type down the text to make it searchable. Yes I probably could use OCR but this helps me remember more. Also, this is not meant to be comprehensive reviews but only what I find interesting from the book. I read a lot about Deep Learning so these will keep popping up.

Okay now let us get to it :)

Initial thoughts from the content

Seems like a book which summarizes ML and tiny bit of DL
Not in depth but more of an executive summary of sorts
Most of the major algorithms explained in brief
Bits of extra information scattered here and there

Notes

I skipped making notes of anything I knew prior. So these points are things that I wanted to read again or just found interesting while I was reading the book.
I skipped things like linear regression while making notes so if you dont know what those are better read the book :)
Why ML -> Solve practical problems

SVM

SVM sees feature vectors as high dimensional spaces and puts them on a n dimensional plot with an n dimensional hyperplace
minimize euclidean norm
kernels that make boundaries non linear
look for largest margin
Hinge loss -> if data is not linearly separable. penalizes the side of the decision boundary
SVMs with hinge -> soft margin. normal -> hard margin
largin margin : generalization
kernel trick -> implicitly transform original space into a higher dimensional space
lagrange multipliers -> optimization problem by finding equivalent representation -> can be solved by quadratic algos
RBF most widely used

Random variable

Prob distribution -> list of prov associated with each possible value -> prob mass function
continuous random variable -> inf possible values in interval -> prob density function
expectation -> mean of random variable

Unbiased estimator

Unlimited no of unbiased estimators -> mean will give actual value.

Shallow learning

Learns parameters directly from features.
Vs DL -> learnt from outputs of previous layers

Cost func

avg loss -> empirical risk

Decision tree

acyclic graph
in each branch, specific feature is examined
choose next leaf based on threshold
ID3 is approximated by constructing a non parametric model
recursively continue
Entropy is an uncertainty measure -> max when all random values have equal probability

GD

SGD -> uses batches to compute gradient
adagrad -> scales ¦Á for each parameter wrt history
momentum -> accelerate SGD

Techniques

Binning -> convert continous feature into multiple binary ones
Normalize -> Increase speed
Standardization -> scale between ¦Ì and ¦Ò

Data imputation

same value outside normal range
avg value
use regression to fix

Regularization

L1 -> sparse model,lasso reg
L2 -> feature selection, ridge reg

Hyper param

Grid search
Bayesian optimization
Evolutionary optimization

RNN

Sequence
not feed forward -> loops
each unit gets 2 inps -> vector of outputs from prev layer, vector of states from prev time step
backprop through time
gated RNN -> forget gate
store info for future use
read write and erase info stored in units

Seq2seq

Encoder -> generate state with meaning representation -> embedding
decoder -> take embedding and give output
best results with attention

Ensemble

Train many low accuracy models and combine

Other learnings

Active learning -> label add to those which contribute most to model. Either density (how many examples around x) or uncertainty (how uncertain prediction of model)
SVM -> Use svm to predict differences and get them annotated

Semi supervised

self learning
autoencoder
bottleneck layer -> embedding
denoising -> corrupts left hand side with random peturbation/ normal gaussian noise

Zero shot

use embeddings to represent input x and also output y

Combine models

Average
majority vote
Stack -> Use stacked model to tune hyper params

Other stuff

regularization -> dropout, batch norm, early stop
avoid loops
density estimation -> model probablity density fn -> novelty
DBSCAN -> build clusters with arbitrary shape
Gaussian mixture model -> member of several clusters with diff membership score
UMAP seems to be better then tsne :o
Ranking -> LambdaMart -> optimize lists on metric. eg Mean average precision (MAP)

Home page

Total posts : 86