Protein Modeling
- Using Bayesian models
- Task : Estimate Probability mass function(because discrete) for a finite, discrete distribution → given a histogram from a sample
- Large number of categories and small number of observations
- Estimate Probability distrib of amino acids in each column in a protein class. 20 dim PMF (one for each site)
- Can be aligned
- High chances of class not being present in data
- MLE will assign 0 Probability to X
- Wrong decision made for a lot of them that were not in the training set
- Cannot use
- 20 dim PMF for amnio acid distrib : θ=(θ1,…,θ20)′=(P(X=A),…,P(X=Y))′
- count vectors of amino acids found in a given site in training data D
- Distributed according to Multinomial Distribution with l = 20
Using Prior
- 0 probabilities should not occur. H=(θ1,…,θ20)′∈R20∣θj∈(0,1) and Σjθj=1
- 19 dim hypervolume
- Continuous space and so can use PDF
- Dirichlet Distribution is used to represent it because parameterized with l = 20

- αs fixed beforehand