Protein Modeling

  • Using Bayesian models
  • Task : Estimate Probability mass function(because discrete) for a finite, discrete distribution given a histogram from a sample
  • Large number of categories and small number of observations
    • Estimate Probability distrib of amino acids in each column in a protein class. 20 dim PMF (one for each site)
    • Can be aligned
    • High chances of class not being present in data
      • MLE will assign 0 Probability to X
      • Wrong decision made for a lot of them that were not in the training set
      • Cannot use
  • 20 dim PMF for amnio acid distrib :
    • count vectors of amino acids found in a given site in training data D
    • Distributed according to Multinomial Distribution with l = 20

Using Prior

  • 0 probabilities should not occur. and
    • 19 dim hypervolume
    • Continuous space and so can use PDF
    • Dirichlet Distribution is used to represent it because parameterized with l = 20
  • s fixed beforehand