Training your RBM on sequences of Protein
Contents
Training your RBM on sequences of Protein¶
We would like you to train your RBM on a sequence of proteins across a large number of creatures. This closely follows this paper by Francesco Zamponi.
The protein sequences you are given are strings of 20 amino acids (plus a - to indicate no protein).
To get the data, use the following code:
!pip install bio
import pylab as plt
import random
from Bio import SeqIO
import numpy as np
import time
get_bin = lambda x, n: format(x, 'b').zfill(n)
p=dict()
count=0
input_file='PF00014_mgap6.fasta'
fasta_sequences = SeqIO.parse(open(input_file),'fasta')
for fasta in fasta_sequences:
name, sequence = fasta.id, str(fasta.seq)
for s in sequence:
if s not in p.keys():
p[s]=[0 if k=='0' else 1 for k in get_bin(count,5)]
count=count+1
def SeqToIsing(sequence):
visible=[]
for s in sequence:
visible=visible+p[s]
return np.array(visible)
input_file='PF00014_mgap6.fasta'
fasta_sequences = list(SeqIO.parse(open(input_file),'fasta'))
sequences=[]
for seq in fasta_sequences:
sequences.append(SeqToIsing(seq))
sequences=np.array(sequences)
This will convert the 21 symbols into binary numbers.
Please train your RBM on these sequences.
Analyze your results¶
Plot the free energy as a function of epoch
Produce 10 sequences that are 1 Gibbs sample away from already existing sequences. See if those sequences seem reasonable. In the real world, what we would do is take these sequences and try to produce them.
Produce many many sequences and compare the probability of seeing a given amino acid on site 3 with the histogram of that probability in your dataset.