Training your RBM on sequences of Protein

Training your RBM on sequences of Protein

We would like you to train your RBM on a sequence of proteins across a large number of creatures. This closely follows this paper by Francesco Zamponi.

The protein sequences you are given are strings of 20 amino acids (plus a - to indicate no protein).

To get the data, use the following code:

!pip install bio
import pylab as plt
import random
from Bio import SeqIO
import numpy as np
import time
get_bin = lambda x, n: format(x, 'b').zfill(n)

p=dict()
count=0
input_file='PF00014_mgap6.fasta'
fasta_sequences = SeqIO.parse(open(input_file),'fasta')
for fasta in fasta_sequences:
    name, sequence = fasta.id, str(fasta.seq)
    for s in sequence:
        if s not in p.keys():
            p[s]=[0 if k=='0' else 1 for k in get_bin(count,5)]
            count=count+1
            
def SeqToIsing(sequence):
    visible=[]
    for s in sequence:
        visible=visible+p[s]
    return np.array(visible)

input_file='PF00014_mgap6.fasta'
fasta_sequences = list(SeqIO.parse(open(input_file),'fasta'))

sequences=[]
for seq in fasta_sequences:
    sequences.append(SeqToIsing(seq))
sequences=np.array(sequences)

This will convert the 21 symbols into binary numbers.

Please train your RBM on these sequences.

Analyze your results

  • Plot the free energy as a function of epoch

  • Produce 10 sequences that are 1 Gibbs sample away from already existing sequences. See if those sequences seem reasonable. In the real world, what we would do is take these sequences and try to produce them.

  • Produce many many sequences and compare the probability of seeing a given amino acid on site 3 with the histogram of that probability in your dataset.