HomeBlog

Sequence Promoter Prediction with CNNs, An Example

16 December, 2019 - 7 min read

In this article, I'll walk through a reimplementation of "Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks". This is not meant to be a faithful reproduction, but the workflow is used as an example for how to build a similar model with the same dataset in the paper with OSS software standard to the bioinformatics and deep learning communities.

The workflow is as follows:

  1. Use lineflow to implement a dataloader on a set of FASTA sequences
  2. Use gcgc to tokenize the sequences from their textual representation into an integer representation (that we'll use for embeddings).
  3. Use PyTorch and PyTorch lightning to fit a CNN.

Using lineflow to load Sequences

I like lineflow. It's a library to build NLP data loaders for use with both PyTorch and TensorFlow. I also the pattern it adopts: I implement an interface that it expects, and the rest is taken care of for me (this is also how PyTorch lightning works too).

To create our loader, we need to write the lineflow.core.DatasetMixin object, which is then used in lineflow's dataset object.

import lineflow

class FASTADatasetMixinImplementation(lineflow.core.DatasetMixin):
    """This class implements the DatasetMixin methods which are then extended by lineflow."""

    def __init__(self, fasta: Fasta):
        """Initialize the FASTA loader."""
        self.fasta = fasta

    def __iter__(self):
        """Iterate through the FASTA index."""
        for row in self.fasta:
            yield row

    def __len__(self):
        """Return the number of FASTA sequences."""
        return len(self.fasta.keys())

    def get_example(self, i: int):
        """Get the single example."""
        return self.fasta[i]

The reader will notice the fasta: Fasta block in the init method declaring the type, and wonder where does the fasta come from? Concretely, the dataset I'm using is the one the authors' of the paper released. It can be found here. That's the fasta, then for the Fasta object I'm using pyfaidx which is a python library that provides indexed lookups to sequences in FASTA files. BioPython also could have been used.

With the DatasetMixin implemented, it's now a matter of implementing the lineflow.Dataset.

from pyfaidx import Fasta

class FASTADataset(lineflow.Dataset):
    """FASTA Dataset."""

    def __init__(self, fasta_path):
        """Init the dataset with the fasta path."""
        fasta = Fasta(str(fasta_path))
        dataset_mixin = FASTADatasetMixinImplementation(fasta)

        super().__init__(dataset_mixin)

And there we have it, a nice dataloader that will load sequences from our FASTA file into a dataset. It also provides nice methods for working with the dataset, like .map we'll see later.

Tokenizing the Dataset

This is the part of the post dedicated to shameless self-promotion.

So imagine you have a bunch of DNA sequences like:

ATCG
ATCC
ACCT

We know from other areas of ML (particularity NLP) we need to tokenize these sequences into integer representations (be it one-hot or not), before passing it into our model based on some vocabulary.

In NLP, the vocabulary is often generated by counting how often each word appears then using the top N words then converting the rest to an unknown token (ex: UNK). Moreover, the corpus doesn't contain all the words of interest (ex: Shakespeare doesn't have all English words).

In genomic sequence analysis, the "open vocabulary problem" isn't as big of problem. If the sequence of interest is DNA, then even with the IUPAC ambiguous alphabet, there's a fixed set of characters even for kmers where k > 1. It's a combinatorics problem rather than a linguistics problem.

GCGC aims to make the tokenization step based on different alphabets straightforward to integrate with PyTorch, TensorFlow, and other Deep Learning packages.

The first step is to create the tokenizer.

import gcgc

MAX_LENGTH = 81  # We'll use the ecoli dataset, where the sequences are of length 81

tokenizer_spec = gcgc.SequenceTokenizerSpec(max_length, alphabet="ATCG")
tokenizer = gcgc.SequenceTokenizer(tokenizer_spec)

We then map the tokenizer over the dataset to produce the input tensors.

tokenizer = gcgc.SequenceTokenizer(tokenizer_spec)

promoter_ds = FASTADataset("./Ecoli_prom.fa") # From the paper's dataset
promoter = promoter_ds.map(
    lambda x: {"seq": torch.LongTensor(tokenizer.encode(x)), "label": 0}
)

non_promoter_ds = FASTADataset("./Ecoli_non_prom.fa") # From the paper's dataset
non_promoter = non_promoter_ds.map(
    lambda x: {"seq": torch.LongTensor(tokenizer.encode(x)), "label": 1}
)

dataset = non_promoter + promoter

To summarize: we took the promoter dataset, mapped over it to create a dataset with dictionaries as records where the "seq" key is the PyTorch sequence and the "label" key is the label for the dataset. Then did the same with the non promoter dataset. We combine the two datasets into a single set by "adding" them -- you see why I like lineflow.

It's worth noting that these simple token encoders are not the state of the art in NLP, and it's more common now to use Byte Pair Encoding or a unigram Language Model to determine the tokens (something I plan on exploring for this domain!).

Building and Fitting the Model

To build the model, I'll use PyTorch, though it'd be straightforward with TensorFlow or another option.

When using PyTorch, I've been using PyTorch Lightning. It is similar to lineflow in that once we implement the necessary methods, the library handles the details. That said, there's a fair amount of boilerplate, so I'll omit it here.

Otherwise, the core model a somewhat standard model that uses an embedding layer for the input tokens, then a combination of 1D CNNs and dense layers to produce logits for each class.

class ClassificationModel(torch.nn.Module):
    def __init__(self, vocab_size: int):
        super().__init__()

        self.embedding_layer = nn.Embedding(vocab_size, 150)
        self.conv1d = nn.Conv1d(150, 150, 7, stride=1)
        self.norm = nn.BatchNorm1d(150)
        self.conv1d2 = nn.Conv1d(150, 30, 8, stride=1)
        self.flatten = nn.Flatten()
        self.linear = nn.Linear(2040, 2)

    def forward(self, examples):  # pylint: disable=arguments-differ
        seq = examples["seq"]

        embedded = self.embedding_layer(seq).transpose(1, 2)
        conv1d = self.norm(F.relu(self.conv1d(embedded)))
        conv1d2 = F.relu(self.conv1d2(conv1d))
        network_out = self.linear(self.flatten(conv1d2))
        return F.log_softmax(network_out, dim=1)

This is a rough representation of what the network looks like:

CNN

Now we can use PyTorch Lightning we can train the model.

$ python app/model.py train

                    Name                 Type Params
0                  model  ClassificationModel  198 K
1  model.embedding_layer            Embedding  600
2           model.conv1d               Conv1d  157 K
3             model.norm          BatchNorm1d  300
4          model.conv1d2               Conv1d   36 K
5          model.flatten              Flatten    0
6           model.linear               Linear    4 K
Epoch 12: 100%|████| 121/121 [00:08<00:00, 35.42batch/s, accuracy=1, batch_nb=93, loss=0.000322, v_nb=56, val_acc=0.926, val_loss=0.348, val_sensitivity=0.878

In the last line of the console output, to the right, are the accuracy and sensitivity. These are computed on a validation set, which is created by using one of lineflow's built in cross validation functions.

from lineflow.cross_validation import split_dataset_random

# Using the combined dataset from earlier
train, test = split_dataset_random(combined, 3000)

It's important to note, that random cross validation is almost never the best scheme, and it's worth thinking about the underlying problem. When I try to think about CV, I normally think about what are the exchangeable units. For example, in the famous 8 schools example we can exchange at the school level, but not the student. Here it's a little less clear, and again, depends on the problem, but you could imagine were the exchangeable unit isn't at the sequence level, but rather the specific species, or perhaps further up the phylogenetic tree.

Conclusion

In this note, we've seen how to use lineflow to create a simple dataloader for FASTA files, use gcgc to tokenize the input from the raw sequences into integer encoded lists, and build a simple CNN model to predict if input sequences are promoters or not based on a labeled dataset.