Using a CNN to classify DNA segments.2018.04.07Code

Predicting DNA Splice Sites with Convolutional Nets

In this post I'll show an example of predicting DNA splice sites using a CNN. This is an important computational task that must happen before raw genetic sequences can be used to understand expression.

The motivation for this post was the paper DNA Sequence Classification by Convolutional Neural Network which uses a similar model.

Splicing and Splice Sites

In most eukaryotic genes (humans are eukaryotes), there are regions of interspersed DNA called introns and extrons. During DNA to RNA transcription, the introns are removed and the extrons are assembled to form mature RNA. The RNA is what will determine the protein.

IntronExon (wikipedia)

Proteins are "doers" in that they catalyze reactions and regulate other gene's expression (among other things), so understanding which parts of the DNA contribute to their creation is important. Moreover, splice site mutations are implicated in several diseases, so finding and understanding if they're mutated could help diagnosis individuals.

There are caveats to all of this.

Convolutional Nets

Convolutional Neural Nets (CNNs) have shown fantastic performance on various sorts of prediction tasks, especially for image and language tasks. That said, they've started getting used on new types of data (e.g. graphs via graph convolutional networks) and new domains (e.g. genetics with DeepVariant). The problem of splice site detection is another way CNNs could be applied in genetics.

An Example of Splice Site Prediction

In this example, I'll use the Molecular Biology (Splice-junction Gene Sequences) Data Set. It's a nice dataset in that it shows some of the idiosyncrasies of the application of CNNs to genetics, for example rather than just ATCG for nucleotide, there are also DNSR. These are used there's sufficient ambiguity around which particular nucleotide is at a certain position, i.e. D is used if the actual letter could be A or G or T.

Data Preparation

I decided to take the easy road and simply one hot encode the DNA sequences.

For example, the sequence ATGGGGTAA is encoded as:

1 0 0 0
0 1 0 0
0 0 0 1
0 0 0 1
0 0 0 1
0 1 0 0
1 0 0 0

assuming the columns are in the order of ATCG. The true number of columns would correspond to the vocabulary size (for example the D character referenced earlier).

Model Building

To build the model I used TensorFlow's Estimator interface. The benefit of that is TensorFlow will handle a lot of the "extra" stuff that goes into model building for us, like logging.

So with that in mind, I'll just show the function:

def cnn(features, labels, mode):

    input_layer = tf.cast(features["seq"], tf.float32)

    conv = tf.layers.conv1d(
        inputs=input_layer,
        filters=32,
        kernel_size=2,
        padding='same',
    )

    layer = tf.layers.flatten(conv)
    layer = tf.layers.dense(inputs=layer, units=512, activation=tf.nn.relu)
    logits = tf.layers.dense(inputs=layer, units=3)

    predictions = {
        "classes": tf.argmax(input=logits, axis=1),
        "probabilities": tf.nn.softmax(logits, name="probabilities")
    }

    if mode == tf.estimator.ModeKeys.PREDICT:
        return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)

    accuracy = tf.metrics.accuracy(
        labels=labels, predictions=predictions["classes"])

    tf.summary.scalar("training_accuracy", tf.reduce_mean(accuracy))

    loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)

    if mode == tf.estimator.ModeKeys.TRAIN:
        optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
        train_op = optimizer.minimize(
            loss=loss, global_step=tf.train.get_global_step())

        return tf.estimator.EstimatorSpec(
            mode=mode, loss=loss, train_op=train_op)

    eval_metric_ops = {
        "accuracy": accuracy,
    }

    return tf.estimator.EstimatorSpec(
        mode=mode, loss=loss, eval_metric_ops=eval_metric_ops)

This is a lot of code, but the bulk of the work is in just a few lines of code:

  input_layer = tf.cast(features["seq"], tf.float32)

  conv = tf.layers.conv1d(
	  inputs=input_layer,
	  filters=32,
	  kernel_size=2,
	  padding='same',
  )

  layer = tf.layers.flatten(conv)
  layer = tf.layers.dense(inputs=layer, units=512, activation=tf.nn.relu)
  logits = tf.layers.dense(inputs=layer, units=3)

  predictions = {
	  "classes": tf.argmax(input=logits, axis=1),
	  "probabilities": tf.nn.softmax(logits, name="probabilities")
  }

This code does a few things:

  1. Create input_layer, which is the sequence that has been one hot encoded. This is treated as the input image in a traditional CNN.
  2. Run the input_layer through a 1d convolutional layer (images with channels use conv2d).
  3. Flatten the output of the convolutional layer, and run it through a dense layer with a relu activation function.
  4. Run that output through a layer with 3 nodes corresponding to the three categories of output: intron-exon, exon-intron, and none. These are what we care about predicting.
  5. Squash the output by running it through a softmax function in order to get the probabilities associated with each category.

That probability is then used in conjunction with the loss function sparse_softmax_cross_entropy to evaluate the model, and ultimately use that loss to back-propagate weight updates back through the network.

Results

So that's dandy and all, but how'd does the model work?

To start, the paper I referenced at the beginning of this post achieves around 96.18% accuracy.

paper results

In contrast the results I get are slightly lower at 95.1% -- this is the average over the holdout sets from 5-fold cross validation.

cv results

So we weren't able to do as good with this simple approach, but came quite close and did have slighly better perfomance than the paper's previous best. That said, this paper was posted in 2015, and the state of art certainly has improved since then.

CodeBiologyDeep Learning