Leave One Clade Out Cross Validation

07 March, 2020 - 3 min read

Leave One Clade Out Cross Validation is an idea to use cross validation methods that are informed by the phylogeny of the underlying dataset. I should admit now that I didn't coin the term, but also can't seem to find references of it being used (let me know if you can).

Cross validation can be implemented in lots of generic ways, and depending on the situation some may be more appropriate than others. For example, random shuffle split is the simplest (exchangability at the instance level) to more complex methods like group k-fold (exchangability at the group level) These more complex methods, while better than shuffle split, often still aren't a great match for the underlying data process as dictated by the domain.

Why Use Clade

One area I'm particularity interested in right now is biology, where data is often hierarchical. For example, say you were trying to predict if a protein might confer some function or if an unknown sequence belonged to some taxon rank. To evaluate performance it'd be better to construct a CV method that split examples so that test data would be from the overall data distribution, but hopefully not exactly the training distribution.

For one attempt, the idea of Leave One Clade Out could help. In simple terms, a clade is a subbranch of a phylogenic tree, that is distinct from other subbranches on a branch.

Wikipedia has a helpful visualization. https://en.wikipedia.org/wiki/File:Clade-grade_II.svg Wiki Cladogram

Ergo, it's potentially the case that say, sequences/organisms, from one clade may more dissimilar from other clades.

An Example

Say we were trying train a model to use 16S RNA to predict the Phlyum of a sequenced organism. The Phylum would be the level to predict, then we'd hold out different clades (in this case Class) under the specific group.


(Graphic design is my passion).

An Implementation

So how might we implement this so it can be used in practice? Well, many languages have some graph library that can be used to represent the hierarchical structure, and since I use Python a bunch networkx is the de facto choice.

import random

import networkx as nx

# Add the example data.

G = nx.DiGraph()

G.add_node("Acidobacteriota", rank="Phylum")

G.add_node("Acidobacteriae", rank="Class")
G.add_edge("Acidobacteriota", "Acidobacteriae")

G.add_node("Blastocatellia", rank="Class")
G.add_edge("Acidobacteriota", "Blastocatellia")

G.add_node("Verruco", rank="Phylum")

G.add_node("Chlamydiae", rank="Class")
G.add_edge("Verruco", "Chlamydiae")

G.add_node("Lentisphaeria", rank="Class")
G.add_edge("Verruco", "Lentisphaeria")

# Iterate through the graph, only grabbing Phlyum nodes.
for node, data in G.nodes(data=True):
    if data["rank"] != "Phylum":
        # Only process Phlyums, so continue if not

    # Select the descendents of the node.
    decendents = list(nx.descendants(G, node))

    # Grab one for to leave out, then use the rest for test.
    test = random.choice(descendents)
    train = set(descendents) - set([test])

    print(f"Using test clade {test} in phylum {node} and train set {train}.")s

This then prints the following, showing which groups were used in which CV sets.

Using test clade Blastocatellia in phylum Acidobacteriota and train set {'Acidobacteriae'}.
Using test clade Chlamydiae in phylum Verruco and train set {'Lentisphaeria'}

Anyways, that's the short and sweet of it. There're lots of edge cases to think about (only one clade?), the challenging nature of data biology, and even fundamental questions like is a tree too ridged given the nature of evolution? -- well there's 20 lines of Python, good luck.