n50 with SQL

bioinformatics
SQL
How to use SQL to compute n50 of an assembly.
Author

Trent Hauck

Published

July 6, 2022

There was some recent discussion on twitter about how SQL is both very useful and hard to learn in school (including the highest levels), perhaps because problems are so short term, the challenges of long term data management are never made visceral. I shared in this to some extent, though I studied Accounting which is probably more amenable to data management examples than other fields.

So that tweet, someone mistakenly describing n50 as the median contig length to me today, and my recent foray in DuckDB + brrrr1; how about an example of an assembly metric with SQL, the vaunted n50.

Setup

NCBI has lots of assemblies, and lots will do, but choose something interesting (or at least more than a couple of contigs). I’ll use a Penicillium expansum assembly. It’s a fungus that causes mold in Apple trees.

Penicillium expansum plate

There are a few metrics they’ve already calculated, including an n50 to compare with: 301,353bp.

To get started, I’ll download the assembly…

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/769/745/GCF_000769745.1_ASM76974v1/GCF_000769745.1_ASM76974v1_genomic.fna.gz
gunzip GCF_000769745.1_ASM76974v1_genomic.fna.gz

and use brrrr to convert it to parquet.

brrrr fa2pq ./GCF_000769745.1_ASM76974v1_genomic.fna GCF_000769745.1_ASM76974v1_genomic.parquet

Then drop into DuckDB and let the fun (SQL) begin.

$ duckdb
v0.4.0 da9ee490d
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D

As a quick check, see if the total counts match the link to P. expansum above.

SELECT COUNT(*) FROM 'GCF_000769745.1_ASM76974v1_genomic.parquet';
┌──────────────┐
│ count_star() │
├──────────────┤
382
└──────────────┘

Calculating n50 in SQL

Wikipedia has a good example of the concept of n50, so on to SQL.

Breaking down the calculation, we need to get the length of each contig, sort by that length in descending order, then descend until we find the place where the running sum is half the total length (another value needed)… the length of the first contig after the halfway mark is our n50.

So first things first, let’s simplify by creating a table that’s just the contig id and the length of the sequence.

COPY (
  SELECT id, length(sequence) AS seq_length
  FROM 'GCF_000769745.1_ASM76974v1_genomic.parquet'
) TO 'GCF_000769745.summary.parquet' (FORMAT PARQUET);

SELECT * FROM 'GCF_000769745.summary.parquet' LIMIT 10;
┌────────────────┬────────────┐
id       │ seq_length │
├────────────────┼────────────┤
│ NW_015971303.11744
│ NW_015971337.115685
│ NW_015971232.11322
│ NW_015971233.18276
│ NW_015971234.1263183
│ NW_015971235.11033
│ NW_015971228.1104645
│ NW_015971229.11594
│ NW_015971230.1376691
│ NW_015971231.16100
└────────────────┴────────────┘

Great, so now let’s compute the running sum of lengths, and divide that by the total length. Select that along with id and the specific contig length.

SELECT SUM(seq_length*1.0) OVER (ORDER BY seq_length DESC) / SUM(seq_length) OVER () pct_of_total, id, seq_length
FROM 'GCF_000769745.summary.parquet'
ORDER BY seq_length DESC
LIMIT 5;

Everything here is pretty simple except the pct_of_total column that uses two window functions. The numerator is the running total over sequence lengths in descending order and the denominator is the sum seq_length overall all rows. The OVER keyword and parenthetical expression dictate the window. Window functions are very powerful and it’s worth reading more: https://duckdb.org/docs/sql/window_functions.

Anyways, the result looks like:

┌──────────────────────┬────────────────┬────────────┐
│     pct_of_total     │       id       │ seq_length │
├──────────────────────┼────────────────┼────────────┤
│ 0.031041275498169617 │ NW_015971275.1 │ 1004373    │
│ 0.06143951820073947  │ NW_015971324.1 │ 983567     │
│ 0.08912797384896944  │ NW_015971406.1 │ 895889     │
│ 0.114818904954029    │ NW_015971526.1 │ 831257     │
│ 0.13539530538463782  │ NW_015971309.1 │ 665771     │
└──────────────────────┴────────────────┴────────────┘

This is more or less “it”, but we need to go the final step to select the actual seq_length that breaches 50%, or in SQL land, the first row where pct_of_total is greater than .5. This uses a nested query treating the above calculation as the table to filter from.

SELECT id, seq_length FROM (
  SELECT SUM(seq_length*1.0) OVER (ORDER BY seq_length DESC) / SUM(seq_length) OVER () pct_of_total, id, seq_length
  FROM 'GCF_000769745.summary.parquet'
  ORDER BY seq_length DESC
)
WHERE pct_of_total > .5
LIMIT 1;

And there we have it, the n50 that matches the NCBI Assembly page.

┌────────────────┬────────────┐
│       id       │ seq_length │
├────────────────┼────────────┤
│ NW_015971413.1 │ 301353     │
└────────────────┴────────────┘

Footnotes

  1. Full disclosure, I write/maintain brrrr.↩︎