Skip to main content

Building a marine refpack for taxator-tk


taxator-tk is a program for taxonomic classification of nucleotides sequences which I wrote and which has been published in Bioformatics 2015. There are a couple of programs around by now which do similar things, so why should you use it? Well, this article is not about the software itself, so to make it short, it is basically a trade-off between the lazyness of curating a sequence database versus computation time but also taxator-tk is specialized on conservative assignment of longer sequences (contigs or chromosomes). The program only unsees nucleotide sequences with taxonomic labels as reference. Comparable programs often employ databases of marker genes which need to be curated and kept up-to-date for all domains of life you want the classification to work on. Although some of these marker gene sets have been extended to other domains now, most of the genes are still for prokaryotes. So taxator-tk works well for complex communities composed of prokaryotes, eukaryotes and possibly viruses. The flexibility comes at at cost: running a sensitive search against the nucleotide database (e.g with NCBI Blast) and then using a large number of pairwise comparisons is computationally expensive. For instance, it is considerably slower than k-mer based methods like Centrifuge. However, we can do better when it comes to generalization because these programs work in a nearest-neighbor like fashion which means that the false positive rate increases quickly when your sampe taxa are underrepresented in the database. Benchmarks show that taxator-tk has an exceptionally high average precision, if you can live with sparse assignments. Having said all this, you can similarly use this example to build databases for Centrifuge and alike. In fact, it would be really nice if, at some point, everybody agreed on a common data format for refpacks, and recipes to build them automatically.

Marine nucleotide sequences

First of all, I needed a comprehensive set of nucleotide sequences with taxonomic labels for marine microbes. taxator-tk uses the NCBI taxonomy which means that we want taxonomic affiliations given in terms of taxon IDs rather than common names. After a little research, I found these resources.

  1. marRef and marDB provided by the ELIXIR Marine Metagenomics Portal (MMP) (prokaryotes only)
  2. MATOU and OM-RGC provided by the Ocean Gene Atlas (mixed prokaryotes, eukaryotes and viruses)

The World Register of Marine Species also looked good but does not seem to have links to the NCBI taxonomy nor to any nucleotide data. So I sticked to the others.

Download NCBI taxonomy

The taxonomy will be used to represent all downloaded sequence data. We need to make sure, that the taxonomy version is not older than any of the selected reference sequences collections because this might lead to problems with missing taxa. I downloaded the NCBI taxonomy dump files on 2018-07-23.

wget ""

It would be really nice if the NCBI taxonomy were a versioned and trackable resource. Right now, the best you can do is to record the date you downloaded the taxonomy.

Download MarRef/MarDB

This download was almost straight forward. Unfortunately, the MMP didn't provide hashsums to verify the downloads. Since I got corrupted files with lftp, I used wget instead. It's quite unlikely that sequences still contain errors but without hashsums it cannot be verified. Note to everybody: If you put large files for download, please provide ways to check integrity. I also noted that if FASTA files for download were compressed, it would reduce download size and time considerably.


First I downloaded a metadata table and the nucleotide sequences

wget ''
wget -r -np '' -A '*_genomic.fa'

The latter command creates a copy of the folder structure on the download server with many useless directories. I simply concatenated and compressed the FASTA files later.


I did similarly for the marDB dataset which is larger.

wget ''
wget -q -r -np '' -A '*_genomic.fa'

Process FASTA

I concatenated the files (in lexicographic order), validated them and renamed sequences to simple identifiers (accession) before the first space using seqkit and save them in compressed form.

find -iname "*_genomic.fa" |
  sort -V |
  xargs cat |
  seqkit seq -v |
  seqkit -w 0 replace -p '([^ ]+\.[0123456789])[_ ].*' -r '${1}' -o marref.fna.gz
find -iname "*_genomic.fa" |
  sort -V |
  xargs cat |
  seqkit seq -v |
  seqkit --w 0 replace -p '([^ ]+\.[0123456789])[_ ].*' -r '${1}' -o mardb.fna.gz
# rm -r  # delete primary download

Download MATOU and OM-RGC


The Ocean Gene Atlas publication has details on this dataset. It's hosted on Genoscope. I downloaded a list of assigned taxa.

wget ''

This file only contains taxonomic annotation of the sequences.


The Science Publication 2015 explains how this dataset was constructed. I downloaded the data from the EBI FTP.

wget ''

This file contains both genes sequences and taxonomic annotations in one file.

Quality of taxonomic annotation

MATOU contains longer nucleotide sequences such as scaffolds while OM-RGC only has individual gene sequences. In theory, we don't care. Even if there is splicing, the local alignment algorithm will take care of it. That's one of the advantages of using taxator-tk and it means we can also use the program to classify RNA sequences without problems. But after looking at the data, I was surprised that taxonomic annotation of these sequences was so poor. For instance, OM-RGC lists over 20,000 sequences with genus level annotation 'Homo' (rank 83 out of 2367 genera). This could be caused by contamination but is more likely caused by the poor taxonomic classification routine (which is exactly what we want to avoid with taxator-tk). Similarly but to a lesser extent, MATOU has 351 'Homo' sequences (rank 257 out of 1838 genera). We don't want to use sequences with such bad annotation to classify novel sequences because that would just transfer the weaknesses of sequence annotation of these datasets. However, I set a frequency threshold to remove rare annotation and filtered the list of genera by hand to derive a reliable list of marine genera. I then used that list later to filter RefSeq by genus level taxon IDs.

Download NCBI RefSeq

The RefSeq release catalog contains metainformation like accessions and taxon IDs. Unfortunately, the given link only works for older versions of RefSeq and the current version has a different syntax. I've adjusted it here because I built the refpack using RefSeq v90 but the current version by now is v91.

wget ""

In theory, we could just determine accessions we need from the release catalog and only download a subset but NCBI does not provide a stable interface (at least that I know of), which would work for arbitrary RefSeq accession and not be superslow. Therefore, the fastest way is actually downloading all RefSeq sequences and post-filtering.

Filter RefSeq

I derived a list of accessions from the RefSeq release catalog by matching genus level taxon IDs from MATOU and OMG-RGC. I used this list to filter RefSeq sequences while downloading them from the NCBI FTP.

lftp -e 'glob cat *.genomic.fna.gz && exit' |
  seqkit seq -v |
  seqkit replace -p '([^ ]+.[0123456789])' -r '${1}' |
  seqkit -w 0 grep -f accession.list -o refseq.filtered.fna.gz

Fix taxonomic annotation

For each of the FASTA entires we need to have a valid taxon ID for our taxonomy. Again, there isn't a single service or authority to do this, so I had to improvise.

Missing taxon ID

Some of the given taxon IDs were missing or not resolvable in the current taxonomy. MarRef/MarDB annotations are often only at higher ranks. So I needed to retrieve the leaf nodes in the taxonomy for those sequences.

The NCBI FTP holds large mapping files for accession to taxon ID mappings. I used the following files.

All the accessions which were still not assigned to a taxon ID could be resolved using the NCBI Entrez online service. The only way it worked for all accession I have encountered, like ones that have been deleted in GenBank, was to query the website using GET requests. The NCBI people have released a toolkit and command line client written in Perl called edirect, but these scripts were unstable, occasionally writing error messages to the standard output and silently dropping accessions. So I wrote my own Python script which parses XML returned by Entrez and generously handles delays and errors on the NCBI side, so it is quite reliable for thousands of accessions. When you use it for millions, eventually the remote side will block you. Be warned!

cat accessions.txt |
  accession2taxon.py3 > results.tsv

The results.tsv will have three columns, accession without version tag, latest accession with version and the taxon ID.

Old taxon IDs

File merged.dmp in the taxonomy folder contains a mapping from old taxon IDs which have been deleted to recent ones.

Conflicting annotation

After joining taxonomic mapping files from different source, I found over 1k accessions to be linked to multiple taxa, probably representing later changes. To figure out the most up-to-date assignment, I simply use the online accession script, again.

Sequence deduplication

I cleaned up the data using the following strategy. I'm only showing the basic commands, feel free to change file names, compress and parallelize as you wish.

Canonical FASTA

I converted all sequences to upper case and substituted unusual bases by N.

seqkit seq --upper-case |
  sed -e '/^[^>]/s/[^ACGT]/N/g'

Remove identical

I filterd duplicates using accession and 100% sequence identity. seqkit write statistics to the given stat files.

seqkit rmdup -n -D 01.stats |
  seqkit -w 0 rmdup -s -m -D 02.stats

Family-level compression

I filtered highly similar sequences at 98% sequence identity within the same annotated family.

mkdir split-family && cd split-family
pigz -d < ../allfasta.fna |
  paste ../ - - |
  awk '{printf $2 "\n" $3 "\n" >> $1 ".fna"}'
for i in *.fna; do in="$i" out="${i%fna}dedup98.fna" minidentity=98 pigz jni storequality=f 2> "${i%fna}dedup.log"

Memory can be an issue with, try to set the threads=1 and a generous memory limit then. BBMap has a good online documentation.

Refpack format

I created a refpack folder called allmarine98_20181206 and created all files within.


I created a subfolder ncbi-taxonomy and copied the files names.dmp and nodes.tmp from the NCBI taxonomy into this folder. Taxator-tk reports the taxonomy version as part of the bioboxes binning output format to make results more reproducible. Therefore, I made a file called version.txt in this subfolder. I decided that it would be better to calculate a hash based on the tree structure in nodes.dmp to identify a specific taxonomy version than using a download date or similar. The same approach has been used in all distributed refpacks.

(awk -F '\t' '{if($1 != $3) print $1 "\t" $3}' |
LC_COLLATE=C sort -u |
md5sum |
cut -d ' ' -f 1) < nodes.dmp > ncbi-taxonony/version.txt


I concatenated the deduplicated FASTA files and saved them as refdata.fna.

ls split-family/*.dedup98.fna |
  sort -V |
  xargs cat > refdata.fna


The mapping is a TAB-separated file with sequence identifier (col 1) and taxon ID (col 2). For taxator-tk, I generally specify taxon IDs at level species or higher because sub-species classification is rarely possible and it will simply save a little of computation time by pruning the tree before doing the classification. I used the program taxknife in taxator-tk can do this.

taxknife -f 2 traverse -r species genus family order class phylum superkingdom < seqname_taxid.tsv >

I kept the mapping file in exactly the same order as the FASTA file.


I added a README file in which I described the source, content and curation procedure of the refpack. Every refpack should also contain the name and email of the curator and curation date.

Content and distribution

The final refpack looks quite simple and is easy to understand.

I packaged the refpack using tar and compressed using plzip.

tar -cv --owner=0 --group=0 -f - allmarine98_20181206 |
  plzip > allmarine98_20181206.tar.lz

Programs used in this blog post

I didn'quite described all details but these are the programs I used in the examples given here.