Gene Homology Pipeline Description
We downloaded proteomic data for 12 species spanning the diversity of chordates
and including outgroups in addition of which we used data produced here
for 6 new tunicate species. These 18 datasets were dereplicated
(i.e. the longuest transcript for each gene was kept) and then used as input
for the clustering software package SiLiX (Miele et al. 2012).
Two rounds of clusterization were ran, the second one having been designed
to break apart and reclusterize around 30% of all sequences that were affiliated
to the same "mega-cluster". We then only kept cluster of sequences that contained
either at least two tunicate sequences, or at least one tunicate and one vertebrate
sequence. This resulted in 12,885 clusters of homologous sequences, containing both
orthologs and paralogs.
These clusters were then aligned using MAFFT (Katoh & Standley 2013), and fragmented
sequences (i.e. small sequence in both absolute length and relatively to the rest of
the alignment) were discarded. For each of these alignements, we computed a
phylogenetic tree as well as a 100 bootstrap replicates to estimate node support
using the LG+G4+F evolution model in RaxML (Stamatakis 2014). These trees of
homologous sequences were then analyzed with a custom program (written in C++) in
order to detect the vertebrates orthologous genes of each tunicate sequence.
This phylogeny-based orthology information was subsequently used to create a
definitive name for tunicate gene that follows recently published recommendation
for tunicate gene nomenclature (Stolfi et al. 2015).