A tree 400MYA in the making: choosing species for a phylogenetic analysis between zebrafish and human genes

One of the analysis that I like to do when trying to understand the function of a specific gene is to recapitulate its evolution since the last common ancestor between zebrafish and humans. This brings a problem: from which species do I have to search and download data to convincingly fill the gap on this divergence times? How many data points do we need to bridge a 400 million year gap, including (at least) one full whole-genome duplication?


Species tree created with TimeTree

While there are countless publications with phylogenetic trees made with the sequences from zebrafish, mouse, humans, and either chicken or frogs, I found these datasets to be skewed towards tetrapods, with insufficient fish species coverage and usually lead to wrong conclusions.

As of 2018, this is the list of species that have a sequenced genome (at least partially) and I use in phylogenetic and reconciliation analysis focused on investigating genetic evolution between zebrafish and mammals:

Last common taxonomySpecies
VertebrataLampreys
GnathostomataSharks
Telestomi 
EutelestomiLatimeria, lungfishes & Tetrapods
Actinopterygii 
Actinopteri 
NeopterygiiLepisosteus oculatus
TeleosteiAnguilla anguilla
OsteoglossocephalaiScleropages formosus, Paramormyrops kingsleyae
ClupeocephalaGreat majority of fishes with sequenced genome:
Takifugu, Tetraodon, Oryzias, Gasterosteus, Peocilia, Xiphophorus, ...
OtomorphaClupea harengus
Ostariophysi 
OtophysiAstyanax mexicanus, Electrophorus electricus, Ictalurus punctatus,
Pygocentrus nattereri
Cypriniphysae 
Cypriniformes 
Cyprinoidea 
CyprinidaeCyprinus carpio, Squalius pyrenaicus, Sinocyclocheilus rhinocerous,
anshuiensis, and grahami, Leuciscus waleckii, Pimephales promelas
DanioDanio rerio

I often use sequences from lampreys and sharks to root the trees; the coealacanth and lungfishes help in joining the tetrapod and often diverging fish lineages; and the spotted gar is the only non-teleost fish with sequenced genome that I know of, which is useful to pinpoint duplications emerging from the whole genome duplication.