-
Notifications
You must be signed in to change notification settings - Fork 32
Open
Description
When you create a taxdump using create-taxdump (ICTV taxonomy, for example), the taxids "skip" some numbers. For example:
$ head ictv-taxdump/names.dmp
1 | root | | scientific name |
287205 | Hoswirudivirus MRV1 | | scientific name |
287935 | Shomudavirus limadaptatum | | scientific name |
1096518 | Sclerotimonavirus betaclarireediae | | scientific name |
1138752 | Potato virus H | | scientific name |
1536674 | Rhopapillomavirus 1 | | scientific name |
1845995 | Monomorium pharaonis virus 1 | | scientific name |
1890985 | Aquamavirus A | | scientific name |
2079526 | Hylipavirus | | scientific name |
2290567 | Fattrevirus | | scientific name |
This is not a problem in itself, as the nodes are still connected. However, this causes a bug when you try to create a MMSeqs2 taxonomy database using the custom taxonomy, as it apparently assumes that numbers are not skipped (unless they are in delnodes.dmp and merged.dmp, I guess).
I wrote a script that mapped taxids such that no number is skipped and it solved the issue.
$ head ictv-taxdump/names.dmp
1 | root | | scientific name |
2 | Hoswirudivirus MRV1 | | scientific name |
3 | Shomudavirus limadaptatum | | scientific name |
4 | Sclerotimonavirus betaclarireediae | | scientific name |
5 | Potato virus H | | scientific name |
6 | Rhopapillomavirus 1 | | scientific name |
7 | Monomorium pharaonis virus 1 | | scientific name |
8 | Aquamavirus A | | scientific name |
9 | Hylipavirus | | scientific name |
10 | Fattrevirus | | scientific name |
This is not a TaxonKit bug in any way. But because MMSeqs2 is pretty popular, I thought it was best to report this here in case anyone else faces the same issue.
Metadata
Metadata
Assignees
Labels
No labels