With the introduction of next-generation sequencing, complete bacterial genomes could be read at an affordable price. The power of current sequencers is such that it is even possible to investigate bacterial communities directly from the environment - a process called metagenomics analyis - bypassing the need for cultivation. The data produced by these machines can be extremely large, yet not exhaustive, as some bacteria are present only in trace amounts. This means that many genomes will not be complete, and even if a genome is fully represented, it might not be possible to connect its parts to each other. Despite this, metagenomic analysis and related tools are becoming increasingly important in different fields of biology.
From the study of the gut microbiome to the investigation of soils, specialized software is emerging rapidly. When developing such tools, bioinformaticians have to rely on heuristic methods, and use benchmarking against datasets which historically were too small to represent reality. Now, with at least 15000 known species of bacteria with a sequenced genome publicly available online, it is possible to evaluate which information in the genome is meaningful for taxonomy, especially when only a small part of the genome is available.
In this project, we are interested in uncovering an estimator of taxonomic dissimilarity, which is less affected by genome (in)completeness than existing methods, and could be used to extract meaningful information from metagenomic analyses.