What is a strain? Strain-level identification in microbiome analyses

Microbiome research has revealed the huge microbial diversity within the human body and the environment. The human gut alone contains thousands of species of bacteria, not to mention other microbes such as fungi and archaea. Each of these species can have a different function within the human body whether that is contributing to colorectal cancer or improving intestinal barrier function. This has led scientists to embark on a continuous effort to identify and characterize all microbial life within the human body and other environments.

With increasing advances in sequencing technologies, however, it is now possible to characterize microbes at an even finer taxonomic resolution. Microbial species can be further classified into “strains” that themselves can each have different functions. However, there is no clear scientific consensus of what constitutes a “strain” making it difficult for researchers to describe and characterize microbiomes at the deepest level.

At an academic level, it is difficult to define exactly what a “strain” is due to the rapidly evolving genomes of microbes. However, in practice, it is necessary to catagorize microbes into “strains” to track the source and spread of pathogens, diagnose infections and understand how mutating microbes affect the human body and other systems. This blog will discuss the concept of microbial “strains” and considerations for those wanting to analyse their microbiome samples at a strain-level.

What do we mean by "strain"?

In biology, life is categorised into well-defined taxonomic ranks (Domain, Kingdom, Phylum, Class, Order, Family, Genus, Species) which have standardized, albeit artificial, definitions. A species, for example, is often defined as the largest group of organisms that can reproduce and produce fertile offspring. For example, humans and chimpanzees are classified as different species yet share 98.5% of their genomes. In microbiology, organisms with ≥95% identity in their 16S rRNA gene sequences are generally considered different species, although this is just an operational definition. There is diversity of life however, within species, particularly in microbiology, where different “strains” can exist within the same species. The concept and definition of a strain, however, does not have a consensus and, like other taxonomic levels, existing definitions are largely artificial.

Historically, microbial strains were individual isolates of bacteria derived from a single ancestor or parent cell that were streaked out on a growth medium. Different strains were characterised based on observable characteristics and phenotypic traits like growth patterns, metabolic capabilities, or antigenicity. The concept of strains, however, has evolved over time with advances in gene sequencing technologies, which can help to analyse different groups of bacteria down to the single nucleotide level. Due to this, groups of microbes within the same species that exhibit different genotypic or phenotypic traits are also often referred to as “sub-species” or “variants” depending on the context and how different these groups of organisms are genetically. However the extent of genotypic or phenotypic variation to determine a new strain has not been universally defined.

Why define a strain?

Putting eukaryotic life into arbitrary categories is relatively simple and can have its advantages. However, putting microbes into categories based on arbitrary thresholds of genetic similarity may be more problematic, largely due to their fast mutation rate and mechanisms for sharing their genomes through horizontal gene transfer, which can blur the lines between genetically distinct organisms. However, there are a number of practical reasons for which defining “strains” is important:

Defined ecological niches: Theoretically, different “strains” of the same species have different ecological niches and therefore respond to their environment in different ways. This can be due to rapid mutations of horizontal gene transfer from other organisms. For example, some E. coli strains in the human gut can be pathogenic and cause diarrhea, whilst others are probiotics.

Source tracking: Before the revolution in genome sequencing, it was almost impossible to track the source and spread of individual microbes. Categorizing microbes into “strains” can help us to track infectious diseases, how they spread from person to person and how they mutate over time. Similarly, it is possible to identify how normal, healthy strains within the microbiota are shared between individuals and environments.

Diagnostics: Medical treatment for infectious diseases relies on diagnosis of a particular pathogen. Strain identification allows health professionals to create effective treatment guidelines for individual microorganisms.

How to define a strain

There is much debate in microbiology about how to define a strain. In the past, restriction fragment length polymorphism (RFLP) and variable number tandem repeat (VNTR) analysis were commonly used to differentiate strains and are still used today to differentiate between Saccharomyces cerevisae strains in food production. In modern times however, analysis of strains centres around genetic (using next generation sequencing) versus phenotypic classification of strains, which come with a number of advantages and disadvantages.

Defining Strains by Average Nucleotide Identity:

Much like methods to define different bacterial species using genetic similarity, one way to define different strains is to test the average nucleotide identity (ANI). A species is commonly defined at the 95% ANI threshold; however, there is no consensus on a threshold of genetic similarity that should be used to define whether a microbe is a different strain. One of the major challenges is the fast mutation rate that occurs in microbes, meaning their genetic sequences can change relatively quickly. And small changes in a microbial genome may not necessarily translate to changes in microbial functions, so if a threshold of ANI is met, it may mean little for that strain’s function. It has been suggested that the threshold for a different strain should be defined at 99.5% ANI based on an analysis of thousands of bacterial genomes, however this has not been adopted yet into common scientific practice. So it is not clear if two microbes that fall between these thresholds of ANI should be considered different strains.

Advantages:

ANI provides a quantitative measure of genetic similarity between genomes. A higher ANI indicates a closer evolutionary relationship.
ANI is rooted in genomic data, making it a robust and objective method for strain delineation.
ANI helps identify highly conserved regions, aiding in the recognition of core genomic features.

Disadvantages:

Selecting a fixed ANI threshold to define strains may not consider biological variations, potentially leading to misclassifications.
Microbial evolution can result in variations that may fall above or below the chosen ANI threshold, causing challenges in accurate strain determination.

Defining Strains by Phenotypic Functions:

Many would argue that something should only be considered a different strain if it is phenotypically different. The reason for this is that microbes can have small changes in their genomes that don’t result in any major change in that microbe’s functions. On the other hand, phenotypic functions may depend on the environment within which a strain resides, making it difficult to distinguish from other strains.

Advantages:

Phenotypic characteristics offer functional insights, linking strains to specific behaviours or traits important for their ecological roles or applications.

Disadvantages:

Phenotypic characteristics can be subjective and affected by different environments, making this approach less objective for strain identification.
Extensive time and resources are required to identify meaningful differences in phenotypic traits between individual strains.
Microorganisms can exhibit varying phenotypes based on environmental conditions, making it challenging to define strains solely on phenotypic traits.

The strain concept in the real world

Despite a lack of a definition for a microbial strain, the term is often used in infection biology and food processing to distinguish between isolates with particular infectious or beneficial characteristics. For example:

Seasonal flu viruses: Influenza viruses that change over time or geographically are often characterized as “strains”, some of which can be more infectious or more lethal than others.
Pathogenic E. coli: Sub-types of E. coli (EPEC, EHEC, EAEC etc) can be considered different strains as they demonstrate different infectious mechanisms and disease outcomes.
Lactobacillus strains in fermented food production: Various strains of different Lactobacillus species have shown different effects on food production and human health such as immune modulation and gut health improvement. Strain-specific studies enable the selection of the most effective probiotics for specific health benefits.

Analysis of strains within microbiomes

Analysing different phenotypic traits within pathogenic organisms or individual starter cultures used for food production is relatively easy. Microbiome research however, now allows us to easily analyse thousands of species of microorganisms within a single environment simultaneously. Strain-level analysis can be very useful in this context, for example to study transmission of microbiome strains between individuals. Analyses of mother and infant gut microbiomes has identified vertical strain transmission following birth. Strain-level analysis has also been used to show that strains are commonly transferred between co-habiting individuals. There are a number of approaches to consider, however, in strain-level analysis of microbiomes:

Reference genomes

Reference genomes serve as crucial benchmarks in strain-level analysis within microbiome research. These genomes represent well-characterized, specific microbial strains, offering a foundation for comparing and understanding the genomic variations among strains in metagenomic data. Alignment of sequencing data from a microbiome sample to these references enables accurate strain identification and classification. Functional annotation linked to these references allows researchers to associate genomic variations with distinct phenotypic traits or biological functions. However, reliance solely on reference genomes may limit analysis to well-known strains from reference databases, potentially overlooking the broader diversity inherent in complex microbial communities. Integrating reference genomes with other methods ensures a more comprehensive understanding of microbial strain-level variations.

Metagenome assembled genomes

Metagenome Assembled Genomes (MAGs) are composite genomic sequences constructed from metagenomic data, representing microbial populations within a sample. MAGs play a pivotal role in strain-level analysis, allowing researchers to dissect the genomic makeup of individual strains within complex microbial communities. The advantages include obtaining genomic insights without prior isolation, aiding in strain identification, understanding functional potential, and assessing ecological roles. However, challenges lie in incomplete assemblies, potential chimerism, and the inability to capture rare or low-abundance strains accurately. Furthermore, strain assignment of MAGs can also be influenced by the databases of reference genomes that are used. Despite limitations, MAGs significantly advance our understanding of microbiome diversity and function at the strain level, facilitating a broader spectrum of research in microbial ecosystems.

Metagenomic tools to analyse “strain” variation in microbiome samples

The rapid advances in DNA sequencing technologies and associated bioinformatic pipelines to analyse the data has advanced our ability to identify differences between microorganisms. Not only is it possible to ascertain very small differences in ANI, it is even possible to identify single nucleotide variants (SNVs) in microbial genomes. Metagenomic assembly tools and associated pipelines can be very useful to identify genetic variation within and between samples. For example, StrainPhlan and PanPhlan use a variety of approaches including pangenomes (a complied collection of genes within a species, which can come from different “strains”) and SNV identification to group genetically similar metagenomes into “strains” and compare them with reference genomes. The strength of these bioinformatic tools can also be dependent on sequencing approaches. Traditional short-read sequencing (e.g. Illumina based sequencing) can struggle with accurately resolving highly similar strains or distinguishing repetitive genomic regions. In contrast, new long-read sequencing technologies (e.g. PacBio and Oxford Nanopore) can sequence long challenging genomic regions, such as repetitive sequences, aiding in the precise identification of strain-specific variations. Collectively, these tools can aid the conceptual understanding of genetic differences between different microorganisms allowing scientists to identify new “strains”.

Conclusion

In microbiology, a strain is not well-defined. This means that efforts in microbiome research to analyse strain-level differences can be difficult, especially if the purpose is to associate different strains with health outcomes. Despite these challenges, deeper level analyses of microbiome samples to assess genetic variation within species can be extremely useful and may provide greater information than is available by analysing microbiomes at higher taxonomic levels. The team at Microbiome Insights have extensive expertise from sequencing to bioinformatics analysis to help you with your microbiome study. If you are interested in strain-level analysis within your microbiome study, reach out to the team today who will be happy to discuss more.

About Microbiome Insights

Microbiome Insights, Inc. is a global leader providing end-to-end microbiome sequencing and comprehensive bioinformatic analysis...

Microbiome Insights Blog