Chapter 3. Introduction to Microbiome Analysis#

Microorganisms that are difficult or impossible to culture have been a barrier for microbiologists for decades. Boosted by advances of high-throughput sequencing technology, in recent years, microbiome studies, has increased dramatically, which has yielded tremendous insight into the causality of microbiota functions to human health.

In this chapter, we will walk through the microbiome research, including:

  • The history of microbiome research

  • Understand the importance of next-generation sequencing in the field of microbiome research

  • Introduce common analysis based on the real microbiome data

3.1 Microbiome Research#

The Microbiome is the community of microorganisms (such as fungi, bacteria and viruses) that exists in a particular environment.

3.1.1 What is a poop transplant (Fecal Microbiota Transplantation)#

HuangLongSoup.jpg

A special “soup” with a deep yellow color and an intense aroma can cure your diarrhea. Will you take it?

And, if I tell you one of the most important component in the soup is face. Will you take it?

Fecal Microbiota Transplantation (FMT):

  • Fourth-century Chinese medical literature mentions its use, by Ge Hong among others, to treat food poisoning and severe diarrhoea.

  • Ancient practices: FMT used in ancient Chinese medicine to treat abdominal diseases, by Li ShiZhen and documented in the 16th and 17th centuries.

Find a great vedio from TED here.

3.1.2 Introduction to The Early History of Human-associated Microbiome Research#

Microbiome research originated in microbiology back in the seventeenth century. The story starts from a Letter from Antonie van Leeuwenhoek to Royal Society of London in 1683.

“I then most always saw, with great wonder, that in the said matter there were many very little living animalcules, very prettily a-moving.” — Antonie van Leeuwenhoek.

Microbiome Research

  • 1670s–1680s: Antonie van Leeuwenhoek used handcrafted microscopes to describe human-associated microbiota, identifying different kinds of bacteria in the mouth and comparing oral and fecal microbiota.

  • 1853: Joseph Leidy published “A Flora and Fauna within Living Animals,” considered an origin of microbiota research.Then, the work of Pasteur, Metchnikoff, Koch, Escherich, Kendall and a few others, laid the foundations of how we understand host–microrganism interactions.

  • 1890: Koch published his postulates to establish microorganism-disease causation. Infectious diseases became the earliest focus of interest and research, likely due to the fact that most bacterial pathogens can grow in the presence of oxygen (therefore, aerobic), whereas most members of the gut microbiota (therefore, anaerobic ) cannot and thus could not be studied at the time.

  • 1917: Alfred Nissle isolated the Escherichia coli Nissle 1917 (facultative, can survive in either aerobic or anaerobic niche) strain, establishing the concept of colonization resistance.

  • 1940s–1950s: Discovery of methods to culture anaerobic organisms, allowing in-depth study of microbial communities.

  • 1960s: Comparisons of germ-free and colonized animals led to important observations in microbiota research.

Despite progress in culturing microorganisms, it quickly became evident that there was a significant difference between the number of microbiota present and those that could be grown in the lab. Thanks to the discovery of DNA and its function, scientists proposed the Development of sequencing-based approaches: Pioneered by Woese, Pace, and Fox, enabling study of unculturable microorganisms.

Find More Interesting Story Here: A field is born

3.1.3 Introduction to Amplicon Sequencing#

In this section, you should get a basic idea of,

  1. what is a tree of life,

  2. why 16S region could be used to classify the microorganisms?

3.1.3.1 Tree of Life#

Linnaean System#

Linnaean System (1735): Carl Linnaeus formalized taxonomy by introducing a hierarchical system with specific ranks.

linnaean_system.png
Darwin’s Theory of Evolution#

Darwin’s Theory of Evolution (1859): Charles Darwin’s “On the Origin of Species” proposed natural selection as a mechanism for evolution, fundamentally changing how organisms were understood in relation to one another. Darwin’s concept of common descent began to shape the idea of a “tree of life.”

With advances in comparative anatomy and embryology, with morphological and behavioral features, scientists began to construct early phylogenetic trees that depicted evolutionary relationships.

early_tree_of_life.png

The First Darwinian Phylogenetic Tree of Plants

Eukaryote-Prokaryote Dichotomy#

Initial Classification (Late 19th Century): Early microbiologists, such as Louis Pasteur and Robert Koch, began differentiating between two major groups of life based on cellular organization: eukaryotes (organisms with complex cells containing a nucleus) and prokaryotes (simple, unicellular organisms without a nucleus).

two_domain.png
Three-Domain Theory#

In the 1970s, Carl Woese of the University of Illinois and his colleagues published the first “universal tree of life” by using the RNA tools. They presented the tree as three great trunks. Our own trunk, known as eukaryotes, includes animals, plants, fungi and protozoans. A second trunk included many familiar bacteria like Escherichia coli. The third trunk that Woese and his colleagues identified included little-known microbes that live in extreme places like hot springs and oxygen-free wetlands. Woese and his colleagues called this third trunk Archaea.

By 1990, Woese adopted the term “domain” for the three new branches of life. A major change to the Linnaean system was the addition of the “domain”. A domain is a taxon that is larger and more inclusive than the kingdom.

. . . .

In 2016, a team of scientists presented a new tree of life, illustrating the evolution of all living organisms. Their research revealed that bacteria constitute the majority of the branches in this tree. And they found that much of that diversity has been waiting in plain sight to be discovered, dwelling in river mud and meadow soils.

16s_region.png

New view of the tree of life, 2016

3.1.3.2 Human Microbiome#

  • Our human body contains about 10X as many microbes as human cells, including Bacteria, Archaea, viruses and Fungi.

  • Microbe community differs accross human body

human_micro.png

The human microbiome: at the interface of health and disease

  • First Introduction occurs at birth

  • Microbes provides enzymes for digestions and other compounds products such as Short-Chain Fatty Acids (SCFAs).

  • Carriage of microbial taxa varies while the metabolic pathways remain stable within the healthy population.

human_micro.png

Structure, function and diversity of the healthy human microbiome

3.1.3.3 Principles of Amplicon Sequencing (16S/18S/ITS)#

If I have seen further, it is by standing on the shoulders of giants. – Sir Isaac Newton

Pioneer

Main Contribution

Carl Woese

Propose the Archaea (a new domain of life) in 1977 through a pioneering phylogenetic taxonomy of 16S ribosomal RNA, a technique that has revolutionized microbiology. It is suggested that 16S rRNA gene can be used as a reliable molecular clock for Bacteria and Archaea because the 16S region contains conserved sequences for broad identification and variable regions that allow differentiation between species and strains.

George E. Fox

Collaboration with Carl Woese. Three-domain system Towards a natural system of organisms: proposal for the domains Archaea, Bacteria, and Eucarya.

Pace, Norman R.

The first characterization of the composition of a bacterial community (or microbiota) via the identification of 16S rRNA gene sequences without cell culture in 1985

So What is 16S rRNA?

  • Part of the subunit of prokaryotic ribosome

  • widely conserved (Bacteria and Archaea), but have hypervariable regions for differentiation.

16s_region.png

The choice of using different region is pivotal for optimizing phylogenetic resolution, cost-effectiveness, and microbial diversity assessment. You need to make a trade-off.

You may find the article below useful:

  1. https://pmc.ncbi.nlm.nih.gov/articles/PMC11264597/

  2. https://www.nature.com/articles/sdata20197

  3. https://www.nature.com/articles/s41467-019-13036-1

  4. https://onlinelibrary.wiley.com/doi/full/10.1002/imo2.9

Other types of amplicon sequencing include 18S and ITS sequencing, which are used to study fungi.

3.1.3.4 Common Amplicon Sequencing Steps#

The Schematic diagram describe the overview of 16s sequencing Heidi H. Kong, 2011:

16s_OVERVIEW.png

Common Amplicon Sequencing can be split into several steps:

  1. Sample Preparation

  • This may involve adding enzymes to break down cell walls, applying heat to denature proteins, or using physical methods to separate cells from debris.

  1. DNA Extraction

  • This is typically done using a DNA extraction kit, which uses a combination of chemical and physical methods to separate the DNA from other cellular components.

  1. Library Preparation

  • Library preparation refers to the series of procedures that need to be taken to prepare the DNA for next generation sequencing, typically including 1) Amplification of the 16S rRNA gene. 2) Perform PCR to add molecular ‘barcodes’ to each cleaned DNA sample 3) clean DNA with magnetic microbeads

  1. Sequencing

  • During sequencing, the DNA libraries are randomly amplified and sequenced using a combination of chemistry and optics. The resulting data generates short DNA sequences that can be aligned up to databases to identify which bacteria and archaea they belong to.

This Link from the most popular platform illumina introduce the NGS sequencing in a clear way.

3.1.4 Bioinformatics challenge: Sequencing and Alignment#

Terminology

Term

Explain

Note

Reads Length

The number of bases present in one read

Longer reads can generally produce better results

Sequencing Depth (or sometimes simply depth)

The average number of times a given position in the genome is sequenced.

More important in whole genome sequencing

Number of reads

Total amount of reads produced for a sample

In 16s rRNA gene sequencing, we can consider number of reads the same as sampling depth, as we know which regrion are we targeting

Coverage

The proportion or percentage of a genome that has been sequenced at a certain depth.

More important in whole genome sequencing

3.1.4.1 Quality Control#

It is inevitable to HAVE some errors during 16S rRNA gene sequencing (Pay price for high number of information generated). But we can minimise these errors by using bioinformatics tools and proper experimental design.

challenge.png

So you must have got the basic understanding how the 16S or other amplicon sequencing can estimate and sample the microbial population in an efficient and accurate way. Now, let’s have a first glance at the sequencing output:

Find sample FASTQ file format,

16s_region.png

The quality score (Q) reflects in a logarithmic scale the probability that this particular base was called incorrectly (Perror). These scores can help identify low-confidence calls that are more likely to be erroneous. quality score equals to 20 (1 error happened in 100 bases) or 30 (more stringent, 1 error happened in 1000 bases) was normally set as the cutoff.

The question comes when we got the sequencing data,

  • Is the sequencing data with high quality?

16s_region.png

Note

This is a complex question but worth to note when you are doing the analysis on the raw sequence data. Think about the following sub-questions:

  1. Did you remove all the artifical sequence (e.g. the barcodes, the adpater for PCR)

  2. Is your sequence of high sequencing quality?

  3. Any contaminants present in my file?

  4. Did you apply specific tools to correct the error induced during PCR? (e.g. DADA2 for 16S)

You may find more details about quality control here: Fastqc

3.1.4.2 Alignment#

After you believe you have got good quality of sequencing data, you may think about the following questions:

  • 1.How to identify and classify the unknown bacteria sequence to known species (i.e. curated in database),

  • 2.How to quantify the amount (abundance) of each unknown bacteria sequence.

Note

Like the traditional classification performed based on the traits and morphology, for sequencing and molecular classification, we typically seek to find similarity (aka, identity) between sequences. This procedure, called sequence alignment, is considered as the most essential step in bioinformatics.

Two commonly used sequence alignment algorithms are global alignment and local alignment.

Global alignment: Global alignment is a method of comparing two sequences, which aligns the entire length of the sequences by maximizing the overall similarity. This method is used when comparing sequences that are of the same length.

Local alignment: In local alignment, instead of attempting to align the entire length of the sequences, only the regions with the highest density of matches are aligned. This is useful for identifying short conserved regions in protein or nucleotide sequences.

alignment

Global and Local Alignment of two sequences (Mount, D. W., 2001).

You will have practice about this mission in the practice session.

alignment

Example adopted from link

At present, you know how to identify the similarity between two sequences. Before we go deeper to the quantification part, we should first look at the possible issues we may meet.

  1. The goal for targeted sequencing is to determine its taxonomic origin based upon a potentially small number of variations relative to similar taxa. The results may be cofounded by single nucleotide variants (SNVs) caused by sequencer error, leading to a detection of a similar, but incorrect organism, or the false discovery of a new organism.

  2. For millions of reads, it is computational intensive to analysis.

alignment

otu vs asv

Scientists post a strategy to counter these two issues: find a proxy for microbial species. At the beginning, the proxy is based on the operational taxonomic units (OTUs), clusters of reads that differ by less than a fixed sequence dissimilarity threshold, most commonly 3% (i.e. 97% identity). The clustering method could be further divided into two strategies, 1) closed-reference methods - in which reads sufficiently similar to a sequence in a reference database are recruited into a corresponding OTU. 2) De novo methods - in which reads are grouped into OTUs as a function of their pairwise sequence similarities.

However, several disadvantages has been proposed:

  • Arbitrary Clustering: The 97% threshold is somewhat arbitrary and can lump distinct species together or split the same species.

  • Reproducibility Issues: Different software or parameters can produce different OTU clusters.

  • Low resolution: We can not fully believe the species-level resolution from this method.

In recent years, Amplicon Sequence Variants (ASVs) (100% idenitity sequence clustering and filter possible error by probability modelling) were proposed to minimize the effects. One of the most advantage is that ASVs can detect finer differences between sequences, providing more precise and accurate data, and do taxonomic assignment in a higher-resolution.

Tip

Difference between targeted sequencing and whole genome sequencing

In the context of whole genome sequencing, a small number of single nucleotide variants (SNVs) caused by sequencer error are unlikely to seriously confound an aligner and have little effect on the final attribution of the sequence. Targeted sequencing, where comparison of multiple similar sequences, rather than alignment across multiple genomes is the main operation, has the potential to be confounded by erroneous SNVs. This can result in misattribution of the sequence leading to either to detection of a similar, but incorrect organism, or the false discovery of a new organism.

After align the proxy (We call it Taxa in our tutorial) to the curated database, we may classify the taxa into different taxonomic ranks:

The taxonomic ranks used in this molecular classification system are:

  • Genus (About 93-97% identity)

  • Species (In general, > 99% identity depend on the targeted 16S region)

Each rank represents a level of classification

Tip

More advanced machine learning based taxonomic assignment method was also proposed e.g. naive bayes classifiers. You will learn more about machine learning in Professor Yin’s Lecture.

3.1.4.3 Pros and Cons of 16S rRNA Sequencing#

Pros:

  • Cost: Cost-effective.

  • Culture-independent: We only require DNA of the tested bacteria.

Cons:

While 16S rRNA sequencing is a powerful tool for identifying and characterizing microorganisms, it has some limitations in addition to strengths:

  • Taxonomic resolution: 16S rRNA sequencing targets a specific region of the bacterial genome that is relatively conserved across species but contains enough variation to differentiate between different taxa. However, this region may not always provide enough resolution to distinguish between closely related strains within a species due to limited genetic variability. Shotgun metagenomics sequencing works better which target the whole metagenome of the microbial community.

  • Identifying microbial functions: While 16S rRNA sequencing focuses solely on a single gene in bacteria and archaea, it is limited in its ability to reveal the full range of functional capacities exhibited by the microbes present in a sample. Shotgun metagenomics sequencing works better which target the whole metagenome of the microbial community.

Tip

The taxonomic resolution of 16S is dependent on region(s) targeted. If we can target the full length of the 16S rRNA gene (~1500bp), then we can probably increase the resolution to species-level identification. However, it can not be done with the next generation sequencing, in which we mainly provide short reads (~300bp).

Scientists are working on this issue, see Link here

3.1.5 Understand the Ecological Property of Microbiome: Alpha and Beta Diversity#

An ecological community consists of the collective populations of various species inhabiting a specific area. Each organism within this community occupies a unique ecological niche, encompassing

  • the physical environment they inhabit,

  • their utilization of available resources, and

  • their interactions with other organisms sharing the same space.

By defining their niche, organisms establish their role within the ecosystem, contributing to the intricate web of relationships that govern the functioning and dynamics of the community.

Different types of interspecific interactions have different effects on the two participants, which may be positive (+), negative (-), or neutral (0).

Interaction types

Effect on species 1

Effect on species 2

Competition

-

-

Predation

+

-

Mutualism

+

+

Commensalism

+

0

Parasitism

+

-

This is actually the area that underexplored and underdebate. Some scientists think the main interaction types between microbial species is competition rather than cooperation (Kevin R. Foster, 2012), while other disagree (Katharine Z. Coyte et al.).

One way to understand the diversity within a ecosystem or the dynamic stability of a certain microbial community is to use Alpha and Beta diversity.

Alpha diversity: The diversity within a particular niche or ecosystem (e.g. for example, human gut). Reduces the complexity of a multi-species community to a single integer that describes how many taxonomic units are present

Common Index, include:

  • the number of species (i.e., species richness) in that ecosystem.

  • The evenness (whether the microbial community was dominated by small proportion of the species.)

Beta diversity: A comparison of of diversity between ecosystems, usually measured as the amount of species change between the niche or ecosystems.

Common Index Could be split into two common types weighted and unweighted:

  • Unweighted means the Comparison between ecosystems only take the presence or absence of certain species into consideration: Jaccard Distance, Unweighted Unifrac distance

  • Weighted means the Comparison between ecosystems take the abundance information into consideration: Bray-Curtis Distance, Weighted Unifrac distance.

Jaccard Similarity = the size of the species intersection between divided by the size of the union species of the sample sets

Jaccard distance = 1 - Jaccard Similarity

AlphaAndBetaDiversity

Note

The idea that microorganisms exist solely as an independent unit evolved as it became clear that they form complex groups where species interact and communicate. High-throughput genome and metagenome analysis now offer powerful tools for studying both individual microbes and entire microbial communities in their natural environments.

Further Reading About Microbial Ecology:

3.1.5.1 Ordination#

Ordination is a statistical technique used in ecology to visualize and interpret complex multivariate data sets. It is particularly useful for exploring patterns and relationships among samples or variables in a dataset. Ordination methods aim to represent high-dimensional data in a lower-dimensional space while preserving the underlying structure and variation in the data.

here are various ordination techniques available, each with its own strengths and assumptions. Some common ordination methods include:

  • Principal Component Analysis (PCA): PCA is a widely used ordination method that reduces the dimensionality of the data by finding a set of orthogonal axes (principal components) that capture the maximum amount of variation in the data.

  • metric MultiDimensional Scaling (aka PCoA): the dissimilarities between samples are typically represented by a distance matrix, such as a Euclidean distance matrix or a Bray-Curtis dissimilarity matrix. The goal of PCoA is to find a set of axes (principal coordinates) that best capture the variation in the original dissimilarities.

  • Non-metric Multidimensional Scaling (NMDS): NMDS is a non-linear ordination method that aims to represent the dissimilarities between samples in a lower-dimensional space. It is often used when the data do not meet the assumptions of linear methods like PCA.

PCoA and NMDS is generally recommanded for microbiome data, therefore will be discussed below.

Find More information here, PCoA, NMDS, PERMANOVA

Case Exercise 6#

The figure describes the gut microbiota from our friends, honey maker - bumble bee, while different shapes represent for different bacteria. Could you calculate the alpha (species richness) for blue bee and green bee, beta diversity (Jaccard distance) between the blue bee and green bee?

Tip

Unifrac distance incorporates the phylogenetic distance between species into the comparison between ecosystems. Why take phylogenetic information into consideration can improve our understanding of the beta diversity?

3.2 Common Analysis in Microbiome study#

In this section, we will use the real microbiome data from Dr. Johnson et al. , NEEN project , which describe a control-treatment experiment design applied in mice to explore the function of first documented probiotics E.Coli Nissle in gut microbiota modulation. In this study, we applied 16s rRNA sequencing targting the v4 region.

3.2.1 Basic Experiment Design#

16s_NEEN.png

3.2.2 From Fasta File to Taxa Table#

We generated the data by the methodology published in the link.

This part will not be the main focus of this Course, but you may find the following link useful:

3.2.3 Import Data into R#

You will find the following data under documents/case7

Name

Description

Info

taxa_table.tsv

The ready-to-analysis microbial quantification matrix (row as Taxa, column as samples) generated by QIIME2 and dada2 pipeline

784 taxa and 12 samples

taxonomic_assignment_table.tsv

The taxonomic assignment (row as Taxa, column as Taxonomic information) generated with greengene2 database

784 taxa by 7 taxonomic ranks

Phylogenetic Tree

The ASVs phylogenetic tree

Not provided

sample_metadata_table.tsv

The table indicating the sample-group relationship

12 samples by Grouping variables

Install the phyloseq package. It will take some time.

# Install phyloseq from Bioconductor
# if (!requireNamespace("BiocManager", quietly = TRUE))
#    install.packages("BiocManager")
# BiocManager::install("phyloseq")

# Install phyloseq from Github (Recommended)
# install.packages("devtools")
# devtools::install_github("phyloseq/joey711")

library(phyloseq)
sample_meta <- read.csv("../document/case7/sample_metadata_table.tsv",sep = "\t", row.names = 1)
sample_meta
A data.frame: 12 × 4
referenceDietDescriptionGroup
<int><chr><chr><chr>
NEEN.01111ALIOS dietGroup1CTRL
NEEN.01212ALIOS dietGroup1CTRL
NEEN.01313ALIOS dietGroup1CTRL
NEEN.01414ALIOS dietGroup1CTRL
NEEN.01515ALIOS dietGroup1CTRL
NEEN.01616ALIOS dietGroup1CTRL
NEEN.02121ALIOS dietGroup2EcN
NEEN.02222ALIOS dietGroup2EcN
NEEN.02323ALIOS dietGroup2EcN
NEEN.02424ALIOS dietGroup2EcN
NEEN.02525ALIOS dietGroup2EcN
NEEN.02626ALIOS dietGroup2EcN
table(sample_meta$Group) # Count the number of sample for each Group
CTRL  EcN 
   6    6 
taxa_table <- read.csv("../document/case7/taxa_table.tsv", sep = "\t", row.names = 1) 
tail(taxa_table[1:6]) # You can find the Taxa id which is generated randomly to represent for different bacterial sequence (Taxa)
A data.frame: 6 × 6
NEEN.011NEEN.012NEEN.013NEEN.014NEEN.015NEEN.016
<int><int><int><int><int><int>
bbaad0f397a1590db5c42bf1f40c8db9 0 13 012 0 22
1e040db0048bb4067f6d355cb45648dd 9911614959121 302
a6814253a38248cacced10ba5b2e3b1d143 60109803941124
d2be1254f5e7eb97bd42ec342d9cb11d 6 0 0 6 0 0
45ec8d8b3f3f833abb943a394602ca67 0 0 0 0 0 0
2ce88e05f1919882b01289628bd3d156 0 0 9 6 21 20

You can use colSums() to calculate the Total Abundance (or called as sampling depth in our tutorial) of each sample

print(colSums(taxa_table))
NEEN.011 NEEN.012 NEEN.013 NEEN.014 NEEN.015 NEEN.016 NEEN.021 NEEN.022 
   50047    49734    52780    47521    56489    56181    51627    51627 
NEEN.023 NEEN.024 NEEN.025 NEEN.026 
   52991    42890    48576    51663 

Note

You can see that the total abundance varied accross samples. One of the important things for you to understand is that the abundance generated by the 16S rRNA sequencing (and probably other techniques based on sequencing) is highly dependent on the sampling depth (num of reads used in sequencing). Therefore, only sampling a fraction of the total population makes it invalid to compare between samples (We need some special statistical help). We will discuss it in next section 3.2.4.

taxonomic_assign_table <- read.csv("../document/case7/taxonomic_assignment_table.tsv", sep = "\t", row.names = 1)
tail(taxonomic_assign_table) # You may find NA present in the taxonomic assignment table
A data.frame: 6 × 7
KingdomPhylumClassOrderFamilyGenusSpecies
<chr><chr><chr><chr><chr><chr><chr>
bbaad0f397a1590db5c42bf1f40c8db9d__BacteriaFirmicutes_AClostridia_258483LachnospiralesLachnospiraceaeNA NA
1e040db0048bb4067f6d355cb45648ddd__BacteriaFirmicutes_AClostridia_258483LachnospiralesLachnospiraceaeNA NA
a6814253a38248cacced10ba5b2e3b1dd__BacteriaFirmicutes_AClostridia_258483LachnospiralesLachnospiraceaeNA NA
d2be1254f5e7eb97bd42ec342d9cb11dd__BacteriaFirmicutes_AClostridia_258483LachnospiralesLachnospiraceaeNA NA
45ec8d8b3f3f833abb943a394602ca67d__BacteriaFirmicutes_AClostridia_258483LachnospiralesLachnospiraceaeKineothrixKineothrix sp000403275
2ce88e05f1919882b01289628bd3d156d__BacteriaFirmicutes_AClostridia_258483LachnospiralesLachnospiraceaeNA NA
sort(unique(taxonomic_assign_table$Genus)) # Genus we have
  1. '14-2'
  2. '1XD42-69'
  3. '28-57-27'
  4. 'Acetatifactor'
  5. 'Acetitomaculum'
  6. 'Acutalibacter'
  7. 'Adlercreutzia_404199'
  8. 'Adlercreutzia_404218'
  9. 'Adlercreutzia_404257'
  10. 'Akkermansia'
  11. 'Alcanivorax_A'
  12. 'Aliidiomarina'
  13. 'Alistipes_A_871400'
  14. 'Alistipes_A_871404'
  15. 'Alkalibacterium'
  16. 'Alkalicoccus'
  17. 'Alkalimonas'
  18. 'Anaerotruncus'
  19. 'Angelakisella'
  20. 'Aquisalimonas'
  21. 'Avispirillum'
  22. 'Bacteroides_H'
  23. 'Bariatricus'
  24. 'Bifidobacterium_388775'
  25. 'Bilifractor'
  26. 'Bilophila'
  27. 'Borkfalkia'
  28. 'Butyribacter'
  29. 'Butyricicoccus_A_77030'
  30. 'BX12'
  31. 'C-19'
  32. 'C-53'
  33. 'CAG-269'
  34. 'CAG-273'
  35. 'CAG-314'
  36. 'CAG-41'
  37. 'CAG-485'
  38. 'CAG-56'
  39. 'CAG-632'
  40. 'CAG-873'
  41. 'CAG-95'
  42. 'Choladousia'
  43. 'Clostridium_AP'
  44. 'Clostridium_Q_134526'
  45. 'Clostridium_T'
  46. 'COE1'
  47. 'Conexibacter'
  48. 'Coprocola'
  49. 'Copromonas'
  50. 'Cryptobacteroides'
  51. 'D16-34'
  52. 'Demequina'
  53. 'Desulfovibrio_R_446353'
  54. 'Desulfurivibrio'
  55. 'Dethiobacter'
  56. 'Dubosiella'
  57. 'Duncaniella'
  58. 'Dysosmobacter'
  59. 'Ectothiorhodospira_607903'
  60. 'Emergencia'
  61. 'Enterenecus'
  62. 'Enterocloster'
  63. 'Enteroscipio'
  64. 'Erysipelatoclostridium'
  65. 'Escherichia_710834'
  66. 'Eubacterium_F'
  67. 'Eubacterium_G'
  68. 'Eubacterium_J'
  69. 'Eubacterium_R'
  70. 'Eubacterium_S'
  71. 'Evtepia'
  72. 'Faecalibaculum'
  73. 'Faecimonas'
  74. 'Faecousia'
  75. 'Fimenecus'
  76. 'GCA-002282575'
  77. 'GCA-2718595'
  78. 'Halofilum'
  79. 'Halolactibacillus'
  80. 'Halomonas_C_640989'
  81. 'Harryflintia'
  82. 'Holdemania'
  83. 'Hydrogenophaga_590395'
  84. 'Intestinimonas'
  85. 'Isachenkonia'
  86. 'Izemoplasma_B'
  87. 'JC017'
  88. 'Kineothrix'
  89. 'Kocuria'
  90. 'Lachnoclostridium_A_130679'
  91. 'Lachnoclostridium_B'
  92. 'Lachnospira'
  93. 'Lacrimispora'
  94. 'Lactobacillus'
  95. 'Lawsonibacter'
  96. 'Ligilactobacillus'
  97. 'Limivicinus'
  98. 'Limosilactobacillus'
  99. 'Luxibacter'
  100. 'Mailhella'
  101. 'Massilioclostridium'
  102. 'MD308'
  103. 'Merdibacter'
  104. 'Merdisoma'
  105. 'Methylophaga'
  106. 'Mucispirillum'
  107. 'Muribaculum'
  108. 'Nanosyncoccus'
  109. 'Natronospirillum'
  110. 'Nitrincola'
  111. 'NM07-P-09'
  112. 'Oceanicaulis'
  113. 'Odoribacter_865974'
  114. 'Oribacterium'
  115. 'Paludicola'
  116. 'Parabacteroides_B_862066'
  117. 'Paralachnospira'
  118. 'Paramuribaculum'
  119. 'Parasutterella'
  120. 'Pelethenecus'
  121. 'Peptococcus'
  122. 'Phocea'
  123. 'Planktosalinus'
  124. 'Planococcus'
  125. 'Porcincola'
  126. 'Prevotella'
  127. 'Pseudobutyricicoccus'
  128. 'Pseudomonas_D_641492'
  129. 'Pseudoxanthomonas_A_615337'
  130. 'QAMM01'
  131. 'QWKK01'
  132. 'Rhodobaculum'
  133. 'Rhodohalobacter'
  134. 'Robinsoniella'
  135. 'Roseinatronobacter'
  136. 'Ructibacterium'
  137. 'RUG12438'
  138. 'RUG762'
  139. 'Ruminiclostridium_E'
  140. 'Ruminococcus_B'
  141. 'Ruminococcus_C_58660'
  142. 'Ruthenibacterium'
  143. 'Salinarimonas'
  144. 'Salipaludibacillus_361384'
  145. 'Scatocola'
  146. 'Schaedlerella'
  147. 'Shewanella'
  148. 'Silanimonas_613110'
  149. 'SKVR01'
  150. 'Sodaliphilus'
  151. 'Sporofaciens'
  152. 'Staphylococcus'
  153. 'Streptococcus'
  154. 'Sulfurimonas'
  155. 'SZUA-442'
  156. 'TC1'
  157. 'Thioalkalivibrio_B'
  158. 'Thiomicrospira'
  159. 'Thiothrix_606507'
  160. 'Tindallia'
  161. 'Turicibacter'
  162. 'Turicimonas'
  163. 'UBA2658'
  164. 'UBA3263'
  165. 'UBA3282'
  166. 'UBA3789'
  167. 'UBA5905'
  168. 'UBA644'
  169. 'UBA6985'
  170. 'UBA7173'
  171. 'UBA737'
  172. 'UBA9414'
  173. 'UBA946'
  174. 'UBA9715'
  175. 'UMGS1071'
  176. 'UMGS1994'
  177. 'Ventrimonas'
  178. 'Ventrisoma'
  179. 'VRTS01'
  180. 'Wenzhouxiangella'
  181. 'Yoonia_491068'
  182. 'ZJ304'

Note

You will find some genus name, for example, “CAG-485” looks like the name generated by computer. Those are condidate genera name which we can identify the presence of certain organism by sequencing, however, we can not culture it so far, therefore, can not to do subsequent formal nomenculture.

Also, you may find the some genus name, for example, “Escherichia_710834”, looks loke a normal name plus a string of random number. Those are another example of why sequencing has revolutionized the microbiology study. GreenGene2 Database adopt the phylogenetic information from another important database GTDB. The former part from the example is, “Escherichia”. This is the name given by current formal nomenculture. However, based on the phylogenetic tree generated by the GTDB, the genera “Escherichia” might have the possibility (at least from the DNA sequence insights) to be split into subdivision or relocate to other genera.

tail(is.na(taxonomic_assign_table)) # whether the value in the matrix is NA with is.na()
A matrix: 6 × 7 of type lgl
KingdomPhylumClassOrderFamilyGenusSpecies
bbaad0f397a1590db5c42bf1f40c8db9FALSEFALSEFALSEFALSEFALSE TRUE TRUE
1e040db0048bb4067f6d355cb45648ddFALSEFALSEFALSEFALSEFALSE TRUE TRUE
a6814253a38248cacced10ba5b2e3b1dFALSEFALSEFALSEFALSEFALSE TRUE TRUE
d2be1254f5e7eb97bd42ec342d9cb11dFALSEFALSEFALSEFALSEFALSE TRUE TRUE
45ec8d8b3f3f833abb943a394602ca67FALSEFALSEFALSEFALSEFALSEFALSEFALSE
2ce88e05f1919882b01289628bd3d156FALSEFALSEFALSEFALSEFALSE TRUE TRUE

Remember that the FALSE could be considered as 0, and the TRUE could be considered as 1.

print(colSums(is.na(taxonomic_assign_table))) # See how many NA present in each Taxonomic Ranks
Kingdom  Phylum   Class   Order  Family   Genus Species 
      0       2       2       5      10     197     376 

Warning

NA is a common symbol used in R to indicate the Missing Value. In our case, the missing value is because we did not have species or even genus level assignment for some of the taxa.

na.omit(df) - clean row with na

You need to take care of the missing value with caution.

We will replace NA values with the previous rank concatenated with “current rank” + “_no_rank”.

Note

Think about how to finish this mission if you are interested.

taxonomic_assign_table <- read.csv("../document/case7/taxonomic_assignment_table_revised.tsv", sep = "\t", row.names = 1) # we  will directly use the revised one
tail(taxonomic_assign_table) 
A data.frame: 6 × 7
KingdomPhylumClassOrderFamilyGenusSpecies
<chr><chr><chr><chr><chr><chr><chr>
bbaad0f397a1590db5c42bf1f40c8db9d__BacteriaFirmicutes_AClostridia_258483LachnospiralesLachnospiraceaeLachnospiraceae_genus_no_rankLachnospiraceae_genus_no_rank_species_no_rank_OTU_bbaad0f397a1590db5c42bf1f40c8db9
1e040db0048bb4067f6d355cb45648ddd__BacteriaFirmicutes_AClostridia_258483LachnospiralesLachnospiraceaeLachnospiraceae_genus_no_rankLachnospiraceae_genus_no_rank_species_no_rank_OTU_1e040db0048bb4067f6d355cb45648dd
a6814253a38248cacced10ba5b2e3b1dd__BacteriaFirmicutes_AClostridia_258483LachnospiralesLachnospiraceaeLachnospiraceae_genus_no_rankLachnospiraceae_genus_no_rank_species_no_rank_OTU_a6814253a38248cacced10ba5b2e3b1d
d2be1254f5e7eb97bd42ec342d9cb11dd__BacteriaFirmicutes_AClostridia_258483LachnospiralesLachnospiraceaeLachnospiraceae_genus_no_rankLachnospiraceae_genus_no_rank_species_no_rank_OTU_d2be1254f5e7eb97bd42ec342d9cb11d
45ec8d8b3f3f833abb943a394602ca67d__BacteriaFirmicutes_AClostridia_258483LachnospiralesLachnospiraceaeKineothrix Kineothrix sp000403275
2ce88e05f1919882b01289628bd3d156d__BacteriaFirmicutes_AClostridia_258483LachnospiralesLachnospiraceaeLachnospiraceae_genus_no_rankLachnospiraceae_genus_no_rank_species_no_rank_OTU_2ce88e05f1919882b01289628bd3d156

3.2.4 Normalization Issue#

We must first normalize our samples because sampling depth variations among samples can affect distance/dissimilarity measures. In this tutorial, we will focus on a common ways applied in 16S analysis: Rarefying (Or you can call it as subsampling to a given sampling depth).

Rarefying is the process of subsampling reads without replacement to a certain sampling depth. To have a basic understanding of the subsampling used in R. Let’s simplify the case into draw some letters from two boxes:

box1 <- c("A","A","A","B","B","B","B","C","C","D") # Set box1
box2 <- c("A","A","B","B","B","C","C") # Set box2

We will draw 5 letters from each of the boxes Without replacement. Without replacement means: once an letter is drawn it cannot be drawn again.

set.seed(2024) # set random seed to make sure the result is reproducible
sort(sample(box1, size = 5, replace = FALSE))
  1. 'A'
  2. 'A'
  3. 'B'
  4. 'B'
  5. 'C'
sort(sample(box2, size = 5, replace = FALSE))
  1. 'A'
  2. 'A'
  3. 'B'
  4. 'B'
  5. 'C'

The box1 and box2 have identical result when we randomly draw 5 letters. The result will change if you choose a different seed. However, if you draw many times, the probability of whether a spcific letter will be drawn will be stable. Let’s try:

draw_result = data.frame() # We will use dataframe to store the draw result
for (i in 1:1000){
   set.seed(i) 
   tmp = sort(sample(box1, size = 5, replace = FALSE))
   draw_result = rbind(draw_result, tmp)
}
head(draw_result)
A data.frame: 6 × 5
X.A.X.A..1X.B.X.B..1X.C.
<chr><chr><chr><chr><chr>
1AABBC
2ABBCD
3AABBB
4ABBCC
5AAABC
6ABBCC
dim(draw_result)[1]*dim(draw_result)[2]
5000
sum(draw_result == "A")/(dim(draw_result)[1]*dim(draw_result)[2])
0.3106
sum(box1=="A")/length(box1)
0.3

Quite close to actual probability, right? This is an example of the power of computer simulation.

Warning

Some have argued that rarefying or rarefaction is “inadmissible” because it omits valid data (for example, you may see the missing of letter “D” in the box1 when you draw). Therefore, be aware of this situation when you are interesting in taxa with small amount. Other situations that may induced by rarefying.

  1. The introduction of artificial uncertainty in the sub-sampling step,

  2. The arbitrary selection of the minimum library size,

  3. Challenges in estimating over-dispersion parameter.

Alternatives to rarefying: normalizing transformations (decostand in vegan) possible with variance stabilization (decostand and dispweight in vegan) and methods that are not sensitive to sample sizes

Keep the controversy in mind. Let’s go ahead.

Now, let’s show how can we do rarefy on our microbiome data:

library(vegan) # load vegan package
Loading required package: permute
Loading required package: lattice
This is vegan 2.6-8

Vegan accept the transposed taxa table (row as sample, column as taxa), so lets transpose our taxa table first

transposed_taxa_table <- t(taxa_table)

raremax <- 25000 # get the minimum sampling

vegan::rarecurve(transposed_taxa_table, step = 10, sample = raremax, col = "lightgreen", cex = 0.6)

# Step size for sample sizes in rarefaction curves.
Warning message in vegan::rarecurve(transposed_taxa_table, step = 10, sample = raremax, :
“most observed count data have counts 1, but smallest count is 2”
_images/421659dc296199d8af01ac4de1fe5e3481c98f86ee37cab47b2a6d706bca8608.png

As you can see in the Figure above, the alpha diversity (species richness in this case) did not change significantly when the sequencing depth larger than 25000 (i.e. reaches a plateau). We consider the sequencing depth is sufficient to cover what we need. As we want to retain as many as the data we produced when trying to reduce the sampling bias, we will choose minimum sampling depth as the even sampling depth used accross the sample. We will imply the rarefy_even_depth function used in phyloseq package for this purpose.

3.2.5 Phyloseq Object#

Phyloseq is a widely used R object to store the 16s data as well as other metagenomics data. You will find 3 main objects embedded in the phyloseq object in this tutorial:

  1. otu_table: similar to taxa_table

  2. tax_table: similar to taxonomic_assign_table

  3. sample_data: similar to sample_meta

head(otu_table(taxa_table, taxa_are_rows = T)) # convert the taxa_table into otu_table object
A otu_table: 6 × 12 of type int
NEEN.011NEEN.012NEEN.013NEEN.014NEEN.015NEEN.016NEEN.021NEEN.022NEEN.023NEEN.024NEEN.025NEEN.026
3af64985583451cceeb3901d043bfd73 019 0 0 0 0 0 0 0 0 0 0
a57944ffff8ec2097b06e45a616f907d 0 0 0 0 0 0 0 0 0 0 0 0
a69c0a328e1b282389062478c5d49c3a 68829645 27 24 27453417 9 6
7d02f99e2ea00512b055a4e70f84c18b1116429407685132606625211415
54a43d2113c13203e813d22b17f3f370 47555025 33 56 842916181622
e127450ed2ad1088a651c6a3f8a405b3 0 5 0 3 0 0 0 0 0 0 6 0
head(tax_table(as.matrix(taxonomic_assign_table))) # convert the taxonomic_assign_table into tax_table object
A taxonomyTable: 6 × 7 of type chr
KingdomPhylumClassOrderFamilyGenusSpecies
3af64985583451cceeb3901d043bfd73d__BacteriaBacteroidotaBacteroidiaBacteroidalesMuribaculaceaeUBA7173 UBA7173 sp002491305
a57944ffff8ec2097b06e45a616f907dd__BacteriaBacteroidotaBacteroidiaBacteroidalesMuribaculaceaeSodaliphilus Sodaliphilus pleomorphus
a69c0a328e1b282389062478c5d49c3ad__BacteriaBacteroidotaBacteroidiaBacteroidalesMuribaculaceaeCAG-485 CAG-485 sp900547755
7d02f99e2ea00512b055a4e70f84c18bd__BacteriaBacteroidotaBacteroidiaBacteroidalesMarinifilaceaeOdoribacter_865974Odoribacter laneus
54a43d2113c13203e813d22b17f3f370d__BacteriaBacteroidotaBacteroidiaBacteroidalesMarinifilaceaeOdoribacter_865974Odoribacter splanchnicus
e127450ed2ad1088a651c6a3f8a405b3d__BacteriaBacteroidotaBacteroidiaBacteroidalesMarinifilaceaeOdoribacter_865974Odoribacter splanchnicus
head(sample_data(sample_meta))
A sample_data: 6 × 4
referenceDietDescriptionGroup
<int><chr><chr><chr>
NEEN.01111ALIOS dietGroup1CTRL
NEEN.01212ALIOS dietGroup1CTRL
NEEN.01313ALIOS dietGroup1CTRL
NEEN.01414ALIOS dietGroup1CTRL
NEEN.01515ALIOS dietGroup1CTRL
NEEN.01616ALIOS dietGroup1CTRL

Then Merge them into a phyloseq object

physeq <- phyloseq(otu_table(taxa_table, taxa_are_rows = T), 
                   tax_table(as.matrix(taxonomic_assign_table)),
                   sample_data(sample_meta) )
physeq # You can see a very nice summary of the current data
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 784 taxa and 12 samples ]
sample_data() Sample Data:       [ 12 samples by 4 sample variables ]
tax_table()   Taxonomy Table:    [ 784 taxa by 7 taxonomic ranks ]

phyloseq package provide many useful wrapper function to analysis the phyloseq object, like:

print(sample_sums(physeq)) # Quite familiar, right? this function output the total taxa abundance for each sample
NEEN.011 NEEN.012 NEEN.013 NEEN.014 NEEN.015 NEEN.016 NEEN.021 NEEN.022 
   50047    49734    52780    47521    56489    56181    51627    51627 
NEEN.023 NEEN.024 NEEN.025 NEEN.026 
   52991    42890    48576    51663 
sample_names(physeq)
  1. 'NEEN.011'
  2. 'NEEN.012'
  3. 'NEEN.013'
  4. 'NEEN.014'
  5. 'NEEN.015'
  6. 'NEEN.016'
  7. 'NEEN.021'
  8. 'NEEN.022'
  9. 'NEEN.023'
  10. 'NEEN.024'
  11. 'NEEN.025'
  12. 'NEEN.026'
rank_names(physeq)
  1. 'Kingdom'
  2. 'Phylum'
  3. 'Class'
  4. 'Order'
  5. 'Family'
  6. 'Genus'
  7. 'Species'
sample_variables(physeq)
  1. 'reference'
  2. 'Diet'
  3. 'Description'
  4. 'Group'

Keep only taxa with KNOWN Phylum

physeq <- subset_taxa(physeq, Phylum != "d__Bacteria_phylum_no_rank") # Phylum is one of the column from tax_table(). 
physeq
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 782 taxa and 12 samples ]
sample_data() Sample Data:       [ 12 samples by 4 sample variables ]
tax_table()   Taxonomy Table:    [ 782 taxa by 7 taxonomic ranks ]
physeq.rarefy = rarefy_even_depth(physeq, 
                                  replace = F, 
                                  sample.size = min(sample_sums(physeq)) , # sample_sums() function from ph
                                  rngseed = 20230927) # random seed, so that the result is reproducible
print(sample_sums(physeq.rarefy))                              
`set.seed(20230927)` was used to initialize repeatable random subsampling.
Please record this for your records so others can reproduce.
Try `set.seed(20230927); .Random.seed` for the full vector
...
186OTUs were removed because they are no longer 
present in any sample after random subsampling
...
NEEN.011 NEEN.012 NEEN.013 NEEN.014 NEEN.015 NEEN.016 NEEN.021 NEEN.022 
   42877    42877    42877    42877    42877    42877    42877    42877 
NEEN.023 NEEN.024 NEEN.025 NEEN.026 
   42877    42877    42877    42877 

3.2.5 Calculate the Taxa-based Diversity#

Each of the Taxa represent a microbial sequence, although some of them hasn’t been fully assigned with the taxonomic ranks. This is possibly because the unassigned taxa are new species that the database hasn’t fully curated.

Phyloseq help us to the alpha diversity analysis in a easy way.

plot_richness(physeq.rarefy, measures=c("Observed", "Shannon")) # measures could be used to determine which alpha diversity you want to include
_images/718704e52cb3ac36c344a997ba4fb900ae5c01c2286eac3336630d6a2480058c.png
plot_richness(physeq.rarefy, x="Group", color="Group", measures=c("Observed", "Shannon")) # we set x as "Group", "Group" is the information in sample_data() object 
_images/f90b8daf823790d7d48d59b459be8d08d26c788cc3dd53295271a014550be7c8.png

We could add some extra ggplot2 code to modify the output

library(ggplot2)
plot_richness(physeq.rarefy, x="Group", color="Group", measures=c("Observed", "Shannon")) + 
    geom_boxplot(aes(color=Group)) # geom_* --- is a common syntax used in ggplot2
_images/e93bb9b5f694fc5d38954a9d15b08cd6d18b90d7ef3e5cbce7fd44346cd0f18d.png

We could probably see that EcN have generally higher observed species (species richness) than CTRL group. You can also observe the Taxonomic Composition by barplot.

plot_bar(physeq.rarefy, fill="Phylum")
_images/3b19fcbbbe8361eb56d6ee4c4b4deb22ad46613a9cd19c6ebf4226051c35e60c.png

The bar plot looks quite ugly, right? Make the bar plot nicer by removing the taxa boundaries. The plot function in phyloseq actually code based on ggplot2, similarly, we can add some ggplot2 code to modify it.

library(ggplot2)
plot_bar(physeq.rarefy, fill = "Phylum") + 
  geom_bar(aes(color=Phylum, fill=Phylum), stat="identity", position="stack") 
_images/915d4d83e7df1642d9315f41d82257ce3bf552c21c674713af4b2baa90e586fa.png
library(ggplot2)
plot_bar(physeq.rarefy, x = "Group", fill = "Phylum") + 
  geom_bar(aes(color=Phylum, fill=Phylum), stat="identity", position="stack")  
  # stat = "identity": performs no extra computation on the data
  # position = "stack": automatically stack values in reverse order of the group aesthetic, which for bar charts is usually defined by the fill aesthetic
_images/d9800a45e0c724416d7790a5c521ed1aa3c58041e7a426b52cae7ab2c929e7e4.png

We will then do multivariate analysis based on Bray-Curtis distance and Non-metric Multidimensional Scaling (NMDS) ordination.

physeq.ord <- ordinate(physeq.rarefy, "NMDS", "bray")
Square root transformation
Wisconsin double standardization
Run 0 stress 0.08689726 
Run 1 stress 0.08689727 
... Procrustes: rmse 4.48701e-05  max resid 0.000109021 
... Similar to previous best
Run 2 stress 0.08689726 
... Procrustes: rmse 1.847964e-05  max resid 4.123059e-05 
... Similar to previous best
Run 3 stress 0.08689726 
... New best solution
... Procrustes: rmse 6.799869e-06  max resid 1.058429e-05 
... Similar to previous best
Run 4 stress 0.08689727 
... Procrustes: rmse 2.327513e-05  max resid 5.535294e-05 
... Similar to previous best
Run 5 stress 0.1213845 
Run 6 stress 0.1213846 
Run 7 stress 0.3244416 
Run 8 stress 0.08689726 
... Procrustes: rmse 5.540047e-06  max resid 8.579548e-06 
... Similar to previous best
Run 9 stress 0.08689727 
... Procrustes: rmse 1.625888e-05  max resid 3.099486e-05 
... Similar to previous best
Run 10 stress 0.08689726 
... Procrustes: rmse 7.135722e-07  max resid 1.273998e-06 
... Similar to previous best
Run 11 stress 0.1213845 
Run 12 stress 0.1213845 
Run 13 stress 0.08689726 
... Procrustes: rmse 3.255696e-06  max resid 7.64139e-06 
... Similar to previous best
Run 14 stress 0.08689726 
... Procrustes: rmse 1.430007e-06  max resid 2.54728e-06 
... Similar to previous best
Run 15 stress 0.08689726 
... Procrustes: rmse 1.181552e-05  max resid 2.89592e-05 
... Similar to previous best
Run 16 stress 0.08689726 
... New best solution
... Procrustes: rmse 2.873369e-06  max resid 6.851487e-06 
... Similar to previous best
Run 17 stress 0.08689726 
... New best solution
... Procrustes: rmse 3.405353e-06  max resid 8.257777e-06 
... Similar to previous best
Run 18 stress 0.1213845 
Run 19 stress 0.347306 
Run 20 stress 0.08689726 
... Procrustes: rmse 9.14899e-06  max resid 2.112381e-05 
... Similar to previous best
*** Best solution repeated 2 times
plot_ordination(physeq.rarefy, physeq.ord, type="samples", color="Group", shape= "Group", 
                  title="samples ordination") + geom_point(size=3)
_images/230b13746dc2543ab8757f5a98ed94cfb4adf15cc6d428c2c62db13d5093dbd1.png

Note

You can also extract the data that we used to plot this picture by

  1. first assign the result to a object plot <- plot_ordination(...)

  2. Then the data could be found using plot$data

Based on the ordination results, we can observe a clear separation between the samples in CTRL and EcN, especially in NMDS1. Remember in Section 3.1, we have discussed that the ordination techniques could condense the main variation between samples in a reduced dimension (2 in our case, i.e. NMDS1, NMDS2). The difference in NMDS1 indicating a noticable difference in the microbial composition. To gain statistical support, we will also apply PERMOANOVA to the dataset to test the significance based on the bray-curtis distance matrix.

PERmutational Multivariate ANalysis of VAriance (PERMANOVA) is a permutation-based technique – it makes no distributional assumptions about multivariate normality or homogeneity of variances. In essence, similar to ANOVA, the PERMANOVA is designed for compares the variation between groups to the variation within groups.

To do so, we need to extract the otu_table and sample data from the phyloseq object.

adonis_taxa <- t(otu_table(physeq.rarefy))
adonis_sample_data <- data.frame(sample_data(physeq.rarefy))
vegan::adonis2( formula = adonis_taxa ~ Group, data = adonis_sample_data, method = "bray") 
# In this case, the left hand side of formula is the abundance matrix, while the right hand side is the variable that you want to associated with
A anova.cca: 3 × 5
DfSumOfSqsR2FPr(>F)
<dbl><dbl><dbl><dbl><dbl>
Model 10.28306080.27955373.8802840.002
Residual100.72948480.7204463 NA NA
Total111.01254571.0000000 NA NA

Based on the result from PERMANOVA, we can confidently draw the conclusion that EcN has significantly alter the bacteria community.

3.2.6 Differential Abundant Analysis#

Indeed, our E.Coli Nissle can significantly change the microbial community. But how to gain more specific insights about which bacteria increased or decreased by our treatment? There are different tools including LEfSe, ANCOMBC, ANCOMBC2, MaAsLin2 and DESeq2 etc. Every tools rely on different assumptions about the data. In this tutorial, to keep consistent with the NEEN project, we will apply ANCOMBC to the differential abundant analysis.

# ========= Install ==========

# if (!require("BiocManager", quietly = TRUE))
#    install.packages("BiocManager")
 
# BiocManager::install("ANCOMBC", version = "2.6.0")

library("ANCOMBC")
packageVersion("ANCOMBC")  
[1] ‘2.6.0’

Note

The R version used in this tutorial is 4.4.

physeq
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 782 taxa and 12 samples ]
sample_data() Sample Data:       [ 12 samples by 4 sample variables ]
tax_table()   Taxonomy Table:    [ 782 taxa by 7 taxonomic ranks ]
# ANCOMBC designed the algorithm to fix the sampling depth bias, therefore, we could directly use the un-rarefied physeq

ancombc_result = ancombc(phyloseq = physeq,  
                tax_level = "Genus", 
                formula = "Group",   # formula can be used to add more cofounder, see ?ANCOMBC
                p_adj_method = "BH", # p value adjustment method
                prv_cut = 0.10,      # Prevalence cutoff
                lib_cut = 1000,      
                group = "Group",     # The Group variable
                struc_zero = TRUE, neg_lb = TRUE, tol = 1e-5, 
                max_iter = 100, conserve = TRUE, alpha = 0.05, global = TRUE,
                n_cl = 1, verbose = TRUE)
'ancombc' has been fully evolved to 'ancombc2'. 
Explore the enhanced capabilities of our refined method!
Warning message:
“The group variable has < 3 categories 
The multi-group comparisons (global/pairwise/dunnet/trend) will be deactivated”
Obtaining initial estimates ...
Estimating sample-specific biases ...
Loading required package: foreach
Loading required package: rngtools
ANCOM-BC primary results ...
Merge the information of structural zeros ... 
Note that taxa with structural zeros will have 0 p/q-values and SEs
names(table(sample_data(physeq)[,"Group"]))[1]
'CTRL'
# We need to get the p value and lfc used for the group
level1 = names(table(sample_data(physeq)[,"Group"]))[1] # In this case, CTRL
level2 = names(table(sample_data(physeq)[,"Group"]))[2] # In this case, EcN
  
# typically, the column name would be varname + level2
column_name = paste0("Group", level2)
  
# Extract differentially abundant taxa 
# The fold change defined in ANCOMBC is level2/level1

diff_abundant_taxa_lfc <- ancombc_result$res$lfc 
diff_abundant_taxa_lfc$FC <-  exp(ancombc_result$res$lfc[, column_name]) # this is because the ANCOMBC employ a natural log transformation
  

diff_abundant_taxa_p <- ancombc_result$res$p_val
diff_abundant_taxa_p$P <-  ancombc_result$res$p_val[, column_name]
  

then, we will introduce a plot commonly used in the field, volcano plot.

# taxa_species_relation <- tax_tab
taxa_p <- diff_abundant_taxa_p[,c("taxon","P")]
taxa_fc <- diff_abundant_taxa_lfc[,c("taxon","FC")]
  
merged_res = merge(taxa_p,taxa_fc )

# add a column of NAs
merged_res$Change <- "Nosig"
# if log2Foldchange > 1 and pvalue < 0.05, set as "UP" 
merged_res$Change[log2(merged_res$FC) > 1 & merged_res$P < 0.05] <- "UP"
# if log2Foldchange < -0.6 and pvalue < 0.05, set as "DOWN"
merged_res$Change[log2(merged_res$FC) < -1 & merged_res$P < 0.05] <- "DOWN"

# Add a column for further Text label
merged_res$label <- merged_res$taxon
merged_res$label[merged_res$Change == "Nosig"] <- "" # Therefore, we do not want the label for not significant species

# Make sure you install package "DT"
# Display
DT::datatable(merged_res, caption = "P value and Fold Changes from the Primary Result") 
head(merged_res)
A data.frame: 6 × 5
taxonPFCChangelabel
<chr><dbl[,1]><dbl><chr><chr>
114-2 0.32943462260.7636420Nosig
21XD42-69 0.26751927971.5627751Nosig
3Acetatifactor 0.00805076282.7043133UP Acetatifactor
4Acetitomaculum 0.35326896760.7583377Nosig
5Acutalibacter 0.00023879970.6594520Nosig
6Acutalibacteraceae_genus_no_rank0.43767460161.6212742Nosig
write.csv(merged_res, file = "../document/case7/final_res.csv")
library("ggplot2")
ggplot(data=merged_res, aes(x=log2(FC), y= -log10(P + 0.000001), color = Change)) +  # + 0.000001 was set to remove p value = 0
   geom_point() +  
   geom_text(aes(label = label))  + # add taxa information
   geom_vline(xintercept=c(-1, 1), col="red") + 
   geom_hline(yintercept=-log10(0.05), col="red") +
   scale_x_continuous(limits = c(-5,5)) + # set the parameter for x axis 
   scale_y_continuous(limits = c(0,6))    # set the parameter for y axis 
_images/2cbb4ca42291cd0f2c90042b900f73712300e6eace9215a6407ca64a44874f9b.png
# You may find the graph quite ugly, try this instead

library("ggplot2")
library("ggrepel")
ggplot(data=merged_res, aes(x=log2(FC), y= -log10(P + 0.000001), color = Change)) +  # + 0.000001 was set to remove p value = 0
   geom_point() +  
   # geom_text(aes(label = label))  + 
   geom_text_repel(aes(label = label),max.overlaps = 200) + # add taxa information with flexible position
   geom_vline(xintercept=c(-1, 1), col="red") + 
   geom_hline(yintercept=-log10(0.05), col="red") +
   scale_x_continuous(limits = c(-5,5)) + # set the parameter for x axis 
   scale_y_continuous(limits = c(0,7))    # set the parameter for y axis 
_images/4d977fe2cdcad97272550a11bc3e7a6925f0a4b6b2b91fb9124493fb36f3a130.png

ggplot2 is a powerful tool to learn for visualization!

3.2.7 Correlation Analysis#

Normally, Correlation and Association are two terms can be used interchangeably. But they have some tiny but important difference that worth to note when interpret the results.

  • Association: A more vague term used to describe a relationship (linear or non-linear) between two variables.

  • Correlation: A certain type of (usually linear) association between two variables.

What is a linear relationship?

A linear relationship (or linear association) is a statistical term used to describe a straight-line relationship between two variables. In math, we can write it as y = mx + b. where: m=slope, b = intercept ​

CorrelationAndAssociaiton.jpg

Naomi Altman, 2015

Basically, we have two main types of data: One is categorical (like girl and boy; or disease and healthy), another one is continuous data (like the height or the proportion of fatty liver). The differential abundant analysis that we have discussed before is actually a kind of association analysis, which mainly focus on the association between continous (the abundance of certain bacteria) and categorical data.

In this section, we will focus on the correlation analysis which is normally used to describe the relationship between two continous data, such as the linear relationship between the height and the weight for each person.

For both association/correlation, we can describe it as either positive or negative association.

  • Negative: As one variable increases, the other will increase (move in the same direction)

  • Positive: As one variable increases, the other will decrease (move in the oppositive direction)

For correlation, we can use some statistics, i.e. correlation coefficient, to describe the STRENGTH of correlation. The range of correlation coefficient is [-1,1]. Where the -1 indicates the strongest negative correlation, 1 indicates the strongest positive correlation.

Different Statistics rely on different assumptions, Here is the assumptions for two commonly used:

  • Pearson’s r:

    • Data from both variables follow normal distributions

    • You expect a linear relationship between the two variables.

    • Paired observations

  • Spearman’s rho:

    • Not have any assumptions on the distribution, will use the rank

    • There is a monotonic relationship between the two variables.

    • Paired observations

Let’s try with the metadata provided by Dr. Johnson Lok for the NEEN data, in which, we measure the short-chain fatty acid level from either plasma and face and the body weight , etc after we sacrifice the mice.

meta_endpoints_data <- read.csv("../document/case9/Meta_endpoint_table.tsv",sep = "\t",
row.names = 1 # set the first column as the rownames index
)
head(meta_endpoints_data)
A data.frame: 6 × 24
NEEN.011NEEN.012NEEN.013NEEN.014NEEN.015NEEN.016NEEN.021NEEN.022NEEN.023NEEN.024NEEN.033NEEN.034NEEN.035NEEN.036NEEN.051NEEN.052NEEN.053NEEN.054NEEN.055NEEN.056
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
Body weight (g) 30.200000003.110000e+013.430000e+01 36.000000 43.90000000 43.00000000 34.900000 29.000000 32.90000000 37.700000003.610000e+01 32.000000 36.70000 32.500000 32.90000 31.300000 32.4000003.080000e+013.460000e+013.480000e+01
Body weight loss (%) -0.666666673.715170e+003.107345e+00 7.216495 -3.78250591 -7.50000000 4.644809 -2.836879 -0.30487805 -3.005464483.989362e+00 5.325444 4.92228 6.876791 5.45977 9.537572 5.2631586.097561e+006.738544e+004.918033e+00
Body weight loss (g) -0.200000001.200000e+001.100000e+00 2.800000 -1.60000000 -3.00000000 1.700000 -0.800000 -0.10000000 -1.100000001.500000e+00 1.800000 1.90000 2.400000 1.90000 3.300000 1.8000002.000000e+002.500000e+001.800000e+00
Liver weight (mg)1700.000000001.480000e+031.650000e+031780.0000002443.000000002197.000000001740.0000001360.0000001660.000000002010.000000001.850000e+031690.0000001850.000001620.0000001660.000001380.0000001490.0000001.740000e+031.860000e+031.700000e+03
MRI fat index 0.038152416.350756e-023.607113e-02 NA 0.12080000 0.14480000 NA NA 0.06871337 0.097883894.194388e-02 NA NA NA NA NA NA3.169522e-023.825281e-023.357185e-02
Reduction of MRI fat index 0.028847594.924434e-048.692887e-02 NA 0.07266106 0.04178158 NA NA 0.18298663 0.172716111.250561e-01 NA NA NA NA NA NA9.380478e-027.724719e-025.072815e-02
# regular check
str(as.data.frame(t(meta_endpoints_data)))
'data.frame':	24 obs. of  34 variables:
 $ Body weight (g)               : num  30.2 31.1 34.3 36 43.9 43 34.9 29 32.9 37.7 ...
 $ Body weight loss (%)          : num  -0.667 3.715 3.107 7.216 -3.783 ...
 $ Body weight loss (g)          : num  -0.2 1.2 1.1 2.8 -1.6 ...
 $ Liver weight (mg)             : num  1700 1480 1650 1780 2443 ...
 $ MRI fat index                 : num  0.0382 0.0635 0.0361 NA 0.1208 ...
 $ Reduction of MRI fat index    : num  0.028848 0.000492 0.086929 NA 0.072661 ...
 $ Reduction of MRI fat index (%): num  43.056 0.769 70.674 NA 37.558 ...
 $ Liver fat droplets (%)        : num  0.484 0.513 1.365 5.645 15.832 ...
 $ Plasma_Formic acid            : num  170 144 169 127 174 ...
 $ Plasma_Acetic acid            : num  63.9 101.3 56.3 66.5 100.1 ...
 $ Plasma_Propionic acid         : num  6.1 5.94 4.29 4.25 2.35 ...
 $ Plasma_Butyric acid           : num  NA 0.262 0.112 0.118 NA ...
 $ Plasma_Iso-butyric acid       : num  3.82 1.96 5.86 2.83 2.47 ...
 $ Plasma_Succinic acid          : num  15.06 14.23 19.65 6.78 7.29 ...
 $ Plasma_Valeric acid           : num  0.272 0.206 0.336 0.222 0.158 ...
 $ Plasma_Iso-valeric acid       : num  0.0434 0.0929 0.0973 0.0658 0.1163 ...
 $ Plasma_Caproic acid           : num  0.951 0.44 1.046 0.85 0.512 ...
 $ Cecum_Formic acid             : num  NA NA 0.0281 0.0137 NA ...
 $ Cecum_Acetic acid             : num  2.62 4.93 5.43 2.36 5.28 ...
 $ Cecum_Propionic acid          : num  0.484 0.949 1.36 0.69 1.268 ...
 $ Cecum_Butyric acid            : num  0.262 1.518 0.696 0.229 3.548 ...
 $ Cecum_Iso-butyric acid        : num  0.0368 0.0352 0.0591 0.0311 0.0573 ...
 $ Cecum_Succinic acid           : num  0.0448 0.0474 0.0605 0.0282 0.4963 ...
 $ Cecum_Valeric acid            : num  0.0327 0.0533 0.0898 0.0278 0.076 ...
 $ Cecum_Iso-valeric acid        : num  0.00582 0.00648 0.00994 0.00387 0.01563 ...
 $ Cecum_Caproic acid            : num  0.000832 0.0012 0.001087 0.000882 0.000726 ...
 $ Plasma_ALT                    : num  38.9 43.5 45.8 60.8 40.5 ...
 $ Plasma_AST                    : num  302 208 218 157 188 ...
 $ Plasma_Glucose                : num  11.7 10.3 11.5 11 14.5 ...
 $ Plasma_HDL cholesterol        : num  2.81 4.21 3.32 4.5 4.35 ...
 $ Plasma_Total cholesterol      : num  2.75 3.49 3.32 4.41 4.08 ...
 $ Plasma_LDL cholesterol        : num  0.345 0.465 0.402 0.6 0.466 0.387 0.351 0.259 0.617 0.466 ...
 $ Plasma_Triglycerides          : num  0.695 0.853 0.749 0.728 0.721 ...
 $ Log10 CFU/g feces             : num  NA NA NA NA NA ...

Great! All of the variables that we have is in a numeric variable types ! But we should also notice that we have several “NA” present, meaning that we do not have the data. Then, we will use the built-in function cor.test() for correlation analysis. Try ?cor.test()

The assumptions for correlation analysis is paired observations. Therefore, in case there is an “NA” in one variable, another variable from the same sample will be removed. cor.test() is smart enough to remove the unwanted samples by itself, but it is better to remove the sample with “NA” before we do the analysis.

na_index <- is.na(as.numeric(meta_endpoints_data["Body weight (g)", ])) | is.na(as.numeric(meta_endpoints_data["Liver fat droplets (%)", ]))
no_na_index <- !na_index
sum(na_index)
0

It is a great news, we have all the data we want for body weight and liver fat droplets.

cor.test(x = as.numeric(meta_endpoints_data["Body weight (g)", ])[no_na_index],              #input the first data vector, use as.numeric to convert one row dataframe to vector
         y = as.numeric(meta_endpoints_data["Liver fat droplets (%)", ])[no_na_index],       #input the second data vector,
         method = "spearman" )                                                   
Warning message in cor.test.default(x = as.numeric(meta_endpoints_data["Body weight (g)", :
“Cannot compute exact p-value with ties”
	Spearman's rank correlation rho

data:  as.numeric(meta_endpoints_data["Body weight (g)", ])[no_na_index] and as.numeric(meta_endpoints_data["Liver fat droplets (%)", ])[no_na_index]
S = 781.67, p-value = 0.0004477
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.6601435 
# store it in an R object
cor.res <- cor.test(x = as.numeric(meta_endpoints_data["Body weight (g)", ]),         #input the first data vector, use as.numeric to convert one row dataframe to vector
         y = as.numeric(meta_endpoints_data["Liver fat droplets (%)", ]),        #input the second data vector,
         method = "spearman" )
Warning message in cor.test.default(x = as.numeric(meta_endpoints_data["Body weight (g)", :
“Cannot compute exact p-value with ties”
# The strength of the correlation
cor.res$estimate
rho: 0.660143525064217
# Whether the strength of the correlation is by chance in statistical sense
cor.res$p.value
0.000447662958586588

The body weight and proportion of liver fat droplets shows a strong positive correlation (Spearman’s rho = 0.8637, p < 0.001). Let’s visualize their relationship use ggplot2 to have a scatter plot with a fitted straight line.

# Prepare for the format that ggplot2 required
data_for_plot <- data.frame(t(meta_endpoints_data),check.names = T) # check.names will fix the name that is not proper for most of the R scripts
head(data_for_plot)
A data.frame: 6 × 34
Body.weight..g.Body.weight.loss....Body.weight.loss..g.Liver.weight..mg.MRI.fat.indexReduction.of.MRI.fat.indexReduction.of.MRI.fat.index....Liver.fat.droplets....Plasma_Formic.acidPlasma_Acetic.acidCecum_Iso.valeric.acidCecum_Caproic.acidPlasma_ALTPlasma_ASTPlasma_GlucosePlasma_HDL.cholesterolPlasma_Total.cholesterolPlasma_LDL.cholesterolPlasma_TriglyceridesLog10.CFU.g.feces
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
NEEN.01130.2-0.6666667-0.217000.038152410.028847588643.0561023 0.4838043170.4957 63.915830.0058167310.000832475638.878302.20511.7342.8092.7520.3450.695NA
NEEN.01231.1 3.7151703 1.214800.063507560.0004924434 0.7694427 0.5133340144.2435101.284240.0064800600.001200156043.488208.40010.3424.2123.4870.4650.853NA
NEEN.01334.3 3.1073446 1.116500.036071130.086928867070.6738756 1.3647282169.0280 56.314840.0099364060.001086936845.822217.83511.4663.3223.3170.4020.749NA
NEEN.01436.0 7.2164948 2.81780 NA NA NA 5.6450791126.8910 66.488680.0038654870.000881745660.835156.90711.0194.4964.4070.6000.728NA
NEEN.01543.9-3.7825059-1.624430.120800000.072661064337.558495115.8317157174.1511100.054320.0156304140.000725541740.522187.52314.4774.3534.0830.4660.721NA
NEEN.01643.0-7.5000000-3.021970.144800000.041781581222.393197116.0552065172.1734 92.837440.0136273260.001730498866.884192.10513.3684.0433.6520.3870.407NA
library(ggplot2)
# draw
ggplot(data_for_plot, aes(x = Body.weight..g., y = Liver.fat.droplets.... ) ) + # set the base
  geom_point(size = 1) +  # add  point
  stat_smooth(method = "lm", 
              formula = y ~ x, 
              geom = "smooth")  
_images/ae48eb805b37a3c0be75efdb9a420d4d5ecfdb1df916fea1433cd3e6075ace13.png

Not that good, as most of the mice have a small variation of liver fat droplets. As spearman’s correlation actually calculate the correlation based on the rank, let’s try another way to visualize:

library(ggplot2)
# draw
ggplot(data_for_plot, aes(x = rank(Body.weight..g.), y = rank(Liver.fat.droplets.... )) ) + # set the base
  geom_point(size = 1) +  # add  point
  stat_smooth(method = "lm", 
              formula = y ~ x, 
              geom = "smooth")  
_images/d5af2c6871d747bc0a8eb57d8e1fceda20eb888cdee103b5dc19dc2e66e4c5a0.png

Great! You can see that the body weight and liver fat droplets do move in the same direction. And it is quite strong.

Case Exercise 9. Microbiota-wise correlation#

Previously, we have introduced how to do the correlation analysis on the metadata provided. Now, you are required to:

    1. merge the otu_table with the meta_endpoint_data to form a dataframe that is ready to do correlation (hint: ?merge()).

    1. Do correlation iteratively for each bacteria and each metadata (hint: Two for loop)

    1. Store your results (most importantly your correlation coefficient and p value) in an R object for each association analysis, either a list() or a dataframe() (hint: create a new dataframe or list, and store your results within each of the for loop)

More Reading#

You may try other wrapper R packages, like microbiomeAnalyst.

Memorize the Great Contribution Made By the Giant#

Discovery of DNA#

  • 1869: Friedrich Miescher discovered a substance he called “nuclein” (now known as DNA) in the nuclei of white blood cells.

  • 1910s: Phoebus Levene identified the components of DNA as a phosphate-sugar-base structure, proposing the “tetranucleotide hypothesis.”

  • 1944: Oswald Avery, Colin MacLeod, and Maclyn McCarty demonstrated that DNA is the molecule responsible for heredity in their transformation experiments with bacteria.

  • 1950: Erwin Chargaff discovered that DNA composition varies between species and established Chargaff’s rules, noting that adenine equals thymine and guanine equals cytosine.

  • 1952: Alfred Hershey and Martha Chase confirmed that DNA, not protein, is the genetic material using bacteriophage experiments.

  • 1953: James Watson and Francis Crick, with critical contributions from Rosalind Franklin and Maurice Wilkins, proposed the double helix structure of DNA, revolutionizing biology.

Sequencing#

  • 1977: Frederick Sanger developed the chain-termination method, known as Sanger sequencing, allowing for the rapid sequencing of DNA.

  • 1980s-1990s: Automated sequencing machines were developed, increasing the speed and efficiency of DNA sequencing.

  • 2001: The first draft of the Human Genome Project was completed using Sanger sequencing.

  • 2005: Next-generation sequencing (NGS) technologies emerged, dramatically reducing the cost and time required for sequencing.

PCR (Polymerase Chain Reaction)#

  • 1983: Kary Mullis invented PCR, a revolutionary technique allowing for the amplification of specific DNA sequences.

  • 1985: The technique was first published, rapidly gaining popularity for its simplicity and effectiveness.

  • 1990s: PCR became a fundamental tool in molecular biology, used in various applications such as cloning, diagnostics, and forensic science.

Cloning Techniques#

  • 1972: Paul Berg performed the first successful gene splicing, laying the groundwork for cloning techniques.

  • 1973: Herbert Boyer and Stanley Cohen developed a method for cloning DNA in bacteria using recombinant DNA technology.

  • 1996: The first mammal, Dolly the sheep, was cloned from an adult somatic cell using somatic cell nuclear transfer (SCNT).

Common Technique Used in Molecular Biology#

Agarose Gel Electrophoresis: Arne Tiselius, a Swedish scientist who published his first study on electrophoresis. This technique separates DNA fragments based on size. By applying an electric current to a gel matrix, researchers could visualize and analyze DNA samples, allowing for the assessment of fragment length and purity.

Western blotting: A western blot experiment, or western blotting (also called immunoblotting, because an antibody is used to specifically detect its antigen) was introduced by Towbin, et al. in 1979 and is now a routine technique for protein analysis.

Restriction Enzyme Analysis: Restriction enzymes cut DNA at specific sequences, enabling scientists to analyze DNA fragments generated by these cuts. This method was crucial for mapping genomes and studying genetic variation.

Southern Blotting: Developed by Edwin Southern in 1975, this technique combines gel electrophoresis and hybridization. It allows for the detection of specific DNA sequences within a complex mixture, using labeled probes to identify complementary sequences.

Polymerase Chain Reaction (PCR): Invented in the 1980s, PCR enables the amplification of specific DNA segments, making it possible to study small amounts of DNA in detail. This technique revolutionized molecular biology and diagnostic testing.

DNA Probes and Hybridization: Researchers used labeled DNA probes to bind to complementary sequences in DNA samples, allowing for the identification of specific genes or mutations.

DNA Microarrays: Although more advanced than earlier methods, microarrays allowed for the simultaneous analysis of thousands of genes, providing insights into gene expression and regulation before the full sequencing of genomes.