Why use 16s gene




















Further analysis of the results was performed as stated above. The data for the evaluation of the three different sequencing protocols was analyzed using only the de novo assembly approach followed by a BLAST using the NCBI database.

OTU clustering groups the reads into OTUs, which consist of representative sequences of pseudo-species, based on sequence similarities and assigns taxonomy to them OTU, The mapping consists of aligning reads to reference sequences based on a predefined length and similarity fraction CLC, Trimmed reads were analyzed using the Map Reads to Reference tool with the default settings similarity fraction 0.

Afterward, the proportion of mapped reads was calculated by summing up the total read count for each species and dividing it by the total of mapped reads of the given sample.

During the data analysis, the species found in the negative control were excluded from the clinical samples in all three approaches, and were considered as potential contaminating species. Then a cut-off value was determined for each method to define whether the bacterial species identified should be accounted as infectious causing pathogens e. The results of the three data analysis approaches were compared based on the number of the species identified, the relative abundance number of reads for a specific species , and time to result.

All sequencing protocols showed similar identification results at the species level. The proportion of reads corresponding to bacteria identified at the genus level and at the species level is shown in Figure 1. The remaining non-identified reads are belonged to other taxonomies mainly human.

The main differences between sequencing kits were due to differences in the detection of contaminating species. Figure 1. Proportion of reads corresponding to bacteria identified at the genus level and at the species level using three different sequencing protocols.

After removing the low-quality nucleotides by trimming, an average of 1,, reads Subsequent data analysis using the three different approaches shown in Figure 2 identified bacterial species in the negative control Supplementary Table S2 , and these species were considered to be contaminating species.

If these species were found in clinical samples, they were excluded from further analysis unless they were above the cut-off level defined for each tool Supplementary Tables S2 — S4 and Tables 1 , 2. From this point on, the samples identified with clinically relevant bacteria using either 16S Sanger sequencing or culturing were named conventional positive samples and those identified by using NGS of the 16S—23S rRNA encoding region were named NGS positive samples.

Table 1. Table 2. The same bacteria were identified at the genus and species level in all samples with two exceptions in samples 18 and In sample 18, the contig with bp Supplementary Table S2 was identified as Herbaspirillum sp. Likewise, in sample 26, the contig with bp Supplementary Table S2 was assigned as Undibacterium oligocarboniphilum in the NCBI database, could not be identified in the local database. The bacterial species in most of the samples were found with a slightly higher similarity score in the local database than in the NCBI database.

We concluded that the local database was accurate enough to identify and distinguish clinically relevant species. Therefore, the other two approaches OTU clustering and mapping were performed using the local database. The conventional methods identified bacterial species in 12 out of 28 samples.

Among them, two samples samples 2 and 33 were positive only by culturing and nine samples samples 10, 17, 18, 20, 21, 24, 25, 26, and 27 were positive only by 16S rDNA Sanger sequencing.

By comparing the primers with the 16S—23S rRNA encoding region sequences of this species, we realized the primers did not align with the target region. Therefore, these two samples were excluded from further statistical analysis. The same bacteria were identified at species level in 5 samples 18, 20, 21, 25, 27 out of 10 NGS positive samples and at genus level in one sample sample 26 between Sanger sequencing and 16S—23S rRNA encoding region NGS.

This approach identified C. However, by doing further analysis, we found that in a subsequent sample taken from the same patient, C. Table 3. In sample 6, OTU clustering identified a low abundant 0. Additionally, there were more, closely related bacterial species, identified in six of the NGS positive samples using the OTU clustering method. A total of seven samples were positive using the mapping approach Table 2. As mentioned above, bacteria identified in five of the samples samples 18, 20, 21, 25, 27 coincided with the results of conventional methods at the species level and in one sample sample 26 , at the genus level.

On the other hand, sample 3 was identified as Gemella sp. In six NGS positive samples, the same species was identified as the most abundant one with all three data analysis approaches. The time to complete the analysis for de novo assembly and BLAST using the local database was about 2 h for all 30 samples including positive and negative control while it took around 4 h including 1 h of hands-on-time for the OTU clustering and about 6 h and 30 min including 4 h of hands-on-time for mapping.

All the species present in the mock community sample were identified by both the OTU clustering and de novo assembly and BLAST approaches, whereas mapping did not identify two of the bacterial species Supplementary Table S5. Mapping identified several additional species that were not present in the mock community sample Supplementary Table S5. The de novo assembly and BLASTN is the only approach of the three that works at the contig level, and both the OTU clustering and mapping are performed at the read level.

Using the NCBI database for these two last approaches would have resulted in odd results, since the NCBI database includes sequences that do not belong to the 16S—23S rRNA encoding region, but that due to the small read length, would have homology with our reads and would have resulted in the creation of bizarre OTUs and mapping results. The results show that the de novo assembly and subsequent BLASTN analysis using the in-house developed database was the superior approach to obtain results faster compared to the other two.

Additionally, the 16S—23S rRNA encoding region NGS-based method was superior in distinguishing bacterial species and in the identification of additional species per sample, not detected by conventional methods. The initial evaluation study of the sequencing protocols demonstrated the potential use of a shorter read length sequencing kit compared to the longer ones.

Even though the number of sequencing reads generated was lower with the cycles kit than the cycles kit, it provided a similar resolution at the bacterial species identification level as the other two kits, with the advantage of being much faster. The use of a faster sequencing workflow may improve the implementation of the appropriate antimicrobial therapy by providing a faster diagnostic answer.

Therefore, this approach was chosen for the sequencing of the following samples. The data analysis of the mock community sample Supplementary Table S5 showed that the mapping approach was much less sensitive and specific than the other two data analysis approaches. This could be improved by changing to more stringent analysis parameters, however, this would have affected the sensitivity of a method that already underperformed, as two species could not be identified.

During the analysis, the main challenge was the presence of contaminating species. All species detected in the negative controls Supplementary Tables S2 — S4 have been previously described as contaminants of sequencing-based analysis stemming from DNA extraction kits and other laboratory reagents Salter et al.

These species were highly abundant in samples with low abundant infectious microorganisms and in negative samples, whereas they were identified in relative lower abundancy in true positive samples conventional positive samples. This suggests that highly abundant contaminants might be masking low abundant infectious microorganisms in some samples. In addition to being a common bacterium of the human skin and a contaminant from laboratory reagents or the environment Salter et al.

Instead, we defined cut-off values to distinguish contaminants introduced during sample handling from an infectious microorganism. Only C. Yet, like in our study, we would like to highlight that these results should be interpreted in light of other clinical data available. By creating an in-house developed database, we aimed to overcome the bias of data analysis introduced by using the public 16S rRNA gene databases.

On the other hand, the database should be as complete as possible to identify all relevant bacterial pathogens. For this reason, we compared the sequences present in our database with the emerging infectious diseases and pathogens in the Netherlands published by the Dutch National Institute for Public Health and the Environment RIVM de Gier et al. This demonstrated that our database contains pathogens that are common in the Netherlands, while as for pathogens that are common in the United States, some species are missing e.

This revealed that many species with few occurrences in our local database also had few genome assemblies available. The NCBI genome database provides completely sequenced genomes and also sequences that are incomplete, and these can be at the contig-, scaffold- or chromosome-level Kitts et al.

This has the disadvantage of not always being possible to find a 16S—23S rRNA encoding region amplicon due to incomplete sequencing assemblies available. On the other hand, some species had many genome assemblies available which means that more 16S—23S rRNA encoding region sequences can still be added to the database, despite the considerable number of sequences 23, entries already present.

As new species are identified, especially from anaerobes, more and more sequences need to be added and updated, as well. Also, the same species might have different number of 16S—23S rRNA encoding regions and different ITS sequences, hence the database should be broad enough to represent different strains of the same species.

In sample 18, the contig with bp was identified as Herbaspirillum sp. In the same analysis, second and third hit matching the same contig was Massilia sp. As Herbaspirillum sp. On the other hand, the same contig was identified as Bordetella sp. This explains the difference between the two methods, since the closest reference in our database was the Bordetella sp.

Furthermore, mapping approach also identified Bordetella species but with a very low abundance, below the cut-off value and OTU clustering did not identify Bordetella species at all. To confirm the presence of the pathogen, one should use another methodology, e. In sample 26, an additional species Undibacterium oligocarboniphilum , which has been described as a common contaminant of DNA extraction kits and other laboratory reagents Salter et al. Some of the gene sequences remain stable in the long course of evolution.

Nucleotide probes have been applied to the identification of clinical bacteria, sequence analysis, molecular classification of bacteria, and phylogenetic analysis. Ribosomal rRNA is essential for the survival of all living things. At the same time, its conservatism is relative. There are different degrees of difference in the families, genera and species of different bacteria, so 16S rRNA can be used as both It is a marker for bacterial classification and can be used as a target molecule for detection and identification of clinical pathogens.

The PCR of the bacterial ribosome 16S rRNA gene as the target molecule can judge the existence of bacterial infection early and identify the species of the pathogen by further analysis of the amplified products and make up for the above deficiencies.

It is an important breach in the diagnosis of infectious diseases and has become the principal of bacteriologists at home and abroad. One of the directions is to be studied. Your email address will not be published. Phylogenetic markers include the presence of specific protein coding or structural genes, the combinations of such genes and their variants, insertion and repeat elements. Foremost, the functional constancy of this gene assures it is a valid molecular chronometer, which is essential for a precise assessment of phylogenetic relatedness of organisms.

It is present in all prokaryotic cells and has conserved and variable sequence regions evolving at very different rates, critical for the concurrent universal amplification and measurement of both close and distant phylogenetic relationships. These characteristics allow the use of 16S rRNA in the assignment of close relationships at the genus [ 8 ] and in many cases at the species level [ 19 — 21 ].

Moreover, dedicated 16S databases [ 22 — 24 ] that include near full length sequences for a large number of strains and their taxonomic placements exist. The sequence from an unknown strain can be compared against these sequences. This last point is particularly relevant in an era where DNA sequencing is rapidly becoming a commodity. Tens to thousands of full-length 16S rRNA gene sequences can be generated using capillary sequencing of cloned PCR products while at least two orders of magnitude more short hypervariable regions to bp can be generated using next-generation sequencing technologies in a cost effective way [ 25 , 26 ].

While relying on non-full length 16S rRNA gene sequence limits the taxonomic resolution and the specific hypervariable region dictates taxonomic coverage [ 27 , 28 ], it is clear that recent advances in sequencing and 16S rRNA gene sequencing protocols [ 29 ] will make this molecular marker a more acceptable means for rapid identification.

Several studies evaluated the usefulness of 16S rRNA gene sequencing for clinical microbiology. Historically, slow-growing mycobacteria have been a major group of organisms for which a plethora of 16S studies exist [ 30 , 31 ]. Drancourt et al. Bosshard et al.

Spilker et al. Despite the existence of these studies, a systematic and broad evaluation of 16S rRNA gene for the identification of clinically relevant organisms is lacking. Moreover, even in the existing studies with a limited breadth of organisms, the identification is based on sequence alignment based similarity against databases with very limited diversity i.

Toward these aims, we assembled a culture isolate collection of some of the most common hospital-associated bacterial pathogens as well as endemic community-acquired and less common organisms associated with increased disease burden to determine the accuracy of clinical vs.

The results of our investigation provide insight into the strengths and limitations of molecular identification using 16S rRNA gene for microbiological identification of common bacterial pathogens. Overall, the isolates represented the most common bacterial pathogens with the exception of two Neisseria lactamica isolates cultured by the UCSF clinical microbiology lab as well as some less common species associated with severe disease burden such as Stenotrophomonas maltophilia and Burkholderia cepacia complex.

For Neisseria meningiditis. For each of the clinical identities represented in the repository, Table 1 summarizes the clinical identification method, the number of isolates, and the source of the isolate. Isolates obtained from the Clinical Microbiology Laboratory at University of California, San Francisco had undergone culture on relevant selective media, had been further sub-cultured, and had their biochemical profile tested per clinical microbiology laboratory protocols based on current Clinical and Laboratory Standards Institute guidelines to provide a final culture-based identification.

Typical temporal workflow of clinical microbiological laboratory to identify microbes from clinical samples based on phenotypic, biochemical, and culture-based techniques.

Neisseria spp. Streptococcus spp. Single colonies of each isolate were sub-cultured in liquid media for DNA extraction. The majority of species were sub-cultured in Luria-Bertoni broth and grown at 37 C and rpm for 24—48 hours, H. A total of 2 ml of liquid culture of each isolate was centrifuged and DNA extracted using a combination of bead-beating 5. Previous studies have shown that the quality of 16S sequences are essential to accurate phylogenetic placement [ 44 ] and taxonomic classification [ 45 ].

To obtain the longest feasible high-quality sequences, forward and reverse reads corresponding to each isolate were assembled using Phrap version 0. Training set. This set included , sequences and was filtered to obtain a set of 35, sequences corresponding to medically important bacterial 89 genera listed in the most current edition of Manual of Clinical Microbiology [ 47 ] S1 Dataset.

All the species pathogens and commensals under these genera were included. S2 and S3 Datasets list GenBank accession numbers for the sequences in the training set and the number of sequences for all the genera and species in the training set respectively.

The assembled 16S rRNA sequences were classified to species level and the bootstrap confidences for the genus and species level classifications were estimated based on iterations.

The percent identity for a pair of sequences was calculated by dividing the number of matches by the total number of the remaining alignment positions. Distributions shown as violin plots of 16S rRNA percent identity y-axis of each figure of pairs of training set sequences belonging to the same gray and different genera.

The genus Mycobacterium has been categorized as a gram-positive in the figure. Distributions shown as violin plots of 16S rRNA percent identity y-axis of each figure of pairs of training set sequences belonging to the same gray and different species for select gram-positive bacteria.

Distributions shown as violin plots of 16S rRNA percent identity y-axis of each figure of pairs of training set sequences belonging to the same gray and different species for select gram-negative bacteria. Among 19 different clinical identities, two Enterobacter cloacae complex and Streptococcus viridans corresponded to heterogeneous groups of organisms. To make comparisons between the species-level 16S based identification and the coarser clinical identification feasible, we expanded these two groups of organisms to species level.

The currently assigned E. Streptococcus viridans group consists of the following species [ 51 ]: 1 S. For these two clinical identities, the 16S-based identity was deemed to be concordant with the clinical identity if it matched any of the species within the respective group. These initial identities span 17 genera, including a wide range of gram-negative rods and gram-positive cocci as well as common gram-negative cocci. Though the training set included examples of genera and species matching and not-matching clinical identities S1 Dataset , we present here a subset of those relevant for comparison of clinical identities with 16S-based identification.

S5 and S6 Datasets summarize basic characteristics of these distributions for genera and species respectively. At the genus level, the mean pid of sequences belonging to the same genus ranged from Since accurate identification of the genus Facklamia is not easy [ 52 ] it is possible that Greengenes database is enriched with sequences inaccurately assigned to this genus.

Streptococcus is a genus that contains many genetically highly heterogeneous species [ 51 ], which is likely to result in many pairs of sequences with low pids.

Within the large family of Enterobacteriaceae, the distributions for Escherichia and Klebsiella genera were significantly less tight than for the other genera. Two species of these two genera E.

At the species level, the mean pid of sequences belonging to the same species ranged from Comparisons of within and between species pid distributions for 53 different species Figs.

For each species, these comparisons show which other species are most likely to be confused for the species in question. For instance, for many species including S. We observed that particularly species under Enterobacteriaceae family had large spread in their within species distributions Fig. Nevertheless, even for these species, the bulk distribution of within species pid was significantly different than that of between species pid for many other species of the same genus.

Alignment-based sequence similarity methods are commonly used to classify 16S rRNA sequences to the species-level[ 35 — 40 ]. For each of 30 types of initial clinical identities, Table 2 shows the level of concordance between the clinical identity and 16S rRNA based Bayesian and alignment based classifications. S4 Dataset lists for each isolate, the initial clinical identification and the genus-species classifications based on the 16S rRNA gene using both methods.

For the Bayesian classifier, in addition to the genus and species level classifications, bootstrap estimates of classification confidence are listed for both taxonomic levels. In cases where 16SpathDB gives a multitude of identified species, all the species with the highest percent sequence similarity are listed. At the species level, the rates of concordance with the clinical identities were Bayesian taxonomic classification has a distinct advantage over the simple sequence similarity based method in terms of identification specificity.

While the Bayesian classifier predicts a single genus-species identity with the associated confidence levels, sequence similarity based approach of 16SpathDB results in up to six best-hits leading to ambiguous species identification for isolates. Among isolates whose 16SpathDB identification matched with the clinical identity, had definite identification a single genus-species.

Considering the identification specificity at the species level, using the Bayesian classifier, When stratified by whether the initial clinical identification was based on culture or molecular methods, genus-level rates of concordance were roughly similar for either type of clinical identification Table 3 : This skew in the concordance rate was primarily due to the isolates from genus Burkholderia that were clinically identified using multiphasic diagnostic tests.

As these tests are known to be highly accurate for characterization of Burkholderia genomovars [ 54 , 55 ], they are unlikely to be clinical misidentifications. A unique feature of the 16S rRNA based Bayesian classifier for the medically important organisms is that in addition to a genus-species classification, it generates bootstrap confidence estimates for both taxonomic levels.

Use of a high bootstrap cutoff corresponds to a taxonomic classification with a higher accuracy. We compared genus and species clinical and 16S rRNA-based identities taking into account the classification confidence for the latter see Fig.

The threshold for a high confidence classification was 70 out of bootstrap samples. These comparisons placed each isolate into one of 12 categories: Categories A-H correspond to cases for which there were genus level concordance between the clinical and 16S rRNA based identities with high A-D or low E-H bootstrap confidences.

Categories a-d corresponds to cases for which 16S rRNA classification was to a different genus with either high a or b or low c or d bootstrap confidence. Each isolate was assigned to one of 12 categories A-H, a-d based on the agreement between clinical and 16S rRNA based genus and species classifications and the confidence scores.

For 5 out of 30 clinical identities S. The discordances between the clinical and 16S-based identities of the isolates could be due to several factors including 1 insufficient representation of the clinical identities in the training set, 2 phenotypic misidentification, and 3 new taxonomic or phylogenetic placements.

The breadth and depth of the NB classifier training set for representation of species corresponding to the clinical identities makes 1 unlikely. The bootstrap confidence score of the best matching taxa generated by the NB classifier gives a level of confidence to the assignment it makes. As such, it is informative on to whether 2 or 3 is more likely to underlie the observed discordances.

The assignments with high confidence scores are more likely to be indicative of phenotypic misidentifications while the assignments with low confidence scores are indicative of taxonomic or phylogenetic novelty not represented in the training set. For a total of 40 isolates from 18 different clinical identities, NB classifications were to another species under the same genus categories C and D.



0コメント

  • 1000 / 1000