Bullseye https://broadclinicallabs.org/ Sun, 02 Mar 2025 18:33:08 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.2 https://broadclinicallabs.org/wp-content/uploads/2021/12/logo-favicon-150x150.png Bullseye https://broadclinicallabs.org/ 32 32 Illumina and Bullseye usher in new era of drug discovery with collaboration to rapidly scale single-cell solutions https://broadclinicallabs.org/illumina-and-broad-clinical-labs-usher-in-new-era-of-drug-discovery-with-collaboration-to-rapidly-scale-single-cell-solutions/ Sun, 02 Mar 2025 16:11:14 +0000 https://broadclinicallabs.org/?p=9661 The goal of developing a 5 billion single-cell atlas within the next three years took a big step forward with the recent announcement of a new effort between longstanding partners Illumina and Broad Clinical Labs to rapidly streamline and scale single-cell workflows. Broad Clinical Labs will utilize Illumina’s Single Cell Prep, NovaSeq™ X Plus platform, […]

The post Illumina and Broad Clinical Labs usher in new era of drug discovery with collaboration to rapidly scale single-cell solutions appeared first on Broad Clinical Labs.

]]>
The goal of developing a 5 billion single-cell atlas within the next three years took a big step forward with the recent announcement of a new effort between longstanding partners Illumina and Broad Clinical Labs to rapidly streamline and scale single-cell workflows. Broad Clinical Labs will utilize Illumina’s Single Cell Prep, NovaSeq™ X Plus platform, 25B flow cell, and DRAGEN™ analysis software workflow, along with our cutting-edge Perturb-seq, CRISPR screens, and other platforms, to help researchers rapidly process and analyze single-cell samples at unprecedented volumes. This collaboration promises to unlock large-scale functional genomics studies that will accelerate the understanding of disease and propel drug development. 

Read more about the collaboration here.

The post Illumina and Broad Clinical Labs usher in new era of drug discovery with collaboration to rapidly scale single-cell solutions appeared first on Broad Clinical Labs.

]]>
Broad Clinical Labs to collaborate on flagship project tapping Illumina’s new spatial technology https://broadclinicallabs.org/broad-clinical-labs-to-collaborate-on-flagship-project-tapping-illuminas-new-spatial-technology/ Sun, 02 Mar 2025 15:57:47 +0000 https://broadclinicallabs.org/?p=9655 Broad Clinical Labs is proud to support a collaborative effort between Illumina and Broad Institute’s Spatial Technology Platform (STP) by providing sequencing services for the recently announced Spatial Flagship Project utilizing Illumina’s new spatial technology. The project will generate large-scale spatial datasets from hundreds of samples provided by Broad Institute principal investigators and will provide […]

The post Broad Clinical Labs to collaborate on flagship project tapping Illumina’s new spatial technology appeared first on Broad Clinical Labs.

]]>
Broad Clinical Labs is proud to support a collaborative effort between Illumina and Broad Institute’s Spatial Technology Platform (STP) by providing sequencing services for the recently announced Spatial Flagship Project utilizing Illumina’s new spatial technology. The project will generate large-scale spatial datasets from hundreds of samples provided by Broad Institute principal investigators and will provide external research groups early access to Illumina’s spatial technology through Broad Institute’s STP pipeline. We are excited to partner in this effort to map complex tissues and cell responses at an unprecedented scale.

Read more about the project here.

The post Broad Clinical Labs to collaborate on flagship project tapping Illumina’s new spatial technology appeared first on Broad Clinical Labs.

]]>
Announcing new computational and AI capabilities at Broad Clinical Labs https://broadclinicallabs.org/announcing-new-computational-and-ai-capabilities-at-broad-clinical-labs/ Fri, 21 Feb 2025 10:50:11 +0000 https://broadclinicallabs.org/?p=9592 We are excited to announce that Dr. Victoria Popic and her lab have joined BCL to drive research and development efforts at the intersection of AI and clinical genomics. As the Director of Computational R&D at BCL, Victoria will lead our newly established Computational Research Lab (CRL). CRL’s mission is to pioneer novel machine learning […]

The post Announcing new computational and AI capabilities at Broad Clinical Labs appeared first on Broad Clinical Labs.

]]>
We are excited to announce that Dr. Victoria Popic and her lab have joined BCL to drive research and development efforts at the intersection of AI and clinical genomics.

As the Director of Computational R&D at BCL, Victoria will lead our newly established Computational Research Lab (CRL). CRL’s mission is to pioneer novel machine learning approaches for complex large-scale multiomic data analysis. By integrating innovative computational techniques with the latest biotechnological advances – from BCL and our industry partners – CRL will foster the development of robust, scalable, and sustainable AI solutions to catalyze breakthroughs in both foundational biomedical research and clinical applications.

Working in close collaboration with molecular and clinical scientists across the Broad community, CRL will develop data-driven methods that leverage readouts from diverse but complementary molecular assays and sequencing platforms to accurately reconstruct the genome, transcriptome, proteome, and methylome – which is critical to our understanding of disease mechanisms and the design of novel therapies. As a key focus, CRL will develop new models to describe the variation in the structure and composition of the -omes, and investigate the impact of this variation on function and disease.

While these have been long-standing goals of genomics, both the latest advances in AI and the petabyte-scale sequencing datasets generated to date have enabled a paradigm shift in how we can approach these problems computationally and what we can discover from highly complex multiomic data.

Victoria’s work at CRL, will build on her research as a Schmidt Fellow and Principal Investigator at the Broad Institute, where she has been leading a lab focused on the development of deep learning approaches for the characterization and interpretation of the genome and the mechanisms that drive disease. Prior to joining the Broad, Victoria worked in the Bay Area as a compiler engineer at the AI hardware startup SambaNova Systems and as an AI engineer at Illumina. She received her Ph.D. in Computer Science from Stanford University and holds an M.Eng. degree in Computer Science and B.S. degrees in Computer Science and Mathematics from MIT.

 

Important note for this blog: Posts do not equal endorsements. Opinions expressed in this blog are those of the author, on behalf of the genomics group at Broad. We make every effort to ensure the accuracy of data/figures presented here but these are not peer-reviewed and errors may occur from time to time.

 

The post Announcing new computational and AI capabilities at Broad Clinical Labs appeared first on Broad Clinical Labs.

]]>
Maximize Genetic Insights: A Cost-Effective, Combined Clinical WGS and WES Service from Broad Clinical Labs https://broadclinicallabs.org/maximize-genetic-insights-with-cbge/ Sat, 23 Nov 2024 20:39:51 +0000 https://broadclinicallabs.org/?p=9321 In the pursuit of cost-effective genomics testing, researchers and clinicians face a challenging trade-off. While whole genome sequencing (WGS) provides comprehensive genetic insights, its high cost remains a significant barrier, particularly for large-scale studies. Many have turned to whole exome sequencing (WES) as a more economical alternative, but its focus on protein-coding regions alone leaves […]

The post Maximize Genetic Insights: A Cost-Effective, Combined Clinical WGS and WES Service from Broad Clinical Labs appeared first on Broad Clinical Labs.

]]>
In the pursuit of cost-effective genomics testing, researchers and clinicians face a challenging trade-off. While whole genome sequencing (WGS) provides comprehensive genetic insights, its high cost remains a significant barrier, particularly for large-scale studies. Many have turned to whole exome sequencing (WES) as a more economical alternative, but its focus on protein-coding regions alone leaves gaps in our understanding. To bridge this divide, teams often combine multiple approaches – running WES alongside microarrays with imputation. However, this patchwork solution introduces its own challenges: potential data bias, increased sample requirements, and the complex task of integrating results from different platforms. 

To address these challenges, the team at Broad Clinical Labs (BCL) has developed an innovative new assay – Clinical Blended Genome-Exome (cBGE) Sequencing. This streamlined, single-sample approach combines low-coverage, PCR-free WGS with high-depth WES in a single, cost-effective assay, ideal for large-scale studies, clinical trials, and affordable patient genetic testing. This assay offers concurrent estimation of both polygenic and monogenic disease risk, enabling a more precise risk profile to inform risk models, enrich clinical outcome assessment, and provide maximum information to clinicians and patients. 

The cBGE Assay: Sample to Result in just 28 days 

The qualitative cBGE assay begins with genomic DNA extraction from a blood or saliva sample, followed by the generation of a PCR-free whole genome library. Next, an aliquot of the library undergoes PCR amplification and exome selection. The genome and exome libraries are then recombined for sequencing on the powerful Illumina® NovaSeq X Plus platform. 

The result is a single CRAM file containing 2–4X WGS and 85–100X WES data (Figure 1). Alignment and variant calling are performed using the Illumina® DRAGEN platform, with clinical reporting available upon request.

Blended Genome Exome

Figure 1. The Clinical Blended Genome-Exome Assay yields low-coverage WGS and high-coverage WES data from a single sample. Both data types are delivered in a single CRAM file. 

 

All testing for the cBGE assay is conducted in BCL’s CLIA-licensed and CAP-accredited laboratory by an expert team with decades of experience, ensuring the highest standards of quality and reliability. BCL’s streamlined workflow and scalable sample processing of up to hundreds of thousands of samples per year, enable us to provide this service at a low cost, starting at just $120 per sample, with no sample minimums. 

Creating Opportunities to Improve Patient Care 

The applications of this innovative cBGE assay are wide-ranging. It provides a cost-effective alternative to microarrays for genotyping in GWAS studies, enabling germline gene-disease discovery. The low-pass WGS component is well-suited for common variant detection in population genetic diversity and community health studies, while the deep WES data enables sensitive detection of rare genetic variants (SNVs, small indels, and CNVs). Together, this combined approach supports both monogenic and polygenic risk analysis.  

Our cBGE assay is already being used to power studies like ProGRESS, a prostate cancer genetic risk screening trial, in partnership with Genomes2Veterans. In addition, we are utilizing the cBGE assay to provide no-cost genetic testing for patients in Alabama as part of the Catalyst program in collaboration with Southern Research and MyOme, with a goal of bringing genetics-driven clinical risk assessments to underserved communities. 

By combining the breadth of WGS with the depth of WES in a single, cost-effective assay, our aim is to unlock comprehensive genomic insights for every patient and empower researchers to conduct large-scale studies to advance personalized medicine. 

 Learn more about our Clinical Blended Genome-Exome Assay and how it can power your clinical applications on our service page.

The post Maximize Genetic Insights: A Cost-Effective, Combined Clinical WGS and WES Service from Broad Clinical Labs appeared first on Broad Clinical Labs.

]]>
Whole Genome Sequencing at Broad Clinical Labs Gains Additional Regulatory Approval https://broadclinicallabs.org/whole-genome-sequencing-at-broad-clinical-labs-gains-additional-regulatory-approval/ Tue, 22 Oct 2024 17:38:22 +0000 https://broadclinicallabs.org/?p=7460 Whole genome sequencing (WGS) is a powerful platform for discovery, screening, and diagnosis. Our human genome sequencing process builds on almost 30 years of institutional expertise dating back to the first human genome project. We have now sequenced over 675,000 genomes and over 1 million exomes. Depending on the application, we can sequence genomes at […]

The post Whole Genome Sequencing at Broad Clinical Labs Gains Additional Regulatory Approval appeared first on Broad Clinical Labs.

]]>
Whole genome sequencing (WGS) is a powerful platform for discovery, screening, and diagnosis. Our human genome sequencing process builds on almost 30 years of institutional expertise dating back to the first human genome project. We have now sequenced over 675,000 genomes and over 1 million exomes. Depending on the application, we can sequence genomes at 0.1X coverage for cancer tumor fraction estimation, all the way up to 160X for somatic mosaicism detection. We sequence genomes with long read platforms and with short read platforms. We also run low pass genomes spiked with deeper exome libraries (blended genome-exome) as a cost effective assay for risk screening.

Today we are excited to announce that the subset of our genome assays that are clinically validated WGS laboratory developed tests (LDTs) are now approved by the New York State Clinical Laboratory Evaluation Program (NYS CLEP). These LDTs run on the Illumina NovaSeq X Plus and use DRAGEN software for variant calling. NYS regulatory approval for this product spans its use as a technical genome (data and variants delivered only) and as an interpreted genome where we can deliver clinical reports for gene panels, secondary findings, or perform a whole genome analysis in a phenotype-driven approach using the Fabric Genomics Enterprise platform. For those who may not be familiar with the space: NYS approval is a big deal. Known for its exacting standards, approval from NYS has even been cited by FDA as a viable alternative to the IVD pathway for molecular assays in some scenarios.

In accordance with our mission, and through our awards and partnerships, we are driving forward the application of WGS in population programs (e.g. the NIH All of Us Research Program), newborn screening (with Nurture Genomics), NICU diagnostics (with Fabric Genomics and Intermountain Health), childhood developmental delays (collaboration with Quest Diagnostics to understand utility of WGS to detect chromosomal abnormalities), disease risk estimation (with MyOme, Southern Research, Boston VA and others using the blended genome-exome), common disease research (with the Stanley Center at Broad), rare disease research (Simons Foundation, Boston Children’s Hospital, GREGoR consortium and others), and cancer (NCI, DFCI, Gabriella Miller Kids First Pediatric Research Program, and others).

Kudos to the entire team at BCL who worked so hard on this application and review process. It was a big task and the team really stepped up. We are so proud of what we’ve built here and are delighted to bring this additional layer of regulatory approval and stamp of quality to our partners.

The post Whole Genome Sequencing at Broad Clinical Labs Gains Additional Regulatory Approval appeared first on Broad Clinical Labs.

]]>
Broad Clinical Labs is moving! https://broadclinicallabs.org/broad-clinical-labs-is-moving/ Thu, 01 Aug 2024 14:56:14 +0000 https://broadclinicallabs.org/?p=6734 If you were to ask the average science enthusiast on the street where the Broad Institute is located they would likely mumble something about Main St in Kendall Square. Indeed Broad’s Merkin and Stanley Buildings are the most visible outposts of our storied institution, however, they are not in fact the OG Broad buildings. That […]

The post Broad Clinical Labs is moving! appeared first on Broad Clinical Labs.

]]>
If you were to ask the average science enthusiast on the street where the Broad Institute is located they would likely mumble something about Main St in Kendall Square. Indeed Broad’s Merkin and Stanley Buildings are the most visible outposts of our storied institution, however, they are not in fact the OG Broad buildings. That honor goes to the humble 320 Charles St.

Our History

This former beer and hotdog storage facility for Fenway Park concessions was the original space in which the Whitehead Institute’s Center for Genome Research leased laboratory space to handle the large-scale production activities for the Human Genome Project at the end of the last century. A few years later, in June 2003, it became the home of the newly formed Broad Institute of MIT & Harvard, thanks to the prescient gift from Eli and Edythe Broad. Once the building on Main St in Kendall Square (then called 7 Cambridge Center) was completed in 2006, most of the institute leadership and administrative facilities moved there, as well as other founding labs of the Broad. What was left at 320 Charles was the sequencing group, which over time became the Genomics Platform of the Broad and also incubated the clinical laboratory that subsequently became Broad Clinical Labs.

The building at 320 Charles has been the center of some truly transformative scientific activities over the years, from the first Human Genome Project to sequencing over 600,000 genomes and over 1,000,000 exomes, and more, in service of discovery and translation across a wide swath of biomedicine (from infectious disease to cancer to rare and common germline disease). During the pandemic Broad Clinical Labs, at 320 Charles St, became one of the national epicenters for Covid-19 diagnostic testing, operating 24 hours a day for the duration and processing >37M patient samples.

This building has seen it all. It is also showing its age. When it rains, the roof has been known to spring a leak. The hallways are labyrinthine and the environmental controls struggle to keep up with our seasons. In 2022 we faced the decision on whether we should continue to pay increasing rents in one of the world’s most expensive life sciences neighborhoods, or start fresh a little further afield. We decided to build. We found a green field site a few miles outside the city and worked with a new landlord and team of architects to come up with a space that is appropriate for one of the world’s largest genome centers.

Meet 27 Blue Sky Drive. Our purpose built, 145,000 sq ft building, located on a new life sciences campus in Burlington, MA. The new home of Broad’s Genomics Platform and Broad Clinical Labs. Broad’s Kendall Square campus will stay of course, and is actually growing further.

Figure 1. Architects rendering of the new building.

 

Current status

The building is built. The furniture is being installed as I type. A dedicated team from our group and Broad’s facilities team has created a mind-boggling detailed move plan. Our collaborators and customers have been notified. The move starts in mid-September and will take about a month. During this phased move our intention is that clinical processing will continue uninterrupted.

We believe this new building represents an investment in the future, in our people (and their quality of workplace life), and in discovery and clinical genomics for all. Once we are settled, we would be happy to host anyone who wants to come for a tour.

 

Important note for this blog: Posts do not equal endorsements. Opinions expressed in this blog are those of the author, on behalf of the genomics group at Broad. We make every effort to ensure the accuracy of data/figures presented here but these are not peer-reviewed and errors may occur from time to time.

 

The post Broad Clinical Labs is moving! appeared first on Broad Clinical Labs.

]]>
Improving Diagnosis in Rare Diseases: The Role of ExpansionHunter in Whole Genome Sequencing https://broadclinicallabs.org/improving-diagnosis-in-rare-diseases-the-role-of-expansionhunter-in-whole-genome-sequencing/ Thu, 16 May 2024 20:46:02 +0000 https://broadclinicallabs.org/?p=6086 Many individuals and families with genetic disorders are all too familiar with the concept of a “diagnostic odyssey”. It refers to the long and arduous journey to find a diagnosis, which can take years or decades, require many different tests and evaluations, and even involve dead-end incorrect diagnoses along the way. Consolidating the number of […]

The post Improving Diagnosis in Rare Diseases: The Role of ExpansionHunter in Whole Genome Sequencing appeared first on Broad Clinical Labs.

]]>
Many individuals and families with genetic disorders are all too familiar with the concept of a “diagnostic odyssey”. It refers to the long and arduous journey to find a diagnosis, which can take years or decades, require many different tests and evaluations, and even involve dead-end incorrect diagnoses along the way. Consolidating the number of genetic tests required to reach a diagnosis could significantly reduce the length of time from symptom onset to diagnosis. Whole genome sequencing (WGS) has already been used to identify multiple variant types that previously required separate tests (eg. single nucleotide variants and copy number variants).

 

Identifying short tandem repeat expansions in WGS data

Expansions of short tandem repeats (STRs) are a type of variant that are responsible for many neurological disorders including, Fragile X syndrome, Friedreich ataxia, Huntington disease, myotonic dystrophy and spinocerebellar ataxia. Testing for STRs traditionally required targeted methods (triplet-repeat primed PCR or Southern blot), and while these methods are effective, they were typically ordered as a single gene analysis. Enter DRAGEN ExpansionHunter, a computational tool packaged into the Illumina DRAGEN pipeline that is capable of detecting STR expansions across many different genes using PCR-free short-read WGS data. This tool allows for a diagnostic WGS test to include reporting on STR expansions for the genes included in the tool.

The Rare Genomes Project (RGP) at the Broad Institute has been at the forefront of leveraging ExpansionHunter to identify variants responsible for rare conditions. Through their efforts, individuals with features of ataxia, myopathy, or muscular dystrophy, spanning a wide age range, have been diagnosed with STR expansions in various genes. This success story underscores the potential of computational tools like ExpansionHunter in unraveling the genetic mysteries behind rare diseases. Success of ExpansionHunter in RGP has led the Broad Clinical Laboratory to validate Illumina DRAGEN ExpansionHunter for clinical WGS testing.

 

Testing of DRAGEN ExpansionHunter

A cohort of 22 samples sourced from the Coriell Institute with known STR expansions in six genes (FMR1, ATXN1, FXN, HTT, C9ORF72, and DMPK) that vary by repeat size, motif, inheritance pattern, and patient sex were used. Coriell sizing was performed by Southern blot and/or PCR analysis. The normal, premutation, and full expansion repeat ranges were determined for each gene, along with a cutoff flag that would be used to flag potentially expanded alleles. These samples were run on the llumina NovaSeq 6000 system and called using DRAGEN v3.10.4 ExpansionHunter. The results of this analysis were compared to the truth data from Coriell.

As expected, sequencing read length (~150 bp) limited the ability for ExpansionHunter to call a repeat size accurately. All repeats below 150bp in length were called accurately (+/-1 repeat), whereas none of the repeats >150bp were accurately called (Figures 1a and 1b). Determining the class of expansion (normal, premutation, full mutation) is also limited for expansions beyond ~150bp. Based on these data, clinical interpretive reporting using ExpansionHunter will need to rely on the flagging cutoffs to distinguish between normal and potentially expanded alleles. However, all of the loci in this validation had flagging cutoffs that were under the 150bp limit. This may not be the case for all loci included in the ExpansionHunter caller.

Figure 1a and 1b: ExpansionHunter repeat number compared to the number reported by Coriell. 1a) Samples with Coriell repeats <50 (<150bp) in length. 1b) Samples with Coriell repeats >50 (>150bp) in length, using a log scale for x and y. Repeats >150bp are consistently under called by ExpansionHunter.

 

Future directions

The integration of ExpansionHunter into clinical WGS testing is not without its challenges. Validation efforts, such as those undertaken by the Broad Clinical Laboratory, are crucial to ensuring accuracy and reliability. Validation and incorporation into clinical WGS interpretation is still in progress. These validations pave the way for incorporating ExpansionHunter calls into interpretive reports, further enhancing the diagnostic utility of WGS.

One of the key advantages of incorporating ExpansionHunter into diagnostic workflows is the expanded variant types reported. This translates to increased sensitivity, particularly for neurological disorders that may be caused by STR expansions. By reducing the need for separate test orders and streamlining the diagnostic process, patients with rare diseases can potentially receive timely and accurate diagnoses, thereby minimizing the diagnostic odyssey they often face.

In conclusion, the inclusion of ExpansionHunter in diagnostic WGS represents a significant leap forward in the field of rare disease diagnostics. Its ability to detect STR expansions with high accuracy and its integration into clinical workflows hold immense promise for improving patient outcomes and reducing the burden of the diagnostic odyssey. As advancements in computational tools continue to evolve, so too will our ability to unlock the genetic mysteries underlying rare diseases, bringing hope to patients and their families worldwide.

 

References:

  1. Dolzhenko E, et al. Detection of long repeat expansions from PCR-free whole-genome sequence data. Genome Res. 2017 Nov;27(11):1895-1903. doi: 10.1101/gr.225672.117. Epub 2017 Sep 8. PMID: 28887402; PMCID: PMC5668946.
  2. Ibañez K, et al. Whole genome sequencing for the diagnosis of neurological repeat expansion disorders in the UK: a retrospective diagnostic accuracy and prospective clinical validation study. Lancet Neurol. 2022 Mar;21(3):234-245. doi: 10.1016/S1474-4422(21)00462-2. PMID: 35182509; PMCID: PMC8850201.
  3. Dolzhenko E, et al.. REViewer: haplotype-resolved visualization of read alignments in and around tandem repeats. Genome Med. 2022 Aug 11;14(1):84. doi: 10.1186/s13073-022-01085-z. PMID: 35948990; PMCID: PMC9367089.

 

Important note for this blog: Posts do not equal endorsements. Opinions expressed in this blog are those of the author, on behalf of the genomics group at Broad. We make every effort to ensure the accuracy of data/figures presented here but these are not peer-reviewed and errors may occur from time to time.

The post Improving Diagnosis in Rare Diseases: The Role of ExpansionHunter in Whole Genome Sequencing appeared first on Broad Clinical Labs.

]]>
Making the Most of Genome Sequencing: The Need for Detection of Sequence Variants in Notoriously Tricky Regions https://broadclinicallabs.org/making-the-most-of-genome-sequencing-the-need-for-detection-of-sequence-variants-in-notoriously-tricky-regions/ Wed, 20 Dec 2023 15:24:57 +0000 https://broadclinicallabs.org/?p=5443 Genome sequencing has opened opportunities for detecting multiple variant types across the genome with a single technology. Previously, BCL has tackled evaluating copy number variant (CNV) detection from genome data using the DRAGEN™ 4.2.4 CNV pipeline. While CNV detection significantly expands the diagnostic power of genome data, there are still many known regions of the […]

The post Making the Most of Genome Sequencing: The Need for Detection of Sequence Variants in Notoriously Tricky Regions appeared first on Broad Clinical Labs.

]]>
Genome sequencing has opened opportunities for detecting multiple variant types across the genome with a single technology. Previously, BCL has tackled evaluating copy number variant (CNV) detection from genome data using the DRAGEN™ 4.2.4 CNV pipeline. While CNV detection significantly expands the diagnostic power of genome data, there are still many known regions of the genome where sequence variant detection is difficult because of biological challenges. For example, many genes where variants are associated with genetic conditions and are reported in a clinical setting contain repetitive regions (e.g. NEB) or have high homology with a pseudogene (e.g. PMS2). These genes currently require specialized orthogonal assay design to detect variants with confidence.

With these challenges in mind, BCL has initiated efforts to evaluate new bioinformatic approaches to calling variants in these difficult genes. Illumina has designed specific targeted callers that can be run to detect sequence variants in these tricky regions using the DRAGEN software. Some of these callers are currently available, such as HBA caller that aids detection in the highly homologous genes HBA1 and HBA2, which are associated with α-thalassemia and the SMN1 caller, which detects copy number changes in the highly homologous genes SMN1 and SMN2, which are associated with spinal muscular atrophy.

This blog details the clinical need for these callers and introduces our latest effort in evaluating the performance of additional DRAGEN targeted callers, which will be detailed in subsequent blog posts.

Genome-wide tests make domain-specific expertise more difficult

With the decreasing cost of sequencing, Mendelian disease genes are being discovered more rapidly than ever before [1]. Per a policy statement of the American College of Medical Genetics and Genomics [2], all genes that have a gene-disease relationship of Moderate or above per the semi-quantitative framework developed by the Clinical Genome Resource (ClinGen) [3], are reportable in a diagnostic setting. As of December 2023, there were approximately 4000 genes in the Gene Curation Coalition database (search.thegencc.org) [4] that have a strength of Moderate or above and could be reportable on a clinical test.

If a clinical laboratory is going to offer a diagnostic test to report variants in a particular gene, the lab must understand all of the technical limitations of detecting variants in this gene. Depending on particular limitations, laboratories may be expected to develop multiple technologies to offer a highly sensitive and specific test [5]. For some genes, this ask is more difficult than others because of biology. This blog will speak to two different biological scenarios that often require special ancillary assays for a complete clinical test. Developing these assays takes time, can be costly, and often requires detailed knowledge of the gene, disease, and underlying biology of the system. This was more feasible when labs focused on small test menus and truly developed expertise in certain gene/disease areas, but as clinical laboratories try to move to genome-wide testing technologies, this task becomes exceedingly difficult.

Homologous Regions: Where is my variant?

Next generation sequencing (NGS) involves fragmenting genomic DNA into small pieces, usually ~350 base pairs each, ligating, barcoding, sequencing, and aligning them to a reference genome, much like putting together a jigsaw puzzle. The overall success of this method, particularly because the fragments are so short (usually ~150bp reads are generated), relies on the principle that many sections of the genome have a unique sequence. However, much like a puzzle that has certain colors or images where the pieces are indistinguishable, there are certain sections of the genome that are notoriously difficult to align. Regions of high homology fall into this bucket because they are regions in the genome that have near identical sequences (Figure 1). Homologous regions may have formed in evolution of the mammalian genome due to gene duplication events. One gene retains all of the functional elements of the active gene, while the other is an inactive gene copy, usually referred to as a pseudogene. Pseudogenes may contain all genetic sequence, or may be “processed” which means that all of the introns have been spliced out of that copy before it was reincorporated into the genome. Pseudogenes can be particularly subject to variation because of the lack of evolutionary constraint on the nonfunctional gene elements [6].

Figure 1: Mapping short reads is difficult in regions of high homology

 

Distinguishing between genes and pseudogenes can be critical because variants in some genes with a known pseudogene pair are associated with severe, highly actionable genetic conditions. This is illustrated by the fictitious case example below:

John Smith is a 45 year old patient with a history of colorectal cancer. Histology of a biopsy of this cancer reveals microsatellite instability (MSI). His clinical genetics team orders a comprehensive cancer panel test. The clinical lab detects a nonsense variant in exon 11 of PMS2. PMS2 is definitively associated with autosomal dominant Lynch syndrome, a syndrome with a high risk of early-onset colorectal cancer, among other cancers. The lab may have found the cause for John’s cancer. However, PMS2 has a highly homologous pseudogene PMS2CL that overlaps with exons 9 and 11-15 of PMS2 more than 98% identical sequence [7]. Before the lab can report this variant, they must determine if it is located in PMS2 (likely explains his condition, can inform familial cascade testing) or PMS2CL (pseudogene variant without clinical impact, no identified cause for his condition). NGS cannot differentiate between these regions; thus the laboratory must design and validate an ancillary assay to detect bona fide PMS2 variants in the pseudogene region. This usually consists of Multiplex Ligation-dependent Probe Amplification (MLPA), long-range PCR, or other methods. Reflexing to this assay when a variant is detected can add time and cost to a clinical sequencing test.

Differentiating between variants in genes and pseudogenes is also critical in a screening situation. If you alter the fictitious clinical scenario above slightly and say that John Smith is a 45 year old seemingly healthy individual who opted to do genetic screening and a nonsense variant was identified in exon 11 of PMS2, it would also be critical for the lab to determine if this variant were in PMS2 or PMS2CL to help inform screening protocols and prophylactic measures for this individual. If the variant were in PMS2, this individual would be at risk for Lynch syndrome cancers, but this would not be the case if the variant were in PMS2CL.

Repetitive Regions: Is my variant real?

A second bucket of difficult to align sequence in short read sequencing are those regions that contain long runs of repetitive material. If we return to the puzzle analogy, these would be sections in a puzzle that have a repeating pattern and the first part repeat of the pattern is not distinguishable from the third repeat of the pattern. It may be difficult to accurately detect variants in this region because the sequence is so repetitive (Figure 2). While it may not be as critical to place the variant in a repetitive region, it is critical to determine if a variant is actually present at all or is merely a sequencing artifact. This is illustrated in the fictitious example below:

Figure 2: Mapping short reads is difficult in repetitive regions

 

Jane Doe is a patient with Nemaline myopathy and a muscle biopsy that has identified Nemaline rods. Her clinical team orders a comprehensive myopathy panel and a variant in NEB is identified. However, it is in the region of exons 82-105, which is a highly repetitive region that is a triplication of 8 exons [8]. The lab must determine if this variant is in fact real because it could be an explanation for Jane’s condition. However, because sequencing alone cannot determine this, an ancillary assay must be performed.

Future Directions

BCL, in collaboration with Illumina, has prioritized testing certain targeted callers that would help improve the sensitivity of our current genome sequencing product without the need to perform ancillary testing to detect variants in tricky regions. Current work is being done to source samples that have been tested with orthogonal methods to validate these callers. Future blog posts will be dedicated to the performance of new targeted callers for genes like PMS2, NEB, STRC, and HBA1 and 2, to name a few on the road map. Iterative development of bioinformatic algorithms to leverage the power of genome sequencing will help improve variant calling and, in turn, accurate clinical reporting from WGS, pushing it further toward that “one stop shop” genetic testing method of the future.

 

References:

1. Boycott, K.M., et al., International Cooperation to Enable the Diagnosis of All Rare Genetic Diseases. Am J Hum Genet, 2017. 100(5): p. 695-705.

2. Bean, L.J.H., et al., Diagnostic gene sequencing panels: from design to report-a technical standard of the American College of Medical Genetics and Genomics (ACMG). Genet Med, 2020. 22(3): p. 453-461.

3. Strande, N.T., et al., Evaluating the Clinical Validity of Gene-Disease Associations: An Evidence-Based Framework Developed by the Clinical Genome Resource. Am J Hum Genet, 2017. 100(6): p. 895-906.

4. DiStefano, M.T., et al., The Gene Curation Coalition: A global effort to harmonize gene-disease evidence resources. Genet Med, 2022. 24(8): p. 1732-1742.

5. Rehder, C., et al., Next-generation sequencing for constitutional variants in the clinical laboratory, 2021 revision: a technical standard of the American College of Medical Genetics and Genomics (ACMG). Genet Med, 2021. 23(8): p. 1399-1415.

6. https://www.ncbi.nlm.nih.gov/books/NBK535152/

7. Li, J., et al., A Comprehensive Strategy for Accurate Mutation Detection of the Highly Homologous PMS2. J Mol Diagn, 2015. 17(5): p. 545-53.

8. Yuen, M. and C.A.C. Ottenheijm, Nebulin: big protein with big responsibilities. J Muscle Res Cell Motil, 2020. 41(1): p. 103-124.

9. Blog Thumbnail Photo by Warren Umoh on Unsplash

Important note for this blog: Posts do not equal endorsements. Opinions expressed in this blog are those of the author, on behalf of the genomics group at Broad. We make every effort to ensure the accuracy of data/figures presented here but these are not peer-reviewed and errors may occur from time to time. Broad has a collaboration agreement with Illumina that in-part funds this work.

The post Making the Most of Genome Sequencing: The Need for Detection of Sequence Variants in Notoriously Tricky Regions appeared first on Broad Clinical Labs.

]]>
Single-Cell Isoform Sequencing of Fluent Libraries with MAS-seq https://broadclinicallabs.org/single-cell-isoform-sequencing-of-fluent-libraries-with-mas-seq/ Fri, 27 Oct 2023 17:18:05 +0000 https://broadclinicallabs.org/?p=5289 The Methods Development Lab (MDL) of Broad Clinical Labs is always looking to innovate on the latest genomics technologies. One promising new approach is from Fluent BioSciences for single-cell RNA sequencing.

The post Single-Cell Isoform Sequencing of Fluent Libraries with MAS-seq appeared first on Broad Clinical Labs.

]]>
At the Methods Development Lab (MDL) of Broad Clinical Labs we are always looking to innovate on the latest genomics technologies. One promising new approach is from Fluent BioSciences for single-cell RNA sequencing. Their PIPseq platform can rapidly generate single-cell emulsions by vortexing a cell suspension with particles they call PIPs (Particle-templated Instant Partitions)1. This technology has the potential to generate very large single-cell libraries at a low cost per cell. Given these favorable characteristics, we were keen to investigate PIPseq’s compatibility with our recently developed high-throughput RNA isoform sequencing approach, Multiplexed Arrays Sequencing (MAS-seq). Together, these approaches could enable the large-scale measurement of isoform expression in 10⁴ to 10⁶ cells.

MAS-seq works by concatenating a defined number of cDNAs, typically 16, into long single molecules, termed MAS arrays2. The length of the MAS arrays are tailored to be in the optimal size range for the long-read PacBio sequencing platforms, approximately 10-18kb. After sequencing, the MAS arrays are deconcatenated informatically, boosting sequencing output by the number of constituent cDNAs per array. PacBio has commercialized MAS-seq and provides kits that boost output for bulk and single-cell RNA sequencing libraries by 16 fold.

The current MAS-seq protocols were developed using 10x Genomics Single-cell Gene expression libraries. In theory, MAS-seq is compatible with any method that generates full-length cDNA molecules and contains the appropriate PCR adapter handles. However, unforeseen obstacles can arise when integrating protocols. In this post, we demonstrate successful integration of MAS-seq and PIPseq in the context of peripheral blood, and show that the combination of both technologies provide high-quality single-cell isoform data.

Experiment

To test out this new method we looked at a classic single-cell sample: peripheral blood mononuclear cells (PBMCs). We used Fluent’s T20 3′ kit to encapsulate single cells into droplets according to the manufacturer’s protocol. After single-cell library construction, we had a pool of full-length cDNA molecules with barcodes and UMIs attached. Here, we separated the full-length cDNA library into two parts: one continued with the standard protocol and went on an Illumina sequencer to get short read gene expression data. The other pool was used as input for MAS-seq.

Figure 1: Overview of the experimental workflow. PBMCs were encapsulated into droplets with the PIPseq T20 3′ kit. After cDNA synthesis the pool was split in two and input to both short- and long-read protocols. Illustration created with BioRender.com.

Short-read PIPseq

First we examined the short-read data that we generated from the Fluent platform. We sequenced the library on a NovaSeq X 10B, generating 1.1 billion 246-bp cDNA reads. We used Fluent’s computational pipeline, called PIPseeker, to assign reads to barcodes, count UMIs, and map to the genome. The result is a familiar MTX-formatted file of UMI counts, along with the list of barcodes and features. These files are compatible with the standard tools and processes for analyzing single-cell sequencing data.

To identify cells, we selected cell barcodes containing a minimum of 1,000 UMIs and between 2-7% mitochondrial reads (Fig 2). After applying these filtering steps, 7,029 cell barcodes remained.

Figure 2: Left: knee plot showing the distribution of UMIs per barcode sorted in descending order. Selected barcodes are highlighted in orange. Right: histogram of the percent mitochondrial UMIs per cell, with dotted lines showing the cutoffs used to select barcodes.

After performing gene selection and Leiden clustering, we identified eight canonical cell types that should be familiar to people who work with peripheral blood. There is certainly more complexity in this dataset and further clustering would resolve some of the structure visible, but for our purposes it was sufficient to identify the major cell types in the data.

Figure 3: Graph embedding of short read data. The plot on the left shows the number of UMIs per cell, on a log scale. On the right the cells are colored by cell type.

Consistent with the described behavior of PIP-seq’s V4 chemistry, we saw lower counts in the CD14 monocytes than in other cell types3. This is a known challenge when processing cells that express high levels of RNAses, which is the case in the CD14 monocyte population.

Full-length cDNA using PIPseq

We next sought to determine if cDNA generated by PIPseq is compatible with MAS-seq for single-cell isoform-specific long reads. We split a portion of the pool of PIPseq full-length cDNA to be used as input into the MAS-seq library prep kit from PacBio (cat# 102-659-600) to generate 16-mer cDNA arrays.

After array creation, we sequenced the MAS-seq library on a PacBio Revio, yielding 3,062,399 HiFi reads. After segmentation with PacBio’s tool skera, we ended up with 46,116,893 s-reads, demonstrating a >15-fold boost in output. A total of 2,744,221 reads (89.6%) of the HiFi reads were full-length arrays, with an average length of 11,889 bp (Fig 4). The segmented reads had an average length of 725 bp.

We extracted the barcodes and UMIs from these reads, and mapped the long-read cDNA to the same genome that we used for the short-read data. Following barcode extraction and mapping, we quantified isoforms with IsoQuant and associated those calls with the Fluent barcodes to get a cell-by-isoform matrix4.

Figure 4: Quality control for MAS-seq data. Left: concatenation factor and read length distribution for HiFi reads. Right: heatmap of ligation adapter pairs.

When we looked at our long-read data we found nearly all of the cell barcodes that we selected in the short-read data: 7,023 out of 7,029, or 99.9%. Consistent with the short-read data, long-read counts for CD14 monocytes were considerably lower than for other cell types. This effect was potentially exacerbated by our long-read processing protocols that preferentially filter out shorter and likely degraded cDNAs.

Reads after segmentation 46,116,893
Bead barcodes in whitelist 498,380
Reads with barcodes in whitelist 30,766,203
Total UMIs, all barcodes 18,160,470
Total UMIs, selected cell barcodes 12,834,983
Median UMIs/cell 1,253
Median isoforms/cell 462

Table 1: Key metrics describing the single-cell MAS-seq results

Isoform-level quantification in PIPseq

While RNA degradation is apparent in the CD14 monocytes, the other cell types were sequenced at high depth. We found many isoforms to be specifically or preferentially expressed in specific cell types, recapitulating the findings of Inamo et al., 2022 and other groups5. Below we show two examples.

Figure 5: FCGR3A expression plotted across the cell population. The left panel shows FCGR3A gene expression, while the middle and right panels show FCGR3A isoform expression.

Example 1: FCGR3A (CD16)

FCGR3A (CD16) is an Fc receptor gene and a canonical marker of non-classical monocytes. When the protein is expressed on the cell surface, it serves as a receptor for IgG antibodies and triggers antibody-dependent cell-mediated cytotoxicity. At the gene level FCGR3A is highly expressed in the associated monocyte population, with lower expression in cytotoxic T cells and innate lymphoid cells (Fig 5). When we look at the data at the isoform level, we see that the latter two cell types express an alternate isoform that is absent from the CD16 monocyte cluster. This isoform has an additional exon in the 5′ UTR, suggesting that its inclusion may serve a regulatory role in these cell types (Fig 6).

Figure 6: IGV plot of the differential isoform usage in three cell types. The top panel shows the Gencode annotation for the two isoforms identified.

Figure 7: S100A6 expression plotted across the cell population. The left panel shows S100A6 gene expression, while the middle and right panels show S100A6 isoform expression.

Example 2: S100A6

The gene S100A6 codes for a calcium-binding protein in the S100 family that is expressed in a wide range of cell types. In these data we see the gene expressed primarily as two isoforms, with one isoform particularly highly expressed in CD16 monocytes, while other cell types show a more balanced expression profile (Fig 7).

Figure 8: IGV plot of the differential isoform usage in three cell types. The top panel shows the Gencode annotation for the two isoforms identified by IsoQuant.

Again, the long-read data shows differential transcriptional start site (TSS) usage manifesting as differences in 5′ UTR sequences observed among the cell types (Fig 8). Moreover, the MAS-seq data affords us enough resolution to inspect the read pileup and see that the data are not fully explained by this combination of reference isoform annotations or indeed any other known form of this gene. This points us toward strategies for improving our isoform-calling methods, and exploring uncharted areas of the transcriptome.

Conclusion

Here we demonstrate that PIPseq is compatible with MAS-seq, expanding the validated platforms for single-cell RNA isoform sequencing. Future efforts will include cross-platform benchmarking and characterizations of single-cell platforms for RNA isoform sequencing. This early exploration highlights the transformative capability of MAS-seq and PIPseq for large-scale single-cell RNA isoform sequencing studies to efficiently identify alternative splicing in the context of development and disease.

Disclaimers

Fluent Biosciences provided PIPseq T20 3′ Single Cell RNA Kit v4.0, PIPseq starter equipment kit, and cryopreserved PBMC samples to conduct this study. PacBio provided MAS-seq kits, Revio sequencing reagents, and funding as part of a collaboration agreement.

Important note for this blog: Posts do not equal endorsements. Opinions expressed in this blog are those of the author, on behalf of the genomics group at Broad. We make every effort to ensure the accuracy of data/figures presented here but these are not peer reviewed and errors may occur from time to time.

Citations
  1. Clark, I. C. et al. Microfluidics-free single-cell genomics with templated emulsification. Nat. Biotechnol. 1–10 (2023) doi:10.1038/s41587-023-01685-z.
  2. Al’Khafaji, A. M. et al. High-throughput RNA isoform sequencing using programmed cDNA concatenation. Nat. Biotechnol. (2023) doi:10.1038/s41587-023-01815-7.
  3. Fontanez, K. et al. High–throughput single cell analysis with Particle–templated Instant Partitions (PIPseqᵀᴹ).
  4. Prjibelski, A. D. et al. Accurate isoform discovery with IsoQuant using long reads. Nat. Biotechnol. 1–4 (2023) doi:10.1038/s41587-022-01565-y.
  5. Inamo, J. et al. Immune Isoform Atlas: Landscape of alternative splicing in human immune cells. 2022.09.13.507708 Preprint at https://doi.org/10.1101/2022.09.13.507708 (2022).

The post Single-Cell Isoform Sequencing of Fluent Libraries with MAS-seq appeared first on Broad Clinical Labs.

]]>
Seeking Truth: Solving CNV Evaluation Challenges with T2T Genome Assembly https://broadclinicallabs.org/seeking-truth-solving-cnv-evaluation-challenges-with-t2t-genome-assembly/ Tue, 03 Oct 2023 15:48:48 +0000 https://broadclinicallabs.org/?p=5167 We are interested in evaluating the CNV calling capabilities of the latest available DRAGEN™ version with an eye toward future validation and inclusion in Broad’s existing clinical WGS pipeline.

The post Seeking Truth: Solving CNV Evaluation Challenges with T2T Genome Assembly appeared first on Broad Clinical Labs.

]]>
Copy Number Variations (CNVs) are pivotal in shaping human genetic diversity and influencing disease susceptibility. These segments, either deleted (DEL) or duplicated (DUP), can dramatically alter gene dosage, thereby modulating phenotypic outcomes1. Given their impact, the precision of CNV calling is not just a technical requirement but a critical factor for robust downstream analyses and accurate genetic interpretation. We are interested in evaluating the CNV calling capabilities of the latest available DRAGEN™ version with an eye toward future validation and inclusion in Broad’s existing clinical WGS pipeline (which currently uses DRAGEN™ v3.10.4).

A significant enhancement in the DRAGEN™ 4.2.4 CNV calling pipeline is that CNV calls now come with SV support. In previous versions, DRAGEN™ CNV callers primarily rely on depth signals for detecting CNV events. The underlying concept behind this approach is that short reads are randomly sampled on the genome, and copy number can be inferred from the read depth of aligned short reads2. However, while depth signal is reliable for detecting large CNV events, its efficacy diminishes significantly when dealing with shorter CNV events (<10kbp) due to a lower signal to noise ratio. Consequently, many smaller CNV events were historically left to be identified by the DRAGEN™ structural variant (SV) caller, which utilizes junction signals. In practice, this presented a challenge as researchers needed to perform an extra processing step to merge CNV events from both DRAGEN™ SV and CNV callers’ outputs, potentially introducing inaccuracies. In the latest iteration, DRAGEN™ 4.2.4 introduced an additional integration step that leverages junction signals from the SV caller and depth signal from CNV caller to generate the cnv_sv.vcf. This newly introduced VCF includes a SVCLAIM field which indicates how each CNV event was detected – whether through depth signal D, or junction signals J, or a combination of both DJ. Illumina has indicated that this improvement streamlines the workflow for researchers and boosts the accuracy of CNV calls, especially with shorter CNV events (<10kbp)3.

This blog introduces our latest effort in evaluating the performance of the DRAGEN™ v4.2.4 CNV caller with SV support. We have used the conventional method of comparing a query VCF file to benchmarking a VCF file which helped us in identifying true positives (TPs) when query variants matching events in benchmark datasets. However, we have found limitations of currently available benchmarking VCFs, often omitting relevant events, which hinder our ability to accurately estimate false positive rates – a pivotal metric in our clinical validation. Recognizing these limitations, we have developed a more robust and reliable alignment-based evaluation method to address shortcomings in the conventional approach while enhancing our understanding of CNV events.

Old School: Conventional CNV Caller Evaluation

Traditionally, we utilize benchmarking datasets to evaluate the performance of our variant calling pipelines. The National Institute of Standards and Technology (NIST) has provided a benchmark ‘truth’ dataset for the HG002 genome, generated using dipcall – a reference-based variant calling pipeline. Dipcall employs the HG002-T2T (Telomere-to-Telomere)  v0.9 assembly as its ground truth. The T2T assembly represents one of the most accurate genome assemblies4,5 and offers a comprehensive view of CNV events. The assembly is then compared against the hg38 reference genome to generate a robust and precise variant benchmarking dataset for HG002. However, it’s important to note because dipcall relies on the hg38 genome as a template for variant calling, the output may be biased against some unresolved regions in hg38. Because long-read technology has superior capabilities in detecting structural variants, we also use PBSV calls from long-read HiFi data of the hg38 genome as another means for benchmarking accuracy on HG0026. It’s important to recognize this dataset, like the NIST benchmarking dataset, may have its own pitfalls. HiFi has its own limitation in detecting extra-long CNV (>20kbp) events. Those benchmarking datasets allow us to measure the accuracy of DRAGEN™ CNV calling pipeline. We have conducted a comparative analysis using Witty.er by comparing the results of DRAGEN™ v3.10.4 CNV calls, DRAGEN™ v3.10.4 bcftools concat CNV SV calls, and DRAGEN™ 4.2.4 CNV_SV calls against the HG002 dipcall and HiFi PBSV benchmarking datasets within the high-confidence region7.

As illustrated in Figure 1, DRAGEN™ 4.2.4 demonstrates a strong correlation with dipcall and HiFi benchmarking datasets, suggesting its proficiency in identifying CNV DEL events across various event lengths, especially those under 10 kbp threshold. On an event-level assessment, the performance of the 4.2.4 cnv_sv caller is closely aligned with 3.10.4 bcftools merged calls. However, it’s worth noting that performance outcomes exhibit variations depending on the choice of benchmarking datasets. For instance, when using dipcall as benchmarking reference, DRAGEN™’s precision in the 20-50kbp and >50 kbp is notably higher than when utilizing HiFi PBSV as the benchmarking dataset. One possible explanation for this is the limitation of HiFi in detecting extra-long CNV events. Our results highlighted DRAGEN™ CNV caller performance and the impact of selected benchmarking datasets on performance evaluation.

FIGURE 1. Precision and Recall of DEL Events by DRAGEN™ CNV in HG002 (hg38) Using Dipcall and HiFi PBSV as Benchmarking Datasets

For CNV detection, the task of identifying DUP events has long been acknowledged as more challenging as compared with identifying DEL events. This challenge can be attributed to lack of sensitivity of both depth signal and junction signal in distinguishing a change in copy number accurately8. Consequently, neither of the benchmarking datasets we use differentiate between insertions and duplications, which poses a challenge in assessing the performance of DRAGEN™ caller for DUP events. To address this limitation, we used an external tool SVWIDEN to annotate DUP events within the benchmarking datasets. Given the complexity of DUP event detection, it came as no surprise that our DUP events evaluation yielded less ideal results (Figure 2). Moreover, we also observed performance discrepancies across different benchmarking datasets pointing to the incompleteness of our benchmarking datasets for DUP events. This prompted us to explore an alternative alignment-based method to evaluate CNV events.

FIGURE 2. Precision and Recall of DUP Events by DRAGEN™ CNV in HG002 (hg38) Using Dipcall and HiFi PBSV as Benchmarking Datasets

A New Approach: Alignment-Based CNV Evaluation

In this method, we use an alignment search tool (LAST9) to identify CNV event sequences relative to both the HG002-T2T v0.9 assembly and the hg38 reference genome. We align the DNA sequence representing a DUP variant called in hg38 to the HG002-T2T reference.  A TP DUP event is characterized by a higher copy number in HG002-T2T than in hg38, while a FP DUP event shows fewer copies in HG002-T2T than hg38. Since the hg38 assembly is haploid, and the HG002-T2T assembly is diploid, we assume a copy neutral event has one copy in hg38 and two copies (one for the maternal and paternal haplotypes) in HG002-T2T.  A copy is defined as a LAST alignment where the alignment length is at least 95% of the query length and the percent identity of the alignment is at least 95%. This method allows us to identify not only whether a CNV event is correct, but in the case of duplications also allows us to identify the precise number, locations, and genotype of duplication events, even if they occur on separate chromosomes from the original call on hg38.

By using this method, we scrutinized certain DUP events that were absent in both HG002 benchmarking datasets we used. Our objective was to figure out whether these events were indeed false positives (FPs) or if their classification was a byproduct of the limitations inherent in the benchmarking datasets we relied upon. As shown in Figure 3B, a specific DRAGEN™-called DUP event with depth signal on chromosome 10 (chr10) attracted our attention. We identified seven copies on HG002-T2T’s chr10 in the maternal side and one copy on the paternal side. However, only one copy exists in the corresponding region in the hg38 genome. Although this DUP event was not present in either the dipcall or HiFi PBSV VCF, our alignment results strongly suggest that the genomic region chr10:39364454-39376272 in hg38 indeed represents a genuine DUP event in HG002. Furthermore, our alignment-based evaluation method unveiled a valuable insight into the mechanism behind this DRAGEN™-called DUP event, revealing it as a heterozygous tandem duplication event.

FIGURE 3. Detailed Analysis of a Unique DUP Event. A. An Integrated Genomics Viewer (IGV) screenshot displays a genomic region of interest. The highlighted red region represents a DUP event uniquely called by DRAGEN™, and absent in both HG002 dipcall VCF and HiFi PBSV VCF benchmarking datasets. Additionally, two green-highlighted DEL events are observed, with their benchmarking variants present in HG002 dipcall VCF but not in HiFi PBSV VCF. B. The alignment search results using LAST for the red-highlighted DUP event in 3A. Copies of this event are shown in hg38, HG002 maternal, and HG002 paternal chromosomes.

Obtaining a precise depiction of the genomic landscape of CNV events is a challenging quest. Our alignment-based evaluation method has proven itself as a powerful tool, shedding light not only on individual DUP events but also on the intricate relationships between multiple CNV events. Figure 4A depicts two DRAGEN™-called DUP events that were positioned approximately 70 kbp apart from each other. Neither DUP events were found in the benchmarking datasets that we used. What amplifies the intrigue is our discovery regarding DUP event a (chr1:16605769-16645359) and DUP event b (chr1:16715827-16727637), challenging the conventional spatial arrangement depicted in hg38 coordinates. According to hg38 coordinates, DUP event a occurred before DUP event b. However, our alignment results reveal a different scenario: on both the maternal and paternal chromosomes of HG002-T2T DUP event b precedes event a in sequence (Figure 4B). This intriguing twist is accompanied by a consistent presence of two copies on the maternal side and three copies on the paternal side, suggesting both DUP events are TPs and might be located on the same haplotype. Furthermore, our method reveals these copies to be dispersed, suggesting both events are homozygous dispersed DUP events. Our alignment-based CNV evaluation method not only unveils the intricate details of DUP events but also illuminates their spatial distribution, expanding our understanding of the complexity of CNV events.

FIGURE 4. Investigation of Two Adjacent DUP Events Identified by DRAGEN™. A. An IGV screenshot displays the genomic region containing DRAGEN™-called DUP event a (chr1:16605769-16645359) and DUP event b (chr1:16715827-16727637). Three genomic tracks from top to bottom include DRAGEN™ 4.2.4 output cnv_sv.vcf, HG002-T2T v0.9 dipcall VCF, HG002 HiFi PBSV VCF. B. Alignment results for DUP event a and DUP event b. Copies of DUP event a and DUP event b are shown in hg38, HG002 maternal, and HG002 paternal chromosomes.

In addition, we confronted an interesting case involving a DUP event located at chr17:22130212-22156606, a genomic region that also lacked any support from the benchmarking datasets we relied upon (Figure 5A). Unfortunately, we were unable to find supporting evidence for this 26kbp DUP event using our alignment-based method, which only had depth-only signal (SVCLAIM=D). As shown in Figure 5B, we observed only one copy of this sequence within the HG002 maternal chromosome, conspicuously absent on the HG002-T2T’s paternal chromosome. This observation defied our hypothesis, as we anticipated a DUP event sequence should have two or more copies of this sequence on either maternal or paternal chromosome. The presence of just a single copy on the HG002-T2T maternal chromosome strongly points to a putative false positive DUP event. It’s worth noting that our alignment-based evaluation method imposes stringent requirements on the precision of genomic coordinates. Thus, while it remains plausible that this genomic region harbors a genuine DUP event, its event length might be smaller than what’s indicated by DRAGEN™ CNV calling algorithm.

Figure 5. A Detailed Investigation of a DUP Events Identified by DRAGEN™ CNV. A. An IGV screenshot displays the genomic region harboring one DRAGEN™-called DUP event chr17:22130212-22156606. Three genomic tracks from top to bottom include DRAGEN™ 4.2.4 output cnv_sv.vcf, HG002-T2T v0.9 dipcall VCF, HG002 HiFi PBSV VCF. B. Alignment results for DRAGEN™-called DUP Event chr17:22130212-22156606. Copies of this DUP event are shown in hg38, HG002 maternal, and HG002 paternal chromosomes.

Overall, our alignment-based CNV evaluation method represents a pioneering approach in the intricate realm of CNV events. By harnessing the power of the HG002-T2T genome assembly, we’ve unlocked the ability to not only unveil CNV events in finer resolution but also delve deeper into spatial relationships and characteristics of adjacent CNV events.

Conclusion and Future Work

In this work, we evaluated the performance of the DRAGEN™ v4.2.4 CNV caller, utilizing dipcall events from HG002-T2T v0.9 assembly and HiFi PBSV data as our benchmarking datasets. It’s worth noting that some of the FPs initially identified using this method ended up revealing limitations within the benchmarking datasets. The limitations inherent in conventional evaluation approaches prompted us to embark on a journey towards developing a new alignment-based methodology by harnessing the power of the HG002-T2T genome assembly. Using this novel approach, we found some of the query FPs suggested by benchmarking datasets were actually TP events, highlighting the room for improvements of benchmarking datasets. Conversely, some FPs were indeed true FPs providing valuable insights that can inform methods or filtering strategies in future CNV callers.

While we have made significant strides in understanding CNV events within the HG002 sample, we acknowledge that a full validation across all DRAGENTM called CNV events leveraging our alignment-based approach is yet to be completed. As we anticipate a growing number of completed genomes emerging from the T2T effort, we hope to streamline and refine this alignment-based CNV evaluation method, providing a transformative means of evaluating CNV events across a broader spectrum of genomes.

References

  1. Ruderfer, et al. Patterns of genic intolerance of rare copy number variation in 59,898 human exomes. Nat. Genet, 2016
  2. Yoon S, et al. Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Res, 2009
  3. https://www.illumina.com.cn/content/illumina-marketing/spac/en_AU/destination/Webinar-Illumina-DRAGEN-4-2-Enhanced-machine-learning-new-targeted-callers-and-more.html
  4. Wang T, et al. The Human Pangenome Project: a global source to map genomic diversity. Nature, 2022
  5. Rautiainen M, et al. Verkko: telomere-to-telomere assembly of diploid chromosomes. Nature, 2022
  6. Zook J, et al. A robust benchmark for detection of germline large deletions and insertions. Nature Biotech, 2022
  7. Ebert, Peter, et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science, 2021
  8. Teo SM, et al. Statistical challenges associated with detecting copy number variations with next-generation sequencing. Bioinformatics, 2012
  9. Kielbasa, et al. Adaptive seeds tame genome sequence comparison. Genome Res, 2011
Important note for this blog: Posts do not equal endorsements. Opinions expressed in this blog are those of the author, on behalf of the genomics group at Broad. We make every effort to ensure the accuracy of data/figures presented here but these are not peer reviewed and errors may occur from time to time.

The post Seeking Truth: Solving CNV Evaluation Challenges with T2T Genome Assembly appeared first on Broad Clinical Labs.

]]>