TileDB

 
Unlock the future of newborn genetic screening

Enhance your ability to identify rare diseases in newborns, find the right populations for drug trials and connect with leading researchers through TileDB’s powerful expansion of the BeginNGS consortium.

TileDB-for-BeginNGS

Schedule a discovery call

TRUSTED CUSTOMERS & PARTNERS

Quest-Diagnostics
Takeda
Rady-Childrens-Institute
Chan-Zuckerberg-Initative
Boehringer-Ingelheim
Amgen
Cellarity
Alexion
PhenomicAI
AWS
Zifo

TileDB is allowing us now to do things that were hitherto not possible: It's not just a matter of running complex queries, it's a matter of running hundreds of concurrent complex queries on dramatically expanding genomic data, which is key for diagnostics in the NICU to guide the right treatments now and into the future with gene therapies.

Stephen Kingsmore

Dr. Stephen Kingsmore
President & CEO, Rady Children’s Institute for Genomic Medicine

Efficient genomic analysis can make or break care in those critical hours after birth.

Rapid whole genome sequencing speeds diagnosis and treatment decisions for genetic diseases in infants—if used effectively.

The complexity of variant data stretches the limits of status quo and DIY compute and data management solutions, hindering your ability to find the right targets for key genetic insights during vital early days of life.

97%
reduced variant data storage costs
 
<30 sec
to run complex variant queries
 
all within a Trusted Research Environment

Elevate Care for Rare Diseases

TileDB for Biopharmas

For Biopharmas

Access consortium data to glean novel variants for rare diseases, and guide affected populations into timely interventions.

TileDB for Hospitals

For Hospitals

Share newborn genomic variant data with the BeginNGS consortium to expand the dataset and identify true-positive variants to power rapid diagnosis.

Run the BeginNGS federated query on TileDB

BeginNGS federated query on TileDB

TileDB is an end-to-end modern genomics workbench

Why TileDB?

The platform for fast and scalable analysis of variant datasets

Power variant analysis at biobank scale

TileDB offers a highly scalable computational platform, combining an efficient storage format with the distributed power of the cloud.

Run complex variant queries in less than 30 seconds 

TileDB delivers analysis of newly sequenced genomes at unprecedented speed, achieving a seven-hour clinical turnaround of diagnoses that once took days.

Reduce storage costs by up to 97%

TileDB can store tables, variants, single-cell data, imaging, proteomics—while reducing costs by up to 97% compared to file-based approaches.

Secure global collaboration

TileDB enables governed, secure collaboration across international databases, transcending individual projects and geographic boundaries while implementing FAIR practices and adhering to HIPAA and SOC 2 Type 2 compliance.

Support federated queries between namespaces and organizations

TileDB's Trusted Research Environment facilitates precise analysis of comprehensive genomic repositories such as UK Biobank while maintaining rigorous privacy protocols, safeguarding sensitive data like sample identities and individual genetic profiles. 

Explore TileDB features

 

TileDB allows research teams to collaborate together for insights using dashboards, notebooks and visualizations.

The TileDB Extensible Variant Browser is a dashboard application that enables users to build their own stack for viewing genomic variants on TileDB-VCF datasets.

13.  Lifesciences apps - PartC  Variant browser-1

Explore UK Biobank GWAS results stored in an AWS S3 TileDB array.

11. Genomics workflow demo - Part B GWAS-1

 

The PCA tutorial uses transformations to produce dosage matrices directly from a TileDB-VCF dataset. We can use ancestry informative markers sets to replicate the familiar boomerang decomposition for human world populations.

TileDB-Carrara-PCA

View notebook

The eQTL analysis uses 1000 Genomes TileDB-VCF from DRAGEN, and corresponding GEUVADIS bulk RNA-seq data ingested as a TileDB-SOMA dataset to perform a quantitative trait loci analysis using the TensorQTL tool from the following paper: Taylor-Weiner, A., et al. Scaling computational genomics to millions of individuals with GPUs. Genome Biol 20, 228 (2019).

TileDB-Carrara-eQTL

The rationale for BeginNGS query federation is to overcome a major impediment to genome-based newborn screening (gNBS) - namely imprecision due to variants classified as pathogenic (P) or likely pathogenic (LP) that are not SCGD causal. Federated training predicated on purifying hyper-selection provides a general framework to attain high precision in population screening. Federated training across many biobanks and clinical trials can provide a privacy-preserving mechanism for qualification of gNBS in diverse genetic ancestries.

figure-run-federated-queries--UDF

FAQs

01. I am at a new BeginNGS enrollment site. What are my options for accessing and supporting federated queries?
There are many options to variant warehousing - these include
  • Traditional relational databases
  • Spark & Parquet based warehouses
  • Query engines
  • Distributed SQL & NoSQL
  • Cloud-vendor managed solutions
Each of these solutions presents disadvantages including lack of scalability, lack of flexibility, missing features, and support issues.

TileDB-VCF is a multidimensional array solution, an API that offers intuitive methods for ingesting, querying, and exporting variant data. TileDB-VCF is not a bespoke domain solution. It’s built on the same adaptive TileDB array model that can be used to capture any data modality. Variant data happens to be sparse. Variants themselves are surrounded by thousands of non-variant loci. TileDB is uniquely suited to represent this type of sparse data without necessitating intermediate data structures, massive transformations, or long running pre-processing steps.

Federated queries in TileDB are implemented using user-defined functions (UDFs) which allow the data owners to communicate complex, secure queries across organizational barriers.
02. I have a data analyst. Why do I need a database for this information?

A skilled data analyst still needs a proper database solution for population genomics data because:

File-based approaches collapse under the weight of hundreds of thousands of samples, each with millions of variants. TileDB-VCF efficiently manages this scale while solving the "N+1 problem" of adding new samples.

Without a database, an analyst wastes time on data management instead of insights. They'll struggle with slow queries across large cohorts and face barriers integrating variant data with clinical information or other omics data. Team collaboration becomes problematic without centralized data access, leading to inconsistencies and reproducibility challenges. Security concerns also arise when handling sensitive genomic information through files.

A database doesn't replace the data analyst—it empowers them with faster queries, better integration capabilities, and more time for actual analysis instead of data wrangling.

03. We are at a hospital, just started sequencing. Why should I use TileDB?

As your hospital begins sequencing, you'll quickly generate vast amounts of genomic data that traditional file-based approaches can't efficiently handle. TileDB-VCF provides a scalable solution specifically designed for clinical genomics, allowing you to securely manage patient variant data while maintaining HIPAA compliance.

Importantly, TileDB enables federated queries with consortiums like BeginNGS for newborn screening, letting you securely share and analyze data across institutions without compromising patient privacy. This connectivity expands your diagnostic capabilities while maintaining data governance.

TileDB also solves the technical challenges of adding new patient samples, integrating with clinical data, and providing the performance needed for timely clinical interpretation—critical requirements for implementing genomic medicine in your hospital setting.

04. What problems with traditional VCF files does TileDB-VCF solve?

Traditional VCF files become extremely inefficient at scale. They're monolithic, making it difficult to add new samples, lack proper database functionality for fast queries, struggle with large-scale storage, and don't integrate well with other data types. TileDB-VCF addresses these limitations by providing a database-like solution optimized for genomic data that maintains the complete information of VCF files without compromise.

05. How does TileDB-VCF handle the enormous amounts of variant data in population genomics?

TileDB-VCF uses a multi-dimensional array data model that efficiently compresses and indexes variant data, allowing for rapid slicing of genomic regions across any number of samples. It's optimized for cloud storage and can scale linearly with the addition of new samples, making it ideal for biobanks and large cohort studies that may include hundreds of thousands of sequenced individuals.

06. Can TileDB-VCF help with analyzing variant data alongside other types of genomic information?

Yes! TileDB-VCF can be integrated with other omics data through TileDB SOMA for multi-omic experiments. This allows researchers to link transcriptomes or other data types from the same subjects for genotype-to-phenotype analyses or gene-by-environment studies. It also supports AI/ML workflows by generating genotype or dosage matrices that can be used as features for prediction models.

07. How does TileDB-VCF perform compared to traditional approaches?

TileDB-VCF significantly outperforms traditional VCF-based approaches for large-scale variant data. It's optimized for rapidly slicing variant records by genomic regions across arbitrary numbers of samples. The  implementation ensures maximum speed for ingestion and queries, and its columnar format allows for efficient compression tailored to different data types within the VCF.

08. Is TileDB-VCF suitable for sensitive human genomic data?

Yes, TileDB provides advanced security features including user-configurable encryption, configurable access policies, secure sharing capabilities, and comprehensive logging for auditing. It's SOC 2 Type 2 and HIPAA compliant, making it suitable for managing sensitive human genomic data in clinical or research settings.

09. What kind of organizations are using TileDB for population genomics?

TileDB is used across multiple sectors dealing with large-scale sequencing: biopharma companies using biobank data for drug development, rare disease consortiums conducting newborn screening, microbiologists studying infectious diseases like SARS-CoV-2, and agricultural researchers working with complex crop genomes. It has been battle-tested in production environments with hundreds of thousands of samples.

10. How does TileDB-VCF maintain the richness of variant data while improving performance?

Unlike some solutions that simplify data to achieve better performance, TileDB-VCF stores variant data in a lossless manner, preserving all the information from the original VCF files. Its columnar format allows it to compress different VCF fields with different compressors based on the data types, optimizing both storage and query performance without sacrificing data fidelity.

Ready to enhance genomics analysis for rare diseases in newborns?

Christina Pucci
Christina Pucci
Sr. Business Development Representative
George Vlastarakis
George Vlastarakis
Sales Development Representative

Schedule a discovery call
with our team of experts

 TileDB-monogram © TileDB, Inc. LinkedIn GitHub Twitter