Building a platform for managing, analyzing and sharing biomedical big data = xây dựng nền tảng quản lý, phân tích và chia sẻ dữ liệu lớn y sinh học

Motivation

This thesis examines the challenges faced by research projects focused on the human genome To illustrate these issues, we will explore a compelling narrative surrounding ongoing research in this field.

In the 2000s, the Human Genome Project was successfully completed after 13 years of extensive research involving around 1,000 scientists and an investment of over $3 billion This landmark achievement significantly advanced the field of genomics and marked a pivotal moment in scientific history.

● And today, thanks to the technology development, to decode a human genome, it takes only a few days with approximately 1000 USD

In the coming years, the cost of decoding a human genome is expected to drop to under $100, leading to an explosion of genomic data that presents significant challenges in data analysis and management To address these issues, various projects have been initiated, including the Database of Genomic Variants for the Vietnamese Population Project (DGV4VN), funded by Vinbigdata This initiative has developed a comprehensive biomedical data platform called MASH, designed to effectively manage, analyze, and share large-scale biomedical data.

System’s Main Objective

When developing the MASH system, we aimed to create a scalable platform capable of storing over 1,200 terabytes of rapidly growing data This system is essential for sharing data with our partners and the research community, ensuring that the information remains Findable, Accessible, Interoperable, and Reusable A significant challenge for any big data platform is maintaining high performance, which is also a key objective for MASH.

System Requirements

Functional Requirements

MASH comprises of four main functional groups, including management, analysis, sharing and visualization of biomedical data The specific functions are set out as follows:

Integrate and manage diverse data from multiple projects, including the current DGV4VN initiative and future endeavors like cancer genomics and various health-related data, as well as projects across different fields.

The system enables the updating and retrieval of data according to specific project data models, facilitates collaboration between experts and system developers to define data models in a compatible format, and allows for the integration of analysis workflow results into the system, adhering to the established data models.

Effectively manage, analyze, share, and harmonize large data sources ranging from petabytes to exabytes Data is stored in object-storage, either on-premises or in the cloud, with a system that includes essential components for data access, such as databases.

Have an access control mechanism suitable for project data sources and ensure security issues specific to each data field, including personal data privacy in biomedical discipline

To enhance user experience, it is essential to offer suitable data retrieval and display interfaces for the public, enabling users to conduct downstream analysis effectively Additionally, advanced data search and query interfaces should be provided for biomedical experts to facilitate in-depth research and analysis.

The system is designed to seamlessly integrate standard analysis workflows for each project, enabling the automatic execution of numerous processing tasks simultaneously, handling anywhere from thousands to tens of thousands of inputs.

The essential functional requirements outlined are crucial for addressing the challenges of a big data management system However, to develop a system that is user-friendly and scalable for future needs, it is vital to consider the accompanying non-functional requirements detailed in the next section.

Non-functional Requirements

In distributed systems, non-functional requirements are essential, but for large-scale data management systems like MASH, the FAIR principles—Findable, Accessible, Interoperable, and Reusable—take precedence.

The system is structured as microservices to effectively manage and analyze large, diverse datasets from various projects, adhering to the FAIR principles of Findability, Accessibility, Interoperability, and Reusability.

High performance: The system must be able to receive and process a large amount of data files as well as analysis pipelines in short response time

Scalability: The system can be easily scaled both horizontally (adding containers, VM) and vertically (adding hardware resources to the system) without interfering with the running service

Portability and Backward compatibility: The system can easily be switched from place of deployment without taking much time to reconfigure parameters

Stability, high availability, and maintainability Backup services and high availability need to be implemented to ensure that the system services are not affected when main services’ fault occurs

System security (Security): The system is resistant to attacks such as DDoS or SQL injection, etc Ensure that the system data cannot be accessed by unwanted third parties

User interface should be friendly and usable (Usability): user interface should be designed to be user-friendly targeting doctors, bio-informaticists, and biologists

Error monitoring and reporting: deploy services to monitor system resources and health, send alerts and emails to teams when errors occur, or the system resources hit a threshold

Logging: ensures the entire system activities are logged, centrally manages logs for services, visualizes logs of services in the system, making debugging easy

To achieve seamless automatic integration and delivery, it is essential to implement Continuous Integration and Continuous Delivery (CI/CD) processes that facilitate the ongoing examination, integration, and deployment of systems The design and implementation of the system effectively showcase non-functional requirements, with the MASH system structured around a microservices architecture and deployed using container orchestration platforms to meet these outlined requirements.

Main Contributions

From the above objectives, the master thesis has conducted the research

"building a platform for managing, analyzing, and sharing biomedical big data"

The developed system features essential functions for managing, sharing, analyzing, and visualizing data, resulting in significant contributions Firstly, we propose a flexible architecture that can be deployed on either cloud or on-premises environments, capable of simultaneously handling data from various sources Secondly, our system offers a data model specifically designed for efficient management of raw biomedical data Lastly, we provide a method and data model that significantly reduce the time required for data insertion and retrieval.

To enable the contributions, the master thesis is presented through the following chapters:

- Chapter 2: Theoretical background on MASH construction

This chapter explores key theories in genomics and examines big data technologies essential for research and system development It details the various input file types utilized within the system and provides a review of comparable systems in the field.

- Chapter 3: MASH system design and development

The content of this chapter introduces the overall architecture of the MASH system and the technologies used in MASH

- Chapter 4: Solutions to speed up data insertion and querying

The content of this chapter introduces in detail how to speed up the process of inserting and querying data from the MASH system

The content of this chapter provides some methods of system performance testing, test cases and test results.

THEORETICAL BACKGROUND ON MASH

Distribution of Data Samples

The 1,000 Vietnamese Genomes Project analyzes genomic sequencing data from over 1,000 individuals of the Kinh ethnic group across 63 provinces in Vietnam This extensive project is expected to generate up to 1,000 Terabytes of data, encompassing files in FASTQ, BAM, and VCF formats.

(Figure 2.1 was captured from https://genome.vinbigdata.org)

To effectively process vast amounts of data, it is essential to utilize high-performance and scalable methods and tools This topic will be explored in detail in Section II.3 on Big Data Technologies, as well as in Chapter 3, which focuses on MASH system design and development, and Chapter 4, which presents solutions for accelerating data insertion and querying.

System Input Files

The system efficiently handles files in FASTQ, BAM, and VCF formats, storing them as data objects within its Data Lake, each assigned a globally unique identifier (GUID) These data objects are interconnected with clinical data, which is organized according to a specific data model design, while the clinical data itself is securely stored in PostgreSQL.

Actions need to be taken for handling input data are as followed:

- Upload data on Data Lake, and assign a GUID to each data object

- Filter and convert data from VCF files, and then upload the converted data to Elasticsearch (www.elastic.co) database These data sets are then displayed on the website

FASTQ files are the output of genome sequencing machines, providing a text-based format that stores biological sequences along with their corresponding quality scores Each read in a FASTQ file is represented in four lines, ensuring a structured presentation of the data.

- The first line contains a sequence identifier of a read, initiating with the character ‘@’

- The second line is a DNA chain made up of 4 different nitrogenous bases including A, C, T, G

- The third line begins with the character ‘+’, and can possibly be followed by a string of identifiers, i.e., the identifier string can be repeated

- The fourth line is ASCII which encodes the phred-scaled base quality score of the DNA chain in the second line

SAM (Sequence Alignment Map) is a text format designed for storing sequence data in tab-delimited ASCII columns However, the substantial size of SAM files can lead to storage challenges and make it time-consuming to retrieve information from these large text files.

BAM (Binary Alignment Map) files are the binary equivalent of SAM (Sequence Alignment Map) files, containing identical information but in a more efficient format By storing data in an indexed and compressed manner, BAM files are significantly lighter and allow for faster access compared to their SAM counterparts.

The Variant Call Format (VCF) is a text-based format used to represent genetic variants, consisting of a header and a body The header begins with a series of characters marked by a "#" symbol, while the body is made up of TAB-delimited fields Each row in the body provides information about a specific variant, including annotations from the Variant Effect Predictor (VEP) The rightmost columns contain individual-specific data, while the other columns present general variant information, such as chromosome position, reference and alternate alleles, and various metrics like allele count and depth of coverage.

2.072;NEGATIVE_TRAIN_SITE;QD=0.45;ReadPosRankSum=-

2.751;culprit=SOR;ANN=C|upstream_gene_variant|MODIFIE R|DDX11L1|ENSG00000223972|Transcript|ENST00000450305. 2|transcribed_unprocessed_pseudogene||||||||||rs79668 8738|1|1881|1||insertion|1|HGNC|HGNC:37102|||||||||||

||||chr1:g.10131dup|||||||||||||||||||||||||,C|upstre am_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|Tran script|ENST00000456328.2|processed_transcript||||||||

||rs796688738|1|1740|1||insertion|1|HGNC|HGNC:37102|Y ES||1||||||||||||chr1:g.10131dup|||||||||||||||||||||

||||,C|downstream_gene_variant|MODIFIER|WASH7P|ENSG00 000227232|Transcript|ENST00000488147.1|unprocessed_ps eudogene||||||||||rs796688738|1|4275|-

1||insertion|1|HGNC|HGNC:38034|YES||||||||||||||chr1: g.10131dup|||||||||||||||||||||||||,C|regulatory_regi on_variant|MODIFIER|||RegulatoryFeature|ENSR000003442 64|CTCF_binding_site||||||||||rs796688738|1||||insert ion|1|||||||||||||||||chr1:g.10131dup||||||||||||||||

||||||||| GT:AD:AF:DP:GQ:FT:F1R2:F2R1:PL:GP /.:159,.:0:367:0:PASS:.:.:.:

The content of VCF file is relatively complicated as it contains various information that requires in-depth insight in bioinformatics to fully understand

In terms of data format, VCF is relatively similar to Tab-separated values (TSV).

Big Data Technologies

Processing and storing large volumes of data present significant challenges for any system, particularly with the rapid advancements in information technology The exponential growth of information generated every second, including genetic data, is overwhelming As the costs of next-generation sequencing have dramatically decreased, more researchers can access this technology, leading to a surge in data production Consequently, when sequencing data accumulates into the thousands, existing systems struggle to manage and store this influx effectively.

A number of tools and methods are introduced to address such issues, some of which include Hadoop MapReduce [3, 4], Spark [5], and Elasticsearch

Hadoop MapReduce transforms data into key-value pairs that are distributed across cluster nodes During the reduce phase, intermediate results with identical keys are merged to produce the final output.

The MapReduce method is popular in bioinformatics, but Hadoop's reliance on hard drive data reading and writing creates bottlenecks that hinder data processing performance.

Apache Spark was developed following Hadoop to resolve various challenges associated with it Unlike Hadoop, which stores intermediate results on hard drives, Spark utilizes RAM for data storage, significantly minimizing the delays caused by reading and writing to disk As a result, Apache claims that Spark is up to 100 times faster than Hadoop.

Elasticsearch is a powerful and scalable tool for data searching and analysis, functioning both as a search engine and a NoSQL database management system Its scalability allows for the efficient processing of large volumes of data without compromising the speed of indexing and querying.

A data lake is a repository that stores data in its raw format, accommodating unstructured, semi-structured, and structured data Typically established on-premises within a data center, a data lake utilizes digital IDs and associated metadata for data organization, but it does not require a predefined data model.

A data lake enables the efficient collection and integration of vast amounts of data from diverse sources, allowing users to collaborate and analyze information effectively This capability fosters improved and expedited decision-making processes.

Data warehouses serve as central repositories that integrate data from various sources, primarily storing relational data from transactional systems, operational databases, and business applications The data within a data warehouse is highly transformed and structured, ensuring it is optimized for analysis Importantly, data is only loaded into the warehouse once its intended use has been clearly defined.

The ETL (Extract, Transform, Load) process is a crucial data integration method that involves three key steps: extraction, transformation, and loading This process is essential for merging data from various sources, primarily for the purpose of constructing a data warehouse Initially, data is extracted from multiple data sources, followed by its transformation into a specific format Finally, the transformed data is loaded into a data warehouse or another system for analysis and reporting.

A data mart is a specialized structure within data warehouse environments designed for retrieving client-facing data, serving as a subset focused on specific business lines or teams Unlike data warehouses that encompass enterprise-wide information, data marts contain data relevant to individual departments In many cases, each department or business unit assumes ownership of its data mart, encompassing all associated hardware, software, and data.

- Data source: The data in the data source is the input to the ETL tools These data sources can be internal file storage or public APIs

ETL tools, originally created for the MASH system, are essential for efficient data extraction, transformation, and loading, necessitating high performance Chapter IV outlines methods for developing these ETL tools, focusing on solutions to enhance data insertion speeds and optimize querying processes.

- Data warehouse: A data warehouse stores the output of ETL tools, which are structured data that can be used for data querying and analyzing

- Access Tools: Tools are developed by users of the system for querying and analyzing data

Figure 2.4: Distributed Object Storage System Architecture [8]

The common architecture of large distributed storage systems, as illustrated in Figure 2.4, features a metadata server responsible for storing metadata This metadata is typically distributed using a hash derived from the file's full path or a combination of the directory path and filename.

Object-based storage devices manage data as distinct objects, unlike file storage that organizes data in a hierarchical structure or block storage that handles data in blocks within sectors and tracks Each object consists of the data, a variable amount of metadata, and a globally unique identifier, allowing for efficient data management without the need for traditional file hierarchy This metadata association simplifies data retrieval and enhances storage flexibility.

Object storage offers several advantages:

 Performance: Can offer better performance than a single server in some cases, for example, it can store data closer to its consumers, or enable massively parallel access to large files

 Distributing: Allow data to be stored across multiple regions

 Scalability: The primary motivation for distributed object storage is horizontal scaling, adding more storage space by adding more storage nodes to cluster

 Redundancy: Can store more than one copy of the same data for high availability, backup, and disaster recovery purposes

Object storage enables applications to manage data through programmatic interfaces that facilitate create, read, update, and delete (CRUD) operations Additionally, certain object storage implementations offer enhanced features like replication, versioning, lifecycle management, and the ability to move objects between various storage types.

Some famous implementations of object storage: Amazon S3 1 , Scality RING 2 , OpenIO 3 , Minio 4 , etc.

Literature Review

Genomic research has advanced significantly worldwide, highlighted by major initiatives such as the 1000 Genomes Project, the Large-Scale Whole-Genome Sequencing of Three Diverse Asian Populations in Singapore, and the DNA Data Bank of Japan (DDBJ), which supports genome-scale research in life sciences Additionally, the GenomeAsia 100K Project facilitates genetic discoveries across Asia, further enhancing our understanding of genomics.

With the large amount of data generated from such projects, these systems face a number of challenges as follows:

- The excessively large amount of data sets causes trouble for management and analysis

- Access controlling and data sharing for different analyses

- Data visualization and customization of data filtering

The review part outlines typical features of systems built to address such challenges

Developing systems on cloud-based platforms like Google Cloud Platform, Amazon Web Services, or Microsoft Azure is similar to creating separate platforms for small systems While cloud solutions eliminate capital costs, they often incur high operating expenses for processing large volumes of data In contrast, building a separate platform involves significant initial investment but can lead to lower operating costs over time For long-term projects, developing a dedicated platform proves to be more cost-effective.

2 https://www.scality.com/products/ring/

Some projects led by labs and universities are developed on public cloud platforms such as Galaxy Cloud [13,14], Bionimbus Protected Data Cloud [15], and the Cancer Genome Collaboratory [16]

Data commons are designed to establish an interoperable resource for the research community by integrating data, storage, and computing infrastructure alongside widely-used applications and tools for data management, sharing, and analysis They are essential components of science-as-a-service frameworks.

Users can leverage computational resources, storage, and pre-installed software from data commons instead of managing data locally on personal computers Notable examples of data commons include the Open Science Data Cloud (OSDC), operated by the Open Commons Consortium (OCC) since 2009, and the National Cancer Institute’s genomic data commons (GDC) Additionally, users can upload their own data to these platforms for analysis.

The main services of data commons mentioned above comprise:

(a) Authentication, access control and user authorization services

(b) Service to assign digital ID to data

(c) Service to link data to its corresponding metadata

(d) Data model and data integration strategy definition

Workflow services facilitate the execution of bioinformatics pipelines for data analysis, ensuring that data stored in the data commons meets the FAIR criteria—findable, accessible, interoperable, and reusable Adhering to these principles is essential for the integrity and effectiveness of future analyses utilizing data from the data commons.

With a common data model (d), data commons are capable of curating, integrating data contributed by the community, and running bioinformatics pipelines (e) to analyze and harmonize data

Recent advancements in big data processing and cloud computing have led to the emergence of several innovative platforms designed for the research community These systems enable users to efficiently manage, share, and analyze large data sets at a low cost Additionally, they foster a collaborative network where researchers can interact and exchange data sets and analysis results, enhancing the overall research experience.

The MASH system offers a versatile architecture suitable for both cloud and on-premises environments, enhancing data insertion and retrieval speeds Additionally, it incorporates robust mechanisms for data security and safety We will also assess the system's performance to ensure optimal functionality.

MASH SYSTEM DESIGN AND DEVELOPMENT

Solution Overview

The system encounters challenges in managing and processing a diverse array of input data To enhance data management efficiency, it employs graph data models Additionally, for high-performance data processing and maintaining data integrity, the system utilizes ETL (Extract, Transform, Load) processes, which include automated testing steps.

To create a versatile system compatible with both cloud and on-premises environments, a microservice architecture was implemented, utilizing object storage as a data lake for storing raw files.

Data Model

The system employs two distinct data models: the graph data model for managing raw data and the document data model for data exploration and visualization.

Utilizing the graph data model for raw data management effectively captures the relationships between data components, ensuring that all significant intermediate data and steps in the analysis pipeline are well-represented Each node in this model includes metadata and associated data files, optimizing resource use during bioinformatic analyses, particularly genomic data processing, which can be resource-intensive and costly due to the need for expert manpower in development and maintenance By clearly defining these relationships, the system allows for the strategic storage of essential intermediate data while avoiding unnecessary retention of less meaningful data types Ultimately, it retains final outputs and key intermediate data that can serve as inputs for future analysis pipelines, enhancing efficiency and effectiveness in bioinformatics research.

Data linked to metadata often features many-to-many relationships among elements such as subject, study, sequencing center, sample, and analysis These interconnections reflect real-world relationships, making graph models particularly effective for illustrating these dynamics.

 Relationships can be described visually

 Well-known algorithms can effectively work on graphs such as path finding, and shortest path finding

 Some important properties of graph data models include:

 An edge can connect any two vertices in a graph

 It is possible to efficiently identify incoming and outgoing edges of any vertex in a graph Therefore, you can travel the entire nodes in the graph

 The graph can contain a wide range of information when we use different labels while maintaining a clean data model

Graphs are an ideal tool for creating adaptable data models, particularly when adding features to applications necessitates scaling through the incorporation of additional vertices or edges This flexibility allows for easy modifications to align with changes in the application's data structure In the MASH system, data models are defined by a data dictionary and represented as graphs that illustrate the relationships between data components, demonstrating practical effectiveness.

Data models are essential for managing data and metadata consistently and comprehensively, while also being adaptable to evolving data and technology They must accommodate complex user queries and are designed to store data using graph models and ontology-based concepts Additionally, these models support indexing and enable data validation during user uploads, ensuring data integrity and accessibility.

Each node in the graph data model is linked to other nodes in a logical

- Information of Program, Project, Study and Subject are stored in Administrative nodes

- Clinical and medical history-related data is stored in clinical nodes including Demographic and Diagnosis clinical nodes

This node type encompasses data related to biological specimens, including cell molecules, fluids, tissues, organs, and bodily excretions It comprises Sample, Read Group, and Aliquot nodes, which are essential for research, testing, diagnosis, and treatment purposes.

The data model focuses on storing only the metadata of raw data files, such as FASTQ, SAM/BAM, and VCF, within the Data file node This node includes essential details like the file name, size, format, and a description of the raw data file.

The index file node, like the Data files node, holds metadata related to the Index file, such as the Aligned Reads Index node, which contains the index for a collection of aligned reads.

Information related to genomic pipeline analysis is contained in this node (g) Notation

Data that does not match other categories are stored in the Notation node The data in this node can be used for later updating and modifying the data model

The data model offers a detailed visualization of connections linked to a node, simplifying the processes of designing, testing, and modifying the model This functionality empowers non-technical users to create data models, provided they have a basic understanding of the data involved.

The final output presented to users is derived from the annotated data in the VCF file Given the extensive information contained in a VCF file, implementing an efficient data model is essential for the effective extraction and insertion of data, as well as for ensuring rapid retrieval from the system.

NoSQL database systems excel in supporting document data models, offering high efficiency through data localization and flexible schemas This document-based approach is especially advantageous for handling data formats such as VCF files.

Overall Architecture of the System

In system development, our software team focuses on leveraging the best available technologies to avoid redundancy and enhance efficiency We conduct comprehensive research on technologies relevant to each feature before starting development Consequently, we integrate various open-source codes such as Gen3 Spark, Elasticsearch, Hadoop, and PostgreSQL Additionally, we implement rigorous testing to ensure these open-source solutions are secure, effective, and meet our established requirements.

We prioritize the creation of innovative features and the enhancement of underdeveloped ones, aiming to build a user-friendly system that minimizes the learning curve for users.

According to the system requirements, the following MASH architecture is proposed:

MASH is built on a microservice architecture, incorporating both Gen3 open-source services like Indexd, Sheepdogs, Peregrine, Tube, and Guppy, along with newly developed microservices.

We enhanced the existing Gen3 open-source microservices, including data-portal and guppy, to align with the new requirements of the MASH system A detailed description of these microservices is provided below.

- Next-generation high-throughput sequencing machines, which can generate up to 6 Tb and 20 billion reads in dual flow cell mode with simple streamlined automated workflows

- Internal analysis pipelines, which have been developed by research team (c) File storage

- File data storage device, which can store up to petabytes of data and is highly scalable

- The ETL tool which takes input VCF data from File storage, then extracts and transforms the data into a structured form and pushes the output to the data warehouse

- Transfer raw data from file storage to MASH datalake

- Indexd microservice assigns a digital ID to each data file so that it allows user to find physical location for data object

- A local object storage or Amazon Simple Storage Service A local object storage can be deployed using opensource likes Minio

- A microservice allows user to do rich data submission

- A microservice generates metadata for the Data files node in data model (k) Peregrine

- A service which allows user to do GraphQL queries to get insights into data in MASH system

- Tube is an ETL tool which translates data from a graph data model stored in PostgreSql database to indexed documents in ElasticSearch database which have higher performance in queries

- Guppy is a microservice which supports GraphQL queries on data in Elasticsearch database

- A set of microservices, which allows user to run genomics data analysis pipelines on MASH's infrastructure

Data-portal is an interactive platform that enables users to explore, submit, and download data effortlessly Additionally, it offers command line interface tools for users to perform similar tasks efficiently.

Currently MASH receives data primarily from two main sources:

 Gene sequencing machines of agencies including hospitals The data flow of this source is indicated in the system architecture by red line

Data is uploaded to the system through a website interface or a specialized big data transfer tool known as mash-client, which utilizes a command-line interface The data flow from these sources is represented by green lines in the system architecture.

Different data sources require distinct handling methods due to variations in size and generation rates Sequencing machine data is typically larger and produced at a higher rate than community data, leading to differences in system components This data can either be processed through internal analysis pipelines with intermediate results stored on a file server, or initially saved to the server before analysis Community data uploads occur via a website interface or a specialized command-line tool for large files This system design separates data sources, enhancing data quality control and access management Ultimately, all final files are consolidated in the system's data lake, ensuring consistent management and sharing.

In the above architecture, to reduce complexity, the architecture of the authentication, authorization, and the architecture of the workflow will be presented separately

The system architecture will be exhibited from layer diagram’s angle as follows (excluding workflow):

The MASH system consists of two primary components: the client side, which includes the data-portal/cli microservices, and the server side, encompassing the remaining microservices These microservices communicate via an API gateway, enabling centralized monitoring of user access and system performance metrics, such as response time and user count.

The architecture of system access authentication and authorization is displayed as follows:

 User sends a request to the system

 API Gateway sends the request to a corresponding service

 If user request needs authentication, service shall pass the request to Fence to perform user authentication

Figure 3.3: Layer diagram – MASH system architecture

Figure 3.4: System authentication and authorization architecture

The system currently relies on Google’s authentication service for user verification, with authentication requests being directed to an external vendor However, plans are in place to develop an in-house authentication service in the future, which will enable support for both Google’s service and the new system.

After user authentication is completed, or if the initial request does not need authentication, the service will send a request to the Arborist service to verify the user's authorization for the initial request.

 If the user is authorized, the request will be performed otherwise an invalid error will be returned

The Analysis Workflow Automation Service offers a comprehensive suite of features for accessing, analyzing, aggregating, and sharing bioinformatics data Users can design workflows in CWL language through an intuitive drag-and-drop interface, allowing for parameter modifications, resource configuration, and performance environment selection Additionally, the system ensures optimal scheduling and management based on available resources and the user's account type.

Users can easily monitor the progress of their workflows, as the system automatically generates result messages and reports These results are securely stored and can be downloaded or utilized for future analysis.

(d) The analysis results are presented in graphs and tables, allowing users to see the quality of data and analysis results

Users have the ability to share workflow or analysis results with others through email, allowing recipients to view, re-analyze, or download the shared content based on the permissions set by the owner.

SOLUTIONS TO SPEED UP DATA INSERTION AND

Data Insertion

The graph data model is effective for storing metadata; however, it falls short in performance for analytical data storage Consequently, data warehouses typically utilize NoSQL databases for this purpose In the MASH system, we leverage Elasticsearch as a primary data source for data analysis and visualization tasks.

Ensuring the performance for data warehouses is a very challenging but interesting work For data insertion phase, I accelerate MASH by taking advantage of Spark and pipeline structure:

In the data processing phase, a pipeline structure is utilized, where one process is responsible for loading data from disk, while another process manages the uploading of this data to workers.

● Bioinformatics tools are integrated with spark to take advantage of distributed computing platforms

The effectiveness of the methods in this section is clearly described in section 5.3

To enhance query speed, data is denormalized and stored in a flat format rather than a nested one, as the latter can significantly slow down performance Increased levels of nesting lead to further reductions in query speed, which is why this system restricts data to a single nested level Additionally, the use of a parent-child format can decrease query speed by several dozen times, making it unsuitable for our data storage approach.

The following examples show data stored in flat and nested forms in the database:

In Elasticsearch, flat format organizes all fields and values at the same level using keyword data format, which, while similar to text format, lacks full search capabilities Consequently, the keyword format enables quicker data insertion and querying compared to the text format.

"type_of_alcohol_used": "string",

In our system, we avoid nested forms, yet the relationships between subjects, variants, and genes are closely interconnected Users frequently seek to compile information pertaining to these three elements, such as counting the number of male subjects, identifying variants classified as “modifier” impact and “SNV,” or listing genes associated with specific criteria.

The "LncRNA" gene biotype presents challenges when selecting male subjects, as the number of participants can vary To accurately adjust figures for variants and genes, it is common to combine index subjects, variants, and genes to achieve the desired results However, this process can be time-consuming, particularly when dealing with large indexes, which can impede overall performance.

To streamline query performance and eliminate the need for data joining across multiple indexes, each of the three provided indexes is structured to include essential data fields from the other indexes within its nested fields.

Let’s take a closer look at the mapping configuration and index data

Mapping configuration of index subject, variant, and gene:

Due to system security, the above table only shows a part of the data sets instead of the entire data sets in indexes

Subject, variant, and gene indexes corresponding to demographic, exposure (subject), variant and gene panels are displayed to users on website:

(Figure 4.4 was captured from https://genome.vinbigdata.org/explorer)

In the index subject, the nested fields include variant_list and gene_list, while the gene index features nested fields such as subject_list and variant_list Additionally, the variant index contains nested fields like subject_list and gene_list These fields provide users with options to effectively search and filter data For instance, when a user is in the variant tab and selects the ‘High’ checkbox under the ‘Impact’ category, specific results are generated based on that selection.

(Figure 4.5 was captured from https://genome.vinbigdata.org/explorer) Three queries will be generated to:

(a) In the variant index: Select all documents whose “impact” field shows

“high” value, and count the total number of results obtained

(b) In the subject index: Select and count all documents that have “high”

Figure 4.4: Support data analysis and search by selecting filter options

(c) In the gene index: Select and count all documents that have “high” values in the “impact” subfield of the “variant_list” nested field

The above design of the data model helps avoid joining indexes, thereby improving data querying performance.

Application of genetic algorithm in optimal parameter selection

In section 4.1, several parameters that can affect the performance of data insertion Elasticsearch database were proposed These parameters are:

 Bulk_size: Size of bulk requests

 N_workers: Number of workers/threads which are used to insert data to Elasticsearch

 N_shards: Number of shards of each Elasticsearch index

 N_replicas: Number of replicas of each Elasticsearch index

 N_masters: Number of master nodes in the Elasticsearch cluster

 N_data: Number of data nodes in the Elasticsearch cluster

The vast array of potential combinations for the given parameter options makes heuristic methods more effective than traditional search techniques, making genetic algorithms an ideal choice for application.

This section provides a concise overview of genetic algorithms (GAs) and explores their application in optimizing parameter searches for data insertion and querying within the Elasticsearch database.

The Genetic Algorithm (GA), rooted in Darwinian Natural Selection, is a type of Evolutionary Algorithm designed to tackle problems with extensive search spaces Central to GA is the principle of "Survival of the Fittest," which drives its operations, including mutation, crossover, and selection.

Mutation occurs when offspring inherit genetic traits from both parents, leading to a limit on the number of gene pair combinations within a population This genetic variation is crucial, as mutations play a significant role in driving evolution and natural selection.

Natural selection ensures that over time, individuals lacking adaptive traits are eliminated due to challenges like competition for resources, harsh environmental conditions, and predation This process favors those with superior adaptations, allowing them to survive and reproduce, thereby shaping the evolution of species.

During reproduction, offspring inherit traits from both parents, typically receiving half of their genes from each This genetic crossover allows children to adapt in varying degrees compared to their parents, influencing their overall resilience and adaptability.

To achieve the optimal value using Genetic Algorithms (GA), it is essential to fine-tune all parameters The following table outlines the tuning parameters along with their respective ranges for adjustment.

Parameters Description Possible values bulk_size Size of bulk requests 100 – 10000 (step: 100) n_workers Number of workers/threads which are used to insert data to Elasticsearch

1 – 100 n_shards Number of shards of each

1 – 1000 (step: 10) n_replicas Number of replicas of each

0 – 3 n_masters Number of master nodes in the

N_data Number of data nodes in the

Every combination of parameters produces a distinct set, where each element signifies a specific value The parameter set is represented as follows: bulk_size, n_workers, n_shards, n_replicas, n_masters, and N_data.

Figure 4.6: Representation of a parameter set

Consequently, the value for each parameter set can be:

Figure 4.7: Specific value of a parameter set

After knowing all the parameters and procedures for selection operations, we propose an optimization procedure applying genetic algorithm as suggested in a previous study [23]:

Figure 4.8: Genetic Algorithm flow chart for Parameter tuning [30]

In Figure 4.8 after generating a population with P individuals, we need to measure the fitness of each individual by experimental methods, the results of

GA algorithm for parameters tunning are described in detail in the section 5.2.

MASH CONSTRUCTION RESULTS

Test Environment

To perform testing, the entire system will be implemented on a physical service with the following parameters:

Table 5.1: Configuration parameters of the server in the test environment

Server Platform ProLiant DL360 Gen10

Processor Intel(R) Xeon(R) Silver 4214R CPU @

Local Storage SSD 3PARdata - SAN 10TB

We conduct performance tests on data insertion and querying to evaluate efficiency Queries are generated using a specialized performance testing tool to assess querying performance The detailed results of these tests will be presented in the following sections.

Result of Parameter Optimization by Genetic Algorithm

The parameter set identified through the application of the genetic algorithm was utilized to deploy an Elasticsearch cluster, and all subsequent results were tested based on this configuration.

Table 5.2: Result of parameter optimization by Genetic Algorithm

Bulk_size N_workers N_shards N_replicas N_masters N_data

Insertion performance

The data insertion test results are presented in the table below, comparing MASH with Spark and MASH without Spark This comparison highlights the benefits of integrating bioinformatics tools with Spark, leveraging the advantages of distributed computing.

The results are shown better in the following bar chart:

Figure 5.1: Performance of data insertion phase

Integrating bioinformatics tools with Spark significantly enhances performance, achieving approximately 5.4 times improvement in data insertion efficiency This advancement demonstrates a substantial impact on the data insertion phase, ultimately saving considerable time when processing large files in real-world systems.

Query performance

Elasticsearch utilizes a document model for efficient data exploration and visualization, but response times for queries can vary significantly As the volume of data or the number of requests to the system increases, these differences in response times become more pronounced.

Test cases are developed from the diverse queries submitted to the system, each impacting the infrastructure in unique ways Given the vast range of queries, it is impractical to test every possible scenario Therefore, we focus on five common queries for our testing process, which are detailed in the accompanying table.

1 Aggregation Aggregation COUNT the number of genes of all subjects who have gender is female

2 Aggregation Aggregation COUNT the number of variants of all subjects who have gender is female

3 Selection Select all variants which have impact is

4 Selection List details about all female subjects

5 Selection Select all subjects who have original area is

Test cases are listed in the below table:

T1 10 Aggregation COUNT the number of genes of all subjects who have gender is female

T4 10 Aggregation COUNT the number of variants of all subjects who have gender is female

T7 10 Select all variants which have impact is

T10 10 List details about all female subjects with condition is a specific gene symbol List details about all female subjects with

T12 500 List details about all female subjects with condition is a specific gene symbol

T13 10 Select all subjects who have original area is “Ha

These test cases are tested before and after optimization of data and database mapping configuration corresponding to Nested field type (NF) and parent-child (PC) type

Let’s look at the query results of the first five test cases:

Test cases parent_child nested

With 10 CCR, we have seen quite a clear difference in data query performance between the two schema types Nested type has better performance than parent_child type from 1 to 11.75 times Tests T10 and T13 do not show much difference in query performance between two types of schema which can be explained by the relatively small number of subjects in the data (at the time of testing the dataset contains 504 subjects), while other queries work with variants and genes in the dataset up to 27,019,368 and 60,343 respectively

In the upcoming five test cases, an increase in the number of CCR to 100 highlights a significant disparity in data query performance between the two schema types The nested type consistently outperforms the parent-child schema, demonstrating performance improvements ranging from 1 to 38 times.

Test case parent_child nested

In recent tests, the nested type demonstrated an acceptable response time even with up to 500 CCR, while the parent-child type exhibited poor performance, with the most time-consuming query taking around 14 minutes Additionally, the nested type outperformed the parent-child type by a factor of 1 to 78.4 in query performance.

Test case parent_child nested

The application of the denormalization technique led to a notable enhancement in data query performance, marking a key contribution of this thesis.

Research Questions and Outcomes

This research focuses on developing, testing, and evaluating data models to enhance data management, insertion, and retrieval efficiency It also proposes a scalable and maintainable system architecture tailored for large biomedical data storage Throughout the system's development, several key questions emerged, which are addressed in the following sections, summarizing potential solutions to these challenges.

(a) How to aggregate data from different sources?

System raw data is generated from sequencing machines and user uploads, while metadata requires collection from diverse sources such as sequencing machines, hospitals, and third-party APIs Given the varied structures and frequent updates of these data sources, ETL (Extract, Transform, Load) tools are essential for maintaining data integrity in the target storage area These centrally managed tools utilize monitoring methods to ensure successful task execution and enhance performance by limiting direct database interactions, opting instead to work with dumped files This approach improves ETL tool performance by approximately 20 times.

(b) Does data store in MASH satisfy the criteria such as findable, accessible, interoperable, and reusable (FAIR)?

Data in datalakes can easily be overlooked and are often not reusable due to the sheer volume of information lacking essential details To enhance the findability, accessibility, interoperability, and reusability of data in the MASH datalake, each file object is assigned a GUID, which is linked to its corresponding metadata All metadata is organized within a graph data model However, the proposed models may have limitations that need to be addressed.

MASH uses two types of data models for different purposes:

The graph data model is ideal for managing raw data and effectively storing close relationships between data elements, ensuring they are findable, accessible, interoperable, and reusable However, its data query performance diminishes with larger datasets, leading to poor performance during data joins Consequently, in MASH, the graph data model is utilized primarily for storing metadata and small-sized data.

The document model offers superior data query performance compared to the graph model, as it ensures faster response times by allowing user data to be pre-seeded into indexes, which reduces the need for joins However, this approach may lead to increased data duplication across tables and presents challenges in maintaining data consistency between indexes In MASH, the data model is utilized to store VCF file data, making it accessible to users through an exploration service.

Contributions and Perspectives

● Technically, through system research and development, we have the following major contributions:

- Propose an architecture for a flexible platform that can be deployed on cloud or on-premises and can handle data coming from many different sources at the same time

- Offers a data model suitable for raw biomedical data management

- Provide a method and data model to reduce the time to insert data, and to retrieve data

The system is now operational, offering significant advantages to users such as researchers, students, and doctors It ensures reliable access to high-quality data and equips users with user-friendly tools for effective searching, filtering, and analysis of information.

 Published an article in the Journal of Science and Technology of Technical Universities, ISSN: 2354-1083, vol 147, pp 14-21

MASH data is designed to be findable, accessible, interoperable, and reusable We suggest future research directions to enhance user engagement with the system's data, enabling more effective utilization.

(a) Allow users to develop pluggable applications and provide users with the results they want

(b) Allows the system to be updated, using the latest technologies while still ensuring backwards compatibility

1 Nguyen Thanh Huong, Dao Dang Toan, “Cluster-based Routing Approach in Hierarchical Wireless Sensor Networks toward Energy Efficiency using Genetic Algorithm”, Journal of Science and Technology of Technical Universities, ISSN: 2354-1083, vol 147, pp 14-21

[1] Miloslavskaya, N., & Tolstoy, A (2016) Big Data, Fast Data and Data Lake Concepts Procedia Computer Science, 88, 300–305 https://doi.org/10.1016/j.procs.2016.07.439

[2] McLaren, W., Gil, L., Hunt, S E., Riat, H S., Ritchie, G R S., Thormann, A., Flicek, P., & Cunningham, F (2016) The Ensembl Variant Effect Predictor Genome Biology, 17(1) https://doi.org/10.1186/s13059-016- 0974-4

[3] Borthakur D The Hadoop Distributed File System: Architecture and Design Hadoop Project Website 2007; 11:21

[4] Dean, J., & Ghemawat, S (2008) MapReduce Communications of the ACM, 51(1), 107–113 https://doi.org/10.1145/1327452.1327492

[5] Zaharia, Matei & Chowdhury, Mosharaf & Franklin, Michael & Shenker, Scott & Stoica, Ion (2010) Spark: Cluster Computing with Working Sets Proceedings of the 2nd USENIX conference on Hot topics in cloud computing 10 10-10

[6] O’Brien, A R., Saunders, N F W., Guo, Y., Buske, F A., Scott, R J., & Bauer, D C (2015) VariantSpark: population scale clustering of genotype information BMC Genomics, 16(1) https://doi.org/10.1186/s12864-015-

[7] Atzeni, P., Bugiotti, F., Cabibbo, L., & Torlone, R (2020) Data modeling in the NoSQL world Computer Standards & Interfaces, 67, 103149 https://doi.org/10.1016/j.csi.2016.10.003

[8] Pollack, Kristal & Brandt, Scott (2005) Efficient Access Control for Distributed Hierarchical File Systems 253-260 10.1109/MSST.2005.11

[9] 1000 Genomes Project Consortium An integrated map of genetic variation from 1,092 human genomes Nature 2012; 491(7422):56–65 doi:10.1038/nature11632

The study conducted by Wu et al (2019) focuses on large-scale whole-genome sequencing of three diverse Asian populations in Singapore The research, published in the journal Cell, highlights the genetic diversity and potential implications for understanding population genetics in Asia The findings contribute valuable insights into the genomic variations present within these populations, paving the way for future studies in genomics and personalized medicine.

[11] Tateno, Y (2002) DNA Data Bank of Japan (DDBJ) for genome scale research in life science Nucleic Acids Research, 30(1), 27–30 https://doi.org/10.1093/nar/30.1.27

[12] (2019) The GenomeAsia 100K Project enables genetic discoveries across Asia Nature, 576(7785), 106–111 https://doi.org/10.1038/s41586-019-1793-z

[13] Afgan, E., Baker, D., Coraor, N., Chapman, B., Nekrutenko, A., & Taylor, J

(2010) Galaxy CloudMan: delivering cloud compute clusters BMC Bioinformatics, 11(Suppl 12), S4 https://doi.org/10.1186/1471-2105-11- s12-s4

[14] Afgan, E., Baker, D., Coraor, N., Goto, H., Paul, I M., Makova, K D., Nekrutenko, A., & Taylor, J (2011) Harnessing cloud computing with Galaxy Cloud Nature Biotechnology, 29(11), 972–974 https://doi.org/10.1038/nbt.2028

[15] Heath, A P., Greenway, M., Powell, R., Spring, J., Suarez, R., Hanley, D., Bandlamudi, C., McNerney, M E., White, K P., & Grossman, R L

(2014) Bionimbus: a cloud for managing, analyzing and sharing large genomics datasets Journal of the American Medical Informatics Association, 21(6), 969–975 https://doi.org/10.1136/amiajnl-2013-002155

The Cancer Genome Collaboratory, presented at the AACR Annual Meeting 2017 in Washington, DC, highlights collaborative efforts in cancer genomics This initiative, involving a diverse team of researchers, aims to enhance our understanding of cancer through comprehensive genetic analysis The findings contribute significantly to molecular and cellular biology, paving the way for advancements in cancer treatment and research For more details, refer to the study published in Molecular and Cellular Biology, Genetics.

[17] Grossman, R L., Heath, A., Murphy, M., Patterson, M., & Wells, W

(2016) A Case for Data Commons: Toward Data Science as a Service Computing in Science & Engineering, 18(5), 10–20 https://doi.org/10.1109/mcse.2016.92

The article discusses the design of a community science cloud from the perspective of the Open Science Data Cloud, highlighting its significance in high-performance computing and data analysis The authors, including Grossman, Greenway, and Heath, present insights on the collaborative potential of cloud computing in scientific research, emphasizing the importance of accessible data storage and efficient networking The findings are part of the proceedings from the 2012 SC Companion conference, where advancements in computing and storage solutions were explored.

[19] Jensen, M A., Ferretti, V., Grossman, R L., & Staudt, L M (2017) The NCI Genomic Data Commons as an engine for precision medicine Blood, 130(4), 453–459 https://doi.org/10.1182/blood-2017-03-735654

[20] Gaurav Kaushik, Sinisa Ivkovic, Janko Simonovic, Nebojsa Tijanic, Brandi Davis-Dusenbery, and Deniz Kural Graph theory approaches for optimizing biomedical data analysis using reproducible workflows bioRxiv, 2016 doi: 10.1101/074708

[21] K F Man, K.S Tang, S Kwong, “Genetic Algorithms: Concepts and Applications”, IEEE Transactions on Industrial Electronics, Vol 43, No.5, October 1996

[22] L Haldurai, T Madhubala, R Rajalakshmi, “A Study on Genetic Algorithm and its Applications”, International Journal of Computer Sciences and Engineering, Vol-4, Issue-10, ISSN-2347-2693.

Định dạng
Số trang	52
Dung lượng	1,5 MB