1. Trang chủ
  2. » Kỹ Thuật - Công Nghệ

Tài liệu technology and method developments for high throughput translational medicine doc

122 353 0
Tài liệu được quét OCR, nội dung có thể không chính xác

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 122
Dung lượng 6,76 MB

Nội dung

Trang 1

TECHNOLOGY AND METHOD DEVELOPMENTS: FOR HIGH-THROUGHPUT TRANSLATIONAL MEDICINE

A DISSERTATION

SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

1N PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Trang 2

2011 by Junhee Seok All Rights Reserved,

Re-disebuted by Stanford University under Ticense with the author

“This work is leensed under a Creative Commons Attribution- [CDG] worommeriat 3.0 United States License Wan ilereaivecommonsorgiicenses/by-ne/3.Olus!

Trang 3

1 ceriy that Ihave res this dissertation and that, in my opinion tis fully adequate in scope and quality asa dissertation forthe degree of Doctor of Philosophy

Ronald Davis, Primary Adviser 1 cenily that {have read this dissertation and that, in my opinion, i fully adequate Ít scope and quality as dissertation forthe degece of Doctor of Philosophy

‘Stephen Boyd, Co-Adviser 1 comity that have reed this dissertation and that in my’ opinion itis fully adequae In scope and quality as dissertation forthe degree of Doctor of Philosophy

Robert Tibshirant

Approved forthe Stanford University Committee on Graduate Studies Patricia J Gumport, Vice Provost Gra

‘This signature page wa generated electronically upon subsson of thi dissertation ia

Trang 4

Abstract

‘Translation of knowledge from basic science 1o medicine i essential 10 improving

both clinical research and practice In this translation, high-throughput genomic ap- proaches can greatly decelerate our understanding of molecular mechanisms of diseas- «5 A succesful high throughput genomic study of eisease requires ist, comprehen

sive an efficient platforms o collect genomic data fom elincal samples and second ‘computational analysis methods that utilize databases of prior biological knowledge

together with experimental daa to derive clinically meaningful result, In this thesis, we discuss the development ofa new mitoarray platform as well computational methods for Knowledge-hased analysis along ith their applications in clnieai re- search

First, we and other cotleagues have developed new high-density oligonucleo- tide aray ofthe human tansripomt for high-trouphput and cost-efficient analysis ‘of patent samples in cinical suules, This aray’allows comprehensive examination of

_gene expression and genome-wide identification of alternative splicing and also pro Vides assays for coding SNP detection and non-coding transcripts Compared with high-throughput mRNA sequencing technology, we show that Us array’ is highly re- producible in estimating gene and ex expression, and Sensitive in detecting expres sion changes I adton, the exon-exon junction Feature of is aay is shown 1 im prove detection efficiency for mRNA altemstive splicing when combined with an ap propriate computational method We implemented the use of this array in 8: multi

Trang 5

re-produeile data, With ow costs ad high throughputs for sample processing, we asi ipate tha this array platform wil have a wide range of applications in high-throughput clinical statis

Sowond, we investigated knowledge-based methods that utilize prior know = lodge from biology and mice to improve analysis and interpretation of high- throughput genomic data, We have developed knowledge-based metho to enrich our prior knowledge, illustrate dynamic response to extemal sinus, and ideal distur bance in cellular pathy by chemical exposure, as well as discover hidden biolog-

‘al Signatures forthe predietion of patient outcomes Finally, we applied « knowlede

based approach in a large scale genomic study of tauoa patients Cooperating with clinical information, prior knowledge improved te interpretation of common and dit

ferential genomic response to injury, and provide efficient «sk assessment For patient cuteomes Tae stnial and genomic data as well as analysis results in this tatima study were systematically organized and provided fo research communities a new Knowledge of raumati injry

‘The mieroaray platform and knowledge-base meth presented in this thesis provide appropriate research tools for high-throughput translational medicine in a large clinical setting ‘Tis thesis is expected to advance understanding and treatment

Trang 6

Acknowledgements

{ would ike to express my deepest appreciation to everyone who has support my Ph.D, research and ie at Stanton over the years

Ic would not have been possible to accomplish my work inthis PRD thesis without the insighifl guidelines and generous suppor of Professor Ronald W Davi His open mind toward other Fields encouraged me a graduate student in engineering to study the inteeisciptinary areas i bioinformatics and technology His great im tions concerning biology and medicine have guided me to investigate the mos sign cant and fundamental problems in these fick | have gained profound knowledge,

seieniic ways of thinking, effective communication skills, and other many lessons from every conversation f have hal with Professor Davis, He has inspired me 1 ex plore research topies and conduct my own research, through which {have been able 0 develop a high sense of responsibility and feadership skills am also appeeciative of| the assistance of Professor Stephen P Boyd in my department who hsp me to ans late my background knowledge in enginecring into an intedsciplinary study that touched on science and medicine Labo would like to thank Professor Robert Tibshi- rani in the department of saisties He i a world-renown slatisisian and the person how [consul ist whenever I face a static! problem in my’ esearch, In addition, 1 especially thank Professor Wenzhong Xiao at Harvatd Medical School in mentoring tne thoughout my whole PhD career: He and I have had detailed discussions on

Trang 7

understanding of bath biology and statistics 9 wel his personal kindness ave Sup ported me as {conducted my PhD research

1 would ike to acknowledge many other people who have supported me daring imy Ph.D years The peoples the Stanford Genome Technology Center have provided the bes atmosphere to complete my PhD research, {would like to send many thanks to De, Weihong Xu, Dr Hong Gao, Dr, Amit Kaushal, and Dr, Yuping ang of the bioinformaties group who have worked on and discussed with me several diferent projects, Lalo thank Ms Jolie Wibelmy and Dr Micheal N Mindhins in pesforming

‘experiments and providing genomie data, Many parts ofthis thesis are based on colla-

borative esearch condited with researchers in Glue Grant, Iulammstion ad the Host Response to Injury appreciate thee support ofthis research and their provision ‘of financial asistane, especially fom Dr, Ronald G Tompkins at Harvard Medical School, Boston, and Dr, Lyle L, Moldawer at Universi of Frida, Gainesville, also

\wonld like to acknowledge help from my other colleagues at Stank 8, Professor Wing TH, Wong and his group inthe department of statistics helped ta improve computation- al methods for atemative splicing analysis deserved inthis thesis Professor Markus W, Cover in bioengineering directed me to bul @ dynamic model for human LPS response thoughout class he taught and his continued advice, Dr Shana” H, Dairkee 21 California Medical Research Center bas provide valle opporuniies to extend my research area lo breast cancer studies

Lam very happy ta have had my Valued frends who greatly cnrched ny ie a Stanford, Friends and classmates fom Seoul Science High School were reunited at

Trang 8

1o-per as wel as listening to each other's coneems and spending our spare tine to esther Alum’ of the Korea Advanced Insitute of Scions and Technology became ‘lose friends soon after [frst arived at Stanford, They helped mi to stein and ss tain my life at the university emo forget the times when mi fiends in elect nginesring and I strugled with homework assignments and examinations Lam also gruel to the wonderful golfers who have willingly played with me even on rain, windy, and cold days

Above ll things, Isl appreciate the support and ere of my fail canoe think of any’ words 10 express my’gratiude for my parents in Korea, owe all of my accomplishments to thei lve and teas, Iam always proud of my brother, Donghe ‘We had fot of fun while he studied atthe University of Southern Califa, a tly releshing time in my ie My Title boy Coin, is miracle had never experienced His very existnee cheers me up and spurs on my research, Finally, 1 woud like ta send all my Tove to my wile, Yeang, Yeong has alsays been with me whenever am lad, depressed, oF sorrowful She isthe best woman and friend Ihave met as well 35

Trang 9

Table of Contents — Acknowledgements ‘Table of Contents Listof Tables List of Figures (Chapter 1 Intoducton, Lt Motivation 1.2 ‘Thesis organization

‘Chapter 2 Comparison of Microarray and Sequencing Technologies 2.1 ‘The Glue Grant Human (GG-H) army

2 Comparison settings 2.3 Technical specifications

2.4 Enimation of sequencing equivalence aray 25 Summary

Chapter 3 Altemative Splicing Analysis Micmurays MI Induction

Trang 10

3 Suntmary

Chapter d Knowledge-based Analysis, 4 tà dã 4a us (Chapter SA Translational st 52 53 s4 35 Sel-enrichment of knowledge

À dynamie model of baman LPS response Bisphenol effets on breast cancer patients Gene et prediction for patient outcores

Trang 11

List of Tables

‘Table 2.1 The contents of tse GG-H array w

‘Table 2.2 Comparison of microarray and sequencing technical specifications 16 ‘Table 3 The performance summarization for various methods 1 detetaltemalively

spliced exon from micronmray data 33

‘Table 4.1 Signticanty perturbed pene sets by BPA 56

“Table 5.1 Patient information inthe trauma std 68

‘Table 5.2 Enriched pathways in significantly activated and suppressed penes after

‘eau injory n

‘Tables The numbers of genes significantly associated with inieal variables at eae

tine point after teauma injury n

‘Table 54 Genes with significantly distinct expression trajectories benween an

uncomplicated and a complicate recovery

"Table 5.5 Performance of predictors based on clinical variables, high-throughput

‘genomic data, and prior knowledge 0

Trang 12

List of Figures

Figure 1-1 Statistics in health: cost and mortality

igure 1.2 Translational layers in andaienal medial research

Figure 2 Array designs of a3 gene array, an exon array, a the GG-H array Figure 22 Reproducibility of aay snd sequencing

Figure 2.3 Dynamic ranges of aay and sequencing

Figure 24 Detection oftferentally expression in array and sequencing Figure 25 Coefficients of vaviation (CoV) of aay and sequencing

Figure 3.1 4 gone wit thee exons and thre junctions Figure 3.2 Putavelyconstutive exon selection far GARNLI Figure 3.3 luscton supremacy

Figure 34 Misbehaving jonetion filtering

Figure 35 Performance of various methods fo

native splicing detection accoaling to thei parameter changes

Figure 36 Best-junction approach vs allunetion approach Figure 3.7 JETTA software sivstare

Figure 41 Expression comelations of TF-TG and TG-TG,

Figure 4.2 Prediction perfons

knowledge bases,

Figure 4.3 Estimated ransesptional activities in LPS response

Figure 4.4 Gene clusters by he calculated regulatory strengths in LPS response Figure 45 A dynamic network for human LPS response

Trang 13

Figure 46 Breast cancer aggressiveness of plints stratified by the molecular

signatures of BPA sĩ

Figure 4.7 risk stratification example of the proposed gene set preition method on

cancer patients “

Eigare S41 Tenporl presion pateemtofigmiisanly peraeho genes le trauma

injury Tô

Figure 52 Average gone expression trajectories of an uncomplicated and a complicate

recovery 16

Figure 8.3 À network for the molecular signatures of a complicated recovery

Trang 14

Chapter 1 Introduction 1.1 Motivation

Improving pubic health is significant challenge in our soeity In many major de- elopsl countries health care eosts have inereased for last several decades [1 (Figure 1L1A}, and now the rapid increasing of heath costs becomes a serious hurden over the ‘world In the United States, 16% of its gross domestic product were spent for heath care in 2008, which was double of the health costs in 1980 While we have spent a ‘nage amount of money for health, diseases are not controlled well (2} (Figure 110) For example, we sill do not have successful eatments for cance although significant

‘human and financial resources have been invested in cancer research for ast sever lcades In contrat, we are experiencing a significant improvement in the mortality of heart diseases, the number one death cause in the United States However, this im

provement is mainly from a well-esablished emergency system of pre-hospital cares

Ẩm Bima

” š

: _ —

Figure Sait in eal nt nd mortally

Trang 15

[51 creasing ciseases such as dahotes ad obesity aze now new threats in the soe ‘Consequently, there emerges wide-spread necessity 40 improve the eurtent public

health towards beter treatment for diseases with less os

For this purpose, a new sescarch paradigm, translational medical esearch, has been suggested [1-6] The translational medical research soften refered wo as "farm the bench 10 the bedside,” where tanslaied knowledge and technologies of basic

science boost elinical esearch in understanding disease mechanisms and developing

new treatment, and finaly improve public health, igure 1.2 ustates the translation ‘of molecular information from a science domain toa clinical domain, espesally for understanding of disease mechanisms at molecular levels, Molecular signatures foo

genetic variants, mRNA expression, and protein abundance are measured by (eebnolo ‘2s developed in genetics and biochemistry, including polymer chain reaction (PCR) [7] microarray 18-9), sequencing [19-11], and flow cytometry [12] The measured mo- lecular signatures are intespreted through several transasional layers, The layer el

ments representing: moleenfes, cellular funetions, ar disease processes interact with

cach other in layer and each layer also interacts with its upper or lower layers These interactions ace partially found from our knowlege previowsly accumulated in basic science and medicine while large portion of the whole interactions i ill unknown, Molecular signatures ar finally interpreted athe layer of disease processes where we observe elincal phenotypes such as patient demographics, physiological status, dis: «ase symptoms, and elnial outcomes The translated molecular information helps 0

Trang 16

C1 = Beta tr Pen orn os

Figure 1.2 Transitional layers in ransational medieal research

In such translations, itis essential to develop appropriate technologies for re search platforms that detet molecular signatures a the bottom layer A elinical study often requires high quality genomic data from hundreds or thousands of patents, “which can be obtained witha sliable, sensitive, and efficient genomic platform, High

throughput microaray technology (8-9)

enables to assess several thousand

‘mRNA teansripis simultancousl, bas provided useful research platms for many clinical studies on complex diseases such as cance, inflammatory and infectious dis cases, transplant failures, and Alzheimer's disease [13-16] However, many of eurent- ly available microarsay platforms are not appropriate to capture signatures from com: plex human transcriptome For example, widely sed Affymetsix U133 any’ [7] f= «uses on gene expression, and it does not provide acceses for individual exon signa lures Affymetix exon array [18] has probes for genome-wide exons: however, is

Trang 17

Sequencing technology for mRNA transcripts provides a new platform for analyses of complex human transeriptome [10-11), the processing eosts and sample throughputs

are sil ae tobe improved more puicularly For a age clini! setting

The developmen of analytical methods is another essential pat in traslaional

medicine, in particular, for high-throughput genomic studies in eliies The high d

rensionliy of high-throughput dats often causes false positive traslaions of mole- ular signatures int a clinical domain Conventional methods shrink data dimension in a data-driven way such as principle component analysis and L-regulation {19-20} Asn allemative approach, prior knowlege shrinks data dimension in iological con- texts repesening our knowledge that hasbeen collected accumulated, and managed in sience and medicine Proc knowledge constructs the teanslational layers described in Figue 1.2 where high dimensional moleulsr signatves are summarized into fewer cellar funetons and disease processes, and dicey inteypreted in associations with clinical phenotypes There have been several efforts towards knowledge-based analy

sis in inferring signiticances of biological functions [21-22], estimating regulatory ae~ Lvitis 23], and prediting new molecular interaetions [24] However, many of these

methods were demonstrated on simple organisms lke £ Col and yeast, o small sets

‘of human samples There has been relatively tess efforts to apply knowledge-tased methods for large cfinieal stuies In ation, the current knowledge-based analysis still needs to he extended further into other significant problems sich as patient ou

‘come prediction

Trang 18

research, For technology’ developments, we demonstrate the svantages Ÿ 3 mieroar ray platforms in farge cine sais, and its appliations for alkemative spicing analy sis, For method developments, we present several biolgica applisitions and new me- thods for knowledge-based approaches, including an application toa large sale clini

cal study for traumatic patients

1.2 Thesis organization

First, technology developments ate discussed in Chapter 2 and 3 tn Chapter 2, the Give Grant Hursan (GG-H) army 4 new microarray platform developed by ws and cathe colleagues, is brief intraduced adits advantages for large cinical sae ae lemonsirat in comparison with sequencing technology In Chapier 3, we demon: strate thatthe G-H amay improves allemaive splicing analysis when combined with an appropriate computational method Second, method developments are discussed in ‘Chapter 4 and 5 Chapter 4 introduces several applications and new method develop ments in knowledge-based analysis for high-throughput genomie data In Chapter 5 ove present a typical translational study for severe traumatic patients, where high Uhzoughput gesomie technology and Knowledge based analysis were apolied, and si nificant inca implications ware found Fly, we summarize the contributions of

Trang 19

Chapter 2 Comparison of Microarray and Sequencing Technologies

Fr last decades, microarray technology has heen widely used to measure expression levels of RNA transcripts [8-9] With the developments of high-density ligonueleo- lide chips súch as the Glue Grant Muman (GG-H) array, microaeray technology pro-

‘ides comprehensive, reliable, and efficient platforms for large sale clinical studies ‘The ecent development of high-tioughpur mRNA sequencing (mRNA-Seq) tecmol: ‘ay provides a new and promising platform to measure mRNA abundance {10-11} However, fr large clinical studies, dhe availability ofthe cumrent sequencing techn! ‘ey Is significa limited by is sost, sequencing depth, sample throughput and re- <quiremens foe large amounts of patent samples

In this chapter we compare microarray and mRNA sequencing technologies using the GG-H array and Mlumiaa mRNA-Seq platforms First, we briefly intndoee the GG-H anay as a genomic research platform for large clinical ste ‘Then, we

‘compare the GG-H array with mRNA-Seq on reproducibility, dynamic ranges, detec- lion power for differential expression, as well as costs and sarmpe thro pats, The ‘overall results show thatthe GG-H array provides 2 suitable platform fora lage elini- cal study especially compared with the current mRNA-Seq technology

2.1 The Glue Grant Human (GG-H) array

Trang 20

roguires high qualiy genome data from hundreds of patients in a time- and eos ciicient manner High-throughput microaray technology [8-9 shat allows to monitor genome-wide molec signatres from mRNA tansrits has been applied in any clinical studies a assess pene expression profile of patients for disease predisposition, diagnostics, prognoses, and individualized ice regimens [25-29

While most of the curently available mcroaray’plarms target to measure the expression levels of gees, the himan tanseriptome is undoubted nore complex For example alteative splicing of mRNA isa major mechanisn that generates d- verse mRNA transcript isoforms, and subsequent diferentes prowins and their

Tanetions in humans and other high organismis (30-31) Altemative splicing is

served at developmental stages and between dstinet responses to extemal stimulus 38 well as various human diseases such as Alzheimer’s disease, eystiefibwoss, and mul tiple cancers [32-35] Moreover vanianis in coding and non-coding regions of a ge- nome cause ferent transcriptions regulations and protein properties 136, and poten- Waly diferentiate drug responses and risks Yor diseases 1378] In ation ức hú: rman transcriptome is even more complisated by fenctional and regulatory non-coding RNAS, anti-sense transcripts, and untranslated regions of gees,

Human trmseripome stalies io medical research have been limited by the availability of comprehensive, high throughput, and lime/eoskelfeient pladfomms Eor cxample, to date, Affymetix Human Exon (HEX) array 18] the only commercially available microanay platform fo exon and gene expression proiing os well as alter native splicing analysis, With 2-4 probes per probe se, the Hu aay targets known

Trang 21

tations, Therefore, this seay i easily influenced by false postive signals in alternative splicing analysis On the other hand, customize! arrays with probes targeting exon- ‘exon junctions have bees introduced to improve altemative splicing studies (39-82) however, many of them ae stil experimental and improper fr general tansripame sealies As another example, currently there is no avilable micmaray platform for penome-wide allele-specific expression profiling Microarray platforms sich as AL Iymetsix SNP aray [43] focus on detecting genome-wide DNA yarians rather than expression profiling, Recently developed high-throughput mRNA sequencing (eoRNA-Seq) is another high thioughput method to ienbfy and quan0fy transeript

isoforms [10-11] However, is high processing cost, low sample throughput, and quirements of large amounts of tots! RNA from patients sill need 6 be improved

ror fr lage clinical stues Those technical challenges have limited comprehensive {genomic studies in clinical research, and there have been eds for a new genomic

lulom fora tare einical sting

Here, we and ater colleagues have designed 4 new genomic research platform ‘based on high-density oligonucleotide aray technology The Glue Grant Human (GG> 1) amay is» cestomized muli-perpose array for expression profiling of genes, exons, {and functional RNAS a5 well a5 analysis foraterative splicing and allele-specific ex pression, The GG-H array provides relisle and efficient measurements for eompre- hensive genomic Features aswell as a high Sample throughput tough automated Pipelines in microarray processing

‘The GG-H anay has 6.9 million oligonucleotide probes targeting 315.137 ex

Trang 22

‘50,783 antisense non-coding RNAS, and 49;

357 unannorated transcribed unis (Table 2.1 and Figure 2.1) White conventional 3 gene arrays such as Affymetrix HU=133 ‘array [17] have 11 probes om the 3 end exon() ofeach gene, exon arrays andthe GG- H aray have probes tiling genome-wide exons In contrast to conventional exon azays with 24 probes for cach exon [18], the GG-H aray has on average tn probes per ex- a8 well as four probes foreach exon-exon junction The exons and junctions of the (GG-H array were collected from transcript annotations of RefSeq (44) Ensembl [45],

and UCSC Known Genes [46] compl mented by EBI's AEDB exons [47} and UCLA ASAP2 cassette exons (48), Exon probes were designed by considering probe perfor mance in thermodynamics, cross hybridization to off target transcribed regions, and spread ofthe seleted probes over an exon, Junetion probes were designed by tiling sxon oxon jineion sies a 3, 1, + and +3 bases of, In addition, the GG:H aay targets SNPs in coding regions collected from dbSNP (build 126) database [49] with

‘ix probes designed for each allele at 4,

1 and +4 pasitions on each ofthe two strands 'Non-coding functional RNAS were manually surveyed from several databases [50-52

‘non exon GeneAray == EenAmay 66H Array

uur Teer T bơm Se RNAs

SH sues

Trang 23

Antisense RNAs that overlap with RefSeq genes were also collected Ten probes were designed for each non-coding RNA Besides, we targeted 3 and 5° unannorated tran- serihed regions of RefSeq genes at the density of one probe per 50 bases and a mini= mum of six probes per region,

‘Table 21 The contents ofthe GG-H array

‘Array contents ‘Num of targets Num of probes

Exon 318.137 3.292.929

Exon-ex0n junction 260.488 1.060703

SNP 5978 98294

'Non-coding functional RNA 730 5.869

Non-coding antisense RNA 363,007

Unannotated transcribed unit 49937 485.581

Control probes 498,840

Total 6,892,960

2.2, Comparison settings

Microarray and mRNA sequencing technologies are compared on human liver and `

y teehnolog,fourrelicals [or cách tissue were hybridized on the GG-H arrays The samples were prepared following Ambion proto- ‘col with $0 ng of toll cellular RNA a8 starting materials For sequencing technology

four replicated samples for each tissue were independently prepared following the Standant prowcol slanting with 2 up of total RNA Each prepared sample was se quenced in Urge of Four lanes of the Mlumina Genome Analyzer I™ platform In otal

Trang 24

ight arrays and sequencing runs for human live and muscle tissues were prepared for the comparison

Raw probe intensities of microarray samples were processed through ack= round corection of mbust multianay analysis (RMA), median-sealing nosmalization, nd median-polish for summarization [53] Gene and exon expression levels were es- timated acconling Wo the annotations of the GG-H aray collected from multiple public databases

Sequencing reads with 36 bases were mapped over the exonie regions of GG-H uray annotations SeqMap was used forthe mapping by allowing 2 mismatches [54 About 65% of the total reads were uniquely mapped on eNonie regions, and on average 36 million uniquely mapped reads were collected foreach sample Expression indices ‘of exons were calculated as reads pe kilobase per million mapped reads (RPKMB [55],

and gone expression was caleulated in RPKM as an average of exon expression weighted by exon lengths

2.3 Technical specifications

In this section we look at important tecnica specifications of array and sequencing technologies ineluding reprodueibiiy, dynamic ranges, and detection power for diffe rential expression, as well as processing costs and sample throughputs, In clinical sewing, research platform is required to have high reprodueiiity with lw variances for cable measurements, and high dynamic ranges and detection power for sensitive Imeasurements For a feasible implementation ina lage seate study, low cost and a

Trang 25

Array Sequencing M1 0.99 0.99 0.98 M1 0.96 0.95 0.96 Y M2/0.99 099 | / M2 0.96 0.96 ‘S78 099 SS M3 096 LL LM LL LM Exons M1 (0.99 0.99 0.99 M1 0.99 0.99 0.99 L2 M2|100099 x4 M2 0499 099 3J⁄⁄/Ma +, M3 099 ⁄1⁄2z18l me Genes

Figure 2.2 Reproducibility of array and sequencing

Reproducibility in expression levels of exons and genes in muscle samples measured by array and sequencing platforms Each panel shows seater plots of logged expression on the botton-let side, and the eoresponing Pearson's cor relation coefficients on the top-ight side

Reproducibility of array and sequencing was evalsted with coreatons of &x- pression levels among four technical replicates Parsons correlation coefficients were calculated with logged expression values for both array andl sequencing samples, For both technologies, the calculation of corelation was performed on ~140,000 exons and 17,000 genes that had nơn-zen9 reads inal of the four sequencing replicates

Trang 26

reproducibility and reduce variances [53], instead we used median scaling normliza tion that scales probe intensities according to thei median values, simi

to the ap- proach of PRKM in sequencing data that sc the expression vales by the total nuinber of read As shown in Figure 2.2, the expression levels of exons and genes in both ary and Sequencing plaforms were highly reproducible, Across the independent replicates the corelation coeiiens of aay were 2.9 in gones and ~0.99 in exons “The sequencing platforms showed correlation coefieiens 0.99 in gene expression,

Similar to those of array, bu the e4 correlations were ~0.96, lower than those of ar ray platforms The poor reproducibility of exons in sequencing is expected because low abundant exons with a few reads ean have large variances with difference of one ‘or two reads The lack of sequencing reas affects the estimation of gene expression

Jess because @ gone is on average fen times longer than an exon and expected to have ten times more reads

Dynanie ranges of array and sequencing were tested with gene expression of liver and muscle tssve samples (Figure 2.3) About 20,000 genes with non-zer0 se «quencing ads in both tissues were considered forthe comparison The signa range of gene expression measured by aray was fom 2 02 while that of sequencing was neh larger The expression fall changes between two sues in array were upto 2 of which dyeamie range was about 1,000 folds, For the same seting, the dynamic

range of sequencing was more tha 10,000 folds It is wellknown thatthe dyamic range of microarray i limited by saturated probe performance The dynamic range of

Trang 27

> o Fold change Se 7] Ee ge Ệ i Be ge Be Ề

Tin hữu "——— ì ? ng rst] ot0GH aay

Figure 2.3 Dynamic ranges of array and sequencing

(A) Average gene expression levels measured by’ the GG-H array and mRNA: Seq for muscle samples (B) Fold changes of gene expression between liver and muscle samples measured by the GG-H array and mRNA-Seq,

times larger than array, the expression fold changes of significantly changed genes showed much smaller dilference between array and sequencing, Por ~7,000 dilferen- Lally expressed genes that were commonly deteted by both anay and Sequencing (see next paragraph for details), the median fold change of amay was two fos while that

of sequencing was four folds

‘To evaluate the detection power af aray and sequencing, we tested sinifc ance ofciferental expression of genes and exons between liver and muscle issues Differentially expressed genes in sequencing are often detected by parametric models, assuming Poisson sampling [56], which ignores variances from sample preparation ‘To avoid any bias caused by assumed parametric models, we instead applied nonpa-

rameirie permutation tests [571 to the aay and sequencing analyses, Figure 2 shows

Trang 28

range of false discovery rates (FDRs), With widely accepted criteria of 1% or 0.1% FDR, similar numbers of ciferentlly expressed genes were found in array and se- {quencing while more differentially expressel exons were discovered in array, The de- tcotion power for exons in sequencing is limited by large variances of the estimated ‘expression, especially for lowly expressed exons I isles problematic for genes be- ‘cause genes have better reproducibility than exons as shown before, The sets of genes nd exons identified by the array and sequencing analyses were heavily overlapped 10 ‘ach other (Fisher's est pvalue < 10"), For example, ~11,000 genes were ideniitied by the aay analysis with FDR<O.1%, among which ~7.000 genes were als identified by the sequencing analysis,

‘To conduet a large seale clinical study that preferably requires more than 1,000 samples ssay cost and sample throughputs are also important echnical specications to choose a proper platform, With similar iia instrument cost, one average core fae

a ones B Exons 300000 counts 18000 25000 35000 \ counts 0000 150000 a4 a 8 5 3.4 a 2 0 0 logt00R) legt0fDR)

Figure 24 Detection of difTeremially expression in array and sequencing

+

Trang 29

cittycan be equipped with ether one Affymetex 76 seanner and four hybridization adlons or aray ($400,000) or one Illumina Genome Analyzer HT™ (GA-TD fọc InRNA sequencing (-S600,000), Micrasray Iechnolqsv requires ~S400 snd one day to process one sample, and ane core facility ean process more than 20 samples per week, The curent sequencing technology’ roguires ~$12,000 and one wook for 8 se quencing rn Assuming a half flow ell per sample, one core facility can process 0 samples per wock with soquencing technology Fara typical large lineal study with 1.000 samples, mismarray technology requires S00, 008 for sample processing ps 100,000 for intial ost, and 1 year foe Sample processing In conte, the cure se- quencing technology needs 6 milion dollars even excluding ils initial cost and mare than 10 years or sample processing, whic is practically impossible io be conducted [AtTeas at this moment, microara’ technology i the only one feasble solution fora

large scale efinieal seting The comparison of microarray and sequeneing spe

tions is summarized in Table 2.2,

‘Table 2.2 Comparison of microarray and sequencing technical specications CGH Anay

Costisample ago (af How cell)

Processing time I day I week

Sample requirements 50m 2g

Dynamic range =1,000 210.000

30.99 for genes 0.99 for genes

Reproducibility £0.99 tor exons 096 for exons

Trang 30

2.4 Estimation of sequencing equivalence to ar-

ray

Ain pion ea maine wn Nk lava) poor pote eee Gá le ueneing eas The dteton power for iteomialy expressed gees of sequencing inary with 35 ion Kad preg; howeve, det power: (i el a rc a i Aaa ra a ns rt aan WS BS He ies ha A, berof sequencing eds 1 ne ary intros of epeiity, Here, eau of wcrc eplcnes was measured a colicin of variation (COV)

Fr econ of th for techie! ects ihe Overall median CoVE of aay was 0062, rid in sctuencing the subset of 600 gees wit a nina af

210 reads (mean of 924 reads) ahve the sre median CoV (Figure 23), To bring the median CoV ofall the -20.000 genes wih teas one ead (median of 108 reads} the same levels of CoV wih aay while fein he overal expression đc buon, a sequencing sample requires -305 milion (= 924/108 x 36M ead, A inore consevtive estimation which only takes onsration of genes tha ave ain imum estimated abundance of 0.1 RPKM, or roughly 4 reads per repli (or a gene ith 9 nth of 1.000 bases, rested in =150 milion reas, Similar 360 milion reads were estmaed fr exons Figure 5B), and conservatively 20 milion re foe cons with imu shundane of 1 RPKM or roughly 4 reads per epicte for an

Trang 31

> #” B : ,| = Sse oe sn Big 8 MIS |1

Figure 2.5 Coefficients of variation (CoV) of array and sequencing

CoVs were measured in the expression of (A) genes and (B) exons Y-axis repretens median Cavs ofthe subset of geneslexons witha minim number fof sequencing reads in cats A black dot represen the equivalent number of reads o array Dashed red ines represent the estimated CoVs fom Sequencing

sample preparation assuming Poisson sampling variances

‘The total variance observed from four replicates of sequensing can be approx imate as the sum of sample preparation variance and sampling variance of sequenc- ing The latter ean be approximately estimated as Usqrtaverage number of reads per replicate} under the assumption of Poisson sampling [56], which is 0.10 for 109 reads, (0.22 foe 20 rads, 045 for 5 reads, and I for | ead, Therefore he seen preparation variance ean be estimated as the difference between the total observed variance and the Poisson sampling variance Interestingly on average more than half ofthe ob- served variance is estimated to come from sample preparation for genes and exons with more than 4 read, or roughly’ minimum abundance of 0.1 RPKM and 1 RPKM

respectively In addition, while the Poisson sampling variation decreases when the 10- tal number of reads inereases in an experiment, the variations introduced by the sam

Trang 32

2.5 Summary

Along with increasing needs for genomic plaonns im elnial studies, we and other colleagues have developed the GO-H array a customized high-density mieroaray

plhtfonn, The GO-H array provides comprehensive reliable, accurate and efficient reasurements in homan tanscriptome, While conventional seay platforms focus on gene oF exon expression profiling, the GG-H array has move comprehensive eostens

targeting exons, exon-exon junctions, coding SNPs, and non-coding RNAs The tích

probe design on exons sn junetions ofthe any impraves expression analysis 2s well 4s alleratve splicing analysis The aay also achieves high repredeiiiy, reason ‘ble dynamic ranges, high detection power for diferenial expression as well a8 low processing cost an! high yield, especially compared with recently developed high- Uhoughput mRNA sequencing technology ‘The curent mRNA sequencing segues sore reads 4 be equivalent ta army, which was estimated a6 150-300 milion reads er one ariy AL this momen, the GO-H array is the oly one fsaible solulion fora large eineal study with more dan 1,000 samples

We have applied the GG-H array to an ongoing large clinial study

ie patients in Glue Grant reseasch consortium, Inlammtion and the Host Response to -njues (vs gluegrantorg) The primary focus of the study is investigating the tem pol genomic response to severe injury in human blood Tclls, monocytes, and ne tropils, We obtsined 25-200 ng of total RNA from a patient foreach ell 1ype and sạch ime point, with on average 87.700 for sample alleton cost per subject tot

Trang 33

within a yeat, with approximately $400 for processing cost por sample, Our expe rience demonstates that the GG

array 48 suitable platform for a large eliieal study, and se expect that this aray platform will be appfied wo a wide range of app

cations in high-throughput elnical studies

Trang 34

Chapter 3 Alternative Splicing Analy- sis in Microarrays

Altemative splicing of mRNA is a major mechanism for generating diverse mRNA, leanscript isoforms from a single gene, and subsequently dtferentiates proteins with

saying binding properties, intercellular localizations, enzymatic actives, and eX: pression regulations [30] Altemative splicing play an important role in many celular and developmental processes sich as ell differentiation and apoptosis (31) Alterna tive splicing has heen observed across diferent tissue types, between distinc es ponses t0 extemal stimu, and among various human diseases (38-60) 1 is also knowin that altemative spicing is essociated with heritable diseases and several types

fof cancer 161+

], Moreover, recent studies reported that alternative splicing is @ ‘common event widely spreading over a genome Genome-wide studies using high- throughpot MRNA sequencing demonstrated that more than 904 of genes endergo

alternative splicing [55,63]

For technology developments in the previous chapter we introduce the GG-HL aay a an accurate and comprehensive researc platform for a large clinical stuly ‘Compreiensive measurements of the GG-H array enable iaproved analyses of mole- ‘ular signatures beyond gene expression including altemative splicing, allele specific

‘expression, and expression of funetional and regulatory RNAS As an example of the

Trang 35

3.1 Introduction

“The development of highilroughpu microaay’ technology has sigificanly en: hanced aliemative splieng sudies Wi the increasing density of oligonueleoide probes, nowadays mieroarays cover most known and predicted exons throughout a whole genome Probes designed over exons allow to measure exon expression diffe: rentatd by altemative splicing, which lads to detection of genome-wide aletnaive splicing events Duving pas years, there have been several genome-wide altemtive splicing sues with mieroaerays [61-67] These sties commonly followed an ap- proach detecting candidates fom microaray’ daa, and verifying selected candidates with ca-time polymer chain reaction (RT-PCR) experiments, While some recent si lies reported successful validation rates over 80S, many other studies have suffered fron ow validation rates [68 Improving detection etfiieney timed out to be 8 major challenge in altemative splicing detection with microarrays Fr this purpose, sever ‘computational methods have been introduced Splicing index (SI) method calelates, ‘xon expression normalized gene expression, and detects exons with large fold ‘changes of normalized exon expression a lemadtely spliced ones (69 Mieroaray’ detetion of altematve splicing (MIDAS) method calculates tases of normalized ‘von expression changes [70] stead of using summarized exon expression, mivoar- vay analysis of differential spicing (MADS) method computes tstatsies of probe in tensity changes noamatized to gene expression, and it summarizes probe evel p-values into probe set level p-value [68], Detection ahove background (DABG) is addtional-

Trang 36

we 3.1 A gene with three exons and thee junctions

‘This gene produces two isoforms by alternative splicing of exon 2, Exon 1 and 3

axe constitutive and exon 2 is alternative Exon 2 is associated with three june sons Juson 12a 2 ae sso unto fon 2 and uson 31s an ‘expression change averaging splicing index {PECA-SD) method utilizes probe level

splicing indices instead of probe set level splicing indices (7

Along with many efforts on exon arrays, the development of junetion arays like the GG-H aray hasbeen enforcing microarrays" power in alternative splicing st dies (39-42) Junction arrays have probes designed over juneton regions between 0

adjacent exons (Figure 3.1) In general, an exon is associated with Several junetions acconling ois tanserpt structure An inelusion junetin ofan exon targets isoforms containing the corresponding exon, and its expression change is expected t follow

that ofthe exon, In contrat, an exclusion junction tages isoforms where the comres- ponding exon is not involved, and its expression change i often in the opposite dire: tion ofthe exon expression change uneton signals comeated with exon signals also reflec alternative spicing events, and therefore junetons are expected 1 provide a>

Trang 37

As jetion artays are actively introduced with potential 1 improve allmmative splicing detection, i becomes required to develop a new detection method specific 10 jmetion aays For such a detection method, iis essential wo javestgate the complex

properties of inclusion and exclusion junctions in relations with exons Additionally

‘sonsideringallernative splicing in gene eapresson calculation is also important Exon expression differentially contributes to gene expression by alieative spicing, which ‘often makes gee expression corelated with expression of altematively spliced exons ‘To improve devetion efficiency more, gene expression need to be calculated over se-

lected exons that re not ikely 10 be spiced

‘Based on indepth investigation of junction properties hete we propose & new computational method to detect alteraatvely spliced exons for junction arrays The proposed method works though thece phsses 1) putatively canstituve exon selection, 2 dewetion of alternatively spiced exons with suppevting junctions, and 3) misbehav- ing juoction fering, Firs, the method caleuates accurate gene expression with Se lected exons to void biases caused by allematve splicing Then, it detects alternatively spliced exons supported by junetions to rede false positives Finally, the metho

saves miss candidstes in the previows phase hecause of misbehaving junctions 3.2 Putatively constitutive exon selection

Trang 38

con-sistent overall putatively constitutive exons ofa gene, The proposed algorithm selects putatively constitutive exons through «wo steps AL he frst step exons absent in any ‘one of group are filtered out, If any group DABG p- lu, a geomet mean of sample ĐANG povales of a group, is lst than a pre-dstined pevalue p, the exon is deter smined lo be ptaively allemative AL the second step, the algoritho detects deviated exons tha have sigificanly diferent expression ratios from other exons For all eX: cons except ones filtered outst the Fist step the man av and sandal deviation » of thei logged expression ratios are calculated Then, an exon whose logged expression ratio is less than mul or larger than ad is marked as a outs is a predefined range parameter The mean and standard! deviation ae eleulated using unmarked ex cons, an ours are detected apain, These esleutations are repeated unt the outlier

marking converges, Finally, on-outies atthe second step are selected as putatively constitutive exons, All other exons are considered to be putatively altemative, If the second step doesnot converge until en trations oF the Finally determine putatively constitutive exons are fess than thes all exons are considered to he putatively cong lutive Finally, gene expression is called over putatively constitatve exons instead of using all exons or annotated constitutive exons The proposed method mikes the calculated gene expression ess corclated with the expression of alleralively spliced «exons while the gene expression caelated with ll exons includes potently alerna-

lively spliced exons This method also provides robust gene expression by calculating

Trang 39

cats GERNLL EỆt<Leer+-Cti east" | Spliced 22 at Sen

Figure 3.2 Putatively constitutive exon selection for GARNLL

A the top, the stricture of GARNLI trom USCS Known Gene is shown At the bottom, logged exon expression indices of liver and muscle are plotted, Black and red colors represent annotated constitutive and altemative exons respective ly A putatively constitutive exon is marked with a cicle A putatively altema- live exon is marked with a cross if selected by probe set presence, ora triangle if selected by expression ratio deviation The blue solid line represents the avera expression ratio between 10 tissues, and blue dashed lines show upper and lower boundaries forthe deviated exon detection Indicated is an alternatively

spliced exon (chrl4: 35,230,119-35,239,259) that was detected as a confident ‘candidat in this work,

Figure 32 shows an example of putatively eonstitive exon selection between liver and mascle tissue samples Among the annotated 28 altemative and 21 const tive exons of GARNLI, 41 exons were selected as putatively consitative exons in- <liding 22 annotated allermative exons nthe Figure, these annotate alternative exons are aligned well with constitutive exons, which implies that they were not actualy spliced between two tissues, Most allerative exons are involved in long isoforms

Trang 40

is0-foxms rarely change even though the shox isoform changes sinicantly: Therefor they have the similar expression ratios with constitutive exons and enforce the gene expression calculation Among eight putatively allerative exons, one exon (ehrd

35,289,119 ~ 35,239,259) obviously shows adferent expression ratio from other exe ‘os tis the nly one alternative exon included in the shor isoform, an its expression ieclyreflets the shor isoform change This exon was detected as a confident an

didate of alternative spliing later

3.3 Detection with supporting junctions

Except single-exon isoforms all trssript isoforms have multiple exons and atleast ‘one exon-exon junction, IP expression of a ranserip isoform changes significantly by am alleratvely spliced exon, ican be observed through probes onthe spliced exon as well a8 junctions associated with the exon, These junctions ean Support the detection ‘of the alternatively spiced exon, However some junctions ofthe spliced exon may not reflect the isoform expression change if they are involved in the other rarely hanging isoforms, For example, in Figure 3.34, when the expession of isoform 2 ‘hangs significantly by allematve splicing of exon 2 junetion 12 reflects he isoform ‘change while junction 28 das not Even though tis dificult Know which junetons

sre supporting, an alternatively spiced exon has at est one supporting junction

In most eases, soppoming junetons have supremaey aver exons in tems of re- fering isoform expression changes When the expression change ofan isoform is ob serve through exon or juneton probes, mixed signals from significantly changing and

Ngày đăng: 21/02/2014, 05:20

TỪ KHÓA LIÊN QUAN