Một nghiên cứu so sánh về các bộ mã hóa tự động biến đổi với các kiến trúc bộ mã hóa-giải mã khác nhau cho các tạo dữ liệu chuỗi thời gian A comparative study of variational autoencoders with different encoder decoder architectures for time series data genera
Trang 1VIETNAM NATIONAL UNIVERSITY, HANOI
INTERNATIONAL SCHOOL
GRADUATION PROJECT
A Comparative Study of Variational Autoencoders with Different Encoder-Decoder
Architectures for Time-Series Data Generation
Tran Ngoc Thanh Binh
Hanoi - 2024
Trang 2VIETNAM NATIONAL UNIVERSITY, HANOI
INTERNATIONAL SCHOOL
GRADUATION PROJECT
A Comparative Study of Variational Autoencoders with Different Encoder-Decoder
Architectures for Time-Series Data Generation
SUPERVISOR: Dr Nguyen Quang Thuan
STUDENT: Tran Ngoc Thanh Binh
STUDENT ID: 20070902
COHORT: QH-2020-Q
SUBJECT CODE: INS401101
MAJOR: Business Data Analytics
Hanoi - 2024
Trang 3A Comparative Study of Variational Autoencoders with Different Encoder-Decoder Architectures for Time-Series
Data GenerationTran Ngoc Thanh Binh
Trang 4First, I want to send my sincere gratitude to my advisor, Dr Nguyen Quang Thuan for thecontinuous support and guidance of my thesis, for his unwavering belief in me and this thesis,and for his immense knowledge I want to thank all the members of the International School’sClub of Science and Technology’s Data Science team, for allowing me to work with them for thelast three years, your enthusiasm and guidance are there to be remembered Finally, I wouldlike to thank my family and friends who have always supported my journey in college
Trang 5AbstractThe explosive growth of Large Language Models (LLMs) in the past year has raisedsignificant interest in acquiring as much data as possible The main problem is not all
of the data can be acquired or should be acquired, this problem concerns the privacy,and copyrights of Internet users, copyright holders, and constitutions around the world.Hence, synthetic data has been gaining traction as a powerful solution to the challenge
of privacy and diversity of data This project focuses on a subset, which is Time Seriesdata generation, the need for Time Series Synthetic Data is highly concerned in variousapplications, including data generation, and privacy preservation This comparative studyconcerns the efficacy and accuracy of Time Series data generation methods, particularlythose based on Variational Autoencoder
Autoencoders are artificial neural network architectures that are intended for thecompression and reconstruction of data Variational Autoencoder (VAE), in particular,achieves so by introducing a probabilistic view of encoding and using stochastic variationalinference VAE has been widely used for data reconstruction or generation This thesisaims to test different variational autoencoders generating Time Series data
This thesis investigates two novel implementations of VAE, to find out the strengthsand weaknesses of each architecture on different types of Time Series data I compare thegenerative capability of three VAE-based architectures on various types of Time Seriesdata This project uses an established framework that includes a standardized prepro-cessing pipeline and systematic evaluations
Trang 61.1 Time Series data in different domains 10
1.1.1 Healthcare 10
1.1.2 Energy 11
1.2 Importance of synthetic data 11
1.3 Ethical Consideration for Synthetic Data Generation 12
1.4 Types of synthetic data generators 12
1.5 Existing generative methods 12
1.6 Chosen VAE-based methods for comparison 13
1.7 Contribution of this Thesis 13
1.8 Outline 13
2 Related Work and Background 14 2.1 Related Work 14
2.2 Synthetic data generation 15
2.3 Artificial Neural Networks 15
2.3.1 Recurrent Neural Networks 16
2.3.2 Convolutional Neural Network 17
2.4 Transformers 18
2.5 Fourier Transform 21
2.5.1 Short-Time Fourier Transform 21
2.6 Autoencoders 22
2.7 Variational Autoencoders 23
2.7.1 Formulation 23
2.7.2 Kullback-Leibler Divergence 24
2.7.3 Evidence Lower Bound (ELBO) 25
2.7.4 Reparameterization 26
3 VAE-based Time Series Generation methods 26 3.1 TimeVQVAE 26
3.1.1 Method proposed 27
3.1.1.1 Stage 1: Learning Vector Quantization 27
3.1.1.2 Stage 2: Prior Learning 27
3.2 TimeVAE 29
3.2.1 Method Proposed 29
3.2.1.1 Base TimeVAE Architecture 29
3.2.1.2 Interpretable TimeVAE 30
4 Method for Comparison 31 4.1 Disclaimer 31
4.2 Final methods for comparison 32
4.2.1 Metrics for TSGBench comparison 32
5 Experiments 33 5.1 TSG Benchmarking Results 33
5.2 Ablation Studies 35
5.2.1 TimeVAE’s Ablation Study 35
5.2.2 TimeVQVAE’s Ablation Study 43
5.2.3 Experiments Results 43
Trang 76 Conclusions and Future Work 44
Trang 8List of Figures
2.1 Structure of a feed-forward neural network 16
2.2 An RNN with hidden state 17
2.3 Two-dimensional cross-correlation operation The output is calculated as 0 × 0 + 1 × 1 + 3 × 2 + 4 × 3 = 19 18
2.4 Given a handwritten digit image, LeNet performs a series of computations to classify it into one of 10 categories, yielding a probability for each category as its output 18
2.5 The attention mechanism calculates a weighted combination of values v via at-tention pooling, these weights are determined by how well each query q matches or aligns with the corresponding keys ki 19
2.6 Multi-head attention 20
2.7 The Transformer architecture 20
2.8 Structure of an Autoencoder, extracted from [Wikipedia contributors, 2024] 23
2.9 Graphical model representation of VAE Given N observed data points {xi}, each data point is locally generated by a latent random variable zi θ is a global parameter, and is obtained through training 24
3.1 Overview of Stage 1 - Learning Vector Quantization 27
3.2 Stage 2 - Prior Model Training Dark green blocks represent the masked tokens 28 3.3 Iterative decoding process with two passes TF-LF and TF-HF denote the LF and HF bi-directional transformers respectively 28
3.4 Components of base TimeVAE 29
3.5 Interpretable TimeVAE components, extracted from [Desai et al., 2021] 30
3.6 Trend and Seasonality Blocks, extracted from [Desai et al., 2021] 30
5.1 TSG Benchmarking result 34
5.2 TSG Bench Visualization by t-SNE and Distribution plot, blue as Torg and orange as Tgen 35
5.3 Dense VAE vs Time VAE on Original vs Reconstructed Train for Air dataset 36
5.4 Dense VAE vs Time VAE on Original vs Reconstructed Train for Energy dataset 36 5.5 Dense VAE vs Time VAE on Original vs Reconstructed Train for Sine dataset 36
5.6 Dense VAE vs Time VAE on Original vs Reconstructed Train for Stockv dataset 37 5.7 Dense VAE vs Time VAE t-SNE plots Train for Air dataset 37
5.8 Dense VAE vs Time VAE t-SNE plots Train for Energy dataset 38
5.9 Dense VAE vs Time VAE t-SNE plots Train for Sine dataset 38
5.10 Dense VAE vs Time VAE t-SNE plots Train for Stockv dataset 39
5.11 Convolutional VAE vs Time VAE on Original vs Reconstructed Train for Air dataset 39
5.12 Convolutional VAE vs Time VAE on Original vs Reconstructed Train for Energy dataset 40
5.13 Dense VAE vs Time VAE on Original vs Reconstructed Train for Sine dataset 40
5.14 Dense VAE vs Time VAE on Original vs Reconstructed Train for Stockv dataset 40 5.15 Dense VAE vs Time VAE t-SNE plots Train for Air dataset 41
5.16 Convolutional VAE vs Time VAE t-SNE plots Train for Energy dataset 41
5.17 Convolutional VAE vs Time VAE t-SNE plots Train for Sine dataset 42
5.18 Convolutional VAE vs Time VAE t-SNE plots Train for Stockv dataset 42
5.19 FID and IS score for VQ-VAE and TimeVQVAE 43
5.20 VQ-VAE vs TimeVQVAE on reconstructing examples and generated samples 44
6.1 Turnitin Similarity Score of this Thesis 46
Trang 9TSG Time Series GenerationANN Artificial Neural NetworkVAE Variational AutoencoderCNN Convolutional Neural NetworkLLM Large Language ModelSTFT Short-Time Fourier TransformISTFT Inverse Short-Time Fourier Transform
KL Kullback-LeiblerELBO Evidence Lower Bound
VQ Vector Quantization
LF Low-Frequency
HF High-FrequencyConv ConvolutionFID Fr´echet Inception DistanceC-FID Contextual Fr´echet Inception Distance
ED Euclidean DistanceDTW Dynamic Time Warping
Trang 101 Introduction
In the past decades, with the boom of the internet, coupled with humanity’s endless effort inadvancing science and technology, we have seen a rapid growth of Machine Learning and DeepLearning methods to solve problems that we have not been able to solve before Fields includingComputer Vision (CV), natural language processing (NLP), Problems involving time serieshave also been increasingly attempted and tacked using deep learning, including problems inclassification [Ismail Fawaz et al., 2019], forecasting [Han et al., 2019], and anomaly detection.[Gamboa, 2017] The success of applying deep learning to those problems requires having alarge amount of data Unfortunately, time series tasks typically do not have enough data forsuch models As we try to resolve this problem, data generation has become an effective tool
to increase the size and quality of data The main idea of applying data generation is to try tosynthesize data points that are realistic The recent boom of Large Language Models, furthersignifies the need for time series data as we are trying to improve LLMs’ time series analysiscapabilities [Zhang et al., 2024]
There are many methods for time series generation, including basic methods based on TimeDomain and Frequency Domain to more advanced methods like Statistical Generative Models.However, Deep Generative Models remain less investigated for time series data generation Inthis Thesis, we explore the usage and results of two such models, both based on VariationalAutoencoders [Kingma and Welling, 2013]
1.1 Time Series data in different domains
Time series data is widely used in different fields and has become crucial in predicting andforecasting potential needs Time series data can be utilized in predicting the risk of diseaseand providing people with medical help Time series can also be used by governments andcompanies to make decisions on energy, climate, and finance The growth is exponential, asthe amount of massive data encourages people to explore various applications
However, due to problems in quality, quantity, and privacy of using real data, people usually
do not use the original data when exploring applications in various domains I give exampleswithin several representative domains to show how these problems with real data drive thenecessity for generated data
1.1.1 Healthcare
Time series data is used in healthcare to make more accurate diagnoses
• Time series data is used to effectively predict the blood glucose level of patients, which
is critical for diabetes subjects [Bhimireddy et al., 2020]
• Wearable devices also collect large amounts of data and can provide suggestions to improvepeople’s health For example, [Sathyanarayana et al., 2016] uses wearable devices for sleepcondition tracking Data collected by wearable devices can include location, sound, andimages, all of which are very sensitive and should only be stored on-device Algorithms andmethods [Bonawitz et al., 2017] [Jayaraman et al., 2018] have been developed to trainmachine learning models on these kinds of data without sending them to a centralizedserver
Healthcare machine learning systems analyze highly sensitive data and impact critical decisions.Ensuring data quantity, quality, balance, and privacy is crucial
Trang 11• Time series data in this field can also be used to forecast the load of an entire district’sheating system [Gong et al., 2022] uses data from a District Heating System in Tianjin,China to perform the task.
Data in energy systems is hard to record Especially household appliances’ energy consumptiondata due to privacy concerns, cost, and the quality of data
1.2 Importance of synthetic data
Despite the huge amount of money invested every year by institutions to collect time seriesdata, real time-series data can not always satisfy all the needed characteristics Overall, thereare time-series data-related issues in various domains:
• Quantity issue: In certain areas, the amount of data we can get is insufficient Especially
if the acquisition of such data requires people with specialized skills, for example in thehealthcare domain, the amount of data we can get and the cost to get it will be a problem
If we can utilize synthetic data to supplement and enhance the existing actual data, thenmore applications can be built without requiring more real data than we currently possess
• Quality issue: Quality issues with data are common During the acquisition process,there are various factors that can cause inconsistency in the quality of data For instance,
a questionnaire can have missing values and outliers simply because people incorrectlyfilled the questionnaires
• Imbalance issue: Data imbalance in time series data is normal This is due to thenature of real-world phenomena, where certain events or patterns naturally occur lessfrequently than others Imbalance poses challenging problems when developing models.This problem can be mitigated by using synthetic data to supplement the niche parts ofthe dataset
• Privacy issue: Privacy is always a problem when it comes to data acquisition Data thatcontain sensitive information is usually strictly protected and researchers often can notget access to those data Synthetic data can be synthesized to preserve the correlationsand without the sensitive information, thus mitigating or removing the privacy issuesentirely
Application of Synthetic Data High-quality synthetic data can be utilized in importantapplications:
• Data Augmentation: Synthetic data can supplement limited real-world data to createlarger and more diverse datasets for training machine learning models, leading to improvedperformance and generalization
• Simulation: In domains where rare events (e.g fraud, equipment failures) are neededbut under-represented in historical data, synthetic data can help to simulate these events
so we can train more robust models
Trang 12• Data Privacy: Models can be trained on synthetic data without exposing sensitive sonal information, addressing privacy concerns in healthcare, finance, and other domains.
per-1.3 Ethical Consideration for Synthetic Data Generation
Synthetic data can mitigate privacy concerns in collecting users’ data and enhance privacypreservation, there are still ethical and privacy concerns that need to be raised:
• While Synthetic Data Generation can be a good tool to preserve privacy, nonetheless, itstill needs real data as input, and where and how that data is available needs to be takeninto consideration
• Synthetic data, while being mainly discussed as a tool to increase the amount of trainingdata and enhance the quality of training data, can be made into an unethical tool togenerate fake data and use them where real data and statistics are required For ex-ample, unethical researchers can use fake data in research papers to fake the results ofexperiments
There are always ethical implications when considering such generative tools Researchers andpractitioners need to keep in mind the importance of ethics in this field, such that no harm will
be done with these tools
1.4 Types of synthetic data generators
We can generally divide synthetic data generators into two groups:
• Modification-based methods: Methods in this group work with existing data points,try to modify the values to fix outliers, and use techniques to reduce privacy leaks Sincethe data synthesized by this approach is constrained by the original data, these methodscan not increase the size of the dataset and also can not provide sufficient data privacyprotection
• Generation-based methods: This category of methods tries to synthesize new datafrom some distributions These distributions can either be hand-crafted or learned fromreal data, resulting in an arbitrary amount of data Privacy protection methods can also
be added to provide better privacy Generating data from handcrafted distributions hasbeen widely used while generating data from learned distributions is an area of recentwork
The methods in the first category remain ineffective in protecting privacy, while methods inthe second category offer solutions to address quantity, quality, and privacy issues through theuse of synthetic data as a substitution for real data
1.5 Existing generative methods
Great advances have been made in synthesizing time series data from distributions Statisticaland deep learning methods are used to learn the underlying distributions of real data to createsynthetic data by sampling from such distributions
Statistical models often use a family of predefined distributions to fit a new time seriesdataset For example, a Gaussian Mixture Model can be used by projecting the data to ahigher-dimensional space, and estimate the distribution [Eirola and Lendasse, 2013] However,these models are limited by the available distributions However, this distribution limitation is
Trang 13less relevant to time series data, as statistical models tend to focus more on capturing temporaldependencies and patterns of the data.
Deep learning methods are in the other category The success of deep neural networks inother domains has motivated their adoption for time series data Deep generative models, withthe like of Variational Autoencoders (VAEs) [Kingma and Welling, 2013] and Generative Ad-versarial Networks (GANs) [Goodfellow et al., 2014] can learn complicated, high-dimensionalprobability distributions, and to produce high-quality samples from images or text
In recent studies, many efforts have been made to combine deep generative models withdifferent architectures like Transformers, Fourier Transform, and classical RNNs or CNNs.These models can perform well and have proven themselves to outperform or be comparable totraditional statistical models in many metrics
1.6 Chosen VAE-based methods for comparison
For this thesis, I chose two VAE-based models: TimeVAE [Desai et al., 2021] and VQTimeVAE[Lee et al., 2023] The two VAE-based time series generation methods represent the most recentadvancements in VAE-based time series generations They incorporate cutting-edge techniquesthat address specific challenges in time series generation, making them ideal candidates forcomparative analysis
1.7 Contribution of this Thesis
This thesis aims to provide an in-depth analysis and comparison between two advanced, the-art Time Series Generation methods that were built on Variational Autoencoders, includingstudies of each method’s architecture, components, and technical challenges that come withthem
state-of-This thesis could then be used as a guideline on Model Selection for Time Series Generationtasks and can serve as a baseline for further comparative studies with future methods
1.8 Outline
Below is the structure of this work:
• Chapter 2: Related Work and Background This beginning section commences with
a concise review of relevant literature Following this, essential background information isprovided to help readers with the familiarity of the concepts employed in this work Thisencompasses an overview of synthetic data generation, along with brief introductions
to artificial neural networks (ANNs), recurrent neural networks (RNNs), convolutionalneural networks (CNNs), the transformer architecture, Fourier transform, autoencoders,and variational autoencoders (VAEs)
• Chapter 3: VAE-based Time Series Generation Methods This chapter examinesvarious implementations of VAEs for data generation tasks, with a particular focus onthe detailed architectures and implementations of TimeVAE and VQTimeVAE We delveinto how these models address the unique challenges inherent to time series data, alongwith their respective strengths and weaknesses
• Chapter 4: Method for Comparison: This chapter contains the detailed method forcomparison, and discusses the advantages and disadvantages of the chosen method, frame-work, and potential improvements on the comparison method will also be briefly discussed
Trang 14• Chapter5: Experiments This chapter discusses the experiments brought out to strate the relative performance of TimeVAE and TimeVQVAE Ablation studies of bothmethods will also be discussed and analyzed.
demon-• Chapter 6: Future Works This chapter contains the conclusion of this thesis, wheredrawbacks and potential improvements of this work are discussed
2 Related Work and Background
To lay the groundwork for this thesis, we first explore the current landscape of time seriesgeneration methods, highlighting approaches similar to those examined in our research We thendefine the time series generation problem, clarifying the objectives of this work The chapterconcludes with an overview of the theoretical foundations underpinning the architectures used
in the models discussed
2.1 Related Work
As mentioned in Chapter1, data synthesizing is being widely and actively developed with deeplearning methods There are two main approaches when trying to synthesize data: VariationalAutoencoders (VAEs) and Generative Adversarial Networks (GANs)
Both GANs and VAEs have had significant research focus on the reconstruction/generation
of data Images, audio, and tabular data have been the major fields of focus for such ment Some notable works that use GANs for the respective fields are: Old Photos Restoration[Wan et al., 2020], WaveGAN [Donahue et al., 2018], CTGAN [Xu et al., 2019] The equiva-lent for VAEs-based methods are: IntroVAE [Huang et al., 2018], SadTalker (video and sound)[Zhang et al., 2023b], TVAE [Xu et al., 2019]
develop-Generative Adversarial Networks (GANs) are a family of generative models that learn togenerate new data with distributions closely resembling those of the input data GANs consist oftwo neural networks that operate as competing agents: the generator and the discriminator.The generator creates synthetic data, while the discriminator assess their realism, providingfeedback to the generator
Since the focus of this thesis is on Time Series Generation, we will elaborate more on some
of the GAN and VAE-based models relevant to this domain
[Yoon et al., 2019] proposed TimeGAN, by using a new Stepwise Supervised Loss in addition
to the standard unsupervised discriminator loss This loss is calculated at each time step withinthe time series By incorporating this loss, TimeGAN encourages the model to learn and adhere
to the temporal dynamics present in the training data, ensuring that the synthesized time seriesaccurately represents the relationships between variables over time TimeGAN also introduces
an embedding network that maps the high-dimensional time series into a lower-dimensionallatent space This embedding is then learned by the supervised and adversarial objectives ofthe model This leads to better capturing of the underlying temporal, hence better generationand more representative of time series data points
[Lee et al., 2023] proposed TimeVQVAE, where they utilized Vector Quantization ing data into the discrete latent space, with a bi-directional transformer to learn the prior.Another implementation is TimeVAE [Desai et al., 2021], where they utilized an architec-ture with interpretable components like level, trend, and seasonality
compress-Both time series-focused methods mentioned above will be the subjects of this thesis’s study
Trang 152.2 Synthetic data generation
Time series data can be broadly categorized into two main types: univariate and multivariate
• Univariate time series: A type of time series that is only made up of one variable
For example, daily closing stock prices, hourly temperature readings, and daily sales of
goods
• Multivariate time series: Opposite to univariate, multivariate time series involves
multiple variables, where these variables may be interrelated For example, weather data,
economic indicators, electrocardiogram (ECG)
There are other sub-categories related to time series, but in this thesis, I will focus on the
simplest cases, univariate and multivariate time series
2.3 Artificial Neural Networks
Artificial Neural Networks (ANNs) are computing models that drew inspirations from biological
neural architecture [Jain et al., 1996] ANNs were first described in the 1940s [McCulloch and Pitts, 1943],and have continuously received interest
The primary objective of ANNs is to ”learn” to perform a specific task This learning process
typically involves training the networks through multiple iterations, often using gradient-based
optimization methods like backpropagation The objective function is to minimize the
discrep-ancy between the network’s predicted outcomes and the target values in a given dataset By
doing so, the network can generalize its learned patterns to effectively process unseen data In
other words, during training, the networks learn from labeled training data and iteratively
up-date their parameters to minimize a loss function A network is called Deep Neural Network
it it has two or more layers
A feed-forward network can be described as a series of transformations:
aj =
DX
i=1
w(1)j,ix(1)i + w(1)j,0y(1)j , ∀j = 1, , M (2.1)
Equation (2.1) represents the computation within a neuron, in a layer of a feed-forward network
M linear transformations are performed on input vector x = (x1, x2, , xD), creates a new
vector a with M elements, and the number of neurons in a hidden layer is denoted as M
The network then reaches a hidden layer z = (z1, z2, , zM) Non-linearity is then
in-troduced with zj = h(aj) with h as an activation function (sigmoid, tanh, ) That is one
iteration, the number of iterations depends on the number of hidden layers in the network
In the output layer, hidden layers z are used to create K linear combinations, and K denotes
the dimensions of the output vector y = (y1, y2, , yK) Equation 2.2 below represents it:
ak =
MX
j=1
w(2)k,jz(2)j + w(2)k,0, ∀k = 1, , K (2.2)
The prediction would be ˆy = a, and the loss function (for example here, Mean Squared
Error ) L =PKi=1(yi− ˆyi)2 An example of a feed-forward, deep neural network, extracted from
[Bishop, 2006], can be seen on Figure 2.1
Trang 16Figure 2.1: Structure of a feed-forward neural network
2.3.1 Recurrent Neural Networks
Recurrent Neural Networks(RNNs), a class of deep learning models, employ recurrent nections, allowing them to model the temporal dynamics inherent in sequential data Theyhave been used to solve problems involving sequential data such as image captioning, speechsynthesis, NLP, time series prediction,
con-RNNs are neural networks equipped with hidden states Hidden states are different fromhidden layers that we mentioned above in 2.3 Hidden layers refer to the intermediate lay-ers within a neural network architecture that are not directly exposed to input or output.Conversely, hidden states are internal variables within recurrent neural networks that storeinformation from prior time steps and contribute to the computation at each subsequent step.Let’s be more detailed about how the hidden states are calculated
Given a minibatch of inputs Xt∈ Rn×d at time step t, each row of Xtequals to one example
at time step t from the sequence Denote Ht ∈ Rn×h as the hidden layer output of timestep t Here we save the hidden layer output Ht−1 from the previous time step and a newweight parameter Whh ∈ Rh×hto guide the network how to use the hidden layer output of theprevious time step in the current time step The current hidden layer output is then computed
by combining the input of the current time step with the hidden layer output from the previoustime step
Trang 17Ht = ϕ(XtWxh+ Ht−1Whh+ bh) (2.3)
At a time step t, the hidden state is computed as follows:
• Input Xt of the current time step t is concatenated to the the hidden state Ht−1 at theprevious time step t − 1
• After that, the result is fed into a fully connected layer with activation function ϕ Theoutput of the fully connected layer is the current hidden state Ht
• The current hidden state Ht will be included to compute the next hidden state Ht+1
• Ht will also be input into a fully connected layer to compute the current output Ot
A visual representation of the process, extracted from [Zhang et al., 2023a], can be seen onFigure2.2
Figure 2.2: An RNN with hidden state
2.3.2 Convolutional Neural Network
Convolutional Neural Networks (CNNs) have found tremendous success in Computer Vision.CNNs are known for their computational efficiency because they (i) require fewer parametersthan fully connected layers, and (ii) convolutions are easier to parallelize across GPU cores.Due to their computational advantages, CNNs have found applications beyond computer vi-sion, extending their utility to tasks involving one-dimensional sequence data, such as audio[Abdel-Hamid et al., 2014], text [Kalchbrenner et al., 2014], time series analysis [LeCun et al., 1995].(iii) CNNs also excel at finding different patterns in an image, thanks to their shift-invariantproperty
Similar to ANNs, CNNs are typically composed by an input layer, one or more hidden layerand an output layer CNNs are usually consist of (but are not limited to) three types of layers:Convolutional, Pooling, and Fully connected These three combined make the buildingblock of a CNN, a Convolutional Block The basic operations in each block as mentioned, are:
a convolutional layer, an activation function, and a subsequent pooling operation
The convolution layer can be more accurately described as a cross-correlation operation.Within a convolutional layer, the cross-correlation operation is applied between the input tensorand the kernel tensor, resulting in the generation of an output tensor Take an example of atwo-dimensional tensor with shape 3 × 3, and a kernel with shape 2 × 2, the kernel shape is
Trang 18Figure 2.3: Two-dimensional cross-correlation operation The output is calculated as 0 × 0 +
Finally, pooling layers are utilized to make the convolutional layers less sensitive to thelocation of features within an image, and also downsampling the feature maps, helps to controloverfitting
A visual representation of LeNet [Lecun et al., 1998], a convolutional neural network, tracted from [Zhang et al., 2023a] is at Figure 2.4
ex-Figure 2.4: Given a handwritten digit image, LeNet performs a series of computations to classify
it into one of 10 categories, yielding a probability for each category as its output
2.4 Transformers
The Transformer architecture [Vaswani et al., 2017], has been the dominating the field of ral language processing since its release There are many influential, pre-trained models for NLP.For examples, BERT [Devlin et al., 2019], RoBERTa [Liu et al., 2019] In the last two years,
natu-we have also seen the boom of Large Language Models (LLMs), centered around OpenAI’s GPTlineup (stands for Generative Pre-training Transformer) with GPT-2 [Radford et al., 2019] andGPT-3 [Brown et al., 2020]
To understand the Transformer architecture, we have to first go through the attention anism [Bahdanau et al., 2014] proposed the attention mechanism to address the limitation offixed-length encoding vectors, which hindered the decoder’s ability to access relevant inputinformation The core idea is that instead of considering all parts of the input equally at everystage, the decoder should select which part that it needs to focus on at a particular decodingstep
Trang 19mech-The attention mechanism, the essence of the Transformer architecture, is based on theconcept of: queries, keys, and values They are three vectors that work together to help themodel focus on the most relevant parts of the sequence A compatibility score is calculatedbetween each query and key, this score is then transformed into a weight, the higher the score,the higher the weight This operation is referred to as attention pooling Figure 2.5, extractedfrom [Zhang et al., 2023a], demonstrates how the attention mechanism works with Keys andQuerys and Values.
Figure 2.5: The attention mechanism calculates a weighted combination of values v via tion pooling, these weights are determined by how well each query q matches or aligns withthe corresponding keys ki
atten-Another important concept regarding the attention mechanism is multi-head attention Thiswas created to enable the use of multiple representation subspaces of queries, keys and valueswithin the attention mechanism Instead of having a single set of queries, keys, and val-ues, multi-head attention splits them into multiple ”heads” Each head operates indepen-dently, essentially creating multiple parallel attention mechanisms Figure 2.6, extracted from[Zhang et al., 2023a], demonstrates the multi-head attention, where multiple heads are merged,and then go through a linear transformation
Self-attention, as the name suggests, applies attention to a single head, and measuring thesimilarity between elements in a sequence For example, given a sequence of tokens, apply-ing self-attention to that sequence would result in each token having its own query, key, andvalue By doing this, we have representations of words that take into account their surroundingcontexts, thus having better language understanding
Positional Encoding is also an important concept, this is usually presented as an additionalinput attached to each sequence Since self-attention does not preserve the order of tokens
in a sequence (it does so in favor of parallel computation), a ”positional signal” needs to beimplemented to preserve the order of input tokens
The Transformer model is an instance of the encoder-decoder architecture, as presented inFigure2.7, extracted from [Zhang et al., 2023a]
Trang 20Figure 2.6: Multi-head attention.
Figure 2.7: The Transformer architecture
The Transformer encoder is built by stacking multiple layers with the same structure, eachlayer performs two distinct operations through its two sublayers: a multi-head self-attentionpooling and a postitionwise feed-forward-network In the encoder self-attention, queries, keysand values come from the previous encoder’s layer The two sublayers are connected by aresidual connection, the input x and the output of sublayer sublayer(x) must have the samedimensionality so that the residual connection x + sublayer(x) is possible After the residualconnection, a layer normalization [Ba et al., 2016] is followed The result of the encoder vectorrepresentations of each token of the input sequence
The Decoder also has stacked layers with residual connections and layer normalizations.The sublayers are the same as those of the encoder, with and addition of a masked attentionlayer The masked attention forces the decoder to be autoregressive, one token at a time Thepredicted token is then appended to the input sequence, and the decoder uses masked attention
to predict the next output token, this process continues iteratively, and the masked attentioncontinues until an ”end” token is generated
Trang 212.5 Fourier Transform
The Fourier Transform is widely used in various fields including physics, engineering, and ematics It is an integral transform that takes a function of time x(t), to a function of frequencyX(ω) There are two equations: Forward Fourier Transform and Inverse Fourier Transform,they also have different names: Analysis Equation and Synthesis Equation, respectively
math-On the other hand, the Inverse Fourier Transform would take the frequency spectrum of asignal and reconstructs the original signal
The Forward Fourier Transform or Analysis Equation is derived as follows:
• F (ω) is the Fourier transform of f(t)
• ω is the angular frequency
• i represents the imaginary unit (√−1)
• e−iωt is a complex exponential function
The Inverse Fourier Transform or Synthesis Equation is derived as follows:
• f(t) is the reconstructed signal
• F (ω) is the Fourier Transform of the signal in the frequency domain
• ω represents the frequency
• i represents the imaginary unit (√−1)
• 1
2π is a scaling factor for proper scaling when inverting the frequency
The standard Fourier Transform works well when the frequency remains stationary ever, many real-world signals, like music or speech, have frequency components that changeover time, the standard Fourier Transform struggles to capture this
How-2.5.1 Short-Time Fourier Transform
The Short-Time Fourier Transform (STFT) is a modified Fourier Transform with the goal ofbetter capturing the non-stationary frequency It works by dividing a long signal into shorter,overlapping segments (windows) Then, the Fourier Transform is applied to each individualsegment, this gives us a series of frequency spectra, each representing a short snapshot of thesignal’s frequency at a particular time
The mathematical representation of the Short-Time Fourier Transform is as follows:
Trang 22• X(τ, ω) is the STFT of the signal x(t).
• τ is the window position
• ω is the angular frequency
• x(t) is the input signal in the time domain
• w(t) is the window function applied to the signal (e.g., Hamming, Hanning, Gaussian).Unlike the Inverse Fourier Transform, Inverse Short-Time Fourier Transform (ISFT) is not asstraightforward as it contains different windows, solely applying the inverted Fourier Transform
on each segment would introduce an artifact To address this, the ISTFT involves an add (OLA) procedure
overlap-In the OLA procedure, each segment is first applied with Fourier Transform to be convertedback to the original Since the windows are overlapped, the overlapping portions are thenadded together to reconstruct the original signal A window normalization step is then involved
in compensating the windowing effects, making sure the reconstructed signal has the correctamplitude
The mathematical representation of the Inverse Short-Time Fourier Transform is as follows:
x(t) ≈X
τ
Where:
• x(t) is the reconstructed time-domain signal
• ISTFT(X(τ, ω)) is the inverse Fourier Transform applied to the STFT coefficients X(τ, ω)
at a given time frame τ
• w(t − τ) is the window function shifted to the appropriate time frame τ
• The asterisk ∗ is the convolution operation, an integral part of the overlap-add procedure
2.6 Autoencoders
Autoencoder, introduced by [Rumelhart et al., 1986], is an unsupervised deep learning rithm An Autoencoder has two functions, an encoding function that encodes the data into alatent representation, and a decoding function that tries to recreate the data using the encodedlatent representation The latent layer has
algo-Autoencoder’s most widely used architecture can be understood as two symmetrical neuralnetworks acting as encoder and decoder, respectively The encoder maps input data x into thelatent representation z, due to its smaller dimensions, z is also called the bottleneck layer Thedecoder will then take this latent representation z and try to produce output ˆx
The autoencoder’s objective fucntion is to minimize the difference between output ˆx andinput x, and is called reconstruction loss Both encoder and decoder are trained jointly tominimize this loss
Both the encoder and the decoder in Autoencoders can have a single layer, but commonly wesee two or more layers in this network, incorporating the network with deep learning techniques
In this case, the Autoencoder is called a Deep Autoencoder
Commonly, there are four main types of autoencoders: sparse autoencoders, denoising toencoders, contractive autoencoders and variational autoencoders
Trang 23au-Sparse autoencoder [Makhzani and Frey, 2014] incorporates a sparsity constraint, aging the hidden layer neurons to be mostly inactive The network have to minimize both thereconstruction loss and the sparsity constraint, hence forcing the autoencoder to only learnthe most important features of the input data while keeping the hidden layer representationssparse.
encour-Denosing Autoencoders [Vincent et al., 2010]are trained to reconstruct clean data fromnoisy or corrupted inputs, effectively learning to remove noise The idea behind this is forc-ing the autoencoder to learn the key features of the data that are essential for reconstruction.Denoising Autoencoder, as its name implies, is robust to noisy input, and can be effective indenoising noisy or corrupted data
Figure 2.8: Structure of an Autoencoder, extracted from [Wikipedia contributors, 2024]
2.7 Variational Autoencoders
Variational Autoencoder (VAE) is a variant of Autoencoder that uses a variational approach
to learn the latent space VAEs have found great success in image-generation tasks, but I willfocus on the application of VAEs in time-series generation in this thesis
2.7.1 Formulation
Variational Autoencoders are generative models with a probabilistic nature The latent space ismodeled with a probability distribution, and the output is drawn from the latent distribution.The VAE model can be described from a graphical model perspective, as follows:
The latent variable ziis sampled from a distribution p(z) whereas data points xi are sampledfrom a conditional probability p(x|z) Both the prior and the likelihood are usually assumed
Trang 24Figure 2.9: Graphical model representation of VAE Given N observed data points {xi}, eachdata point is locally generated by a latent random variable zi θ is a global parameter, and isobtained through training
as Gaussian distributions:
and
where f (z, θ) represents the decoder, often a neural network
VAEs aim to find the most accurate hidden representation z given the observed data points.This is achieved by computing the posterior probability p(z|x) According to Bayes:
The variational distribution from Eq 2.12 introduces a new loss, the loss is the distancebetween the variational distribution and the true posterior distribution and is measured byKullback-Leibler (KL) Divergence