Machine Learning
Machine learning aims to create algorithms that learn from data, addressing the limitations of traditional methods that rely on systematic calculations tailored to specific problems Given that experimental or observational data is often noisy and high-dimensional, developing precise algorithms can be impractical Machine learning provides a solution by enabling computers to identify patterns and learn autonomously, without the need for explicit programming.
The effectiveness of a machine learning model is determined by its ability to learn and generalize effectively Generalization refers to the model's capacity to accurately predict outcomes for new, unseen data Striking a balance between learning capability and generalization is crucial for successful machine learning models In parametric models, enhancing learning capability typically improves performance on training data; however, this often results in the model memorizing specific examples rather than truly learning from them Consequently, this can lead to a decline in generalization, negatively impacting performance on unseen data.
Unsupervised learning is a data analysis approach where the model is trained on unlabeled examples, meaning the training data is unordered and lacks prior knowledge of its underlying structure The primary goal of this method is to identify common patterns within the data, enabling the model to effectively differentiate between individual examples.
Supervised learning involves a problem statement where the model receives labeled data for each observation In classification tasks, the objective is to differentiate between various groups of labels rather than individual examples Another aspect of supervised learning is regression Typically, supervised classification problems are addressed using logistic regression models, as they provide probability outputs that can be interpreted as the likelihood of class membership.
M Kroiss, Predicting the Lineage Choice of Hematopoietic Stem Cells,
Several mathematical methods like Support Vector Machines [15] or Random Forests
Supervised classification models, such as those referenced in [16,17], often struggle with high levels of abstraction Typically, it is necessary to create an algorithm that effectively extracts relevant features from raw data before these models can successfully learn from it [18,19].
Image classification has emerged as a prominent machine-learning challenge in recent years, necessitating extensive manual analysis to create effective image descriptors Due to the inherent noise and variability in images, universal descriptors are often ineffective, making feature construction highly specific to individual problems This issue is compounded when manual classification is not feasible, leaving uncertainty about whether labels can be accurately inferred from the data.
A different approach is adopted by starting with a basic representation of the data, characterized by general and imprecise features The goal is to utilize these low-level features to develop a model that analyzes multiple examples to identify similarities and patterns, mimicking the human brain's ability to learn abstractions Although constructing such a model poses challenges, once established, it can be applied to a wide range of problems with ample training examples The advent of the deep learning era has led to the introduction of models capable of achieving high levels of abstraction.
Deep Learning
In recent years, deep learning has gained significant attention in artificial intelligence, serving as a method for analyzing information through a layered structure This approach aims to achieve higher levels of abstraction by incorporating additional layers into the model, which allows for the organization of information at varying levels of detail within a hierarchical framework.
This approach has proven to be highly successful, establishing itself as a fundamental component of machine learning It is recognized as the gold standard for addressing challenges in handwriting analysis, image classification, and speech recognition.
Deep learning reduces the necessity for manually crafting specific features to describe observations Instead, it focuses on developing models capable of autonomously generating essential features from basic representations or even raw inputs, such as images or audio signals By starting with low-level data representations, the algorithm learns to create layers of features that evolve into increasingly meaningful insights.
1 Introduction 3 until the solution to the problem becomes obvious Essentially, building features can be interpreted as a way of making the data more abstract
Artificial Neural Networks are ideal for layered representation learning, which is the foundation of deep learning that utilizes multiple layers of neural networks The first deep-layered feed-forward networks were introduced in the mid-2000s by G Hinton in Toronto and J Bengio in Montreal, employing different methodologies These models initially train an unsupervised model to create increasingly abstract representations of data by stacking hidden layers, with each layer trained individually to optimize the representation of the layer below After this pre-training phase, labels are added for supervised classification on the final layer, followed by fine-tuning the model's parameters to better align with the specific labels rather than just a general data representation.
The advent of unsupervised pre-training techniques has facilitated deeper exploration into the traits of stable deep networks Research indicates that the stability and generalization capabilities of a network are significantly influenced by the parameter range.
Recent research indicates that neural networks can be effectively trained without the need for a pre-training phase, provided that the network parameters are initialized correctly This advancement enables the direct training of supervised deep-layered models.
Despite the dominance of deep neural networks in machine learning competitions, their application in bioinformatics remains limited, despite the field's wealth of untapped datasets Advanced machine learning techniques, particularly in representation learning, hold significant promise for addressing numerous unresolved biological challenges One area in medicine that shows considerable potential is stem cell research.
High-throughput time-lapse microscopy of murine stem cells
Stem cell research offers significant potential for developing therapies for serious diseases, including cancer, cardiovascular conditions, and dementia Defined as biological cells with the ability to self-renew and differentiate, stem cells can generate daughter cells that retain their original characteristics while also forming specialized cell types.
Our research centers on hematopoietic stem cells (HSCs), which are essential blood stem cells located in adult bone marrow These cells play a crucial role in regenerating blood and immune system cells Investigations into HSCs have led to significant clinical advancements in treating leukemia and various immune system disorders.
The different classes of hematopoietic stem cells can be categorized by their potency and structured in a tree [42,45] (see Figure 1-1) At the root of the hierarchy are the
Hematopoietic stem cells (HSCs) possess self-renewal capabilities and can differentiate into hematopoietic progenitor cells (HPCs), which have limited division potential Our research emphasizes the progeny of common myeloid and lymphoid progenitor (CMP) cells, specifically megakaryocyte-erythroid progenitors (MEP) and granulocyte-macrophage progenitors (GMP) MEP cells are responsible for the formation of red blood cells and megakaryocytes, whereas GMP cells differentiate into granulocytes and macrophages.
Figure 1-1: Hierarchy of hematopoietic stem cells, adapted from [45]
Currently, our understanding of the molecular mechanisms underlying the differentiation of hematopoietic progenitor cells is limited Analyzing these processes, particularly gene regulation expression profiles, necessitates the lysis and destruction of the cells, complicating the study Additionally, the precise timing of differentiation remains unknown, posing further challenges to research in this area.
Analyzing data at the right moment is crucial, as conducting the analysis either too early or too late can result in the loss of valuable insights regarding the factors that led to the observed differentiation.
Non-invasive methods are essential for the continuous long-term study of cells Live-cell imaging of in-vitro experiments presents a promising approach for this purpose To avoid altering or killing the cells, microscopic images must be captured at short intervals of several minutes This process involves taking repeated images over several days, ultimately creating a long-term microscopy movie that showcases the behavior of the cells.
Live-cell imaging utilizes markers such as fluorescent proteins encoded in the cell's genetic sequence or antibodies that target specific cellular molecules Fluorescence levels are assessed by imaging at specific fluorescent wavelengths Notably, biological markers associated with the differentiation of MEP and GMP facilitate effective cell labeling.
Figure 1-2: Example of a bright field image from long-term microscopy [53] The cells are the black objects with the white halo in the images The other small particles are dirt
Analysis of tracked genealogies reveals that siblings typically align with either the MEP or GMP lineage, as illustrated in Figure 1-3, where the left subtree contains only MEPs and the right subtree exclusively GMPs This indicates that the determination of whether a cell develops into a GMP or MEP occurs several generations prior.
6 1 Introduction biological markers become active This raises the question whether there is a way to distinguish the cells before the onset of the currently used markers
Figure 1-3: Labeled tracking tree of HSC differentiation to MEP or GMP [54]
Each branch of the tree represents a lineage made up of multiple cell cycles When a cell divides, the branch is divided as well The green lines indicate cell cycles with an active MEP marker, while the blue lines denote cell cycles with an active GMP marker.
Bright field images and the expression levels of various biological markers represent two untapped sources of information for labeling Typically, bright field images are utilized solely for tracking; however, preliminary studies have indicated that the morphological features of cells can help differentiate between MEPs and GMPs Our objective is to explore these concepts further through innovative deep learning techniques.
Problem Statement
This thesis aims to employ deep learning techniques to differentiate between MEP and GMP cells For training data, sequences of bright field images are utilized, capturing each cell over one cycle Cells are manually labeled based on whether the fluorescent marker's intensity exceeds a specified threshold, addressing the challenge of supervised binary time series classification.
We aim to tackle this issue by employing deep recurrent neural networks, which have made significant strides in recent years However, processing raw images for time series data remains challenging To overcome this, we utilize features that capture the shape and texture of segmented cell images These models must attain a high level of abstraction to effectively identify patterns and similarities over time.
We aim to extend our classifier's application to additional time-lapse movies to predict labels for unknown cells Given that the tracking tree indicates that lineage decisions occur before the activation of biological markers, we hypothesize that our method could effectively identify cells at earlier developmental stages.
Structure of the thesis
In the initial chapters, we provide an overview of the techniques employed, beginning with basic Artificial Neural Networks and Feedforward models designed for one-dimensional input We then delve into the structure and challenges of Recurrent Neural Networks, offering solutions to these issues A critical aspect of neural networks is the training process, where we examine two classic optimization techniques and various regularization methods to mitigate overfitting during learning Finally, we present and analyze the results of our studies, concluding with suggestions for future research directions.
2 Introduction to deep neural networks
Feedfoward Neural Networks
Perceptron
The perceptron is a fundamental component of artificial neural networks (ANNs) and is often regarded as the first generation of neural networks This algorithm is designed for supervised binary classification tasks.
M Kroiss, Predicting the Lineage Choice of Hematopoietic Stem Cells,
Deep neural networks are powerful classifiers that learn the relationship between data and target outcomes from training examples They operate by predicting a binary output based on input data and then comparing this prediction to the actual class label Through iterative adjustments of the model's free parameters, the system minimizes the difference between predicted and actual targets in the training set, enhancing its classification accuracy.
In a perceptron, the features of an observation serve as input signals, each assigned a weight that indicates its influence on the output These weighted signals are summed together, with an additional bias introduced to adjust the linear combination The resulting value is then processed through a non-linear activation function, specifically the sigmoid function, which transforms it into a binary output ranging from 0 to 1.
2 Introduction to deep neural networks 11
The complete computation of the output can be described with the following equation:
The primary objective of a classifier is to reduce prediction errors by comparing its output with the expected results for each data-target pair To measure these errors effectively, the zero-one loss function is utilized.
0 where are the free parameters and D is the set of training examples.
The zero-one loss function is not differentiable, making it costly to compute for large models and numerous data pairs Therefore, the negative log likelihood is employed as an alternative loss function.
This loss function is iteratively minimized, by calculating the gradient of the loss and updating the free parameters accordingly Optimization techniques are described in the following chapters
The perceptron is effective for binary classification tasks, but addressing multi-class classification is more complex This requires the use of multiple perceptrons, each designed to determine a yes or no response for individual classes Essentially, this approach mirrors a perceptron with several output units, where the output nodes operate independently.
To obtain a probability value for each class, the softmax function is applied to the final output units of the network:
Interestingly, the loss function stays the same even for the multiclass case
12 2 Introduction to deep neural networks
Limitations of a Perceptron
If the right features are chosen by hand, a perceptron can learn almost anything However, when using the wrong features, it can be very limited
In binary classification with two binary features, positive examples are represented by (1,1) and (0,0), while negative examples are (0,1) and (1,0) The classification rule checks for equality: if both features are equal—either both are one or both are zero—the observation is classified as positive Conversely, if the features differ, the classification is deemed negative.
Figure 2-3: Limitation of a perceptron [54] the four data points of two different classes cannot be separated linearly
A perceptron cannot solve this problem, as illustrated in Figure 2-3, where its weight plane appears as a single line that fails to separate the four data points effectively To address this issue, one solution is to stack multiple perceptrons, allowing for the separation of data using two linear lines.
In cases where data points require non-linear separation, using stacked perceptrons with linear units proves insufficient, as they can only achieve linear data separation To effectively address this challenge, a combination of non-linear functions and multiple layers of perceptrons is essential, which will be explored in the following chapter.
Multilayer Perceptron
The Multilayer Perceptron (MLP) architecture involves stacking perceptrons in layers to enhance data modeling capabilities While a single perceptron is limited in handling data, particularly when it comes to non-linearly separable inputs, the MLP effectively addresses this issue by combining multiple perceptrons, as illustrated in Figure 2-4.
2 Introduction to deep neural networks 13 enough perceptron units, the MLP is able to approximate any boundary function and thus has a universal approximation capability [60,61]
Figure 2-4: Structure of a Multilayer Perceptron [62]
In a Multi-Layer Perceptron (MLP), each layer utilizes the outputs from the previous layer as inputs for the next, facilitating a continuous flow of information While the sigmoid function is commonly used, the hyperbolic tangent (tanh) function is often preferred as the activation function for hidden layers due to its superior learning capabilities Unlike the sigmoid function, which is not centered at zero and can lead to saturation towards one, the tanh function provides a more balanced activation range, enhancing the neural network's performance.
The choice of the activation function in the last layer depends on the specific problem being addressed For binary classification tasks, the sigmoid function is appropriate, as it outputs probability values Similarly, in logistic regression, which also aims to predict probabilities, the sigmoid function is utilized In contrast, for regression tasks that predict real values, such as estimating loan sizes in banking, the identity function is employed, allowing for direct output without any transformation.
In matrix notation, the equations of a Multilayer Perceptron (MLP) can be expressed using the hyperbolic tangent function (tanh), where one vector represents the hidden layer units and another vector represents the output layer units The activation function utilized in the output layer plays a crucial role in determining the final output of the MLP.
Deep neural networks utilize activation functions such as the sigmoid, softmax, or identity functions, depending on the specific problem being addressed Each layer is characterized by a weight matrix and bias vectors, which have dimensions corresponding to the number of neurons in the layers The weight matrix is applied to a column vector through a matrix-vector product This process can be iteratively applied to multiple hidden layers, where the output of one layer serves as the input for the subsequent layer It is important to note that each layer requires its own distinct set of parameters.
Backpropagation
To calculate the gradient of the loss function in a neural network, a technique called backpropagation [22] is used The partial derivative of the loss function results from using the chain rule:
The change of a weight connecting neuron and can be calculated as follows: Δ
The linear combination of a neuron's input, including a bias term, is calculated before the activation function is applied After the activation function processes this combination, the resulting value represents the expected output of the neural network.
Recurrent Neural Networks
Basic RNN
The fundamental architecture of a Recurrent Neural Network (RNN) typically includes a single recurrent hidden layer, where each unit in this layer is fully interconnected with both the output layer and other units within the recurrent layer During the analysis of time series data, feature vectors are sequentially input into the network, allowing for the computation of updated values for the output units and each hidden unit at every time step.
In the subsequent time step, the hidden unit's value is integrated with the new incoming input, producing an updated output and hidden value This process enables the network to effectively retain and comprehend temporal information.
Figure 2-5: Structure of a Recurrent Neural Network [62]
This structure can be described with the following equations: tanh
Note that the term is the main difference between the equations of a MLP (see chapter 2.1.3) is the weight matrix for the hidden units of size # ∗
# and are the values of the hidden units from the previous time step. input layer recurrent hidden layer output layer
16 2 Introduction to deep neural networks
The result of processing a time series is a vector with one output for each time step.
In sequence classification, where the focus is on the overall label of the time series, it is standard practice to calculate the mean of the output units across each time step to derive an average prediction This approach allows for the application of the same loss functions used in Multi-Layer Perceptrons (MLPs) to Recurrent Neural Networks (RNNs).
Recurrent Neural Networks (RNNs) possess the theoretical capability to model and compute any function that a computer can, provided they have sufficient neurons and time However, the primary challenge lies in the difficulty of training these RNNs effectively.
Vanishing Gradient Problem
A significant challenge in training recurrent neural networks (RNNs) is the vanishing gradient problem, where the influence of earlier time points diminishes exponentially over time As the network processes more steps, new inputs tend to overwrite the information stored in the hidden units, causing the RNN to lose track of the initial inputs.
The issue with recurrent neural networks (RNNs) is highlighted when visualizing their temporal unfolding, as shown in Figure 2-6 The shading in the figure represents the network's sensitivity to input signals, revealing that over time, the gradient diminishes significantly Consequently, after approximately ten time steps, the information is entirely lost.
The vanishing gradient problem in recurrent neural networks (RNNs) leads to a rapid decline in the network's ability to retain input information As new data is introduced, the stored information in the hidden units diminishes exponentially, causing the network to forget earlier inputs This phenomenon highlights the challenges RNNs face in maintaining long-term dependencies in sequential data.
2 Introduction to deep neural networks 17
In the forward pass of a neural network, activation functions such as the logistic function constrain activities within specific intervals, effectively preventing activity from exploding However, during the backward pass, the system operates in a linear manner, leading to a situation where doubling the error derivatives in the final layer results in all preceding layer derivatives also doubling as the error is backpropagated.
Neural networks exhibit a linear system behavior during backpropagation, leading to significant challenges similar to those faced by linear systems during iteration This results in gradients that can either vanish if they are too small, causing a loss of information, or explode if they are excessively large, ultimately erasing critical data Consequently, this dynamic often leads to the loss of gradient information from several time steps prior, impacting the overall learning process.
Feed-forward networks typically do not encounter this issue due to their limited number of layers, while recurrent neural networks (RNNs) can present challenges as they effectively function with as many layers as there are time steps in the training sequence, potentially reaching 100 layers This results in gradients raised to the power of 100, complicating the training process Various strategies can be employed to address this problem, both in terms of network architecture and training techniques, which will be explored in the subsequent chapters.
Long Short-Term Memory
To effectively tackle the vanishing gradient problem, Long Short-Term Memory (LSTM) cells offer a robust solution Unlike standard neurons that merely transmit values to subsequent neurons, LSTM cells incorporate a unique memory block that enables them to retain and store information.
The network's dynamic state functions like short-term memory, while LSTM cells aim to develop a long-term memory system This involves creating a dedicated module for information storage, ensuring that this memory is isolated from the rest of the network, allowing new inputs to be processed without disrupting the stored information.
18 2 Introduction to deep neural networks
The Long Short Term Memory (LSTM) cell features a unique memory store that allows it to read, write, or forget information These essential operations are controlled by gate functions that respond to incoming input values, ensuring efficient data management within the cell.
The memory block is isolated from the rest of the network by logistic gates, which include input, output, and forget gates The decision to open these gates for a cell is made by the network When the network intends to store information, it activates the input gate, allowing new data to flow into the memory block.
The information remains in the cell while the forget gate is closed, with the state of this logistic forget gate being influenced by the rest of the network To access the stored information, the network opens the output gate, allowing the stored value to flow out and impact the entire network.
A memory block serves as a storage unit for numerical values, maintaining the exact figure as long as the input and forget gates remain closed To erase the stored value, the forget gate can be opened, which resets the number to zero Subsequently, new information can be recorded by opening the input gate, allowing for efficient data management within the network.
The structure of the LSTM can be formalized as follows:
Deep neural networks (DNNs) are a crucial component of modern artificial intelligence, utilizing structures like output gates, forget gates, and memory blocks to process information effectively The tanh activation function plays a significant role in enhancing the network's performance by managing the values within these gates and hidden units, thereby facilitating efficient learning and data representation.
LSTM cells are capable of retaining information across hundreds of time points, effectively overcoming the vanishing gradient problem They excel particularly in handwriting recognition tasks Training an LSTM network follows a similar process to that of RNNs, utilizing gradient descent methods Typically, the structure of an LSTM network resembles that of an RNN, featuring just one recurrent layer.
Bidirectional RNN
A bidirectional RNN enhances traditional RNNs by incorporating two separate networks: one processes time series data in the forward direction, while the other operates in reverse This dual approach allows the model to combine information from both directions at each time step, improving the accuracy of predictions for the network output.
Figure 2-8: Bidirectional RNN unfolded in time, adapted from [69]
At each time step, the network utilizes complete sequence information by processing it in both forward and backward directions The output layer effectively combines the insights gained from the two recurrent networks.
20 2 Introduction to deep neural networks
The primary benefit of this model structure is its ability to analyze the entire time series at each time step, processing the first part in a forward direction and the second part in reverse This comprehensive approach enhances the significance of the output at every time step However, the use of the complete sequence for classification poses a challenge for real-time predictions.
Utilizing a forward RNN alone can lead to inaccurate predictions, as the model lacks insight into future time steps To address this limitation, a potential solution involves introducing a small window that grants the RNN access to subsequent time steps, thereby enhancing its predictive capabilities However, this approach can significantly increase the number of weights, adding complexity to the model An alternative solution is to implement a time delay in the output layer, which can mitigate this issue, although the optimal window length remains a sensitive parameter that requires manual tuning for each specific problem.
In sequence classification, the initial outputs of a basic forward RNN can be unreliable due to limited exposure to time points, making them highly dependent on the specific problem While one might consider weighting later outputs more heavily, crucial information may reside in the earlier time steps, and the RNN's tendency to forget over time complicates this approach A more effective and less arbitrary solution is to utilize a bidirectional RNN, which addresses these limitations and enhances the model's performance.
Training Neural Networks
Input normalization
Neural networks exhibit strong resilience to irrelevant features; however, as the input dimensionality increases, weight learning becomes more challenging Additionally, with a small dataset, the risk of overfitting to these unnecessary features escalates.
Normalizing the mean and standard deviation of each feature in a dataset is a crucial preprocessing step, ensuring that the mean is zero and the standard deviation is one This process maintains the integrity of the data while optimizing the input range for initial network parameters As a result, learning becomes more efficient, leading to a more stable training process.
2 Introduction to deep neural networks 21
Parameter Initialization for RNN
Recent research highlights the critical importance of proper initialization in recurrent neural networks for successful training Inadequate parameter initialization can lead the model to local minima, resulting in overfitting to the training data rather than achieving generalization While a robust optimizer may help escape local minima, starting from a poor initial point significantly diminishes the likelihood of reaching an optimal solution, as numerous intermediate steps may show no improvement in error values.
The spectral radius of the hidden-to-hidden weight matrix is crucial for RNN initialization, as it significantly affects the dynamics of hidden units, particularly with the tanh activation function A spectral radius less than one causes the RNN to forget information rapidly, while a value much greater than one leads to prolonged information retention but can result in chaotic behavior and exploding gradients Exploding gradients can erase existing information, causing the network to attempt to retain excessive data Conversely, a spectral radius slightly above one allows the network to remember information effectively without encountering exploding gradients Consequently, a spectral radius of 1.1 is recommended as an optimal value for balancing memory retention and stability in RNNs.
The initial scale of weights is crucial in neural networks, particularly for input-to-hidden units, especially when dealing with datasets that contain many uninformative features Setting the initial weights too high can lead to the overshadowing of valuable information by less important features, while weights that are too low may result in a sluggish learning process that fails to converge effectively It is advisable to use a Gaussian distribution with a standard deviation of 0.001 for input-to-hidden weights However, if all features are deemed useful, increasing the standard deviation up to 0.1 can enhance the learning speed.
22 2 Introduction to deep neural networks
Parameter type Scale input-to-hidden N(0, 0.001) hidden-to-hidden Spectral radius of 1.1 hidden-to-output N(0, 0.1) hidden-bias 0 output-bias average of outputs
Table 2-1: Optimal initial parameter values for RNNs
Proper scaling and centering of both input and output parameters are essential for optimal performance As highlighted in section 2.3.1, the input data should be normalized to have a mean of zero and a standard deviation of one Furthermore, the output should be adjusted to have a mean of one, which can be achieved by initializing the bias value of the output units to the average of the correct output values.
Parameter optimization methods
There is no single optimal method for training neural networks, as various scenarios highlight the importance of different aspects of the learning process.
Neural networks exhibit diverse structures, with deep networks featuring numerous layers that complicate training due to the optimizer's need to detect subtle gradients in the initial layers Similarly, recurrent networks can unfold into extensive layers when processing time sequences, sometimes exceeding a thousand steps, necessitating the identification of relationships across long intervals In contrast, shallow networks consist of fewer layers but possess many parameters, requiring a different optimization approach; overly sensitive training can lead to significant overfitting Instead, these parameters are optimized with less sensitivity, and training is halted early to ensure effective generalization.
The input types in data processing can significantly differ, affecting the expected weight values For dense inputs like pixel values, the data is rich and compact, while binary inputs, such as those representing words, tend to be sparse This disparity necessitates that weights for highly informative features, like words, be finely tuned to capture their sensitivity, whereas denser inputs, such as those from images or speech signals, require a different approach to weight adjustment.
2 Introduction to deep neural networks 23 network has to find a good abstraction of the data first, which generally requires small evenly distributed weights
When choosing an optimization technique, it's essential to consider the size of the training set Generally, algorithms can be categorized based on the number of examples available, guiding the selection of the most appropriate method for the specific scenario.
Small dataset (< 10000 examples) Big redundant dataset
Minibatch-SGD with momentum Rmsprop
Table 2-2: Different optimization methods depending on the size of the data set
Small datasets typically employ full batch learning algorithms, whereas large redundant datasets utilize batch learning methods A notable exception is the Hessian-Free optimization technique, designed specifically for deep or recurrent neural networks Although primarily applicable to small datasets due to speed limitations, this algorithm leverages mini-batches for efficient learning.
Gradient descent is a method used to adjust a model's free parameters by iteratively taking small steps along the error surface, which is determined by a loss function, based on the gradient.
In standard gradient descent, the process involves iterative steps where the loss function's error and the gradient of the weights are computed based on the entire training dataset By utilizing the gradient derived from backpropagation, the weights are adjusted according to a specified learning rate.
Stochastic gradient descent (SGD) operates similarly to traditional gradient descent, but it calculates the gradient using only a single data point at each iteration instead of the entire dataset This approach not only accelerates computation but also introduces randomness, enabling the optimizer to more effectively navigate out of local minima.
24 2 Introduction to deep neural networks
The learning rate is a crucial hyper-parameter in gradient descent methods, as it defines the step size taken towards the gradient This parameter is highly dependent on the specific problem and requires adjustment throughout the training process As training progresses, the parameter adjustments should become smaller To address this, an additional hyper-parameter known as momentum can be introduced, which regulates the decay of the learning rate after each epoch.
Mini-batch Stochastic Gradient Descent (SGD) improves the learning process by estimating gradients from a small set of training examples rather than a single one, leading to greater stability Notably, increasing the batch size from one to five significantly reduces variance and enhances learning stability However, increasing the batch size beyond 20 often yields diminishing returns, with improvements quickly fading.
Resilient Propagation (RPROP) [75] is a full batch learning method It is very robust and a good general-purpose optimization method for both feed-forward and recurrent neural networks
Gradients in neural networks can vary significantly, with some being large and others quite small, particularly in recurrent neural networks (RNNs) The key concept of the proposed algorithm is to focus solely on the sign of the gradient while disregarding its magnitude However, this approach presents challenges in selecting an appropriate learning rate and momentum parameter that accommodates the diverse range of gradient sizes To address this issue, the RPROP algorithm introduces a unique learning rate, referred to as the step size, for each individual parameter.
The algorithm features a dynamic step size that adjusts over time When the parameter is adjusted correctly, the step size increases multiplicatively, while an incorrect adjustment results in a multiplicative decrease The direction is deemed correct when the signs of the last two gradients are in agreement.
Optimal values for increasing the step size positively are 1.2, while a suitable value for decreasing it negatively is 0.5 To maintain limits on step sizes, a lower bound of 10 and an upper bound of 50 are commonly used For the initial step size, a recommended value is 0.0125 The selection of hyper-parameters can be tailored to the specific problem at hand.
Deep neural networks are highly robust and generally exhibit minimal impact from their dependency on certain factors This contrasts with momentum-based Stochastic Gradient Descent (SGD), where variations in learning rate and momentum significantly influence performance.
IRprop- is the most robust variation of the RPROP algorithm In the following we provide the pseudo code for the update of a gradient with IRprop-:
# - grad: the gradient of the error function
In the parameter update process, the old values of the free parameters are utilized, with the change calculated as the product of the gradient and the last gradient If the change is positive, the delta is adjusted to be the lesser of the product of the delta and the positive step or the maximum step Conversely, if the change is negative, the delta is set to the greater of the product of the delta and the negative step or the minimum step The parameter is then updated by subtracting the product of the sign of the gradient and the delta Lastly, if the change is non-negative, the last gradient is updated to the current gradient; otherwise, it is reset to zero.
- parameter: the new values for the free parameters
Regularization
Training a machine learning model involves more than just optimization; achieving good generalization on unseen data is crucial A model with excessive parameters may memorize the training data, leading to poor performance on new observations, a problem known as overfitting To combat overfitting, various regularization techniques can be employed, ensuring that the model truly understands the data rather than simply memorizing it.
Overfitting primarily arises from irregularities in the dataset, which can either correlate with the input-output mapping or stem from noise due to sampling errors in the training set Distinguishing between genuine irregularities and accidental ones is challenging A flexible model may perfectly fit these accidental irregularities, leading to poor generalization performance.
26 2 Introduction to deep neural networks
To effectively prevent overfitting, acquiring more data is the most reliable strategy, as it allows models to capture patterns without approximating noise or restricting their capabilities While increasing data is beneficial when computationally feasible, it is crucial to ensure that the training set is not correlated with the test set; otherwise, even a larger dataset will not mitigate the risk of overfitting.
To achieve effective generalization in neural networks, it's crucial to limit the model's capacity This involves allowing the model sufficient flexibility to accurately capture significant patterns while preventing it from fitting random noise However, this task is challenging and typically relies on the assumption that the noise irregularities are less prominent than the meaningful ones.
L 1 and L 2 regularization is a way to limit the networks capacity and works by adding an additional term to the loss function that penalizes extreme parameter values:
∈ which is the -norm is a hyper-parameter that controls the influence of the regularization term on the error function is usually 1 or 2.
The regularization term promotes a balanced distribution of parameter values, penalizing large parameters to restrict the modeling of excessive nonlinearity This approach effectively mitigates overfitting and encourages simpler solutions that align well with training data, ultimately enhancing generalization performance.
L1 regularization outperforms L2 when there are significantly fewer training examples than features, as it effectively selects the most relevant features from the dataset Conversely, when all features hold substantial significance, L2 regularization is likely to yield better results by smoothing the network mapping without eliminating any features.
L1 and L2 regularization techniques are highly dependent on the specific case at hand, and it is common practice to combine both methods with adjustable hyper-parameters The optimal coefficients for these regularization techniques are typically identified through cross-validation and empirical testing.
2 Introduction to deep neural networks 27
Early stopping is an effective technique to prevent overfitting in neural networks by closely monitoring their learning process It utilizes a separate validation set to evaluate the model's performance on new and unseen data If the model shows no further improvement or experiences a decline in performance on the validation set, the training process is halted This validation set is distinct from both the training and test sets, allowing the network to assess its performance at each epoch without bias.
Figure 2-9: Example of early stopping to prevent overfitting [54]
During the training of a neural network, the error on the training set typically decreases consistently However, errors on the test and validation sets may begin to rise after a certain point, indicating that the model is overfitting to the training examples This overfitting results in a loss of generalization and a decline in performance when encountering new data.
Adding Gaussian noise to input vectors is a way of increasing the size of the training set This is a very good way to prevent overfitting and increase the generalization,
Deep neural networks can be particularly effective when working with small training datasets Although employing this technique may lead to reduced performance on the training set, it often enhances performance on validation and test sets.
The optimal level of noise to incorporate is largely determined by the specific data set, ensuring that the new observations closely resemble the original data Excessive noise can hinder the model's ability to learn effectively Research indicates that starting with noise equal to half the variance typically yields positive results.
Dropout is a technique used in neural networks where multiple models are trained independently by omitting portions of the training data This approach resembles the random forest method, creating an ensemble of classifiers that collaborate to produce a consensus prediction during evaluation However, since training individual neural networks can be resource-intensive, training numerous models separately is often impractical.
Dropout is an innovative training technique for large neural networks that allows for the simultaneous training of multiple models without the need for separate training processes During each training iteration, half of the hidden units are randomly deactivated, creating a unique network architecture for every training instance Importantly, each hidden unit can be sampled multiple times across different architectures, leading to significant weight sharing among the various models This weight sharing is a fundamental concept that underpins the effectiveness of dropout in neural network training.
Figure 2-10: Example of dropout with one hidden layer [54]
In this example, four out of nine hidden units are randomly deactivated with a 50% probability, leading to the network being trained solely with the active units Consequently, the weights of the active units are, on average, twice their usual value.
2 Introduction to deep neural networks 29
In a feed-forward neural network with one hidden layer, a neuron in the hidden layer can be omitted with a probability of 0.5, resulting in the deactivation of certain nodes during training For each training example, only the active hidden units are utilized, while the inactive ones are ignored, allowing for a new dropout architecture to be sampled each time This approach enables the generation of 28 unique architectures when using eight hidden units.
Methods
Biological data
The time-lapse movies used in this thesis were conducted at the Institut für Stammzellforschung (ISF, Helmholtz Zentrum München) in the group of Dr Timm Schroeder
A healthy mouse strain indistinguishable from wild type mice served as the model organism for this study Hematopoietic stem cells (HSCs) were isolated from the bone marrow using fluorescence activated cell sorting (FACS), with identification achieved through antibodies that specifically target CD150 surface proteins This method allows for precise targeting of HSCs and their progeny The isolated HSCs were then cultured on a plastic slide designed to support long-term cell growth.
The growth of cells was monitored using an AxioVert200 inverse fluorescence microscope (Zeiss), with images captured by an AxioCAM HRM (Zeiss) camera Due to the camera's limitation in covering an entire slide in one shot, the plate was divided into an overlapping grid of positions To efficiently capture all areas, the camera was mounted on a motor that moved it in a programmed pattern Consequently, bright field images were taken approximately every 90 seconds, while fluorescence images were captured at intervals of 22.5 minutes.
Following the experiments, the brightness variations in the bright field images were normalized Cells within these images were manually tracked over time using Timm’s Tracking Tool, leading to the creation of cell genealogies as outlined in chapter 1.3.
Fluorescence measurements were conducted alongside bright field imaging to annotate the expression values of biological markers Notably, the transcription factors PU.1 and GATA, along with the surface protein FCgamma, are correlated with the differentiation of MEPs and GMPs These correlations provide insights into the distinct characteristics of various cell types.
M Kroiss, Predicting the Lineage Choice of Hematopoietic Stem Cells,
Using recurrent neural networks (RNNs) to predict the lineage choice of stem cells reveals key trends in gene expression Specifically, FCgamma levels increase in granulocyte-monocyte progenitors (GMPs), while GATA expression rises in megakaryocyte-erythroid progenitors (MEPs) Notably, PU.1 expression decreases prior to the rise of GATA, and GATA is not expressed when PU.1 levels are elevated.
MEPs can be reliably distinguished from GMPs using FCgamma and GATA as markers Currently, PU.1 is not utilized for cell labeling due to the unclear nature of its expression profile during the differentiation of MEPs and GMPs.
This thesis explores the differentiation of MEPs from GMPs using bright field images that showcase cell morphology, building upon the findings of F Buggenthin Additionally, it investigates the potential of PU.1 as a distinguishing marker Ultimately, the goal is to identify MEPs and GMPs more effectively than the current markers GATA and FCgamma.
3.1.1.2 Data used in this thesis
For testing, four time-lapse movies were utilized, featuring bright field images captured at varying time intervals Each cell cycle is recorded as a single cell, and upon division, the count begins anew with the two resulting cells During certain cell cycles, biological markers surpass a specific threshold, allowing for the labeling of the original cell and all its descendants.
Name Unknown MEPs GMPs Time interval (sec) 120602PH5 705 213 489 153
This thesis utilizes four time-lapse movies, detailed in Table 3-1, which include the 120602PH5 and 130218PH8 films as the primary data sources The remaining movies contribute minimally, featuring only a limited number of MEPs and GMPs.
To effectively distinguish cells, it is essential to analyze various time windows, ranging from a single image to complete cell cycles, including those of their ancestors Initial tests indicated that relying solely on one image was insufficient Consequently, we explored full cell cycles and their ancestors, which yielded promising results Ultimately, we decided to concentrate on classifying based on a single cell cycle, as this approach proved to be effective in our evaluations.
Using RNNs to predict stem cell lineage choices presents practical challenges, particularly in live prediction scenarios that depend on automatic cell tracking The difficulty of managing cell divisions within tracking algorithms led us to concentrate on monitoring a single cell's trajectory over one cell cycle However, it remains essential to explore the use of complete lineages for more comprehensive insights (refer to chapter 4.6.4).
To normalize the time intervals in our study, we omitted every second image for the dataset 120602PH5 and selected every third image for datasets 130218PH8, 130708PH9, and 140206PH8 As a result, all movies have time intervals of 300 seconds, except for 130708PH9, which has a 360-second interval However, the limited number of tracked cells in this movie is unlikely to impact the overall results Fluorescence measurements were interpolated using the nearest neighbor method at each time step where a bright field image was present.
Figure 3-1: Histogram of cell cycle length of the 120602PH5 movie [54]
Cell cycle lengths can vary significantly, averaging around 130 time points after normalization, which translates to approximately 11 hours per cycle It is important to note that normalization is applied only to the time interval between two cycles.
34 3 Using RNNs to predict the lineage choice of stem cells images such that experiments with different settings become comparable; the cell cycles are not length normalized
Our goal is to develop a classifier capable of understanding time and learning from cell tracks in one movie to predict labels for cell tracks in another movie While predicting labels for different cells within the same movie would be simpler due to the classifier's ability to learn movie-specific biases, this approach is impractical Instead, we aim to ensure that our method can be effectively applied for live predictions on new movies.
As we only have two movies with large amounts of tracked cells (see Table 3-1) available, we create two different pairs of training and test sets as follows:
859 unlabeled, 1932 labeled cells Test: 120602PH5
1020 unlabeled, 818 labeled cells Test: 130218PH8
Merging movies creates a broader data set, which is essential for achieving movie-independent generalization, as distinct test sets from the training set enhance this process While incorporating more movies into the training set may extend training time, it generally leads to improved performance, provided the data is accurate It's important to note that data sets exhibit imbalances, with data set (A) showing 70% GMP in training and 80% in testing, while data set (B) has 80% GMP in training and 70% in testing However, since both training and test sets exhibit similar imbalances, additional normalization is unnecessary.
Extracting features from images
The tracking process allows for the annotation of each cell's position in bright field images Previous research at ICB concentrated on segmenting cells from their backgrounds and extracting significant features from these segmentations By employing the maximally stable extremal regions (MSER) algorithm, cells can be effectively segmented, and bounding boxes can be created around them This results in a series of small cell images captured throughout the cell cycle.
Figure 3-3: Segmented bright field images of an exemplary cell over one cycle [54] The highlighted boundary indicates the manual segmentation of a cell
The segmentation of cells allows for the extraction of key features that characterize their shape and texture In F Buggenthin's diploma thesis, a combination of various methods was utilized to compute 88 distinct features for each bright field cell segmentation.
- Shape: area, axis lengths, eccentricity, orientation, perimeter, diameter
- Colors: intensity, contrast, correlation, variance
- Histogram of Oriented Gradients features [83]
- Histogram of Zernike moments [84]: similarity to predefined shapes
- Tamura features [85]: coarseness, contrast, directionality, linelikeness, regularity, roughness
This study primarily emphasizes the analysis of bright field images, while also considering the movement speed and intensity of PU.1 as secondary factors The use of PU.1 intensity is limited to alternative scenarios due to the necessity of transgenic mouse strains that express fluorescently labeled PU.1 proteins.
3 Using RNNs to predict the lineage choice of stem cells 37
During a cell's growth cycle, its area gradually increases while its movement decreases As the cell matures, it transforms from an oval shape and eventually prepares for division.
Throughout a cell cycle, cells exhibit consistent behaviors, including growth and an increase in surface area Notably, the eccentricity of the cells rises, indicating a shift away from an oval shape, peaking in the middle of the cycle before decreasing again We propose that MEP and GMP cells display distinct variations in these characteristics during the cell cycle, which could serve as a basis for differentiating between the two cell types.
Baseline with SVM
We conducted a test to determine if it is feasible to differentiate MEPs from GMPs by analyzing the time series of extracted features over one cell cycle, utilizing a Support Vector Machine (SVM) for classification However, it is important to note that SVMs inherently lack the ability to process temporal information and are limited to a fixed number of features.
To ensure consistency, each cell cycle is normalized to a uniform length, allowing the time series of a feature to be represented with a limited number of data points Rather than relying on a single measurement for each time point, the analysis utilizes 10 evenly spaced values to estimate the curve An alternative method involves fitting a spline, as demonstrated in previous research.
Using Recurrent Neural Networks (RNNs) to predict the lineage choice of stem cells presents challenges, particularly regarding the normalization of cell cycles to a uniform length, which raises concerns about accuracy For example, if Megakaryocyte-Erythroid Progenitors (MEPs) have inherently longer cell cycles than Granulocyte-Macrophage Progenitors (GMPs), this normalization could obscure important differences Additionally, Support Vector Machines (SVMs) struggle with temporal data, leading to the collapse of the time dimension and treating time points as independent observations In our analysis, this results in a feature vector comprising 880 dimensions derived from 88 features across 10 normalized time points.
In our study, the SVM classifier is trained on labeled time series data from the training set, enabling predictions for each cell in the test set These predictions yield probability values indicating whether a cell is classified as MEP or GMP To assess the classifier's performance, we utilize the Area Under the Curve (AUC) as a more reliable metric than accuracy, particularly for smaller datasets Unlike accuracy, which only distinguishes between correct and incorrect predictions, AUC incorporates the confidence of predictions, as reflected in the probability values.
Recurrent Neural Networks
Our research emphasizes the use of Recurrent Neural Networks (RNNs) for time understanding, despite the recent success of deep feedforward neural networks While traditional RNNs have struggled to deliver strong results, recent advancements have emerged, including the implementation of dropout in classic RNNs and the development of multi-layered, bidirectional models utilizing Long Short-Term Memory (LSTM) cells We aim to build upon these innovative approaches to enhance model performance.
When training neural networks, a validation set is essential alongside the training and test sets to facilitate early stopping This validation set enables monitoring of performance on data similar to the training set, helping to identify when a decrease in training error is a result of overfitting rather than improved generalization Throughout the training process, performance on the validation set is continuously observed, and the optimal parameter combination that yields the best results on this set determines the success of the training.
The basic Recurrent Neural Network (RNN) faces significant training challenges due to the vanishing gradient problem, as discussed in chapter 2.2.2 Recently, this issue has been addressed through the implementation of the Hessian-free optimization technique, which effectively handles small gradients by approximating the second derivative of the loss function.
Utilizing Recurrent Neural Networks (RNNs) to predict stem cell lineage choices presents significant challenges, particularly in optimizing techniques and experimenting with various network architectures Each optimization step can require several hours to complete, even when working with a small dataset, making the process time-consuming and complex.
Recent findings indicate that the success of training neural networks is significantly influenced by their initial parameters Utilizing the initial values outlined in chapter 2.3.2 for the RNN allows for effective training without the need for hessian-free optimization Given the small training set of approximately 2000 observations, full batch learning proves to be the most effective approach Additionally, RProp, as detailed in chapter 2.3.3.2, emerges as a highly efficient general-purpose optimization method, yielding impressive results across various architectures while effectively addressing the vanishing gradient problem by focusing solely on the sign of the gradient.
During the development of this thesis, a notable advancement in recurrent neural networks (RNNs) was introduced through the implementation of fast dropout This technique was adopted to address the issue of overfitting in our network Specifically, we applied dropout to both the input-to-hidden and hidden-to-hidden connections, as outlined in the referenced study Although concerns were raised regarding the impact of dropout on the hidden-to-hidden connections and its potential to disrupt the network's ability to model sequences, we found that solely applying dropout to the input-to-hidden connections was insufficient in mitigating overfitting.
Figure 3-5: RNN with dropout unfolded in time [54]
Dropout is used on the input-to-hidden and hidden-to-hidden connections as indicated by the dashed lines t-1 t t+1
40 3 Using RNNs to predict the lineage choice of stem cells
We compared the performance of the classic RNN to the implementation with fast dropout to see the impact of the regularization on the network
In our second approach, we implement a recurrent network utilizing LSTM units to effectively address the vanishing gradient issue, enabling the modeling of long time sequences through its memory cells This capability allows for training the network with stochastic gradient descent (SGD); however, we chose to use RProp to eliminate hyper-parameters, which resulted in faster convergence Additionally, we initialize all weights of the LSTM network with a mean of zero and a Gaussian standard deviation of 0.01.
Extending a basic LSTM model with a bidirectional approach has proven beneficial for frame-wise classification In our scenario, precise timing of predictions is less critical; instead, we focus on achieving accurate average predictions per sequence However, we found that the initial outputs of the RNN can be ineffective, as the network has only processed a limited number of time points at that stage To address this, we utilize a bidirectional approach for sequence classification, allowing the network to make informed predictions at every time step by accessing the entire sequence.
A recent advancement in bidirectional LSTM models involves a hybrid approach that incorporates stacked bidirectional networks in a layered configuration This innovative design enhances the network's depth both temporally and spatially by duplicating the hidden layer of a bidirectional LSTM and positioning it as an additional layer This modification has been demonstrated to significantly improve the performance and abstraction capabilities of the recurrent network.
We get a boost in performance for the first additional layers, but after adding three layers there is no more gain
3 Using RNNs to predict the lineage choice of stem cells 41
Figure 3-6: Deep bidirectional LSTM with two recurrent layers each, adapted from
The concept resembles a bidirectional RNN, utilizing LSTM units and stacking hidden layers Notably, the stacking of the forward and backward directions operates independently, with information integration occurring solely at the output layer.
We are interested in the performance of the normal one layered forward LSTM in comparison to the deep bidirectional LSTM (DB LSTM)
Recurrent Neural Networks (RNNs) produce outputs at each time step, which must be combined into a single output for sequence classification This is typically achieved by summing the values of all output units across the time steps and applying a sigmoid activation function for binary classification or a softmax function for multi-class scenarios The process of merging values from various time points or layers is referred to as pooling.
42 3 Using RNNs to predict the lineage choice of stem cells
Our experiments revealed that applying the sigmoid function after pooling the output significantly hinders the network's learning ability This difficulty arises because optimizing this approach without overfitting is challenging for long sequences, where a single time point can heavily influence classification In contrast, we found that applying the sigmoid function to each time step before pooling yields better results This method reduces the weight of each output, allowing the prediction to depend more on the average correctness, resulting in a binary probability output between zero and one for the network.
We explored various pooling methods for sigmoid values in unidirectional RNNs, including weighting outputs based on the sequence length observed by the model Although applying a linear weight to prioritize later outputs was considered, it did not enhance our results Additionally, relying solely on the last output of the network appeared promising since the unidirectional model has processed the entire sequence by that point; however, this strategy is highly susceptible to overfitting.
3.1.4.4 Implementation details and hyper parameters
Training deep neural networks can be extremely time-consuming, which is why we developed our methods from the ground up to optimize GPU calculations Graphics processing units (GPUs) significantly outperform central processing units (CPUs) in floating-point operations, often achieving speed increases of up to 20 times In our implementation, we utilize Theano to leverage these performance benefits.
[90], a framework that directly generates GPU optimized C code from mathematical expressions
To enhance GPU calculation efficiency, our approach utilizes a 3-dimensional tensor rather than traditional 2-dimensional matrices Typically, recurrent neural networks (RNNs) process data by iterating through matrix rows, where each row represents a training example across time points and features, leading to one vector calculation at a time However, given the small dataset and our preference for full batch training, we can execute all calculations for a single time step simultaneously across all sequences This input is organized in a 3D tensor, with time points as the first dimension, sequences as the second, and features as the third The method iterates over time points, applying equations collectively to the matrix that encompasses all sequences and their respective features, optimizing the overall computation process.
3 Using RNNs to predict the lineage choice of stem cells 43 when all calculations are finished, the first dimension is swapped with the second one to put everything in order