Pm2 5 prediction using genetic algorthm based feature selection and encoder decoder model = dự đoán chỉ số pm2 5 sử dụng học sâu và thuật toán di truyền

PM2.5 forecasting problem

Industrialization and urbanization have significantly improved human lives but have also led to serious air pollution concerns, particularly regarding air quality in residential areas Particulate matter 2.5 (PM2.5) is a critical indicator of air quality, closely linked to human health, as these tiny particles can penetrate deep into the lungs and contribute to diseases such as cardiovascular and respiratory issues Research indicates that prolonged exposure to PM2.5 can increase the risk of heart attacks and strokes Consequently, accurate forecasting of PM2.5 levels is essential for enabling governments and citizens to implement effective strategies to mitigate or prevent adverse health effects.

Existing solutions and problems

PM2.5 forecasting is typically approached as a time series prediction challenge, often utilizing recurrent neural networks (RNNs) like LSTM, which have proven effective for air quality prediction Various studies have explored LSTM models, including one that integrates gas and PM2.5 concentrations for air quality forecasting in Taiwan, and another that develops a hybrid neural network for multi-step PM2.5 forecasting Additionally, innovative approaches such as combining graph convolutional networks with LSTM and employing k-nearest neighbor algorithms to extract spatial-temporal information have been investigated Despite these advancements, existing air quality prediction models face two significant challenges: limitations on input and output lengths, which restrict the ability to predict future values beyond the input data's length, and the need for optimized feature selection among numerous influencing factors like temperature and humidity Addressing these issues is crucial for enhancing the accuracy and efficiency of PM2.5 predictions.

Goals and approaches

This paper introduces a novel PM2.5 prediction model that integrates a genetic algorithm (GA) for optimal feature selection and an encoder-decoder (E-D) architecture to enhance prediction accuracy The GA effectively enriches the model by selecting relevant features, while the E-D model accommodates varying input and output sizes We validate the model's effectiveness through evaluations on the Hanoi and Taiwan datasets, demonstrating that the GA-based feature selection significantly outperforms other methods Additionally, we compare our model to the state-of-the-art ST-DNN method, highlighting its superior performance in PM2.5 prediction.

Utilizing the Taiwan dataset, our model demonstrates a significant accuracy enhancement, rising from 14.82% to 41.71% compared to the ST-DNN Furthermore, by integrating the GA-based feature selection algorithm with the E-D model, we achieve an additional accuracy increase of at least 3%.

Structure of thesis

This paper is structured to provide a comprehensive overview of our research, beginning with the motivations outlined in Section II In Section III, we detail our proposal, followed by an evaluation of its performance in Section IV Section V discusses related works, and the paper concludes with a summary in Section VII.

This section reviews various PM2.5 and air quality prediction models, highlighting the use of deep learning techniques Kửk et al [3] developed a model that integrates an LSTM layer for training data, yielding high accuracy in predictions, although it lacks flexibility in input and output Other models like ST-DNN [11], deep air learning (DAL) [10], and GC-DCRNN [14] utilize spatial data to analyze spatial-temporal relationships but fail to identify key factors influencing air quality Furthermore, these models incur significant time costs due to preprocessing requirements The DAL model focuses on feature selection to determine the significance of input features rather than enhancing prediction accuracy, ultimately revealing critical factors affecting air quality and supporting air pollution mitigation efforts.

In the study referenced in [28], the authors employ a sequence-to-sequence model for predicting PM2.5 levels, incorporating all air pollutants without considering their relevance This approach may lead to an "accumulation of errors," as inaccuracies in predicting individual features can adversely impact PM2.5 forecasts Even features that do not influence PM2.5 can contribute to greater inaccuracies Additionally, in [20], the dataset includes five features aside from PM2.5, resulting in 120 potential feature combinations However, the authors fail to detail the process for selecting the optimal combination and only present results for seven combinations without justification for their choices.

L Yan et al utilize the E-D model to predict PM2.5 concentrations by incorporating various features such as monthly and daily average PM2.5 levels, PM10 concentration, AQI, and meteorological factors like temperature and humidity However, the lack of optimal feature selection may lead to model complexity and reduced prediction accuracy In a separate study, authors group features and use distinct encoders for each group, but this method suffers from two significant issues: the inclusion of irrelevant features adversely affects prediction accuracy, and the complexity of the model increases with the number of feature groups due to the corresponding number of encoders required.

Related works

The study involves two training phases to predict PM2.5 levels Initially, an auto-encoder model is employed to analyze the relationship between various climate variables and PM2.5 concentrations, while also compressing the input data to simplify complexity The results from this first phase are subsequently used in the second phase, where a Bi-LSTM network predicts future PM2.5 levels based on historical data.

In their study, the authors leverage three key data types: recent air pollutants, meteorological information, and PM2.5 readings from nearby stations They employ a one-dimensional convolutional neural network to capture the spatiotemporal correlations in air quality data The resulting feature vector is subsequently processed through an LSTM layer enhanced with an attention mechanism to forecast future air quality levels Overall, existing research explores diverse variants of these methodologies.

ED model to perform PM2.5 prediction, but none consider feature selection

The selection of hyperparameters significantly influences the performance of deep-learning models, including factors such as the number of hidden layers, neurons per layer, weights, and learning rate Traditionally, these parameters have been determined through trial and error, a process that is both time-consuming and may not yield the best results To improve this, leveraging search techniques to identify optimal settings is a viable solution Among these methods, genetic algorithms (GA) stand out as a promising meta-heuristic approach for search and optimization.

A genetic algorithm (GA) is utilized to optimize the structure of deep belief networks for identifying various types of attacks in IoT networks, allowing for adaptive adjustments in the number of hidden layers and neurons based on the attack type Bouktif et al employ a GA to optimize the time lag and layer count of an LSTM model for predicting future electric loads Additionally, a deep long short-term memory (DLSTM) model is proposed for forecasting petroleum production, where GA is used to determine optimal hyperparameters like epochs and hidden neurons W Liu et al enhance PM2.5 prediction using support vector machines (SVM), optimizing model parameters with both GA and particle swarm optimization (PSO) to boost accuracy Furthermore, a PM2.5 prediction model that integrates multi-resolution data employs an ensemble approach, optimizing weights through the nondominated sorting genetic algorithm (NSGA-II) Unlike these studies, our approach focuses on optimizing input feature combinations rather than just tuning model parameters.

This research employs a Genetic Algorithm (GA) for feature selection in air quality prediction models, significantly enhancing the accuracy of PM2.5 forecasts A notable feature of our model is its capability to predict multiple time steps ahead, catering to specific forecasting needs.

Artificial intelligence (AI) simulates human intelligence in machines, enabling them to think and act like humans This technology encompasses machines that demonstrate human-like traits, including learning and problem-solving abilities AI has gained significant attention for its accuracy and versatility across various fields, such as computer vision, speech recognition, and wireless communications A key feature of AI is its capacity to rationalize and undertake actions that maximize the likelihood of achieving specific goals.

Artificial intelligence is categorized into two main types: narrow AI and general AI Narrow AI refers to intelligent systems designed to perform specific tasks without explicit programming, exemplified by applications such as Apple's Siri, self-driving car vision systems, and personalized recommendation engines These systems are limited to predefined functions, hence the term "narrow." In contrast, general AI represents a more advanced form of intelligence akin to human cognition, capable of learning and adapting to a wide range of tasks, from haircutting to spreadsheet creation, and reasoning across diverse topics based on experience.

The advancements discussed are primarily rooted in machine learning, a subset of artificial intelligence that allows computer programs to learn and adapt to new data independently of human intervention Deep learning, a more sophisticated technique within machine learning, facilitates this autonomous learning by processing vast amounts of data, including text, images, and video In the following sections, we will explore the various subsets of artificial intelligence, focusing specifically on machine learning and deep learning.

In the past ten years, deep learning has emerged as a leading technology within the artificial intelligence sector As a subset of machine learning, deep learning utilizes deep artificial neural networks that mimic human brain behavior, enabling systems to learn from vast amounts of data.

Theoretical Background

Artificial Intelligence

Artificial intelligence (AI) simulates human intelligence in machines, enabling them to think and act like humans This technology encompasses machines that demonstrate human-like traits such as learning and problem-solving AI has gained significant attention for its accuracy and versatility across various fields, including computer vision, speech recognition, and wireless communications Its key feature is the ability to rationalize and take actions that optimize the chances of achieving specific goals.

Artificial intelligence is categorized into two main types: narrow AI and general AI Narrow AI refers to systems designed to perform specific tasks without explicit programming, such as Siri's speech recognition, self-driving car vision systems, and personalized recommendation engines These systems are limited to defined tasks, hence the term "narrow." In contrast, general AI embodies the adaptable intelligence found in humans, capable of learning and executing a wide range of tasks, from haircutting to spreadsheet creation, and reasoning across various topics based on accumulated experience.

The achievements in artificial intelligence primarily arise from machine learning, which allows computer programs to learn and adapt to new data independently, without human intervention Deep learning, a more sophisticated form of machine learning, facilitates this process by processing vast amounts of data, including text, images, and video In the upcoming sections 2.2 and 2.3, we will explore the subsets of artificial intelligence, focusing on machine learning and deep learning.

Deep learning overview

In the past ten years, deep learning has emerged as a leading technology in the field of artificial intelligence As a subset of machine learning, deep learning utilizes deep artificial neural networks designed to mimic human brain functions, enabling the system to learn from extensive data sets.

Chapter 2 Theoretical Background amount of data An artificial neural network is a graph computing system inspired by the neural network of humans An artificial neural network consists of many layers, which can be classified into three main categories: input layer, hidden layer, and output layer Figure 1 shows a basic structure of an artificial neural network

Figure 1 An example of an artificial neural network

Deep Learning excels in managing diverse problems involving large datasets and various input types It significantly lowers the costs associated with feature engineering, which can be labor-intensive in traditional Machine Learning By simply preparing input data into training, testing, and evaluation datasets, Deep Learning models autonomously learn and generate results, streamlining the overall process.

Deep learning has notable disadvantages, including high training costs and the need for extensive parallel computing resources Additionally, researchers must experiment with various parameters to achieve optimal results, as there is no universal formula or theoretical framework to guide the process.

In recent years, deep learning architectures have gained significant traction across various domains, including time series prediction, computer vision, natural language processing, and speech recognition Specifically, in time-series analysis, models such as Recurrent Neural Networks (RNN), Gated Recurrent Units (GRU), and Long Short-Term Memory (LSTM) have demonstrated superior performance compared to traditional methods When it comes to predicting fine dust indices for environmental quality assessment, deep learning proves to be an effective solution, especially when dealing with large datasets and the complexities of data preprocessing.

Long short-term memory

The Long Short-Term Memory (LSTM) network is a specialized type of recurrent neural network (RNN) that excels in processing sequential data Unlike traditional neural networks, RNNs utilize the output from previous computations as input for subsequent steps, enabling them to retain and leverage previously stored information This unique capability makes RNNs particularly advantageous for handling string data, allowing for more effective analysis and predictions in various applications.

In Figure 2, the input data at times \( t, t+1, t+2 \) are represented as \( x(t), x(t+1), x(t+2) \), while the corresponding hidden states are denoted as \( h(t), h(t+1), h(t+2) \), which are typically calculated using an activation function The output data for these time points is represented as \( o(t), o(t+1), o(t+2) \) The hidden state is initialized to zero, and the weight matrices involved in the recurrent neural network (RNN) are labeled as \( W, V, \) and \( U \) The defining equation for the RNN is provided below.

Equation 1 RNN equation where 𝑡𝑡𝑓𝑓𝑢𝑢ℎ are activation functions

In Recurrent Neural Networks (RNNs), training involves backpropagation through time, where the gradient is calculated at each step This gradient is essential for updating the weights within the network.

When the influence of a previous layer on the current layer is minimal, the gradient value becomes small, leading to an even smaller gradient in subsequent layers This phenomenon causes gradients to shrink exponentially during backpropagation, resulting in negligible weight updates Consequently, the network struggles to learn from earlier inputs, which contributes to the short-term memory problem in neural networks.

To overcome this problem two specialised versions of RNN were created They are

1) GRU (Gated Recurrent Unit) 2) LSTM (Long Short-Term Memory) LSTM’s and GRU’s make use of memory cell to store the activation value of previous words in the long sequences Now the concept of gates come into the picture Gates are used for controlling the flow of information in the network Gates are capable of learning which inputs in the sequence are important and store their information in the memory unit They can pass the information in long sequences and use them to make predictions

LSTM (Long Short-Term Memory) networks outperform both RNN (Recurrent Neural Networks) and GRU (Gated Recurrent Units) in multistep time series forecasting by utilizing the output of the current step as input for the next Unlike traditional neural networks, LSTM can retain previously calculated information, making it advantageous for sequential data processing One major issue with standard RNNs is the vanishing gradient problem, which hinders the model's ability to capture long-term dependencies LSTM effectively addresses this challenge by incorporating three specialized units: the forget gate, update gate, and output gate, which determine the retention of information The primary benefit of LSTM over other RNN variants is its enhanced capability to learn and maintain long-term dependencies in data.

The structure of an LSTM unit is characterized by three primary gate mechanisms: the update gate \( u_t \), the forget gate \( f_t \), and the output gate \( o_t \) An LSTM unit processes three key components: the previous cell state \( c_{t-1} \), the previous hidden state \( h_{t-1} \), and the current input data \( x_t \) The candidate cell state \( \tilde{c}_t \) is derived from the current input and the previous hidden state, serving as a potential replacement for the memory cell Both the forget and update gates utilize a sigmoid function for their calculations, which take the previous hidden state and the current input as inputs, ensuring that their output values remain within a defined range.

The forget gate regulates how much previous information is discarded from the cell state \( c_{t-1} \), while the update gate influences the integration of new memory \( \tilde{c}_t \) into the updated cell state Essentially, a value closer to 0 in the forget gate indicates greater information loss, whereas a value nearer to 1 signifies that more information is retained Additionally, the output gate \( o_t \) determines the next hidden state The mathematical equations governing these states are outlined accordingly.

Equation 2 LSTM equation where 𝑊𝑊 corresponds to the weight matrix; 𝑏𝑏 is the bias coefficient; 𝜎𝜎 and 𝑡𝑡𝑓𝑓𝑢𝑢ℎ are activation functions

Figure 3 Structure of the LSTM unit

The new cell state \( c_t \) is derived from the previous state \( c_{t-1} \) and the new memory candidate \( \tilde{c}_t \) This is achieved by pointwise multiplying the previous cell state \( c_{t-1} \) with the forget vector \( f_t \), which can eliminate certain values from the cell state when \( f_t \) is zero Subsequently, the product of \( u_t \) and \( \tilde{c}_t \) is added to incorporate new values deemed important by the neural network This results in the updated cell state calculated as \( f_t \ast c_{t-1} + u_t \ast \tilde{c}_t \).

The formula 𝑐𝑐 + 𝑢𝑢 * 𝑐𝑐̃ is crucial for determining which information to retain and update in Long Short-Term Memory (LSTM) networks The new hidden state 𝑓𝑓 is generated by combining the output gate with the new cell state's information This process allows the new cell state and hidden state to be passed to the subsequent time step In essence, LSTMs leverage three gates: the forget gate filters out irrelevant data from previous steps, the update gate incorporates significant information from the current step, and the output gate defines the next hidden state, enhancing the model's ability to manage long-term dependencies effectively.

Forget Gate Update Gate Tanh Output Gate

Encoder-Decoder model

An encoder is a neural network, such as a Fully Connected (FC), Convolutional Neural Network (CNN), or Recurrent Neural Network (RNN), that processes input data to produce a feature vector or tensor, encapsulating the essential information of the input The decoder, typically structured similarly to the encoder but oriented in the opposite direction, takes this feature vector and generates the closest match to the original input or desired output This encoder-decoder architecture is widely utilized in various applications, including language translation and generative modeling.

Figure 4 The basic structure of the encoder-decoder model

Figure 4 shows the structure of an encoder-decoder model The model can be constructed in three steps:

1 Encoder: It accepts a single element of the input sequence at each time step, processese it, collects information for that element, and propagates it forward

2 Encoder vector: This is the final internal state produced from the encoder part of the model It contains information about the entire input sequence to help the decoder make accurate predictions

3 Decoder: given the entire sentence, it predicts an output at each time step

In a data processing framework, let \(X\) represent the input data and \(Y\) denote the output data The functions \(f()\) and \(u()\) serve to map the input to the output Mathematically, the output of the decoder can be expressed in a specific format.

Feature engineering

Time series prediction is crucial for implementing proactive measures to mitigate risks associated with unexpected events Key applications include forecasting air quality, water levels, rainfall, and population trends For accurate predictions, such as PM2.5 levels, access to relevant air quality datasets is essential.

Air quality data encompasses numerous factors beyond PM2.5, including various concentrations, temperature, and humidity, which may influence prediction accuracy Properly selecting these features is crucial, as it can enhance predictive performance, while improper use may lead to decreased accuracy and increased computational time This selection process is a key aspect of feature engineering, which transforms raw data into a more effective set of features Effective feature engineering improves compatibility with specific prediction models and enhances overall accuracy.

Feature engineering aims to effectively represent the original dataset in a way that aligns with the chosen predictive model The features within the dataset significantly influence the performance of the predictive model; therefore, it is essential to carefully define the structure of these features to accurately capture the characteristics of the dataset.

A lean set of features simplifies the computational complexity of models, making calculations faster and easier for users to interpret For instance, in decision tree models, using excessive features can lead to excellent results but complicates user understanding of predictions Feature selection addresses this issue by automatically identifying a subset of relevant features tailored to the specific problem at hand.

Attribute importance assessment is crucial for data analysts in determining which attributes to include in training data Various methods exist to evaluate attribute importance, resulting in individual ranking scores for each attribute Higher scores indicate greater relevance, leading to increased inclusion in the training process, while attributes with lower scores may be excluded.

When dealing with complex data sets containing numerous attributes, the learning process can become time-consuming and prone to inaccuracies due to irrelevant features To address this issue, feature extraction methods are employed to reduce data dimensionality, resulting in simpler and more compact raw data that can be efficiently processed and fed into training models By doing so, these methods significantly reduce training time and improve the overall performance of predictive models Various attribute extraction techniques are available, and the choice of method depends on the specific problem at hand, ultimately enhancing the accuracy and efficiency of the model.

Figure 5 An example of feature extraction

In any predictive modeling scenario, the significance of attributes can vary, with some being essential for accurate predictions while others may be redundant and should be eliminated Feature selection methods are crucial for identifying a relevant subset of attributes from the original set to enhance model performance Various algorithms, including correlation evaluation techniques like Pearson's, assess the relationships between features to select the most relevant ones for the prediction model Additionally, embedded models such as Lasso and random forest incorporate their own feature selection methods, further optimizing the selection process.

Figure 6 An example of feature selection

Creating new features is a complex task that demands significant time and effort from researchers Unlike automated processes, feature development relies heavily on human creativity and the exploration of various scenarios An example of this feature construction process is illustrated in Figure 7.

Time WIND_SPEED WIND_DIR TEMP RH BAROMETER RADIATION INNER_TEMP PM10 PM1 PM2.5

Time WIND_SPEED TEMP RH BAROMETER PM2.5

Figure 7 An example of feature construction.

Genetic algorithm

In nature, the survival and success of every species depend on their ability to adapt to their environment According to Darwin's law of natural selection, species that fail to evolve quickly enough face the risk of extinction.

Chromosomes, which are structural formats of linked genes, are unique to each species and define their characteristics As environments change, chromosome structures also adapt, ensuring that subsequent generations are more suited to their surroundings This adaptability arises from the random exchange of information between chromosomes and the external environment.

Time WIND_SPEED TEMP RH BAROMETER PM2.5 PM10 PM1 DAY_OFF

Time WIND_SPEED TEMP RH BAROMETER PM2.5

Genetic algorithms, inspired by biological genetics, utilize terms such as hybridization, mutation, crossover, and chromosome to solve complex optimization problems across various fields, including finance and engineering They offer significant advantages over traditional optimization methods, particularly in handling intricate problems and enabling parallel processing However, careful consideration of the fitness function, population size, and hyper-parameters is crucial for the effective convergence of these algorithms; otherwise, the results may lack validity.

Genetic Algorithms utilize an encoding function to transform optimization parameters into a linear sequence of bits or characters, representing chromosomes These chromosomes undergo variation through genetic operators, while a fitness function is employed to select the most promising chromosomes, ultimately leading to an optimal solution for the given problem The entire procedure encompasses these key steps.

1 Encoding the set of optimization parameters (cost functions)

2 Defining a fitness function to evaluate the performance of a chosen set of parameters (selection criterion)

4 Perform the iteration of evaluating the individuals in the population, creating a new population using some genetic operators, fitness-proportionate reproduction, and replace the old population with the newly generated population

5 Choosing an optimal solution and decode the result to get the solution for the problem

Figure 8 The basic structure of Genetic Algorithm

The steps to produce the generic algorithm given shown in Figure 8

Genetic algorithms have four main genetic operators: initialization, crossover, mutation, and selection Their roles can be very different

1 Initialization: Randomly generating a population of individuals with a predetermined size

2 Crossover: Swapping parts of the solution with another in chromosomes or solution representations The main role is to provide mixing of the solutions and convergence in a subspace

3 Mutation: The change of parts of one solution randomly, which increases the diversity of the population and provides a mechanism for escaping from a local optimum

4 Selection: The use of the solutions with high fitness to pass on to the next generations, which is often carried out in terms of some form of a selection of the best solutions

The details of these operations will be presented in 2.7.3, 2.7.4, 2.7.5, 2.7.6

In genetic algorithms (GAs), individuals are typically represented by binary strings of 0s and 1s, although alternative encoding methods, such as decimal, are also viable The evolutionary process begins with a randomly generated population and progresses through successive generations Reproduction operators select parent chromosomes from the population to undergo crossover, while each individual's fitness is assessed in every generation The method of selecting chromosomes for reproduction can vary from random selection to fitness-biased choices The fitness evaluation function plays a crucial role in determining each individual's quality, influencing the likelihood of a chromosome being chosen to generate new solutions in subsequent generations.

Crossover is a genetic operator used in evolutionary algorithms, facilitating the exchange of genetic material between two parent chromosomes to create two offspring By combining the best traits of each parent, the crossover operator aims to produce superior offspring compared to their predecessors Various types of crossover methods exist, including one-point, two-point, and uniform crossover, each designed to enhance genetic diversity and improve evolutionary outcomes.

The mutation operator plays a crucial role in preserving diversity within a population by altering the genetic makeup of individuals It typically involves randomly selecting a position in a bit string and changing the corresponding bit Various mutation techniques exist for binary representations, including flip-bit and interchange mutations The flip-bit mutation modifies offspring chromosomes by toggling bits—changing 0s to 1s and vice versa—based on a randomly generated mutation pattern In contrast, the interchange mutation rearranges genes, creating new chromosomes through permutation These methods ensure the continuous evolution of the population across generations.

The selection process in evolutionary algorithms relies on the objective function to guide the choice of individuals from the population Various selection methods are available, including Roulette Wheel Selection, Rank Selection, Steady State Selection, Tournament Selection, Elitism Selection, and Boltzmann Selection While Elitism Selection guarantees the survival of the best individuals, excessive reliance on it can cause premature convergence Therefore, incorporating alternative selection methods alongside Elitism is essential for maintaining diversity and improving algorithm performance.

Research methods

The research methods encompass surveying the existing literature on the relevant problem, developing a deep learning model aimed at predicting environmental monitoring indicators, and evaluating the model's performance by comparing it with alternatives proposed by other researchers.

Figure 9 presents an overview of our proposed forecasting framework OFFGED

The Optimal Forecasting Framework integrates a GA-based feature selection method and an encoder-decoder prediction model, as detailed in Sections 3.2 and 3.3 In the feature selection module, each individual in the initial population represents a unique combination of features, and through multiple iterations, genetic operations such as crossover and mutation create new individuals The fitness of each individual is evaluated using Mean Absolute Error (MAE), which reflects the prediction model's output based on the selected feature combination Over successive generations, individuals with higher fitness are retained while less fit individuals are eliminated, leading to improved fitness values Ultimately, the individual with the highest fitness is identified The prediction module employs an encoder-decoder architecture to forecast PM2.5 levels, utilizing the dataset of features selected by the GA-based method as input.

Figure 9 Overview of the proposed model

Caculate fitness of each individual

GA-based feature selection Encoder-decoder based prediction

Proposed Forecasting Framework (OFFGED)

GA-based feature selection

In our study, we denote 𝑢𝑢 as the total number of features within the dataset, where each individual is represented as a binary string of length n, encoding various feature combinations Specifically, the 𝑘𝑘-th gene can take a value of either 1 or 0, indicating the selection status of the 𝑘𝑘-th feature Our genetic algorithm utilizes a fitness value to assess the quality of each individual, defined as the mean absolute error (MAE) derived from the prediction model During each generation, we generate new individuals, train the prediction model with them, and subsequently evaluate the model to calculate the MAEs.

Figure 10 Encoding a feature combination (the white and gray cells represent the selected feature encoded by 1 and 0, respectively)

The initial population is created through a random process The crossover and mutation algorithms function by taking two parent sets, A = {f1, f2, …, fn} and B = {b1, b2, …, bn}, and randomly selecting m genes from each to exchange, generating two new offspring Here, m is a randomly chosen number that can range from 1 to n.

2 Moreover, to retain good features, we propose a heuristic algorithm for selecting parents when performing the crossover as follows Let 𝑝𝑝𝑐𝑐 be the crossover probability and 𝑁𝑁 be the population size; then, we choose among the 𝑁𝑁 individuals 𝑁𝑁×𝑝𝑝 𝑐𝑐 individuals to crossover We choose the 𝑛𝑛×𝑝𝑝 𝑐𝑐

2 individuals who have the best fitness values The other

In our algorithm, we randomly select two parents from the remaining 𝑁𝑁 − 𝑛𝑛×𝑝𝑝 2 𝑐𝑐 individuals for genetic operations For mutation, a gene segment from one parent is chosen at random, and the values of the genes are inverted (changing 0s to 1s and vice versa) To enhance the diversity of the population, we employ two selection methods: the best individual selection operator, which retains 50% of the top performers, and a random selection operator for the remaining 50% Figure 11 demonstrates the crossover and mutation processes integral to our approach.

(a): Crossover: the green and orange cells represent genes that are swapped to create two offspring

(b): Mutation: the blue cells represent genes that are mutated

Figure 11 Illustration of the GA’s crossover and mutation operations.

Encoder-Decoder model-based prediction

Our model leverages data from the past 𝑢𝑢 time steps to forecast PM2.5 levels ℎ time steps ahead The inputs to the encoder and decoder are represented as 𝑥𝑥𝑥𝑥 and 𝑥𝑥𝑥𝑥, while the output from the decoder is denoted as 𝑥𝑥 𝑦𝑦 These components are essential for accurate predictions.

• 𝑘𝑘= {0, 1} represents the {removed, selected} feature

• 𝑖𝑖 is the iteration counter, which ranges from 0 to the length of the training set

• 𝑢𝑢 is the number of time steps in the input sequence

• ℎ is the number of time steps in the output sequence

• 𝑥𝑥 is the input data of the selected feature

• 𝑦𝑦 is the predicted PM2.5 value

We utilize an encoder-decoder architecture featuring l LSTM units in the encoder and h LSTM units in the decoder To enhance our model's performance, we implement the Adam optimizer for automatic learning rate optimization This article will first delve into the specifics of the LSTM unit before outlining the overall structure of the encoder-decoder model.

The E-D model was first proposed to solve a natural language processing problem

The initial input to the model is a sequence of words of length \( m \) (i.e., \( x = \{x_1, x_2, \ldots, x_m\} \)) After processing, the output sequence \( y = \{y_1, y_2, \ldots, y_n\} \) can have a length \( u \) that may differ from \( m \) The model can utilize architectures such as RNN, GRU, or LSTM based on the specific application The Encoder-Decoder (E-D) model consists of an encoder and a decoder; the encoder comprises multiple recurrent units that handle input elements of arbitrary lengths, collecting and forwarding information to produce a hidden state and cell state known as the encoder state This state encapsulates the information from all input elements to facilitate accurate predictions by the decoder, which receives the encoder state as its initial input The computation of the encoder state follows a specific formula, while the decoder's hidden state at the first time step is calculated through a defined process.

Equation 3 LSTM decoder equation of the first time step where 𝑓𝑓 𝑒𝑒𝑐𝑐 is the encoder state and < 𝐺𝐺𝐺𝐺> is the decoder’s seeding value, which is zero in our model

The decoder, like the encoder, consists of multiple recurrent units that process information sequentially Each unit in the decoder takes the hidden state from the previous unit as input to generate an output \( y_i \) at time step \( i \) The hidden state at the \( u \)-th time step is computed based on this input-output relationship.

Equation 4 LSTM decoder equation where 𝑝𝑝 is the prediction result of the previous time step

After obtaining the hidden state 𝑓𝑓, we pass it through a regular neural network layer called the dense layer to obtain the final prediction, which is calculated by the following equation:

Equation 5 Prediction result of one time step

Figure 12 Structure of the LSTM-based encoder-decoder model

The proposed E-D model, as depicted in Figure 12, features an encoder with an input length of \( u \) and a decoder with an output length of \( h \) The input data, represented as \( x_{i,k}, x_{i+1,k}, \ldots, x_{i+l-1,k} \), comprises historical selected features spanning from time step \( i \) to \( i+u-1 \) The model predicts PM2.5 values, denoted as \( y_{i+l}, y_{i+l+1}, \ldots, y_{i+l+h-1} \), for the subsequent time steps from \( i+u \) to \( i+u+h-1 \).

The encoder processes each input element through multiple LSTM units to generate a final hidden state, which is then utilized by the decoder, also comprising LSTM units, to forecast PM2.5 levels over a specified number of time steps.

New training strategy LTS2

The fitness of each individual is evaluated based on the Mean Absolute Error (MAE) generated by the encoder-decoder model, as illustrated in Figure 9 Consequently, the overall training process exhibits a significant duration, extending up to 7 days To address this issue, a novel training mechanism has been introduced.

LTS2 (Lightweight Time Saving Training Strategy) is proposed

In a training data set of length 𝑳, the testing data remains untouched to maintain objectivity during evaluation The hypothesis suggests treating each instance as a worker assigned specific tasks, where each instance trains on a subset of data relevant to selected features However, dividing the dataset among instances may impact the model's learning We define 𝑚 as the ratio of the length 𝒍 of the sub-dataset to the total training data length, expressed as 𝑛 = 𝑙/𝑳 * 100% Consequently, this approach allows for a maximum efficiency of 100%.

In this study, we define a set of sub-datasets, denoted as 𝑆𝑆 = {𝑠𝑠₁, 𝑠𝑠₂, … , 𝑠𝑠𝑛}, as illustrated in Figure 13 Rather than utilizing the complete training dataset, we opt to use only 𝑛𝑛% of the data This approach significantly reduces the time required to achieve optimal fitness.

Figure 13 Notation of the proposed GA-based training mechanism

However, if we were to split it fixedly into 𝟏𝟏

In time series analysis, sub-datasets are frequently utilized across generations; however, the relationship between time steps at the junction of sub-datasets 𝒔𝒔𝒊𝒊 and 𝒔𝒔𝒊𝒊+𝟏𝟏 remains unlearned, as the information from the end of 𝒔𝒔 𝒊𝒊 does not connect with the beginning of 𝒔𝒔𝒊𝒊+𝟏𝟏 To address this gap, four distinct training methods for the data have been proposed.

In this article, we introduce two boolean variables, 𝒇𝒇𝒊𝒊𝒙𝒙𝒙𝒙𝒙𝒙 and 𝒔𝒔𝒉𝒉𝒔𝒔𝒇𝒇𝒇𝒇𝒍𝒍𝒊𝒊𝒏𝒏𝒔𝒔, which determine the strategy employed in our analysis When 𝒇𝒇𝒊𝒊𝒙𝒙𝒙𝒙𝒙𝒙 is set to true, the dataset is partitioned into fixed sub-datasets for each generation, as illustrated in Figure 14 Following each generation, we compute the fitness of all individuals based solely on one of these sub-datasets.

When 𝒇𝒇𝒊𝒊𝒙𝒙𝒙𝒙𝒙𝒙 is set to false, a sub-dataset of length 𝒍𝒍 𝒔𝒔 is randomly generated from the original dataset This process, depicted in Figure 15, showcases an alternative method for dividing sub-datasets Following each generation, new sub-datasets are created from the training dataset using different starting points, referred to as 𝒑𝒑𝒊𝒊𝒑𝒑𝒑𝒑𝒑𝒑.

𝒔𝒔𝒉𝒉𝒔𝒔𝒇𝒇𝒇𝒇𝒍𝒍𝒊𝒊𝒏𝒏𝒔𝒔 means that we calcuclate fitness of all individuals in a generation using a same sub dataset if 𝒔𝒔𝒉𝒉𝒔𝒔𝒇𝒇𝒇𝒇𝒍𝒍𝒊𝒊𝒏𝒏𝒔𝒔 is true and vice versa

An objective assessment is achieved when each individual utilizes the same data set, while the inclusion of diverse perspectives accelerates the overall selection process This dynamic is illustrated in Figures 16 and 17.

Below is an illustration of our training mechanism in pseudo code

Algorithm 1 GA-based feature selection

Algorithm 2 GA-based training strategy

Algorithm 1: GA -based feature selection

Result: Get an optimal combination of features in short amount of time

Initialize a population of individuals, which are different combination of features;

Calculate fitness of each individual using E -D model;

Initialize max generations 𝑢𝑢𝑥𝑥 𝑢𝑢 𝑚𝑚𝑓𝑓𝑥𝑥 , probability of crossover 𝑝𝑝 𝑐𝑐 , probability of mutation 𝑝𝑝 𝑚𝑚 while 𝑢𝑢𝑥𝑥𝑢𝑢 < 𝑢𝑢𝑥𝑥𝑢𝑢 𝑚𝑚𝑓𝑓𝑥𝑥 do

Training a model with each individual in 𝑃𝑜𝑜 𝑝𝑝 followed by our GA-based training mechanism describe in Algorithm 2.

Algorithm 2: GA -based training mechanism if 𝑓𝑓𝑖𝑖𝑥𝑥𝑥𝑥𝑥𝑥 = true and 𝑠𝑠ℎ𝑢𝑢𝑓𝑓𝑓𝑓𝑢𝑢𝑖𝑖𝑢𝑢𝑢𝑢 = false then

Split the whole training dataset S into a fixed sub -datasets

Select randomly a sub-dataset 𝑠𝑠 𝑖𝑖 , 𝑖𝑖 ∈ ℕ ∗ , 𝑖𝑖 ∈ 1; 𝑢𝑢 ; for each individual in the population do

Assign the individual to 𝑠𝑠 𝑖𝑖 else if 𝑓𝑓𝑖𝑖𝑥𝑥𝑥𝑥𝑥𝑥 = true and 𝑠𝑠ℎ𝑢𝑢𝑓𝑓𝑓𝑓𝑢𝑢𝑖𝑖𝑢𝑢𝑢𝑢 = true then

Split the whole training dataset S into a fixed sub -datasets

𝑆 = 𝑠𝑠 1 , 𝑠𝑠 2 , … , 𝑠𝑠 𝑢𝑢 ; for each individual in the population do

Select randomly a sub-dataset 𝑠𝑠 𝑖𝑖 , 𝑖𝑖 ∈ ℕ ∗ , 𝑖𝑖 ∈ 1; 𝑢𝑢 ; Assign the individual to 𝑠𝑠 𝑖𝑖 else if 𝑓𝑓𝑖𝑖𝑥𝑥𝑥𝑥𝑥𝑥 = false and 𝑠𝑠ℎ𝑢𝑢𝑓𝑓𝑓𝑓𝑢𝑢𝑖𝑖𝑢𝑢𝑢𝑢 = false then

Select randomly a pivot, we get a sub-dataset 𝑠𝑠 𝑖𝑖 having length ( 𝑝𝑝𝑖𝑖𝑣𝑜𝑜𝑡𝑡, 𝑝𝑝𝑖𝑖𝑣𝑜𝑜𝑡𝑡 + 𝑢𝑢 𝑠𝑠 ) ; for each individual in the population do

Assign the individual to 𝑠𝑠 𝑖𝑖 else if 𝑓𝑓𝑖𝑖𝑥𝑥𝑥𝑥𝑥𝑥 = false and 𝑠𝑠ℎ𝑢𝑢𝑓𝑓𝑓𝑓𝑢𝑢𝑖𝑖𝑢𝑢𝑢𝑢 = true then for each individual in the population do

Select randomly a pivot, we get a sub-dataset 𝑠𝑠 𝑖𝑖 having length ( 𝑝𝑝𝑖𝑖𝑣𝑜𝑜𝑡𝑡, 𝑝𝑝𝑖𝑖𝑣𝑜𝑜𝑡𝑡 + 𝑢𝑢 𝑠𝑠 ) ;Assign the individual to 𝑠𝑠 𝑖𝑖

To ease the presentation, we name our proposed model ED-LSTM This section aims to answer two questions related to prediction accuracy:

1 How much does the feature selection algorithm improve the results compared to other selection methods?

2 How much does our proposed prediction model gain compared to state-of-the-art models?

In our study, we first integrated ED-LSTM with multiple feature selection algorithms and subsequently evaluated their performance against our proposed GA-based algorithm using two datasets The first dataset, known as the Hanoi dataset, comprises hourly air quality data collected from Hanoi, Vietnam, starting from January.

2016 to January 2018 for the features mentioned in Section II The second dataset, named the Taiwan dataset, contains hourly data collected from January 2014 to September 2017

[11] The dataset includes PM2.5 and other indicators, such as time, ambient temperature,

The dataset includes various air quality indicators such as CO, NO, NO2, NOx, O3, PM10, RH, and SO2 It is important to note that both datasets contain missing data points, with the specific number and percentage of missing values for each indicator detailed in Table [insert table number].

1 It is necessary to do data imputation to compensate for the missed data We see that the missing rates are relatively small and do not bias any specific parameters in ED-LSTM Hence, we leverage the median value to fill in the lost places To further confirm the method’s effectiveness, we compare it to the one in [11] (the original one with Taiwan data set) With such missing rates, the experiment results, with such missing rates, the imputation methods do not impact the models’ performance Therefore, throughout this paper, we present the results using the median values for data imputation

In this comparison, we evaluate the performance of ED-LSTM against three established models: AE-BiLSTM, which integrates auto-encoder and Bi-LSTM neural networks; AC-LSTM, which combines CNN and LSTM networks; and ST-DNN, which utilizes both spatial and temporal relationships for PM2.5 prediction.

Performance Evaluation

Dataset and evaluation settings

To ease the presentation, we name our proposed model ED-LSTM This section aims to answer two questions related to prediction accuracy:

1 How much does the feature selection algorithm improve the results compared to other selection methods?

2 How much does our proposed prediction model gain compared to state-of-the-art models?

In our study, we first integrated ED-LSTM with several feature selection algorithms and subsequently evaluated their performance against our proposed GA-based algorithm We utilized two datasets for this comparison, with the first being the Hanoi dataset, which comprises hourly air quality data collected in Hanoi, Vietnam, starting from January.

2016 to January 2018 for the features mentioned in Section II The second dataset, named the Taiwan dataset, contains hourly data collected from January 2014 to September 2017

[11] The dataset includes PM2.5 and other indicators, such as time, ambient temperature,

The analysis includes key air quality indicators such as CO, NO, NO2, NOx, O3, PM10, RH, and SO2 Both datasets exhibit some missing data points, with the specific number and percentage of missing values for each indicator detailed in Table.

1 It is necessary to do data imputation to compensate for the missed data We see that the missing rates are relatively small and do not bias any specific parameters in ED-LSTM Hence, we leverage the median value to fill in the lost places To further confirm the method’s effectiveness, we compare it to the one in [11] (the original one with Taiwan data set) With such missing rates, the experiment results, with such missing rates, the imputation methods do not impact the models’ performance Therefore, throughout this paper, we present the results using the median values for data imputation

In addressing the second question, we compare the ED-LSTM model with three established approaches: the AE-BiLSTM, which integrates auto-encoder and Bi-LSTM neural networks; the AC-LSTM, which combines CNN and LSTM networks; and the ST-DNN, which utilizes both spatial and temporal relationships for PM2.5 prediction.

Table 1 Details of missing data in the datasets

Dataset Indicator Total observations Missing Missing rate

Table 2 presents the hyperparameters utilized in the ED-LSTM model The variable 𝐵𝐵𝑓𝑓𝑡𝑡𝑐𝑐ℎ𝑠𝑠𝑖𝑖𝑠𝑠𝑥𝑥 represents the number of training samples processed by the network, while 𝑇𝑇𝑥𝑥𝑠𝑠𝑡𝑡𝑠𝑠𝑖𝑖𝑠𝑠𝑥𝑥 indicates the proportion of the dataset allocated for testing The parameters 𝑢𝑢 and ℎ correspond to the time steps for input and output, respectively An 𝑥𝑥𝑝𝑝𝑜𝑜𝑐𝑐ℎ signifies a complete pass through the entire training dataset The 𝑥𝑥𝑓𝑓𝑡𝑡𝑢𝑢𝑦𝑦𝑠𝑠𝑡𝑡𝑜𝑜𝑝𝑝𝑝𝑝𝑖𝑖𝑢𝑢𝑢𝑢 value serves as a threshold for ending the training to avoid overfitting We adopt a standard approach to optimize the hyperparameters of deep learning models by experimenting with different parameter sets and selecting the most effective one.

Impact of the GA’s number of generations

This experiment investigates the effect of varying the number of generations in a GA-based feature selection algorithm, focusing on prediction accuracy and training time The results indicate that increasing the number of generations generally leads to a reduction in mean absolute error (MAE) but also results in longer training times Specifically, a significant drop in MAE occurs when the number of generations increases from 1 to 3, stabilizing thereafter Conversely, training time experiences a marked increase during the same range, with a slower rate of increase observed from 4 to 7 generations, and a substantial rise beyond that Therefore, a moderate range of 5 to 7 generations is recommended to balance high prediction accuracy with acceptable training times.

Figure 18 Impact of the number of generations.

Comparing feature selection algorithms

We compare the GA-based feature selection method against several approaches, including the use of only PM2.5 data, all available features, XGBoost, and Pearson’s correlation The XGBoost package includes a feature_importances function that ranks features by their importance coefficients We sort these features in descending order and systematically eliminate those with low importance from the original list The remaining features are then utilized to train the prediction model, and we assess the model's performance using Mean Absolute Error (MAE).

Pearson’s correlation quantifies the strength of the linear relationship between two features, with values ranging from −1 to 1 We focus on features with an absolute correlation value exceeding 0.3, indicating a moderate to strong correlation with PM2.5 Additionally, we exclude features with an absolute correlation greater than 0.9 to mitigate multicollinearity, which can hinder the model's ability to identify significant independent variables.

Figure 19 Comparison of feature selection algorithms

The MAE values for one-step-ahead predictions, as shown in Figure 19, indicate that our proposed GA-based method outperforms other feature selection techniques on both the Taiwan and Hanoi datasets Specifically, the optimal feature combination for the Hanoi dataset comprises wind speed, temperature, radiation, PM10, and PM2.5 Notably, using all available features led to higher MAE compared to using only PM2.5, suggesting that more input features do not necessarily yield better results Our GA-based method achieves a 6% and 16% reduction in MAE compared to predictions using PM2.5 and all features, respectively, and it outperforms XGBoost and Pearson’s correlation by 90% and 83% For the Taiwan dataset, the GA-based method reduces MAE by 4%, 8%, 3%, and 4% when compared to PM2.5, all features, XGBoost, and Pearson’s correlation, respectively.

In our analysis of PM2.5 predictions, we compared the performance of a genetic algorithm (GA)-based feature selection method against using all available features with the Hanoi dataset As illustrated in Figure 20, the predictions utilizing GA-selected features consistently demonstrated more accurate peak values This further validates the effectiveness of the GA-based feature selection approach in enhancing prediction accuracy.

GA based Only PM2.5 All features XGBoost Pearson’s

(a) One time step (b) Two time steps

(c) Three time steps (d) Four time steps

(e) Five time steps (f) Six time steps

(g) Seven time steps (h) Eight time steps

Figure 20 Comparison of GA-based feature selection and using all the features for the

Comparing prediction models

This section presents a comparison of our ED-LSTM model with AE-BiLSTM, AC-LSTM, and ST-DNN Due to the unavailability of source codes for these models, we initially reimplemented AE-LSTM and AC-LSTM following the guidelines provided in their respective papers However, we faced challenges in replicating ST-DNN because of insufficient details, leading us to directly extract numerical results from the original ST-DNN paper We conducted two experiments to facilitate this comparison.

The evaluation of our models, AE-LSTM and AC-LSTM, is conducted using the Hanoi dataset, while the Taiwan dataset is employed for a fair comparison between ED-LSTM and ST-DNN, following the feature recommendations from the ST-DNN paper The data preparation involves partitioning into a training set covering January 2014 to September 2016 and a testing set from October 2016 to September 2017, maintaining a 2:1 ratio The experiments utilize historical data from the past 48 hours to predict PM2.5 values for the next 1 to 6 hours.

4.4.1 Comparing ED-LSTM, AE-BiLSTM, and AC-LSTM

We assess the performance of ED-LSTM, AE-BiLSTM, and AC-LSTM using the Hanoi dataset, presenting the Mean Absolute Error (MAE) and computation time in Tables 3 and 4 The average computation time for each model, defined as the time taken to make a single prediction, is calculated by measuring the interval between data input and the receipt of the prediction result Table 3 illustrates the outcomes when all features are utilized as input, while Table 4 highlights the results derived from features selected through our GA-based algorithm.

Table 3 ED-LSTM, AE-BiLSTM, and AC-LSTM use all features (Hanoi dataset) h ED-LSTM AE-BiLSTM AC-LSTM

Table 4 ED-LSTM, AE-BiLSTM and AC-LSTM use selected features by GA (Hanoi dataset) h ED-LSTM AE-BiLSTM AC-LSTM

The ED-LSTM model demonstrates superior prediction accuracy across both feature sets, achieving a 46% reduction in Mean Absolute Error (MAE) compared to AE-BiLSTM and 6.4% compared to AC-LSTM when utilizing all features With features selected through a GA-based algorithm, ED-LSTM further decreases the MAE by 53.7% and 20.1% on average against AE-BiLSTM and AC-LSTM, respectively While AE-BiLSTM records the shortest computation time, ED-LSTM and AC-LSTM show comparable performance, with ED-LSTM's computation time being slightly higher but still acceptable for real-time PM 2.5 prediction tasks.

Our analysis compares the performance of two types of feature selection across three models, revealing that the mean absolute error (MAE) for each model is lower when utilizing features selected by the GA-based algorithm rather than all available features Specifically, the ED-LSTM model achieves an average MAE reduction of 14.7%, with a maximum improvement of 18.3% Meanwhile, the AE-BiLSTM and AC-LSTM models show average MAE reductions of 21.3% and 12.4%, respectively, when employing features selected by the GA-based method These findings demonstrate the effectiveness of the GA-based feature selection not only for the ED-LSTM model but also for other predictive models.

Groundtruth AC−LSTM AE−BiLSTM ED−LSTM

GroundtruthAC−LSTMAE−BiLSTMED−LSTM

Figure 21 Comparison between models using Hanoi dataset with all features

Figure 22 Comparison between models using Hanoi dataset with feature selected by

Figure 23 MAE of the proposed model with different output lengths

In our analysis of the Hanoi dataset, we visualize the prediction outcomes using different time-step values, as shown in Figures 21 and 22 The results demonstrate that our proposed ED-LSTM model significantly outperforms the AE-BiLSTM model, with predictions closely aligning with the ground truth Furthermore, the ED-LSTM model effectively captures more peaks compared to the AC-LSTM model A comparison of Figures 21 and 22 reveals that predictions utilizing features selected through Genetic Algorithm (GA) are more accurate than those based on all features This finding supports the data presented in Tables 3 and 4, where the MAE value for GA consistently outperforms that of the complete attribute set.

To validate the effectiveness of our proposed model, we utilized features selected by our GA-based algorithm within an encoder-decoder framework to forecast PM2.5 levels one month in advance By averaging hourly data to obtain daily mean values, we fixed the input length at 70 while varying the output length from 1 to 31 days to assess its impact on model accuracy As illustrated in Figure 23, the model consistently demonstrates low and stable Mean Absolute Errors (MAEs) across different output lengths, with the last three test cases showcasing the lowest MAE values.

4.4.2 Comparing ED-LSTM and ST-DNN

This section presents a comparison of our proposed model's performance with that of the ST-DNN model Since we were unable to reproduce the ST-DNN results, we have included the results from Fig 23 in the ST-DNN paper [11] in the third column The collected results are summarized in Table 5, where ST-DNN (A+L+C) represents the optimal combination of adaptive artificial neural networks (A), long short-term memory (L), and convolutional networks (C).

Number of the output’s time steps

The MAE 02468 convolutional neural network (C) demonstrates superior performance compared to ST-DNN across all experimental conditions The performance gap between our proposed model and ST-DNN ranges from 14.82% to 41.89%, with a notable trend of increasing disparity as the output time steps are extended.

Table 5 Comparing the MAE of the proposed ED-LSTM model and the ST-DNN model (using the features proposed by [11])

To demonstrate the effectiveness of our prediction model, we present the results in Table 5 and Figure 24, which features six subfigures illustrating predictions over varying time steps Each subfigure includes the ground truth for comparison Notably, when the output time steps are minimal, as shown in Figures 24(a) and (b), the predictions closely align with the actual data, accurately capturing even peak points However, as the forecast horizon extends in subsequent figures, the accuracy of predictions gradually decreases compared to Figure 24(a).

The ED-LSTM model demonstrates superior prediction accuracy compared to existing models, primarily due to the integration of a GA-based feature selection algorithm that identifies the optimal feature combination and the effective synergy of the encoder-decoder architecture with LSTM networks, which enhances the extraction of meaningful information from inputs Although ED-LSTM exhibits a slightly longer computation time than the fastest model, it remains adequate for real-time PM2.5 predictions.

Figure 24 Comparison between models using Taiwan dataset with features selected by

LTS2 evaluation

In this experiment, I evaluated a new training strategy against traditional methods Each training instance is labeled as 𝑢𝑢%_𝑓𝑓𝑖𝑖𝑥𝑥𝑥𝑥𝑥𝑥_𝑠𝑠ℎ𝑢𝑢𝑓𝑓𝑓𝑓𝑢𝑢𝑖𝑖𝑢𝑢𝑢𝑢, with 100_any_any indicating the initial training scenario where all training data is excluded Consequently, the values of the variables fixed and shuffling, whether true or false, do not influence the outcomes Table 6 provides a summary of the hyperparameters employed in this innovative training approach.

Table 6 Hyperparameters of training strategy

The findings presented in Table 7 indicate that the new training strategy significantly reduces training time compared to the original method While there is a slight decrease in accuracy, particularly in the 10_true_false case, the Mean Absolute Error (MAE) shows only a minimal difference of approximately 0.05 from the 3.592 in the 100_any_any case, all while achieving a nearly 25-fold reduction in training time.

Table 7 Training strategy for different cases

Table 8 illustrates that the feature combinations produced by the innovative training method maintain superior accuracy compared to related studies, particularly when utilizing the attribute set chosen by Genetic Algorithm (GA) during the training of the complete dataset.

Table 8 Comparing proposed method combining new training strategy with related works h ED-LSTM

ED-LSTM with new training strategy (LTS2)

Thus, with the proposed training tactic reduced the proposed method's training time by at least 10 and at most 25 times while keeping the MAE at least 10% lower than related works.

Discussion

Our GA-based feature selection methods identify the optimal feature combination for predicting PM2.5 as {wind speed, temperature, radiation, PM10}, indicating these factors significantly influence PM2.5 levels We assessed the correlation between PM2.5 and these features using Spearman’s correlation (SC), Pearson correlation (PC), and mutual information score (MIC) The results, detailed in Table 9, reveal that wind speed, temperature, radiation, and PM10 exhibit strong correlations with PM2.5, as evidenced by their high absolute values in SC, PC, and MIC.

Our findings align with previous research indicating a positive correlation between temperature and PM2.5 levels Specifically, a study revealed that when wind speed is below 3 m/s, it negatively correlates with PM2.5, but above this threshold, the correlation becomes positive Additionally, PM2.5 is significantly influenced by wind speed and temperature, as highlighted in another study Furthermore, research has demonstrated that rising temperatures and decreasing radiation contribute to increased PM2.5 levels Lastly, it has been established that PM10 is strongly related to PM2.5.

In summary, our GA-based feature selection method has selected the optimal feature combination for predicting PM2.5 The optimal combination includes features that have a considerable effect on PM2.5

The new training strategy LTS2 demonstrates comparable accuracy to traditional training methods while significantly enhancing training speed, achieving improvements of 10 to 25 times This allows for a practical tradeoff between accuracy and training efficiency in real-world applications Furthermore, with proper hyperparameter adjustments, LTS2 can yield even better results than training with the entire dataset.

This paper introduces an innovative PM2.5 prediction model that integrates PM2.5 data with various air quality features Utilizing a genetic algorithm (GA) for feature selection and an enhanced deep long short-term memory (ED-LSTM) prediction framework, our model effectively identifies near-optimal feature combinations The ED-LSTM model enhances prediction capabilities by accommodating varying input and output lengths Experimental findings reveal that our approach improves mean absolute error (MAE) by up to 53.7% over existing PM2.5 prediction models, with the GA-based feature selection boosting accuracy by at least 13.7% Additionally, we propose a training mechanism that reduces training time by a factor of ten while preserving high accuracy.

Recent research has successfully developed a highly accurate predictive model for the fine dust index; however, several challenges remain that present opportunities for further enhancement and innovation.

To address environmental issues in Vietnam, the current focus is limited to data from Hanoi, which neglects the geographical and spatial relationships of other regions To enhance the model's applicability and accuracy, it is essential to gather data from various cities and incorporate geographical factors into the analysis.

The challenge of incomplete data due to unreliable sensors significantly impacts the learning and prediction processes To enhance prediction accuracy beyond conventional methods like averaging, I propose developing effective techniques for filling in missing data.

[1] Minh Hieu Nguyen, Phi Le Nguyen, Kien Nguyen, Van An Le, Thanh-Hung

Nguyen, Yusheng Ji, “PM2.5 Prediction Using Genetic Algorithm-based Feature Selection and Encoder-Decoder Model,” IEEE Access, Vol 9, pp 57338 - 57350,

A study published in JAMA by C Arden Pope, III and colleagues in 2002 investigates the relationship between long-term exposure to fine particulate air pollution and its effects on lung cancer and cardiopulmonary mortality The research highlights significant health risks associated with fine particulate matter, emphasizing the need for improved air quality standards to protect public health.

[2] S Hochreiter and J J Schmidhuber, ‘‘Long short-term memory,’’ Neural Comput., vol 9, no 8, pp 1735–1780, 1997

[3] I Kok, M U Simsek, and S Ozdemir, ‘‘A deep learning model for air quality prediction in smart cities,’’ in Proc IEEE Int Conf Big Data (Big Data), Dec 2017, pp 1983–1990

In their 2018 paper presented at the IEEE 16th International Conference on Dependable, Autonomic and Secure Computing, Tsai, Zeng, and Chang explore air pollution forecasting through the application of Recurrent Neural Networks (RNN) with Long Short-Term Memory (LSTM) technology Their research highlights the effectiveness of advanced machine learning techniques in predicting air quality, contributing to the field of environmental monitoring and data analysis.

[5] Y Zhou, F.-J Chang, L.-C Chang, I.-F Kao, and Y.-S Wang, ‘‘Explore a deep learning multi-output neural network for regional multi-step-ahead air quality forecasts,’’ J Cleaner Prod., vol 209, pp 134–145, Feb 2019

[6] Y Qi, Q Li, H Karimian, and D Liu, ‘‘A hybrid model for spatiotemporal forecasting of PM2:5 based on graph convolutional neural network and long short- term memory,’’ Sci Total Environ., vol 664, pp 1–10, May 2019

[7] C Wen, S Liu, X Yao, L Peng, X Li, Y Hu, and T Chi, ‘‘A novel spatiotemporal convolutional long short-term neural network for air pollution prediction,’’ Sci Total Environ., vol 654, pp 1091–1099, Mar 2019

In the study conducted by Ma et al (2020), a novel approach to air quality prediction is introduced, utilizing a spatially transferred bi-directional long short-term memory (LSTM) network This method aims to enhance the accuracy of air quality forecasts at new monitoring stations The research, published in the journal Science of the Total Environment, highlights the effectiveness of advanced machine learning techniques in environmental monitoring and underscores the importance of innovative methodologies in addressing air quality challenges.

In a 2020 study published in the Journal of Environmental Management, researchers J Wang, P Du, Y Hao, X Ma, T Niu, and W Yang developed an innovative hybrid model that integrates an outlier detection and correction algorithm with a heuristic intelligent optimization algorithm This model aims to enhance the accuracy of daily air quality index forecasting, addressing critical environmental management challenges.

[10] Z Qi, T Wang, G Song, W Hu, X Li, and Z Zhang, ‘‘Deep air learning: Interpolation, prediction, and feature analysis of fine-grained air quality,’’ IEEE Trans Knowl Data Eng., vol 30, no 12, pp 2285–2297, Dec 2018

[11] P.-W Soh, J.-W Chang, and J.-W Huang, ‘‘Adaptive deep learning-based air quality prediction model using the most relevant spatial-temporal relations,’’ IEEE Access, vol 6, pp 38186–38199, 2018

[12] Hanoi Dataset Accessed: Nov 2020 [Online] Available: https://bit.ly/hanoi-pm

[13] M Z Joharestani, C Cao, X Ni, B Bashir, and S Talebiesfandarani, ‘‘PM2:5 prediction based on random forest, XGBoost, and deep learning using multisource remote sensing data,’’ Atmosphere, vol 10, no 7, p 373, Jul 2019

[14] Y Lin, N Mago, Y Gao, Y Li, Y.-Y Chiang, C Shahabi, and J L Ambite,

‘‘Exploiting spatiotemporal patterns for accurate air quality forecasting using deep learning,’’ in Proc 26th ACM SIGSPATIAL Int Conf Adv Geographic Inf Syst., Nov 2018, pp 359–368

[15] J Biesiada and W Duch, ‘‘Feature selection for high-dimensional data — A Pearson redundancy based filter,’’ in Computer Recognition Systems 2, vol 45 Berlin, Germany: Springer, 2007, pp 242–249

[16] D Panda, R Ray, A A Abdullah, and S R Dash, ‘‘Predictive systems: Role of feature selection in prediction of heart disease,’’ J Phys., Conf Ser., vol 1372, Nov 2019, Art no 012074

[17] R Genuer, J.-M Poggi, and C Tuleau-Malot, ‘‘Variable selection using random forests,’’ Pattern Recognit Lett., vol 31, no 14, pp 2225–2236, Oct 2010

[18] D P Kingma and J Ba, ‘‘Adam: A method for stochastic optimization,’’ 2014, arXiv:1412.6980 [Online] Available: https://arxiv.org/abs/1412.6980

[19] I Sutskever, O Vinyals, and V Q Le, ‘‘Sequence to sequence learning with neural networks,’’ 2014, arXiv:1409.3215 [Online] Available: https://arxiv.org/abs/1409.3215

[20] B Zhang, H Zhang, G Zhao, and J Lian, ‘‘Constructing a PM2:5 concentration prediction model by combining auto-encoder with biLSTM neural networks,’’ Environ Model Softw., vol 124, Feb 2020, Art no 104600

[21] S Li, G Xie, J Ren, L Guo, Y Yang, and X Xu, ‘‘Urban PM2:5 concentration prediction via attention-based CNN–LSTM,’’ Appl Sci., vol 10, no 6, p 1953, Mar 2020

[22] J Wang and S Ogawa, ‘‘Effects of meteorological conditions on PM2:5 concentrations in Nagasaki, Japan,’’ Int J Environ Res Public Health, vol 12, no

[23] P D Hien, V T Bac, H C Tham, D D Nhan, and L D Vinh, ‘‘Influence of meteorological conditions on PM2:5 and PM2:5−10 concentrations during the monsoon season in Hanoi, Vietnam,’’ Atmos Environ., vol 36, no 21, pp 3473–

[24] M Prộndez, M Egido, C Tomặs, J Seco, A Calvo, and H Romero, ‘‘Correlation between solar radiation and total syspended particulate matter in Santiago, Chile— Preliminary results,’’ Atmos Environ., vol 29, no 13, pp 1543–1551, Jul 1995

A study by Zhou et al (2016) published in Chemosphere examines the concentrations, correlations, and chemical species of PM2.5 and PM10 in China The research highlights potential implications for revising particulate matter standards, providing valuable insights into air quality management and public health.

[26] E Maraziotis and N Marazioti, ‘‘Statistical analysis of inhalable (PM10) and fine particles (PM2:5) concentrations in urban region of Patras, Greece,’’ Global Nest J., vol 10, no 2, pp 123–131, 2008

[27] D Zhao, H Chen, E Yu, and T Luo, ‘‘PM2:5/PM10 ratios in eight economic regions and their relationship with meteorology in China,’’ Adv Meteorol., vol

[28] B Liu, S Yan, J Li, G Qu, Y Li, J Lang, and R Gu, ‘‘A sequence-tosequence air quality predictor based on the n-step recurrent prediction,’’ IEEE Access, vol

[29] L Yan, Y Wu, L Yan, and M Zhou, ‘‘Encoder-decoder model for forecast of PM2:5 concentration per hour,’’ in Proc 1st Int Cognit Cities Conf (IC3), Aug

In their 2019 paper presented at the 28th International Joint Conference on Artificial Intelligence, Zhang et al introduced multi-group encoder-decoder networks designed to integrate heterogeneous data for predicting air quality a day in advance The research highlights innovative techniques for effectively combining diverse data sources to enhance the accuracy of next-day air quality forecasts.

[31] Y Zhang, P Li, and X Wang, ‘‘Intrusion detection for IoT based on improved genetic algorithm and deep belief network,’’ IEEE Access, vol 7, pp 31711–

[32] S Bouktif, A Fiaz, A Ouni, and M Serhani, ‘‘Optimal deep learning LSTM model for electric load forecasting using feature selection and genetic algorithm:

Tiêu đề	Pm2 5 Prediction Using Genetic Algorithm Based Feature Selection And Encoder Decoder Model
Người hướng dẫn	Dr. Nguyen Phi Le, Dr. Nguyen Thanh Hung
Trường học	BK.AI Center
Chuyên ngành	Air Quality Prediction
Thể loại	thesis

Định dạng
Số trang	60
Dung lượng	1,4 MB