A multifunctional embedded system based on deep learning for assisting the cognition of visually impaired people

Introduction

Motivation

Visually impaired individuals face daily challenges in navigation, prompting the development of specialized devices to enhance their independence These innovations are categorized into three main approaches: sensor-based, computer vision-based, and smartphone-based methods Sensor-based solutions utilize various sensors, such as ultrasonic, infrared, and laser, to detect obstacles In contrast, computer vision-based methods employ cameras and advanced algorithms to analyze and identify barriers in the environment Finally, smartphone-based approaches leverage smartphone cameras and sensors to gather real-world data, which is then processed to detect obstacles effectively.

However, most of the existing studies focus on navigation and obstacle avoidance, and there is less focus on context-aware and recognition of surrounding objects [3, 7,

Recent studies have primarily conducted experiments on servers or laptops, resulting in limited experimental areas and navigation capabilities Object detection schemes have typically been employed for identifying objects in current scenes, leading to simplistic descriptions of the surrounding environment To address these challenges, this study develops a multifunctional embedded system utilizing deep learning to enhance cognitive assistance for visually impaired individuals The motivations behind this research are elaborated in detail.

Face recognition has gained immense popularity in computer vision due to its critical role in various applications like security systems, video surveillance, and human-computer interaction Recent advancements in deep learning have significantly enhanced face recognition techniques, particularly through the use of convolutional neural network (CNN) architectures This progress has led to the development of cutting-edge face recognition methods, including VGGFace, FaceNet, and ArcFace.

Face recognition technology is primarily utilized in monitoring and security systems, particularly in video surveillance With the rise of IoT technology, there has been significant advancement in IoT-based healthcare systems, particularly for assisting visually impaired individuals Research in this area has focused on two main approaches: cloud-based and computer-based methods Cloud-based systems typically consist of a server and a local unit, where the local unit collects input images and communicates with the server for image recognition However, these methods often require more processing time and depend on internet connectivity, limiting their accessibility in certain areas.

In the computer-based methods [60, 64], the processor unit is a laptop computer

Laptop computers serve as the central processing unit in these systems, handling tasks like gathering and processing input images and exporting results However, their weight makes them cumbersome for visually impaired individuals to carry during movement and navigation Consequently, there is significant potential for enhancing computer-based methods to better accommodate their needs.

(2) Gender, age and emotion classification issues

Gender classification plays a crucial role across various applications, including human-computer interaction (HCI), surveillance systems, commercial development, demographic research, and entertainment In HCI, it enables robots to identify users' gender and tailor services accordingly In surveillance, gender classification enhances the effectiveness of intelligent systems In the commercial sector, it aids market research and informs business decisions Additionally, it facilitates the efficient collection of demographic data in research contexts In entertainment, gender classification helps design targeted game content and customize applications Although its use in the medical field, particularly for visually impaired individuals, has been limited, this area is gaining attention and presents opportunities for future research.

Age classification plays a crucial role in various fields, including intelligent surveillance, human-computer interaction, commercial development, social media analysis, and demographic research It is particularly valuable for enhancing security measures, such as preventing minors from accessing adult content, purchasing age-restricted items, and consuming alcohol Additionally, store management systems utilize age classification to tailor the shopping experience according to customers' age groups and genders This allows store managers to effectively address diverse customer preferences, track market trends, and customize products and services to better meet public demands.

69] However, in order to further assist visually impaired people, age classification is still expected to have wider applications

Facial expressions serve as powerful, natural, and universal indicators of emotional states and intentions Given their practical significance, numerous studies have concentrated on the classification of facial emotions, which is essential for various applications such as human-computer interaction (HCI), robotics, monitoring driver fatigue, and conducting psychological assessments.

Facial emotion recognition systems in computer vision and machine learning aim to decode emotional expressions from facial representations, focusing on six basic human emotions: anger, disgust, fear, happiness, sadness, and surprise These emotions form the foundation for classifying human emotional studies Recent research has explored emotion classification to assist visually impaired individuals; however, these studies often overlook practical applicability and flexibility Consequently, there is a need for an effective system that considers gender, age, and various emotional classifications.

Object detection has emerged as a vital technique in computer vision, finding applications in human-computer interaction, security monitoring, video surveillance, and autonomous vehicles The increasing demand for precise object detection has led researchers to develop various deep learning methods, including Faster R-CNN, SSD, YOLOv3, RetinaNet, and Mask R-CNN.

In the medical field, various studies have applied object detection to help visually impaired people to navigate independently, and perceive the surroundings and objects

Recent studies have demonstrated significant advancements in assistive systems for visually impaired individuals by integrating multiple sensors and cameras to enhance performance For instance, Joshi et al proposed a system that combines a camera with a distance sensor, enabling functionalities such as object detection, text recognition, and obstacle distance measurement However, the reliance on laptop computers for these systems limits user accessibility While Raspberry Pi has been utilized to address this limitation, it has resulted in reduced system performance This highlights the urgent need for a more flexible and efficient solution that effectively implements object detection techniques to better assist visually impaired users.

Overview of Research

This research focuses on developing a multifunctional embedded system to assist visually impaired individuals, utilizing advanced deep learning techniques such as face recognition, gender classification, age estimation, emotion recognition, and object detection The system is designed to perform multiple tasks, categorized into three primary functions, but controlling the system presents challenges for visually impaired users due to its reliance on various deep learning models.

This study proposes an efficient function selection process for visually impaired individuals, utilizing a remote controller as an input device Key features include face recognition and emotion classification, which rely on critical dataset collection from videos, images, and standard datasets, supported by two algorithms for image collection and pre-processing Notably, the system allows for the enrollment of new individuals without retraining the entire model Additionally, gender and age classification functions enhance user experience by providing vital information about detected strangers, fostering confidence and social inclusion for visually impaired users An object detection function is also integrated to help users recognize their surroundings, with pre-trained models tested for optimal performance The results are organized into an object order table to simplify complex detections Finally, a prototype built on the Jetson AGX Xavier embedded system showcases the feasibility of the proposed functions, including face recognition and real-time image processing.

Dissertation Organization

This research develops a multifunctional embedded system utilizing deep learning for face recognition, gender, age, and emotion classification The dissertation is organized as follows: Chapter 2 reviews relevant literature and technological applications; Chapter 3 outlines the system design and architecture; Chapter 4 details the face recognition function; Chapter 5 covers gender, age, and emotion classification; Chapter 6 describes the object detection function; Chapter 7 presents the prototype implementation; and Chapter 8 showcases the experimental results Finally, Chapter 9 concludes the study and explores future research directions.

Related Work

Face Recognition

Recent advancements in face recognition have gained significant attention due to their critical role in computer vision applications, driven by the development of deep learning and CNN architectures that enhance methods like VGGFace, FaceNet, and ArcFace Concurrently, IoT technology has emerged as a focal point for researchers, particularly in IoT-based healthcare systems, leading to innovative facial recognition applications for visually impaired individuals Aza et al proposed a real-time face recognition system designed to assist visually impaired users, utilizing the Local Binary Pattern Histogram (LBPH) algorithm for face recognition This system leverages smartphone video capture but is limited to recognizing one face per frame and requires input images to be converted to binary or grayscale for effective processing with the LBPH algorithm.

Cloud-based technologies have greatly enhanced system performance, particularly in applications for visually impaired individuals Chaudhry and Chandra developed a mobile face recognition system that operates on a smartphone, utilizing a server for support The smartphone handles essential functions such as face detection and recognition, while the server aids in enrolling and identifying new individuals using a Cascade classifier and LBPH algorithm Similarly, Chen et al introduced a smart wearable system that leverages cloud capabilities to recognize faces, objects, and text, thereby assisting visually impaired users in navigating their environment This system comprises two main components: a local unit for collecting images and communicating with the cloud server, and the cloud server, which processes images and returns results However, a notable limitation of these systems is their reliance on cloud connectivity, restricting functionality to areas with internet access.

Computer-based methods significantly improve the practicality and adaptability of face recognition technology A real-time system named DEEP-SEE FACE, developed by Mocanu et al [60], assists visually impaired individuals with cognition, interaction, and communication This innovative system combines deep convolutional neural networks (CNNs) and computer vision algorithms, enabling users to detect, track, and recognize multiple individuals in their vicinity.

The traditional laptop-based system for assisting visually impaired individuals in face recognition faced challenges due to its weight, making mobility difficult To address this issue, Neto et al proposed a wearable system that utilizes a Kinect sensor to capture RGB-Depth images, integrating an RGB camera and an IR emitter This innovative approach employs various techniques, including histogram of oriented gradients (HOG), principal component analysis (PCA), and k-nearest neighbor (kNN) algorithms for effective face recognition However, the reliance on Kinect's IR sensor limits the system's usability in outdoor environments.

Gender, Age and Emotion Classification

Gender classification can be approached through various methods, including ear-based, fingerprint-based, iris-based, voice-based, and face-based techniques Recent advancements in CNN architectures have made face-based gender classification a prominent research focus For instance, Arriaga et al developed a real-time CNN model, mini-Xception, which utilizes primary layers such as convolution, ReLU activation, and residual depth-wise separable convolution to classify both emotion and gender Liew et al introduced a simpler CNN model comprising three convolutional layers and one output layer, employing cross-correlation to minimize computational demands Yang et al enhanced the original SSR-Net, initially designed for age classification, by integrating a gender classification function in a 2-stream model featuring heterogeneous streams of convolution, batch normalization, and pooling layers Dhomne et al utilized the VGGNet architecture to create a CNN model that recognizes gender through automatic feature extraction from face images, eliminating the need for HOG and SVM techniques Additionally, Khan et al proposed a framework that segments face images into six components using a CRF-based model, followed by a probabilistic classification strategy to generate probability maps for gender classification.

Age classification, alongside gender classification, has garnered significant attention in recent research Many studies have explored the combined effects of age and gender, or age and emotion classifications, relying on facial features Levi and Hassner proposed a deep CNN model tailored for age and gender classification, structured similarly to AlexNet but streamlined to five primary layers: three convolutional layers and two fully-connected layers Their model also incorporates essential components like max-pooling, normalization, dropout layers, ReLU activation functions, and filters Similarly, Agbo-Ajala and Viriri introduced a face-based classification model for age and gender, also based on AlexNet, featuring convolutional layers, ReLU activation, batch normalization, max-pooling, dropout, and fully-connected layers This model was pre-trained and fine-tuned using a comprehensive IMDb-WIKI dataset, employing advanced face detection and alignment techniques to enhance classification accuracy.

Recent advancements in age estimation have leveraged deep convolutional neural networks (CNNs) and generative adversarial networks (GANs) A notable method introduced by [63] utilizes a GAN to reconstruct high-resolution facial images from low-resolution inputs, employing VGGNet for age evaluation Similarly, Liao et al [47] developed a CNN model that integrates a divide-and-rule strategy for robust feature extraction, utilizing a deep CNN based on GoogLeNet architecture Furthermore, Zhang et al [88] proposed a novel residual network of residual networks (RoR) architecture, grounded in ResNet, to classify age groups and gender, enhancing performance through mechanisms that consider the characteristics of different age groups.

Face emotion classification has garnered significant attention in recent literature, with various innovative models proposed to enhance accuracy Hu et al introduced the Supervised Scoring Ensemble (SSE), a deep CNN model that incorporates auxiliary blocks and three supervised layers to improve model precision Cai et al developed a novel island loss function for CNNs, aimed at increasing the pairwise distances between different class centers while reducing intra-class differences, thereby enhancing classification accuracy Bargal et al presented a CNN model utilizing an ensemble of VGG13, VGG16, and ResNet, which concatenates features from multiple networks into a single vector for emotion classification Zhang et al proposed an evolutional spatial-temporal network that employs multitask networks, utilizing a multi-signal convolutional neural network (MSCNN) for spatial feature extraction and a part-based hierarchical bidirectional recurrent neural network (PHRNN) for analyzing temporal features, significantly boosting performance Liu et al designed an AU-aware deep network (AUDN) based on cascaded networks, combining modules for overcomplete representation, AU-aware receptive field processing, and multilayer restricted Boltzmann machines (RBM) to learn hierarchical features for facial expression recognition.

Object Detection

Object detection is a prominent research area in computer vision, particularly beneficial for smart healthcare systems aimed at assisting visually impaired individuals Various systems have been developed to enhance communication and social inclusion for these users For instance, Tian et al proposed a portable computer system that utilizes object detection and text recognition to help visually impaired people navigate by identifying key objects like doors and elevators Similarly, Ko and Kim developed a smartphone-based wayfinding system that relies on QR code detection, though its application is limited to environments that provide QR codes Mekhalfi et al introduced a prototype featuring lightweight hardware components, including a camera and IMU, to aid in navigation and object recognition Long et al designed a framework using millimeter-wave radar and RGB-Depth sensors for obstacle detection, employing Mask R-CNN and SSD networks to enhance performance Khade and Dandawate created a compact, wearable system on Raspberry Pi for obstacle tracking, while Joshi et al utilized YOLOv3 and distance sensors to provide audio feedback for obstacle avoidance Lastly, Tapu et al implemented an automatic cognition system on a laptop, leveraging computer vision algorithms and deep CNNs for navigation assistance.

[70] is applied to detect objects, and the system sends a warning to the user by a headphone when detecting obstacles.

Smart Healthcare

IoT-based techniques are increasingly recognized as a valuable solution in healthcare, particularly for assisting visually impaired individuals Research has focused on developing supportive systems through three main approaches: sensor-based, computer vision-based, and smartphone-based methods Sensor-based methods utilize various sensors, such as ultrasonic, infrared, laser, and distance sensors, to detect obstacles Notably, Katzschmann et al introduced the ALVU device, which combines a sensor belt with an array of distance sensors and a haptic strap for feedback, enabling safe navigation for visually impaired users Additionally, Nada et al developed a smart stick equipped with infrared sensors that can detect obstacles within two meters, offering a cost-effective, lightweight, and user-friendly solution that provides audio alerts Furthermore, Capi designed an intelligent robotic system that assists visually impaired individuals in navigating unfamiliar indoor spaces, utilizing a laptop alongside a camera, speaker, and laser range finder, and operating in both assisting and guiding modes.

Computer vision methods assist visually impaired individuals by capturing their surroundings through a camera and employing algorithms to detect obstacles Kang et al introduced the deformable grid (DG) obstacle detection method, which adapts its grid shape based on the motion of nearby objects, enhancing collision risk recognition through the degree of deformation This innovative approach significantly boosts system accuracy Additionally, Yang et al proposed a deep learning framework to further advance obstacle detection for the visually impaired.

This study focuses on enhancing the perception of the surrounding environment for visually impaired individuals through an efficient semantic segmentation framework It offers crucial terrain awareness by identifying traversable areas, sidewalks, stairs, and water hazards, while also ensuring the effective avoidance of obstacles, pedestrians, and vehicles.

Smartphone-based systems serve as comprehensive solutions for various tasks, including data collection, processing, and decision-making An embedded system developed by Tanveer et al assists visually impaired individuals by utilizing voice commands to detect obstacles and facilitate voice calls This system integrates GPS technology to track user locations, with data managed on a server and displayed through Android applications Additionally, Cheraghi et al created GuideBeacon, a wayfinding system that employs Bluetooth beacons within a designated area to enhance navigation for visually impaired users, allowing them to navigate quickly and efficiently.

System Overview

System Architecture

The system overview, depicted in Figure 3.1, features the NVIDIA Jetson AGX Xavier as its central module, interfacing with peripheral devices such as a webcam, speaker, and Bluetooth audio transmitter This setup facilitates various functions, including image collection, processing, and system control, initiated via a remote controller The primary functions encompass face recognition and emotion classification, age and gender classification, and object detection The first function identifies faces and emotions, providing names and emotional states, while also offering details about unfamiliar individuals, including their gender and age The second function generates descriptions of gender, age, and emotion, and the third function identifies objects within the image, detailing their types and quantities Finally, the system converts the collected results into voice output, which is relayed to the user through the speaker.

Function Selection

This section presents the function selection Subsection 3.2.1 discusses the remote controller technique, and Subsection 3.2.2 describes the function selection process

The proposed multifunctional system is designed to enhance the cognitive support for visually impaired individuals, featuring three easily selectable functions Users can choose their preferred function through various methods, including voice commands, computer vision, and remote control techniques The remote controller method has been selected for its popularity and user-friendly interface, allowing for straightforward system control This remote controller acts as a keyboard that connects to a central processing module, specifically the Jetson AGX Xavier, to transmit control signals effectively.

Table 3.1 Pseudocode of the Key Code Testing

1 print("Please press any key ")

2 image = numpy.zeros([512,512,1],dtype=numpy.uint8)

4 cv2.imshow("The input key code testing (press Esc key to exit)",image)

5 key_input = cv2.waitKey(1) & 0xFF

6 if (key_input != 0xFF): #press key

8 if (key_input == 27): #press Esc key

To effectively utilize the remote controller within the system, it is essential to recognize the key codes associated with the keyboard While the American Standard Code for Information Interchange (ASCII) provides a clear reference for key codes on a standard computer keyboard, determining these codes on various keyboards can prove to be quite challenging.

To address this issue, a program has been developed to capture the input key code, as detailed in Table 3.1 The algorithm used in the program is straightforward, allowing the key code to be retrieved each time a key on the keyboard is pressed.

The proposed system aims to enhance the cognitive abilities of visually impaired individuals by providing an intuitive remote controller that facilitates quick recognition of function keys Utilizing a Logitech remote controller, as illustrated in Figure 3.2, the function keys are programmed with specific key codes—85, 86, and 46 for function keys 1, 2, and 3, respectively, as detailed in Table 3.1 Users can easily select a function by pressing the designated function key, which sends the corresponding key code to the central processing module (Jetson AGX Xavier) for processing This design allows users to seamlessly switch between functions, ensuring an efficient and user-friendly experience.

Figure 3.2 Function Key of Logitech Remote Controller

The system operates with three primary functions: face recognition and emotion classification, age and gender classification, and object detection Effective function selection is crucial as it significantly impacts the system's efficiency Upon logging into the operating system, the Jetson AGX Xavier automatically initiates the function selection process, which sets the program's initial parameters.

The system provides a voice notification to indicate that it is ready for function selection It then tests the input key code and compares it with the predefined function key codes to identify the chosen function.

When the user activates the first function by entering the key code 85, the system performs face recognition and emotion classification, labeling any detected stranger as "Unknown." The output includes the names and emotions of identified individuals By entering the key code 86, the user initiates the second function, which also begins with face recognition and emotion classification; however, if a stranger is detected, the system will further analyze and provide details such as gender and age, resulting in a comprehensive output of gender, age, and emotion Lastly, selecting the third function with key code 46 activates object detection, allowing the system to identify and quantify various objects within an image, with the output detailing the types and quantities of these objects.

Users can effortlessly toggle between three functions using the function keys The system adjusts its operation based on the input key code provided, executing the designated function accordingly This cycle continues until the system is powered down.

Select function 1? Select function 2? Select function 3?

End function? End function? End function?

NVIDIA Jetson AGX Xavier

This section presents NVIDIA Jetson AGX Xavier Subsection 3.3.1 introduces an overview of the NVIDIA Jetson family, and Subsection 3.3.2 provides the technical specification of NVIDIA Jetson AGX Xavier

NVIDIA Jetson is the leading embedded AI computing platform, offering complete System-on-Module (SOM) devices that integrate a CPU, GPU, power management circuitry, DRAM, and flash storage The Jetson platform features compact modules designed for GPU-accelerated parallel processing, along with the Jetpack software development kit (SDK) that includes essential developer tools and extensive libraries for AI application development These systems deliver high-performance, low-power computing capabilities, making them ideal for deep learning and computer vision in the creation of autonomous machine software.

NVIDIA's Jetson products deliver advanced AI edge computing solutions tailored for embedded applications across diverse sectors, including medical, transportation, factory automation, retail, surveillance, and gaming The Jetson Family features a range of modules, such as Jetson Nano, Jetson TX1, Jetson TX2 series, Jetson Xavier NX, and Jetson AGX Xavier With these offerings, NVIDIA has established itself as the gold standard in AI edge computing technology.

The Jetson Nano module is a compact AI computer, measuring just 70 mm x 45 mm, that enables a variety of embedded IoT applications, including surveillance video and home robotics The Jetson TX1, recognized as the first supercomputer on a module, excels in delivering high performance and power efficiency for advanced visual computing tasks The Jetson TX2 series, which includes the TX2, TX2i, and TX2 4GB versions, offers remarkable speed and efficiency in a small form factor of 50 mm x 87 mm, making it ideal for deep learning applications The latest addition, Jetson Xavier NX, provides high performance with low power consumption, supporting cloud-native technologies for developers to create sophisticated software-defined features for embedded and edge devices Lastly, the Jetson AGX Xavier is tailored for autonomous machines, featuring six onboard engines for accelerated sensor data processing, delivering the necessary performance and efficiency for fully autonomous operations.

3.3.2 Technical Specification of NVIDIA Jetson AGX Xavier

The NVIDIA Jetson family comprises a range of embedded computing boards, including the Jetson Nano, TX1, TX2 series, Xavier NX, and AGX Xavier modules This study focuses on the Jetson AGX Xavier module due to its superior performance, offering up to twice the performance of the Xavier NX, twenty times that of the TX2, and forty times that of the Nano The Jetson AGX Xavier developer kit, illustrated in Figure 3.4, facilitates the creation and deployment of comprehensive AI robotics applications across various sectors such as manufacturing, delivery, retail, and smart cities It is backed by NVIDIA Jetpack and DeepStream SDKs, along with essential software libraries like CUDA, cuDNN, and TensorRT, providing all necessary tools for AI edge computing The block diagram of the Jetson AGX Xavier modules is depicted in Figure 3.5.

Figure 3.4 Jetson AGX Xavier Developer Kit

Figure 3.5 Block Diagram of Jetson AGX Xavier Modules [97]

Jetson AGX Xavier includes more than 750Gbps of high-speed input/output (I/O)

This embedded device offers exceptional bandwidth for streaming sensors and high-speed peripherals, being one of the first to support PCIe Gen 4 with 16 lanes across five PCIe Gen connections.

The Jetson AGX Xavier modules support the connection of up to four controllers and allow for simultaneous camera connections through stream aggregation with 36 virtual channels Additionally, it features high-speed I/O options, including three USB 3.1 ports, SLVS-EC, UFS, and RGMII for Gigabit Ethernet, as outlined in Table 3.2.

Table 3.2 Technical Specification of Jetson AGX Xavier Modules [97]

1 CPU 8-core NVIDIA Carmel 64-bit ARMv8.2 @ 2265MHz

2 GPU 512-core NVIDIA Volta @ 1377MHz with 64 TensorCores

3 DL Dual NVIDIA Deep Learning Accelerators (DLAs)

4 Memory 16GB 256-bit LPDDR4x @ 2133MHz | 137GB/s

6 Vision (2x) 7-way VLIW Vision Accelerator

Maximum throughput up to (2x) 1000MP/s – H.265 Main

Maximum throughput up to (2x) 1500MP/s – H.265 Main

9 Camera (16x) MIPI CSI-2 lanes, (8x) SLVS-EC lanes; up to 6 active sensor streams and 36 virtual channels

10 Display (3x) eDP 1.4/ DP 1.2/ HDMI 2.0 @ 4Kp60

11 Ethernet 10/100/1000 BASE-T Ethernet + MAC + RGMII interface

14 CAN Dual CAN bus controller

15 Misc I/Os UART, SPI, I2C, I2S, GPIOs

16 Socket 699-pin board-to-board connector, 100x87mm with 16mm

Face Recognition Function

Overview of Face Recognition Function

The face recognition function begins with the collection of datasets from three distinct sources, followed by pre-processing steps that include the removal of blurry images and face alignment using a multi-task Cascaded Convolutional Neural Network (MTCNN) These refined datasets serve as the training data for our Convolutional Neural Network (CNN) model We evaluate three face recognition models—VGGFace, FaceNet, and ArcFace—focusing on their efficiency to determine the most suitable option for our study The chosen CNN model is then employed for face recognition, displaying the name of the recognized individual and storing it in the result description If the system detects a new person, it identifies them as "Unknown."

Figure 4.1 Face Recognition Function Overview

Dataset Collection

Dataset collection is crucial for enhancing model quality, with this study utilizing videos, images, and standard datasets Initially, videos featuring specific individuals are gathered, enabling the system to perform face detection and extract high-quality facial images from each frame This method offers exceptional image quality, vast resources, and minimal noise Next, images are sourced from Google, though preprocessing is time-consuming due to the prevalence of noisy images Lastly, standard datasets are obtained online, simplifying comparisons with results from other studies Additionally, the MTCNN technique can be applied to generate clearer facial images.

After collecting videos, we employ three methods for frame extraction: period time per frame, number of frames per video, and key-frame extraction The period time per frame method extracts images at regular intervals, while the number of frames per video method calculates the extraction period by dividing the total frames by the desired output frames Although these first two methods, detailed in Table 4.1, are straightforward and efficient, they risk omitting critical frames due to a lack of content consideration In contrast, key-frame extraction addresses this issue by focusing on significant frames, although it demands more computational resources.

Table 4.1 Pseudocode of the Splitting Video

Input: The input video (inputVideo), period time per frame (n), the number of frames per video (m)

2 sumFrames = cap.get(cv2.CAP_PROP_FRAME_COUNT)

After collecting the dataset, pre-processing is essential for creating the final datasets, which involves two key steps The first step is identifying and removing blurry images, utilizing the variance of Laplacian method proposed by Pech-Pacheco et al to assess blurriness An image is classified as blurry if its variance does not exceed a defined threshold, set at 100 in this study For instance, Figures 4.2(a) and 4.2(b) have variance values of 1,812 and 8, respectively, leading to Figure 4.2(a) being deemed clear and suitable for further processing, while Figure 4.2(b) is identified as blurry and excluded The second step involves face alignment using the MTCNN technique.

Table 4.2 Pseudocode of Dataset Collection Scheme

Input: The consecutive video frames, total images per class (number_face), threshold of blurry image (threshold_blur)

Output: Face dataset (face_dataset)

3 initialize parameters of MTCNN scheme

7 calculate the variance of laplacian (var_laplacian) for image

8 if var_laplacian > threshold_blur:

9 face_image = MTCNN scheme detects face

10 save face_image in face_dataset

12 if current_frame > number_face:

The MTCNN method is employed for face alignment, which helps to focus and crop faces in images, thereby improving model accuracy, as detailed in Table 4.2 This proposed approach integrates the elimination of blurry images with face alignment for enhanced performance.

Table 4.2 outlines the pseudocode for the dataset collection scheme, beginning with the definition of all parameters in Lines 1-4 The number_face parameter represents the total number of images per class, while the threshold_blur is a predetermined value used to determine if an image is blurry The current_frame parameter tracks the number of collected images, and the dataset collection process concludes when current_frame matches the number_face Additionally, the parameters for the MTCNN scheme are initialized during this process.

For lines (5-7), each input image is read, and the corresponding variance of Laplacian is calculated

Lines (8-11) indicate that MTCNN is used to conduct face alignment if the image is not blurry

Lines (12-14) show that the program will stop and return the face dataset when the dataset collection is enough.

Model Architectures

In recent years, face recognition architectures have been extensively studied, leading to the evaluation of three different models to determine the most efficient one for this research The VGGFace network, introduced by Parkhi et al., is based on the deep architectures developed by Simonyan and Zisserman This model features convolutional layers, max-pooling layers, fully-connected layers, and a softmax layer, with five max-pooling layers designed to reduce input size The architecture includes three fully-connected layers with output dimensions of 4,096 for the first two layers and 2,622 for the final layer Additionally, the network employs a triplet loss function, which aims to minimize the distance between an anchor and a positive sample while maximizing the distance between the anchor and a negative sample, ensuring effective learning of facial embeddings.

The ArcFace model, introduced by Deng et al., emphasizes the importance of discriminative power in feature learning for deep CNN-based face recognition To enhance feature discrimination, they proposed an additive angular margin loss (ArcFace), which is mathematically defined in Equation 2, where the margin parameter is represented by m, the feature scale by s, and θ j indicates the angle between the weight vectors.

The study explores the implementation of the ArcFace model using various network architectures, specifically focusing on ResNet-50, introduced by He et al This architecture is notable for its deeper bottleneck design, consisting of three convolutional layers: two 1×1 layers for dimension reduction and restoration, and a 3×3 layer serving as a bottleneck with reduced input/output dimensions The findings demonstrate that the ArcFace model achieves state-of-the-art performance in its applications.

The FaceNet model, developed by Schroff et al., utilizes an end-to-end learning architecture based on the Inception-ResNet-v1 framework, which effectively captures essential facial features to generate embeddings This architecture includes modules such as Inception-A, Inception-B, and Inception-C, with shortcut connections enhancing the depth of ResNet The model efficiently encodes images into feature vectors, which are then processed through a triplet loss function for face recognition This triplet loss function enables the FaceNet model to learn both the similarities within the same class and the dissimilarities between different classes, facilitating accurate facial recognition.

Figure 4.3 The Inception-ResNet-v1 Network [77]

Enrolling a New Person

The face recognition system identifies users and provides their names, while labeling any new individuals as "Unknown" if they are not in the database This innovative approach allows for the registration of new individuals without the need to retrain the entire model The system generates unique embeddings for new users, as shown in Figure 4.4, enabling updates to the existing database of known faces seamlessly.

Figure 4.4 Illustration for New Person Registration

For deep learning inference, a face embedding is generated from the test image using a trained model, which is then compared to other embeddings in the database To optimize inference time, various algorithms such as support vector machines (SVM) and k-nearest neighbor (k-NN) are employed Our model utilizes machine learning techniques, with different face embeddings serving as input The SVM works by calculating a hyperplane in N-dimensional space to classify the data points effectively.

The Support Vector Machine (SVM) is employed to train a multi-class classifier using diverse face embeddings as input, with each class representing an individual This classifier's primary function is to categorize new embeddings by comparing them against support vectors, allowing for the identification of the corresponding class for each new face This method not only enhances classification accuracy but also significantly lowers computational costs.

Figure 4.5 The Process of New Person Registration

Dataset Collection for New Person

Training a Multi-class Classifier using SVM

Gender, Age and Emotion Classification Function

Overview of Gender, Age and Emotion Classification Function

Figure 5.1 Overview of Gender, Age and Emotion Classification Function

The second function of the system enhances the detection capabilities by providing additional information such as gender and age when a stranger is identified While it operates similarly to the first function, the inclusion of gender and age classification results in a longer processing time per image As illustrated in Figure 5.1, the process begins with capturing an input image from the user's current scene, followed by the application of pre-trained models to classify gender, age, and emotion The final results are then compiled and saved in the function's result description.

Gender Classification Schemes

Recent advancements in gender classification using CNN methods have been introduced by various researchers Liew et al developed a compact CNN model designed to classify gender from facial images, featuring a simple architecture that integrates convolutional and subsampling layers The network consists of three convolutional layers (C1, C2, and C3) along with one output layer (F4), and it processes 2D face images sized at 32 × 32 pixels To optimize performance, cross-correlation is employed as a substitute for traditional convolution operations, while the training utilizes a second-order backpropagation algorithm combined with the stochastic diagonal Levenberg–Marquardt (SDLM) algorithm.

Duan et al developed a hybrid model that combines Convolutional Neural Networks (CNN) and Extreme Learning Machine (ELM) to classify age and gender This innovative network consists of two main components: feature extraction and classification The CNN is utilized for feature extraction, comprising three convolutional layers, two contrast normalization layers, and two max-pooling layers, arranged alternately Following this, a fully connected layer transforms the feature maps into vectors, which serve as input for the ELM in the classification process The model's effectiveness is significantly enhanced by the implementation of forward-propagation and back-propagation operations within this hybrid architecture.

The compact soft stagewise regression network (SSR-Net) proposed by Yang et al introduces an enhanced model for age and gender classification This 2-stream architecture features heterogeneous streams, each utilizing basic blocks that include 3x3 convolution, batch normalization, non-linear activation, and 2x2 pooling layers Stream 1 employs ReLU activation and average pooling, while Stream 2 uses Tanh activation and max pooling This strategic variation between the streams significantly boosts the model's performance.

Figure 5.2 SSR-Net Structure with Three Stages (K=3) [86]

Age Classification Schemes

In the field of computer vision, deep learning significantly improves tasks like age classification Shang and Ai introduced a novel deep neural network called Cluster-CNN for this purpose The process begins with face normalization using a landmark detector, which crops the face to a standard scale based on the distance between the eyes The normalized face is then fed into the Cluster-CNN to extract features, which are grouped using a k-means++ algorithm The network is subsequently retrained on each group to select a branch with a learnable cluster module, ultimately leading to age prediction.

Hu et al [30] introduced an advanced deep CNN model aimed at enhancing age estimation accuracy This model utilizes age-labeled images along with year-labeled pairs, where each pair consists of two images of the same individual To assess age differences, the Kullback-Leibler divergence is employed Furthermore, the model incorporates adaptive entropy loss and cross-entropy loss for each image, ensuring that the distribution achieves a single peak value Three distinct loss functions are strategically designed atop the softmax layer to effectively capture the representation of age differences.

Levi and Hassner introduced a straightforward convolutional neural network (CNN) for age classification, comprising three convolutional layers and two fully-connected layers Each convolutional layer utilizes a ReLU activation function and is succeeded by a max-pooling layer, with the first two convolutional layers also incorporating a local response normalization layer The architecture features 96 filters of 7×7 pixels in the first layer, 256 filters of 5×5 pixels in the second layer, and 384 filters of 3×3 pixels in the third layer The network concludes with two fully-connected layers containing 512 neurons each, followed by a ReLU activation and a dropout layer.

Figure 5.3 Illustration of the CNN Architecture for Age Classification [44]

Emotion Classification Schemes

Facial expressions serve as powerful and universal indicators of human emotions and intentions In the realm of computer vision, face emotion classification has gained significant attention Jain et al introduced a deep convolutional neural network (CNN) model for this purpose, featuring six convolution layers complemented by three max-pooling layers The architecture also includes two deep residual learning blocks, each containing four convolution layers of varying sizes The network concludes with two fully-connected layers, employing a ReLU activation function and a dropout layer to enhance performance Detailed specifications of the proposed model are outlined in the accompanying table.

Figure 5.4 The CNN Model for Emotion Classification [33]

Table 5.1 Details of the Network for Emotion Classification [33]

Type Filter Size/ Stride Output Size

Jaiswal et al [34] introduced a CNN architecture designed for emotion classification, featuring two parallel sub-models that utilize identical kernel sizes Each sub-model comprises four layer types: convolutional, local contrast normalization, max-pooling, and flatten layers By processing the same input image through both sub-models, the model extracts high-quality features, which are then flattened into vectors and concatenated into a single extended vector matrix The final layer employs a softmax function for emotion classification, resulting in enhanced model accuracy due to the dual sub-model structure.

Jalal et al [35] developed an end-to-end convolutional self-attention framework for facial emotion classification, comprising four CNN blocks (C1-C4), a self-attention layer (A1), and a dense block (D1) The first CNN block (C1) features a convolutional layer with a 3x3 kernel, producing 32 output feature channels, followed by batch normalization and ReLU activation The second block (C2) includes two convolutional layers with 3x3 and 5x5 kernels, max-pooling, and batch normalization, transitioning from 32 to 192 feature channels The third block (C3) consists of three convolutional layers, where the first and third layers are accompanied by max-pooling and batch normalization, yielding 192 input and 128 output feature channels Following C3, the self-attention layer (A1) captures relationships in the feature maps, leading to the dense block (D1), which features two fully-connected layers and a softmax layer, with the first layer incorporating ReLU activation and dropout for enhanced performance.

Figure 5.5 The Model for Real-Time Emotion Classification [4]

Arriaga et al introduced a real-time CNN model for emotion classification, known as mini-Xception, as depicted in Figure 5.5 This fully convolutional network features a structure where each convolution is succeeded by batch normalization and a ReLU activation function, incorporating four residual depths to enhance performance.

Sep-Conv2D/ BatchNorm Sep-Conv2D/ BatchNorm MaxPool 2D

The model architecture utilizes 4X wise separable convolutions, which consist of two distinct layers: depth-wise convolutions and point-wise convolutions To generate predictions, the architecture incorporates a global average pooling layer followed by a softmax activation function in its final layer.

Object Detection Function

Overview of Object Detection Function

The object detection function allows users to identify and count various objects in a scene Initially, an image is captured, and pre-trained models are utilized to ensure optimal efficiency Key evaluation metrics for these models include accuracy and processing time The final output presents the names and quantities of detected objects, organized in an adaptable object order table that can be modified based on specific circumstances Figure 6.1 provides a visual overview of this process.

Figure 6.1 Object Detection Function Overview

Object Detection Schemes

Since the launch of R-CNN, the pioneering CNN-based object detector, research has significantly progressed in general object detection This section presents key object detection architectures that exemplify these advancements.

Regions with CNN features (R-CNN) has emerged as a leading object detection method since its introduction by Girshick et al The model utilizes a selective search algorithm to extract 2000 region proposals from a single image, which are then processed through a CNN for feature extraction Subsequently, class-specific linear support vector machines classify each region, enabling effective localization and segmentation of objects Despite its high capacity, a significant drawback of the R-CNN model is its inability to support real-time implementation.

Figure 6.2 R-CNN Based Object Detection Model [24]

Girshick later introduced Fast R-CNN to address limitations in the original R-CNN algorithm In this improved version, features are extracted from the entire input image and processed through a region of interest (RoI) pooling layer, where region proposals from the feature map are resized to a uniform dimension Subsequently, each feature vector is input into a series of fully connected layers, enhancing the model's efficiency and accuracy.

R-CNN used multi-task loss to achieve end-to-end learning where the network was jointly trained by the use of a multi-task loss on each labeled RoI It has been proved that Fast R-CNN achieves a significant improvement in its training and testing speed and as well as detection accuracy

Ren et al introduced the Faster R-CNN method for real-time object detection, utilizing Region Proposal Networks (RPNs) to generate region proposals with high efficiency and accuracy, ultimately enhancing detection performance Building on this, He et al developed the Mask R-CNN method, which not only detects objects in images but also accurately segments masks for each instance This approach incorporates an additional branch for mask prediction alongside the existing bounding box recognition, offering easy implementation and flexible architecture options for training.

You Only Look Once (YOLO), developed by Redmon et al., is a one-stage object detection system known as YOLOv1, which enables real-time object detection The model divides the input image into an S × S grid, where each grid cell predicts objects whose center falls within it Each cell predicts B bounding boxes along with confidence scores, defined as the product of the probability of an object being present and the intersection over union (IOU) of the prediction Additionally, each grid cell predicts conditional class probabilities for the objects The predictions are structured as an S × S × (5B + C) tensor The YOLO architecture consists of 24 convolutional layers followed by two fully connected layers, utilizing 1×1 convolutional layers to reduce the feature space from previous layers During pre-training on the ImageNet dataset, the initial 20 convolutional layers are employed, followed by an average-pooling layer and a fully connected layer.

As an improved version, YOLOv2, was later proposed by Redmon and Farhadi

YOLOv2 incorporates several effective strategies that enhance its speed and precision compared to YOLOv1 The introduction of Darknet-19, featuring 19 convolutional layers and 5 max-pooling layers, allows for rapid network resizing and facilitates multi-scale training Additionally, the implementation of batch normalization in each convolutional layer significantly improves the mean average precision (mAP) and helps to regularize the model.

Furthermore, all fully connected layers are removed in YOLOv2, and anchor boxes are used to predict bounding boxes YOLOv2 also achieved state-of-the-art on standard detection tasks

YOLOv3, developed by Redmon and Farhadi, enhances YOLOv2 by implementing multi-label classification for object detection in images It utilizes independent logistic classifiers to predict multiple labels for each object, improving performance on complex datasets with overlapping labels Additionally, YOLOv3 makes box predictions at three different scales, with the final convolutional layer outputting a 3D tensor that encodes bounding box, objectness, and class predictions The model incorporates a new feature extraction network called Darknet-53, further enhancing its capabilities.

53 convolutional layers Darknet-53 also achieves the highest measured floating point operations per second (FLOPS); therefore, this network structure better uses the GPU

The Single Shot Detector (SSD), introduced by Liu et al., is an efficient one-stage object detection framework designed for multiple categories Utilizing the VGG16 backbone architecture, SSD enhances the truncated base network by adding several convolutional feature layers that detect objects at various scales By merging predictions from feature maps of differing resolutions, SSD effectively addresses the challenge of detecting objects of varying sizes Unlike traditional methods that depend on object proposals, SSD simplifies the process by eliminating proposal generation and feature resampling, allowing all computations to occur within a single network This streamlined approach not only facilitates easy training but also enables rapid integration into detection systems, resulting in state-of-the-art accuracy and speed in object detection tasks.

Arrangement of Result Description

To enhance user experience, an object order table is established based on the result description of a function, allowing for adjustments based on varying situations This method facilitates quick and easy access to information, especially when dealing with complex result descriptions, without compromising the accuracy of the final detection results.

The object detection function, illustrated in Figure 6.1, utilizes a pre-trained model based on the Common Objects in Context (COCO) dataset This dataset, sourced from natural images depicting everyday scenes, offers valuable contextual information and features labeled and segmented objects, enhancing the model's training process.

The COCO dataset includes 91 object categories, but 11 of these, such as street signs, hats, shoes, eyeglasses, plates, mirrors, windows, desks, doors, blenders, and hair brushes, are not labeled or segmented As a result, only 80 object categories are accurately labeled and segmented within the images.

The COCO dataset categorizes objects into super categories such as person and accessory, animal, vehicle, outdoor objects, sports, kitchenware, food, furniture, appliance, electronics, and indoor objects To establish object positioning in result descriptions, we created an object order table based on these super categories For instance, in an outdoor setting, the top ten objects are ranked as follows: person, bicycle, motorcycle, car, bus, truck, train, traffic light, stop sign, and bench.

System Prototype and Implementation

Experimental Results

Conclusions

Định dạng
Số trang	106
Dung lượng	4,14 MB