Tách nguồn âm thanh sử dụng mô hình phổ nguồn tổng quát trên cơ sở thừa số hóa ma trận không âm =197

AUDIO SOURCE SEPARATION: FORMULATION AND STATE OF

Audio source separation: a solution for cock-tail party problem

1.1.1 General framework for source separation

Audio source separation is a signal processing technique aimed at isolating individual sounds, known as sources, from a mixture that may be single-channel or multichannel This process requires a sophisticated system capable of various tasks, including estimating the number of sources, determining the necessary frequency basis and convolutive parameters for each source, applying separation algorithms, and reconstructing the original audio sources.

The separation process utilizes two primary types of cues: spectral cues and spatial cues Spectral cues pertain to the spectral structures of sources, whereas spatial cues provide information about the spatial positions of these sources As discussed in Sections 1.2.1 and 1.2.2, relying solely on spectral signals can be inadequate for distinguishing sources with similar pitch and timbre Similarly, individual spatial signals may fall short when differentiating sources that are located in close proximity Therefore, most existing systems effectively integrate both spectral and spatial cues for optimal source separation.

The source separation algorithm operates in the time-frequency domain following the short-time Fourier transform (STFT) It utilizes two key modeling cues: the spectral model, which leverages the spectral characteristics of the sources, and the spatial model, which focuses on modeling and utilizing spatial information The final estimated time domain source signals are then derived through the inverse short-time Fourier transform (ISTFT).

Figure 1.1: Source separation general framework.

Multichannel audio mixtures are recordings created using microphone arrays, which capture sound from various sources In this context, the multichannel mixture signal is defined by sources indexed as $ j $ ranging from 1 to $ J $, and channels indexed as $ i $ ranging from 1 to $ I $.

This mixture signal is denoted by x( ) = [t x1( )t , , x I ( )]t T ∈ R I×1 and is sum of contributions from all sources as [87]: x( ) =t

The equation Xj = 1 cj(t) represents the contribution of the j-th microphone array from the k-th source, referred to as the spatial image of this source Here, cj(t) is a vector in R^I×1, consisting of time-domain digital signals indexed by t, where t ranges from 0 to T-1 and T indicates the signal length The notation [ ]^T signifies the transposition of the matrix or vector.

Sound sources are categorized into two main types: point sources and diffuse sources Point sources produce sound from a single location, such as a solo singer or an individual speaker In contrast, diffuse sources generate sound from a broader area, exemplified by a choir singing together or the sound of raindrops Essentially, diffuse sources can be viewed as a collection of multiple point sources, illustrating the difference in sound emission characteristics.

Xτ =0 aj( )τ sj(t τ − ) (1.2) where aj( ) = [τ a 1j( )τ , , a Ij( )]τ T ∈R I×1 , j = 1, , J are mixing ﬁlters modeling the acoustic path from the -th source toj I microphones,τ is the time delay, andsj( )t is the single-channel source signal.

Audio source separation systems typically function within the time-frequency (T-F) domain, allowing for the simultaneous representation of audio's temporal and spectral characteristics A widely utilized method for time-frequency representation is the short-time Fourier transform (STFT), which involves analyzing the time-domain waveform by generating overlapping frames and applying the Fourier transform to each frame individually.

Switched to the T-F domain, equation (1.1) can be written as x(n, f) J

Xj=1 cj(n, f ) (1.3) where cj (n, f )∈ C I×1 and x(n, f ) ∈ C I×1 denote the T-F representations computed from cj( )t and x t( ), respectively n = 1 2, , , N is the time frame index and f

1 2, , , F presents the frequency bin index.

In array signal processing, it is commonly assumed that the source signal is narrowband This narrowband assumption allows the convolutive mixing model to be simplified, enabling the representation of the mixing process through complex-valued multiplication in each frequency bin Specifically, the relationship can be expressed as cj(n, f) ≈ aj(f) sj(n, f), where cj(n, f) and sj(n, f) denote the Short-Time Fourier Transform (STFT) coefficients of the respective signals, and aj(f) represents the Fourier transform of the source signal.

Source separation consists in recovering either the original source signalsJ sj( )t or their spatial imagescj( )t given the -channel mixture signal I x( )t The objective of our

Audio source separation is a key solution for addressing the cocktail party problem, as illustrated in Figure 1.2 The primary objective of this research is to extract the spatial image of the source from the observed mixture Importantly, our study includes background noise as a separate source, applicable to both point and diffuse sources in various contexts, including live recordings and artificially mixed audio.

Literature review on international research works

Literature review on the international research works

A standard architecture for source separation systems comprises two key models: the spectral model, which captures the spectral characteristics of sources, and the spatial model, which utilizes spatial information This modular architecture allows for the integration of various mixing filter estimation techniques with any spectral source estimation method Additionally, some source separation approaches can recover sources by directly leveraging either spectral sources or mixing filters Over the past two decades, extensive research has expanded the field of Blind Source Separation (BSS), encompassing a wide array of techniques that necessitate thorough exploration This section focuses on popular spectral and spatial models, highlighting their combinations and individual applications in advanced algorithms.

This section examines three widely researched source spectral models: the Spectral Gaussian Mixture Model (Spectral GMM), Spectral Nonnegative Matrix Factorization (Spectral NMF), and Deep Neural Network (DNN).

The Gaussian model-based approaches, specifically Spectral Gaussian Mixture Models (GMM), leverage the redundancy and structure inherent in audio sources to facilitate effective audio source separation.

The short-time Fourier spectrum for the -th source is represented as a column vector, denoted as sj( ) = [n s j(n, f)]f, where f ranges from 1 to F This spectral representation is modeled using a Gaussian mixture model (GMM), which characterizes sj( )n as a multidimensional, zero-mean complex-valued K-state mixture, with its probability density function (pdf) defined as p(sj( )) = n.

X k=1 δjkN c (sj( );n 0 Σ , jk), (1.5) where 0 detotes a vector of zeroes, δjk which satisﬁes

K k=1 δjk = 1, j ∀ , and Σjk = diag([vjk( )]f f) are the weight and the diagonal spectral covariance matrix of the -th state of the -th source, respectively, andk j

This model operates through a two-step process for each time frame of the n-th source In the first step, a state $ j_k(n) $ is selected with a probability $ \delta_{jk}(n) $ In the second step, the vector of Short-Time Fourier Transform (STFT) coefficients $ s_j(n) $ is generated from a zero-mean Gaussian distribution characterized by the covariance $ \Sigma_{jk}(n) $ The goal of source separation is to compute the posterior probabilities of all states for each time frame.

The Spectral GMM establishes K × F free variances vjk(f) while utilizing the global structure of the sources for estimation However, it does not explicitly account for amplitude variation in sound sources, leading to different estimated spectral variance templates [vjk(f)] for signals with similar spectral shapes but varying amplitude levels To address this limitation, an alternative version of GMM was introduced in 2006.

[13], called Spectral Gaussian Scaled Mixture Model (Spectral GSMM) In Spectral

GSMM, a time-varying scaling parameter gjk( )n is incorporated in each Spectral-

GMM The pdf of the GSMM is then written as [13] p(sj( )) =n

Xk=1 δjkN c (sj( );n 0, g jk( )n Σ jk), (1.7)

Spectral GMM and Spectral GSMM techniques have been utilized for single-channel audio source separation and for stereo separation of moving sources Additionally, GMM has been explored in the context of multichannel instantaneous music mixtures, where Spectral GMMs are derived from the mixture signals.

Nonnegative matrix factorization (NMF) is an effective dimension reduction technique specifically designed for nonnegative data This method has found applications across various fields, including machine learning and audio signal processing A comprehensive overview of NMF will be provided in Chapter 2, serving as a foundational method for our research.

In the following, we will review NMF as a structured spectral source model applied to audio source separation, known as Spectral NMF.

In the Spectral NMF model, each source $ s_j $ is represented as the sum of $ K_j $ spectral basis components, also referred to as frequency basis or latent components, which can be expressed mathematically as $ s_j(n, f) $.

In the context of time-frequency analysis, the spectral basis $ c_k(n, f) $ is considered mutually independent within each time-frequency bin, which leads to the assumption that $ c_k(n, f) $ follows a zero-mean Gaussian distribution characterized by variances $ h_{nk} w_{kf} $ Specifically, this can be expressed as $ c_k(n, f) \sim N_c(0, h_{nk} w_{kf}) $, where $ w_{kf} $ represents the spectral basis that captures the spectral structures of the signal, and $ h_{nk} $ denotes the distribution of the spectral basis reflecting time-varying activations Additionally, the source Short-Time Fourier Transform (STFT) coefficients $ s_j(n, f) $ are also modeled as independent zero-mean Gaussian random variables with flexible variances.

Denoting bySj = [sj(n, f)]nf theN ×F matrix of STFT coefﬁcients of the -th j source, Hj = [hnk]nk with dimension N ×K j , and Wj = [wkf ]kf with dimension

Kj × F , ML estimation of the latent variables H j and Wj is equivalent to NMF of the power spectrogram |S j | 2 into HjWj according to the divergence function d as following [40]

The equation log (p S j |H j , W j) $ X n,f d s(| j (n, f |) 2 kH j Wj) (1.11) illustrates that the equality holds up to a constant, with the divergence function potentially being the Kullback-Leibler (KL) divergence or the Itakura-Saito (IS) divergence Specifically, the KL divergence is defined as dKL(x yk ) = xlog( x y ) − − x y, while the IS divergence is expressed as dIS(x yk ) = x y − log( x y ) 1− Further details on these divergence measures will be provided in Chapter 2 Notably, Non-negative Matrix Factorization (NMF) simplifies the estimation process by requiring only N Kj values of Hj and KjF values of Wj, rather than estimating N F values of the power spectrogram Sj.

N Kj + KjF N F.Thus NMF is considered as a form of dimension reductionin this context.

Spectral NMF has been applied to single-channel audio source separation [118,

146] and multichannel audio source separation [104, 106] with different settings In recent years, several studies have investigated user-guided NMF methods [26, 30, 37,

106, 129, 161] that incorporate speciﬁc information about the sources in order to im- prove the efﬁciency of the separation algorithm.

Recent studies indicate that deep neural networks (DNNs) excel at modeling complex functions and are effective in tasks such as audio signal processing Traditional methods like Gaussian Mixture Models (GMM) and Non-negative Matrix Factorization (NMF) initially learn the characteristics of speech and noise, using these models to facilitate signal separation In contrast, deep learning approaches can directly learn the separation mask or model through end-to-end training, leading to significant improvements in performance.

In DNN-based methods, the mixture time-frequency representation undergoes preprocessing to extract key features, which are then used as inputs for a Deep Neural Network (DNN) This DNN either estimates the time-frequency mask directly or calculates the source spectra to derive the mask Time-frequency masking involves filtering the mixture's time-frequency representation using a mask, represented mathematically as ˆcj(n, f) = ˆm j(n, f) x n, f In audio enhancement, the ideal binary mask and ideal ratio mask are considered the optimal binary or soft masks, derived from a real-valued scalar mask, and can be computed using specific formulas.

Research has explored various deep neural network (DNN) architectures and training criteria, as detailed in studies [34, 154, 155] These experiments focus on estimating a real-valued ratio mask, denoted as $ \hat{m}_{\text{rat targ}}(f, n) $, for the target source The ideal ratio mask for the target source is represented as $ m_{\text{rat targ}}(f, n) $, and the networks are trained by minimizing one of three specified error functions.

- The error of spectra computed using the estimated mask:

( ˆm rat targ (f, n |x f, n | − |s) ( ) targ (f, n |) ) 2 , (1.16) where starg(f, n )is the target source spectra.

- The error of signal in the complex-valued T-F domain computed using the estimated mask:

Research indicates that the D P SA cost function surpasses the performance of its two counterparts, highlighting the advantages of incorporating phase information during DNN training Although the estimated mask is real-valued, it does not influence the phase, yet its consideration proves to be advantageous.

Most studies have addressed the problem of single-channel source separation [18,

Literature review on research works in Vietnam

In recent years, various groups in Vietnam have focused on researching audio source separation techniques, which are crucial for advancements in speech recognition and audio detection.

A research team from the University of Sciences, VNU-HCM, has introduced a novel method for blind source separation utilizing the independent component analysis (ICA) technique This method operates under the assumption that the sources are statistically independent and adhere to a non-Gaussian distribution However, it is limited by the requirement for a square mixing matrix, which prevents it from effectively handling under-determined scenarios where the number of sources exceeds the number of mixing channels.

Researchers at Hanoi University of Science and Technology have developed a blind speech separation method that utilizes non-Gaussianity maximization and inverse filters This approach employs the Independent Component Analysis (ICA) method along with the Fast-ICA algorithm to extract the innovation processes of speech sources by maximizing non-Gaussianity, followed by artificial coloring using re-coloration filters Computer simulations demonstrate the method's effectiveness in separating speech signals from convolutive mixtures However, a notable limitation is the absence of an analysis regarding the impact of noise, and it is not applicable in underdetermined scenarios.

Nguyen Linh Trung’s research team at VNU University of Engineering and Technology focused on the blind separation of nonstationary sources in underdetermined scenarios, where the number of sources exceeds the number of sensors They introduced two innovative subspace-based algorithms designed for time-frequency nondisjoint sources, with one algorithm utilizing quadratic time-frequency distributions.

This study explores two methods for separating sources in the time-frequency (TF) domain: traditional TF decomposition (TFDs) and linear TFDs Both approaches operate under the assumption that the number of sources at a specific point in the TF domain is fewer than the number of sensors available The research was further developed to isolate EEG signals for detecting seizures in patients.

Research on audio source separation in Vietnam has primarily focused on classical methods like ICA and time-frequency clustering for determined cases However, there has been a lack of exploration into more advanced techniques such as GMM, NMF, and DNN Consequently, challenging scenarios, including under-determined conditions and reverberation, remain under-researched.

Source separation performance evaluation

The topic of the source separation performance evaluation has long been studied in the literature Several studies have been published both in terms of objective quality

In our study, we examine two widely recognized families of objective evaluation criteria—energy ratio criteria and perceptually-motivated criteria—that can be applied to any audio mixture and algorithm without needing specific knowledge of unmixing parameters or filters These criteria are commonly utilized within the community and have been integral to recent evaluation campaigns.

The criteria discussed stem from the perceptual decomposition of the estimated source image $ \hat{c}_{ij}(t) $ into four components: the true spatial image $ c_{ij}(t) $, spatial distortion $ e_{\text{spat} ij}(t) $, interference from other sources $ e_{\text{inter} ij}(t) $, and burbling artifacts $ e_{\text{artif} ij}(t) $ This decomposition allows for a detailed analysis of each criterion family, which is further elaborated in subsections 1.4.1 and 1.4.2.

In the context of decomposition (1.25), three distortion components are calculated based on energy ratio criteria: the spatial distortion $ e_{\text{spat}}(t) $, which is defined as the difference between the least-squares projection $ P_j^L(\hat{c}_{ij}(t)) $ and the original signal $ c_{ij}(t) $; the inter-component distortion $ e_{\text{inter}}(t) $, representing the difference between the overall projection $ P_{\text{all}}^L(\hat{c}_{ij}(t)) $ and the spatial projection $ P_j^L(\hat{c}_{ij}(t)) $; and the artificial distortion $ e_{\text{artif}}(t) $, which is the difference between the original signal $ \hat{c}_{ij}(t) $ and the overall projection $ P_{\text{all}}^L(\hat{c}_{ij}(t)) $ The least-squares projectors $ P_j^L $ and $ P_{\text{all}}^L $ are constructed from specific subspaces defined by the parameters $ c_{kj}(t-\tau) $ and $ c_{kl}(t-\tau) $, with a filter length $ L $ set to 32 ms.

The measurement of distortion types, including interference, artifacts, and spatial distortions, is conducted using three energy ratio criteria expressed in decibels (dB): the Source to Interference Ratio (SIR), the Sources to Artifacts Ratio (SAR), and the Source Image to Spatial Distortion Ratio (ISR), as defined in reference [144].

The suppression of interfering sources in the separation is objectiﬁed by this measure.

I i=1 P t (cij( ) +t e spat ij ( ) +t e inter ij ( ))t 2

This measure estimates the artifacts introduced by the source separation process.

• Source Image to Spatial distortion Ratio:

I i=1 P t e spat ij ( )t 2 (1.31)This measure represents the suppression of the spatial distortions.

The total error represents the overall performance of the source separation algorithm, also measured by the Signal to Distortion Ratio (SDR) and calculated as follows

I i=1 P t (e spat ij ( ) +t e inter ij ( ) +t e artif ij ( ))t 2 (1.32)

These criteria were implemented in Matlab and distributed for public use [41] 1 They are most commonly used metrics in the source separation community so far.

In addition to energy ratio criteria, we utilize perceptually-motivated objective criteria to evaluate the quality of estimated source image signals These criteria are based on the decomposition of the signals into three distortion components: target distortion, interference distortion, and artifact distortion By employing the PMO-Q perceptual salience measure, we compute four performance metrics: Overall Perceptual Score (OPS), Artifacts-related Perceptual Score (APS), Interference-related Perceptual Score (IPS), and Target-related Perceptual Score (TPS).

These criteria score from 0 to 100 where higher values indicate better performance.

Research has demonstrated that perceptually-motivated criteria enhance the correlation with subjective scores when compared to energy ratio criteria, and since 2010, they have frequently been utilized alongside energy ratio criteria within the audio source separation community The source code for these perceptually-motivated criteria is also accessible.

Summary

This chapter presents an overview of audio source separation, outlining the general problem that will be addressed in this thesis We have examined key techniques for utilizing spectral and spatial information to enhance the separation process Furthermore, we introduced two widely-used objective evaluation criteria that will be employed to assess the effectiveness of the source separation methods.

1 http://bass-db.gforge.inria.fr/bss eval/

2 http://bass-db.gforge.inria.fr/peass/ performance of the proposed methods in Chapter 3 and 4, have also been presented.

CHAPTER 2 NONNEGATIVE MATRIX FACTORIZATION APPLYING

Spectral decomposition using Non-negative Matrix Factorization (NMF) has gained prominence in various audio signal processing applications, including source separation, enhancement, and audio detection This chapter begins with an overview of the NMF formulation and its extensions, followed by an introduction to NMF-based audio spectral decomposition techniques Finally, we present innovative methods for the automatic detection of unusual sounds utilizing NMF, with the goal of achieving effective sound annotation.

NMF introduction

Nonnegative Matrix Factorization (NMF) is a powerful dimension reduction technique specifically designed for nonnegative data Although it gained popularity after Lee and Seung's groundbreaking work in 1999, its roots can be traced back nearly two decades earlier under different names like nonnegative rank factorization and positive matrix factorization Since its introduction, NMF has found extensive applications across various fields, including bioinformatics, image processing, and facial recognition.

[55], speech enhancement [39, 91], direction of arrival (DoA) estimation [135], blind source separation [40, 104, 109, 125, 133, 164], and the informed source separation

[25, 44, 46, 48] Comprehensive reviews about the NMF can be found in [152, 165].

In the following, we will present some details about NMF so as to understand what the NMF is and how it works.

Given a data matrixV ∈ R F N + × of dimensionsF ×N with nonnegative entries,

NMF aims at ﬁnding two nonnegative matrices W and H such that WH is approximately equal to Vas [73]

Non-negative Matrix Factorization (NMF) is a powerful technique used for the statistical analysis of multivariate data It operates on a nonnegative matrix $ V $ of dimensions $ F \times N $, where $ F $ represents the characteristics of the data and $ N $ indicates the number of observations NMF approximates the factorization of this matrix into two nonnegative matrices: $ W $, an $ F \times K $ matrix, and $ H $, a $ K \times N $ matrix This method allows for the extraction of meaningful patterns and structures from complex datasets.

Fig 2.1, where K is the number of the basis vector (latent components) Usually,K is chosen to be smaller than F and N, in order to achieve the decompositions, where

F K K N× + × F N× [42, 73] So WandHare smaller than the original matrix

V, they are lower-rank representation of the original data matrix That is why NMF is considered as a dimensionality reduction technique.

The equation (2.1) can be expressed as $ \mathbf{V} \approx \mathbf{W} \mathbf{h} $, where $ \mathbf{V} $ and $ \mathbf{W} $ represent the columns of the data and the dictionary matrix, respectively This indicates that each data vector is approximated by a linear combination of the columns in $ \mathbf{W} $, weighted by the components of $ \mathbf{h} $ The matrix $ \mathbf{W} $ serves as a dictionary matrix, optimized for the linear representation of the data in $ \mathbf{V} $ The distribution of the basis in the $ \mathbf{W} $ matrix is referred to as the distribution weight matrix or activation matrix Typically, a limited number of basis vectors can effectively represent a large number of data vectors, enabling a strong approximation when these basis vectors reveal the underlying structure within the data.

In summary, Non-negative Matrix Factorization (NMF) seeks to identify nonnegative fundamental factors that facilitate feature extraction, reduce dimensionality, eliminate redundant information, and uncover hidden patterns within a set of non-negative vectors.

Figure 2.1: Decomposition model of NMF [36].

2.1.2 Cost function for parameter estimation

For decomposing a matrix V into matrices W and H, we want to get as close an approximation for equation (2.1) as possible This can be achieved by solving the optimization problem [40]

H ≥0, ≥0minW D(V WHk ), (2.2) where D(V WHk ) is the cost function Denoting V WHˆ = , this cost function is deﬁned by

In (2.3), d x y( k ) is a divergence function, which may be Euclidean distance (EUC)

The article discusses various divergence measures, including Kullback-Leibler (KL) divergence, Itakura-Saito (IS) divergence, and -β divergence, which are widely used as cost functions in data analysis These divergences fall under the broader category of -divergence, characterized by a general form defined mathematically The most common divergences utilized in practice are the Euclidean distance (EUC), KL divergence, and IS divergence, each serving distinct purposes in statistical modeling and machine learning applications The equations provided illustrate the mathematical foundations of these divergences, emphasizing their relevance in comparing probability distributions and optimizing models.

When β equals 2, the distance corresponds to the Euclidean distance (EUC); when β equals 1, it represents the Kullback-Leibler (KL) divergence; and when β equals 0, it signifies the Itakura-Saito (IS) divergence.

• KL divergence: dKL(x yk ) = log( x x y) − − x y (2.6)

• IS divergence: dIS(x yk ) = x y − log( x y) 1 − (2.7)

In 2005, the selection of the NMF cost function should be influenced by the nature of the data being analyzed The Euclidean distance serves as a symmetric measure, heavily reliant on the magnitude of its components In contrast, KL-divergence and IS-divergence are asymmetric functions that assess the relative entropy between two distributions, allowing them to be treated as normalized probability distributions Our research focuses specifically on these aspects.

NMF with the IS divergence, which is a limit case of the -divergence and have beenβ demonstrated that its relevance to the decomposition of audio spectra [40, 43, 130].

In 2001, Lee and Seung explored gradient descent algorithms for minimizing the -divergence cost functions, as outlined in equation (2.2) Their research introduced the concept of transforming gradient descent update rules into multiplicative update (MU) rules, enabling more efficient optimization They established that the derivative of D with respect to θ can be expressed in a specific form, paving the way for advancements in optimization techniques.

∇ θ D θ( ) = ∇ + θ D θ( ) − ∇ − θ D θ ,( ) (2.8) where∇ + θ D θ( )and∇ − θ D θ( )are nonnegative components, the gradient descent update rules of can be turned into the MU rules asθ θ←θ.∇ − θ D θ( )

Applied to the -divergenve, the derivative ofβ dβ(x yk ) (i.e, equation (2.4a)) with regard to is caculated asy

Because y representsWH, the partial derivative with regard toHandW, respectively, are written as

∇ W D(V WHk ) = ((WH) β ( −2) (WH V − )H T (2.12) where A n ( ) denotes the matrix with entries[ ]A n ij ( ) , A T is the transposition of matrix

A Subject to equation 2.9, the multiplicative update Hand Ware written as

(WH) β ( −1) H T , (2.14) where denotes the element-wise Hadamard product and the division is also element- wise.

The NMF algorithm with the MU-rules in order to estimate Wand His described in Algorithm 1 The input of the algorithm is matrixV and number of spectral basis

K β determines the divergence used in the algorithm: β = 0 corresponds to IS- divergence, β = 1 corresponds to KL-divergence, niteris the number of iterations.

Initialize H (0) , W (0) randomly with nonnegative values t = 0;

Lee and Sung (2009) found that Dβ(V WHk) does not increase with updates for β values of 2 (Euclidean distance) and 1 (KL divergence) Additionally, Kompass (2007) extended this proof to encompass the range of 1 ≤ β ≤ 2.

A study by Fevotte et al in 2009 demonstrated that the criterion remains non-increasing under specific updates for values of β less than 1 and greater than 2, particularly at β = 2, which relates to the IS divergence While a general proof of convergence is lacking, the straightforward nature of the multiplicative update (MU) rules has significantly enhanced the popularity of Non-negative Matrix Factorization (NMF).

Application of NMF to audio source separation

Non-negative Matrix Factorization (NMF) is extensively utilized for supervised source separation in various studies The process begins with applying a short-time Fourier transform (STFT) to the original time-domain signal, denoted as x(t) This results in the computation of the magnitude or power of the STFT coefficients, producing a nonnegative matrix V The fundamental concept involves representing matrix V as a product of a spectral basis matrix W and an activation matrix H, expressed mathematically as V = WH.

The spectral basis represents various spectral characteristics of an audio signal, while the corresponding rows in the matrix H indicate the time gain of these spectral bases Denoting K as the number of spectral bases, F as the number of frequency bins, and N as the number of time frames, a simple non-negative matrix factorization (NMF) model with two spectral bases is illustrated in Fig 2.2 In this example, the two spectral bases, which capture distinct spectral characteristics, collectively form the dictionary matrix.

W The activation matrix H returns the mixing proportions of two spectra basis in each time-frame.

Figure 2.2: Spectral decomposition model based on NMF (K = 2) [66].

The number of spectral basis functions, denoted as K, plays a crucial role in the analysis of audio spectra A larger K allows for a more detailed modeling of spectral characteristics, but it complicates parameter estimation and increases computational time Conversely, a smaller K may overlook significant sound features that are essential for accurate modeling.

Choosing the right tuning parameter K is crucial and should be informed by prior knowledge of the specific sound type Research indicates that an optimal K value for speech is approximately 32, while for environmental noise, it is around 16.

2.2.2 NMF-based audio source separation

We introduce in this section a conventional supervised audio source separation method based on NMF as one of the most popular model for audio signal [43, 107, 127,

The general pipeline operates in the frequency domain following the Short-Time Fourier Transform (STFT) and involves two main phases: first, it learns Non-negative Matrix Factorization (NMF) source spectral models from training examples; second, it decomposes the observed mixture using the pre-learned models.

Figure 2.3: General workﬂow of supervised NMF-based audio source separation.

In a single-channel signal separation scenario involving J sources, we denote the observed mixture signal as X ∈ C F N × and the individual source signals as Sj ∈ C F N × for j = 1, , J The mixing model for these signals is expressed through their complex-valued Short-Time Fourier Transform (STFT) coefficients.

Denoting V = | |X 2 the power spectrogram of the mixture where | |X p is the matrix with entries[ ]X p il , NMF aims at decomposing the F ×N non-negative matrix

The matrix V is decomposed into two non-negative matrices, W and H, where W is of size F × K and H is of size K × N This decomposition minimizes the Itakura-Saito divergence, commonly used in audio processing, while ensuring that the product W H remains non-negative.

The parameters θ = {W H, } are usually initialized with random non-negative values and are iteratively updated via the well-known MU rules [40].

In the supervised setting, spectral model for each source, denoted by Wj, j

1, , J, is ﬁrstly learned (see Algorithm 1) from the corresponding training examples by optimizing criterion as in (2.2) Then spectral model for all sourcesW is obtained by

During the testing phase of the source separation process, the spectral model remains fixed while the time activation matrix H is estimated using the MU rule It is important to note that H is also divided into blocks for analysis.

H= [H T 1 , H T 2 , , H T J ] T , (2.17) where Hj denotes a block characterizing the time activations for -th sourse withj j

1, , J, and A T is the transposition of matrix A

Algorithm 2 Baseline NMF-based audio source separation algorithm

Training data of all source {s j ( )t }, j = 1 : J

Ensure: : Source imagesˆcj( )t separated from x t( )

- Estimating the spectral basis matrixWj for -th source from training examplej sj( )t by Algorithm 1 end for

- Estimating Hfrom mixture signal x( )t by Algorithm 1 (Wis ﬁxed).

- EstimatingˆSj by Wiener ﬁltering (2.18) ˆsj( ) = t ISTFT( ˆSj )

Once the parametersθ = {W H, }are obtained, the source STFT coefﬁcients are computed by Wiener ﬁltering as ˆSj = WjHj

The state-of-the-art NMF-based audio source separation algorithm, outlined in Algorithm 2, utilizes the element-wise Hadamard product and division to derive time domain source estimates through the inverse Short-Time Fourier Transform (STFT) This algorithm serves as a baseline for comparison with our proposed approach in Chapter 3 It operates under the Multiplicative Update (MU) rules and relies on a supervised spectral basis dictionary, W, which is constructed from training data Consequently, this supervised NMF method is not applicable in scenarios where training data is unavailable.

Proposed application of NMF to unusual sound detection

Our primary focus in exploring the audio spectral characteristics of Non-negative Matrix Factorization (NMF) is its application in audio source separation Additionally, we examine how NMF captures recurring features in lengthy audio files and introduce a novel sound detection method utilizing NMF.

Audio event detection and audio scene analysis play crucial roles in acoustic signal processing and have garnered significant attention in recent years Notably, the IEEE audio and acoustic signal processing challenges, such as DCASE, have been held in 2016, 2017, and are ongoing in 2018 A common methodology for these tasks involves combining feature extraction techniques, like mel-frequency cepstral coefficients (MFCC), with classifiers such as Gaussian mixture models (GMM) through supervised training Recently, the focus has shifted towards utilizing deep learning architectures However, the successful classification of diverse sounds hinges on the availability of well-annotated datasets, which requires extensive human effort to label audio segments with sound type information, making the annotation process both time-consuming and tedious.

To enhance sound annotation efficiency, we introduce innovative techniques for the automatic detection of non-stationary segments in an unsupervised manner, aimed at minimizing annotation costs.

1 http://dcase.community/challenge2018/index

In natural environments, persistent background sounds, such as the call of cicadas in summer parks, often accompany sporadic audio events These recordings typically feature a continuous stationary sound alongside brief, distinct acoustic occurrences Consequently, identifying and annotating these short audio events from lengthy recordings, which can last one to two hours, can be a time-consuming task.

NMF (Non-negative Matrix Factorization) effectively models the spectral characteristics of audio signals using a spectral basis dictionary represented by matrix W and K spectral basis numbers When applying NMF with a single spectral basis, stationary background sounds are expected to be accurately represented, while non-stationary audio events may not be The residual divergence serves as an effective metric for identifying non-stationary segments, which correspond to distinct audio events Consequently, human listeners can focus on these detected non-stationary segments for annotation, resulting in a diverse collection of sounds that contribute to a robust dataset for training supervised source separation algorithms.

In the analysis of a single-channel audio signal, it is essential to identify various time segments that encompass non-stationary acoustic events As outlined in Section 2.2.2, the complex-valued matrix representing the Short-Time Fourier Transform (STFT) coefficients of the observed signal is denoted as X ∈ C F N ×.

Letnsecbe the duration of the segment that we want to extract (e.g nsecequals to

In audio processing, the size of a block in matrix V is determined by the length of the targeted non-stationary acoustic events, typically ranging from 5 to 10 seconds This block size, denoted as F × B, is calculated using the formula B = bf s∗nsec/nshifc f s, where the sampling rate of the audio signal is crucial Additionally, nshif represents the frame shift in the Short-Time Fourier Transform (STFT), and b cx denotes the largest integer less than or equal to x.

The power spectrogram of the input signal is represented as V = ||X In order to decompose the matrix V, Non-negative Matrix Factorization (NMF) is applied, resulting in two matrices W and H, as outlined in equation (2.2) using IS divergence (2.7) The process begins with the initialization of parameters using random non-negative values, which are then iteratively refined through the Multiplicative Update (MU) rules specified in equations (2.13) and (2.14).

2.3.2 Proposed methods for non-stationary frame detection

This section presents three proposed methods for extracting short segments from real-world audio recordings that include environmental noise and various audio events One method relies solely on signal energy, while the other two methods utilize Non-negative Matrix Factorization (NMF) with a single spectral basis These extracted segments aim to capture the interesting non-stationary audio events we seek to detect.

This method operates on the premise that environmental noise, such as silence and wind, typically possesses lower energy compared to non-stationary acoustic events like human speech, car sounds, and bird songs Given that environmental noise during recordings is minimal and low in energy, we focus on extracting high-power segments from the power spectrogram matrix, anticipating that these segments will capture the desired non-stationary audio events effectively.

Figure 2.4: Image of overlapping blocks.

We ﬁrst calculate the total energy of each overlapping block of matrix , which isV shown in Fig 2.4, as pt F

Vf, t ( −1) B 0 +b, (2.19) where t= 1, , T is the block index, is the frame index in each block, andb B0is the block shift.

After calculating the total energy vector p = [p1, , p T] for all blocks, we identify audio segments with high energy values, as these segments are likely to contain non-stationary audio events.

This method utilizes a Non-negative Matrix Factorization (NMF) model with a single spectral basis to effectively model stationary signals, primarily focusing on background noise By doing so, we can accurately detect audio events linked to signal segments that the model does not estimate well The overall workflow operates within the frequency domain, as illustrated in Fig 2.5.

Figure 2.5: General workﬂow of the NMF-based nonstationary segment extraction.

STFT NMF is performed with IS divergence Then, the residual divergence matrix between the model and observation is computed as

D V W= /( ∗H) log(− V W./( ∗H)) 1− , (2.20) where denotes the element-by-element division of matrix entries /

Similar to what mentioned in the Signal energy based method, we calculate the sum divergence of each block of matrix Dcorresponding to the segment duration nsec as follows: q t F

After calculating the total divergence vector $ q = [q_1, , q_T] $ for all blocks, we identify audio segments with high divergence values These segments are less accurately represented by Non-negative Matrix Factorization (NMF), indicating that they likely correspond to non-stationary audio events.

Due to the non-stationary nature of background signals over extended recordings, applying Non-negative Matrix Factorization (NMF) to the entire dataset may not yield optimal results To address this, we explore a localized approach where NMF is implemented on shorter segments of the recording, such as one- or two-minute intervals This technique allows for the construction of a residual divergence matrix that is tailored to the acoustic signal's variations, enhancing the adaptability and accuracy of the analysis.

Algorithm 3 Global/Local NMF algorithm

Initialize H ( ) m , W ( ) m randomly; H ( ) m is a one-row matrix and W ( ) m is a one- column matrix. ˆV( ) m ← W ( ) m H( ) m

// Update NMF parameters for i= 1, , niter do

In this study, we define M as the total number of short acoustic blocks, each potentially lasting 60 seconds, extracted from a longer recording For each segment, denoted as V(m), we compute its power spectrum, where m ranges from 1 to M Following the application of one-spectral basis Non-negative Matrix Factorization (NMF) to each V(m), we then calculate the residual divergence matrix D(m), which represents the difference between the model and the observed data.

Then, Dis computed for the entire long recording using the following equation:

Finally, the segments having high divergence are extracted, as described in the Global NMF-based method.

Algorithm 3 can be seen as the general description of both global and local NMF- based methods in the sense that whenM = 1, Algorithm 3 is equivalent to the global

NMF-based method and whenM > 1, it is equivalent to the local NMF-based method.

SINGLE-CHANNEL AUDIO SOURCE SEPARATION EXPLOITING NMF-BASED GENERIC SOURCE SPECTRAL MODEL WITH MIXED GROUP

General workﬂow of the proposed approach

Recent research in audio source separation indicates that fully blind techniques yield inadequate results In contrast, informed source separation methods that utilize specific information, such as music scores or speech transcripts, demonstrate improved effectiveness.

A weakly-informed strategy approach can significantly enhance separation efficiency, especially in scenarios where specific information is scarce By utilizing abstract semantic information regarding the types of sources present in a mixture, we can effectively identify training examples This method relies on a limited amount of training data, consisting of just a few short audio files, typically three to five recordings, each lasting around a specific duration.

The Generic Source Spectral Model (GSSM) is developed by analyzing sources within a mixture for a duration of five seconds This model serves as a foundation for understanding the spectral characteristics of the sources, which subsequently aids in the separation process Throughout the thesis, GSSM will be referred to as shorthand for this model.

In this article, we address the challenge of single-channel signal separation involving J sources We represent the observed mixture signal $ x(t) $ and the $ j $-th source signal $ c_j(t) $ as complex-valued matrices of short-time Fourier transform (STFT) coefficients, denoted as $ X \in \mathbb{C}^{F \times N} $ and $ S_j \in \mathbb{C}^{F \times N} $, respectively The mixing model is formulated as described in equation (2.15) Our primary objective is to accurately estimate the source signals $ c_j(t) $ from the given single-channel mixture $ x(t) $ without relying on any training data.

In practical applications, it is assumed that the types of sources in a mixture are known, and recorded examples of these sounds are available For instance, when separating speech from a noisy mixture, the target sources include speech and noise, both of which can be easily sourced from existing recordings Although multiple examples are necessary for each source due to the variability and ambiguity of sounds like noise, the required amount of training data is minimal—typically just three speech files and four noise files, each lasting between 5 to 10 seconds.

Figure 3.1: Proposed weakly-informed single-channel source separation approach.

We propose a weakly-informed single-channel audio source separation method utilizing Non-negative Matrix Factorization (NMF), which effectively leverages a limited number of training examples to enhance the separation process This approach operates primarily in the time-frequency (T-F) domain, ensuring efficient and accurate audio source separation.

The STFT transform involves two key phases: first, learning the Generalized State Space Model (GSSM) from training examples using Non-negative Matrix Factorization (NMF), and second, utilizing the pre-learned models to decompose the observed mixture For effective training, we gather audio files of similar types to the sources, such as three speeches (one male and two female) and four environmental sounds (including wind, street noise, cafeteria ambiance, and birdsong) The GSSM is established a priori through NMF, as detailed in Section 3.2.

GSSM formulation

In this study, we denote the single-channel learning example from the -th source as $ s_{l_j}(t) $ and its corresponding spectrogram, obtained through Short-Time Fourier Transform (STFT), as $ S_{l_j} $ The spectrogram $ S_{l_j} $ is utilized to learn the associated Non-negative Matrix Factorization (NMF) spectral dictionary, represented as $ W_{l_j} $, by optimizing a criterion similar to that outlined in equation (2.2).

H l j ≥0,Wmin l j ≥0 D(S l j kW l j H l j ) (3.1) where H l j is the time activation matrix Given W l j for all examples l= 1, , L j of the j-th source, the GSSM for the -t source is constructed asj

Uj = [W j 1 , , W L j j ], (3.2) then the GSSM for all the sources is computed by

To effectively implement speech and noise separation, it is essential to gather multiple speech samples from various male and female voices, ideally three examples in total Additionally, incorporating diverse noise types, such as sounds from outdoor environments, cafeterias, waterfalls, and streets, with a total of four examples, is crucial The Generalized Speech Separation Model (GSSM) is then developed using these training examples, as illustrated in Figure 3.2.

Model ﬁtting with sparsity-inducing penalties

The GSSMsUj matrices, defined in (3.2), grow in size with an increasing number of examples and exhibit redundancy, as different examples can share similar spectral patterns Consequently, to effectively fit the mixture spectrogram model, a sparsity constraint is essential to select only a subset of the large matrix U, as outlined in (3.3), that corresponds to the targeted source within the mixture.

[58, 74, 146] In other words, the mixture spectrogram V = | |X 2 is decomposed by solving the following optimization problem minH≥0 D(V UHk ) + λΩ( )H (3.4)

Figure 3.2: Generic source spectral model (GSSM) construction. where Ω( ) H denotes a penalty function imposing sparsity on the activation matrixH, and is a trade-off parameter determining the contribution of the penalty Whenλ λ = 0,

His not sparse and the entire generic model is used as illustrated in Fig 3.3a Recent work in audio source separation has considered two penalty functions as the following.

To enhance the accuracy of training data, Reynolds et al introduced a block sparsity penalty function in 2000 to filter out irrelevant examples that lack spectral similarities to the targeted source in a mixture.

Sun and Mysore [131] applied it for single-channel source separation in the speaker independent case The block sparsity-inducing penalty function is written as follows.

The equation Xg = 1 log( + kH( ) g k 1 ) describes the relationship between activation coefficients for the -th block, where g represents the total number of blocks In this context, each block signifies a single training example, and k k 1 denotes the `1-norm, with H( ) g being a specific subset of H Additionally, a non-zero constant is included in the formulation.

J j=1 Lj is the total number of examples for all sources This penalty enforces the activation for relevant examples only while omitting the poorly ﬁtting examples since their corresponding activation

Figure 3.3 illustrates the estimated activation matrix H under different conditions: (a) without a sparsity constraint, (b) with a block sparsity-inducing penalty (3.5), (c) with a component sparsity-inducing penalty (3.6), and (d) utilizing the proposed mixed sparsity-inducing penalty (3.7) Notably, the block sparsity approach is expected to converge towards zero, as demonstrated in Fig 3.3b.

In 2014, El Badawy et al introduced the component sparsity-inducing penalty function as below

The penalty term Ω2( )H, defined as Xk=1 log( + kh k k 1 ), is introduced to enhance model accuracy by selectively activating only the most relevant spectral components from U This approach acknowledges that not all parts of the learned spectral model may align with the targeted source within a mixture, as highlighted in [8] An illustration of the matrix H after convergence is presented in Fig 3.3c, which mirrors a similar figure found in [8].

3.3.3 Proposed mixed sparsity-inducing penalty

The block sparsity-inducing penalty in GSSM promotes sparsity by either removing or retaining all spectral bases within a specific block of training examples This approach may overlook crucial characteristics dispersed throughout GSSM or may keep less significant features within the blocks In contrast, the component sparsity-inducing penalty enhances the extraction of scattered characteristics by enforcing sparsity on the individual vectors (rows) of GSSM, offering a distinct advantage over block sparsity However, this method is slower in eliminating unsuitable elements, as it meticulously evaluates each row in the large matrix.

Inspired by the advantage of these two state-of-the-art penalty functions, we proposed to combine them in a more general form as

Xk=1 log( + kh k k 1 ), (3.7) where γ ∈ [0 1], weights the contribution of each term in mixed group sparsity constraint.

The newly introduced penalty function (3.7) serves as a generalization of both the block sparsity-inducing penalty (3.5) and the component sparsity-inducing penalty (3.6) Specifically, when the parameter γ is set to 1, the function (3.7) aligns with (3.5), while setting γ to 0 makes (3.7) equivalent to (3.6).

3.3d shows an example of the activation matrix H after convergence when the novel penalty (3.7) is used It can be seen that some blocks converge to zero due to the contribution of the ﬁrst term in (3.7), while in the remaining blocks, some components are zeros due to the second term in (3.7).

Derived algorithm in unsupervised case

This section outlines the calculation of the MU rule for updating the H matrix with the new penalty function in the optimization criterion Although the approach involves a training phase, it is considered unsupervised, as this phase focuses on learning generic models from various example signals.

The minimization criterion L( )H incorporates a mixed group sparsity constraint Ω( ) H, as defined in equation (3.7), while D(ãkã) represents the IS divergence outlined in equation (2.7) The focus is on calculating the partial derivative of L( ) H concerning an entry h kn.

This∇ h kn L( )H can be written as a sum of two nonnegative parts, denoted by ∇ + h kn L( ) H ≥

Following a standard approach for MU rule derivation [40, 73]), hkn is updated as hkn ← h kn ∇ − h kn L( )H

, (3.11) where η = 0 5 following the derivation in [42, 74], which was shown to produce an accelerated descent algorithm Putting (3.10) into (3.11) and rewriting it in a matrix form, we obtain the updates of Has

The matrix V is defined as V = UH Y, where Y consists of uniform matrices Yp for p = 1, , P, each having entries of +kH1p k1 Additionally, Z is represented as Z = [z>1, , z>K], with zk for k = 1, , K being uniform vectors of the same size as hk, also containing entries of +kh1k k1.

To optimize the parameter estimation algorithm using the proposed penalty function, one can utilize the derived MU rule and the majorization-minimization algorithm The summarized process is outlined in Algorithm 4, which features a uniform matrix Y( ) g that matches the dimensions of H( ) g, alongside a uniform row vector zk that corresponds in size to hk.

Algorithm 4 Unsupervised NMF with mixed sparsity-inducing penalty Require: V U, , , λ γ

// Taking into account block sparsity-inducing penalty for g= 1, , Gdo

// Taking into account component sparsity-inducing penalty for k= 1, , K do zk ← +kh 1 k k end for

Derived algorithm in semi-supervised case

This section explores a distinct scenario from that in section 3.1, where we assume that some sources in the mixture have clean signals available for training, while the remaining sources are non-deterministic.

Speaker-dependent speech enhancement systems are widely utilized in practice, particularly when speech is employed to control robots or devices In these scenarios, the operator's voice is typically known and can be recorded in advance for system training However, noise in the operating environment is often non-stationary and can change based on time and location, making it challenging to accurately identify noise during the training process.

To construct the semi-GSSM for the remaining Q sources, we assume that clean training signals are available for P sources This approach allows us to focus on developing the semi-GSSM for the Q = J - P sources throughout the chapter.

LetWpis the spectral basis matrix is learned by NMF from clean training signal of p-th source withp = 1, P, the spectral basis model obtained by allP clean signal is

The GSSM for -th sources, which do not have clean training signal, are learnedq from Lq examples as in (3.1) and (3.2), then the GSSM for all Qsources is computed by (3.3):

U= [U 1, , U Q] (3.14) Finally, the semi-GSSM for all sources is constructed byJ

The activation matrix corresponding to Us also consists of two parts as

H T ] T , (3.16) where His the part of activation matrix corresponding toP sources that having clean training signal, e

H corresponds to Q sources that learned GSSM from the example signals found.

Algorithm 5 Semi-supervised NMF with mixed sparsity-inducing penalty Require: V U, s, , λ γ

Initialize Hs randomly with nonnegative values ˆV U= sHs repeat

// Taking into account block sparsity-inducing penalty for g= 1, , Gdo

// Taking into account component sparsity-inducing penalty for k= 1, , K do zk ← +kh 1 k k end for

3.5.2 Model ﬁtting with mixed sparsity and algorithm

In semi-supervised learning for mixture spectrogram modeling, a sparsity constraint is essential to focus on a specific subset U of the matrix Us during the fitting process This involves decomposing the mixture spectrogram V = ||X 2 by addressing the optimization problem defined as minH≥0 D(V Uk s Hs) + Ω(λ).

H) denotes a penalty function imposing sparsity on the activation matrix e

H) denotes a penalty function imposing sparsity on a subset e

H of the activation matrix Us The remainder of Us, H, is updated according to the usual optimization formula (2.2).

The proposed mixed sparsity-inducing penalty function (3.7) is applied to

Q q=1 Lqis the total number of training example for Qsources.

The semi-supervised algorithm is summarized in Algorithm 5, whereY( ) g is a uniform matrix of the same size as e

U( ) g , andz( ) k a uniform row vector of the same size as e h( ) k

Experiment

To validate the proposed approach, we utilized audio samples from the Diverse Environments Multichannel Acoustic Noise Database (DEMAND) and the International Signal Separation and Evaluation Campaign (SiSEC) for training the GSSM Testing was conducted on three distinct datasets: the first comprised artificially mixed speech and noise examples, while the other two were benchmark datasets from the SiSEC campaign, meticulously crafted by audio source separation researchers and widely recognized in the field.

1 http://parole.loria.fr/DEMAND/

2 http://sisec.wiki.irisa.fr. described below:

We developed two distinct training sets for our project: one for speech and another for noise The speech training set features three different recordings, including two from female voices and one from a male voice, each lasting 10 seconds In contrast, the noise training set comprises three types of environmental sounds—kitchen noise, metro sounds, and field noise—with durations ranging from 5 to 15 seconds.

The test data consists of a set of 12 single-channel speech and noise mixtures, artificially combined at a 0 dB signal-to-noise ratio (SNR) These mixtures, featuring various types of noise, are designed to effectively evaluate the performance of the proposed algorithm Throughout the mixing process, both speech and noise sources are consistently present in all samples The audio was sampled at 16,000 Hz, with durations ranging from 5 to 10 seconds The speech components include both male and female voices in English, sourced from the SiSEC dataset, while the noise samples were taken from one of the 16 channels of the DEMAND dataset Some mixtures combine two distinct noise types, such as traffic with wind sound, ocean waves with birdsong, and restaurant noise with guitar sounds.

The training data consists of two distinct sets: one for voice and another for music The voice training set features three different voices—one male and two female—each lasting 10 seconds Meanwhile, the music training set comprises nine files, including three bass sounds, three drum sounds, and three additional instrument sounds, with durations ranging from 10 to 15 seconds.

• Test data: Test set contains 5 snip songs as described in table 3.1 They are in

3 Speech ﬁles are from the International Signal Separation and Evaluation Campaign (SiSEC): http://sisec.wiki.irisa.fr.

4 Some noise ﬁles are from the Diverse Environments Multichannel Acoustic Noise Database (DE-

MAND): http://parole.loria.fr/DEMAND/.

The International Signal Separation and Evaluation Campaign (SiSEC) offers a dataset of professionally-produced music recordings for the MUS task 6 in 2016 This dataset includes five voice examples and is accessible at [SiSEC 2016](https://sisec.inria.fr/sisec-2016/2016-professionally-produced-music-recordings/).

Table 3.1: List of snip songs in the SiSEC-MUS dataset.

2 Tamy - Que pena Tanto faz 15

3 Another dreamer - The ones we love 25

4 Fort Minor - Remember the name 25

• Training data: We use the training sets for speech and noise as presented in section 3.6.1.1.

• Test data: We used the benchmark dataset of the “Two-channel mixtures of speech and real-world background noise” (BGN) task 7 within the SiSEC 2016

The dataset comprises 29 stereo audio mixtures, each lasting 10 seconds and sampled at 16 kHz, featuring a combination of male and female speech alongside real-world noises from various public settings, including a cafeteria (Ca), square (Sq), and subway (Su) The recordings exhibit varying levels of reverberation, with the cafeteria and subway having higher levels compared to the square The signal-to-noise ratio for these mixtures was randomly set between -17 and +12 dB These 29 audio mixtures are categorized into two distinct sets known as the "devset" and "testset."

- The devsetincludes 9 mixtures: three with Ca noise, four with Sq noise, and two with Su noise.

- The testset contains 20 mixtures: eight with Ca noise, eight with Sq noise, and four with Su noise.

6 https://sisec.inria.fr/sisec-2016/2016-professionally-produced-music-recordings/.

7 https://sisec.inria.fr/sisec-2016/bgn-2016/

3.6.2 Single-channel source separation performance with unsupervised setting

We assess the source separation performance of our proposed algorithm in an unsupervised setting, utilizing experiments conducted on all three datasets outlined in section 3.6.1 Each test employs the training set to effectively learn the GSSM, as detailed in Section.

3.2 Then, the observed mixtures in the test set are decomposed with the guide of the pre-learned model GSSM as described in Algorithm 4.

The parameters for the study were established with a sliding window for the Short-Time Fourier Transform (STFT), utilizing a frame length of 1024 and a 50% overlap Non-negative Matrix Factorization (NMF) components were designated as 32 for speech/vocal, 16 for noise, 15 for bass/drums, and 25 for other sounds The training phase involved 100 iterations for the Multiplicative Update (MU) algorithm, while testing varied the iterations from 1 to 100 to assess algorithm convergence Additionally, the sensitivity of the proposed algorithm to the trade-off parameter λ was evaluated by adjusting its values across a range of 1, 10, 25, 50, 100, 200, and 500, alongside varying the sparsity-inducing penalty parameter γ from 0 to 1 in increments of 0.2.

The evaluation of separated speech results utilized the source-to-distortion ratio (SDR), source-to-interference ratio (SIR), and source-to-artifacts ratio (SAR), all measured in decibels (dB) and averaged across all sources, with higher values indicating better performance These criteria, referred to as BSS-EVAL metrics, are predominantly employed within the source separation community.

We ﬁrst compare the separation performance obtained by proposed algorithm with the closed baseline algorithms as follows over two Synthetic dataset and SiSEC-MUS dataset:

The baseline Non-negative Matrix Factorization (NMF) algorithm, as outlined in Section 2.2, was tested without any training data In this approach, the spectral models for speech and noise were initialized with random non-negative values and subsequently refined through iterative updates using equations (2.14) and (2.13).

The NMF-based algorithm, detailed in Section 2.2, utilizes a non-sparsity approach to learn spectral models for speech and vocal sounds This model is developed from a single speech/vocal file created by combining all samples from the training set outlined in Section 3.6.1.2 In parallel, the noise/music spectral model is derived from a single file that integrates five noise samples from the noise training set, also described in Section 3.6.1.2.

• NMF - Block sparsity: Proposed framework, combining NMF with block sparsity constraint by (3.5) [131].

• NMF - Component sparsity: Proposed framework, combining NMF with component sparsity constraint by (3.6) [8].

The results achieved by our proposed algorithm on the SiSEC-BGN dataset were submitted to the SiSEC 2016 campaign We compared our findings with those of various state-of-the-art algorithms that have participated in the SiSEC campaign since 2013.

The Martinez-Munoz algorithm, presented in SiSEC 2013, utilizes a source-filter model to effectively separate speech sources from various types of noise This noise is characterized by a blend of pseudo-stationary broadband noise, impulsive noise, and pitched interferences The algorithm employs parameter estimation based on the multiplicative update (MU) rules used in Non-negative Matrix Factorization (NMF).

Bryan's algorithm, introduced in SiSEC 2013, utilizes an interactive method that leverages human annotation of the mixture spectrogram to enhance and refine the source separation process This approach is grounded in probabilistic latent component analysis (PLCA), which is equivalent to non-negative matrix factorization (NMF).

• L´opez’s algorithm [84] (in SiSEC 2015) : uses spectral subtraction, they de- signs the demixing matrix and the post-ﬁlters based on a single-channel source separation method.

• Liu’s method [83] (in SiSEC 2016): performs Time Difference of Arrival (TDOA) clustering based on Generalized Cross Correlation Phase Transform (GCC-PHAT).

Table 3.2: Source separation performance obtained on the Synthetic and SiSEC-MUS dataset with unsupervised setting.

Dataset Method Speech/Vocal Noise/Music

SDR SIR SAR SDR SIR SAR

Table3.3:SpeechseparationperformanceobtainedontheSiSEC-BGN.∗ indicatessubmissionsbytheauthorsand“-”indicatesmissing information [83, 100, 102] Methoddevsettestset Ca1Sq1Su1AverageCa1Ca2Sq1Sq2Su1Su2Average Martinez-Munoz* (SiSEC 2013)

SDR5.49.61.56.43.43.79.010.95.02.26.1 SIR15.417.35.814.114.617.118.620.523.25.917.1 SAR6.110.75.87.94.24.09.911.55.26.07.0 Bryan* [17] (SiSEC 2013)

SDR5.610.24.23.73.813.112.95.65.67.37.8 SIR18.415.613.613.916.521.818.221.423.016.118.5 SAR5.912.14.94.54.213.714.65.75.78.48.5 L´opez* (SiSEC 2015)

SDR 4.04.55.111.0-3.83.94.9 SIR 14.916.19.616.3-1.68.812.1 SAR 4.75.08.613.04.36.37.3 Liu* (SiSEC 2016)

SDR1.9-3-10.6-3.11.62.7-4.41.9-12.6-1.2-1.0 SIR4-2.9-9.7-2.14.57.7-4.32.4-12.20.10.9 SAR7.516.46.911.36.55.518.816.910.3811.4 Proposed (SiSEC 2016)

Figure 3.4: Average separation performance obtained by the proposed method with unsupervised setting over the Synthetic dataset as a function of MU iterations.

1) The convergence and stability of the algorithm

The convergence of the proposed algorithm over the Synthetic dataset as a function of the number of MU iterations is shown in Figure 3.4 We can see all measure SDR,

SIR, and SAR increases when the number of MU iterations increases This conﬁrms that the derived algorithm converges correctly and saturates after about 20 MU iterations.

The average speech separation performance in the Synthetic dataset, illustrated in Figure 3.5, indicates that the proposed algorithm demonstrates less sensitivity to the parameter γ and greater sensitivity to λ The algorithm remains stable with small values of λ, achieving optimal results when λ is set between 10 and 50 and γ between 0 and 4 Overall, the algorithm's insensitivity to hyper-parameter selection allows for straightforward practical implementation.

2) Comparison with the closed baseline algorithms

Comparing the separation performance obtained by proposed algorithm to the closed

Computational complexity

In algorithms 4 and 5, we implement a mixed group sparsity-inducing penalty, diverging from the block and component sparsity-inducing penalties used in the baseline algorithms from [131] and [8] During each iteration of the multiplicative update for the Non-negative Matrix Factorization (NMF) parameters, the time complexity for the block sparsity-inducing penalty is G, representing the number of blocks or training examples, while the complexity for the component sparsity-inducing penalty is K, which denotes the number of spectral bases.

The complexity in time of NMF algorithm 1 isO (F KN) when updating the dictionary W and H, where (F, N) is the dimension of the input power spectrograms

When dealing with large N, the multiplicative updates algorithm for Non-negative Matrix Factorization (NMF) becomes costly due to the significant time complexity of O(FKN) involved in large matrix multiplications during the dictionary update step Consequently, the proposed algorithms exhibit a similar computational complexity to existing methods, as indicated in references [8, 131], since the sparsity constraint does not notably influence the overall computational time For instance, running this on a laptop equipped with an Intel Core i5 Processor at 2.2 GHz and 8GB of RAM illustrates these performance challenges.

In our experiment, the proposed RAM method, when implemented non-optimally in Matlab, successfully separates a 10-second mixture in approximately 2.2 seconds In comparison, the NMF-Block sparsity method, NMF-Component sparsity method, and Baseline NMF method demonstrate slightly faster performance, with running times of 1.8 seconds, 2.1 seconds, and 1.7 seconds, respectively.

MULTICHANNEL AUDIO SOURCE SEPARATION EXPLOITING NMF-BASED GSSM IN GAUSSIAN MODELING FRAMEWORK 70

Tiêu đề	Audio Source Separation Exploiting NMF-Based Generic Source Spectral Model
Tác giả	Duong Thi Hien Thanh
Người hướng dẫn	Assoc. Prof. Dr. Nguyen Quoc Cuong, Dr. Nguyen Cong Phuong
Trường học	Hanoi University of Science and Technology
Chuyên ngành	Computer Science
Thể loại	doctoral dissertation
Năm xuất bản	2019
Thành phố	Hanoi

Định dạng
Số trang	133
Dung lượng	2,47 MB