Improvement of Microphone Array Characteristics for Speech Capturing

This paper presents a new adaptive technique for speech capturing in adverse conditions using microphone arrays. The proposed technique is based on frequency-domain alignment of microphone signals with the output of the fixed beamformer directed to the target speaker. This alignment procedure improves pattern directivity and reduces sidelobes. The low complexity of the technique is achieved by means of a frequency-domain implementation of the algorithm. This makes it possible to implement this technique in real-time applications with a large number of microphones. The technique was evaluated on speech data corrupted by varying levels and directions of noise and interference. The proposed technique improves the directivity pattern of a traditional Dealy & Sum beamformer as well as provides additional suppression of spatially incoherent noise, diffuse noise and interference, with minimal loss of the target signal quality.


Introduction
Microphone arrays (MAs) are an effective tool for many applications (speech transcription, voice/speaker identification, etc.).MAs also enable long-distance speech capture in adverse conditions (Zang et al., 2010).For speech enhancement, the main advantage of using microphone arrays rather than a single microphone is that a microphone-array-based beamformer can spatially suppress multiple interfering signals and different background noises (diffuse noises, independent noises, etc.) while maintaining minimum distortion of the target signal from the look direction (Gannot, et al.2001).A great number of well-known fixed or adaptive algorithms for signal processing in MA are described in (Widrow & Stearns, 1985) and (Brandstein & Ward, 2010).Comparison of different beamforming techniques may be found, for example, in (VanVeen & Buckley 1988).Different approaches can be classified into three main categories (Fischer & Simmer, 1995):
However, practical application of these algorithms presents a number of difficulties: audible music noise, target signal suppression, sensitivity to microphone mismatch (Doclo & Moonen 2007), etc.In our previous paper (Stolbov & Aleinik, 2014) we proposed a new adaptive beam-forming algorithm.It was developed for an 8-element MA and its efficiency was proved in practice by experimental studies on artificial model and real signals.This paper presents a modification of that algorithm, as well as a detailed study of the algorithm and its characteristics, advantages and disadvantages.

The Proposed Technique
Here we briefly describe the modification of our method proposed in (Stolbov & Aleinik, 2014).

The Description of the Modified Method
The method is based on frequency domain alignment of microphone signals with the output of a fixed beamformer directed to the target speaker.The block diagram of the method (example for a 4-microphone array) is shown in Figure 1.We implement this procedure in frame-based frequency domain (Simmer & Wasiljeff, 1992).The three initial steps are well-known as a frequency-domain Delay & Sum FBF.At the first step, the input signals of each microphone are transformed into spectra using short-time Fourier (STFT), where is the microphone index, is the frequency bin index and is the frame index (Figure 1, block 1).Then each signal is delayed in block 2 by multiplication with the complex steering vector : , where is the desired source direction.At the third step the frequency-domain FBF output is calculated (block 3) using the following equation: . (2) We point here that the signals ( 1) and ( 2) represent an ideal situation where microphone characteristics are equal to each other.In practice there are fluctuations in microphone sensitivity and phase (both are generally frequency dependent).To compensate for these fluctuations, we first estimate the transfer function : where and are the estimated cross spectra and the power spectra of the and signals.These estimates are obtained using exponential frame-by-frame averaging: (4) x 1 (t) x 2 (t) x 3 (t) where is the time constant of the decaying, and are the frame duration and adaptation time constant, respectively.The calculation of (3-4) is performed in block 4, Figure 1.Then the matched transfer function is transferred to block 5 where the microphone signals are transformed as: . (5) After modification the new output spectrum is calculated in block 6 as follows: . ( 6)

Analysis of the Proposed Algorithm (Magnitude and Phase Influence)
We should point out here that further on we omit the frame index k for the sake of simplicity.Consider again the MA directed perpendicularly to the line of microphones.Naturally the transfer function ( 3) is a complex function, i.e. we can write: (7) where and are the magnitude and the phase of the function, respectively.It is clear that generally both components (magnitude and phase) affect the results ( 5) and ( 6).Transforming in (5) to "only the phase of " or "only the magnitude of " allows us to investigate these influences separately.Results these studies are presented below.

Phase Influence
The physical meaning of the influence of the phase can be easily explained as follows.Consider a 3-element equidistant linear MA.In that case the middle (second) microphone represents the phase center of the MA.If our MA is directed perpendicularly to the line of microphones, then every element of the steering vector is real valued and equal to 1. Accordingly, the MA output is a simple sum of three microphone output signals divided by 3. Now let the input signal be a harmonical plane wave from some direction .
It is clear that in this case phase differences between and are equal to for the first, second and third microphone, respectively.Moreover, exactly the same phase differences are obtained when we calculate cross-spectra (4).So for our 3-element MA the phase of can be written as: . Thus when we calculate (5) we provide additional rotation of the and signals to angles and , i.e. phase differences are doubling.As a result, for the input signal frequency we have the MA directivity pattern corresponding to the frequency : the width of the mainlobe and sidelobes are two times narrower, the sidelobes are smaller, etc.Of course, this conclusion is valid for any number of microphones.Figure 2 shows the resulting directivity patterns of an 8-microphone MA for a Conventional Delay & Sum FBF and the proposed method when only phase was used in (5).The distance between the microphones was 5 cm, the input signal was a 2000 Hz harmonical plane wave.
It is clear that the dashed curve corresponds to the input signal with the frequency equal to 4000 Hz, while in reality the input signal frequency was 2000 Hz.

Magnitude Influence
If we use only the magnitude of in ( 5), the result will be qiute different.If the input signal is a plane wave from an angle , then the amplitude of the output signal can be written as: where is the input signal amplitude and is the magnitude directivity pattern of the MA.A close inspection of ( 5) and ( 6) shows that in this case where denotes the mathematical expectation.Rewriting ( 8) and ( 9) in terms of expectations we get: (10) This means that the resulting directivity pattern is equal to the square of the initial one.These conclusions are confirmed by Figure 3 which shows the resulting directivity patterns for the same parameters of the microphone array and the input signal as in Figure 2 when only the magnitude of is used in (5).We can see that main lobe of the dashed curve is slightly narrower, and the level of sidelobes for the dashed curve on the dB scale is two times lower than that for the solid curve.
If we compare Figs. 2 and 3, we can conclude that the phase mainly reduces the width of the main lobe, while the magnitude has a greater effect on the amplitude of the sidelobes.Of course mainlobe width reduction looks very useful.Moreover our investigation have shown that the use of the complex allows to get both mainlobe width and sidelobes reduction.However, the data in Figs. 2 and 3 were obtained using harmonic signals without interfering noise, i.e. in ideal conditions.On the other hand, our experiments showed that, first, the phase method works poorly in the presence of high level additive noise.Second, the use of the phase increases parasitic musical noise in the output speech signal.In the end we left the phase method for further research and focused on the investigation of (3) with magnitude in the numerator, i.e. everywhere below we assume that for our proposed method: , (11)

Analysis of the Proposed Algorithm for Coherent Noise
Consider coherent noise suppression with the presence of a target signal.The microphone output signal in this case equals interference plus speech signal:

Equipment and Parameters
In our experiments we investigated the proposed method using 8 equally spaced microphone array with 5 cm inter-microphone spacing and 35 cm total aperture length.Omni-directional microphones (Knowles Electronics) were used.The signals of the microphones were sampled with the frequency =16 kHz.We used a standard Overlap-and-Add (OLA) technique with the frame length of 512 samples, 50% overlapping and a Hann window.
In practice, one of the detected issues was audible music noise caused by estimated spectrum random fluctuations (4).It is known that it is mostly low-frequency estimated spectrum fluctuations that cause this type of noise.To reduce the music noise, we propose the following transfer function limitations: (20) We used fixed = 1, is set by user.Estimated directivity patterns, as well as values of noise reduction (NR) and interference reduction (IR) were used for evaluating the performance of the proposed method.White Gaussian noise (WGN) and Wide-Band Gaussian noise (WBGN) with a 300-5000 Hz band were used as input signals.We investigated the method for three scenarios: incoherent, coherent and diffuse noises.Testing was conducted with both model signals and real signals recorded in an office.For comparison we also estimated the characteristics of a conventional FBF.

Directivity Patterns
Directivity patterns (DPs) shown in Figures 2 and 3 were obtained using an artificial single-frequency harmonic signal and therefore do not provide complete information about the performance of the proposed method.For clarification, we conducted a series of experiments using the model and real WBGN as input signal for MA. Figure 4 shows the DPs when the input signal is artificial WBGN from the direction .It can be seen that for our WBGN oscillating sidelobes are absent and the DPs gradually decrease as the MA angle moves away from the signal direction .We can also see that the width of the mainlobe is large enough.This fact can be explained by the effect of low-frequency components of WBGN (as our antenna has a length of 35 cm, which provides poor directivity at frequencies below 1000 Hz).Using our method we obtain a similar behavior of the DP curve (dashed curve), but the mainlobe is narrower and the sidelobes are smaller.
Figure 5 shows the DPs for the signal obtained in a real experiment: WBGN noise emitted by high-quality speakers located at a distance of 4 meters from the MA in an office with the reverberation constant equal to 860 ms.We can see that despite the fact that the overall MA performance deteriorates, the proposed method has better characteristics than the conventional Delay & Sum FBF.

Noise Reduction as a Function of Frequency
We already mentioned the well-known fact that noise reduction in a MA depends on noise frequency.Consider Figure 6 which shows three power spectra.In this experiment, WGN emitted by high-quality speakers located 6 m from the MA and hidden from the MA by a screen simulated diffuse noise in the office.A slight convergence of the curves in the region above 6000 Hz can be explained, in our opinion, by the appearance of parasitic sidelobes with a high amplitude.We should point out that the data shown in Figure 6 are very useful as they allow us to calculate the level of diffuse noise reduction in the frequency bands.For example, let us denote the power spectrum of the single microphone signal (solid curve) as and the power spectrum of the MA output signal for the proposed method as .Then noise reduction in the discrete frequency band can be written as: (20)

Noise Reduction for Diffuse Noise and for Interference
We calculated noise reduction for three types of noises: model incoherent noise, real diffuse noise and real interference.Model incoherent noise is a set of 8 independent artificial random sequences with Gaussian distribution, generated using the "WELL1024a" algorithm described by (Panneton et al. 2006).Diffuse noise is a real WGN recorded by the MA in an office and generated as explained in the description of Figure 6.
Interference is also a real recorded WGN in an office coming to the MA from the angle of 40 degrees when the MA is oriented at .Estimations of NR and IR were provided in the frequency band 1000 -8000 Hz.  1 show that the proposed method has considerable advantages compared to FBF.

Discussion
In this paper we proposed a novel frequency-domain alignment technique of speech enhancement for directional microphone array systems.The proposed technique improves the MA directivity pattern compared to the conventional Delay & Sum FBF: we show both analytically and experimentally that the proposed method narrows the mainlobe and reduces the sidelobes of the DP.Two modifications of the proposed method ("phase" and "magnitude" modifications) were considered and studied in detail.We performed a number of experiments using real (emitted and recorded in a real office) and simulated acoustic signals.Various types of signals and noises were used in the experiments: harmonic signals, white Gaussian noise, wide-band Gaussian noise (300 -5000 Hz), etc.We achieved a good similarity of theoretical and experimental results.Our experiments showed that for an 8-microphone array it is possible to achieve noise reduction of up to 12.9 dB for diffuse noise and up to 18 dB for incoherent noise.

Conclusion
The proposed technique showed good results in the processing of acoustic data.Moreover, the technique, which involves a low-complexity algorithm, can be used in systems with a large number of microphones (for example, a two-dimensional microphone array) in real-time applications.We think that the technique can be combined with several types of microphone array processing algorithms, e.g.null-steering, post-filtering, etc.In a limited number of experiments we found that the combined use of the null-steering algorithm and the proposed method enhances the suppression in the null direction.We also found that the system consisting of our algorithm and a sequentially connected Zelinsk post-filter performs better than either of these components separately.However, we did not conduct detailed studies in this area.We should point out that musical noise is one of the major problems limiting the performance of the proposed method.Combination with other adaptive frequency-domain algorithms as well as musical noise suppression and detailed study of the performance of the phase and complex transfer function are a task for future work.

Acknowledgments
The work was financially supported by the Ministry of Education and Science of the Russian Federation

Figure 1 .
Figure 1.Block diagram of the proposed technique.

Figure 2 .
Figure 2. Directivity patterns for Delay & Sum and the proposed method for

Figure 3 .
Figure 3. Directivity patterns for Delay & Sum and the proposed method for .

Figure 4 .
Figure 4. Directivity patterns for Delay & Sum and the proposed method for artificial wide-band Gaussian noise (bandwidth is 300 -5000 Hz)

Figure 5 .
Figure 5. Directivity patterns for Delay & Sum and the proposed method for wide-band Gaussian noise recorded in a real office

Figure 6 .
Figure 6.Power spectra of input and output signals of the MA for diffuse noise

Table 1 .
Estimated noise and interference reduction