cover

James Daniel Zabala

GenAIVoice AgentVAD

Decoding Background Noise: Enhancing Speech Detection for Smarter Voice Agents

Author
James Daniel Zabala
Cover
image (2).png
Slug
decoding-backward-noise-voice-agent
Person
Published
Published
Date
Jan 6, 2025
Category
GenAI
Voice Agent
VAD

OneByZero: Advancing AI-Powered Voice Solutions

At OneByZero, we develop voice agents that tackle various challenges, including background noise. While these agents excel in accurate speech recognition, environmental noise poses a significant hurdle. Our research reveals that the future of voice interaction goes beyond just recognizing speech — it requires powerful mechanisms to filter out background noise and adapt to ever-changing conditions for more reliable and efficient performance.
For businesses keen on staying competitive in the fast-evolving voice tech landscape, the message is clear: precision matters. The right VAD (Voice Activity Detection) technology can be the difference between a frustrating experience and a seamless, intuitive interaction.
This experiment compares two approaches to enhancing VAD in noisy environments: classical machine learning (ML) and SileroVAD. Our goal is to determine the most effective solution for optimizing speech detection with low latency, ensuring faster and more responsive voice agents.

VAD and Background Noise

Voice Activity Detection (VAD) is a critical aspect for voice agents, enabling them to detect when a user speaks and when they stop, triggering appropriate responses. However, several challenges arise with VAD:
  • Delayed Detection: Slow responses lead to awkward silences.
  • Premature Detection: False pauses cause interruptions and speech overlap.
  • Misinterpreted Context: Incorrect pauses result in missed or irrelevant responses.
Background noise complicates this task because it often mimics speech. Sounds like chatter or ambient noise may share similar frequencies and patterns, making it difficult for VAD algorithms to distinguish between actual speech and noise. Effective VAD systems must precisely differentiate between these sounds to ensure smooth, natural interactions.

Classical Machine Learning Approach

The classical ML approach in this experiment involves training a model to differentiate between speech and background noise using a real phone conversation dataset. The process for data processing and model training is shown in Figure 1.
Figure 1. Classical ML Approach for Pause Detection
Figure 1. Classical ML Approach for Pause Detection

NOISE REDUCTION

The noisereduce package uses spectral noise gating to reduce noise by suppressing frequencies linked to noise while preserving speech frequencies.
Strengths:
  • Removes consistent background sounds (e.g., static, hum).
  • Improves audio clarity for better feature extraction.
Weaknesses:
  • Struggles with background chatter that resembles speech.
  • May confuse similar frequencies and temporal patterns, leading to misclassification.
Figure 2. Comparison of audio signals before and after noise reduction. It shows how challenging it is to eliminate background noise without affecting speech.
Figure 2. Comparison of audio signals before and after noise reduction. It shows how challenging it is to eliminate background noise without affecting speech.

RULE-BASED DETECTION

A threshold-based algorithm is implemented for detecting pauses after speech. A sliding window scans the predictions, and a pause is triggered after two consecutive non-speech events, reducing false transitions and making pause detection more reliable.
Figure 3.  Visualization of the rule-based pause detection. This helps the algorithm stay robust against short speech bursts or isolated events that could be misclassified.
Figure 3. Visualization of the rule-based pause detection. This helps the algorithm stay robust against short speech bursts or isolated events that could be misclassified.

SileroVAD

SileroVAD is an advanced deep learning-based VAD model that accurately separates speech from background noise, even in noisy environments. Trained on over 6,000 languages and diverse datasets, it uses Convolutional and LSTM layers to deliver exceptional performance. It processes audio in under 1 millisecond per chunk, making it highly suitable for real-time applications like voice agents.
SileroVAD does not require additional noise filtering because it is trained on noisy data, allowing it to detect speech in varied environments. It also features a tweakable threshold parameter, mimicking the rule-based approach from classical ML, enabling precise speech detection based on consecutive speech or non-speech predictions.

Comparison of XGBoost vs SileroVAD

The models were evaluated on a Filipino-English phone recording. Performance will be assessed based on:
  • Precision and Recall: Measure the models' ability to accurately identify speech and pause segments.
  • Intersection over Union (IoU): Assesses the overlap between predicted and actual segments, highlighting accuracy and penalizing over-predictions.
  • Latency: Delay in correctly predicting a pause segment.

XGBoost Model

Upon testing the model on a separate phone conversation recording the results show the following:
  • Low Recall and IoU: Cannot accurately classify background noise as non-speech.
  • High Latency: Delayed pause classification, with a 100ms computation time expect around 300-500ms latency.
EVENT
PRECISION
RECALL
IoU
Non Speech
95%
60%
58%
Speech
81%
98%
80%
Figure 4. Histogram of pause prediction accuracy latency for XGBoost
Figure 4. Histogram of pause prediction accuracy latency for XGBoost

SileroVAD

SileroVAD's results are much better:
  • High Recall, Lower Precision on Non Speech: Aggressive in detecting pauses.
  • Low Latency: Most pauses were detected in real time, with a negligible computation time of 1 ms.
EVENT
PRECISION
RECALL
IoU
Non Speech
82%
99%
82%
Speech
100%
88%
87%
Figure 5. Histogram of pause prediction accuracy latency for SileroVAD
Figure 5. Histogram of pause prediction accuracy latency for SileroVAD

Phone Conversation Results

In Figure 6, we compare the performance of XGBoost and SileroVAD during a phone conversation. Around the 50-second mark, the XGBoost model misidentifies background noise as speech, causing a 10-second delay in response. This highlights the limitations of XGBoost trained on limited data compared to SileroVAD’s superior performance.
Figure 6.  Plot of voice activity detection for a phone conversation. XGBoost struggles with background noise, whereas SileroVAD closely resembles the ground truth.
Figure 6. Plot of voice activity detection for a phone conversation. XGBoost struggles with background noise, whereas SileroVAD closely resembles the ground truth.

Conclusion

This experiment emphasizes the importance of accurate, low-latency Voice Activity Detection (VAD) for improving voice agent performance. Key insights include:
  • Classical ML Approaches (e.g., XGBoost): Effective but limited by small datasets, which affect performance and requires additional noise reduction to deal with background nose.
  • SileroVAD: Excels in distinguishing speech from background noise, even in real-time conditions. It is also easily deployable with wide application, as it is trained on 6,000 languages.
The key takeaway is that in real-time applications, SileroVAD is the more reliable choice for ensuring seamless performance in dynamic, noisy environments.

References

Silero Team. (2024). Silero VAD: Pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier. GitHub Repository. Retrieved from https://github.com/snakers4/silero-vad.
Cochard, D. (2023, December 27). SileroVAD : Machine Learning Model to Detect Speech Segments. Medium. https://medium.com/axinc-ai/silerovad-machine-learning-model-to-detect-speech-segments-e99722c0dd41
 

Related Posts