OneByZero: Advancing AI-Powered Voice Solutions
At OneByZero, we develop voice agents that tackle various challenges, including background noise. While these agents excel in accurate speech recognition, environmental noise poses a significant hurdle. Our research reveals that the future of voice interaction goes beyond just recognizing speech — it requires powerful mechanisms to filter out background noise and adapt to ever-changing conditions for more reliable and efficient performance.
For businesses keen on staying competitive in the fast-evolving voice tech landscape, the message is clear: precision matters. The right VAD (Voice Activity Detection) technology can be the difference between a frustrating experience and a seamless, intuitive interaction.
This experiment compares two approaches to enhancing VAD in noisy environments: classical machine learning (ML) and SileroVAD. Our goal is to determine the most effective solution for optimizing speech detection with low latency, ensuring faster and more responsive voice agents.
VAD and Background Noise
Voice Activity Detection (VAD) is a critical aspect for voice agents, enabling them to detect when a user speaks and when they stop, triggering appropriate responses. However, several challenges arise with VAD:
- Delayed Detection: Slow responses lead to awkward silences.
- Premature Detection: False pauses cause interruptions and speech overlap.
- Misinterpreted Context: Incorrect pauses result in missed or irrelevant responses.
Background noise complicates this task because it often mimics speech. Sounds like chatter or ambient noise may share similar frequencies and patterns, making it difficult for VAD algorithms to distinguish between actual speech and noise. Effective VAD systems must precisely differentiate between these sounds to ensure smooth, natural interactions.
Classical Machine Learning Approach
The classical ML approach in this experiment involves training a model to differentiate between speech and background noise using a real phone conversation dataset. The process for data processing and model training is shown in Figure 1.
![Figure 1. Classical ML Approach for Pause Detection](https://www.notion.so/image/https%3A%2F%2Fprod-files-secure.s3.us-west-2.amazonaws.com%2F16ebcd74-563a-48a2-9ce7-686f30d5c337%2Fa9c0440d-e4e4-40b3-adf7-7d5d006a8316%2Fimage.png%3FspaceId%3D16ebcd74-563a-48a2-9ce7-686f30d5c337?table=block&id=1732cc79-102c-80c6-bee3-ce958caf4168&cache=v2)
NOISE REDUCTION
The noisereduce package uses spectral noise gating to reduce noise by suppressing frequencies linked to noise while preserving speech frequencies.
Strengths:
- Removes consistent background sounds (e.g., static, hum).
- Improves audio clarity for better feature extraction.
Weaknesses:
- Struggles with background chatter that resembles speech.
- May confuse similar frequencies and temporal patterns, leading to misclassification.
![Figure 2. Comparison of audio signals before and after noise reduction. It shows how challenging it is to eliminate background noise without affecting speech.](https://www.notion.so/image/https%3A%2F%2Fprod-files-secure.s3.us-west-2.amazonaws.com%2F16ebcd74-563a-48a2-9ce7-686f30d5c337%2F1d1f7f39-a8b0-4865-8242-479f8bbfe5c8%2Fimage.png%3FspaceId%3D16ebcd74-563a-48a2-9ce7-686f30d5c337?table=block&id=1732cc79-102c-8006-b774-c7ce4af19ceb&cache=v2)
RULE-BASED DETECTION
A threshold-based algorithm is implemented for detecting pauses after speech. A sliding window scans the predictions, and a pause is triggered after two consecutive non-speech events, reducing false transitions and making pause detection more reliable.
![Figure 3. Visualization of the rule-based pause detection. This helps the algorithm stay robust against short speech bursts or isolated events that could be misclassified.](https://www.notion.so/image/https%3A%2F%2Fprod-files-secure.s3.us-west-2.amazonaws.com%2F16ebcd74-563a-48a2-9ce7-686f30d5c337%2F44ac4efb-270e-4a95-aa44-e4738ffbf334%2Fsliding_predictions.gif%3FspaceId%3D16ebcd74-563a-48a2-9ce7-686f30d5c337?table=block&id=1732cc79-102c-80e2-beb8-c96f6e10ea04&cache=v2)
SileroVAD
SileroVAD is an advanced deep learning-based VAD model that accurately separates speech from background noise, even in noisy environments. Trained on over 6,000 languages and diverse datasets, it uses Convolutional and LSTM layers to deliver exceptional performance. It processes audio in under 1 millisecond per chunk, making it highly suitable for real-time applications like voice agents.
SileroVAD does not require additional noise filtering because it is trained on noisy data, allowing it to detect speech in varied environments. It also features a tweakable threshold parameter, mimicking the rule-based approach from classical ML, enabling precise speech detection based on consecutive speech or non-speech predictions.
Comparison of XGBoost vs SileroVAD
The models were evaluated on a Filipino-English phone recording. Performance will be assessed based on:
- Precision and Recall: Measure the models' ability to accurately identify speech and pause segments.
- Intersection over Union (IoU): Assesses the overlap between predicted and actual segments, highlighting accuracy and penalizing over-predictions.
- Latency: Delay in correctly predicting a pause segment.
XGBoost Model
Upon testing the model on a separate phone conversation recording the results show the following:
- Low Recall and IoU: Cannot accurately classify background noise as non-speech.
- High Latency: Delayed pause classification, with a 100ms computation time expect around 300-500ms latency.
EVENT | PRECISION | RECALL | IoU |
Non Speech | 95% | 60% | 58% |
Speech | 81% | 98% | 80% |
![Figure 4. Histogram of pause prediction accuracy latency for XGBoost](https://www.notion.so/image/https%3A%2F%2Fprod-files-secure.s3.us-west-2.amazonaws.com%2F16ebcd74-563a-48a2-9ce7-686f30d5c337%2F7309bba0-93cb-4a3e-8287-e32abdcfbb4e%2Fimage.png%3FspaceId%3D16ebcd74-563a-48a2-9ce7-686f30d5c337?table=block&id=1732cc79-102c-807e-b608-e4452d85cd24&cache=v2)
SileroVAD
SileroVAD's results are much better:
- High Recall, Lower Precision on Non Speech: Aggressive in detecting pauses.
- Low Latency: Most pauses were detected in real time, with a negligible computation time of 1 ms.
EVENT | PRECISION | RECALL | IoU |
Non Speech | 82% | 99% | 82% |
Speech | 100% | 88% | 87% |
![Figure 5. Histogram of pause prediction accuracy latency for SileroVAD](https://www.notion.so/image/https%3A%2F%2Fprod-files-secure.s3.us-west-2.amazonaws.com%2F16ebcd74-563a-48a2-9ce7-686f30d5c337%2F1166ac94-f52b-4eb8-8a93-050ed5988ac3%2Fimage.png%3FspaceId%3D16ebcd74-563a-48a2-9ce7-686f30d5c337?table=block&id=1732cc79-102c-80d7-b4ee-c6815a971193&cache=v2)
Phone Conversation Results
In Figure 6, we compare the performance of XGBoost and SileroVAD during a phone conversation. Around the 50-second mark, the XGBoost model misidentifies background noise as speech, causing a 10-second delay in response. This highlights the limitations of XGBoost trained on limited data compared to SileroVAD’s superior performance.
![Figure 6. Plot of voice activity detection for a phone conversation. XGBoost struggles with background noise, whereas SileroVAD closely resembles the ground truth.](https://www.notion.so/image/https%3A%2F%2Fprod-files-secure.s3.us-west-2.amazonaws.com%2F16ebcd74-563a-48a2-9ce7-686f30d5c337%2F2a83af44-bdb4-4707-aed0-c759d04a5667%2F273183AE-A4F1-4E52-BA20-9D878FCC71BE.png%3FspaceId%3D16ebcd74-563a-48a2-9ce7-686f30d5c337?table=block&id=1732cc79-102c-8004-ba65-e60f5f8134f1&cache=v2)
Conclusion
This experiment emphasizes the importance of accurate, low-latency Voice Activity Detection (VAD) for improving voice agent performance. Key insights include:
- Classical ML Approaches (e.g., XGBoost): Effective but limited by small datasets, which affect performance and requires additional noise reduction to deal with background nose.
- SileroVAD: Excels in distinguishing speech from background noise, even in real-time conditions. It is also easily deployable with wide application, as it is trained on 6,000 languages.
The key takeaway is that in real-time applications, SileroVAD is the more reliable choice for ensuring seamless performance in dynamic, noisy environments.
References
Silero Team. (2024). Silero VAD: Pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier. GitHub Repository. Retrieved from https://github.com/snakers4/silero-vad.
Cochard, D. (2023, December 27). SileroVAD : Machine Learning Model to Detect Speech Segments. Medium. https://medium.com/axinc-ai/silerovad-machine-learning-model-to-detect-speech-segments-e99722c0dd41