As a company where users are constantly making calls and watching videos, having the ability to objectively analyze the audio quality that a user receives is extremely advantageous for Aircore. A tool that can help automate audio quality testing will allow us to easily observe how any changes to our encoding process affect audio quality under various constraints such as packet loss.
This kind of tool allows us to verify how our changes affected audio quality, rather than trying to listen to the output by ear. And better audio means a better user experience.
Here’s how Aircore runs audio quality tests on internal test data, ensuring crystal clear audio during in-app communication.
Want to hear it for yourself? Try a call for free with one of Aircore’s SDKs.
The numerous models for audio quality testing each have their pros and cons, and the best analysis method may be different in each use case. The most common type of audio analysis model is the media layer model. A media layer model is one that takes audio signals as inputs.
Generally, other analysis models are computationally cheaper than media layer models. Examples of other models are:
However, since precision is the most important factor for Aircore, a media layer model is the optimal choice.
Analyzes a decoded audio file relative to an original audio file. The full-reference model has the most research and development compared to other measurement models and is very accurate.
Analyzes a decoded audio file by using features of the original sound. In practice, this method is usually only used when access to the entire original audio sample is unavailable.
Analyzes a standalone audio file, and does not require input from an original audio file. Extracts distortion that originates from sources other than the human voice (ex. network constraints). This is not as accurate as a full-reference model.
Fortunately for Aircore, we will have access to the entire original audio sample. Additionally, we only care about the difference in quality of the two audio samples, rather than the standalone audio quality. This is because if the publisher sends out low-quality audio (original audio), there is nothing our encoding process can do to improve the subscriber’s audio experience if the input itself is low-quality. We only care about how we can give the subscriber the best possible experience, and that is by ensuring their audio quality is as close as possible to the original’s input.
When choosing an audio analysis method, we must consider that audio quality tests are different for:
For Aircore, telephony audio is more important as real-time voice and video chat is a signature feature of the Aircore experience. However, high-fidelity audio is still relevant, as there will be scenarios where audio other than speech is the focal point, such as watching a video of a concert. It’s ideal to choose a method in which the analysis can account for both types of audio.
Full-reference audio analysis methods generally return a MOS (mean opinion score) between one and five to determine how good the decoded audio quality is. Although there have been several full-reference audio analysis methods for telephony-type audio, there are currently two that can be potentially considered the best.
POLQA has been the ITU-T (International Telecommunication Union -Telecommunication) recommendation since 2011. POLQA is the successor of PESQ which was the previous ITU-T full-reference audio analysis method standard. POLQA compares the differences between the original and decoded signal. The model to determine the difference is a perceptual psycho-acoustic model that is based on similar models of human perception. POLQA assumes a temporal alignment of the original and decoded signal
ViSQOL is a more recent but similar audio quality testing method. It is developed by Google and is open-source. It uses a spectro-temporal measure of similarity between a reference and a test speech signal to produce a MOS.
The charts above show how ViSQOL performs compared to POLQA at different bitrates. The y-axis shows the MOS, and the x-axis shows the complexity of the Opus-encoding, which Aircore’s audio is encoded with. It’s important to note that most modern devices can handle the CPU intensity of using the maximum algorithmic complexity, as the complexity is set to 10 by default.
We can see that POLQA is more sensitive to changes at lower bitrates, and the original ViSQOL is more sensitive to change at higher bitrates. Although there is not subjective data for this dataset, the developers of ViSQOL expected that MOS should be less sensitive in higher bitrates, meaning that POLQA is a better match than the ViSQOL original, but similar to the ViSQOL v3 that we would be using.
The above chart shows the correlation coefficient and standard error of our audio analysis methods when compared to subjective scores from a database of audio files. The NOIZEUS database focuses on audio files with background noise (ex. cars driving by), and E4 focuses on IP degradations such as packet loss and jitter. We can see that PESQ performs the best with ViSQOL following for the NOIZEUS database, and POLQA performs the best with the other two similarly performing slightly worse for E4.
Although PESQ seems to be the best overall choice, there are several factors that make it a non-viable option when compared to the other two. POLQA is tuned to respect modern codec behavior such as error correction, while PESQ does not. PESQ cannot evaluate speech above 7kHz, but multiple codecs are 8kHz in wideband mode.
Lastly, PESQ cannot properly resolve time-warping and will give MOSs unrecognizably lower than expected. Between POLQA and ViSQOL, they perform quite similarly, as ViSQOL performs better with the NOIZEUS database and vice-versa.
It’s more important to note that ViSQOL is an open source library with C++ compatibility, whereas POLQA is not, as it is primarily used in the telecommunication industry. With regards to telephony-type audio, both would be viable choices, but ViSQOL is more accessible.
Additionally, ViSQOL has a speech mode, and an audio mode which could be used for high-fidelity audio. The only other tool for general audio quality analysis is PEAQ (Perceptual Evaluation of Audio Quality). Comparing ViSQOL and PEAQ, the difference in performance at lower bitrates still stand, as PEAQ would struggle more than ViSQOL.
All in all, ViSQOL seems like the best overall choice for a full-reference audio quality test. It performs extremely well, is the most accessible, and is the only tool capable of analyzing both telephony and high-fidelity audio so we wouldn’t have to simultaneously use two different tools.
The system diagram for ViSQOL is shown below:
The first step is the global alignment of the two audio signals. Then we create the spectrogram representations of the signals, and divide the reference signals into patches for comparison. The point in which the max NSIM similarity score for each patch is the one that will be used. ViSQOL will then predict time warp by temporally warping the spectrogram patches. Time warp is when a reference patch is a degraded patch is shorter or longer (typically 1% to 5%) than a reference patch (due to “compression” or “stretching”).
If a warped version of a patch has a higher similarity score, the score will be used for the patch. This is because NSIM is more sensitive to time warping than a human listener. The NSIM scores are then passed into a mapping function and a MOS is generated.
One problem that we must tackle is audio alignment. What would happen if our original audio file contained 10 seconds of audio, but our degraded audio file contained five seconds of audio? Would we use the first five seconds of the original audio, the last five seconds, or somewhere in between for comparison? We would need a method to align the two audio files, such that only the common portions of each file are passed into ViSQOL for comparison.
Although there are plenty of methods to find the delay of two audio files, cross-correlation seemed the best option. Some others included convolution and autocorrelation, but cross-correlation would be the best in our use case. This is because convolution and autocorrelation are the measure of similarity of a signal with the same signal but with a time lag. Cross-correlation is used for finding the similarity between two signals, even if they’re not identical when lined up. Since wav files take periodic samples from the analog sound wave, cross-correlation would need to happen discreetly.
The general cross-correlation formula for discrete functions is as follows:
Essentially, to find the cross-correlation at any given point, we must compute the sum of f(g(x)) at every point of the array. However, to find the time in which our audio signals align, we must compute the cross-correlation for every possible alignment.
Here’s an example of how cross-correlation works:
We can observe that regardless of whether we compute f(g(x)), or g(f(x)), the point of max correlation will be the same. For this reason, we will always pass in the degraded file as. ‘g’, such that f(g(x)) is computed when calculating for a point’s cross correlation for simplicity’s sake. When our audio files are not of equal length, we can add padding to the shorter audio file. Zero padding is a common method to align audio files of unequal length, as the cross correlation algorithm expects both signals to have the same length.
Although there are several ways to implement cross-correlation, the optimal method when trying to find the delay between two audio files would be to use the Fast Fourier Transform. The cross correlation integral is equal to the convolution integral if one of the input signals is conjugated and time reversed. We then just need to take the reverse Fourier transform of the result to get the cross-correlation between two signals.
However, finding the delay is only the first step of audio alignment. The next step would be to cut off non-common parts of both signals using the delay. For example, if we have an original signal that is 8s long, and a degraded signal that is 9s long, but the degraded signal has a delay of -3s, which parts of which signals do we cut off? Below is a visual of how it would work.
2. Once we find the delay of -3s, we move the degraded signal 3 seconds to the right in the time domain.
3. We must now cut off the non-common parts of each signal. The first three seconds of the original signal, and the final four seconds of the degraded signal would need to go. Finally, the final five seconds of the original signal, and first five seconds of the degraded signal will be passed in to ViSQOL and return a score.
With this, we now have all the tools needed to conduct objective audio quality tests at Aircore. In an upcoming post, we dive deeper into how we made this work. In the meantime, try one of Aircore’s SDKs for free here.