How We Test Audio Quality Through Objective Analysis

May 3, 2023
How We Test Audio Quality Through Objective Analysis

How We Test Audio Quality Through Objective Analysis

Previously, we discussed the most effective way to objectively test audio quality at Aircore. But how can we use this information to actually help us maintain and improve audio quality in Aircore’s audio and video SDKs? How can we ensure that your users enjoy the kind of smooth, crystal-clear connection that makes for a great, natural-feeling in-platform communication experience?

Simply cutting and pasting audio files that went through Aircore’s encoding process into ViSQOL, our audio quality analysis API, would not be sufficient. Creating a robust and flexible tool that leverages ViSQOL is paramount to making beneficial and impactful changes to Aircore.

First step: Generate a mean opinion sore with Huron

Huron is an application that takes in an original audio file and its degraded counterpart as inputs and outputs the results in JSON. The JSON file contains the MOS (mean opinion score) indicating the quality score of the degraded file relative to the original. The workflow of Huron is as follows:

Huron workflow diagram for testing audio quality
Huron Workflow Diagram

At the beginning of execution, the Huron main application will create an AudioController, which is responsible for returning the MOS generated by ViSQOL’s API given the original and degraded audio files passed in from the main app. The AudioController will then create two WavParsers — one for the original audio sample, and one for the degraded audio sample. Each of these WavParsers will asynchronously decode the audio into a discrete array of numeric values. The WavParsers are also responsible for resampling the audio and converting it from stereo to mono if necessary.

When each WavParser finishes decoding its respective audio, it will send a signal back to the AudioController to indicate that it’s finished. When the AudioController receives a finished signal from both WavParsers it will trim the parsed audio data, such that only the common parts of both remain. Finally, the AudioController will pass in the trimmed data to ViSQOL’s API and receive the MOS as a promise. The AudioController will output the result as a JSON, and then fire a signal to the main app, letting it know that everything is finished, and that it can exit safely.

Now that we have a tool that allows us to generate a MOS given an original audio file and a degraded audio file as inputs, we can see how network and CPU constraints affect audio quality at Aircore.

Let’s test the effect of different network and CPU constraints

We gathered data sets using an iPhone 8 by recording audio that went through Aircore’s encoding process and applied network and CPU constraints to the iPhone. Different results are expected under different constraints, because the encoding process takes into account available CPU & network resources before generating the encoded output. Both one-variable and two-variable tests were used to observe how constraints impact audio quality both alone, and in conjunction with other constraints. Additionally, test scenarios for both speech and music were covered.

Chart showing how different bandwidth constraints impacted audio quality score.

Although we can evidently see that the score decreases as we have less bandwidth, the scores are all very similar until the last column. It is entirely possible that a moderate constraint can have a higher MOS than an output with no constraints due to variance. It appears that the audio quality only starts to noticeably decrease once the bandwidth drops below a certain threshold. But once the bandwidth drops below this threshold, the audio quality drops extremely quickly as shown in the 100Kbps example.

Chart showing how packet loss impacts MOS score.

Initially, the score seems to decrease as packet loss increases. However, the moderate packet loss and severe packet loss have a similar average. This is likely due to variance. The 1.53 outlier in the moderate packet loss example was one where a very large portion of the beginning of the file is just completely missing. The outliers in the severe packet loss example (3.71, and 3.27) were due to insignificant parts of the audio being cut off.

Since the packets being lost are random, there is a lot of variance in MOS, as it is unknown whether significant or insignificant packets of the audio will be lost. However, there is still an overall trend where MOS decreases as packet loss increases.

Chart showing correlation of CPU utilization constraints and MOS score.

It seems there is a weak correlation between the CPU utilization and MOS when only audio is being transmitted. However, as CPU utilization increased, there were very occasional short hiccups in the degraded audio file, which would probably explain the cases where the score was 3.32 and 3.21. Overall, even at relatively severe CPU utilization the audio quality is only sometimes affected, and when it is, it is not severe.

Chart showing two-variable test of audio quality in which we applied both bandwidth and packet loss constraints.

The above table is an example of a two-variable test where both bandwidth and packet loss constraints were applied. We can observe that adding a moderate bandwidth constraint has essentially no effect on the MOS. The table where we have severe packet loss and moderate bandwidth constraint is very similar to the severe packet loss one-variable table. It appears that the MOS strongly gravitates towards the scores generated in the one-variable test case with lower scores.

In the above example’s bottom-right cell, one might expect this two-variable example to have a significantly lower score than 1.9. However, the average score is still around 1.9, because the score is gravitating towards the lower one-variable score. This is likely due to the fact that MOSs are not calculated using a linear scale, so the variable that individually affects the MOS less will seem almost insignificant compared to the other. The other two-variable tests that have been performed further reinforce this idea.

Chart showing how various constraints and packet loss impact MOS score when publishing videos.

We can observe from the above table that publishing a 180×180 video has essentially no impact on the MOS. Even a 1920×1080 video has minimal impact. For the moderate bandwidth and packet loss values, we can see that the scores are very similar to the constraint’s one-variable MOS values.

The exception here seems to be the CPU constraint. Publishing a video at the same time with a heavy CPU constraint seems to lower the score further than without the video. This is likely because when there is a video, the already limited CPU has to both render the video and process the audio. The reason that publishing a video does not affect the other constraints to nearly the same degree is probably due to the fact that bandwidth and packet loss constraints don’t strain the CPU.

Music:

Chart showing how various constraints impact audio quality in music.

It appears that the MOS significantly drops even with no constraints when a complex music sample is passed through the encoder. For the bandwidth constraints, similar to speech, the bandwidth only seems to noticeably affect the MOS when it is a severe constraint.

Chart showing effect of packet loss on MOS score in music.
Chart showing effect of CPU utilization constraints on MOS scores in music.
Chart showing two-variable test of audio quality where we applied both packet loss and bandwidth constraints in music.

Overall, the music MOSs show that the score doesn’t drop nearly as much given constraints. This could partially be due to the fact that the audio quality score is already quite low with no constraints, and adding some constraints wouldn’t make it much worse. Despite having a smaller effect, the scores follow the same pattern as the speech examples. We can observe that the MOS also gravitates to the more affected single variable constraint score when there are two variables.

Because we learned how different constraints affect audio quality in different ways at Aircore, we now have a basis to refer to, allowing us to evaluate if our encoding process could be improved under specific conditions. Another application of Huron would be ensuring that any changes to our encoding process does not negatively impact the audio received by users.

Using Huron in Automation Tests

To ensure that we maintain our current level of audio quality, we added Huron to Aircore’s automation environment. When features are added to Aircore’s encoding process, the tests in the automation environment verify that everything is working as intended.

The new tests recorded an audio file that went through our encoding process, and passed it into Huron with the original file to receive a MOS. The MOS was then checked against a pre-defined value based on the results gathered from the tables above. The tests added were as follows:

Chart showing automation tests with expected MOS.
Huron Automation Tests

The goal is for these tests to have high accuracy, while not having an excessively long runtime. To do this, we ran each test case three times, and if the median score was above the expected MOS, we considered it as a passing test.

These tests serve as a baseline for Aircore’s expected audio quality under different circumstances. When we test audio quality this way, we can maintain that our level of audio quality when any changes are made to Aircore’s encoding process. Additionally, if we’re looking to improve audio quality under certain constraints, utilizing these test cases is a simple and effective method to check if these optimizations are working as intended.