This blog post turned out to be quite extensive in terms of length. So, I figured I might as well give the reader a slight heads up before we begin. The topic of the day, as it were, is all about how we at SVT P&T conduct meaningful assessments of experienced video quality through grand quantitative user studies. For those of you who are only interested in *how* we do this, you can skip the first segment and go straight to the heading called “Methodology”. If your attention span is even shorter, and you just want to know *what* our latest user study was all about, you can skip to the third and last segment ”Delivered Master vs PAM”.
However, if you are also interested in why we decided to do this to begin with, I have written a bit of a primer to preface the basis to all of this.
Preface – Some background as to the why
To say that the subject of video quality is of outmost importance to any organisation that aims to provide a relevant video streaming service is hardly an exaggeration. Furthermore, to proclaim that the topic itself resides in a state of perpetual consideration, is most certainly an understatement. In a world where the amount of digital video being streamed is growing steadily, (and thus constantly encroaching on the limited physical capacity of our digital infrastructure), it is safe to say that visual quality will also stay important going forward.
But, the notion of increased visual quality is not the only objective that leads to constant contemplation among these organisations. It turns out that it is equally important to figure out what any one particular proposed method, aiming to achieve this quality goal, entails practically.
To be more specific, the various approaches by which one goes about increasing video quality usually boils down into two distinct categories:
- Increase the bitrate of the video
- Transcode the video in a more efficient manner
In other words, there are many ways to go about solving the issue at hand. Obviously, the most straight forward approach is to throw more bits at the problem, and while that will not in and of itself increase the visual quality, it enhances the potential for a codec to reach a given level. In most cases however, this method also increases the cost of distribution and thus might not, despite its inherent simplicity, be as lucrative as the second option. It turns out that instead of cranking up the virtual “amount of data”-knob one can, in a way, aim for better technique.
By using more efficient, and often more complex, transcoding formulas it is possible to achieve better compression at a given bitrate. Naturally, this is harder to accomplish for many technical reasons, but it also indirectly introduces potential problems further down the chain regarding device compatibility. Needless to say, the requirements of continual reprioritisation, and the accompanying demand for problem-solving, keeps the video streaming business on its toes.
As I have mentioned in an earlier blog post, public broadcasters suffer from an unusually strenuous variant of this particular conundrum. As we are constrained by a fixed budget regardless of the number of unique viewers being served, we lack the ability to counterweight rising costs with a growing revenue from subscriptions or similar forms of income. This is made even worse by the fact that even if one manages to increase visual quality by any of the above-mentioned methods, that is still far from the “be all end all” of actually achieving a better viewer experience.
You see, one of the main challenges with the overall process of increasing video quality is the inherent difficulty of determining actual benefits for the viewer. Just because a larger amount of original information is being retained it does not automatically translate into a better visual representation, instead the key is to effectively retain specific parts of the information. Thus, it is clear that in order to properly increase the video quality of a transcoding procedure, one must first find a reliable method of measuring said quality to accurately assess the results of any proposed change.
In other words: We need to be able to tell for certain if our viewers are able to actually notice a particular change we make while attempting to increase visual quality.
“How can this be?”
You might ask, further suggesting that:
“Better quality should by all means translate into a better viewer experience, right?”
Which is a very reasonable thing to ask, but unfortunately it is not that easy.
It is true that at the most basic level, the quality of any transcoded media file is determined by the amount of original information that is retained in the new arrangement of data. So, on an objective level, which is to say the mathematical representation of information, it is very easy to determine retained quality. This however, does not hold true on a subjective level.
The human psychovisual system is remarkable in its own right, everything from our ability to detect hidden features to recognise complex patterns is arguably something we have just begun to truly appreciate with the advent of computer vision. Trying to discern the ins and outs of such a system calls for a multidisciplinary approach far outside the scope of digital video compression. Thus, instead of trying to predict exactly how a viewer might experience the visual quality of a video, we figured that it is better to find out directly through user tests. The main reasoning behind this decision is easy to comprehend as soon as one realises precisely what we want to accomplish. After all, considering that we deliver a finished video stream to an end user, we should not actually put too much of an emphasis on any objective quality metric. Instead we should focus on the subjective experience of the viewers watching our content.
Having established then, that we want to objectively measure subjective experience of visual quality, a completely new set of questions emerge. How does one go about determining subjective quality? And given that a way is found to do the former, how does one go about proving such data points objectively? Now this is where it all gets interesting!
You see, at SVT P&T we utilise a method that we like to call: SVT SCACJ
Methodology – An insight into how it is done
As some of you might know SCACJ (Stimulus Comparison Adjectival Categorical Judgement), is a methodology for the subjective assessment of the quality of television pictures (described in ITU-R BT.500-11). SVT SCACJ on the other hand, is our own somewhat similar variant of this standardised method.
Enough with the abbreviations and acronyms, how does it actually work?
Say that you are interested in the rather classical case of wanting to know which one of two specific encoding methods provide better visual quality, given some arbitrary constraints. In order to test this, you would start off by choosing a video source file and then encode that file into two new video files, let us call them “A” and “B”, using the two different encoding techniques. Then you would decide upon a procedure of somehow pitting A and B against each other to determine which technique that actually provided the best visual quality. SVT SCACJ enables you to do just that, in an objectively convincing manner, using data gather from subjective judgement off experience.
SVT SCACJ is a double-blind AB-test where the user is presented with a predetermined number of video pairs, occupying half of the screen each, the choice of which pair is being shown is in turn randomised between the follow combinations: AB, BA, AA and BB. It is double blinded in the sense that neither the tester nor the user knows which one of the various combinations that will be presented.
The four possible combinations per video instance
The user then rates the video-pair along the following scale:
-3 -2 -1 0 1 2 3
- Negative values means that the video to the left has higher visual quality
- Positive values means that the video to the right has higher visual quality
- Zero means that the user cannot tell a difference
The sampled data is then gathered and grouped in the following order. For each user, for each video instance, we store the presented combination and the user’s submitted score. Once an ample number of users have conducted the test, various forms of statistical analysis can be applied to determine an objective result at some chosen level of significance. Thus far, we have always chosen to utilise F-tests in the form of single- and two-factor ANOVA.
Summary overview of our 2-way ANOVA + T-test Compound Error Correction
To make this digestable, let us use some actual numbers.
Say that you want to create a SVT SCACJ test where the user watches 10 video instances comparing encoding techniques A and B. To create these 10 video instances, one starts off by choosing 10 different source files, these are in turn used to create 10 encoded videos of alternatives A and B respectively, adding up to a total of 20 encoded videos. These 20 videos are now combined into the four pairs mentioned earlier, one for each instance bringing the total to 40 unique combinations. To be crystal clear, a given user will only ever watch 10 videos in succession, corresponding to the original 10 source files, and for each instance they can be presented with any one of the four combinations.
Delivered Master vs PAM – What we have actually tested this time around
In our ongoing effort to improve the visual quality of video-content for Online we tend to focus exclusively on the pivotal transcoding step. By improving our compression-process, whilst also retaining acceptable transcoding times and sufficient decoder compatibility, we incrementally increase the quality in small steps. This time around we made an effort to explore potential quality improvements on a grander scale, looking at the whole media flow chain. Almost immediately we realised that the main production pipeline contained some very low hanging fruit in terms of improvement, and we choose to start off with the most obvious: The handling of our input file
Currently, the whole production asset management (PAM) system at SVT is designed to primarily accommodate the strict demands of professional broadcasting. This in and of itself is nothing strange, SVT has been conducting traditional broadcasting since the mid-fifties, as opposed to OTT streaming for example, which SVT has only really been working with for the past decade.
Overview of how we bypass the PAM, I assure you that it is not this simple in reality
One of the many consequences of this setup however, is that every source-file entering the PAM system is first and foremost transcoded into a standardised broadcast-format, internally called a house format. By its nature, this transcoding-step is almost always mathematically destructive in the sense that the delivered Master file usually contains more information than the house-format has capacity to contain. As one might understand, this in turn means that this first transcode could hypothetically introduce some loss of quality that actually propagates through the PAM, into our Online transcoders and thus have an adverse effect on the end result. Going by that logic, if the lossy PAM-transcode introduced a significant loss of quality, it would stand to reason that we could gain an increase in quality by using the Master file instead.
To find out whether or not this was actually the case, we conducted yet another one of our large user studies, using SVT SCACJ. In essence we asked the public to help us decide whether there was a subjectively noticeable difference in the encoded material depending on which of the two possible source files had been used. In the end, their opinion on this matter is what actually matters after all. In order to conduct this study, we choose to use footage from one of the larger drama productions we were currently broadcasting, called Andra Åket.
The left image is the transcoded PAM-file, and the right image is the Master file delivered by the post production company
Although the mathematical difference between the two files is a given, it might not be equally obvious to the uninitiated that there is a visual difference as well. The above frame is a perfect example of this, and to make this crystal clear here is a zoomed in example:
Notice the difference in quality between these two stills, especially around the teeth
This choice of content turned out to be truly spot on for our purposes since the ten video instances, picked from scenes out of a total of four episodes, were enough to provide us with nigh on perfect test footage. As one might imagine, it was important to choose video material in such a way that each instance would stress a particular component / sub-process of the encoder, since some steps in the transcode might be more affected than others.
Thus, each video corresponded to one of the following:
- Spatial integrity (quality)
- Structural integrity luma
- Structural integrity chroma
- Temporal motion estimation
- Temporal coding of high contrast
- Shifting depth of field
- Psychovisual operations
It should be noted straight away that we did not control or filter the user group, the test was simply accessed through a link on the SVT Play site and no advertisement was made. In other words, the group was in all but name self selecting, which traditionally is frowned upon for good reason. To counterbalance this, we calculated that we would need n > 2020 samples in order to consider the group random. This extra measure was not strictly needed in this case, but we liked the idea of having a sufficient number of samples to actually call the group “representative of the online population”.
Utilising a dedicated test site operating SVT SCACJ we ran the test continuously for three months. When those three months had passed, it turned out that we had gathered more than 5000 concluded test results (which was about twice as many samples as we needed to comfortably reach statistical significance). We normalised the data, taking into account variables such as resolution, OS, OS-version and so on, and ran our rigorous statistical analysis. This is what we discovered:
Viewers were able to distinctly differentiate between the two alternatives, preferring the video which had been encoded using the Master File as input.
The results themselves were actually quite shocking, especially in terms of statistical significance.
Our null hypothesis was that the users would not be able to tell a difference, and using an alpha level of 0.01, we calculated a p value of 0.00034472.
For anyone not fluent in statistics, this means that we are roughly 99.9% sure that the viewers did notice a difference in quality. Someone with insight in how statistics work on the other hand, would quickly note that the above statement does not actually tell us which one of the alternatives the viewers preferred. Although such a skeptical objection would, if made, be entirely true, let me assure you that this second part was very easy to affirm once we had our initial results.
Now, one might think that we made a mountain out of a mole hill regarding this particular study. After all, should it not be fairly obvious that a higher quality input file would retain more information, and thus translate into a higher quality output file upon transcode? The honest answer however, is actually: ”No”. Even though we can easily determine an objective difference, through simple PSNR /SSIM measurements, this is ultimately irrelevant if the end user (i.e. the viewer) cant see the improvement. Thus, relinking this to prior parts of this blog post, in order to know what is worthwhile we need test it properly, making an objective assessment of subject quality measurements, and on that final note:
All in all the study gave credence to the notion that we have something to gain by improving the way we handle mezzanine files in our PAM pipeline. The larger body of work remains however. It is far from obvious how we would go about changing our current workflow in a way such that more quality is retained from the Master file, whilst still keeping the current broadcast systems operational.
We have some ideas obviously, but that is a topic for another post.