We've used the same methods to compress video for nearly thirty years. The next change will be a big one.
I think it's time for a change. Paradigm might be a word used far too often and mostly inaccurately, but that's what we need a change of. I don't want to seem ungrateful because, within the current paradigm, codecs have done a pretty good job. I mean, I can watch 8K videos over my telephone line. That's an achievement by any measure. It almost shouldn't be possible, and it wouldn't be without a stack of interdependent techniques, which, layer upon layer, bring us fast broadband and beautiful pictures.
So why is it time for a change? I'd say for two reasons. First, the current codec paradigm is based on mathematics done with pixels. Pixels aren't a natural fit for the world we live in, nor for our brains, nor, should I say, our cognitive methods. Second, codecs know nothing about what it is that they're compressing. Does that sound strange to you? I'll explain.
Codecs are content-agnostic because they don't care if they're working on a clip of a courtroom drama or surgical procedure. Nor do they care if the clip is about a swan in flight or a boxing match. They're all just pixels. The only sense in which codecs care about content is when that content is difficult to compress: like a forest in a snowstorm or a horse race with a "difficult" background. The first of these examples has too much randomness (it's hard to distinguish between a bunch of snowflakes and random noise), and the second simply has too much rapidly-changing information. But neither of those content-related concerns are concerned with what the content is "about".
How's AI going to help here? It goes without saying that AI and its close cousin, machine learning, aren't trivial to understand, but by looking at some examples, it's not too hard to see how it might be useful to us.
There are already some cases of AI at work on real-time video. For a while now, Nvidia has been showing a video-calling application that can make the participants seem as though they're looking at the camera, even when they're not. It's good enough for it not to draw attention to itself.
Alongside that, AI can now produce photorealistic images based on spoken commands. "Show me a penguin building a quantum computer" will indeed create a picture of an aquatic antarctic bird designing plumbing for Qubits.
For several years, a website called "thispersondoesnotexist.com" has been producing photorealistic images of people it has essentially just made up. Each time you refresh, you're presented with another "person". None of them exists. They're the product of a massively trained AI that just 'knows" what people tend to look like.
New tools like Google's Imagen take the art of "imagining" a photorealistic scene still further. And as the techniques improve - rapidly, inevitably, it seems - they work at higher and higher resolutions. I can't help feeling that if they're so good at making photorealistic images of something that's never existed, what would happen if you asked the same kind of AI to make the best possible picture of something you're currently showing it?
Wait, what? Does that even make sense? Yes, completely. There's a word for it: re-synthesis.
With resynthesis, you take something that already exists and rebuild it from scratch, to match or exceed the fidelity of the original. AI would use the same methods it employs to build fantasy images and apply them to real ones. This poses an obvious question: what's the point? The answer is that AI doesn't "think" in pixels. Instead, it harnesses the power of conceptual patterns. The more detailed the patterns, the better the results, but it's still not made from pixels. My expectation is that if you train an AI with enough experience of the natural world - or, more to the point, what things are supposed to look like in the real world, then it should be able to reproduce an existing scene based on its "knowledge" of what the essence of that scene must be.
An AI codec won't have a "resolution" as such. Like Postscript, it will be able to output at any resolution. Its frame rate will be "any frame rate". The advantages will be huge and transformative. Several paths are leading in this direction already. One of them is Computational Optics. Sony recently predicted that by 2024, smartphone cameras will outperform DSLRs (or their mirrorless equivalents). So much of computational optics is just number crunching. But increasingly, it's going to be based around AI. At which point - and I'm not even sure you'd call it a codec at this point - your video will be stored as concepts rather than pixels. It could be the biggest change since the dawn of digital video.