Video game graphics are now good enough to train AI systems without reference to the real world

Written by Phil Rhodes | Sep 17, 2017 6:00:00 PM

There's a lot going on with artificial intelligence learning

RedShark at IBC 2017: AI will take over the world, and cameras are now all very good. This might seem like a statement of the obvious, but a measure of how far things have come can be seen by the latest developments. Phil Rhodes gives us his roundup of day three at IBC.

There’s been a lot of talk of artificial intelligence and machine learning at IBC this year, although a good proportion of that talk has been people mentioning the fact that it’s being frequently mentioned. It’s impossible to prove that we haven’t missed anything, but so far actual working examples of these sorts of technologies are clustered quite heavily around speech to text transcription – that is, taking the spoken word and turning it into text. Naturally, that’s something that’s been attempted for decades and which has probably become actually useful in the last few years, so it’s been a vexed topic for some time.

Any new application of tech that improves it further is something we’d all welcome, especially if we’re a news organisation with a need to make a mass of incoming material into something that can be searched for keywords. If I were a politician, I might find this a rather alarming prospect, since the current status quo is probably quite reliant on the fact that human beings struggle to muster a sufficiently good memory to pick each other up on every single inconsistency of opinion or statement.

Making every word Googleable, even just within the confines of one particular current affairs broadcaster, is potentially dangerous if you’re in front of the camera, and fantastically powerful if you’re sitting at a desk trying to make someone look like an idiot; it’s an example of the classic realisation that almost nobody’s character could withstand such an assault. Either way, that’s what’s likely to happen, given that apparently some speech to text engines can deal with new material something like four times faster than realtime.

Probably the most impressive AI demo given today, however, was one concerned with image recognition, particularly the ability to pick human beings out of an arbitrary image. At risk of emphasising style over substance, the graphics had been designed to simulate scenes from The Terminator in which the Arnold’s-eye-view outlines people in white over a red-tinted world. Memorable as this was, the particularly interesting thing about it was the teaching technique involved.

Typically, a machine learning program needs a lot of input data – images, in this case – with additional info about what it depicts. Teaching data intended to allow a neural network to learn how to identify people, for instance, will involve pictures containing people alongside the same picture with the person outlined in a way the computer will understand. Because this involves a lot of manual work, high-quality teaching data for AI is hard to come by. The idea presented today involved using a computer simulation (literally a game – GTA 5) to provide the training data. The graphics are apparently now close enough to reality to work as a training aid, much as computer simulations are used to train humans, and crucially are capable of separating things such as walls, floors and people without having someone manually draw around them.

Cameras are all very good now.

Hanging around in the seminar room, we discover that rarest of things: someone providing unequivocal, objective data about how good cameras are, not involving pointing them at someone holding a chart. As we might expect, there was very little discussion of specific camera manufacturers. Some of them had provided behind-the-scenes access to information on the proviso that the resulting white paper was used for academic, rather than commercial purposes, which does go some way to justifying the decision not to name names.

What the paper did show, fairly conclusively, is that single-chip cameras suffer some quite serious colorimetry problems compared to their three-chip brethren, which is another advantage of the layout Sony chose for the 8K live production camera that we discussed yesterday. Possibly, some of this is due to the tendency for manufacturers to put less-dense coloured filters on Bayer sensors, in order to allow more light through, decrease noise for a given amount of light, and thereby improve sought-after characteristics such as sensitivity, dynamic range and noise. These are things reviewers mention; colorimetry is not, but of course the camera’s ability to see colour without excessive chroma noise is dependent, ultimately, on the saturation of the colour filters on the array.

It’s important to be aware that the presentation didn’t imply that any single-chip camera (including both hight and low-cost types) is unusable because of this. In fact, it was specifically mentioned that all of the types tested produce a perfectly pleasing image, so long as we don’t try to compare the colour of an object in reality with the colour of an object as recorded. It is, however, generally more difficult to match two dissimilar single-chip cameras, as opposed to any two dissimilar three-chip types.

It’s a fascinating paper, a subject we’ll look at revisiting, and certainly the sort of thing that should attract people away from the bright lights of the show floor towards the conference rooms, where much of the most original thinking takes place.

View full post