Replay: Unless you've been living under a rock for the last few years, you know that the clock speeds of processors haven't changed much, yet performance is still going up. Most of that performance gain is coming from parallelism.
To simplify the discussion, we'll look at five approaches to paralellization.
Most microprocessors these days, especially ones made for computers, tablets, and phone, are superscalar. Basically that means that they have multiple piplines designed for particular tasks, like loading and storing data, integer arithmetic and floating point arithmetic.
The challenge here is that semantically the instructions need to execute in one particular order to deliver the correct results.
For example consider a simple instruction stream:
C = A * B
if C < D
E = A * D
else E = B * D
F = A + B
In this very simple series of instructions, the second pair requires the output from the first pair for its logic. So how can the processor optimize this?
First, it can load the data for D while it's executing A * B. It can also, since there's no other dependency, compute F even BEFORE it's finished computing C, called "Out of Order Execution."
Second, it can speculatively execute A * D AND B * D while it's checking C < D, and then discard the unnecessary result.
In a real processor the stream of instructions is a lot more interesting than this as well as a lot longer, and there are more pipelines for the processor to use. The part of the processor that analyzes that stream of instructions and attempts to optimize them, identifying instructions that can be re-ordered without affecting the program semantics, and how to keep the various pipelines running.
One way that modern processors keep the various pipelines busy is called Simultaneous Multithreading (SMT). What this amounts to is running two streams of instructions at the same time, and letting the scheduler choose instructions from either stream to execute, so it has a higher probability of being able to find instructions that don't depend on each other.
Vectorization is something that used to be the domain of very high end supercomputers like the Cray Y-MP. Let's take another example.
Imagine pixels in an image processing program. We want to take the RGBA values of one color and multiply them by another then add the third color. On a standard floating point unit, that's six operations:
R1 * R2 + R3
G1 * G2 + G3
B1 * B2 + B3
A1 * A2 + B3
However, using vector operations, it can be expressed as:
Color1 * Color2 + Color 3
Now it's two instructions, a big improvement from six.
But that combination is so common that there's an instruction for it, called FMA or Fused Multiply Add that makes that a single instruction.
This is appropriately described as "Single Instruction, Multiple Data" or SIMD.
On a Sky Lake processor, there are two floating point units that each support AVX 128. That's Advanced Vector Executions, and the width of the registers is 128 bits. Some higher end Sky Lake X models have AVX 256 units, which have 256 bit vectors.
Vector instructions are getting used more and more; they aren't limited to just colors, for example. It's possible to group any set of operations this way as long as they require the same instruction.
This used to be a feature available only in servers and workstations, but now it's pretty much standard. Back then in order to have two processors, you'd have to use a computer with two sockets and buy Xeon processors, since the SMP (Symmetric MultiProcessing) flag was disabled in the Pentium line for market segmentation.
Now, multiple cores are built into just about every mainstream processor on the market that's made for a computing device. Even most cell phones have either four or eight processors these days, and server processors 32 or more. More on this in the next section.
This was once also the domain of high end supercomputers like IBM's Scalable Power 2 and Beowulf clusters.
Shared memory supercomputers like SGI's Columbia were great for tasks where the processors needed to be able to share data dynamically. There was never and concern about whether or not a piece of data was local, and that made quite a few parallel programming tasks easier. The catch was that the supercomputer was essentially a box; once you filled all of the available sockets, you were at your limit and would need a newer, bigger box to get more computing power.
Beowulf and SP2 were distributed supercomputers; a compute node had its own memory and didn't share it with the rest of the cluster. That meant that when a node needed data that resided in another node's memory, things could get complicated, especially in situations where both nodes were updating that data. Which one updated it first could significantly affect the outcome of an entire simulation. On top of that, scheduling instructions required accounting for data locality; it would take longer to get data from another node than from the current node's memory.
To address this, systems designers developed directory systems, which were basically a catalogue of which data was currently residing where. If a node updated a piece of data, the directory system would mark that piece as "invalid" so that if another node had a copy it would know that it need to re-fetch that data.
That process is called cache coherency, and since the latency for memory access was not uniform, they came to be called Cache Coherent Non-Uniform Memory Access, or ccNUMA.
And now we come to the modern era.
AMD's monstrous Ryzen2 is a perfect example of a processor that uses ALL of these methods for paralellization. It is built of a wide superscalar cores, each supporting SMT. The FPUs support vector computing. The cores are built into chiplets each with seven siblings. And each chiplet is connected by a network fabric on another chiplet that together comprise a 32-core ccNUMA cluster... in a single package.
And it's possible to build systems with two sockets, so EVERY SOCKET has a 32-core ccNUMA cluster installed.
Since GPUs evolved out of very specialized hardware, their architectures are quite different. In general, they're based more on smaller and more limited cores, each of which is somewhat analogous to a pipeline in a CPU. They're grouped into clusters, and organized into a hierarchy that abstracts the underlying layout from the software. They have management nodes that take incoming instructions and assign them to groups.
Because the cores are fairly simple, and do not include general-purpose processor functions like load/store and scheduling, they're very compact and efficient, making it possible to implement the thousands of cores that modern GPUs sport.
Which do software developers in video, 3D animation, compositing, and visual effects use? All of them.
Take DaVinci Resolve for example. When rendering Redcode for example, it dedicates threads to extracting individual frames from the Redcode clip. The Red SDK uses OpenCL or CUDA to de-Bayer the frames, taking advantage of the GPU, and then Resolve takes those frames and injects them into its image processing pipeline. The image processing pipeline uses OpenCL, CUDA and now Metal to process the frames on the GPU, and then Resolve feeds those frames to threads that are writing that data out to and output stream. The threads that are decompressing the raw data and compressing the final video stream use vectorized code.
While there are 3D renderers that use the GPU and others that use the CPU, there are a few like Otoy's Octane and AMD Radeon ProRender that can use both Plus, most 3D renderers also support clustering in the form of render farms.
How far computing technology has advanced in the past few years is astonishing, and the sheer complexity and power available in even mobile processors now would have defied imagination just ten years ago.