Haswell, Xeon & Core Demystified - Part One. Guest author Rakesh Malik sheds light on the latest Intel chipsets, their capabilities and their suitability for video applications.
There seems to be quite a bit of confusion about what the differences are between Core and Xeon processors. There’s also a lot of confusion about what precisely Haswell is, and hence uncertainty about whether or not it’s a good choice for media applications as compared to, for example, Ivy Bridge... or Xeon. There are a lot of options as far as processors go nowadays, and even Haswell itself is available in so many variants that it’s easy to get lost.
For the past several years, Intel has been working on a tick-tock cycle. One product generation is a port to a new fabrication process (also known as a die shrink), and the next is a new architecture built on that process.
Ivy Bridge was a port of Sandy Bridge to Intel’s 22nm process. Haswell is a new architecture using that process. Intel heavily emphasized power consumption in the Haswell design, as it’s another step in Intel’s transition toward mobile computing.
When Apple introduced the iPad, it was basically a sexy new niche in personal computing. Since then, that niche has grown to be a major market segment. Now that processors are far more powerful than most users will ever need — even just the dual-core ARM processors from the earlier iPad/iPhone generation, and the GPUs are powerful enough to render high-definition and now even 4K video without significantly taxing the host CPU — the market for ultra-high-end processors and graphics cards has become a niche, primarily limited to content creators and hardcore gamers.
In order to maintain its dominance in the semiconductor industry, Intel has to not only continue making new processors, but it also has to sell enough of them to keep its fabs busy, no easy task for a company with such massive capacity.
So, the company has been forced to adapt. The Haswell architecture has a lot of updates oriented toward reducing power consumption, such as significantly improving the processor’s ability to change its clock speed and even turn significant parts of itself off when they’re idle. Intel has also integrated more parts that used to be entirely separate chips, such as a GPU, I/O controller, and a memory controller onto the chip, plus it used its new, smaller than ever fabrication technology to cram ever more cache onto the die.
Haswell also includes some performance improvements, some of them pretty significant. One of these is its ability to adjust the clock speed beyond its official clock speed rating to respond to demand, but it also has a few less obvious updates. One is an enhancement to multi-threading, providing developers with tools that allow them to write parallelized code that shares data with significantly reduced overhead for locking data structures. The idea behind this is to enable developers to use the multiple cores more efficiently.
Another significant addition that is even more useful for media applications is an update to the floating point units. Ever since AMD forced Intel’s hand with the Athlon64, Intel as well as AMD have been continuing to add vector instructions to the processor. The big new addition to Haswell is called FMA, or “Fused Multiply-Add”.
To understand what FMA is and why it’s beneficial, it would help to understand some SIMD basics. SIMD is short for Single Instruction, Multiple Data. Essentially it’s supercomputer style vector operations.
For example, take a large pixel array, such as an image file. Imagine that you want to perform an operation such as a (oversimplified) blend operation; so the idea is that you need to multiply every pixel value (remember that each pixel is in reality an RGBA value, hence 4 values) by an opacity value, and add it to the layer below. Imagine that the opacity is coming from an alpha map, so we’re actually dealing with three layers here.
The traditional computing approach would be to, for each channel in each pixel, perform the multiplication and then the addition. Clearly, that adds up to a lot of operations and vector computing is a way to streamline this.
First, we pack the values from a single pixel into a 4-element array, also known as a vector. We do the same for the alpha map, and the second layer. Then with just two operations, we can blend an entire pixel.
With wider vectors, like Haswell’s 256-bit vectors, it’s possible to pack 4 pixels worth of data into a single vector, assuming you’re using 16-bit channels. That’s an even bigger saving, now it’s 32 operations in two instructions.
Fused Multiply Add combines the multiplication and addition operations, so now all 32 operations require just one FMA instruction. Haswell has two units with this capability.
Of course, feeding such a beast requires a tremendous amount of memory bandwidth, so Haswell has been upgraded in that area as well. It has an integrated memory controller and three levels of on-die cache.
For years, Intel has had a well-deserved reputation for producing GPUs so slow that they were “affectionately” referred to as graphics decelerators. Fortunately for us, Intel is more willing to adapt than most companies of its size and age and has been putting a lot of effort into developing its own GPU technology. Haswell’s GPU has some improvements aimed at lowering Haswell’s power consumption, such as being able to read and write from the third level cache and system memory without waking up the main CPU. It also has some improvements for video playback and support for DirectX11 as well as OpenCL 1.2. While hardly competition for discrete GPUs from AMD/ATI and nVidia, Intel finally has a respectable integrated GPU.
This is also an often misunderstood feature that’s become quite common in processors today. To understand what it does, we need to understand a little bit about out-of-order execution.
Processors are able to issue multiple instructions at once. To maximize their performance, their schedule tries to keep as many available instruction slots busy as it can. So it looks past the first instruction in the program stream to find instructions that it can issue in parallel.
An oversimplified example that is both relevant and realistic is rendering a 3D animation. Since that takes place one frame at a time, it’s easy to parallelize because as a computer scientist would describe it, it is “embarrassingly parallel.” That means that the output from one frame has no effect on the output for another frame, so they’re entirely independent, though they operate on the same data.
Generally, on a single machine, it’s more efficient by far to use multiple threads instead of multiple processes because threads can share memory. Other than that, a thread operates essentially as a separate process; for the CPU to switch from one thread to another, even in the same memory space, it needs to perform a context switch, swapping out the thread’s registers and stack frame and program counter with that for the new thread. This is lower in overhead than switching processes, since threads can share their heaps with each other.
With one thread active at a time, the CPU will look for instructions that it can execute out of order. In a production renderer, it could be computing the value of a procedural shader for one ray hit on a polygon while computing a ray hit for another ray, since the two don’t affect each other.
However, it can’t always find instructions that can work out of sequence; it can’t calculate the procedural shader’s value for a given ray hit until it knows where the ray intersects the polygon, because it needs those coordinates to feed into the procedural shader.
Simultaneous multi-threading helps the processor do this by allowing it to maintain two thread contexts at once, so it can choose instructions from either thread to schedule for execution. It can therefore take advantage of the fact that instructions from one thread won’t require the output from instructions in another thread... most of the time. This does fall down in cases when the task at hand is not embarrassingly parallel. The threads can be from separate processes, and in fact could be two independent single-threaded processes also.
Intel has added extensions to Haswell’s instruction set that allow developers to improve the efficiency of parallelized code, by implementing instructions for locking shared resources in hardware. These instructions basically allow a developer to tell the compiler to not allow another thread or process to modify this chunk of data until this set of instruction is complete.
Now we know a bit more about it, Part Two of this article will look in detail at the Haswell varations and whether you should buy Haswell now or wait for Intel's next new chipset, Broadwell, in 2015.