Similarly, application performance requirements can vary over time, so transferring the task to a more efficient CPU when possible improves power efficiency. For specialist computation tasks, dedicated accelerators offer excellent energy efficiency but can only be used for part of the time.
So, what should you be looking for when it comes to heterogeneous processors that deliver significant benefits in terms of performance and low power consumption? Let’s look at a few important considerations.
Even with out-of-order execution, with typical workloads, CPUs aren’t fully utilized every CPU cycle; they spend most their time waiting for access to the memory system. However, when one portion of the program (known as a thread) is blocked, the hardware resources could potentially be used for another thread of execution. Multi-threading offers the benefit of being able to switch to a second thread when the first thread is blocked, leading to an increase in overall system throughput. Filling up all the CPU cycles with useful work that otherwise would be un-used leads to a performance boost; depending on the application, the addition of a second thread to a CPU typically adds 40 percent to the overall performance, for an additional silicon area cost of around 10 percent. Hardware multi-threading is a feature that in CPU IP is bespoke to Imagination’s MIPS CPUs.
A Common View
To move a task from one processor to another requires each processor to share the same instruction set and the same view of system memory. This is accomplished through shared virtual memory (SVM). Any pointer in the program must continue to point to the same code or data and any dirty cache line in the initial processor’s cache must be visible to the subsequent processor.
Cache coherency can be managed through software. This requires that the initial processor (CPU A) flush its cache to main memory before transferring to the subsequent processor (CPU B). CPU B then has to fetch the data and instructions back from main memory. This process can generate many memory accesses and is therefore time consuming and power hungry; this impact is magnified as the energy to access main memory is typically significantly higher than fetching from cache. To combat this, hardware cache coherency is vital, minimizing these power and performance costs. Hardware cache coherency tracks the location of these cache lines and ensures that the correct data is accessed by snooping the caches where necessary.
In many heterogeneous systems, the high-performance processors reside in one cluster, while the smaller, high-efficiency processors reside in another. Transferring a task between these different types of processors means that both the level 1 and level 2 caches of the new processor are cold. Warming them takes time and requires the previous cache hierarchy to remain active during the transition phase.
However, there is an alternative – the MIPS I6500 CPU. The I6500 supports a heterogeneous mix of external accelerators through an I/O Coherence Unit (IOCU) as well as different processor types within a cluster, allowing for a mix of high-performance, multi-threaded and power-optimized processors in the same cluster. Transferring a task from one type of processor to another is now much more efficient, as only the level 1 cache is cold, and the cost of snooping into the previous level 1 cache is much lower, so the transition time is much shorter.
Combining CPUs with Dedicated Accelerators
CPUs are general purpose machines. Their flexibility enables them to tackle almost any task but at the price of efficiency. Thanks to its optimizations, the PowerVR GPU can process larger, highly parallel computational tasks with very high performance and good power efficiency, in exchange for some reduction in flexibility compared to CPUs, and bolstered by a well-supported software development eco-system with APIs such as OpenCL or Open VX.
The specialization provided by dedicated hardware accelerators offers a combination of performance with power efficiency that is significantly better than a CPU, but with far less flexibility.
However, using accelerators for operations that occur frequently are ideal to maximize the potential performance and power efficiency gains. Specialized computational elements such as those for audio and video processing, as well as neural network processors used in machine learning, use similar mathematical operations.
Hardware acceleration can be coupled to the CPU by adding Single Instruction Multiple Data (SIMD) capabilities with floating point Arithmetic Logic Units (ALUs). However, while processing data through the SIMD unit, the CPU behaves as a Direct Memory Access (DMA) controller to move the data, and CPUs make very inefficient DMA controllers.
Conversely, a heterogeneous system essentially provides the best of both worlds. It contains some dedicated hardware accelerators that, coupled with a number of CPUs, offer the benefits of greater energy efficiency from dedicated hardware, while retaining much of the flexibility provided by CPUs.
These energy savings and performance boost depend on the proportion of time that the accelerator is doing useful work. Work packages appropriate for the accelerator are present in a wide range of sizes—you might expect a small number of large tasks, but many smaller tasks.
There is a cost in transferring the processing between a CPU and the accelerator, and this limits the size of the task that will save power or boost performance. For smaller tasks, the energy consumed and time taken to transfer the task exceeds the energy or time saved by using the accelerator.
Data Transfer Cost
To reduce time and energy costs, a Shared Virtual Memory with hardware cache coherency—as found in the I6500 CPU—is ideal as it addresses much of the cost of transferring the task. This is because it eliminates the copying of data and the flushing of caches. There are other available techniques to achieve even greater reductions.
The HSA Foundation has developed an environment to support the integration of heterogeneous processing elements in a system that extends beyond CPUs and GPUs. The HSA system’s intermediate language, HSAIL, provides a common compilation path to heterogeneous Instruction Set Architectures (ISAs) that greatly simplifies the system software development but also defines User Mode Queues.
These queues enable tasks to be scheduled and signals to trigger tasks on other processing elements, allowing sequences of tasks to execute with very little overhead between them.
Heterogeneous systems offer the opportunity to significantly increase system performance and reduce system power consumption, enabling systems to continue to scale beyond the limitations imposed by ever shrinking process geometries.
Multi-threaded, heterogeneous and coherent CPU clusters such as the MIPS I6500 have the ideal characteristics to sit at the heart of these systems. As such they are well placed to efficiently power the next generation of devices.
Tim Mace is Senior Manager, Business Development, MIPS Processors, Imagination Technologies.