|Asymmetric Multiprocessing with Heterogeneous Architectures: Use the Best Tool for the Job Featured|
|Contributor: Arteris SA
September 6,2013 — Often, the term “multiprocessing” is associated with tightly-coupled symmetric multiprocessing (SMP) architectures, due in large part to SMP’s prevalence in high-performance computing, x86/x64 servers, and PCs. Unfortunately, SMP’s incremental performance scaling for most applications decreases significantly with increasing numbers of cores. This lack of scalability has prompted many processor companies to avoid purely SMP solutions for their mobile and consumer electronics applications. Instead, they have implemented asymmetric multiprocessing (AMP) architectures to make more efficient use of silicon.An example of AMP is a mobile phone’s modem baseband SOC, containing an ARM processor and a DSP to handle control and signal processing, respectively. AMP architectures are also found in mobile phone application processors, which have multiple CPU cores and separate discrete graphics cores, video cores, audio cores and imaging cores. Heterogeneous architectures also dominate in most embedded consumer applications, such as digital TVs, set-top boxes, and automotive infotainment.
Heat and power drive architecture decisions
Mobile applications face significant design constraints because of battery size and heat dissipation. As a result, processor designers are forced to use “the best core for the job.” So architectures in mobility have always been created from a baseline expectation of heterogeneous core AMP.
Server and PC chips have relatively unlimited power consumption and heat dissipation capabilities, making an SMP architecture tolerable. In these applications, it is often easier to add more cores of the same type, connect them using cache coherency, and reuse the legacy software to run on top. Comparatively little attention has been paid to heat dissipation and power consumption.
But PCs are becoming smaller and mobile. And server farms are eyeing power consumption as well, forcing designers to reconsider SMP architectures. For example, for server farms that power the likes of Google and Facebook, power consumption and heat dissipation have become huge cost and environmental issues. And in the PC space, we have run into a “gigahertz wall” where the only way to have a step function increase in performance is to have different cores optimized for different workload types.
AMP architectures struggle to break into PC/server applications
Why don’t AMP architectures dominate PC and server applications? Because it’s hard to implement!
In mobile designs, each heterogeneous processing core, whether graphics, audio, DSP, etc., typically has a custom firmware and software stack associated with it. This software must be integrated to communicate with the CPU cores’ operating system, requiring coding work in the OS hardware abstraction layer and drivers. In addition, these heterogeneous cores do not have a single view of system memory, so complicated synchronization schemes are usually implemented in hardware and software. Context switching and preemption are difficult to implement. Adding to the challenge, each of these cores requires an expert programmer, conversant in a particular core’s instruction set and tool chains, to code it.
These barriers have forced AMP to remain in the mobile and consumer electronics realm, which is closed to low-level, close-to-the-hardware software developers. Alternatively, SMP has flourished in the wide-open world of PCs and servers, aided by the ease of programming.
Heterogeneous system architectures (HSA) can span the chasm between mobile/ consumer applications and PC/ server applications, easing the design burden while delivering performance, scalability, improved heat dissipation and reduced power consumption.
Recently, a number of companies, including AMD, ARM, Imagination, MediaTek, Qualcomm, Samsung and Texas Instruments, founded the HSA Foundation. HSA defines interfaces for parallel computation utilizing CPU, GPU, and other programmable and fixed-function devices, and support for a diverse set of high-level programming languages, thereby creating the next foundation in general-purpose computing.
Its goals are to:
The HSA approach requires a technical framework and architecture
There are several issues that must be addressed to successfully bring these two worlds together:
HSA Foundation provides key tools for unlocking heterogeneous programming
Today, CPUs and GPUs do not share a common view of system memory, requiring an application to explicitly copy data between the two devices. In addition, an application running on the CPU that wants to add work to the GPU’s queue must execute system calls that communicate through the CPU operating system’s device driver stack, and then communicate with a separate scheduler that manages the GPU’s work. This adds significant run-time latency, in addition to being very difficult to program.
HSA addresses the need for easy software programming of GPUs to take advantage of their unique capability to crunch parallel workloads much more efficiently than x86 or ARM CPUs.
HSA solution stack: Abstracting away hardware specifics
To enable easier programming, HSA allows developers to program at a higher abstraction level using mainstream programming languages and additional libraries. This HSA solution stack includes several components.
The key to enabling one language for heterogeneous core programming is to have an intermediate run-time layer that abstracts hardware specifics away from the software developer, leaving the hardware-specific coding to be done once by the hardware vendor or IP provider. The core of this intermediate layer is the HSA Intermediate Language or “HSAIL.”
The HSA run-time stack is created by compiling a high-level language such as C++ with the HSA compilation stack. HSA’s compilation stack is based on the LLVM infrastructure, which is also used inOpenCL from the Khronos Group.
Creation of HSAIL can occur prior to run-time or during run-time. Here are two examples: The OpenCL run-time includes the compiler stack and is called at run-time to execute a program that is already in data-parallel form. Alternatively, Microsoft’s C++ AMP (C++ Accelerated Massive Parallelism) uses the compiler stack during program compilation rather than execution. The C++ AMP compiler extracts data-parallel code sections and runs them through the HSA compiler stack, and passes non-parallel code through the normal compilation path.
Figure 3 shows the HSA compilation stack, where programming code is compiled into HSAIL using the LLVM compilation infrastructure:
The hardware-specific HSA Finalizer is a key component
A key role is played by the hardware-specific “finalizer” which converts HSAIL to the computing unit’s native instruction set. Hardware and IP vendors are responsible for creating finalizers that support their hardware. The finalizer is lightweight and can be run at compile time, installation time or run-time depending on requirements.
Figure 4 shows the HSAIL and its path through the HSA run-time stack:
The HSA Finalizer is the point at which the specifics of different heterogeneous computing units are addressed. Initial HSA implementations will most likely support GPU compute with finalizers from GPU vendors such as AMD, Imagination, ARM, and Qualcomm. The quality and features of each vendor’s HSA Finalizer will help determine how software developers take advantage of each hardware element’s computing capabilities.
Benefiting from heterogeneous architectures requires smart scheduling
In addition to GPUs, many existing heterogeneous architectures have additional discrete processing units for functions such as audio (digital signal processing or stream processing), image and video processing (SIMD frame processing), and security. As HSA matures, hardware and IP vendors creating these processing units may want to enable HSA programmability on their hardware by creating hardware-specific finalizers.
Having multiple heterogeneous processing units will complicate workload scheduling from a system perspective. The harsh reality is that existing workload scheduling and OS scheduling algorithms are relatively simple and generally only take into account local activity on a processing unit or a cluster of homogeneous processing units (see the Linux Completely Fair Scheduler for one example of how scheduling is implemented: ).
Interconnect fabric-assisted scheduling is required to implement scalable HSA systems
Existing OS and middleware scheduling algorithms do not take into account the existing traffic throughout the system, nor a view into other processing units. This lack of a global perspective for scheduling virtually guarantees there will be contention and stalling as processing units wait for access to precious system resources, especially the DRAM. It’s like looking out the front door of your house to determine how bad the traffic will be on your commute to work: You are missing very relevant information that could help you determine the optimal route to take.
Probing current run-time data flows at critical points throughout a system’s SOC interconnect fabric can provide critical information to enhance workload scheduling. This information can then be used to assign priorities to workloads, and workloads to processing units. These priorities and assignments can be optimized based on performance requirements or power consumption requirements, as required for a particular use case. As heterogeneous processing becomes the norm, and more processing units are added to a system, this type of interconnect-assisted scheduling will be required.
In other words, the hardware interconnect is a key enabler to putting the heterogeneous into HSA.
“Heterogeneous System Architecture: A Technical Review” whitepaper by George Kyriazis, (AMD), HSA Foundation, August, 2012.
The HSA Compilation and Run-time Stack diagrams are from the whitepaper by George Kyriazis cited above.
By Kurt Shuler
Kurt Shuler is Vice President of Marketing, Arteris, Inc.
Go to the Arteris SA website to learn more.
|Keywords: computer system design, genera|
Asymmetric Multiprocessing with Heterogeneous Architectures: Use the Best Tool for the Job