Heterogeneous computing is emerging as a requirement for power-efficient system design: modern platforms no longer rely on a single general-purpose processor, but instead benefit from dedicated processors tailored for each task.  Traditionally these specialized processors have been difficult to program due to separate memory spaces, kernel-driver-level interfaces, and specialized programming models.  The Heterogeneous System Architecture (HSA) aims to bridge this gap by providing a common system architecture and a basis for designing higher-level programming models for all devices.  This tutorial will bring in experts from member companies of the HSA Foundation to describe the Heterogeneous Systems Architecture and how it addresses the challenges of modern computing devices.  Additionally, the tutorial will show example applications and use cases that can benefit from the features of HSA.  To view a program of all ISCA 2014 tutorials, visit http://www.hsafoundation.com/isca-2014-tutorial-2/.AGENDA
Phil Rogers, AMD Corporate Fellow, and President, HSA Foundation

Modern workloads comprise a mix of parallel and scalar processing, which do not match well to any single processor type and demand a new approach.  HSA is an architecture that embraces different kinds of processing cores, such as CPUs, GPUs and DSPs, interoperating in coherent shared memory to perform the processing of such workloads. HSA Foundation is a non-profit consortium that was founded in 2012 to define this architecture and bring it to market. This talk describes HSA and its key features at a high level, and presents some workload examples.

HSAIL Virtual Parallel Instruction Set
Ben Sander, Senior Fellow at AMD

Unlike the CPU world, GPUs today present a diverse set of instruction set architectures – frequently GPU products from the same vendor will utilize different instruction sets.  As popular programming languages add constructs for multi-core and GPU acceleration, the need a portable compiler intermediate language and binary format becomes increasingly important.  In this talk we describe HSAIL – a low-level, portable compiler IR designed for efficient parallel computing.  The talk includes an explanation of the unique features in HSAIL, describes tools that can be used to generate and use HSAIL, and introduces several popular compiler tool chains and programming languages that are providing GPU acceleration through HSAIL.

HSA Runtime Specification
Yeh-Ching Chung, professor, MediaTek Lab, Department of Computer Science, National Tsing Hua University

The HSA runtime is a thin, user-mode API that provides the interface necessary for the host to launch compute kernels to the available HSA components.  In this talk, we will describe the architecture and APIs for the HSA Core Runtime including initialization, topology, signals, synchronization, memory management, architected dispatch, and program linking.  For the APIs in each category, we provide examples to explain their usage.

HSA Memory Model
Benedict R. Gaster, architect working at Qualcomm on next-generation heterogeneous processors

The ability for developers to reason about values returned from memory loads plays a fundamental role in defining correct programs. In the presence of concurrency, as offered by HSA’s execution model, the developer must correctly deal with memory values that may have been updated by multiple concurrent actors.  Shared memory computers and programming languages divide this complexity into models :  in particular the memory model specifies safety, while the execution model specifies liveness.  Previous sections of this tutorial have focused in HSA’s execution model and in this section we address HSA’s memory model, as seen by the application developer. Like other state of the art memory models, HSA’s model is a shared memory consistency model.  However, unlike C++ or Java, HSA also directly addresses issues with locality in a heterogeneous setting, adopting HRF’s notion of scopes, allowing the developer to provide fine-grain control over store visibility.  In this talk we introduce HSA’s memory model, providing an introduction to its formal underpinnings and show how other modern memory models, such as C++ and OpenCL 2.0, map naturally to HSA’s model. We provide motivation for the scoped memory model, also supported by OpenCL 2.0, and using examples highlight the main features.

HSA Queuing
Håkan Persson, Senior Principal Engineer at ARM

HSA’s queuing features define an architected mechanism for different processors in the system to communicate with each other.   This talk ties together the HSA features to show how HSA enables greater efficiency in task dispatch. The tutorial starts with an overview of how GPU compute task dispatch works today and gradually introduces HSA technologies to illustrate how the overhead is reduced.  In addition HSA signaling is reviewed and the HSA job submission mechanism is discussed in detail. The talk finishes off with walked through examples on task submission and dependency management between tasks.

HSA Compilation Technology
Wen-mei W. Hwu, Professor, Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign

Most application developers will access HSA features through high-level programming languages such as C++, Java, and Python. In all cases, it is important that the same source code can benefit from HSA features while maintaining compatibility with discrete heterogeneous computing architectures. In this talk, we will describe a concrete example of a C++AMP implementation based on Clang and LLVM that allows developers to take advantage of the cache-coherent global shared address space, efficient queues, and platform atomic operations while maintaining good performance of the same source code in traditional, discrete GPU/CPU heterogeneous computing systems. Time permitting, we will also discuss a related Java implementation.

HSA Application Programming Techniques
Wen-mei W. Hwu, Professor, Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign

HSA offers several architectural features that expand the types of heterogeneous application algorithms that can achieve high performance in a heterogeneous computing system: cache coherence, global shared address space, efficient task queues, and platform atomic operations. Cache coherent global address space enables applications whose working set are determined at run time to benefit from throughput compute devices. Efficient queues allow efficient execution of finer granularity tasks on throughput devices. Platform atomic operations allow better coordination of host and devices when they collaborate on a computation. In this talk, we will present application algorithm code examples and benchmarking results that illustrate how these features can be effectively utilized.


Phil Rogers01-HeadPhil Rogers, AMD Corporate Fellow, and President, HSA Foundation, is the lead architect for the Heterogeneous System Architecture. He is channeling his expertise in designing highly efficient GPUs to drastically reducing the power consumed when running modern applications, on heterogeneous processors. After joining ATI Technologies in 1994, Phil served in increasingly senior architecture positions in the development of DirectX® and OpenGL® software. Phil was instrumental in the development of all of ATI Radeon GPUs since the introduction of the Radeon series in 2000. Phil joined AMD with the ATI acquisition in 2006 and has played a lead role in heterogeneous computing, APU architecture and programming models during his tenure. Phil began his career at Marconi Radar Systems, where he designed digital signal processors for advanced radar systems which is why he has a sense of urgency to bring DSPs into the HSA framework as soon as possible. Phil earned his Bachelor of Science degree in electronic and electrical engineering from the University of Birmingham.


bensanderBen Sander is a Senior Fellow at AMD and was the spec editor for the 0.95 version of the HSAIL Programmer’s Reference Manual.   Ben joined AMD in 1995 and has served in various technical and managerial roles on the CPU and GPU development teams. He previously led the CPU performance team at AMD and was deeply involved in the development of the CPU and northbridge architectures for AMD Opteron processors.   In 2009, Ben switched into a individual contributor role in GPU software optimizing OpenCL™ performance and workloads. Ben’s strong background in both CPU and GPU performance architecture led into his current role as the lead software architect for AMD’s Heterogeneous System Architecture program .   Ben’s interests include compilers, programming models (including the Bolt C++ template library), performance, and computer architecture.   Ben received his Master of Science and Bachelor of Science degrees from the University of Illinois in Champaign, IL.



Chung-2Yeh-Ching Chung is a professor in the Department of Computer Science at National Tsing Hua University (NTHU). His research interests are in the areas of parallel and distributed processing, cloud computing, and embedded systems. He is the founder of Taiwan Association of Cloud Computing (http://www.tacc.org.tw), the chief scientist of UniCloud research group (https://www.unicloud.org.tw), and the deputy director of Computer and Communication Research Center (CCRC) of NTHU. He has delivered an HSA emulator, called HSAemu (hsaemu.org), which was a collaborated work with MediaTek in 2013. Dr. Chung received his Ph.D. degree in Computer Science from Syracuse University.



bgaster-headshotBenedict R. Gaster is an architect working at Qualcomm on next-generation heterogeneous processors. Benedict has contributed extensively to the OpenCL’s design and recently has worked on extending data-race-free consistency models to be heterogeneous aware. Benedict has a Ph.D in computer science for his work on type systems for extensible records and variants.




HakanHåkan Persson is a Senior Principal Engineer at ARM and is the spec editor for the HSA System Architecture workgroup. Håkan has more than 10 years’ experience of mobile chipset design before joining ARM in 2012 where he is the architect for the Mali GPU job scheduling subsystem and MMU. Håkan received his Master of Science degree in Electrical Engineering from Lund University.





Wen MeiWen-mei W. Hwu is a Professor and holds the Sanders-AMD Endowed Chair in the Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign. His research interests are in the area of architecture, implementation, compilation, and algorithms for parallel computing. He is the chief scientist of Parallel Computing Institute and director of the IMPACT research group (www.impact.crhc.illinois.edu). He is a co-founder and CTO of MulticoreWare. For his contributions in research and teaching, he received the ACM SigArch Maurice Wilkes Award, the ACM Grace Murray Hopper Award, the Tau Beta Pi Daniel C. Drucker Eminent Faculty Award, the ISCA Influential Paper Award, and the Distinguished Alumni Award in Computer Science of the University of California, Berkeley. He is a fellow of IEEE and ACM. He is the editor in chief for the upcoming book entitled “Heterogeneous Parallel Architecture for Application Programming.” Dr. Hwu received his Ph.D. degree in Computer Science from the University of California, Berkeley.