by Pekka Jääskeläinen and Kati Tervo, Customized Parallel Computing group, Tampere University, Finland. IEEE Computer Society: https://www.computer.org/publications/tech-news/accelerator-framework-for-portable-computing-language
Diverse heterogeneous platforms that utilize various types of resources such as general purpose processors, customized co-processors, hardware accelerators and FPGAs have been in the core of Customized Parallel Computing (CPC) group’s research interests for almost two decades. The group’s mission is to research and develop technologies that make customized heterogeneous parallel platforms easier to design and program to enable their benefits for a wider range of applications and end-users.
One of the activities of CPC has been to study, adopt and promote standards such as HSA, OpenCL and more recently C++. Along these lines, a major contribution of CPC to the heterogeneous platform community is Portable Computing Language (POCL), a flexible open source implementation framework of the OpenCL standard. The goal of POCL is to integrate as many diverse devices as possible in a single OpenCL context to allow system level optimizations, eventually harnessing all heterogeneous devices in the system for the application to use under a single API. We consider the OpenCL API a good core on top of which higher-level software layers can be added for increased engineering productivity, automatic adaptation and other purposes.
POCL already supports multiple device types: For example, HSA Base profile based accelerators with native LLVM ISA based compilation, NVIDIA GPUs via libcuda, multiple CPUs, and open source application-specific processors using the TCE target. It is also known to have multiple private backends that have not (yet) been upstreamed.
The latest class of devices we want to integrate to POCL are fixed function hardware accelerators. Hardware accelerators are power efficient implementations of challenging algorithms in heterogeneous computing platforms which are used to make key tasks in applications such as video codecs or machine vision pipelines faster, more power efficient, and less chip area consuming. Their high-level programming and integration to the application software in a standard and portable way presents one of the interesting challenges which CPC is currently studying.
While the efficiency benefits of hardware accelerators are clear, their trade-off in comparison to software programmable co-processors is in their post-fabrication inflexibility: The function the accelerator performs cannot be changed after the chip has been manufactured – the accelerator’s data path cannot be freely re-programmed to implement a new function outside the supported configuration parameters. However, there is the coarser “task-level” degree of programmability that should be considered: Even if the functions in the single IP blocks in the system cannot be changed, it is essential that the accelerator functionality is integrated to the application software logic in a cohesive and efficient manner.
OpenCL 1.2 introduced a device type called custom device and the concept of built-in kernels, which brought hardware accelerators to OpenCL programmers. Using these concepts, a device driver can advertise to be a non-programmable accelerator that supports a set of “kernels” which are merely identified only by their name.
Since the semantics of the built-in kernels implemented by the accelerators are not “centrally defined” anywhere, the end user is supposed to know the accelerator by its device ID and the meaning of the built-in functions it provides, which, of course, reduces the portability of the OpenCL program across different vendors when utilizing hardware accelerators.
As a framework for easily adopting the OpenCL standard for the diversity of heterogeneous devices including hardware accelerators, what can POCL then provide to make the application integration of hardware accelerators easier? In our first code contribution to POCL accelerator support, we rely on the following concepts which have been shown to work well together in customization of SoCs (FPGA verified) that also include hardware accelerators:
1) Define a standardized memory mapped hardware IP interface/wrapper which POCL can assume to be present in the address space of the process. We based the interface on a set of memory mapped registers for probing the device essentials (to implement “plug’n play” functionality for reconfigurable platforms) and utilized the bit-level-specified HSA AQL for kernel queueing.
2) Contribute an example custom device driver implementation in POCL upstream. The default implementation assumes the hardware accelerator has the standard interface. It integrates accelerator launches to the top-level task scheduling process of POCL to “play nice” with the other devices in the same context, for example, by parallelizing the accelerator execution with CPU kernel execution etc.
3) Provide a list of known built-in functions with pre-specified names, integer function identifiers, and argument interfaces which can be expanded in the future releases of POCL. This helps the portability problem in case the list becomes a defacto standard – which can sometimes happen in case of widely adopted open source software.
Using the list of known accelerators, multiple vendors or, for example, open source community contributors can implement the accelerator in their own way and the application writers can use it, knowing that the invoked built-in functions implement exactly the intended functionality (no matter who provided the IP block). The built-in function descriptors can be along the lines of (this textual description is implemented as software structs and enums):
Kernel called pocl.vecadd_i3 which implements element-wise addition of a vector of 3-bit integers stored in byte vectors. The first argument is a physical address pointer to the beginning of the 1st input vector, the 2nd to the beginning of the 2nd and 3rd argument points to the output (which can be the same as one of the inputs). The vector width (in rounded up number of bytes) is specified using the grid size dimension x.
This initial framework and example implementations of the above mentioned concepts have now been committed to POCL master branch. The quick start instructions along with an example accelerator created with the open ASIP tools we are developing are available here. The work is ongoing and we are happy to receive your feedback. Pull requests are especially welcomed! You can reach us via the POCL Github or the POCL discussion channels.
We are currently looking into improving efficient asynchronous host-autonomous execution capabilities between multiple accelerators, and also are investigating support for SoCs that have IOMMUs capable of system shared fine grained virtual memory as mandated by the HSA Full Profile and OpenCL 2.0 System Shared Virtual Memory.
Finally, we would like to thank the funding sources that make our ongoing customized computing open source and academic contributions possible: The HSA foundation, Academy of Finland (funding decision 297548) and ECSEL JU project FitOptiVis (project number 783162).
For additional information, contact Pekka Jääskeläinen at: firstname.lastname@example.org.