Prof. Jenq-Kuen Lee's research
In this work, we present the efficiently inference quantized models by supporting TVM QNN flow with RISC-V SIMD computations. As RISC-V supports both Superword SIMD and Subword SIMD, we compile models by TVM and replace the computation kernel with designated LLVM intrinsic functions for mapping with RISC-V favorable SIMD instructions. Experiments shows 1.79-7.58x reduction of instruction count compared quantized model with FP32 implementation. The accuracy loss is acceptable by evaluating on 1k images. The benchmark including MobileNet and Inception series. All experiments are executed on Spike with RISC-V SIMD supports.
In this project, we propose to construct the NNEF (Neural Network Exchange Format) compiler infrastructure with GUI-based design and fallback engine libraries for AI NN model. The NNEF is a unified network description format to address machine learning deployment fragmentation issue that network descriptions from each AI framework have no compatibility with each other and inference engines also have no to be compatible with any deep learning frameworks.
Multithread programming is widely adopted in novel embedded system applications due to its high performance and flexibility. This article addresses compiler optimization for reducing the power consumption of multithread programs. A traditional compiler employs energy management techniques that analyze component usage in control-flow graphs with a focus on single-thread programs. In this environment the leakage power can be controlled by inserting on and off instructions based on component usage information generated by flow equations. However, these methods cannot be directly extended to a multithread environment due to concurrent execution issues. We present a multithread power-gating framework composed of multithread power-gating analysis (MTPGA) and predicated power-gating (PPG) energy management mechanisms for reducing the leakage power when executing multithread programs on simultaneous multithreading (SMT) machines. Our multithread programming model is based on hierarchical bulk-synchronous parallel (BSP) models. Based on a multithread component analysis with dataflow equations, our MTPGA framework estimates the energy usage of multithread programs and inserts PPG operations as power controls for energy management. We performed experiments by incorporating our power optimization framework into SUIF compiler tools and by simulating the energy consumption with a post-estimated SMT simulator based on Wattch toolkits. The experimental results show that the total energy consumption of a system with PPG support and our power optimization method is reduced by an average of 10.09% for BSP programs relative to a system without a power-gating mechanism on leakage contribution set to 30%; and the total energy consumption is reduced by an average of 4.27% on leakage contribution set to 10%. The results demonstrate our mechanisms are effective in reducing the leakage energy of BSP multithread programs.
more information: Compiler Optimization for Reducing Leakage Power in Multithread BSP Programs ( TODAES 2014)
Our lab is currently in the HSA acdemic fundations. The Heterogeneous System Architecture (HSA) is designed to efficiently support a wide assortment of data-parallel and task-parallel programming models. We aim to develop high performance compilers supporting OpenCL for heterogeneous architectures. We focus on the optimization method for compilers, including OpenCL divergence, low-power instructions analysis, register allocation. We release OpenCL runtime for HSA based on PoCL enhancements on HSA foundation, 2015.
I am currently the group PI for the MOEA project (2012-2015)for OpenCL compilers and Applications. The project is with matching fund from Mediatek, III, ITRI, and MOEA site. Early research work includes OpenCL vector compiler optimizations with OpenCL SIMD optimizations in CPC 2012. We present a novel framework for SIMD optimizationsof OpenCL programs. Our approach is based on a functional approach thatwe attempt to model computations into adata access function, mathematically. SIMD optimization canthen be done by optimization and rewriting of these data access functions.Our proposed schemes are implemented based on Open64compiler infrastructure enhanced by us to support OpenCL on X86 CPUs and ATI GPUs. In addition, we also develop the first prototype of OpenCL compiler with Open64 environments. The work can be seen in Open64 Developers Forum (2010 and 2011). An enable flow of the OpenCL compiler on multi-core PAC DSP is published in ICPP EMS 2012. In that, we illustrate the compiler flow for multi-core DSP (emulated in SID with 8-16 DSP processors and in the real environment with PAC DUO environments). In addition, an analysis scheme for divergence analysis with OpenCL on GPU is developed to cover both scalar and pointer cases. The work is published in CPC 2013. The work is entitled as Pointer-Based Divergence Analysis in the SSA Form. When PEs take different paths on a conditional branch, divergence occurs and then all PEs must run serially.The occurrence of divergences degrades the performance seriously. Therefore, researches focus on divergence analysis for possiblecompiler optimizations on GPUs.To our best knowledge, none of the existing research workhas yet considered the divergence analysis for pointer-based programs. In our work, we present a novel scheme which reports the divergence information for pointer-based programs. Our approach is based on two extended static single assignment (SSA) forms, memory SSA and gated SSA.In our proposed scheme, a divergence relation graph (DRG) with gatedcontrols is first built. DRG is with all of the possible points-to relationships of the pointer and initialized divergent states. Next, the DRG is transformed to the simplified graph with our reductionscheme. Finally, the divergent state of pointers can be determined by combining the divergent state of a graph. The proposed scheme is implemented in the Open64 Compiler, and can be applied to OpenCL programs. The results show our scheme is effective in analyzingthe divergent information of pointer-based OpenCL programs. In addition, we also demonstrate that our OpenCL compiler can work with real world applications with efficiency. We collaborate with Prof. Shang-Hong Lai (NTHU) on vehicle detection methods with OpenCL programming on multi-core systems. The work is published in ESTIMedia 2013 (a part of ESWeek 2013).
The embedded multicore DSP systems are playing increasingly important role for consumer electronic design. Such systems try to optimize the objective for both performance and power with mobile devices. Embedded application developers will then devise designs to optimize embedded applications for not only performance but also power. However, currently there are no power metrics support for popular application design platforms such as QEMU and SID, where application developers develop their applications. This hinders application developers to help tune optimizations for power. In this paper, we propose a power aware simulation framework on embedded multicore DSP subsystems for SID framework. To the best of our knowledge, this is the first work to attempt to build a power aware simulator based on SID simulation framework. The power estimation flow includes two phases, IP level power modeling and system level power profiling. In the IP level power modeling, PowerMixerIP is employed to build up the power model for PAC DSP and major IPs. In the system level power profiling, we provide a power profiling hierarchy that meets the demand of embedded software developers. The granularity of power profiling can be configured to the whole simulation stage or any specific time slot in the simulation such as a dedicated function loop. In our experiments, DSP programs with SIMD intrinsic for DSPStone benchmark are examined with our proposed power aware simulator. In addition, a face detection application is deployed as a running example on multi-core DSP systems to show how our power simulator can be used to help collaborate with developers in the optimization process to illustrate views of power dissipations of applications.
more information:Power Aware SID-based Simulator for Embedded Multicore DSP Subsystems ( CODES+ISSS 2010)
Embedded processors developed within the past few years have employed novel
hardware designs to reduce the ever-growing complexity, power dissipation, and die
area. While using a distributed register file architecture with irregular accessing
constraints is considered to have less read/write ports than using traditional unified
register file structures, conventional compilation techniques can not produce the optimal
performance from such new register file organizations. This paper presents a novel
scheme for register allocation that includes global and local components on a VLIW DSP processor with distributed register files whose port access is highly restricted. In
the scheme, a optimization phase performed prior to conventional global/local register
allocation, named global/local register file assignment (RFA), is used to minimize
various register file communication costs. For a register file structure where each
cluster contains heterogeneous register files, the enchancement required to conventional
register allocation scheme with cluster assignment is for it to cope with both inter-
and intra-cluster communications. Due to the potential heavy in?uences of global RFA
on local RFA, a heuristic algorithm is proposed for global RFA to make suitable
decisions on communication for local RFA. Experiments were performed with a compiler
compilation based on proposed approach delivers significant performance improvements,
comparable to the solution using only the PALF scheme that we have developed
more information: LC-GRFA: Global Register File Assignment with Local Consciousness for VLIW DSP Processors with Non-uniform Register Files ( Concurrency and Computation: Practice and Experience)
Most modern processors are equipped with SIMD capability to exploit data
parallelism in multimedia applications. For embedded VLIW DSP processors,
they also commonly provide short vector instructions, subword instructions,
to accelerate data processing. Besides subword operations,
their functional units also can be utilized to process multiple data
streams in parallel. However, due to power and area concerns, many embedded
VLIW DSP processors adopt distributed register files to reduce
read/write ports and wire connection by privatizing register files for
clusters and even for functional units. The distributed design presents
great challenges for compilers to distribute SIMD workload to functional
units in clusters. In this paper, we address the issue in supporting SIMD parallelism on
VLIW DSP processors with subword instructions and distributed register files.
Currently, industrial practices have adopted intrinsic supports in C programs
and compilers for conventional DSP processors to enable developers to utilize
hardware resources to compete with the performance of hand-written assembly code.
However, how to provide such a solution for VLIW DSP processors with distributed
register files is still an open issue. In this work, we support subword and
cluster intrinsics to allow programmers to elaborate SIMD computation in C
programs. Subword intrinsics can be used to manipulate multiple subword data
in parallel, while cluster intrinsics enable programmers to distribute
computation to clusters.
To support cluster intrinsics, we identified obstacles for paralleling
SIMD programs on clusters even with cluster specification in programs.
Essential compiler techniques are devised to enable such a flow.
For evaluation, the SIMD intrinsics and compiler supports are incorporated
in an Open64 compiler, and DSPstone and H.264 kernels are rewritten with SIMD intrinsics. Experimental results show that combining subword and cluster intrinsics can bring significant performance improvement, which is 3 times faster for DSPstone and 4 times faster for H.264 kernels compared with the best results of original programs.
more information: SIMD Intrinsic Supports for VLIW DSP Processors with Distributed Register Files (LPC 2010)