Untitled Document

Multithread programming is widely adopted in novel embedded system applications due to its high performance and flexibility. This article addresses compiler optimization for reducing the power consumption of multithread programs. A traditional compiler employs energy management techniques that analyze component usage in control-flow graphs with a focus on single-thread programs. In this environment the leakage power can be controlled by inserting on and off instructions based on component usage information generated by flow equations. However, these methods cannot be directly extended to a multithread environment due to concurrent execution issues. We present a multithread power-gating framework composed of multithread power-gating analysis (MTPGA) and predicated power-gating (PPG) energy management mechanisms for reducing the leakage power when executing multithread programs on simultaneous multithreading (SMT) machines. Our multithread programming model is based on hierarchical bulk-synchronous parallel (BSP) models. Based on a multithread component analysis with dataflow equations, our MTPGA framework estimates the energy usage of multithread programs and inserts PPG operations as power controls for energy management. We performed experiments by incorporating our power optimization framework into SUIF compiler tools and by simulating the energy consumption with a post-estimated SMT simulator based on Wattch toolkits. The experimental results show that the total energy consumption of a system with PPG support and our power optimization method is reduced by an average of 10.09% for BSP programs relative to a system without a power-gating mechanism on leakage contribution set to 30%; and the total energy consumption is reduced by an average of 4.27% on leakage contribution set to 10%. The results demonstrate our mechanisms are effective in reducing the leakage energy of BSP multithread programs.
more information: Compiler Optimization for Reducing Leakage Power in Multithread BSP Programs ( TODAES 2014)

I am currently the group PI for the MOEA project (2012-2015)for OpenCL compilers and Applications. The project is with matching fund from Mediatek, III, ITRI, and MOEA site. Early research work includes OpenCL vector compiler optimizations with OpenCL SIMD optimizations in CPC 2012. We present a novel framework for SIMD optimizationsof OpenCL programs. Our approach is based on a functional approach thatwe attempt to model computations into adata access function, mathematically. SIMD optimization canthen be done by optimization and rewriting of these data access functions.Our proposed schemes are implemented based on Open64compiler infrastructure enhanced by us to support OpenCL on X86 CPUs and ATI GPUs. In addition, we also develop the first prototype of OpenCL compiler with Open64 environments. The work can be seen in Open64 Developers Forum (2010 and 2011). An enable flow of the OpenCL compiler on multi-core PAC DSP is published in ICPP EMS 2012. In that, we illustrate the compiler flow for multi-core DSP (emulated in SID with 8-16 DSP processors and in the real environment with PAC DUO environments). In addition, an analysis scheme for divergence analysis with OpenCL on GPU is developed to cover both scalar and pointer cases. The work is published in CPC 2013. The work is entitled as Pointer-Based Divergence Analysis in the SSA Form. When PEs take different paths on a conditional branch, divergence occurs and then all PEs must run serially.The occurrence of divergences degrades the performance seriously. Therefore, researches focus on divergence analysis for possiblecompiler optimizations on GPUs.To our best knowledge, none of the existing research workhas yet considered the divergence analysis for pointer-based programs. In our work, we present a novel scheme which reports the divergence information for pointer-based programs. Our approach is based on two extended static single assignment (SSA) forms, memory SSA and gated SSA.In our proposed scheme, a divergence relation graph (DRG) with gatedcontrols is first built. DRG is with all of the possible points-to relationships of the pointer and initialized divergent states. Next, the DRG is transformed to the simplified graph with our reductionscheme. Finally, the divergent state of pointers can be determined by combining the divergent state of a graph. The proposed scheme is implemented in the Open64 Compiler, and can be applied to OpenCL programs. The results show our scheme is effective in analyzingthe divergent information of pointer-based OpenCL programs. In addition, we also demonstrate that our OpenCL compiler can work with real world applications with efficiency. We collaborate with Prof. Shang-Hong Lai (NTHU) on vehicle detection methods with OpenCL programming on multi-core systems. The work is published in ESTIMedia 2013 (a part of ESWeek 2013).

The embedded multicore DSP systems are playing increasingly important role for consumer electronic design. Such systems try to optimize the objective for both performance and power with mobile devices. Embedded application developers will then devise designs to optimize embedded applications for not only performance but also power. However, currently there are no power metrics support for popular application design platforms such as QEMU and SID, where application developers develop their applications. This hinders application developers to help tune optimizations for power. In this paper, we propose a power aware simulation framework on embedded multicore DSP subsystems for SID framework. To the best of our knowledge, this is the first work to attempt to build a power aware simulator based on SID simulation framework. The power estimation flow includes two phases, IP level power modeling and system level power profiling. In the IP level power modeling, PowerMixerIP is employed to build up the power model for PAC DSP and major IPs. In the system level power profiling, we provide a power profiling hierarchy that meets the demand of embedded software developers. The granularity of power profiling can be configured to the whole simulation stage or any specific time slot in the simulation such as a dedicated function loop. In our experiments, DSP programs with SIMD intrinsic for DSPStone benchmark are examined with our proposed power aware simulator. In addition, a face detection application is deployed as a running example on multi-core DSP systems to show how our power simulator can be used to help collaborate with developers in the optimization process to illustrate views of power dissipations of applications.
more information:Power Aware SID-based Simulator for Embedded Multicore DSP Subsystems ( CODES+ISSS 2010)

Embedded processors developed within the past few years have employed novel hardware designs to reduce the ever-growing complexity, power dissipation, and die area. While using a distributed register file architecture with irregular accessing constraints is considered to have less read/write ports than using traditional unified register file structures, conventional compilation techniques can not produce the optimal performance from such new register file organizations. This paper presents a novel scheme for register allocation that includes global and local components on a VLIW DSP processor with distributed register files whose port access is highly restricted. In the scheme, a optimization phase performed prior to conventional global/local register allocation, named global/local register file assignment (RFA), is used to minimize various register file communication costs. For a register file structure where each cluster contains heterogeneous register files, the enchancement required to conventional register allocation scheme with cluster assignment is for it to cope with both inter- and intra-cluster communications. Due to the potential heavy in?uences of global RFA on local RFA, a heuristic algorithm is proposed for global RFA to make suitable decisions on communication for local RFA. Experiments were performed with a compiler compilation based on proposed approach delivers significant performance improvements, comparable to the solution using only the PALF scheme that we have developed previously.
more information: LC-GRFA: Global Register File Assignment with Local Consciousness for VLIW DSP Processors with Non-uniform Register Files ( Concurrency and Computation: Practice and Experience)

Most modern processors are equipped with SIMD capability to exploit data parallelism in multimedia applications. For embedded VLIW DSP processors, they also commonly provide short vector instructions, subword instructions, to accelerate data processing. Besides subword operations, their functional units also can be utilized to process multiple data streams in parallel. However, due to power and area concerns, many embedded VLIW DSP processors adopt distributed register files to reduce read/write ports and wire connection by privatizing register files for clusters and even for functional units. The distributed design presents great challenges for compilers to distribute SIMD workload to functional units in clusters. In this paper, we address the issue in supporting SIMD parallelism on VLIW DSP processors with subword instructions and distributed register files. Currently, industrial practices have adopted intrinsic supports in C programs and compilers for conventional DSP processors to enable developers to utilize hardware resources to compete with the performance of hand-written assembly code. However, how to provide such a solution for VLIW DSP processors with distributed register files is still an open issue. In this work, we support subword and cluster intrinsics to allow programmers to elaborate SIMD computation in C programs. Subword intrinsics can be used to manipulate multiple subword data in parallel, while cluster intrinsics enable programmers to distribute computation to clusters. To support cluster intrinsics, we identified obstacles for paralleling SIMD programs on clusters even with cluster specification in programs. Essential compiler techniques are devised to enable such a flow. For evaluation, the SIMD intrinsics and compiler supports are incorporated
in an Open64 compiler, and DSPstone and H.264 kernels are rewritten with SIMD intrinsics. Experimental results show that combining subword and cluster intrinsics can bring significant performance improvement, which is 3 times faster for DSPstone and 4 times faster for H.264 kernels compared with the best results of original programs.
more information: SIMD Intrinsic Supports for VLIW DSP Processors with Distributed Register Files (LPC 2010)

Dual-core processors (and, to an extent, multicore processors) have been adopted in recent years to provide platforms that satisfy the performance requirements of popular multimedia applications. This architecture comprises groups of processing units connected by various interprocess communication mechanisms such as shared memory, memory mapping interrupts, mailboxes, and channel based protocols. The associated challenges include how to provide programming models and environments for developing streaming applications for such platforms. In this paper, we present middleware called streaming RPC for supporting a streaming-function remoting mechanism on asymmetric dual-core architectures. This middleware has been implemented both on an experimental platform known as the PAC dual-core platform and in TI OMAP dual-core environments. We also present an analyticmodel of streaming equations to optimize the internal handshaking for our proposed streaming RPC. The usage and ef?ciency of the proposed methodology are demonstrated in a JPEG decoder, MP3 decoder, and QCIF H.264 decoder. The experimental results show that our approach improves the performance of the decoders of JPEG,MP3, and H.264 by 24%, 38%, and 32% on PAC, respectively. The communication load of internal handshaking has also been reduced compared to the naive use of RPC over embedded dual-core systems. The experiments also show that the performance improvement can also be achieved on OMAP dual-core platforms
more information: Enabling Streaming Remoting on Embedded Dual-core Processors (ICPP2008)

In this paper we present novel methodologies for enhancing the streaming capabilities of Java RMI. Our streaming support for Java RMI includes the pushing mechanism, which allows servers to push data in a streaming fashion to the client site, and the aggregation mechanism, which allows the client site to make a single remote invocation to gather data from multiple servers that keep replicas of data streams and aggregate partial data into a complete data stream. In addition, our system also allows the client site to forward local data to other clients . Our framework is implemented by extending the Java RMI stub to allow custom designs for streaming buffers and controls, and by providing a continuous buffer for raw data in the transport layer socket. This enhanced framework allows standard Java RMI services to enjoy streaming capabilities. In addition, we propose aggregation algorithms as scheduling methods in such an environment. Preliminary experiments using our frame? work demonstrate its promising performance in the provision of streaming services in Java RMI layers.
more information: Streaming Support for Java RMI in Distributed Environment (PPPJ2006)