The 13th Asia and South Pacific Design Automation Conference Technical Program

The 13th Asia and South Pacific Design Automation Conference

Session 4B Memory and Processor Optimization
Time: 10:15 - 12:20 Wednesday, January 23, 2008
Location: Room 310BC
Chairs: Jeonghun Cho (Kyungpook Nat'l Univ., Republic of Korea), Hiroyuki Tomiyama (Nagoya University, Japan)

4B-1 (Time: 10:15 - 10:40)

Title	Synthesis and Design of Parameter Extractors for Low-Power Pre-computation-Based Content-Addressable Memory Using Gate-Block Selection Algorithm
Author	*Jui-Yuan Hsieh, Shanq-Jang Ruan (National Taiwan University of Science and Technology, Taiwan)
Page	pp. 316 - 321
Keyword	CAM, low-power, pre-computation, gate-block selection algorithm, synthesis
Abstract	Content addressable memory (CAM) is frequently used in applications, such as lookup tables, databases, associative computing, and networking, that require high-speed searches due to its ability to improve application performance by using parallel comparison to reduce search time. Although the use of parallel comparison results in fast search time, it also significantly increases power consumption. In this paper, we propose a gate-block selection algorithm, which can synthesize a proper parameter extractor of the pre-computation-based CAM (PB-CAM) to improve the efficiency for specific applications such as embedded systems. Through experimental results, we found that our approach effectively reduces the number of comparison operations for specific data types (ranging from 19.24% to 27.42%) compared with the 1's count approach. We used Synopsys Nanosim to estimate the power consumption in TSMC 0.35um CMOS process. Compared to the 1's count PB-CAM, our proposed PB-CAM achieves 17.72% to 21.09% in power reduction for specific data types.
PDF file

4B-2 (Time: 10:40 - 11:05)

Title	Block Cache for Embedded Systems
Author	*Dominic Hillenbrand, Jörg Henkel (University of Karlsruhe (TH), Germany)
Page	pp. 322 - 327
Keyword	cache, on chip memory, embedded systems, system on chip, memory bandwidth
Abstract	We present a new method to automatically use on chip memory for code blocks of instructions which are dynamically scheduled at runtime to increase performance and reduce power consumption which we call block caches. Block caches can already outperform instruction caches of the same size. We provide initial data and insights into the automated use of block caches and their respective on- and offline phases.
PDF file

4B-3 (Time: 11:05 - 11:30)

Title	A Compiler-in-the-Loop Framework to Explore Horizontally Partitioned Cache Architectures
Author	*Aviral Shrivastava (Arizona State University, United States), Ilya Issenin, Nikil Dutt (University of California, Irvine, United States)
Page	pp. 328 - 333
Keyword	embedded, compiler, processor, cache, energy
Abstract	Horizontally Partitioned Caches (HPCs) are a promising architectural feature to reduce the energy consumption of the memory subsystem. However, the energy reduction obtained using HPC architectures is very sensitive to the HPC parameters. Therefore it is very important to explore the HPC design space and carefuly choose the HPC parameters that result in minimum energy consumption for the application. However, since in HPC architectures, the compiler has a significant impact on the energy consumption of the memory subsystem, it is extremely important to include compiler while deciding the HPC design parameters. While there has been no previous apporaches to HPC design exploration, existing cache design space exploration methodologies do not include the compiler effectsduring DSE. In this paper, we present a Compiler-inthe- Loop (CIL) Design Space Exploration (DSE) methodology to explore and decide the HPC design parameters. Our experimental results on HP iPAQ h4300-like memory subsystem running benchmarks from the MiBench suite demonstrate that CIL DSE can discover HPC configurations with up to 80% lesser energy consumption than the HPC configuration in the iPAQ. In contrast, tradiation simulation-only exploration can discover HPC design parameters that result in only 57% memory subsystem energy reduction. Finally our hybrid CIL DSE heuristic saves 67% of the exploration time as compared to the exhaustive exploration, while providing maximum possible energy savings on our set of benchmarks.
PDF file

4B-4 (Time: 11:30 - 11:55)

Title	Fast, Quasi-Optimal, and Pipelined Instruction-Set Extensions
Author	*Ajay K. Verma, Philip Brisk, Paolo Ienne (EPFL, Switzerland)
Page	pp. 334 - 339
Keyword	Instruction Set Extension, Integer Linear Programming
Abstract	Nowadays many customised embedded processors offer the possibility of speeding up an application by implementing it using Application-Specific Functional units (AFUs). However, the AFUs must satisfy certain constraints in terms of read and write ports between AFU and processor register file. Due to these restrictions the size and complexity of AFUs remain small. However, in recent some work has been done on relaxing the register file port constraints by serialising register file access (i.e., by allowing multi cycle read and write). This makes the problem of selecting best AFU significantly more complex. Most previous approaches use a two staged process to solve this problem, i.e., first selecting AFUs under some higher I/O constraints and then serialise them under the actual register file port constraints. Not only these methods are complex but also lead to suboptimal solutions. In this paper we formulate the AFU selection problem as an Integer Linear Programming and solve it optimally. We show experimentally that our methodology produces significantly better results compared to state of art techniques.
PDF file

4B-5 (Time: 11:55 - 12:20)

Title	Load Scheduling: Reducing Pressure on Distributed Register Files for Free
Author	*Mei Wen, Nan Wu, Maolin Guan, Chunyuan Zhang (National University of Defense Technology, China)
Page	pp. 340 - 345
Keyword	VLIW, distributed register files
Abstract	In this paper we describe load scheduling, a novel method that balances load among register files by residual resources. Load scheduling can reduce register pressure for clustered VLIW processors with distributed register files while not increasing VLIW scheduling length. We have implemented load scheduling in compiler for Imagine and FT64 stream processors. The result shows that the proposed technique effectively reduces the number of variables spilled to memory, and can even eliminate it. The algorithm presented in this paper is extremely efficient in embedded processor with limited register resource because it can improve registers utilization instead of increasing the requirement for the number of registers.
PDF file