Title | (Keynote Address) Next-Generation Design and EDA Challenges: Small Physics, Big Systems, and Tall Tool-chains |
Author | Rob A. Rutenbar (Carnegie Mellon Univ., United States) |
Keyword | |
Abstract | There is much discussion of two challenges in the design of tomorrow's
electronics: the difficult "small physics" of nanoscale transistors,
and the silicon/software complexity of "big systems". But those of us
who want to build beautiful algorithms have an additional hurdle: "tall
tool-chains". If it takes 50 tool-steps to build an
industrial-strength design flow, and each tool is based on 1-2 "big
algorithms", does this mean that each new algorithm idea is worth, at
best, 1-2% of the success of a design? This seems to me a bad way of
accounting for the tremendous value that EDA brings to the world of
design. How can we have a big impact in this important technology area?
In this talk, I will offer several pieces of advice for how not to get
buried by the tall-tool-chain problem. I will discuss how to identify
design problems that can have large impact, how to embrace the strange
physics of tomorrow's silicon technologies in the service of building
beautiful algorithms, and how to get fresh (and unique) insights on
problems by spending time working with a real design team. I will use
design examples ranging from lithography, to computational finance, to
silicon-based speech recognition, to illustrate the point that this is
an exciting time to be working on tomorrow's tool and design challenges. |
Title | Model Based Layout Pattern Dependent Metal Filling Algorithm for Improved Chip Surface Uniformity in the Copper Process |
Author | *Subarna Sinha, Jianfeng Luo, Charles Chiang (Synopsys, United States) |
Page | pp. 1 - 6 |
Keyword | Fill, CMP |
Abstract | Thickness range, i.e. the difference between the highest point and the
lowest point of the chip surface, is a key indicator of chip yield.
This paper presents a novel metal filling algorithm that seeks to
minimize the thickness range of the chip surface during the copper damascene process. The proposed solution considers the physical mechanisms in the damascene process, namely ECP (which is the process used to deposit Cu in the trenches) and CMP (which is the process used to polish Cu after ECP), that affect thickness range. Key predictors for the final thickness range, which is the thickness range after ECP & CMP, that can be computed efficiently are identified and
used to drive the metal filling process.
To the best of our knowledge, this is the first metal filling algorithm that uses an ECP model among other things to guide metal filling. Experimental results are very promising and indicate that the proposed method can significantly reduce the thickness range after metal filling. This is in sharp contrast with the density-driven approaches which often increase the thickness range after metal filling, thereby potentially adversely impacting yield. In addition, the proposed method inserts significantly smaller amount of fill when compared to the density-driven approaches. This is desirable as it limits the impact of metal filling on timing. |
PDF file |
Title | Fast and Accurate OPC for Standard-Cell Layouts |
Author | *David M. Pawlowski, Liang Deng, Martin D. F. Wong (University of Illinois at Urbana-Champaign, United States) |
Page | pp. 7 - 12 |
Keyword | OPC, cell-wise, boundary-based, RET |
Abstract | Model based optical proximity correction (OPC) has become necessary at 90nm technology node and beyond. Cellwise OPC is an attractive technique to reduce the mask data size as well as the prohibitive runtime of full-chip OPC. As feature dimensions have gotten smaller, the radius of influence for edge features has extended further into neighboring cells such that it is no longer sufficient at 65nm node and below to perform cellwise OPC independent of neighboring cells, especially for the metal layers. The methodology described in this work accounts for features in neighboring cells and allows a cellwise approach to be applied to cells with gate length of 45nm with the projection that it can also be applied to future technology nodes. OPC-ready cells are generated before placement using boundary-based technology. Each cell has a small number of OPC-ready versions due to an intelligent characterization of standard cell layout features. Total number of cells with boundaries in the OPC-ready library only increases linearly with the number of cells in the original library. Results are very promising: the average edge placement error (EPE) for all metal1 features in 100 layouts is 0.731nm which is less than 1% of metal1 width (80nm), creating similar levels of lithographic accuracy while obviating any of the drawbacks inherent in layout specific full-chip model-based OPC. For even small circuits, we got up to 100X runtime reduction and 35X mask data size shrinking. |
PDF file |
Title | Coupling-aware Dummy Metal Insertion for Lithography |
Author | *Liang Deng (University of Illinois at Urbana-Champaign, United States), Kaiyuan Chao (Intel Co., United States), Hua Xiang (IBM T.J. Watson Research Center, United States), Martin D. F. Wong (University of Illinois at Urbana-Champaign, United States) |
Page | pp. 13 - 18 |
Keyword | dummy metal, lithography, coupling capacitance, RET |
Abstract | As integrated circuits manufacturing technology is advancing into 65nm and 45nm nodes, extensive resolution enhancement techniques (RETs) are needed to correctly manufacture a chip design. The widely used RET called off-axis illumination (OAI) introduces forbidden pitches which lead to very complex design rules. It has been observed that imposing uniformity on layout designs can substantially improve printability under OAI. For metal layers, uniformity can be achieved simply by inserting dummy metal wire segments at all free spaces. Simulation results indeed show significant improvement in printability with such a dummy metal insertion approach. To minimize mask cost, it is advantageous to use dummy metal segments that are of the same size as regular metal wires due to their simple geometry. But these dummy wires are printable and hence increase coupling capacitances and potentially affect yield. The alternative is to use a set of parallel sub-resolution thin wires (which will not be printed) to replace a printable dummy wire segment. These invisible dummy metal segments do not increase coupling capacitances but increase lithography cost, which includes mask cost and RET/process expense. This paper presents a strategy for dummy metal insertion that can optimally trade off lithography cost and coupling capacitance. In particular, we present an optimal algorithm that can minimize lithography cost subject to any given coupling capacitance bound. Moreover, this dummy metal insertion will achieve a highly uniform density because of the locality of coupling capacitance, which automatically ameliorates chemical mechanical polish (CMP) problem. |
PDF file |
Title | Fast Buffer Insertion for Yield Optimization under Process Variations |
Author | Ruiming Chen, *Hai Zhou (Northwestern University, United States) |
Page | pp. 19 - 24 |
Keyword | buffer insertion, yield optimization |
Abstract | With the emerging process variations in fabrication, the traditional corner-based timing optimization techniques become prohibitive. Buffer insertion is a very useful technique for timing optimization. In this paper, we propose a buffer insertion algorithm with the consideration of process variations. We use the solutions from the deterministic buffering that sets all the random variables at their nominal values to guide the statistical buffering algorithm.
Our algorithm keeps the sizes of solution lists small, and always achieves higher yield than the deterministic buffering.
The experimental results demonstrate that the exiting approaches cannot handle large cases efficiently or effectively, while our algorithm handles large cases very efficiently, and improves the yield more than 12\% on average. |
PDF file |
Title | A Global Minimum Clock Distribution Network Augmentation Algorithm for Guaranteed Clock Skew Yield |
Author | *Bao Liu, Andrew Kahng, Xu Xu (University of California, San Diego, United States), Jiang Hu, Ganesh Venkataraman (Texas A&M University, United States) |
Page | pp. 25 - 31 |
Keyword | clock, distribution, robust , design |
Abstract | Nanometer VLSI systems demand robust clock distribution network design for increased process and operating condition variabilities. In this paper, we proposeminimum clock distribution network augmentation for guaranteed skew yield. We present theoretical analysis results on an inserted link in a clock network, which scales down local skew and skew variation, but may not guarantee global skew and skew variation reduction in general. We propose a global minimum clock network augmentation algorithm, which inserts links simultaneously between all nearest sink pairs, apply rule-based link removal, and perform link consolidation by Steiner minimum tree construction for wirelength reduction with guaranteed clock skew yield. Our experimental results show that our proposed algorithm achieves dominant clock network augmentation solutions, e.g., an average of 16% clock skew yield improvement, 9% maximum skew reduction, and 25% reduction of clock skew variation standard deviation with identical wirelength compared with previous best clock network link insertion methods. |
PDF file |
Title | Control-Flow Aware Communication and Conflict Analysis of Parallel Processes |
Author | Axel Siebenborn, *Alexander Viehl, Oliver Bringmann (FZI Forschungszentrum Informatik, Germany), Wolfgang Rosenstiel (Universität Tübingen, Germany) |
Page | pp. 32 - 37 |
Keyword | Architectural Exploration, Performance Analysis, Environment Modeling, Bus Allocation, SystemC |
Abstract | In this paper, we present an approach for control-flow aware communication and conflict analysis of systems of parallel communicating processes. This approach allows to determine the global timing behavior of such a system and to detect communication that might produce conflicts on shared communication resources.
Furthermore, we show the incorporation of temporal environment models in order to analyze their influence on the system behavior. Based on the determined conflicts, an automated allocation and binding approach for shared resources to resolve potential access conflicts is proposed. All analysis steps can be performed starting with a TLM SystemC model of the entire system without any need for user interaction.
Finally, a SystemC model of a Viterbi decoder is used as case study to demonstrate the capability of our approach. |
PDF file |
Title | Software Performance Estimation in MPSoC Design |
Author | Marcio Oyamada, *Flavio Wagner (UFRGS, Brazil), Marius Bonaciu (TIMA Lab., France), Wander Cesario (MnD, France), Ahmed Jerraya (TIMA Lab., France) |
Page | pp. 38 - 43 |
Keyword | Performance Estimation, MPSoC |
Abstract | Estimation tools are a key component of system-level methodologies, enabling a fast design space exploration. Estimation of software performance is essential in current software-dominated embedded systems. This work proposes an integrated methodology for system design and performance analysis. An analytic approach based on neural networks is used for high-level software performance estimation. At a functional level, this analytic tool enables a fast evaluation of the performance to be obtained with selected processors, which is an essential task for the definition of a “golden” architecture. From this architectural definition, a tool that refines hardware and software interfaces produces a bus-functional model. A virtual prototype is then generated from the bus-functional model, providing a global, cycle-accurate simulation model and offering several features for design validation and detailed performance analysis. Our work thus combines an analytic approach at functional level and a simulation-based approach at bus functional level. This provides an adequate trade-off between estimation time and precision. A multiprocessor platform implementing an MPEG4 encoder is used as case study, and the analytic estimation results in errors only up to 17% compared to the virtual platform simulation. On the other hand, the analytic estimation time takes only 17 seconds, against 10 minutes using the cycle-accurate simulation model. |
PDF file |
Title | Effective OpenMP Implementation and Translation for Multiprocessor System-On-Chip without using OS |
Author | *Woo-Chul Jeun, Soonhoi Ha (Seoul National University, Republic of Korea) |
Page | pp. 44 - 49 |
Keyword | OpenMP, MPSoC, parallel programming, shared memory, synchronization |
Abstract | It is attractive to use the OpenMP as a parallel programming model on a Multiprocessor System-On-Chip (MPSoC) because it is easy to write a parallel program in the OpenMP and there is no standard method for parallel programming on an MPSoC. In this paper, we propose an effective OpenMP implementation and translation for major OpenMP directives on an MPSoC with physically shared memories, hardware semaphores, and no operating system. |
PDF file |
Title | Creating Explicit Communication in SoC Models Using Interactive Re-Coding |
Author | *Pramod Chandraiah, Junyu Peng, Rainer Doemer (University of California, Irvine, United States) |
Page | pp. 50 - 55 |
Keyword | System Level Design, SoC Specification, Refinement, Modeling, Design Methodology |
Abstract | Communication exploration has become a critical step during SoC design.
Researchers in the CAD community have proposed
fast and efficient techniques
for comprehensive design space exploration to expedite this critical
design step.
Although these advances have been helpful in reducing the design time
significantly, the overall design time of the system is still
a bottleneck. All these techniques
assume the availability of an initial SoC input model with explicit communication,whose quality significantly impacts the effectiveness of the communication exploration techniques. Today, these initial models need to be manually written by engineers, which is tedious, error-prone and time consuming.In fact, our studies on industrial-size examples have shown that about 50% of the communication exploration time is spent on coding and re-coding of the initial specification model.
In this paper,we propose an efficient interactive approach to explicit communication creation by automating some of the common coding tasks
in specification models for communication exploration. Our results
show significant savings in designer time. |
PDF file |
Title | A New Boundary Element Method for Multiple-Frequency Parameter Extraction of Lossy Substrates |
Author | Xiren Wang, *Wenjian Yu, Zeyi Wang (Tsinghua University, China) |
Page | pp. 62 - 67 |
Keyword | substrate extraction, frequency-dependent parameter, boundary element method, multiple frequency |
Abstract | The couplings via realistic lossy substrates can be modeled as frequency-dependent coupling parameters. The fast extraction at multiple frequencies can be accomplished in two sequent steps. The first is to extract the coupling resistance using a direct boundary element method (DBEM). The second is to revise the resistance into the parameter at the frequency in an exact and rapid way. The first step is time-consuming, while it runs only one time; the second repeats at each frequency, but is much easier. For more frequency calculation, this method is more advanced. Numerical experiments illustrate that this method has high accuracy, and it can be hundreds of times faster than an advanced Green's function based method. Substrates with arbitrary doping profiles can also be easily handled, which is partly verified by experiment. |
PDF file |
Title | Hierarchical Optimization Methodology for Wideband Low Noise Amplifiers |
Author | Arthur Nieuwoudt, Tamer Ragheb, *Yehia Massoud (Rice University, United States) |
Page | pp. 68 - 73 |
Keyword | Low Noise Amplifiers, Wideband, Optimization, Synthesis |
Abstract | In this paper, we present a systematic synthesis methodology for fully integrated wideband low noise amplifiers that simultaneously optimizes impedance matching, noise figure, and other performance parameters. Leveraging an accurate analytical model, we hierarchically couple global optimization techniques with local convex optimization methods to efficiently locate optimal wideband LNA circuits. The results indicate that the methodology yields significant improvement in key LNA design constraints over existing methodologies while achieving up to one order of magnitude speedup in computational performance. |
PDF file |
Title | PLLSim - An Ultra Fast Bang-bang Phase Locked Loop Simulation Tool |
Author | *Michael James Chan, Adam Postula (University of Queensland, Australia), Yong Ding (NanoSilicon Pty Ltd, Australia) |
Page | pp. 74 - 79 |
Keyword | PLL, Bang-bang PLL, Behavioral Simulation, Jitter |
Abstract | Abstract - This paper presents a simulation tool targeted specifically at bang-bang type phase locked loop systems. The aim of this simulator is to quickly and accurately predict important PLL transient characteristics such as capture range, locking time, and jitter. We present a behavioral model for bang-bang type PLLs, and show how application of this model in a simulator can speed up simulation time by four to five orders of magnitude. With this performance, Monte-Carlo simulation techniques become not only feasible, but convenient. The simulator also models the major non-idealities typical of phase locked loop systems. The accuracy of the simulator is confirmed via detailed analysis and comparison with Matlab Simulink based models. |
PDF file |
Title | Ultralow-Power Reconfigurable Computing with Complementary Nano-Electromechanical Carbon Nanotube Switches |
Author | Swarup Bhunia, *Massood Tabib Azar, Daniel Saab (Case Western Reserve University, United States) |
Page | pp. 86 - 91 |
Keyword | Reconfigurable, Low-power, Carbon nanotube |
Abstract | In recent years, several alternative devices have been proposed to deal with inherent limitation of conventional CMOS devices in terms of scalability at nanometer scale geometry. The fabrication and integration cost of these devices, however, have been prohibitive and/or the devices do not allow smooth transition from the conventional design paradigm. To address some of these limitations, we have developed a new family of devices called “Complementary Nano Electro-Mechanical Switches” (CNEMS) using carbon nanotubes as active switching/latching elements. The basic structure of these devices consists of three co-planar carbon nanotubes arranged so that the central nanotube can touch the two side carbon nanotubes upon application of a voltage pulse between them. Owing to the unique properties of carbon nanotubes, these devices have very low leakage current, low operation voltages, and have built-in energy storage to reduce computation power, resulting in very low overall power dissipation. CNEMS have stable on-off state and latching mechanism for non-volatile memory-mode operation. Besides, the devices can be readily integrated in the same substrate as CMOS transistors with high integration densities - thus, allowing easy manufacturability and hybridization with conventional CMOS devices. In this paper, we present the properties of these devices and based on our analysis, we propose a reconfigurable computation framework using these devices. For the first time, we demonstrate that these devices are promising in dynamically reconfigurable instant-on system development with about 25X lower power dissipation. |
PDF file |
Title | 22-29GHz Ultra-Wideband CMOS Pulse Generator for Collision Avoidance Short Range Vehicular Radar Sensors |
Author | *Ahmet Oncu, B.B.M. Wasanthamala Badalawa, Tong Wang, Minoru Fujishima (The University of Tokyo, Japan) |
Page | pp. 94 - 95 |
Keyword | UWB CMOS pulse generator, 22-29GHz pseudo-millimeter-wave ultra-wideband (UWB), short-range automotive radar |
Abstract | The pseudo-millimeter-wave ultra-wideband (UWB) is attractive for applications in short-range automotive radar systems using 22 to 29GHz in order to realize road safety and intelligent transportation. Although CMOS is suitable for the short-range radar since processing units can be implemented in the same chip with the UWB front-end building block, it is difficult to operate CMOS pulse generators at such a high frequency. To realize the pseudo-millimeter-wave band using CMOS, we have proposed a new pulse generator consisting of a series of delay cells and edge combiners with waveform shaping. As a result of measurement using 90nm CMOS technology, 1Gbps/bit pulses are successfully generated with a power consumption of 1.4mW at a supply voltage of 0.9V. This result will be the key technology for a one-chip short-range radar system. |
PDF file |
Title | A 2.8-V Multibit Complex Bandpass Delta-Sigma AD Modulator in 0.18µm CMOS |
Author | *Hao San, Yoshitaka Jingu, Hiroki Wada, Hiroyuki Hagiwara, Akira Hayakawa, Haruo Kobayashi (Gunma University, Japan), Masao Hotta (Musashi Institute of Technology, Japan) |
Page | pp. 96 - 97 |
Keyword | Complex Bandpass Delta-Sigma AD Modulator, Complex Filter, Multi-bit Modulator, DWA Algorithm |
Abstract | A second-order multibit switched-capacitor(SC) complex bandpass Delta-Sigma AD modulator has been designed, fabricated and tested for application to low-IF receivers in wireless communication systems. We have employed two new algorithms there to improve the signal-to-noise-and-distortion (SNDR) of the modulator. (i) A complex bandpass filter with I, Q dynamic matching to reduce the mismatch influence between I, Q paths. As its by-product, the complex modulator can be divided into two separate parts without signal line crossing between the upper and lower paths. Therefore, the layout design of the modulator can be greatly simplified; (ii) A new complex bandpass Data-Weighted Averaging (DWA) algorithm is implemented to suppress nonlinearity effects of multibit DACs in complex form to achieve high accuracy. Implemented in a 0.18-µm CMOS process and at 2.8V supply, the modulator achieves a measured peak SNDR of 64.5dB at 20MS/s with a signal bandwidth of 78kHz while dissipating 28.4mW and occupying an area of 1.82mm2. |
PDF file |
Title | A Wideband CMOS LC-VCO Using Variable Inductor |
Author | *Kazuma Ohashi, Yusaku Ito, Yoshiaki Yoshihara, Kenichi Okada, Kazuya Masu (Tokyo Institute of Technology, Japan) |
Page | pp. 98 - 99 |
Keyword | Wideband, LC-VCO, Variable Inductor, MEMS, 0.18um CMOS |
Abstract | This paper proposes a novel wide-range tunable CMOS voltage controlled oscillator (VCO). VCO uses an on-chip variable inductor and switched capacitors as variable elements. The VCO was fabricated using a standard 0.18 um CMOS process with five metal layers. The oscillation frequency can be tuned from 1.28 GHz to 2.75 GHz with tuning range of 72 %. |
PDF file |
Title | Design of Active Substrate Noise Canceller using Power Suplly di/dt Detector |
Author | *Taisuke Kazama, Toru Nakura (The University of Tokyo, Japan), Makoto Ikeda, Kunihiro Asada (VLSI Design and Education Center, The University of Tokyo, Japan) |
Page | pp. 100 - 101 |
Keyword | substrate noise, di/dt, on-chip noise canceller |
Abstract | As the growing demand of mixed-signal designs as A/D, D/A and PLL integrated with large scale digital
circuits, substrate noise becomes serious concern.
On the other hand,
the remedies using guard ring and decoupling capacitor do not have enough efficiency against high frequency noise
due to their parasitic component.
To suppress the impact of substrate noise, on-chip active noise cancelling technique using di/dt detector
has been proposed.
This paper introduces an exapmle design of feedforward active substrate noise canceling
technique using multiple power supply di/dt detector and demonstrates the noise cancelling results by the measurement of
0.35 $\mu$m CMOS test chip. |
PDF file |
Title | A 20 Gbps Scalable Load Balanced Birkhoff-von Neumann Symmetric TDM Switch IC with SERDES Interfaces |
Author | *Yu-Hao Hsu, Min-Sheng Kao, Hou-Cheng Tzeng, Ching-Te Chiu, Jen-Ming Wu (Inst. of Communications Engineering, NTHU, Taiwan), Shuo-Hung Hsu (Inst. of Electronics Engineering, NTHU, Taiwan) |
Page | pp. 102 - 103 |
Keyword | TDM switch IC , SERDES, 8B10B CODEC, CML, half-rate |
Abstract | For the first time, we implemented a reconfigurable load-balanced TDM switch IC with SERDES interface circuits for high speed networking applications. An NxN TDM switch could be constructed recursively from the TDM switch IC to achieve switching capacity of hundred gigabits per second or higher. The TDM switch IC contained a digital 8x8 TDM switch core with 8B10B CODECs and analog SERDES I/O interfaces. In the I/O interfaces, eight 2.56/3.2Gbps dual-mode 16/20:1 SERDES with CML buffers were developed. The 16/20:1 instead of 8/10:1 serializer and deserializer were used to reduce the required operating frequency in the switch core by half. New half-rate architectures and all static CMOS gates were used in the 16/20:1 serializer and deserializer for the low power consumption. A wide-band CML I/O buffer with our patented PMOS active load scheme was developed. All implementation were based on the 0.18 µm CMOS technology. Our implementation showed a 20 Gbps switching capacity for the 8ˇÁ8 TDM switch IC. |
PDF file |
Title | Reconfigurable CMOS Low Noise Amplifier Using Variable Bias Circuit for Self Compensation |
Author | *Satoshi Fukuda, Daisuke Kawazoe, Kenichi Okada, Kazuya Masu (Tokyo Institute of Technology, Japan) |
Page | pp. 104 - 105 |
Keyword | self-compensation, reconfigurable, LNA, variable bias cuircuit |
Abstract | This paper proposes a reconfigurable low noise amplifier (LNA) to realize self compensation of performance.
Power consamption and intermodulation are compensated by bias voltage of input transistor.
By tuning the bias voltage according to the input signal, the proposed LNA achieves more than 33 dBm improvement in delta-IM3, and 87 % of power reduction is realized at 1.9 GHz as compared to an LNA with a fixed bias voltage. |
PDF file |
Title | Improving Execution Speed of FPGA using Dynamically Reconfigurable Technique |
Author | Roel Pantonial, Md. Ashfaquzzaman Khan (Graduate School of Engineering, Tohoku University, Japan), *Naoto Miyamoto (New Industry Creation Hatchery Center, Tohoku University, Japan), Koji Kotani, Shigetoshi Sugawa (Graduate School of Engineering, Tohoku University, Japan), Tadahiro Ohmi (New Industry Creation Hatchery Center, Tohoku University, Japan) |
Page | pp. 108 - 109 |
Keyword | dynamic, reconfigurable, FPGA, temporal, interconnect |
Abstract | This paper reports the architecture and performance of Flexible Processor III (FP3), a novel multi-context dynamically reconfigurable FPGA (DRFPGA) designed and fabricated in 0.35um 2P3M CMOS technology. FP3 employs a newly developed shift register-type temporal communication module to reduce the critical path delay. Our experimental results brought out, for the first time, that there exists cases where the fastest speed was achieved when multi contexts were in use. |
PDF file |
Title | Single-Issue 1500MIPS Embedded DSP with Ultra Compact Codes |
Author | *Li-Chun Lin, Shih-Hao Ou (National Chiao Tung University, Taiwan), Tay-Jyi Lin (Industrial Technology Research Institute, Taiwan), Siang-Sen Deng, Chih-Wei Liu (National Chiao Tung University, Taiwan) |
Page | pp. 110 - 111 |
Keyword | DSP |
Abstract | The performance of single-issue RISC cores can be improved significantly with multi-issue architectures (i.e. superscalar or VLIW) by activating the parallel functional units concurrently. However, they suffer high complexity or huge code sizes. In this paper, we borrow some ideas from old vector machines and propose a novel DSP architecture with very compact codes. In our simulations, the DSP has comparable performance to a 5-issue VLIW core with identical computing resources. However, its code sizes are reduced by a factor of 8. The DSP core has been implemented in the TSMC 0.13um CMOS technology, where the operating frequency is 305MHz and the silicon area is 1.45×1.4 mm2 including 12KB on-chip memory. |
PDF file |
Title | A Highly Integrated 8 mW H.264/AVC Main Profile Real-time CIF Video Decoder on a 16 MHz SoC Platform |
Author | Huan-Kai Peng, Chun-Hsin Lee, Jian-Wen Chen, Tzu-Jen Lo, Yung-Hung Chang, Sheng-Tsung Hsu, Yuan-Chun Lin, Ping Chao, *Wei-Cheng Hung, Kai-Yuan Jan (National Tsing Hua University, Taiwan) |
Page | pp. 112 - 113 |
Keyword | H.264, AVC, CABAD, SoC |
Abstract | Abstract - We present a hardwired decoder prototype for H.264/AVC main profile video. Our design takes as its input compressed H.264/AVC bit-stream and produces as its output video frames ready for display. We wrap the decoder core with an AMBA-AHB bus interface and integrate it into a multimedia SoC platform. Several architectural innovations at both IP and system levels are proposed to achieve very high performance at very low operating frequency. Running at 16 MHz FPGA, the whole demo system is able to real-time decode CIF (352x288) video at 30 frames per second. Moreover, we take system cost into consideration such that only a single external SDRAM is needed and memory traffic minimized. |
PDF file |
Title | Implementation of a Standby-Power-Free CAM Based on Complementary Ferroelectric-Capacitor Logic |
Author | *Shoun Matsunaga, Takahiro Hanyu (Tohoku University, Japan), Hiromitsu Kimura, Takashi Nakamura, Hidemi Takasu (ROHM, Japan) |
Page | pp. 116 - 117 |
Keyword | complementary ferroelectric-capacitor logic, content-addressable memory, standby-power-free |
Abstract | A complementary ferroelectric-capacitor (CFC) logic-circuit style is proposed for a compact and standby-power-free content-addressable memory (CAM).
Since the use of the CFC logic circuit in designing a CAM cell makes it possible to merge both logic and non-volatile storage elements into serially connected ferroelectric capacitors, the CAM becomes compact. The standby power of the CAM is completely eliminated because the supply voltage can be cut off with maintaining stored data in the CAM.
The test chip is fabricated by using 0.35-um ferroelectric CMOS, and the basic behavior can be also measured. |
PDF file |
Title | A Multi-Drop Transmission-Line Interconnect in Si LSI |
Author | *Junki Seita, Hiroyuki Ito, Kenichi Okada, Takashi Sato, Kazuya Masu (Tokyo Institute of Technology, Japan) |
Page | pp. 118 - 119 |
Keyword | transmission line, branch |
Abstract | This paper proposes a branching method for on-chip transmission line interconnects, which can reduce delay and power of global interconnects. A 6-mm-long transmission line interconnect with a branch is fabricated by using 0.18um standard Si CMOS process, and the measurement result performs 4Gbps signal transmission. |
PDF file |
Title | A 0.35um CMOS 1,632-gate-count Zero-Overhead Dynamic Optically Reconfigurable Gate Array VLSI |
Author | *Minoru Watanabe, Fuminori Kobayashi (Kyushu Institute of Technology, Japan) |
Page | pp. 124 - 125 |
Keyword | FPGAs, PLDs, optical reconfiguration |
Abstract | A Zero-Overhead Dynamic Optically Reconfigurable Gate Array VLSI (ZO-DORGA-VLSI) has been developed.
It is based on a concept using junction capacitance of photodiodes
and load capacitance of gates constructing a gate array as configuration memory and removing static memory function to store a context.
In this paper, the performance of a 1,632 ZO-DORGA-VLSI, which was fabricated using a 0.35 $\mu m$ -- 4.9 mm square CMOS process chip, is presented.
In addition, the design of an over 10,000 ZO-DORGA-VLSI is presented. |
PDF file |
Title | Low-Power High-Speed 180-nm CMOS Clock Drivers |
Author | *Tadayoshi Enomoto, Suguru Nagayama, Nobuaki Kobayashi (Chuo University, Japan) |
Page | pp. 126 - 127 |
Keyword | power dissipation , delay time , dynamic current, short-circuit current , CMOS |
Abstract | The power dissipation (PT) and delay time (tdT) of a CMOS clock driver were minimized. Eight test circuits, each of which has 2 two-stage clock drivers, and a register array were fabricated using 0.18-µm CMOS technology. The first and second stages of the driver consisted of a single inverter and m inverters, respectively, and the register array stage was constructed with N delay flip-flops (D-FFs). A single inverter in the second stage drove N/m D-FFs where N was fixed at 40 and m varied from 1 to 40. Minimum PT and tdT were 251 µW and 0.640 ns, respectively and were both obtained at an m of 8. These values were 48.6% and 29.4% of maximum PT and tdT, respectively. Simulated and measured results agreed well with these SPICE simulated results. |
PDF file |
Title | Fast Analytic Placement using Minimum Cost Flow |
Author | *Ameya R Agnihotri, Patrick H Madden (SUNY Binghamton, United States) |
Page | pp. 128 - 134 |
Keyword | Placement, Physical Design, Analytic placement |
Abstract | Many current integrated circuits designs, such as those released
for the ISPD2005 placement contest, are extremely large
and can contain a great deal of white space.
These new placement problems
are challenging; analytic placers perform well, but can
suffer from high run times.
In this paper, we present a new placement tool called Vaastu.
Our approach combines continuous and discrete optimization techniques.
We utilize network flows, which incorporate
the more realistic half-perimeter wire length objective, to
facilitate module spreading in conjunction with
a log-sum-exponential function based analytic approach.
Our approach obtains wire length results that are competitive with the
best known results, but with much lower run times. |
PDF file |
Title | FastPlace 3.0: A Fast Multilevel Quadratic Placement Algorithm with Placement Congestion Control |
Author | *Natarajan Viswanathan, Min Pan, Chris Chu (Iowa State University, United States) |
Page | pp. 135 - 140 |
Keyword | Quadratic Placement, Iterative Local Refinement, Multilevel Placement |
Abstract | In this paper, we present FastPlace 3.0 - an efficient and scalable multilevel quadratic placement algorithm for large-scale mixed-size designs. The main contributions of our work are:
(1) A multilevel global placement framework, by incorporating a two-level clustering scheme within the flat analytical placer FastPlace.
(2) An efficient and improved Iterative Local Refinement technique that can handle placement blockages and placement congestion constraints.
(3) A congestion aware standard-cell legalization technique in the presence of blockages.
On the ISPD-2005 placement benchmarks, our algorithm is 5.12X, 11.52X and 16.92X faster than mPL6, Capo10.2 and APlace2.0 respectively. In terms of wirelength, we are on average, 2% higher as compared to mPL6 and 9% and 3% better as compared to Capo10.2 and APlace2.0 respectively. We also achieve competitive results compared to a number of academic placers on the placement congestion constrained ISPD-2006 placement benchmarks. |
PDF file |
Title | Hippocrates: First-Do-No-Harm Detailed Placement |
Author | Haoxing Ren (IBM, United States), *David Pan (University of Texas at Austin, United States), Charles J Alpert, Gi-Joon Nam, Paul Villarrubia (IBM, United States) |
Page | pp. 141 - 146 |
Keyword | placement, timing, detailed placement |
Abstract | Physical synthesis optimizations and engineering change orders
typically change the locations of cells, resize cells or add more
cells to the design after global placement. Unfortunately, those
changes usually lead to wirelength increases; thus another pass of
optimizations to further improve wirelength, timing and routing
congestion characteristics is required. Simple wirelength-driven
detailed placement techniques could be useful in this scenario.
While such techniques can help to reduce wirelength, ones without
careful timing constraint considerations might degrade the timing
characteristics (worst negative slack, total negative slack, etc)
and/or introduce more electrical violations (exceeding maximum
output load constraints and maximum input slew constraints). In this
paper, we propose a new detailed placement paradigm, which use a set
of pin-based timing and electrical constraints in detailed placement
to prevent it from degrading timing or violating electrical
constraints while reducing wirelength, thus dubbed as Hippocrates:
FIRST-DO-NO-HARM optimizations. Our experimental results show great
promises. By honoring these constraints, our detailed placement
technique not only reduces total wirelength (TWL), but also
significantly improves timing, achieving 37% better total negative
slack (TNS). |
PDF file |
Title | ECO-system: Embracing the Change in Placement |
Author | *Jarrod Roy, Igor Markov (University of Michigan, United States) |
Page | pp. 147 - 152 |
Keyword | Placement, ECO, Physical Synthesis |
Abstract | In a realistic design flow, circuit and system optimizations must interact with physical aspects of the design. For example, improvements in timing and power may require replacing large modules with variants that have different power/delay trade-off, shape and connectivity. New logic may be added late in the design flow, subject to interconnect optimization. To support such flexibility in design flows we develop a robust system for performing Engineering Change Orders (ECOs). In contrast with existing stand-alone tools that offer poor interfaces to the design flow and cannot handle a full range of modern VLSI layouts, our ECO-system reliably handles fixed objects and movable macros in instances with widely varying amounts of whitespace. It detects geometric regions and sections of the netlist that require modification and applies an adequate amount of change in each case. Given a reasonable initial placement, it applies minimal changes, but is capable of re-placing large regions to handle pathological cases. ECO-system can be used in the range from high-level synthesis, to physical synthesis and detail placement. |
PDF file |
Title | Bisection Based Placement for the X Architecture |
Author | *Satoshi Ono (SUNY Binghamton CSD, United States), Sameer Tilak (Supercomputer Center, United States), Patrick H. Madden (SUNY Binghamton CSD, United States) |
Page | pp. 153 - 158 |
Keyword | placement, x architecture |
Abstract | Rising interconnect delay and power consumption have motivated the
investigation of alternative integrated circuit routing
architectures. In particular, the X Architecture, which features
preferred routing in diagonal directions, has gained a measure of
industry support, and has even been validated at 65nm.
While there has been extensive study of Manhattan design
methods, there are markedly fewer published results for non-Manhattan
design. To help fill this gap,
we study a patented placement method for the X
Architecture; to our knowledge, there have been no prior published
results for the method. Surprisingly, we find that the patented
method
in fact performs worse than
traditional Manhattan methods -- for both Manhattan and X routing
metrics. We also present a theoretic formulation which explains why
solution quality is degraded.
Many groups in industry are evaluating the merits of non-Manhattan
routing architectures.
By providing concrete
experimental results, we hope to improve the accuracy of these
evaluations. |
PDF file |
Title | Slack-based Bus Arbitration Scheme for Soft Real-time Constrained Embedded Systems |
Author | *Minje Jun, Kwanhu Bang (Yonsei University, Republic of Korea), Hyuk-Jun Lee (Cisco Systems Incorporated, United States), Naehyuck Chang (Seoul National University, Republic of Korea), Eui-Young Chung (Yonsei University, Republic of Korea) |
Page | pp. 159 - 164 |
Keyword | latency, arbiter, QoS, bus, slack |
Abstract | We present a bus arbitration scheme for soft real-time constrained embedded systems. Some masters in such systems are required to complete their work for given timing constraints, resulting in the satisfaction of system-level timing constraints. The computation time of each master is predictable, but it is not easy to predict its data transfer time since the communication architecture is mostly shared by several masters. Previous works solved this issue by minimizing the latencies of several latency-critical masters, but the side effect of these methods is that it can increase the latencies of other masters, hence they may violate the given timing constraints. Unlike previous works, our method uses the concept of “slack” in order to make the latency as close as its given constraint, resulting in the reduction of the side effect. The proposed arbitration scheme consists of bandwidth-conscious arbiter and scheduler. The arbiter can be any existing bandwidth-conscious arbiter and the scheduler implements the latency-awareness proposed in this paper. The scheduler is involved in the arbitration only when it observes a request whose slack is not sufficient for the given timing constraint. The experimental results show that our method outperforms the conventional round-robin arbiter by more than 100% in the best case in terms of the longest violated cycles. |
PDF file |
Title | A Precise Bandwidth Control Arbitration Algorithm for Hard Real-Time SoC Buses |
Author | *Bu-Ching Lin, Geeng-Wei Lee, Juinn-Dar Huang, Jing-Yang Jou (National Chiao Tung University, Taiwan) |
Page | pp. 165 - 170 |
Keyword | bandwidth allocation, real-time systems, system buses, system-on-chip, arbitration algorithm |
Abstract | On an SoC bus, contentions occur while different IP cores request the bus access at the same time. Hence an arbiter is mandatory to deal with the contention issue on a shared bus system. In different applications, IPs may have real-time and/or bandwidth requirements. It is very difficult to design an arbitration algorithm to simultaneously meet these two requirements. In this paper, we propose an innovative arbitration algorithm, RB_lottery, to meet both of the requirements. It can provide not only the hard real-time guarantee but also the precise bandwidth controllability. The experimental results show that RB_lottery outperforms several well-known existing arbitration algorithms. |
PDF file |
Title | Communication Architecture Synthesis of Cascaded Bus Matrix |
Author | *Junhee Yoo, Dongwook Lee (Seoul National University, Republic of Korea), Sungjoo Yoo (Samsung Electronics, Republic of Korea), Kiyoung Choi (Seoul National University, Republic of Korea) |
Page | pp. 171 - 177 |
Keyword | bus matrix, AXI, communication architecture synthesis |
Abstract | For high frequency on-chip communication architecture design, we propose cascaded bus matrix-based solutions. Due to the huge design space in cascaded bus matrix design, it is crucial to perform an efficient design space exploration. In our work, we present a simulated annealing-based design space exploration. For an efficient representation of bus topology, we propose an encoding method called traffic group encoding and apply it to AMBA3 AXI-based bus system design. |
PDF file |
Title | Application Specific Network-on-Chip Design with Guaranteed Quality Approximation Algorithms |
Author | Krishnan Srinivasan, *Karam S. Chatha, Goran Konjevod (Arizona State University, United States) |
Page | pp. 184 - 190 |
Keyword | Network-on-Chip, Irregular topology, Approximation algorithms |
Abstract | Network-on-Chip (NoC) architectures with optimized topologies have
been shown to be superior to regular architectures (such as mesh)
for application specific multi-processor System-on-Chip (MPSoC)
devices. The application specific NoC design problem takes as
input the system-level floorplan of the computation architecture,
characterized library of NoC components, and the communication
performance requirements. The objective is to generate an
optimized NoC topology, and routes for the communication traces on
the architecture such that the performance requirements are
satisfied and power consumption is minimized. The paper discusses
a two stage automated approach consisting of i) core to router
mapping, and ii) topology and route generation for design of
custom NoC architectures. In particular it presents an optimal
technique for core to router mapping (stage i), and a factor 2
approximation algorithm for custom topology generation (stage ii).
The superior quality of the techniques is established by
experimentation with benchmark applications, and comparisons with
an optimal integer linear programming (ILP) based technique. |
PDF file |
Title | Thermal-driven Symmetry Constraint for Analog Layout with CBL Representation |
Author | *Jiayi Liu, Sheqin Dong, Yunchun Ma, Di Long, Xianlong Hong (EDA lab, DCST, Tsinghua University, China) |
Page | pp. 191 - 196 |
Keyword | analog layout, symmetry, thermal-driven, CBL representation |
Abstract | Thermal constraint is very important for analog devices in the context of SOI. Hot-spot effect would cause error or even failure on the performance of analog devices. And the temperature gradient would lead to mismatch on symmetrical devices. In order to handle these problems, this paper introduces an accurate thermal model into the placement process. Based on the geometric symmetry which is achieved with CBL for the first time, the thermal model helps to find the thermal-optimal placement. And the experimental results show this method is promising. |
PDF file |
Title | A Graph Reduction Approach to Symbolic Circuit Analysis |
Author | *Guoyong Shi, Weiwei Chen (Shanghai Jiao Tong University, China), C.-J. Richard Shi (University of Washington, United States) |
Page | pp. 197 - 202 |
Keyword | graph, symbolic, BDD, simulator |
Abstract | A new graph reduction approach to symbolic circuit analysis is developed in this paper.
A Binary Decision Diagram (BDD) mechanism is formulated, together with a specially
designed graph reduction process and a recursive sign determination algorithm.
This combination of techniques is used to develop a core analysis engine of a symbolic analog
circuit simulator that has the potential for analyzing large analog circuits
in the frequency domain. Partial experimental results are reported. |
PDF file |
Title | Robust Analog Circuit Sizing Using Ellipsoid Method and Affine Arithmetic |
Author | Xuexin Liu, *Wai-Shing Luk, Yu Song, Xuan Zeng (ASIC & System State-Key Lab, Fudan University, China) |
Page | pp. 203 - 208 |
Keyword | ellipsoid method, affine arithmetic, geometric programming, robust design |
Abstract | Analog circuit sizing under process/parameter
variations is formulated as a mini-max geometric programming
problem. To tackle such problem, we present a new method
that combines the ellipsoid method and affine arithmetic. Affine
Arithmetic is not only used for keeping tracks of variations and
correlations, but also helps to determine the sub-gradient at each
iteration of the ellipsoid method. An example of designing a
CMOS op-amp is given to demonstrate the effectiveness of our
method. Finally numerical results are verified by SPICE’s simulation. |
PDF file |
Title | WCOMP: Waveform Comparison Tool for Mixed-signal Validation Regression in Memory Design |
Author | *Peng Zhang, Wai-Shing Luk, Yu Song, Jiarong Tong, Pushan Tang, Xuan Zeng (Fudan University, China) |
Page | pp. 209 - 214 |
Keyword | Mixed-signal validation, Waveform comparison, Validation automation |
Abstract | The increasing effort on full-chip validation constrains design cost and time-to-market. A waveform comparison tool named WCOMP is presented to automate mixed-signal validation regression in memory design. Unlike digital waveform comparison tools, WCOMP compares mixed-signal waveforms for functional match instead of graphical match, which tally with the requirements of full-chip validation regression. Simulations with different regression runs, process parameters, voltages and temperatures can be functionally compared. The methods are proved to be effective in Intel Flash memory design. |
PDF file |
Title | Structured Placement with Topological Regularity Evaluation |
Author | *Shigetoshi Nakatake (University of Kitakyushu, Japan) |
Page | pp. 215 - 220 |
Keyword | placement, floorplan, sequence-pair, regular structure, analog layout |
Abstract | This paper introduces a new concept of floorplanning,
called structured placement.
Regularity is the key criterion
so that the placement can make progress beyond constraint-driven approaches.
We propose a linear time extration of
topological regularity like arrays and rows from a sequence-pair.
Besides, we provide a new simulated annealing (SA) framework,
called dual SA, which optimizes the regularity as an objective function
balancing the size of regular structures against the area efficiency. |
PDF file |
Title | (Panel Discussion) Design for Manufacturability |
Author | Organizer: Keh-Jeng Chang, Moderator: Keh-Jeng Chang (National Tsing-Hua Univ., Taiwan), Panelists: Kelvin Doong (TSMC, Taiwan), Nishath Verghese (Clear Shape, United States), Ke-Cheng Chu (Global Unichip, Taiwan), Ting-Chi Wang (National Tsing-Hua Univ., Taiwan), Andrew Kahng (Univ. of California, San Diego and Blaze DFM, United States) |
Title | A Novel Performance-Driven Topology Design Algorithm |
Author | *Min Pan, Chris Chu (Iowa State University, United States), Priyadarsan Patra (Intel Corporation, United States) |
Page | pp. 244 - 249 |
Keyword | Interconnect, Performance, Topology |
Abstract | This paper presents a very efficient algorithm for performance-driven topology design for interconnects. Given a net, it first generates A-tree topology using table lookup and net-breaking. Then a performance-driven post-processing heuristic not restricting to A-tree topology improves the obtained topology by considering the sink positions, required time and load capacitance to achieve better timing. Experimental results show that our new technique can produce better topologies in terms of timing and is hundreds times faster than traditional approach. |
PDF file |
Title | FastRoute 2.0: A High-quality and Efficient Global Router |
Author | *Min Pan, Chris Chu (Iowa State University, United States) |
Page | pp. 250 - 255 |
Keyword | Global routing, Steiner trees, congestion |
Abstract | Because of the increasing dominance of interconnect issues in advanced IC technology, it is desirable to incorporate global routing into early design stages to get accurate interconnect information. Hence, high-quality and fast global routers are in great demand. In this work, we propose a high-quality and efficient global router, FastRoute 2.0. It can achieve more than an order of magnitude less overflow and very fast runtime compared to three state-of-the-art academic global routers. The promising results make it possible to integrate global routing into early design stages. This could dramatically improve the design solution quality. |
PDF file |
Title | DpRouter: A Fast and Accurate Dynamic-Pattern-Based Global Routing Algorithm |
Author | *Zhen Cao, Tong Jing (Tsinghua University, China), Jinjun Xiong, Yu Hu, Lei He (University of California, Los Angeles, United States), Xianlong Hong (Tsinghua University, China) |
Page | pp. 256 - 261 |
Keyword | Routing, Routability, Physical Design, Congestion |
Abstract | This paper presents a fast and accurate global routing algorithm, DpRouter, based on two efficient techniques: (1) dynamic pattern routing (Dpr), and (2) segment movement. These two techniques enable DpRouter to explore large solution space to achieve better routability with low time complexity. Compared with the state-of-the-arts, experimental results show that we consistently obtain better routing quality in terms of both congestion and wire length, while simultaneously achieving a more than 30x runtime speedup. We envision that this algorithm can be further leveraged in other routing applications, such as FPGA routing. |
PDF file |
Title | A Theoretical Study on Wire Length Estimation Algorithms for Placement with Opaque Blocks |
Author | *Tan Yan, Shuting Li, Yasuhiro Takashima, Hiroshi Murata (The University of Kitakyushu, Japan) |
Page | pp. 268 - 273 |
Keyword | Wire length estimation, Block placement, Routing obstacle, Shortest path |
Abstract | How to estimate the shortest routing length when
certain blocks are considered as routing obstacles is becoming an
essential problem for block placement because HPWL is no longer
valid in this case. Although this problem is well studied in computational
geometry [6], the research results are neither well-known
to the CAD community nor presented in a way easy for CAD researchers
to ultilize their establishment. With the help of some
recent notions in block placement, this paper interprets the research
result in [1,8], which gives the best algorithm for this problem
as we know, in a way more concise and more friendly to CAD
researchers. Besides, we also tailor its algorithm to VLSI CAD
application. As the result, we present a method that estimates the
shortest obstacle-avoiding routing length in O(M^2+N) time for
a placement with M blocks and N 2-pin nets. |
PDF file |
Title | LEAF: A System Level Leakage-Aware Floorplanner for SoCs |
Author | *Aseem Gupta, Nikil Dutt, Fadi Kurdahi (University of California, Irvine, United States), Kamal Khouri, Magdy Abadir (Freescale Semiconductor Inc., United States) |
Page | pp. 274 - 279 |
Keyword | Leakage Power, Floorplanner, Temperature, System Level |
Abstract | Process scaling and higher leakage power have resulted in increased power densities and elevated die temperatures. Due to the interdependence of temperature and leakage power, we observe that the floorplan has an impact on both the temperatures and the leakage of the IP-blocks in a system on chip (SoC). Hence, in this paper we propose a novel system level Leakage Aware Floorplanner (LEAF) which optimizes floorplans for temperature-aware leakage power along with the traditional metrics of area and wire length. Our floorplanner takes a SoC netlist and the dynamic power profile of functional blocks to determine a placement while optimizing for temperature dependent leakage power, area, and wire length. To demonstrate the effectiveness of LEAF, we implemented our methodology on ten industrial SoC designs from Freescale Semiconductor Inc. and evaluated the trade-off between leakage power and area. We observed up to 190% difference in leakage power between leakage-unaware and leakage aware floorplanning. |
PDF file |
Title | Protocol Transducer Synthesis using Divide and Conquer Approach |
Author | *Shota Watanabe, Kenshu Seto, Yuji Ishikawa, Satoshi Komatsu, Masahiro Fujita (University of Tokyo, Japan) |
Page | pp. 280 - 285 |
Keyword | protocol, transducer, interface, NoC, wrapper |
Abstract | In IP based design, the designers try to reuse existing IPs as much as possible.
Since currently available IPs use various communication protocols, protocol conversion is one of the most important topics in IP-based design.
We propose a method for automatic protocol transducer synthesis which is applicable to complex protocols.
The main idea of our proposed method is protocol transducer synthesis with a divide and conquer approach.
We demonstrate our method by synthesizing transducers which translate among the real and complicated protocols with advanced features such as non-blocking transactions and out-of-order transactions. |
PDF file |
Title | A Processor Generation Method from Instruction Behavior Description Based on Specification of Pipeline Stages and Functional Units |
Author | *Takeshi Shiro, Masaaki Abe, Keishi Sakanushi, Yoshinori Takeuchi, Masaharu Imai (Graduate School of Information Science and Technology, Osaka University, Japan) |
Page | pp. 286 - 291 |
Keyword | ASIP (Application Specific Instruction-set Processor), Design Space Exploration, Architectural Description Language (ADL), Behavior Description, Micro Operation Description |
Abstract | This paper proposes a method of generating a pipeline processor from behavior description.
In the proposed method, micro operation description is generated by complementing the behavior description
with specification of pipeline stages and functional units.
From the micro operation description, synthesizable HDL description of a processor can be generated.
The proposed method makes it possible to
reduce code size of architectural description language and
design time drastically without degradation of design quality, compared with the conventional method. |
PDF file |
Title | Power and Memory Bandwidth Reduction of an H.264/AVC HDTV Decoder LSI with Elastic Pipeline Architecture |
Author | *Kentaro Kawakami, Mitsuhiko Kuroda, Hiroshi Kawaguchi, Masahiko Yoshimoto (Kobe University, Japan) |
Page | pp. 292 - 297 |
Keyword | H.264, Decoder, Low power, Elastic pipeline, Dynamic Voltage Scaling |
Abstract | We propose an elastic pipeline that can apply dynamic voltage scaling (DVS) to hardwired logic circuits. The proposed pipeline can also reduce required local bus bandwidth. In order to demonstrate its feasibility, a hardwired H.264/AVC HDTV decoder is designed as a real-time application. The proposed architecture reduces power to 56% in a 90-nm process technology, compared to the conventional clock-gating scheme or local bus bandwidth to 37.2%. |
PDF file |
Title | Architectural Optimizations for Text to Speech Synthesis in Embedded Systems |
Author | *Soumyajit Dey, Monu Kedia, Anupam Basu (Indian Institute of Technology Kharagpur, India) |
Page | pp. 298 - 303 |
Keyword | Text to Speech Synthesis (TTS), Natural Language Processing (NLP), Instruction Set Simulation (ISS), Throughput, Co-simulation |
Abstract | The increasing processing power of embedded devices have created the scope for certain applications that could previously be executed in desktop environments only, to migrate into handheld platforms. An important feature of the computing systems of modern times is their support for applications that interact with the user by synthesizing natural speech output. Such applications deliver state of the art performance in desktop environments. However, the real-time performance of such applications in handheld platforms with on-line incoming text streams have not been explored till date. In this work, the performance of a Text to Speech Synthesis application is evaluated on embedded processor architectures and modifications in the underlying hardware platform are proposed for realtime performance improvement of the concerned application. |
PDF file |
Title | Deeper Bound in BMC by Combining Constant Propagation and Abstraction |
Author | Roy Armoni (-, Israel), Limor Fix (Intel, United States), Ranan Fraer (Intel, Israel), *Tamir Heyman (Carnegie Mellon University, United States), Moshe Vardi (Rich University, United States), Yakir Vizel, Yael Zbar (Intel, Israel) |
Page | pp. 304 - 309 |
Keyword | proof-based, abstraction , BMC, constant propagation |
Abstract | The most successful technologies for automatic verification of large industrial circuits are bounded model checking, abstraction, and iterative refinement. Previous work has demonstrated the ability to verify circuits with thousands of state elements achieving bounds of at most a couple of hundreds. In this paper we present several novel techniques for abstraction-based bounded model checking. specifically, we introduce a constant-propagation technique to simplify the formulas submitted to the CNF SAT solver; we present a new proof-based iterative abstraction technique for bounded model checking; and we show how the two techniques can be combined.
The experimental results demonstrate our ability to handle circuit with several thousands state elements reaching bounds nearing 1,000. |
PDF file |
Title | Efficient BMC for Multi-Clock Systems with Clocked Specifications |
Author | *Malay K Ganai, Aarti Gupta (NEC LABS America, United States) |
Page | pp. 310 - 315 |
Keyword | Clocked PSL LTL, Customized SAT-based BMC, Multi-clock System, Dynamic simplification, Formal Verification |
Abstract | Current industry trends in system design — multiple clocks, clocks with arbitrary frequency ratios, multi-phased clocks, gated clocks, level-sensitive latches, combined with clocked specifications – pose additional challenges to verification efforts. We propose an integrated solution that improves SAT-based Bounded Model Checking (BMC) by orders of magnitude, for verification of synchronous multi-clock systems with clocked LTL properties. Our main contributions are: a) Efficient clock modeling schemes to handle clock related challenges uniformly, b) Generation of automatic schedules and clock constraints to avoid unnecessary unrolling and loop-checks in BMC, c) Dynamic simplification of BMC problem instances with clock constraints, and d) Customized BMC translations—with incremental formulations and learning—to directly handle PSL-style clocked specifications. We demonstrate the effectiveness of our approach on some OpenCores multi-clock system benchmarks. |
PDF file |
Title | Symbolic Model Checking of Analog/Mixed-Signal Circuits |
Author | *David Walter, Scott Little, Nicholas Seegmiller, Chris Myers (University of Utah, United States), Tomohiro Yoneda (National Institute of Informatics, Japan) |
Page | pp. 316 - 323 |
Keyword | verification, analog circuits, BDDs, Petri nets, model checking |
Abstract | This paper presents a Boolean based symbolic model checking algorithm for the verification of analog/mixed-signal (AMS) circuits. The systems are modeled in VHDL-AMS, a hardware description language for AMS circuits. The VHDL-AMS description is compiled into labeled hybrid Petri nets (LHPNs) in which analog values are modeled as continuous variables that can change at rates in a bounded range and digital values are modeled using Boolean signals. System properties are specified as temporal logic formulas using timed CTL (TCTL). The verification proceeds over the structure of the formula and maps separation predicates to Boolean variables. The state space is thus represented as a Boolean function using a binary decision diagram (BDD) and the verification algorithm relies on the efficient use of BDD operations. |
PDF file |
Title | Efficient Automata-Based Assertion-Checker Synthesis of SEREs for Hardware Emulation |
Author | *Marc Boule, Zeljko Zilic (McGill University, Canada) |
Page | pp. 324 - 329 |
Keyword | assertion, verification, automaton, checker, psl |
Abstract | In this paper, we present a method for generating checker circuits from sequential-extended regular expressions (SEREs). Such sequences form the core of increasingly-used Assertion-Based Verification (ABV) languages. A checker generator capable of transforming assertions into efficient circuits allows the adoption of ABV in hardware emulation. Towards that goal, we introduce the algorithms for sequence fusion and length matching intersection, two SERE operators that are not typically used over regular expressions. We also develop an algorithm for generating failure detection automata, a concept critical to extending regular expressions for ABV, as well as present our efficient symbol encoding. Experiments with complex sequences show that our tool outperforms the best known checker generator. |
PDF file |
Title | (Invited Paper) Energy-efficient Real-time Task Scheduling in Multiprocessor DVS Systems |
Author | *Jian-Jia Chen, Chuan-Yue Yang, Tei-Wei Kuo, Chi-Sheng Shih (National Taiwan Univ., Taiwan) |
Page | pp. 342 - 349 |
Keyword | Energy-Efficient Scheduling, Real-Time Systems , DVS, Multiprocessor Systems |
Abstract | Dynamic voltage scaling (DVS) circuits have been widely adopted in
many computing systems to provide tradeoff between performance and
power consumption. The effective use of energy could not only extend
operation duration for hand-held devices but also cut down power
bills of server systems. Moreover, while many chip makers are
releasing multi-core chips and multiprocessor system-on-a-chips
(SoCs), multiprocessor platforms for different applications become
even more popular. Multiprocessor platforms could improve the
system performance and accommodate the growing demand of computing
power and the variety of application functionality. This paper
summarizes our work on several important issues in energy-efficient
scheduling for real-time tasks in multiprocessor DVS systems.
Distinct from most previous work based on heuristics, we aim at the
provision of approximated solutions with worst-case guarantees. The
proposed algorithms are evaluated by a series of experiments to
provide insights in system designs. |
PDF file |
Title | Passive Interconnect Macromodeling Via Balanced Truncation of Linear Systems in Descriptor Form |
Author | Boyuan Yan, *Sheldon X.-D. Tan, Pu Liu (University of California, Riverside, United States), Bruce McGaughy (Cadence Design Systems Inc., United States) |
Page | pp. 355 - 360 |
Keyword | model order reduction, descriptor form, TBR, passivity |
Abstract | In this paper, we present a novel passive model order reduction (MOR)
method via projection-based truncated balanced realization method,
PriTBR, for large RLC interconnect circuits. Different from existing passive truncated balanced realization (TBR) methods where
numerically expensive Lur'e or algebraic Riccati (ARE's) equations are solved, the new method performs balanced truncation on linear system in descriptor form by solving generalized Lyapunov equations.
Passivity preservation is achieved by congruence transformation
instead of simple truncations. For the first time, passive model order reduction is achieved by combining Lyapunov equation based TBR method with congruence transformation. Compared with existing passive TBR, the new technique has the same accuracy and is numerically reliable, less expensive. In addition to passivity-preserving, it can be easily extended to preserve structure information inherent to RLC circuits, like block structure, reciprocity and sparsity. PriTBR can be applied as a second MOR stage combined with Krylov-subspace methods to generate a nearly optimal reduced model from a large scale interconnect circuit while passivity, structure, and reciprocity are preserved at the same time. Experimental results demonstrate the effectiveness of the proposed method and show PriTBR and its structure-preserving version, SP-PriTBR, are superior to existing passive TBR and Krylov-subspace based moment-matching methods. |
PDF file |
Title | Automated Extraction of Accurate Delay/Timing Macromodels of Digital Gates and Latches using Trajectory Piecewise Methods |
Author | Sandeep Dabas, Ning Dong, *Jaijeet Roychowdhury (University of Minnesota, Twin Cities, United States) |
Page | pp. 361 - 366 |
Keyword | Model-order-reduction, Simulation |
Abstract | We present a fundamentally new approach,ADME, for extracting highly accurate delay models of a wide variety of digital gates. The technique is based on trajectory-piecewise automated nonlinear macromodelling methods adapted from the mixed-signal/RF domain. Advantages over prior current-source models include rapid automated extraction from SPICE-level netlists, transparent retargettability to different design styles and technologies, and the ability to correctly and holistically account for complex input waveform shapes, nonlinear and linear loading, multiple input switching, effects of internal state, multiple I/Os, supply droop and substrate interference. We validate ADME on a variety of digital gates, including multi-input NAND, NOR, XOR gates, a full adder, a multilevel cascade of gates and a sequential latch. Our results confirm excellent model accuracy at the detailed waveform level and testify to the promise of ADME for sustainable gate delay modelling at nanoscale technologies. |
PDF file |
Title | Practical Implementation of Stochastic Parameterized Model Order Reduction via Hermite Polynomial Chaos |
Author | Yi Zou, Yici Cai, Qiang Zhou, Xianlong Hong (Tsinghua University, China), Sheldon X.D-Tan (University of California, Riverside, United States), *Le Kang (Tsinghua University, China) |
Page | pp. 367 - 372 |
Keyword | stochastic interconnect analysis , Model order reduction |
Abstract | This paper describes the stochastic model order
reduction algorithm via stochastic Hermite Polynomials from the
practical implementation perspective. Comparing with existing work
on stochastic interconnect analysis and parameterized model order
reduction, we generalized the input variation representation using
polynomial chaos (PC) to allow for accurate modeling of non-Gaussian
input variations. We also explore the implicit system representation
using sub-matrices and improved the efficiency for solving the
linear equations utilizing block matrix structure of the augmented
system. Experiments show that our algorithm matches with Monte Carlo
methods very well while keeping the algorithm effective. And the PC
representation of non-gaussian variables gains more accuracy than
Taylor representation used in previous work. |
PDF file |
Title | Frequency Selective Model Order Reduction via Spectral Zero Projection |
Author | Mehboob Alam, *Arthur Nieuwoudt, Yehia Massoud (Rice University, United States) |
Page | pp. 379 - 383 |
Keyword | Interconnect, Model Order Reduction, Passivity |
Abstract | As process technology continues to scale into the nanoscale regime, interconnect plays an ever increasing role in determining VLSI system performance. As the complexity of these systems increases, reduced order modeling becomes critical. In this paper, we develop a new method for the model order reduction of interconnect using frequency restrictive selection of interpolation points based on the spectral-zeros of the RLC interconnect model’s transfer function. The methodology uses the imaginary part of spectral zeros for frequency selective projection and provides stable as well as passive reduced order models for interconnect in VLSI systems. For large order interconnect models with realistic RLC parameters, the results indicate that our method provides more accurate approximations than techniques based on balanced truncation and moment matching with excellent agreement with the original system’s transfer function. |
PDF file |
Title | Abstract, Multifaceted Modeling of Embedded Processors for System Level Design |
Author | *Gunar Schirner, Andreas Gerstlauer, Rainer Doemer (University of California, Irvine, United States) |
Page | pp. 384 - 389 |
Keyword | abstract processor modeling, abstract computation, embedded software, system level design |
Abstract | Embedded software is playing an increasing role in todays SoC designs.
It allows a flexible adaptation to evolving standards and to
customer specific demands. As software emerges more and more as a
design bottleneck, early, fast, and accurate simulation of
software becomes crucial. Therefore, an efficient modeling of
programmable processors at high levels of abstraction is required.
In this article, we focus on abstraction of computation
and describe our abstract modeling of embedded processors.
We combine the computation modeling with task scheduling
support and accurate interrupt handling into a versatile, multi-faceted processor model with varying levels of features.
Incorporating the abstract processor model into a communication model, we achieve fast co-simulation of a complete custom target architecture
for a system level design exploration.
We demonstrate the effectiveness of our approach using
an industrial strength telecommunication example executing on a Motorola DSP architecture. Our results indicate the tremendous value of abstract processor modeling. Different feature levels achieve a simulation speedup of up to 6600 times with an error of less than 8% over a ISS based simulation. On the other hand, our full featured model exhibits a 3% error in simulated timing with a 1800 times speedup. |
PDF file |
Title | Flexible and Executable Hardware/Software Interface Modeling for Multiprocessor SoC Design Using SystemC |
Author | *Patrice Gerin, Hao Shen, Alexandre Chureau, Aimen Bouchhima, Ahmed Amine Jerraya (TIMA Laboratory, France) |
Page | pp. 390 - 395 |
Keyword | HW/SW Interface, Service-based model, MPSoC, Transaction Accurate |
Abstract | At high abstraction level, Multi-Processor System-On-Chip (SoC) designs are specified as assembling of IPs which can be Hardware or Software. The refinement of communication between these different IPs, known as hardware/software interfaces, is widely seen as the design bottlneck due to their complexity. In order to perform early design validation and architecture exploration, flexible executable models of these interfaces are needed at different abstraction levels.
In this paper, we define a unified methodology to implement executable models of the hardware/software interface based on SystemC. The proposed formalism based on the concept of services gives to this approach the flexibility needed for architecture exploration and the ability to be used in automatic generation tools. A case study of hardware/software interface modeling at the Transaction Accurate level is presented. Experimental results show that this method allows higher simulation speed with early performance estimation. |
PDF file |
Title | A Retargetable Software Timing Analyzer Using Architecture Description Language |
Author | *Xianfeng Li (Peking University, China), Abhik Roychoudhury, Tulika Mitra (National Univeristy of Singapore, Singapore), Prabhat Mishra (University of Florida, United States), Xu Cheng (Peking University, China) |
Page | pp. 396 - 401 |
Keyword | Worst Case Execution Time, Retargetability, Architecture Description Language |
Abstract | Worst Case Execution Time (WCET) is an essential input
for performance and schedulability analysis of real-time systems.
Static WCET analysis requires program path analysis and
microarchitecture modeling. Despite almost two decades of research,
WCET analysis has not enjoyed wide acceptance in industry. This is in
part due to the difficulty in microarchitecture modeling of modern
processors. Given the large number of embedded processors available
in the market, retargetability of the WCET analysis framework is a
serious issue. In this paper, we address it using Architecture
Description Language (ADL). Starting with the ADL of a target
processor, the proposed framework automatically generates graph-based
execution models to capture timing effects of instructions in the
pipeline. This pipeline model coupled with parameterized models of
cache and branch prediction lead to a WCET framework that is safe,
accurate and retargetable. |
PDF file |
Title | Automating Logic Rectification by Approximate SPFDs |
Author | *Yu-Shen Yang (University of Toronto, Canada), Subarna Sinha (Synopsys, United States), Andreas Veneris (University of Toronto, Canada), Robert Brayton (University of California, United States) |
Page | pp. 402 - 407 |
Keyword | debugging , verification, EDA VLSI, SAT, correction |
Abstract | In the digital VLSI cycle, a netlist is often modified
to correct design errors, perform small specification changes or implement incremental rewiring-based optimization operations. Most existing automated logic rectification tools use a small
set of predefined logic transformations when they perform such
modifications. This paper first shows that a small set of
predefined transformations may not allow rectification to exploit
the full potential of the design. Then, it proposes an automated
simulation-based methodology to ``approximate'' Sets of Pairs of Functions to be Distinguished (SPFDs) and avoid the memory/time explosion problem. This representation is used by a SAT-based algorithm that devises appropriate logic transformations to fix a design. The SAT method is later complemented by a greedy one that improves on run-time performance. An extensive suite of experiments documents the added potential of the proposed rectification methodology. |
PDF file |
Title | BddCut: Towards Scalable Symbolic Cut Enumeration |
Author | *Andrew Chaang Ling, Jianwen Zhu (University of Toronto, Canada), Stephen Dean Brown (Altera Toronto Technology Centre, Canada) |
Page | pp. 408 - 413 |
Keyword | Cut Enumeration, Binary Decision Diagram, Elimination, Synthesis, Covering Problem |
Abstract | While the covering algorithm has been perfected recently by the iterative approaches, such as DAOmap and IMap, its application has been limited to technology mapping. The main factor preventing the covering problem's migration to other logic transformations, such as elimination and resynthesis region identification found in SIS and FBDD, is the exponential number of alternative cuts that have to be evaluated. Traditional methods of cut generation do not scale beyond a cut size of 6. In this paper, a symbolic method that can enumerate all cuts is proposed without any pruning, up to a cut size of 10. We show that it can outperform traditional methods by an order of magnitude and, as a result, scales to 100K gate benchmarks. As a practical driver, the covering problem applied to elimination is shown where it can not only produce competitive area, but also provide more than 6x average runtime reduction of the total runtime in FBDD, a BDD based logic synthesis tool with a reported order of magnitude faster runtime than SIS and commercial tools with negligible impact on area. |
PDF file |
Title | Node Mergers in the Presence of Don't Cares |
Author | *Stephen Plaza, Kai-hui Chang, Igor Markov, Valeria Bertacco (University of Michigan, United States) |
Page | pp. 414 - 419 |
Keyword | ODCs, sat sweep, global don't cares, node mergers |
Abstract | SAT sweeping is the process of merging two or more functionally equivalent nodes in a circuit by selecting one of them to represent all the other equivalent nodes. This provides significant advantages in synthesis by reducing circuit size and provides additional flexibility in technology mapping, which could be crucial in post-synthesis optimizations. Furthermore, it is also critical in verification because it can reduce the complexity of the netlist to be analyzed in equivalence checking. Most algorithms available so far for this goal do not exploit observability don't cares (ODCs) for node merging since nodes equivalent up to ODCs do not form an equivalence relation. Although a few recently proposed solutions can exploit ODCs by overcoming this limitation, they constrain their analysis to just a few levels of surrounding logic to avoid prohibitive runtime.
We develop an ODC-based node merging algorithm that performs efficient global ODC analysis (considering the entire netlist) through simulation and SAT. Our contributions which enable global ODC-based optimizations are: (1) a fast ODC-aware simulator and (2) an incremental verification strategy that limits computational complexity. In addition, our technique operates on arbitrarily mapped netlists, allowing for powerful post-synthesis optimizations. We show that global ODC analysis discovers on average 25% more (and up to 60%) node-merging opportunities than current state-of-the-art solutions based on local ODC analysis. |
PDF file |
Title | Synthesis of Reversible Sequential Elements |
Author | Min-Lung Chuang, *Chun-Yao Wang (National Tsing Hua University, Taiwan) |
Page | pp. 420 - 425 |
Keyword | reversible logic synthesis, reversible sequential elements |
Abstract | Abstract – To construct a reversible sequential circuit, reversible sequential elements are required. This work presents novel designs of reversible sequential elements such as D latch, JK latch, and T latch. Based on these reversible latches, we also construct the designs of the corresponding flip-flops. Then, we further discuss the physical implementations of our designs based on classical MOS electronics. Comparing with previous work, the implementation cost of our new designs, including the number of gates and the number of garbage outputs is considerably reduced. |
PDF file |
Title | Recognition of Fanout-free Functions |
Author | Tsung-Lin Lee, *Chun-Yao Wang (National Tsing Hua University, Taiwan) |
Page | pp. 426 - 431 |
Keyword | fanout-free, read-once, logic synthesis, factoring |
Abstract | Factoring is a logic minimization technique to represent
a Boolean function in an equivalent function with
minimum literals. When realizing the circuit, a function
represented in a more compact form has smaller
area. Some Boolean functions even have equivalent
forms where each variable appears exactly once, which
are known as fanout-free functions. John P. Hayes
had devised an algorithm to determine if a function can
be fanout-free and construct the circuit if fanout-free
realization exists. In this paper, we propose a property
and an efficient technique to accelerate this algorithm.
With our improvements, execution time of this
algorithm is more competitive with the state-of-the-art
method. |
PDF file |
Title | (Invited Paper) Design Tool Solutions for Mixed-signal/RF Circuit Design in CMOS Nanometer Technologies |
Author | *Georges Gielen (Katholieke Universiteit Leuven, Belgium) |
Page | pp. 432 - 437 |
Keyword | analog, mixed-signal, CAD tools |
Abstract | The scaling of CMOS technology into the nanometer era enables the fabrication of highly integrated systems, which increasingly contain analog and/or RF parts. However, scaling into the nanometer era also brings problems of leakage power, increasing variability and degradation, reducing supply voltages and worsening signal integrity conditions, all this in combination with tightening time-to-market constraints. Design methodologies and tools need to be developed to address these problems. This invited paper describes progress in modeling techniques for design and verification of complex integrated systems, in circuit and yield optimization tools for analog/RF circuits, as well as in signal integrity analysis methods such as EMC/EMI analysis. |
PDF file |
Title | (Invited Paper) Challenges to Accuracy for the Design of Deep-submicron RF-CMOS Circuits |
Author | *Sadayuki Yoshitomi (Toshiba Corporation, Japan) |
Page | pp. 438 - 441 |
Keyword | RF-CMOS, Electro-Magnetic simulation, EKV3.0 , Compact Model, NQS effect |
Abstract | Increasing complexity, functionality and operating frequency makes RF-CMOS
circuit design a tough subject. Efficient use of recent electro-magnetic
simulation, which enables the inclusion of many high-frequency effects, and
the usage of "more" accurate compact models are the key to overcome this
problem. Challenges of these two issues will be shown by the use of real
implementation examples. |
PDF file |
Title | (Invited Paper) Advanced Tools for Simulation and Design of Oscillators/PLLs |
Author | Xiaolue Lai, *Jaijeet Roychowdhury (Univ. of Minnesota, United States) |
Page | pp. 442 - 449 |
Keyword | macromodeling |
Abstract | The lack of fast yet accurate oscillator and PLL simulation methods
has constituted a serious bottleneck in mixed-signal, RF and digital
design flows. Methods are described that, given differential
equations for any oscillator (ie, equivalent to, eg, a SPICE-level
circuit), will extract a simple nonlinear phase macromodel. It will be shown
how such nonlinear phase macromodels are capable of capturing a
variety of important effects, including jitter and phase noise,
injection locking, PLL lock and capture phenomena, cycle slipping,
etc., while being faster by several orders of magnitude than
SPICE-level simulations. It will also be shown how this nonlinear phase
macromodel, when applied to large systems of networked biochemical and
and nanoelectronic oscillators, correctly predicts spontaneous pattern
formation and edge detection. |
PDF file |
Title | A New Methodology for Interconnect Parasitics Extraction Considering Photo-Lithography Effects |
Author | Ying Zhou (Texas A&M University, United States), Zhuo Li (Pextra Corp., United States), Yuxin Tian, *Weiping Shi (Texas A&M University, United States), Frank Liu (IBM Austin Research Laboratory, United States) |
Page | pp. 450 - 455 |
Keyword | lithography simulation, Parasitic Extraction, DFM |
Abstract | Even with the wide adaptation of resolution enhancement techniques
in sub-wavelength lithography, the geometry of the fabricated interconnect is still quite different from the drawn one. Existing Layout Parasitic Extraction (LPE) tools assume perfect geometry, thus introducing significant error in the extracted parasitic models, which in turn cases significant error in timing verification and signal integrity analysis.
Our simulation shows that the RC parasitics
extracted from perfect GDS-II geometry can be as much as 20\% different from those extracted from the post litho/etching simulation geometry.
This paper presents a new LPE methodology and related fast algorithms
for interconnect parasitic extraction under photo-lithographic effects. Our methodology is compatible with the existing design flow. Experimental results show that the proposed methods are accurate and efficient. |
PDF file |
Title | Simple and Accurate Models for Capacitance Increment due to Metal Fill Insertion |
Author | *Youngmin Kim (University of Michigan of Ann Arbor, United States), Dusan Petranovic (Mentor Graphics , United States), Dennis Sylvester (University of Michigan of Ann Arbor, United States) |
Page | pp. 456 - 461 |
Keyword | metal fills, dummy, capacitance, interconnect, modeling |
Abstract | Inserting metal fill to improve inter-level dielectric thickness planarity is an essential part of the modern design process. However, the inserted fill shapes impact the performance of signal interconnect by increasing capacitance. In this paper, we analyze and model the impact of the metal dummy on the signal capacitance with various parameters including their electrical characteristic, signal dimensions, and dummy shape and dimensions. Fill has differing impact on interconnects depending on whether the signal of interest is in the same layer as the fill or not. In particular intra-layer dummy has its greatest impact on coupling capacitance while inter-layer dummy has more impact on the ground capacitance component. Based on an analysis of fill impact on capacitance, we propose simple capacitance increment models (Cc for intra-layer dummy and Cg for inter-layer dummy). To consider the realistic case with both signals and metal fill in adjacent layers, we apply a weighting function approach in the ground capacitance model. We verify this model using simple test patterns and benchmark circuits and find that the models match well with field solver results (1.3% average error with much faster runtime than commercial extraction tools, the runtime overhead reduced by ~75% for all benchmark circuits). |
PDF file |
Title | Parameter Reduction for Variability Analysis by Slice Inverse Regression (SIR) Method |
Author | Alexandar Mitev, Michael Marefact, Dongsheng Ma, *Janet Wang (University of Arizona at Tucson, United States) |
Page | pp. 468 - 473 |
Keyword | performance oriented , parameter reduction |
Abstract | With semiconductor fabrication technologies scaled below 100 nm, the design-manufacturing
interface becomes more and more complicated. The resultant
process variability causes a number of issues in the new generation IC design.
One of the biggest challenges is the enormous number of process variation related parameters.
These parameters represent numerous local and global variations,
and pose a heavy burden in today's chip verification and design.
This paper proposes a new way of reducing the statistical variations
(which include both process parameters and design variables) according to
their impacts on the overall circuit performance. The new approach creates an
effective reduction subspace (ERS) and provides a transformation matrix by using
the mean and variance of the response surface. With the generated transformation matrix,
the proposed method maps the original statistical variations to a smaller set of variables
with which we process variability analysis. Thus, the computational cost due to the
number of variations is greatly reduced. Experimental results show that by using new method
we can achieve 20% to 50% parameter reduction with only less than 5% error on average. |
PDF file |
Title | Stochastic Sparse-grid Collocation Algorithm (SSCA) for Periodic Steady-State Analysis of Nonlinear System with Process Variations |
Author | *Jun Tao, Xuan Zeng (Fudan University, China), Wei Cai (University of North Carolina at Charlotte, United States), Yangfeng Su (Fudan University, China), Dian Zhou (University of Texas at Dallas, United States), Charles Chiang (Synopsys Inc., United States) |
Page | pp. 474 - 479 |
Keyword | process variation, steady-state analysis, Stochastic Collocation Algorithm, Sparse Grid Technique |
Abstract | Abstract—In this paper, Stochastic Collocation Algorithm
combined with Sparse Grid technique (SSCA) is proposed to
deal with the periodic steady-state analysis for nonlinear systems
with process variations. Compared to the existing approaches,
SSCA has several considerable merits. Firstly, compared
with the moment-matching parameterized model order reduction
(PMOR) which equally treats the circuit response on
process variables and frequency parameter by Taylor approximation,
SSCA employs Homogeneous Chaos to capture the impact of
process variations with exponential convergence rate and adopts
Fourier series or Wavelet Bases to model the steady-state behavior
in time domain. Secondly, contrary to Stochastic Galerkin
Algorithm (SGA), which is efficient for stochastic linear system
analysis, the complexity of SSCA is much smaller than that of
SGA for nonlinear case. Thirdly, different from Efficient Collocation
Method, the heuristic approach which may results in “Rank
deficient problem” and “Runge phenomenon”, Sparse Grid technique
is developed to select the collocation points needed in SSCA
in order to reduce the complexity while guaranteing the approximation
accuracy. Furthermore, though SSCA is proposed for
the stochastic nonlinear steady-state analysis, it can be applied
for any other kinds of nonlinear system simulation with process
variations, such as transient analysis, etc.. |
PDF file |
Title | Retiming for Synchronous Data Flow Graphs |
Author | Nikolaos Liveris, Chuan Lin, Jia Wang, *Hai Zhou (Northwestern University, United States), Prithviraj Banerjee (University of Illinois, Chicago, United States) |
Page | pp. 480 - 485 |
Keyword | SDF, retiming, high-level synthesis |
Abstract | In this paper we present a new algorithm for retiming Synchronous Dataflow (SDF) graphs. The retiming aims at minimizing the cycle length of an SDF. The algorithm is provably optimal and its execution time is improved compared to previous approaches. |
PDF file |
Title | Signal-to-Memory Mapping Analysis for Multimedia Signal Processing |
Author | Ilie I. Luican, Hongwei Zhu, *Florin Balasa (University of Illinois at Chicago, United States) |
Page | pp. 486 - 491 |
Keyword | memory management, signal-to-memory mapping, intra-array mapping |
Abstract | The storage requirements in data-dominant signal processing systems,
whose behavior is described by array-based, loop-organized algorithmic
specifications, have an important impact on the overall energy
consumption, data access latency, and chip area. Finding the optimal
storage of the usually large arrays from these behavioral specifications
is an important step during memory allocation.
This paper proposes more efficient algorithms for two intra-array
mapping-to-memory models (of De Greef and Troncon), resulting in
an implementation several times faster than the original ones. |
PDF file |
Title | MODLEX: A Multi Objective Data Layout EXploration Framework for Embedded Systems-on-Chip |
Author | *Rajesh Kumar T. S. (Texas Instruments India, India), Ravikumar C. P. (Texas Instruments, India), Govindarajan R. (Indian Institute of Science, India) |
Page | pp. 492 - 497 |
Keyword | Memory Architecture, Data Layout, Power-performance Trade-off, Genetic Algorithm |
Abstract | The memory subsystem is a major contributor to the performance,
power, and area of complex SoCs used in feature rich multimedia
products. Hence, memory architecture of the embedded DSP is complex
and usually custom designed with multiple banks of
single-ported or dual ported
on-chip scratch pad memory and multiple banks of off-chip memory.
Building software for such large complex memories with many of
the software components as individually optimized software IPs
is a big challenge. In order to
obtain good performance and a reduction in memory stalls,
the data buffers of the application need to be placed
carefully in different types of memory
. In this paper we present
a unified framework (MODLEX) that combines different data layout optimizations
to address the complex DSP memory architectures.
Our method models the data layout problem as multi-objective Genetic
Algorithm (GA) with performance and power being the objectives
and presents a set of solution points which is attractive from
a platform design viewpoint. While most of the work in the
literature assumes that performance and power are non-conflicting
objectives, our work demonstrates that
there is significant trade-off (up to 70\%) that is possible between power
and performance. |
PDF file |
Title | A Run-Time Memory Protection Methodology |
Author | *Udaya Seshua (Philips Semiconductors, India), Nagaraju Bussa (Philips Research, India), Bart Vermeulen (Philips Research, Netherlands) |
Page | pp. 498 - 503 |
Keyword | memory protection, software debug, Hardware/Software co-design |
Abstract | In this paper we present a novel methodology, which aids in debugging memory corruption errors during application development. This methodology is based on the analysis of the memory access behavior of a set of benchmark applications. The analysis result is used to strike an optimal balance between hardware and software instrumentation to make our approach low-cost both from a performance penalty and hardware area point-of-view. Experimental results show that our innovative approach typically requires less than 2% of CPU silicon area for less than 1% run-time performance overhead, making it applicable in time-constrained embedded systems. |
PDF file |
Title | Short-Circuit Compiler Transformation: Optimizing Conditional Blocks |
Author | *Mohammad Ali Ghodrat, Tony Givargis, Alex Nicolau (University of California, Irvine, United States) |
Page | pp. 504 - 510 |
Keyword | Short circuit evaluation, lazy evaluation, compiler transformation, domain space partitioning |
Abstract | We present the short-circuit code transformation technique, intended for embedded compilers. The transformation technique optimizes conditional blocks in high-level programs. Specifically, the transformation takes advantage of the fact that the Boolean value of the conditional expression, determining the true/false paths, can be statically analyzed to determine cases when one or the other of the true/false paths are guaranteed to execute.
In such cases, code is generated to bypass the evaluation of the conditional expression. In instances when the bypass code is faster to evaluate than the conditional expression, a net performance gain is obtained. Our experiments with the Mediabench applications show that the short-circuit transformation yields a an average of 35.1% improvement in execution time for SPARC and an average of 36.3% improvement in execution time for ARM. We also measured an average of 36.4% reduction in power consumption for ARM. |
PDF file |
Title | Optimization of Arithmetic Datapaths with Finite Word-Length Operands |
Author | *Sivaram Gopalakrishnan, Priyank Kalla (University of Utah, United States), Florian Enescu (Georgia State University, United States) |
Page | pp. 511 - 516 |
Keyword | Finite Integer Rings, Modulo Arithmetic |
Abstract | This paper presents an approach to area optimization of arithmetic
datapaths that perform polynomial computations over bit-vectors with
finite widths. Examples of such designs abound in DSP for audio, video
and multimedia computations where the input and output bit-vector
sizes are dictated by the desired precision. A bit-vector of size m
represents integer values reduced modulo 2^m (% 2^m). Therefore,
finite word-length bit-vector arithmetic can be modeled as algebra
over finite integer rings, where the bit-vector size dictates
the ring cardinality. This paper demonstrates how the number-theoretic
properties of finite integer rings can be exploited for optimization
of bit-vector arithmetic. Along with an analytical model to estimate
the implementation cost at RTL, two algorithms are presented to
optimize bit-vector arithmetic. Experimental results, conducted within
practical CAD settings, demonstrate significant area savings due to
our approach. |
PDF file |
Title | Exploiting Power-Area Tradeoffs in Behavioural Synthesis through Clock and Operations Throughput Selection |
Author | *Marco A. Ochoa-Montiel, Bashir M. Al-Hashimi (University of Southampton, Great Britain), Peter Kollig (Philips Semiconductors, Great Britain) |
Page | pp. 517 - 522 |
Keyword | High Level Synthesis, low power |
Abstract | This paper describes a new dynamic-power aware High Level Synthesis (HLS) data path approach that considers the close interrelation between clock choice and operations throughput selection whilst attempting to minimize area, power, or a combination thereof. It is shown that the proposed approach with its compound cost function and its novel clock and operations throughput selection algorithm, obtains solutions with lower power and area than using previous relevant work [11]. Moreover, different power-area tradeoffs can be explored due to the appropriate choice of clock period and operations throughput using our novel approach. |
PDF file |
Title | High-Level Power Estimation and Low-Power Design Space Exploration for FPGAs |
Author | *Deming Chen (University of Illinois at Urbana-Champaign, United States), Jason Cong, Yiping Fan, Zhiru Zhang (University of California, Los Angeles, United States) |
Page | pp. 529 - 534 |
Keyword | high-level synthesis, low-power, FPGA, power estimation |
Abstract | In this paper, we present a simultaneous resource allocation and binding algorithm for FPGA power minimization. To fully validate our methodology and result, our work targets a real FPGA architecture - Altera Stratix FPGA [2], which includes generic logic elements, DSP cores, and memories, etc. We design a high-level power estimator for this architecture and evaluate its estimation accuracy against a commercial gate-level power estimator - Quartus II PowerPlay Analyzer [1]. During the synthesis stage, we pay special attention to interconnects and multiplexers. We concentrate on resource allocation and binding tasks because they are the key steps to determine the interconnections. We use a novel approach to explore the design space during synthesis. It forms, propagates, and prunes synthesis solution points, where each solution point represents one actual implementation of the datapath. Eventually, we generate a design solution curve, which can provide ideal solution points with low power and high performance. Experimental results show that our high-level power estimator is 8.7% away from PowerPlay Analyzer. Meanwhile, we are able to achieve a significant amount of power reduction (32%) with better circuit speed (16%) compared to a traditional resource allocation and binding algorithm. |
PDF file |
Title | Numerical Function Generators Using Edge-Valued Binary Decision Diagrams |
Author | *Shinobu Nagayama (Hiroshima City University, Japan), Tsutomu Sasao (Kyushu Institute of Technology, Japan), Jon Butler (Naval Postgraduate School, United States) |
Page | pp. 535 - 540 |
Keyword | edge-valued BDD, non-uniform segmentation, piecewise polynomial approximation, numerical function generator, FPGA |
Abstract | In this paper, we introduce the edge-valued binary
decision diagram (EVBDD) to reduce the memory and delay in
numerical function generators (NFGs). An NFG realizes a function,
such as a trigonometric, logarithmic, square root, or reciprocal
function, in hardware. NFGs are important in, for example,
digital signal applications, where high speed and accuracy are
necessary. We use the EVBDD to produce a fast and compact segment
index encoder (SIE) that is a key component in our NFG. We
compare our approach with NFG designs based on multi-terminal
BDD's (MTBDDs), and show that the EVBDD produces SIEs that
have, on average, only 7% of the memory and 40% of the delay of
those designed using MTBDDs. Therefore, our NFGs based on
EVBDDs have, on average, only 38% of the memory and 59% of
the delay of NFGs based on MTBDDs. |
PDF file |
Title | An Efficient Computation of Statistically Critical Sequential Paths Under Retiming |
Author | Mongkol Ekpanyapong (Intel Corporation, United States), Xin Zhao, *Sung Kyu Lim (Georgia Institute of Technology, United States) |
Page | pp. 547 - 552 |
Keyword | statistical timing analysis, retiming |
Abstract | In this paper we present the Statistical Retiming-based Timing Analysis (SRTA) algorithm. The goal is to compute the timing slack distribution for the nodes in the timing graph and identify the statistically critical paths under retiming, which are the paths with a high probability of becoming timing-critical after retiming. SRTA enables the designers to perform circuit optimization on these paths to reduce the probability of them becoming timing bottleneck if the circuit is retimed as a post-process. We provide a comparison among static timing analysis (= STA), statistical timing analysis (= SSTA), retiming-based timing analysis (= RTA), and our statistical retiming-based timing analysis (SRTA). Our results show that the placement optimization based on SRTA achieves the best performance results. |
PDF file |
Title | Fast Electrical Correction Using Resizing and Buffering |
Author | Shrirang Karandikar, *Charles J Alpert (IBM Austin Research Laboratory, United States), Mehmet Yildiz, Paul Villarrubia, Steve Quay, Tuhin Mahmud (IBM EDA, United States) |
Page | pp. 553 - 558 |
Keyword | Electrical correction, Slew/cap violations, buffering, gate sizing |
Abstract | Current design methodologies are geared towards meeting different design criteria, such as delay, area or power. However, in order to correctly identify the critical parts of a circuit for optimization, the circuit has to be electrically clean -- i.e., slews on each pin have to be within certain limits, a gate cannot drive more than a certain amount of capacitance, etc. Thus far, this requirement has largely been ignored in the literature. Instead, existing methods which optimize delay are used to fix electrical violations. This leads to solutions that are unnecessarily expensive, and still leave violations that remain unfixed. There is therefore a need for an area-efficient strategy that targets the electrical state of a circuit and fixes all violations quickly. This paper explicitly defines ``electrical violations'' and presents a flexible approach (called EVE, the Electrical Violation Eliminator) for fixing these. Experimental results validate our approach. |
PDF file |
Title | SmartSmooth: A Linear Time Convexity Preserving Smoothing Algorithm for Numerically Convex Data with Application to VLSI Design |
Author | Sanghamitra Roy (University of Wisconsin-Madison, United States), *Charlie Chung-Ping Chen (National Taiwan University, Taiwan) |
Page | pp. 559 - 564 |
Keyword | Convex optimization, Smoothing, Numerically convex, Gate sizing |
Abstract | Convex optimization problems are very popular in the VLSI
design society due to their guaranteed convergence to a global
optimal point. While optimizing tabular data, significant fitting
efforts are required to fit the data into convex form. Fitting the
tables into analytically convex forms like posynomials, suffers from
excessive fitting errors, as the fitting problem may be non-convex.
In recent literature optimal numerically convex tables have been
proposed. Since these tables are numerical, it is extremely
important to make the table data smooth, and yet preserve its
convexity. The smoothness ensures that the convex optimizer behaves
predictably and converges quickly to the global optimal point. The
existing smoothing techniques either cannot preserve convexity, or
require very high execution time. In this paper, we propose a linear
time algorithm to smoothen a given numerically convex data
and at the same time preserve convexity. Our proposed algorithm SmartSmooth can smoothen the data in linear time without introducing any additional error on the numerically convex data. This algorithm can be a significant contribution in the field of optimization of
non-analytical data. We present our SmartSmooth results on
industrial cell libraries. SmartSmooth when applied on convex
tables produced by ConvexFit shows a 30X reduction in fitting
error over a posynomial fitting algorithm and 3X reduction in
fitting error over ConvexSmooth algorithm. |
PDF file |
Title | Modeling the Overshooting Effect for CMOS Inverter in Nanometer Technologies |
Author | *Zhangcai Huang, Hong Yu (The Graduate School of Information, Production and Systems, Waseda University, Japan), Atsushi Kurokawa (Sanyo Semiconductor Company, Japan), Yasuaki Inoue (The Graduate School of Information, Production and Systems, Waseda University, Japan) |
Page | pp. 565 - 570 |
Keyword | CMOS inverter, overshooting time, nanometer technologies, timing analysis |
Abstract | With the scaling of CMOS technology, the overshooting
time due to the input-to-output coupling capacitance
has much more significant effect on inverter delay. Moreover, the
overshooting time is also an important parameter in the short
circuit power estimation. Therefore, in this paper an effective
analytical model is proposed to estimate the overshooting time
for the CMOS inverter in nanometer technologies. Furthermore,
the influence of the process variation on the overshooting time
is illustrated based on the proposed model. And the accuracy
of the proposed model is proved to greatly agree with SPICE
simulation results. |
PDF file |
Title | Flow-Through-Queue based Power Management for Gigabit Ethernet Controller |
Author | Hwisung Jung (University of Southern California, United States), Andy Hwang (Broadcom Corp., United States), *Massoud Pedram (University of Southern California, United States) |
Page | pp. 571 - 576 |
Keyword | Low-power, Gigabit Ethernet controller, energy-efficient, SMDP |
Abstract | Abstract - This paper presents a novel architectural mechanism and a power management structure for the design of an energy-efficient Gigabit Ethernet controller. Key characteristics of such a controller are low-latency and high-bandwidth required to meet the pressing demands of extremely high frame and control data, which in turn cause difficulties in managing power dissipation. We propose a flow-through-queue (FTQ) based power management method, which allows some of the tasks involved in processing the frame data to be offloaded. This in turn enables utilization of multiple clock rates and multiple voltages for different cores inside the Ethernet controller. A modeling approach based on semi-Markov decision process (SMDP) and queuing models is employed, which allow one to apply mathematical programming formulations for energy optimization under performance constraints. The proposed Gigabit Ethernet controller is designed with a 130nm CMOS technology that includes both high and low threshold voltages. Experimental results show that the proposed power optimization method can achieve system-wide energy savings under tighter performance constraints. |
PDF file |
Title | Approximation Algorithm for Process Mapping on Network Processor Architectures |
Author | *Chris Ostler, Karam S. Chatha, Goran Konjevod (Arizona State University, United States) |
Page | pp. 577 - 582 |
Keyword | network processors, throughput maximization, approximation algorithm |
Abstract | The high performance requirements of networking applications has led to the advent of programmable network processor (NP) architectures that incorporate symmetric multi-processing, and block multi-threading. The paper presents an automated system-level design technique for process mapping on such architectures with an objective of maximizing the worst case throughput of the application. As this mapping must be done in the presence of resource (processors and code size) constraints, this is an NP-complete problem. We present a polynomial time approximation algorithm which has a proven guarantee to generate solutions with throughput at least 1/2 that of optimal solutions. The proposed algorithm was utilized to map realistic applications on the Intel IXP2400 (NP) architecture, and produced solutions within 78% of optimal. |
PDF file |
Title | Implementation of a Real Time Programmable Encoder for Low Density Parity Check Code on a Reconfigurable Instruction Cell Architecture (RICA) |
Author | *Zahid Khan, Tughrul Arslan (The University of Edinburgh, Great Britain) |
Page | pp. 583 - 588 |
Keyword | LDPC, FEC, WiMax, Reconfigurable Computing |
Abstract | This paper presents a real time programmable irregular Low Density Parity Check (LDPC) Encoder as specified in the IEEE P802.16E/D7 standard. The encoder is programmable for frame sizes from 576 to 2304 and for five different code rates. H matrix is efficiently generated and stored for a particular frame size and code rate. The encoder is implemented on Reconfigurable Instruction Cell Architecture which has recently emerged as an ultra low power, high performance, ANSI-C programmable embedded core. Different general and architecture specific optimization techniques are applied to enhance the throughput. With the architecture, a throughput from 10 to 19 Mbps has been achieved. |
PDF file |
Title | VLSI Design of Multi Standard Turbo Decoder for 3G and Beyond |
Author | *Imran Ahmed, Tughrul Arslan (University of Edinburgh, Great Britain) |
Page | pp. 589 - 594 |
Keyword | reconfigurable, domain specific, turbo decoder, viterbi, vlsi |
Abstract | Turbo decoding architectures have greater error correcting capability than any other known code. Due to their excellent performance turbo codes have been employed in several transmission systems such as CDMA2000, WCDMA (UMTS), ADSL, IEEE 802.16 metropolitan networks etc. The computation kernel of the algorithm is very similar and we have exploited this commonality for a turbo decoder VLSI design suitable for deployment using platform based system on chip methodologies. Turbo and viterbi components of the unified array are also individually reconfigurable for different standards. This supports the 4G concept that user can be simultaneously connected to several access technologies (for example Wi-Fi, 3G, GSM etc) and can seamlessly move between them. A new normalization scheme for turbo decoding is presented to suit reconfigurable mappings. We have also shown dynamic reconfiguration methodology for a context switch between Turbo and Viterbi decoders which does not waste any clock cycles. The reconfigurable Turbo decoder fabric is implemented reusing components of Viterbi decoder on a 180 nm UMC process technology. |
PDF file |
Title | A High-Throughput Low-Power AES Cipher for Network Applications |
Author | Shin-Yi Lin, *Chih-Tsun Huang (National Tsing Hua University, Taiwan) |
Page | pp. 595 - 600 |
Keyword | AES, Advanced Encryption Standard, Security, Block Cipher, VLSI Design |
Abstract | We propose a full-featured high-throughput low-power AES cipher
which is suitable for widespread network applications. Different
modes of operation are implemented, i.e., the ECB, CBC, CTR and
CCM modes. Our cipher utilizes a cost-efficient two-stage
pipeline for the CCM mode by a single datapath. With the
design-for-test circuitry, the maximum throughput is 4.27 Gbps
using a 0.13um CMOS technology with a 333MHz clock rate.
The hardware cost is 86.2K gates with the power of 40.9mW. |
PDF file |
Title | Improving XOR-Dominated Circuits by Exploiting Dependencies between Operands |
Author | *Ajay K. Verma, Paolo Ienne (Ecole Polytechnique Federale de Lausanne, Switzerland) |
Page | pp. 601 - 608 |
Keyword | Logic Synthesis, Selective Expansion, XOR-Dominated, Parallel Multiplier |
Abstract | Logic synthesis has made impressive progress in recent times, pervading digital design and replacing universally manual techniques. A remarkable exception is computer arithmetic, an example being
multiple additions performed in carry-save form: column-compressors
are usually built exploiting circuit regularity and are hardly optimised further, due to the large number of XOR operations. We show a general technique to optimise XOR-dominated circuits by exploiting the dependencies among the XOR operands and, demonstrate its effectiveness on multiplier-like circuits. We show that it optimises significantly, the best parallel multipliers by exploiting complex dependencies between the addenda which escape known manual optimisations. |
PDF file |
Title | Optimum Prefix Adders in a Comprehensive Area, Timing and Power Design Space |
Author | Jianhua Liu, Yi Zhu, Haikun Zhu (University of California, San Diego, United States), John Lillis (University of Illinois at Chicago, United States), *Chung-Kuan Cheng (University of California, San Diego, United States) |
Page | pp. 609 - 615 |
Keyword | low power, physical synthesis, prefix addition |
Abstract | Parallel prefix adder is the most flexible and widelyused
binary adder for ASIC designs. Many high-level synthesis
techniques have been developed to find optimal prefix structures
for specific applications. However, the gap between these
techniques and back-end designs is increasingly large. In this
paper, we propose an integer linear programming method to
build minimal-power prefix adders within given timing and area
constraints. It counts both gate and wire capacitances in the
timing and power models, considers static and dynamic power
consumptions, and can handle gate sizing and buffer insertion to
improve the performance further. The proposed method is also
adaptive for non-uniform arrival time and required time on each
bit position. Therefore our method produces the optimum prefix
adder for realistic constraints. |
PDF file |
Title | An Interconnect-Centric Approach to Cyclic Shifter Design Using Fanout Splitting and Cell Order Optimization |
Author | Haikun Zhu, Yi Zhu, *Chung-Kuan Cheng (University of California, San Diego, United States), David M. Harris (Harvey Mudd Colledge, United States) |
Page | pp. 616 - 621 |
Keyword | cyclic shifter, interconnect, fanout splitting, permutation, integer linear programming |
Abstract | We propose two orthogonal approaches to logarithmic cyclic shifter design. The first method, called fanout splitting, replaces multiplexers in a conventional design with demultiplexers which have two fanouts driving the shifting and non-shifting paths separately. The use of demultiplexers has a two-fold effect; it cuts the accumulated wire load on the critical path from $O(N\log_2(N))$ to $O(N)$, and reduces the switching probabilities on the inter-stage long wires from 1/4 to 3/16. We then perform cell order optimization to further improve the delay, and formulate it as an integer linear programming problem. For the 64-bit case, the two approaches together reduce the total delay by 67.1% and dynamic power consumption by 17.6%, respectively. |
PDF file |
Title | Optimization of Robust Asynchronous Circuits by Local Input Completeness Relaxation |
Author | *Cheoljoo Jeong, Steven M. Nowick (Columbia University, United States) |
Page | pp. 622 - 627 |
Keyword | asynchronous circuits, input completeness, dual-rail encoding, relaxation |
Abstract | As process, temperature and voltage variations become significant
in deep submicron design, timing closure becomes a critical
challenge using synchronous CAD flows. One attractive alternative
is to use robust asynchronous circuits which gracefully accommodate
timing discrepancies. However, there is currently little CAD
support for such robust methodologies. In this paper, optimization
algorithms for a class of highly-robust asynchronous circuits are
presented. Though the considered circuit style is robust to timing
variation, it suffers from high area overhead inherent in the style.
The proposed algorithm optimizes area and delay of these circuits
by relaxing their overly-restrictive style. The algorithm was implemented and experimented with MCNC circuits, achieving significant
improvement while still preserving the same robustness property
of the circuit. On average, 49.2% of the gates of the circuits
could be implemented in a relaxed manner and, as a result, 34.9%
area improvement was achieved, and 16.1% delay improvement was
achieved using a simple heuristic for targeting the critical path in
the circuit. This is the first proposed approach that systematically
optimizes circuits based on the notion of local relaxation: still preserving the circuit's overall timing-robustness. |
PDF file |
Title | Safe Delay Optimization for Physical Synthesis |
Author | *Kai-hui Chang, Igor L. Markov, Valeria Bertacco (University of Michigan at Ann Arbor, United States) |
Page | pp. 628 - 633 |
Keyword | physical synthesis, delay optimization, safe |
Abstract | Physical synthesis is a relatively young field in Electronic Design Automation. Many published optimizations for physical synthesis end up hurting the final result, often by neglecting important physical aspects of the layout, such as long wires or routing congestion. In this work we propose SafeResynth, a safe resynthesis technique, which provides immediately-measurable delay improvement without altering the design's functionality. It can enhance circuit timing without detrimental effects on route length and congestion. When applied to IWLS'05 benchmarks, SafeResynth improves circuit delay by 11% on average after routing, while increasing route length and via count by less than 0.2%. Our resynthesis can also be used in an unsafe mode, akin to more traditional physical synthesis algorithms popular in commercial tools. Applied together, our safe and unsafe transformations achieve 24% average delay improvement for seven large benchmarks from the OpenCores suite. The relative contribution of safe and unsafe techniques varies depending on the amount of whitespace in the layout. |
PDF file |
Title | (Invited Paper) Development of Low-power and Real-time VC-1/H.264/MPEG-4 Video Processing Hardware |
Author | *Masaru Hase, Kazushi Akie, Masaki Nobori, Keisuke Matsumoto (Renesas Technology, Japan) |
Page | pp. 637 - 643 |
Keyword | Codec, VC-1, H.264, MPEG-4, IP |
Abstract | This paper covers a multi-functional hardware intellectual property (IP) for the encoding and decoding of digital moving pictures with low power consumption. The IP is mainly intended for mobile products such as cellular phones, digital still cameras (DSCs), and digital video cameras (DVCs). It includes VC-1 functionality for Internet content plus AVC (H.264) functionality for digital television broadcasting and MPEG-4 functionality for TV telephony, and is capable of processing D1-sized moving pictures (720 pixels by 480 lines) in real time at an operating frequency of 54 MHz. In addition, original algorithms employed in the IP reduce power consumption by up to 22%. |
PDF file |
Title | (Invited Paper) Development of Low Power ISDB-T One-Segment Decoder by Mobile Multi-Media Engine SoC (S1G) |
Author | *Koichi Mori, Masakazu Suzuki, Yasuo Ohara, Satoru Matsuo, Atsushi Asano (Toshiba, Japan) |
Page | pp. 644 - 648 |
Keyword | ISDB-T, Low Power, Multi-Media, Processor, Mobile Communication |
Abstract | TOSHIBA has developed mobile multi-media engine SoC, we call as S1G, which can realize low power ISDB-T one-segment decode in 42mW for eight months short period of time. Since MPEG2 TS de-multiplexing, AAC decoding and H.264 decoding should be simultaneously processed in ISDB-T one-segment decode, two TOSHIBA MeP (Media embedded Processor) processors and one DSP and hardware blocks are used effectively with pipeline operation in this LSI. Although it is generally considered that dedicated hardware accelerator should be used to realize low power operation for ISDB-T one-segment decode, TOSHBA succeeded in developing low power ISDB-T one-segment decoder using maximum software resources. |
PDF file |
Title | (Invited Paper) Low Power Techniques for Mobile Application SoCs based on Integrated Platform "UniPhier" |
Author | *Masaitsu Nakajima, Takao Yamamoto, Masayuki Yamasaki, Tetsu Hosoki, Masaya Sumita (Matsushita Electric Industrial, Japan) |
Page | pp. 649 - 653 |
Keyword | Low Power |
Abstract | On this presentation, Low Power Techniques for Mobile application SoCs based on Integrated Platform "UniPhier" are introduced. For SoCs, Hierarchical power reduction approaches of each Soc architecture level, UniPhier Processor level, IPP Processor level, and Circuit level are prepared. In case of development of UniPhier base SoC for mobile application, we can pick the combination of suitable low power techniques to realize the target and can make a trade-off between power and cost. |
PDF file |
Title | Simultaneous Control of Subthreshold and Gate Leakage Current in Nanometer-Scale CMOS Circuits |
Author | Youngsoo Shin, Sewan Heo, *Hyung-Ock Kim (KAIST, Republic of Korea), Jung Yun Choi (Samsung Electronics, Republic of Korea) |
Page | pp. 654 - 659 |
Keyword | low power, leakage, power gating, design methodology, gate leakage |
Abstract | Power gating has been widely used to reduce subthreshold leakage.
However, its efficiency degrades very fast with technology scaling
due to the gate leakage of circuits specific to power gating,
such as storage elements and output interface circuits with a data-retention capability. A new scheme called supply switching with ground collapse is proposed to control both gate and subthreshold leakage in nanometer-scale CMOS circuits. Compared to power gating, the leakage is cut by a factor of 6.3 with 65nm and 8.6 with 45nm technology. Various issues in implementing the proposed scheme using standard-cell elements are addressed, from RTL to layout. The proposed design flow is demonstrated on a commercial design with 90nm
technology, and the leakage saving by a factor of 32 is observed with 3% and 6% of increase in area and wirelength, respectively. |
PDF file |
Title | Logic and Layout Aware Voltage Island Generation for Low Power Design |
Author | *Liangpeng Guo, Yici Cai, Qiang Zhou, Xianlong Hong (Tsinghua Univ., China) |
Page | pp. 666 - 671 |
Keyword | Voltage Island, Low Power, Placement, Level Converter |
Abstract | Multiple supply voltage (MSV) is one of the most effective schemes to achieve low power, but most works are based on logic level. A few recent works are based on physical level but all of them do not consider level converters which have an important effect in dual-vdd design. In this work we propose a logic and layout aware approach for voltage assignment and voltage island generation in placement process to minimize the number of level converters and to implement voltage islands with minimal overheads. Experimental results show that our approach uses much less level converters than the approach in [1] (reduced by 59.50% on average) when achieving the same power savings. The approach is able to produce feasible placement with a small impact
to traditional placement goals. |
PDF file |
Title | A Fast Probability-Based Algorithm for Leakage Current Reduction Considering Controller Cost |
Author | Tsung-Yi Wu, Jr-Luen Tzeng, *Kuang-Yao Chen (National Changhua University of Education, Taiwan) |
Page | pp. 672 - 677 |
Keyword | Minimum Leakage Vector, Input Vector Control, Leakage Current Reduction, Sleep Mode, Low Power Design |
Abstract | In this paper, we propose a probability-based algorithm that can rapidly find a minimum leakage vector (MLV). Unlike most traditional techniques that ignore the leakage current overhead of the newborn MLV controller, our technique can consider it. Ignoring this overhead during solution exploration brings a side effect that is misrecognizing a non-optimum solution as an optimum one. Experimental results show that our algorithm can reduce the leakage current up to 48% and can find the optimum solutions on 85% of MCNC benchmarks. |
PDF file |
Title | Approaching Speed-of-light Distortionless Communication for On-chip Interconnect |
Author | Haikun Zhu, Rui Shi (University of California, San Diego, United States), Hongyu Chen (Synopsys Inc., United States), *Chung-Kuan Cheng (University of California, San Diego, United States) |
Page | pp. 684 - 689 |
Keyword | global interconnect, transmission line, distortionless, speed-of-light, serial link |
Abstract | We extend the Surfliner on-chip distortionless transmission line scheme and provide more details for the implementation issues. Surfliner seeks to approach distortionless transmission by intentionally adding shunt resistors between the signal line and the ground. In theory if we distributively make the shunt conductance G=RC/L, there will be no distortion at the receiver end and the signal propagates at the speed of light. We show the feasibility and advantages of this shunt resistor scheme by a real design case of single-ended microstrip line in 0.10$\mu$m technology. The simulation results indicate we can achieve near perfect signaling of 10 Gbps data over a 10 mm serial link, yet no pre-emphasis/equalization or other special techniques are needed. Guidelines for determining the optimal value and spacing of the shunt resistors are also provided. |
PDF file |
Title | Transition Skew Coding: A Power and Area Efficient Encoding Technique for Global On-Chip Interconnects |
Author | *Charbel Akl, Magdy Bayoumi (University of Louisiana at Lafayette, United States) |
Page | pp. 696 - 701 |
Keyword | encoding, repeaters |
Abstract | Global signaling is becoming more and more challenging as technology scales down toward the deep submicron. We propose a new bus encoding technique, transition skew coding, that targets many of the global interconnects challenges such as crosstalk, peak energy and current, switching and leakage power, repeaters area, wiring area, signal integrity and noise. Simulations are done on different bus lengths using a 90 nm library. Repeaters sizing and spacing are optimized, and the proposed encoded bus is compared against a standard bus and a bus with shields inserted between every two wires. The encoding and decoding latencies are also analyzed. Simulations show that transition skew coding is efficient in terms of energy and area with low encoding and decoding latency overhead. |
PDF file |
Title | Predicting the Performance and Reliability of Carbon Nanotube Bundles for On-Chip Interconnect |
Author | *Arthur Nieuwoudt, Mosin Mondal, Yehia Massoud (Rice University, United States) |
Page | pp. 708 - 713 |
Keyword | carbon nanotube, modeling, alternative interconnect technologies |
Abstract | Single-walled carbon nanotube (SWCNT) bundles have the potential to provide an attractive solution for the resistivity and electromigration problems faced by traditional copper interconnect. In this paper, we evaluate the performance and reliability of nanotube bundles for future VLSI applications. We develop a scalable equivalent circuit model that captures the statistical distribution of metallic nanotubes while accurately incorporating recent experimental and theoretical results on inductance, contact resistance, and ohmic resistance. Leveraging the circuit model, we examine the performance and reliability of nanotube bundles including inductive effects. The results indicate that SWCNT interconnect bundles can provide significant improvement in delay over copper interconnect depending on the bundle geometry and process technology. |
PDF file |
Title | Shelf Packing to the Design and Optimization of A Power-Aware Multi-Frequency Wrapper Architecture for Modular IP Cores |
Author | *Danella Zhao, Unni Chandran (University of Louisiana at Lafayette, United States), Hideo Fujiwara (Nara Institute of Science and Technology, Japan) |
Page | pp. 714 - 719 |
Keyword | Modular SoC Test, Multi-frequency wrapper design, power aware architecture, resource constrained test scheduling |
Abstract | This paper proposes a novel power-aware multi-frequency wrapper architecture design to achieve at-speed testability. The trade-offs between power dissipation, scan time and bandwidth are well handled by gating off certain virtual cores at a time while parallelizing the remaining. A shelf packing based optimization algorithm is proposed to design and optimize the wrapper architecture while minimizing the test time under power and bandwidth constraints |
PDF file |
Title | Core-Based Testing of Multiprocessor System-on-Chips Utilizing Hierarchical Functional Buses |
Author | *Fawnizu Azmadi Hussin, Tomokazu Yoneda (Nara Institute of Science and Technology, Japan), Alex Orailoglu (University of California, San Diego, United States), Hideo Fujiwara (Nara Institute of Science and Technology, Japan) |
Page | pp. 720 - 725 |
Keyword | System-on-Chip, Power-constrained, Multiprocessor, Packet-based, Test Scheduling |
Abstract | An integrated test scheduling methodology for multiprocessor System-on-Chips (SOC) utilizing the functional buses for test data delivery is described. The proposed methodology handles both flat bus single processor SOC and hierarchical bus multiprocessor SOC. It is based on a resource graph manipulation and a packet-based packet set scheduling methodology. The resource graph is decomposed into a set of test configuration graphs, which are then used to determine the optimum test configurations and test delivery schedule under a given power constraint. In order to validate the effectiveness of the proposed methodology, a number of experiments are run on several modified benchmark circuits. The results clearly underscore the advantages of the proposed methodology. |
PDF file |
Title | An Architecture for Combined Test Data Compression and Abort-on-Fail Test |
Author | *Erik Larsson, Jon Persson (Linköpings Universitet, Sweden) |
Page | pp. 726 - 731 |
Keyword | abort-on-fail, compression, ATE |
Abstract | The low throughput at IC (Integrated Circuit) testing is mainly due to the increasing test data volume, which leads to high ATE (Automatic Test Equipment) memory requirements and long test application times. In contrast to previous approaches that address either test data compression or abort-on-fail testing, we propose an architecture for combined test data compression and abort-on-fail testing. The architecture improves throughput through multi-site testing as the ATE memory requirement is constant and independent of the degree of multi-site testing. For flexibility in modifying the test data at any time, we make use of a test program for decompression; only test independent evaluation logic is added to the IC. Major advantages compared to MISR (Multiple-Input Signature Register) based schemes are that our scheme (1) allows abort-on-fail testing at clock-cycle granularity, (2) does not impact diagnostic capabilities, and (3) needs no special care for the handling of unknowns (X). |
PDF file |
Title | RunBasedReordering: A Novel Approach for Test Data Compression and Scan Power |
Author | *Hao Fang, Chenguang Tong, Xu Cheng (Micro Processor Research and Development Center of Peking University, China) |
Page | pp. 732 - 737 |
Keyword | test data compression, scan power, scan frame, reorder, run |
Abstract | As the large size of test data volume is becoming one of the major problems in testing System-on-a-Chip(SoC), several compression coding schemes have been proposed. Extended frequency-directed run-length (EFDR) is one of the best coding compression schemes. In this paper, we present a novel algorithm named RunBasedReordering(RBR), which is based on EFDR codes. Three techniques have been applied to this algorithm: scan chain reordering, scan polarity adjustment and test pattern reordering. The experiment results show that the test data compression ratio is significantly improved and scan power consumption is dramatically reduced. Moreover, our algorithm can be easily integrated into the existing industrial flow with little area penalty. |
PDF file |
Title | Systematic Scan Reconfiguration |
Author | *Ahmad Al-Yamani (KFUPM, Saudi Arabia), Narendra Devta-Prasanna (University of Iowa, United States), Arun Gunda (LSI Logic, United States) |
Page | pp. 738 - 743 |
Keyword | DFT, Scan Test, Test Compression, Test Cost Reduction |
Abstract | We present a new test data compression technique that achieves 10x to 40x compression ratios without requiring any information from the ATPG tool about the unspecified bits. The technique is applied to both single-stuck as well as transition fault test sets. The technique allows aggressive parallelization of scan chains leading to similar reduction in test time. It also reduces tester pins requirements by similar ratios. The technique is implemented using a hardware overhead of a few gates per scan chain. |
PDF file |
Title | (Invited Paper) Configurable Multi-Processor Platforms for Next Generation Embedded Systems |
Author | *David Goodwin, Chris Rowen, Grant Martin (Tensilica, United States) |
Page | pp. 744 - 746 |
Keyword | processor, configurable, mpsoc, embedded |
Abstract | Next-generation embedded systems in application domains such as multimedia, wired and wireless communications, and multipurpose portable devices, are increasingly turning to multiprocessor platforms as a vehicle for their realization. But entirely fixed platforms composed of entirely fixed components lack the flexibility and ability to be optimized to the application to offer the best solution in any of these areas. Configurability at multiple levels offers a much better chance to optimize the resulting multiprocessor platform. Existing and emerging technologies for configurable and extensible processors and the creation of configurable multiprocessor subsystem platforms offer significant capability to design teams to both differentiate and optimize their products. |
PDF file |
Title | (Panel Discussion) Multi-Processor Platforms for Next Generation Embedded Systems |
Author | Organizer: Nikil Dutt, Moderator: Nikil Dutt (Univ. of California, Irvine, United States), Panelists: David Goodwin (Tensilica, United States), Kazuyuki Hirata (ARM, Japan), Peter Hofstee (IBM, United States), Rudy Lauwereins (IMEC, Belgium), Maurizio Paganini (STMicroelecronics, France) |
Keyword | |
Abstract | |
Title | Fast Decoupling Capacitor Budgeting for Power/Ground Network Using Random Walk Approach |
Author | *Le Kang, Yici Cai, Yi Zou, Jin Shi, Xianlong Hong (Tsinghua University, China), Sheldon X.-D. Tan (University of California, Riverside, United States) |
Page | pp. 751 - 756 |
Keyword | Power/Ground, Optimization, Random Walk, Leakage |
Abstract | This paper proposes a fast and practical decoupling capacitor (decap) budgeting algorithm to optimize the power ground (P/G) network design. The new method adopts a modified random walk process to partition the circuit. Then, by utilizing the isolation property of decaps, this new method avoids solving the large nonlinear programming problem in traditional decap optimization process. Also, this method integrates leakage currents optimization algorithm using a refined leakage model. Experimental results demonstrate that our proposed method achieves approximate a 10X speed up over the heuristic method based on sensitivity and only about 6% decap area deviation from the optimal budget using the programming method. |
PDF file |
Title | Timing-Aware Decoupling Capacitance Allocation in Power Distribution Networks |
Author | *Sanjay Pant, David Blaauw (University of Michigan, United States) |
Page | pp. 757 - 762 |
Keyword | decap, Ldidt, power grid, timing |
Abstract | Power supply noise increases the circuit delay, which may lead to performance failure of the design. Decoupling capacitance (decap) addition is effective in reducing the power supply noise, thus making the supply network more robust in presence of large switching currents. Traditionally, decaps have been allocated in order to minimize the worst-case voltage drop in the power grid. In this paper, we propose an approach for timing aware decap allocation which uses global time slacks to drive the decap optimization. Non-critical gates with larger timing slacks can tolerate a relatively higher supply voltage drop as compared to the gates on the critical paths. The decap allocation is formulated as a non-linear optimization problem using Lagrangian relaxation and modified adjoint method is used to efficiently obtain the sensitivities of objective function to decap sizes. A fast path-based heuristic is also implemented and compared with the global optimization formulation. The approaches have been implemented and tested on ISCAS85 benchmark circuits and grids of different sizes. Compared to uniformly allocated decaps, the proposed approach utilizes 35.5% less total decap to meet the same delay target. For the same total decap budget, the proposed approach is shown to improve the circuit delay by 10.1% on an average. |
PDF file |
Title | A Current-based Method for Short Circuit Power Calculation under Noisy Input Waveforms |
Author | Hanif Fatemi, Shahin Nazarian, *Massoud Pedram (University of Southern California, United States) |
Page | pp. 774 - 779 |
Keyword | Current-based Method, Short Circuit Power, Noisy input, Crosstalk |
Abstract | An accurate model is presented in this paper to calculate the short circuit energy dissipation of logic cells. The short circuit current is highly dependent on the input and output voltage values. Therefore the actual shape of the voltage signal waveforms at the input and output of the cell should be considered in order to precisely calculate the short circuit energy. Previous approaches such as the approximation of the crosstalk induced noisy waveforms with saturated ramps can lead to short circuit energy estimation errors as high as orders of magnitude for a minimum sized inverter. To resolve this shortcoming, a novel current-based logic cell model is utilized, which constructs the output voltage waveform for a given noisy input waveform. The input and output voltage waveforms are then used to calculate the short circuit current, and hence, short circuit energy dissipation. A characterization process is executed for each logic cell in the standard cell library to model the relevant electrical parameters e.g., the parasitic capacitances and nonlinear current sources. Additionally, our model is capable of calculating the short circuit energy dissipation caused by glitches in VLSI circuits, which in some cases can be a key contributor to the total circuit energy dissipation. Experimental results show an average error of about 1% and a maximum error of about 3% compared to SPICE for different types of logic cells under noisy input waveforms including glitches while the runtime speedup is up to 16000. |
PDF file |
Title | Thermal-Aware 3D IC Placement Via Transformation |
Author | Jason Cong, *Guojie Luo, Jie Wei, Yan Zhang (Department of Computer Science, University of California, Los Angeles, United States) |
Page | pp. 780 - 785 |
Keyword | 3D-IC, placement, thermal-aware |
Abstract | 3D IC technologies can help to improve circuit performance, lower power consumption by reducing wirelength and realize heterogeneous system-on-chip design. In this paper, we propose a novel thermal-aware 3D cell placement approach, named T3Place, based on transforming a 2D placement with good wirelength to a 3D placement, with the objectives of wirelength, through-the-silicon (TS) via number and temperature. Moreover, we proposed a novel relaxed conflict-net (RCN) graph-based layer assignment method to further refine the 3D placements. |
PDF file |
Title | Noise-Direct: A Technique for Power Supply Noise Aware Floorplanning Using Microarchitecture Profiling |
Author | Fayez Mohamood, Michael Healy, Sung Kyu Lim, *Hsien-Hsin S. Lee (Georgia Tech, United States) |
Page | pp. 786 - 791 |
Keyword | Inductive Noise, Floorplanning, Microarchitecture, Power integrity |
Abstract | This paper proposes Noise-Direct, a design methodology for
power integrity aware floorplanning, using microarchitectural
feedback to guide module placement to tackle high-frequency inductive
noise. Given the increasing use of clock-gating for saving power,
reliability has been worsened by induced large inductive noise.
In this work, we propose an average-case design method by
considering the dynamic microarchitectural switching behavior to
guarantee power integrity and alleviate the requirement of on-die
decoupling capacitances. |
PDF file |
Title | On Increasing Signal Integrity with Minimal Decap Insertion in Area-Array SoC Floorplan Design |
Author | *Chao-Hung Lu (National Central University, Taiwan), Hung-Ming Chen (National Chiao Tung University, Taiwan), Chien-Nan Jimmy Liu (National Central University, Taiwan) |
Page | pp. 792 - 797 |
Keyword | Signal Integrity, Floorplan Design, Decap Insertion |
Abstract | With technology further scaling into deep
submicron era, more components can be placed onto one chip
(System-on-chip, SoC). However, the same scaling brings the design
difficulties, among which signal integrity is one of the most
important issues. Although flip-chip and area-array
architectures have been proposed to strengthen the integrity, we
still need careful planning in SoC designs.
Power supply noise problem is getting worse
due to serious IR-drop and simultaneous switching noise, and
decoupling capacitance (decap) insertion is commonly applied
to alleviate the noise. There exist some approaches to
addressing this issue, but they
suffer either from over-design problem or late decap insertion
during design stage. In this paper, we propose a methodology to
insert decap in a more efficient and effective way during
supply noise driven floorplanning in area-array designs.
The experimental results are encouraging.
Compared with other approaches in \cite{Koh} and \cite{Yan},
we have inserted enough decap to meet supply noise constraint
while others employ more area. |
PDF file |
Title | Voltage Island Generation under Performance Requirement for SoC Designs |
Author | *Wai-Kei Mak, Jr-Wei Chen (National Tsing Hua University, Taiwan) |
Page | pp. 798 - 803 |
Keyword | SoC, low power, floorplanning, voltage island |
Abstract | Using multiple supply voltages on a SoC design is an efficient
way to achieve low power. However, it may lead to a complex
power network and a huge number of level shifters if we just
set the cores to operate at their respective lowest voltage levels.
We present two formulations for the voltage level assignment
problem. The first is exact but takes longer time to compute a
solution. The second can be solved much faster with virtually
no loss on optimality. In addition, we propose a modification to
the traditional floorplanning framework. Unlike previous works, we can optimize the total power consumption, the level
shifter overhead, and the power network complexity without com-
promising the wirelength and the chip area. In the experiments,
we obtained 17- 53% power savings with voltage island generation. |
PDF file |
Title | Fast Flip-Chip Pin-Out Designation Respin by Pin-Block Design and Floorplanning for Package-Board Codesign |
Author | *Ren-Jie Lee, Ming-Fang Lai, Hung-Ming Chen (National Chiao Tung University, Taiwan) |
Page | pp. 804 - 809 |
Keyword | Pin-Out Designation, Pin-Block Floorplanning, Package-Board Codesign |
Abstract | Deep submicron effects drive the complication in
designing chips, as well as in package designs and communications
between package and board. As a result, the iterative
interface design has been a time-consuming process. This paper
proposes a novel and efficient approach to designating pinout
for flip-chip BGA package when designing chipsets. The
proposed approach can not only automate the assignment of
more than 200 I/O pins on package, but also precisely evaluate
package size which accommodates all pins with almost no
void pin positions, as good as the one from manual design.
Furthermore, the practical experience and techniques in designing
such interface has been accounted for, including signal
integrity, power delivery and routability. This efficient pin-out
designation and package size estimation by pin-block design
and floorplanning provides much faster turn around time, thus
enormous improvement in meeting design schedule. The results
on two real cases show that our methodology is effective in
achieving almost the same dimensions in package size, compared
with manual design in weeks, while simultaneously considering
critical issues in package-board codesign. To the best of our
knowledge, this is the first attempt in solving flip-chip pin-out
placement problem in package-board codesign. |
PDF file |
Title | A Technique to Reduce Peak Current and Average Power Dissipation in Scan Designs by Limited Capture |
Author | *Seongmoon Wang, Wenlong Wei (NEC Labs., America, United States) |
Page | pp. 810 - 816 |
Keyword | low power testing, scan based testing, power dissipation during test application, low switching activity |
Abstract | In this paper, a technique that can efficiently reduce peak and average switching activity during test application is proposed. The peak transition is reduced by about 40% and average number of transitions is reduced by about 56-75%.
This reduction in peak and average switching is achieved without any decrease in fault coverage.
The proposed method does not require any specific clock tree
construction, special scan cells, or scan chain routing.
Test cubes generated by any combinational ATPG can be processed by the proposed method to reduce peak and average switching activity without any capture violation.
Hardware overhead for the proposed method is negligible.
Further, the hardware for the proposed method can be implemented without detailed knowledge of the design. |
PDF file |
Title | Warning: Launch off Shift Tests for Delay Faults May Contribute to Test Escapes |
Author | Zhuo Zhang, *Sudhakar Reddy (University of Iowa, United States), Irith Pomeranz (Purdue University, United States) |
Page | pp. 817 - 822 |
Keyword | transition delay fault, launch off shift, test escape, functionally detectable faults |
Abstract | A concern expressed often in the literature is the potential over testing or yield loss caused by the fact that launch off shift operates the circuit under test in non-functional manner. In this paper we present data, for the first time, which points to another potential problem with launch off shift tests - test escapes. We also present data that shows that if launch off shift tests with multiple fault activation cycles are used essentially all functionally detectable faults can be detected. |
PDF file |
Title | A Wafer-Level Defect Screening Technique to Reduce Test and Packaging Costs for "Big-D/Small-A" Mixed-Signal SoCs |
Author | Sudarshan Bahukudumbi, Sule Ozev, *Krishnendu Chakrabarty (Duke University, United States), Vikram Iyengar (IBM Corporation, United States) |
Page | pp. 823 - 828 |
Keyword | SoC test, cost model, wafer-level defect screening |
Abstract | Product cost is a key driver in the consumer electronics market, which is characterized by low profit margins and the use of a variety of "big-D/small-A" mixed-signal system-on-chip (SoC) designs. Packaging cost has recently emerged as a major contributor to the product cost for such SoCs. Wafer-level testing can be used to screen defective dies, thereby reducing packaging cost. We propose a new correlation-based signature analysis technique that is especially suitable for mixed-signal test at the wafer-level using low-cost digital testers. The proposed method overcomes the limitations of measurement inaccuracies at the wafer-level. A generic cost model is developed to evaluate the effectiveness of wafer-level testing of analog and digital cores in a mixed-signal SoC, and to study its impact on test escapes, yield loss and packaging costs. Experimental results are presented for a typical mixed-signal "big-D/small-A" SoC, which contains a large section of flattened digital logic and several large mixed-signal cores. |
PDF file |
Title | Fault Dictionary Size Reduction for Million-Gate Large Circuits |
Author | *Yu-Ru Hong, Juinn-Dar Huang (National Chiao Tung University, Taiwan) |
Page | pp. 829 - 834 |
Keyword | fault diagnosis, fault dictionary, fault dictionary size reduction, pass-fail fault dictionary |
Abstract | In general, fault dictionary is prevented from practical applications for its extremely large size. Several previous works are proposed for the fault dictionary size reduction. However, they might not be able to handle today’s million-gate circuits due to the high time and space complexity. In this paper, we propose an algorithm to significantly reduce the size of fault dictionary while still preserving high diagnostic resolution. The proposed algorithm possesses extremely low time and space complexity by avoiding constructing the huge distinguishability table, which inevitably boosts up the required computation complexity. Experimental results demonstrate that the proposed algorithm is fully capable of handling industrial million-gate large circuits in a reasonable amount of runtime and memory. |
PDF file |
Title | Cyclic-CPRS : A Diagnosis Technique for BISTed Circuits for Nano-meter Technologies |
Author | *Chun-Yi Lee, Hung-Mao Lin, Fang-Min Wang, James Chien-Mo Li (Graduate Institute of Electronics Engineering, National Taiwan University, Taiwan) |
Page | pp. 835 - 840 |
Keyword | Fault Diagnosis, BIST, CPRS, Scan Chain, Unknowns |
Abstract | A Cyclic-CPRS (Column Parity Row Selection) technique is presented to diagnose built-in self tested (BISTed) circuits, even in the presence of many unknowns and transient errors. The novel cyclic scan chains retain the transient errors and unknowns in the CUT until they are fully diagnosed. Instead of masking the unknowns, Cyclic-CPRS directly diagnoses the unknowns as if they were errors. Direct diagnosis of unknowns not only eliminates the masking circuitry but also enhances the diagnosis resolution. Experimental results show that Cyclic-CPRS is very successful even in the presence of 10% errors and unknowns. The proposed technique is especially suitable for nano-meter technologies, in which transient errors and systematic defects are becoming serious problems. |
PDF file |
Title | (Invited Paper) Preferable Improvements and Changes to FB-DiMM High-Speed Channel for 9.6Gbps Operation |
Author | *Atsushi Hiraishi, Toshio Sugano (Elpida Memory, Japan), Hideki Kusamitsu (Yamaichi Electronics, Japan) |
Page | pp. 841 - 845 |
Keyword | FB-DiMM, High-speed channel |
Abstract | In this paper we showed the signal degradation parts in High-speed channel of FB-DiMM system. And we also showed possible countermeasure. For the verification propose and also for establishing the precise modeling and simulation method, we compared measurement and simulation up to 9.6Gbps operation with test board. And we get good relation between them. After getting the calculated loss budget of estimated system, we made recommendations of preferable changes to Main board and DiMM socket. |
PDF file |
Title | (Invited Paper) Xbox360TM Front Side Bus - A 21.6 Gb/s End to End Interface Design |
Author | *David Siljenberg, Steve Baumgartner, Tim Buchholtz, Mark Maxson, Trevor Timpane (IBM, United States), Jeff Johnson (Cadence Design Systems, United States) |
Page | pp. 846 - 853 |
Keyword | source synchronous, front side bus, serial link, chip to chip interconnect |
Abstract | With a bandwidth of 21.6 GB/s, the Front Side Bus (FSB) of the Microsoft Xbox360TM is one of the fastest, commercially available Front Side Bus interfaces in the consumer market. This paper explains the end-to-end system approach used in designing the bus that achieved volume production ramp 18 months after design start. The 90 nm SOI-CMOS CPU and 90 nm bulk CMOS GPU designs are described. The chip carrier, circuit board, and signal integrity analyses are described. The design approach used to achieve high volume, low cost, and short development time is explained. |
PDF file |
Title | (Invited Paper) Design Consideration of 6.25 Gbps Signaling for High-Performance Server |
Author | *Jian Hong Jiang, Weixin Gai, Akira Hattori, Yasuo Hidaka, Takeshi Horie, Yoichi Koyanagi, Hideki Osone (Fujitsu Laboratories of America, United States) |
Page | pp. 854 - 857 |
Keyword | multi-gigabit/s transceiver, multi-dap pre-emphasis, linear equalizer |
Abstract | As network data rate increases rapidly, high-speed signaling circuits for server communication pose many design challenges due to various system requirements using different interconnect mediums. This paper discusses main problems and solutions of high-speed circuits for server interconnect. Then, it presents a high-speed circuit implementation for such interconnect using 90nm CMOS technology that achieved data rate at 6.25 Gbps in a backplane environment. |
PDF file |
Title | (Invited Paper) System Co-Design and Co-Analysis Approach to Implementing the XDRTM Memory System of the Cell Broadband EngineTM Processor Realizing 3.2 Gbps Data Rate per Memory Lane in Low Cost, High Volume Production |
Author | *Wai-Yeung Yip, Scott Best, Wendemagegnehu Beyene, Ralf Schmitt (Rambus, United States) |
Page | pp. 858 - 865 |
Keyword | XDR, memory, Cell, interface, Rambus |
Abstract | This paper describes the design and analysis of the 3.2 Gbps XDR™ memory system of the Cell Broadband Engine™ (Cell BE) processor developed by Sony Corporation, Sony Computer Entertainment, Toshiba and IBM. A System Co-Design and Co-Analysis Approach was applied where different components of the system are designed and analyzed simultaneously to allow trade-offs to be made to optimize system electrical characteristics at low overall system cost. The XDR memory interface circuit implemented in the Cell BE processor, the power delivery system design and analysis, and the interface statistical signal integrity analysis will be described to illustrate this design and analysis approach. |
PDF file |
Title | Flow Time Minimization under Energy Constraints |
Author | *Jian-Jia Chen (National Taiwan University, Taiwan), Kazuo Iwama (Kyoto University, Japan), Tei-Wei Kuo, Hseuh-I Lu (National Taiwan University, Taiwan) |
Page | pp. 866 - 871 |
Keyword | Energy-aware systems, Scheduling, Flow time minimization, Dynamic voltage scaling |
Abstract | Power-aware and energy-efficient designs play important roles for modern hardware and software designs, especially for embedded systems. This paper targets a scheduling problem on a processor with the capability of dynamic voltage scaling (DVS), which could reduce the power consumption by slowing down the processor speed. The objective of the targeting problem is to minimize the average flow time of a set of jobs under a given energy constraint, where the flow time of a job is defined as the interval length between the arrival and the completion of the job. We consider two types of processors, which have a continuous spectrum of the available speeds or have only a finite number of discrete speeds. Two algorithms are given: (1) An algorithm is proposed to derive optimal solutions for processors with a continuous spectrum of the available speeds. (2) A greedy algorithm is designed for the derivation of optimal solutions for processors with a finite number of discrete speeds. The proposed algorithms are extended to cope with jobs with different weights for the minimization of the average weighted flow time. The proposed algorithms are also evaluated with comparisons to schedules which execute jobs at a common effective speed. |
PDF file |
Title | Integrating Power Management into Distributed Real-time Systems at Very Low Implementation Cost |
Author | Bita Gorjiara, Nader Bagherzadeh, *Pai Chou (University of California, Irvine, United States) |
Page | pp. 872 - 877 |
Keyword | Dynamic Power Management, real-time systems, distributed systems |
Abstract | The development cost of low-power embedded systems can be reduced by reusing legacy designs and applying proper modifications to meet power constraints. The power management techniques for implementing distributed power managers in multi-processor systems, are very costly in terms of hardware/software modifications. In this paper, we propose a new centralized power management technique that reduces the power consumption of distributed systems at very low implementation cost. Our power manager uses the model of the system/application to compute the schedule of turn on/off commands. We applied our power management technique to a distributed software-defined radio system and achieved 60% to 87% energy savings. |
PDF file |
Title | A Software Technique to Improve Yield of Processor Chips in Presence of Ultra-Leaky SRAM Cells Caused by Process Variation |
Author | *Maziar Goudarzi, Tohru Ishihara, Hiroto Yasuura (Kyushu University, Japan) |
Page | pp. 878 - 883 |
Keyword | Process variation, Leakage power, Software-based Technique, Yield, Embedded Systems |
Abstract | Exceptionally leaky transistors are increasingly more frequent in nano-scale technologies due to lower threshold voltage and its increased variation. Such leaky transistors may even change position with changes in the operating voltage and temperature, and hence, redundancy at circuit-level is not sufficient to tolerate such threats to yield. We show that in SRAM cells this leakage depends on the cell value and propose a first software-based runtime technique that suppresses such abnormal leakages by storing safe values in the
corresponding cache lines before going to standby mode. Analysis shows the performance penalty is, in the worst case, linearly dependent to the number of so-cured cache lines while the energy saving linearly increases by the time spent in standby mode. Analysis and experimental results on commercial processors confirm that the technique is viable if the standby duration is more than a small fraction of a second. |
PDF file |
Title | Design Methodology for 2.4GHz Dual-Core Microprocessor |
Author | Noriyuki Ito, Hiroaki Komatsu, Akira Kanuma, Akihiro Yoshitake, Yoshiyasu Tanamura, Hiroyuki Sugiyama, Ryoichi Yamashita, *Ken-ichi Nabeya, Hironobu Yoshino, Hitoshi Yamanaka, Masahiro Yanagida, Yoshitomo Ozeki, Kinya Ishizaka, Takeshi Kono, Yutaka Isoda (Fujitsu Limited, Japan) |
Page | pp. 896 - 901 |
Keyword | microprocessor, dual-core, clock, custom macro |
Abstract | This paper presents a design methodology that was applied to the design of a 2.4GHz dual-core SPARC64 microprocessor with 90nm CMOS technology. It focuses on the newly adopted techniques, such as efficient data management in dual-core design, fast delay calculation of the noise-immune clock distribution circuit, enhanced signal integrity analysis of a large-scale custom macro design, and enhanced diagnosis capability using a logic BIST circuit. |
PDF file |
Title | An Embedded Low Power/Cost 16-Bit Data/Instruction Microprocessor Compatible with ARM7 Software Tools |
Author | *Fu-Ching Yang, Ing-Jer Huang (National Sun Yat-Sen University, Taiwan) |
Page | pp. 902 - 907 |
Keyword | microprocessor, compatible, low-power, low-cost, narrow width memory |
Abstract | A 16-bit THUMB instruction set microprocessor is proposed for low cost/power in short-precision computing. It achieves 40% gate count, 51% power consumption and 160% clock frequency comparing to ARM7, even the performance is 67% better in narrow width memory at the same clock frequency. The ARM7 software is also compatible. |
PDF file |
Title | A Novel Reconfigurable Low Power Distributed Arithmetic Architecture for Multimedia Applications |
Author | *Zhenyu Liu, Tughrul Arslan, Ahmet T. Erdogan (The University of Edinburgh, Great Britain) |
Page | pp. 908 - 913 |
Keyword | Distributed Arithmetic , reconfigurable, DCT |
Abstract | The use of reconfigurable cores in system on chip (SoC) designs is increasingly becoming a trend. Such cores are being used for their flexibility, powerful functionality and low power consumption. Distributed Arithmetic (DA) is a powerful algorithm wildly used in many fields of multimedia for its efficiency. This paper presents a novel reconfigurable adder-based architecture for DA to realize the inner product which is the key computation in many digital signal processing applications. 1D DCT is mapped onto the architecture. Compared with some existing ASIC designs, the new architecture achieves good performance in area, speed and power. |
PDF file |
Title | Exploration of Low Power Adders for a SIMD Data Path |
Author | *Giacomo Paci (IMEC and DEIS,University of Bologna, Italy), Paul Marchal (IMEC, Belgium), Luca Benini (DEIS,University of Bologna, Italy) |
Page | pp. 914 - 919 |
Keyword | adders, SIMD, power, area |
Abstract | Abstract – Hardware for Ambient Intelligence needs to achieve extremely high computational efficiency (up to 40GOPS/W). An important way for reaching this is exploiting parallelism, and more specifically data-level parallelism enabled by SIMD. Whereas a large body of research exists on the benefits of, the architectural design of and compilation onto SIMD, the design of energy-optimal functional units for SIMD has received limited attention. It appears that existing SIMD functional units are designed in an area optimal, but not energy optimal way. By exploiting the difference in critical path length for the types of operations (e.g., 4x8/2x16/1x32), SIMD adders can be developed that save up to 40% of energy. In this paper, we will present these adders, the issues of building them and quantify their benefits for different usage scenarios and operating frequencies. |
PDF file |
Title | Micro-architecture Pipelining Optimization with Throughput-Aware Floorplanning |
Author | *Yuchun Ma, Zhuoyuan Li (Tsinghua University, China), Jason Cong (University of California, Los Angeles, United States), Xianlong Hong (Tsinghua University, China), Glenn Reinman (University of California, Los Angeles, United States), Sheqin Dong, Qiang Zhou (Tsinghua University, China) |
Page | pp. 920 - 925 |
Keyword | micro-architecture, pipelining, throughput-aware, floorplanning |
Abstract | For modern processor designs in nanometer technologies, both block and interconnect pipelining are needed to achieve multi-gigahertz clock frequency, but previous approaches consider block pipelining and interconnect pipelining separately. For example, all recent works on wire pipelining assume pre-pipelined components and consider only inserting pipeline stages on point-to-point wire or bus connections. To the best of our knowledge, this paper is the first that considers block pipelining and interconnect pipelining simultaneously. We optimize multiple critical paths or loops in the micro-architecture and insert the pipelines stages optimally in the blocks and wires of these loops to meet the clock frequency requirement. We propose two approaches to this problem. The first approach is based on mixed integer linear programming (MILP) which is theoretically guaranteed to produce the optimal solution, and the second one is an efficient graph-based algorithm that produces near-optimal solutions. Experimental results show that simultaneous block and interconnect pipelining leads to more than 20% improvement over wire-pipeling alone on the overall processor performance. Moreover, the graph-based approach gives solutions very close to the MILP results ( 2% more than MILP results on average) but in a much shorter runtime. |
PDF file |
Title | Multithreaded SAT Solving |
Author | *Matthew Lewis, Tobias Schubert, Bernd Becker (Albert-Ludwigs-University of Freiburg, Germany) |
Page | pp. 926 - 931 |
Keyword | SAT, Solver, Threads, Multithreaded, Verification |
Abstract | This paper describes the multithreaded MiraXT SAT Solver which was designed to take advantage of current and future shared memory multiprocessor systems. The paper highlights design and implementation details that allow the multiple threads to run and cooperate efficiently. Results show that in single threaded mode, MiraXT compares well to other state of the art solvers on Industrial problems. In threaded mode, it provides cutting edge performance, as speedup is obtained on both SAT and UNSAT instances. |
PDF file |
Title | Trace Compaction using SAT-based Reachability Analysis |
Author | *Sean Safarpour, Andreas Veneris, Hratch Mangassarian (University of Toronto, Canada) |
Page | pp. 932 - 937 |
Keyword | trace compaction, reachability, SAT, debugging, trace reduction |
Abstract | In today's designs, when functional verification fails, engineers perform debugging using the provided error traces. Reducing the length of error traces can help the debugging task by decreasing the number of variables and clock cycles that must be considered. We propose a novel trace length compaction approach based on SAT-based reachability analysis. We develop procedures and algorithms using pre-image computation to efficiently traverse the state space and reduce the trace lengths. We further introduce a data structure used to store the visited states which is critical to the performance of the proposed approach. Experiments demonstrate the effectiveness of the reachability approach as approximately 75\% of the traces are reduced by one or two orders of magnitudes. |
PDF file |
Title | Fixing Design Errors with Counterexamples and Resynthesis |
Author | *Kai-hui Chang, Igor L. Markov, Valeria Bertacco (University of Michigan at Ann Arbor, United States) |
Page | pp. 944 - 949 |
Keyword | Error correction, Resynthesis, Functional verification |
Abstract | In this work we propose a new error-correction
framework, called CoRe, which uses counterexamples, or bug
traces, generated in verification to automatically correct errors
in digital designs. CoRe is powered by two innovative resynthesis
techniques, Goal-Directed Search (GDS) and Entropy-Guided
Search (EGS), which modify the functionality of internal circuit's
nodes to match the desired specification. We evaluate our solution
to designs and errors arising during combinational equivalence checking,
as well as simulation-based verification of digital systems.
Compared with previously proposed techniques, CoRe is
more powerful in that: (1) it can fix a broader range of error
types because it does not rely on specific error models; (2) it derives
the correct functionality from simulation vectors, hence not
requiring golden netlists; and (3) it can be applied to a range of
verification flows, including formal and simulation-based. |
PDF file |