SIGDA, Super Compendium, DAC 2001, Abstracts

DAC 2001 Abstracts

Sessions: [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50]

Session 1: Panel: The Electronics Industry Supply Chain: Who Will Do What? [p. 1]

Chair and Organizer: Rita Glover
Panel Members: Marc Halpern, Rich Becks, Richard Kubin, Henry Jurgens, Rick Cassidy, Ted Vucurevich

: The makeup and relationships within the design supply chain are changing rapidly. One-stop shopping, whereby a system house procures most of its design and fabrication services from a single source, is now being supplanted by a host of outsourcing suppliers. These players are developing online integration platforms for outsourced design engineering, business transactions, and data management. This evolution is being felt by the entire design automation industry, whether it be focused on the electronic design of chips, boards, or systems, or even the mechanical design of their packaging and housing. What are the implications of this trend on cost, demand for services, and time to market? How can all parties ride this wave of the future to achieve the best outcome in terms of collaborative design capabilities and quality of results?

Session 2: Nanometer Futures

Chair: Andrew B. Kahng
Organizers: Andrew B. Kahng, Kurt Keutzer

2.1 Future Performance Challenges in Nanometer Design [p. 3]

Dennis Sylvester, Himanshu Kaul

We highlight several fundamental challenges to designing high performance integrated circuits in nanometer-scale technologies (i.e. drawn feature sizes < 100 nm). Dynamic power scaling trends lead to major packaging problems. To alleviate these concerns, thermal monitoring and feedback mechanisms can limit worst-case dissipation and reduce costs. Furthermore, a flexible multi-V_dd + multi-V_th + re-sizing approach is advocated to leverage the inherent properties of ultrasmall MOSFETs and limit both dynamic and static power. Alternative global signaling strategies such as differential and low-swing drivers are recommended in order to curb the power requirements of crosschip communication. Finally, potential power delivery challenges are addressed with respect to ITRS packaging predictions.

2.2 IC Design in High-Cost Nanometer-Technologies Era [p. 9]

Wojciech Maly

Nanometer IC technologies are on the horizon. They promise a lot. They will cost a lot as well. Therefore, we need to ask today: How may the billions of dollars, which we will need to spend on nanometer-fablines, affect IC design domain? This paper attempts to address the above question by analyzing design-manufacturing interface. The partial answer is derived from a simple transistor cost model proposed in the body of the paper.
Keywords Design of ICs, cost modeling of IC design and manufacturing, IC technology trends, nanometer-technologies, and IC layout regularity.

Session 3: System-Level Configurability: Bus, Interface, and Processor Design

Chair: Pieter van der Wolf
Organizers: Kees Vissers, Kurt Keutzer

3.1 LOTTERYBUS: A New High-Performance Communication Architecture for System-on-Chip Designs [p. 15]

Kanishka Lahiri, Anand Raghunathan, Ganesh Lakshminarayana

This paper presents LOTTERYBUS, a novel high-performance communication architecture for system-on-chip (SoC) designs. The LOTTERYBUS architecture was designed to address the following limitations of current communication architectures: (i) lack of control over the allocation of communication bandwidth to different system components or data flows (e.g., in static priority based shared buses), leading to starvation of lower priority components in some situations, and (ii) significant latencies resulting from variations in the time-profile of the communication requests (e.g., in time division multiplexed access (TDMA) based architectures), sometimes leading to larger latencies for high-priority communications. We present two variations of LOTTERYBUS: the first is a low overhead architecture with statically configured parameters, while the second variant is a more sophisticated architecture, in which values of the architectural parameters are allowed to vary dynamically. Our experiments investigate the performance of the LOTTERYBUS architecture across a wide range of communication traffic characteristics. In addition, we also analyze its performance in a 4x4 ATM switch sub-system design. The results demonstrate that the LOTTERYBUS architecture is (i) capable of providing the designer with fine grained control over the bandwidth allocated to each SoC component or data flow, and (ii) well suited to provide high priority communication traffic with low latencies (we observed up to 85.4% reduction in communication latencies over conventional onchip communication architectures).

3.2 Robust Interfaces for Mixed-Timing Systems with Application to Latency-Insensitive Protocols [p. 21]

Tiberiu Chelcea, Steven M. Nowick

This paper presents several low-latency mixed-timing FIFO designs that interface systems on a chip working at different speeds. The connected systems can be either synchronous or asynchronous. The design are then adapted to work between systems with very long interconnection delays, by migrating a single-clock solution by Carloni et al. (for �latency-insensitive� protocols) to mixed timing domains. The new designs can be made arbitrarily robust with regard to metastability and interface operating speeds. Initial simulations for both latency and throughput are promising.

3.3 Latency-Driven Design of Multi-Purpose Systems-On-Chip [p. 27]

Seapahn Meguerdichian, Milenko Drinic, Darko Kirovski

Deep submicron technology has two major ramifications on the design process: (i) critical paths are being dominated by global interconnect rather than gate delays and (ii) ultra high levels of integration mandate designs that encompass numerous intrasynchronous blocks with decreased functional granularity and increased communication demands. These factors emphasize the importance of the on-chip bus network as the crucial high performance enabler for future systems-on-chip. By using independent functional blocks with programmable connectivity, designers are able to build systems-on-chip capable of supporting different applications with exceptional levels of resource sharing. To address challenges in this design paradigm, we have developed a methodology that enables efficient bus network design with approximate timing verification and floorplanning of multi-purpose systems-on-chip in early design stages. The design platform iterates system synthesis and floorplanning to build min-area floorplans that satisfy statistical time constraints of applications. We demonstrate the effectiveness of our bus network design approach using examples from a multimedia benchmark suite.

3.4 Estimation of Speed, Area, and Power of Parameterizable, Soft IP [p. 31]

Jagesh Sanghavi, Albert Wang

We present a new approach to estimate speed, area, and power of a parameterizable, soft IP. By running the ASIC implementation flow only on selected configurations, we predict the performance for any arbitrary configuration. We exploit performance function decomposability to address the combinatorial explosion challenge. The estimator has been used successfully to configure Xtensa processor cores for numerous embedded SOC designs.

Session 4: Making Verification More Efficient

Chair: Masahiro Fujita Organizers: Limor Fix, Timothy Kam

4.1 Formal Property Verification by Abstraction Refinement with Formal, Simulation and Hybrid Engines [p. 35]

Dong Wang, Pei-Hsin Ho, Jiang Long, James Kukula, Yunshan Zhu, Tony Ma, Robert Damiano

We present RFN, a formal property verification tool based on abstraction refinement. Abstraction refinement is a strategy for property verification. It iteratively refines an abstract model to better approximate the behavior of the original design in the hope that the abstract model alone will provide enough evidence to prove or disprove the property. However, previous work on abstraction refinement was only demonstrated on designs with up to 500 registers. We developed RFN to verify real-world designs that may contain thousands of registers. RFN differs from the previous work in several ways. First, instead of relying on a single engine, RFN employs multiple formal verification engines, including a BDD-ATPG hybrid engine and a conventional BDD-based fixpoint engine, for finding error traces or proving properties on the abstract model. Second, RFN uses a novel two-phase process involving 3-valued simulation and sequential ATPG to determine how to refine the abstract model. Third, RFN avoids the weakness of other abstraction-refinement algorithms --- finding error traces on the original design, by utilizing the error trace of the abstract model to guide sequential ATPG to find an error trace on the original design. We implemented and applied a prototype of RFN to verify various properties of real-world RTL designs containing approximately 5,000 registers, which represents an order of magnitude improvement over previous results. On these designs, we successfully proved a few properties and discovered a design violation.

4.2 Scalable Hybrid Verification of Complex Microprocessors [p. 41]

Maher Mneimneh, Fadi Aloul, Chris Weaver, Saugata Chatterjee, Karem Sakallah, Todd Austin

We introduce a new verification methodology for modern microprocessors that uses a simple checker processor to validate the execution of a companion high-performance processor. The checker can be viewed as an at-speed emulator that is formally verified to be compliant to an ISA specification. This verification approach enables the practical deployment of formal methods without impacting overall performance.

4.3 Symbolic RTL Simulation [p. 47]

Alfred Kölbl, James Kukula, Robert Damiano

Symbolic simulation is a promising formal verification technique combining the flexibility of conventional simulation with powerful symbolic methods. Unfortunately, existing symbolic simulators are restricted to gate level simulation or handle just a synthesizable subset of an HDL. Simulation of systems composed of design, testbench and correctness checkers, however, requires the complete set of HDL constructs. We present an approach that enables symbolic simulation of the complete set of RT-level Verilog constructs with full delay support. Additionally, we propose a flexible scheme for introducing symbolic variables and demonstrate how error traces can be simulated with this new scheme. Finally, we present some experimental results on an 8051 micro-controller design which prove the effectiveness of our approach.

Session 5: SoC and High-Level DFT

Chair: Yervant Zorian
Organizers: Anand Raghunathan, Irith Pomeranz

5.1 A Unified DFT Architecture for Use with IEEE 1149.1 and VSIA/IEEE P1500 Compliant Test Access Controllers [p. 53]

Bulent I. Dervisoglu

This paper discusses some of the critical issues that may prevent IEEE P1500 from becoming an acceptable standard and offers some suggestions for their solution. In particular, the inadequacy of the proposed P1500 and the VSIA solutions in handling hierarchical implementations is addressed. Support for hierarchical implementations is seen as an essential feature in a test access methodology that is intended for use in System on a Chip (SoC) designs. The author is actively pursuing some of these solutions through the working groups.

5.2 Instruction-Level DFT for Testing Processor and IP Cores in System-on-a-Chip [p. 59]

Wei-Cheng Lai, Kwang-Ting (Tim) Cheng

Self-testing manufacturing defects in a system-on-a-chip (SOC) by running test programs using a programmable core has several potential benefits including, at-speed testing, low DfT overhead due to elimination of dedicated test circuitry and better power and thermal management during testing. However, such a self-test strategy might require a lengthy test program and might not achieve a high enough fault coverage. We propose a DfT methodology to improve the fault coverage and reduce the test program length, by adding test instructions to an on-chip programmable core such as a microprocessor core. This paper discusses a method of identifying effective test instructions which could result in highest benefits with low area/performance overhead. The experimental results show that with the added test instructions, a complete fault coverage for testable path delay faults can be achieved with a greater than 20% reduction in the program size and the program runtime, as compared to the case without instruction-level DfT.

5.3 Test Strategies for BIST at the Algorithmic and Register-Transfer Levels [p. 65]

Kelly A. Ockunzzi, Chris Papachristou

The proposed BIST-based DFT method targets testability problems caused by three constructs. The first construct is reconvergent fanout in a circuit behavior, which causes correlation. The second construct, control statements, also cause correlation, and relational operations degrade observability. The third construct is random-pattern-resistant RTL modules, which cannot be tested effectively with random patterns. Test strategies are presented that overcome the testability problems by modifying the circuit behavior. An analysis and insertion scheme that systematically identifies the problems and applies the strategies is described. Experimental results from seven examples show that this scheme improves fault coverage while minimizing the impact on area and critical delay.
Categories and Subject Descriptors: B.8.1 [Performance and Reliability]: Reliability, Testing, and Fault-Tolerance.
General Terms: Design, Theory
Keywords: design-for-test, built-in self-test, test synthesis

Session 6: Panel: The Next HDL: If C++ is the Answer, What was the Question? [p. 71]

Chair: Rajesh K. Gupta
Organizers: Shishpal Rawat, Ingrid Verbauwhede
Panel Members: Gerard Berry, Ramesh Chandra, Daniel Gajski, Kris Konigsfeld, Partick Schaumont, Ingrid Verbauwhede

: The focus of this panel is on issues surrounding the use of C++ in modeling, integration of silicon IP and system-on-chip designs. In the last two years there have been several announcements promoting C++ based solutions and of multiple consortia (SystemC, Cynapps, Accellera, SpecC) that represent increasing commercial interest both from tool vendors as well as perhaps expression of genuine needs from the design houses. There are, however, serious questions about what value proposition does a C++ based design methodology bring to the IC or system designer? What has changed in the modeling technology (and/or available tools) that gives a new capability? Is synthesis the right target ? or Validation? Tester modeling or testbench generation? This panel brings together advocates and opponents from the user community to highlight the achievements and the challenges that remain in use of C++ for use in microelectronic circuits and systems.

Session 7: Design for Subwavelength Manufacturability: Impact on EDA

Chair: Robert C. Clark
Organizers: Andrew B. Kahng, Lars Liebmann

7.1 Reticle Enhancement Technology: Implications and Challenges for Physical Design [p. 73]

W. Grobman, M. Thompson, R. Wang, C. Yuan, R. Tian, E. Demircan

In this paper, we review phase shift lithography, rule vs. model based methods for OPC and model-based tiling, and discuss their implications for layout and verification. We will discuss novel approaches, using polarizing films on reticles, which change the game for phase-shift coloring, and could lead to a new direction in c:PSM constraints on physical design. We emphasize the need to do tiling that is model-driven and uses optimization techniques to achieve planarity for better manufacturing tolerance in the subwavelength dimensions era. Electromagnetic solver results will be presented which estimate the effect of tiling on circuit timing.
Keywords RET, Reticle Enhancement Technology , subwavelength lithography, mask data preparation, OPC, PSM, optical proximity correction, tiling.

7.2 Enabling Alternating Phase Shifted Mask Designs for a Full Logic Gate Level: Design Rules and Design Rule Checking [p. 79]

Lars Liebmann, Jennifer Lund, Fook-Luen Heng, Ioana Graur

The International Technology Roadmap for Semiconductors lists F2 (lambda = 157 nm) optical lithography and extreme ultraviolet next generation lithography as the two most feasible lithography solutions for the 70 nm technology node. It is likely that both of these solutions will be late, forcing ArF (lambda = 193 nm) lithography to operate at unprecedented resolution levels. Theoretically, alternating phase shifted masks ("altPSM") can achieve the resolution required to manufacture 70 nm logic products with ArF lithography equipment, but technical and logistical challenges associated with the broad implementation of altPSM require novel and invasive EDA solutions which have caused the industry to shy away from altPSM in the past. One of the biggest such challenges is the creation of robust design rule checking (DRC) tools which can predict whether a given layout has a valid, manufacturable altPSM solution. This paper takes a detailed look at the technical and practical issues associated with altPSM design rules and DRC.

7.3 Layout Design Methodologies for Sub-Wavelength Manufacturing [p. 85]

Michael L. Rieger, Jeffrey P. Mayhew, Sridhar Panchapakesan

In this paper, we describe new types of layout design constraints needed to effectively leverage advanced optical wafer lithography techniques. Most of these constraints are dictated by the physics of advanced lithography processes, while other constraints are imposed by new photomask techniques. Among the methods discussed are 1) phase shift mask (PSM) lithography in which phase information is placed on the photomask in combination with conventional clear and dark information; 2) optical proximity correction (OPC) where predictable distortions in feature geometry are corrected by putting an inverse distortion on the mask; 3) off-axis illumination optics that improve resolution of some configurations at the expense of others; and 4) use of non-resolving assist features that improve neighboring structures.
Keywords optical proximity correction, OPC, phase shift mask, PSM, lithography.

7.4 Adoption of OPC and the Impact on Design and Layout [p. 89]

F. M. Schellenberg, Olivier Toublan, Luigi Capodieci, Bob Socha

With the adoption of various combinations of resolution enhancement techniques (RET) for IC lithography, different process constraints are placed on the IC layout. The final layout used for mask production is dramatically different than the original designer�s intent. To insure that EDA tools developed for applying RET techniques can have optimal performance, layout methodology must change to create a true �target� layer that represents the actual design intent. Verification of the final layout is then expanded from LVS and DRC to also include lithography process simulation, which compares results to this desired �target� and governs the application of RET.
General Terms Algorithms, Design, Reliability, Standardization, Verification.
Keywords RET, OPC, PSM, SRAF, scattering bars, lithography, phaseshifting, OAI, off-axis illumination, Quasar, quadrupole.

7.5 A Practical Application of Full-Feature Alternating Phase-Shifting Technology for a Phase-Aware Standard-Cell Design Flow [p. 93]

Michel C�t�, Philippe Hurat, Vinod Malhotra, Michael Sanie

As the semiconductor industry enters the subwavelength era where silicon features are much smaller than the wavelength of the light used to create them, a number of "subwavelength" technologies such as Optical Proximity Correction (OPC) and Phase-Shifting Masks (PSM) have been introduced to produce integrated circuits (ICs) with acceptable yields. An effective approach to subwavelength IC production includes a combination of these techniques, including OPC and PSM. Nevertheless, as we approach silicon features of 0.10� and below, Alternating PSM (AltPSM) becomes a critical part of the technology portfolio needed to achieve IC requirements. An effective EDA methodology that generates AltPSM ICs must guarantee correct generation of AltPSM layouts, maintain or improve today's design productivity, and leverage existing tools and flows. The implementation of such a methodology becomes more complex as phase shifting is applied to all critical features, including those outside of transistor gates. In this paper, we present a methodology targeted for standard-cell or structured-custom design styles. We also present examples of designs that start from standard-cells created in a manner in which all issues regarding generation of AltPSM are effectively considered, and are used in a typical cell-based (synthesis-Automatic Place & Route) flow to produce design layouts that are ready for cost-effective silicon manufacturing.

Session 8: New Ideas in Logic Synthesis

Chair: Yusuke Matsunaga
Organizers: Timothy Kam, Masahiro Fujita

8.1 Layout-Driven Hot-Carrier Degradation Minimization Using Logic Restructuring Techniques [p. 97]

Chih-Wei (Jim) Chang, Kai Wang, Malgorzata Marek-Sadowska

The rapid advances in semiconductor manufacturing technology have created tough reliability problems. Failure mechanisms such as hot-carrier effect, dielectric breakdown, electrostatic discharge and electromigration have posed tremendous threats to the longterm reliability of VLSI circuits. As a result, designers not only need analysis tools to locate the problem, but also design-for-reliability tools to correct it. However, these problems often surface when the physical layout is done and relatively few logic changes can be made. In this paper, we target the performance optimization issues in the context of hot-carrier induced degradation. A layout driven approach combining rewiring, discrete gate resizing, and pin reordering is proposed. Experimental results show that rewiring based incremental logic restructuring is a very powerful technique in post-layout design for reliability.

8.2 An Algorithm for Bi-Decomposition of Logic Functions [p. 103]

Alan Mishchenko, Bernd Steinbach, Marek Perkowski

We propose a new BDD-based method for decomposition of multi-output incompletely specified logic functions into netlists of two-input logic gates. The algorithm uses the internal don't-cares during the decomposition to produce compact well-balanced netlists with short delay. The resulting netlists are provably nonredundant and facilitate test pattern generation. Experimental results over MCNC benchmarks show that our approach outperforms SIS and other BDD-based decomposition methods in terms of area and delay of the resulting circuits with comparable CPU time.

8.3 Factoring and Recognition of Read-Once Functions using Cographs and Normality [p. 109]

Martin C. Golumbic, Aviad Mintz, Udi Rotics

An approach for factoring general boolean functions was described in [5] which is based on graph partitioning algorithms. In this paper, we present a very fast algorithm for recognizing and factoring read-once functions which is needed as a dedicated factoring subroutine to handle the lower levels of that factoring process. The algorithm is based on algorithms for cograph recognition and on checking normality. Our method has been implemented in the SIS environment, and an empirical evaluation is given.
Categories and Subject Descriptors J.6 [Computer Application]: Computer Aided Engineering; I.1.2 [Computing Methodologies]: Symbolic and Algebraic Manipulation
General Terms Factoring Algorithms
Keywords Read-Once Functions, Factoring, Normality, Cograph

8.4 Logic Minimization using Exclusive OR Gates [p. 115]

Valentina Ciriani

Recently introduced pseudoproducts and Sum of Pseudoproduct (SPP) forms have made possible to represent Boolean functions with much shorter expressions than standard Sum of Products (SP) forms [5]. A pseudo product is a product (AND) of Exclusive OR (EXOR) factors, and an SPP form is a sum (OR) of pseudoproducts. The synthesis of SPP minimal forms requires greater effort than SP minimization. In this paper we present a new data structure for this problem, leading to an efficient minimization method for SPP form implemented with an exact algorithm and an heuristic. Experimental results on a classical set of benchmarks show that the new algorithms are fast, and can be applied to "complex" functions with a reasonable running time.

Session 9: Analog Design and Modeling

Chair: Alper Demir
Organizers: Joel Phillips, Noel Menezes

9.1 Design of Half-Rate Clock and Data Recovery Circuits for Optical Communication Systems [p. 121]

Jafar Savoj, Behzad Razavi

This paper describes the design of two half-rate clock and data recovery circuits for optical receivers. Targeting the data rate of 10-Gb/s, the first implementation incorporates a ring oscillator and a linear phase detector whereas the second implementation uses a multiphase LC oscillator and a bang-bang phase/frequency detector. Fabricated in 0.18-um CMOS technology, the power consumption of each of the circuits is less than 100 mW. The rms jitter of the output clock for the two prototypes is 1 ps and 0.8 ps, respectively, while the latter achieves a capture range of more than 14%.

9.2 A Novel Method for Stochastic Nonlinearity Analysis of a CMOS Pipeline ADC [p. 127]

David Goren, Eliyahu Shamsaev, Israel A. Wagner

An analytic approach is presented for estimating the nonlinearity of an analog to digital converter (ADC) as a function of the variations in the circuit devices. The approach is demonstrated for the case of a pipeline ADC with digital error correction. Under some mild assumptions on the expected variations, the error probability is expressed as a simple explicit function of the standard deviations in the components' parameters: gain errors, comparator offset errors and resistor errors. The analytical expression is verified for Integral Non Linearity (INL), and its limits are studied using Monte-Carlo simulations of a 10 bit pipeline ADC structure.

9.3 Behavioral Partitioning in the Synthesis of Mixed Analog-Digital Systems [p. 133]

Sree Ganesan, Ranga Vemuri

Synthesis of mixed-signal designs from behavioral specifications must address analog-digital partitioning. In this paper, we investigate the issues in mixed-signal behavioral partitioning and design space exploration for signal-processing systems. We begin with the system behavior specified in an intermediate format called the Mixed Signal Flow Graph, based on the time-amplitude characterization of signals. We present techniques for analog-digital behavioral partitioning of the MSFG, and performance estimation of the technology-mapped analog and digital circuits. The partitioned solution must satisfy constraints on imposed by the target field programmable mixed-signal architecture on available configurable resources, available data converters, their resolution and speed, and IO pins. The quality of the solution is evaluated based on two metrics, namely feasibility and performance. The former is a measure of the validity of the solution with respect to the architectural constraints. The latter measures the performance of the system based on bandwidth/speed and noise.

9.4 Efficient DDD-based Symbolic Analysis of Large Linear Analog Circuits [p. 139]

Wim Verhaegen, Georges Gielen

A new technique for generating approximate symbolic expressions for network functions in linear(ized) analog circuits is presented. It is based on the compact determinant decision diagram (DDD) representation of the circuit. An implementation of a term generation algorithm is given and its performance is compared to a matroid-based algorithm. Experimental results indicate that our approach is the fastest reported algorithm so far for this application.

Session 10: Scan-Based Testing

Chair: T. M. Mak
Organizers: Anand Raghunathan, Tim Cheng

10.1 Random Limited-Scan to Improve Random Pattern Testing of Scan Circuits [p. 145]

Irith Pomeranz

We propose a method of random pattern generation for at-speed testing of circuits with scan. The proposed method uses limited scan operations to achieve complete fault coverage. Under a limited scan operation, the circuit state is shifted by a number of positions which may be smaller than the number of state variables. Limited scan operations are inserted randomly to ensure that the complete test set can be generated by a random pattern generator with simple control logic.

10.2 Test Volume and Application Time Reduction Through Scan Chain Concealment [p. 151]

Ismet Bayraktaroglu, Alex Orailoglu

A test pattern compression scheme is proposed in order to reduce test data volume and application time. The number of scan chains that can be supported by an ATE is significantly increased by utilizing an on-chip decompressor. The functionality of the ATE is kept intact by moving the decompression task to the circuit under test. While the number of virtual scan chains visible to the ATE is kept small, the number of internal scan chains driven by the decompressed pattern sequence can be significantly increased.

10.3 An Approach to Test Compaction for Scan Circuits that Enhances At-Speed Testing [p. 156]

Irith Pomeranz, Sudhakar M. Reddy

We propose a new approach to the generation of compact test sets for scan circuits. Compaction refers here to a reduction in the test application time. The proposed procedure generates an initial test set that is likely to have a low test application time. It then applies an existing static compaction procedure to this initial test set to further compact it. As a by-product, the proposed procedure also results in long primary input sequences, which are applied at-speed. This contributes to the detection of delay defects. We demonstrate through experimental results the advantages of this approach over earlier ones as a method for generating test sets with minimal test application time and long primary input sequences.

10.4 Generating Efficient Tests for Continuous Scan [p. 162]

Sying-Jyan Wang, Sheng-Nan Chiou

Conventional scan-based designs spend a lot of testing time in shifting test patterns and output responses, which greatly increases the testing cost. In this paper, we propose a modified approach for scan-based design in which a test is conducted in every clock cycle. This approach may significantly reduce the test application time when appropriate test vectors are applied. We develop algorithms to generate efficient test input for the test environment, and experimental results show that we can achieve high fault coverage with only about 10%-30% of the clock cycles required in conventional scan-based design.
Keywords Scan, DFT, Test Generation, Compression.

10.5 Combining Low-Power Scan Testing and Test Data Compression for System-on-a-Chip [p. 166]

Anshuman Chandra, Krishnendu Chakrabarty

We present a novel technique to reduce both test data volume and scan power dissipation using test data compression for system-on-a-chip testing. Power dissipation during test mode using ATPG-compacted test patterns is much higher during functional mode. We show that Golomb coding of precomputed test sets lends to significant savings in peak and average power, without requiring either a shower scan clock or blocking logic in the scan cells. We also improve upon prior work on Golomb coding by showing that a separate cyclical scan register is not necessary for pattern decompression. Experimental results for the larger ISCAS 89 benchmarks show that reduced test data volume and low power scan testing can indeed be achieved in all cases.
Keywords Embedded core testing, Colomb codes, precomputed test sets, scan testing, switching activity, test set encoding.

Session 11: Panel: Your Core - My Problem? Integration and Verification of IP [p. 170]

Chair: Gabe Moretti, Tim Hopes, Ramesh Narayanaswamy
Organizers: Nanette Collins, Dave Kelf, Gabe Moretti, Tim Hopes, Ramesh Narayanaswamy
Panel Members: Tom Anderson, Janick Bergeron, Ashish Dixit, Peter Flake, Tim Hopes, Ramesh Narayanaswamy

: As the popularity of reusing existing designs - or Intellectual Property (IP) - continues to grow, design challenges escalate. The most time-consuming and critical part of IP design and reuse is verifying that it will work as it was designed to and as the user intends. Designers are pushing the limits of IP for new, distinctive and innovative applications. With this innovation come problems that need creative solutions. Product verification, for example, will become more and more important in ensuring the correctness of the design. Over the years, various solutions have come on the market, all seemingly useful, but none reducing the time or manpower it takes to verify the design. With designs becoming increasingly more complex with each new project and verification consuming up to 70 percent of a design cycle, something must be done to alleviate the bottleneck. Most of today's verification techniques rely on old simulation and emulation technologies, combined with add-on products designed to target specific functional items facilitated by the increased importance of the functionality they provide. These environments have led to an overall degrading in productivity, with a decrease in tool speed and a sharp rise in learning curve and installation issues. In addition, interaction between add-on products created in isolation lead to further complications, usually discovered as products are incorporated in design flows. An improved verification flow is required to provide high-level productivity improvement over the entire design. The larger and more complex the design, the higher probability of errors slipping through the verification process, making System-On-Chip (SOC) devices the most vulnerable. With the integration of entire systems within single chips, the need to test hardware and software before the circuitry is produced, within as natural an environment as possible is critical to ensuring design success. The most important aspect in the selection and verification of IP is the collaboration of the vendor and the foundry. Designers need to be able to evaluate the core before a final selection is made. The evaluation should not just use the testbench provided by the vendor, but should provide an indication of the behavior in the intended use environment. In parallel, the designer must look at fabrication option by obtaining its fabrication profile. How many foundries have certified the core? How many times has the core been used in previous designs from each foundry? And, more important, is the core certified by the foundry chosen for the ASIC under development? With the microprocessor or microcontroller IP, the designer might have projected the use of an off-the-shelf RTOS. In this case, it is imperative to make sure that either the IP or RTOS has been used successfully already, or that the software vendor and the hardware vendor are committed to insure proper integration in a timely manner. Most of this work is tedious and costly because it must take place before final selection and business negotiations can take place. Once the IP has been chosen, user and vendor must work as a team process, since bugs can be found when a new set of tests and a new use methodology is available. Those who assume that IP commerce is similar to standard parts commerce are mistaken and are apt to encounter serious obstacles to IP integration. An SOC must be verified at various levels of abstraction: functional to check the correctness of system design, RTL to insure that what is going to be synthesized is correct, and gate for timing accuracy and possibly to check signal integrity and power consumption. Many verification improvements have been proposed which target specific parts of the design flow. The panel will look at various alternatives for IP verification today, including universal simulation, hardware acceleration, formal verification and semi-formal verification. Functional verification, used to verify that the implementation meets the specification, requires a significant amount of time because of the large amount of test vectors required to meaningfully test all of the functionality of a complex system, including the interfaces among the various logical blocks that make up the system. Traditional HDL simulation falls short of the required capabilities when Verilog is used because the language lacks appropriate behavioral constructs. Although VHDL provides the required constructs, its popularity is more limited, so C based dialects or new languages are required. Being able to develop a testbench is only the first of many obstacles. The amount of time required to execute a full suite of functional simulations is significant, even if processors speeds continue to increase. Simulation farms are becoming a popular tool to decrease functional verification time. Moving from behavioral to RTL code is the step in the methodology that has the least amount of tool support. Verifying the RTL implementation of a system is crucial. Given the size of today's designs, traditional simulation methods are running out of steam. Formal verification and hardware acceleration methods are being adopted and improved, respectively. EDA tools implementing formal methods have seen a resurgence in the last year, due to improved user interface as well as algorithms implemented to prove the equivalence between a RTL and a gate-level representation of a system. Emulation is also becoming a more popular method to verify gate-level implementations, although the initial cost of the equipment is still quite high. Panelists, experienced designers and representatives of EDA tools providers, as well as IP providers, will explore ways to beat the verification bottleneck and to identify the methodology best suited for IP design. They will attempt to answer the question, "What methodology works best for IP Design?"

Session 12: Configurable Computing: Reconfiguring the Industry

Chair: Diederik Verkest
Organizers: Diederik Verkest, Kurt Keutzer

12.1 A Quick Safari Through the Reconfiguration Jungle [p. 172]

Patrick Schaumont, Ingrid Verbauwhede, Kurt Keutzer, Majid Sarrafzadeh

Cost effective systems use specialization to optimize factors such as power consumption, processing throughput, flexibility or combinations thereof. Reconfigurable systems obtain this specialization at run-time. System reconfiguration has a vertical, a horizontal and a time dimension. We organize this design space as the reconfiguration hierarchy, and discuss the design methods that deal with it. Finally, we survey existing commercial platforms that support reconfiguration and situate them in the reconfiguration jungle.

12.2 Re-Configurable Computing in Wireless [p. 178]

Bill Salefski, Levent Caglar

Wireless communications requires a new approach to implement the algorithms for new standards. The computational demands of these standards are outstripping the ability of traditional signal processors, and standards are changing too quickly for traditional hardware implementation. In this paper we outline how reconfigurable processing can meet the needs for wireless base station design while providing the programmability to allow not just field upgrades as standards evolve, but also to adapt to much more dynamic factors in the environment such as traffic mix and the weather. We outline how a designer works with the Chameleon reconfigurable processor from algorithm design to prototyping on a development module.

12.3 Hardware/Software Instruction Set Configurability for System-on-Chip Processors [p. 184]

Albert Wang, Earl Killian, Dror Maydan, Chris Rowen

New application-focused system-on-chip platforms motivate new application-specific processors. Configurable and extensible processor architectures offer the efficiency of tuned logic solutions with the flexibility of standard high-level programming methodology. Automated extension of processor function units and the associated software environment � compilers, debuggers, simulators and real-time operating systems � satisfies these needs. At the same time, designing at the level of software and instruction set architecture significantly shortens the design cycle and reduces verification effort and risk. This paper describes the key dimensions of extensibility within the processor architecture, the instruction set extension description language and the means of automatically extending the software environment from that description. It also describes two groups of benchmarks, EEMBC�s Consumer and Telecommunications suites, that show 20 to 40 times acceleration of a broad set of algorithms through application-specific instruction set extension, relative to high performance RISC processors.

Session 13: Interconnect Design Optimization

Chair: Martin D. F. Wong
Organizers: Jason Cong, Louis Scheffer, Patrick Groeneveld

13.1 A Practical Methodology for Early Buffer and Wire Resource Allocation [p. 189]

Charles J. Alpert, Jiang Hu, Sachin S. Sapatnekar, Paul G. Villarrubia

The dominating contribution of interconnect to system performance has made it critical to plan for buffer and wiring resources in the layout. Both buffers and wires must be considered, since wire routes determine buffer requirements and buffer locations constrain wire routes. In contrast to recent buffer block planning approaches, our design methodology distributes buffer sites throughout the layout. A tile graph is used to abstract the buffer planning problem while also addressing wire planning. We present a four-stage heuristic called RABID for resource allocation and experimentally verify its effectiveness.

13.2 Creating and Exploiting Flexibility in Steiner Trees [p. 194]

Elaheh Bozorgzadeh, Ryan Kastner, Majid Sarrafzadeh

This paper presents the concept of flexibility � a geometric property associated with Steiner trees. Flexibility is related to the routability of the Steiner tree. We present an optimal algorithm which takes a Steiner tree and outputs a more flexible Steiner tree. Our experiments show that a net with a flexible Steiner tree increases its routability. Experiments with a global router show that congestion is improved by approximately 20%.

13.3 Simultaneous Shield Insertion and Net Ordering under Explicit RLC Noise Constraint [p. 199]

Kevin M. Lepak, Irwan Luwandi, Lei He

For multiple coupled RLC nets, we formulate the min-area simultaneous shield insertion and net ordering (SINO/NB-v) problem to satisfy the given noise bound. We develop an efficient and conservative model to compute the peak noise, and apply the noise model to a simulated-annealing (SA) based algorithm for the SINO/NB-v problem. Extensive and accurate experiments show that the SA-based algorithm is efficient, and always achieves solutions satisfying the given noise bound. It uses up to 71% and 30% fewer shields when compared to a greedy based shield insertion algorithm and a separated shield insertion and net ordering algorithm, respectively. To the best of our knowledge, it is the first work that presents an in-depth study on the min-area SINO problem under an explicit noise constraint.

13.4 On Optimum Switch Box Designs for 2-D FPGAs [p. 203]

Hongbing Fan, Jiping Liu, Yu-Liang Wu, Chak-Chung Cheung

An FPGA switch box is said to be universal (hyper-universal) if it can detailed route all possible surrounding 2-pin (multi-pin) net topologies satisfying the global routing density constraints. A switch box is optimum if it is hyper-universal and the switches inside is minimum. It has been shown that if the net topology is restricted to 2-pin nets, then a 2-D (4-way) switch box can be built to be universal with only switches, where is the global routing channel density. As the routing resource is relatively expensive in FPGA chips, study of the optimum switch box designs is clearly a topic with theoretical and commercial value of reducing silicon cost. A previous work has constructed a formal mathematical model of this optimum design problem for switch boxes with arbitrary dimensions, and gave a scheme to produce hyper-universal designs with less than 6.7W switches for 4-way FPGA switch boxes. In this paper, we will further investigate this most common 4-way switch box case, and will give new theoretical results followed by extensive experimental justification. The results seem to be quite attractive. We show that such an optimum switch box can be built with a very low number of additional switches beyond 6W for today�s practical range of low W's (e.g. just 6W plus 1 or 2 additional switches for W's up to 7). Even for arbitrary large W's, the bound can be shown to be under 6.34W. To make experimental comparison, we run today's published best FPGA router VPR on large benchmarks for the popular Disjoint structure and our proposed designs. The results are quite encouraging.

Session 14: Power Estimation Techniques

Chair: Massoud Pedram
Organizers: Donatella Sciuto, Kazutoshi Wakabayashi

14.1 Dependency Preserving Probabilistic Modeling of Switching Activity using Bayesian Networks [p. 209]

Sanjukta Bhanja, N. Ranganathan

We propose a new switching probability model for combinational circuits using a Logic-Induced-Directed-Acyclic-Graph(LIDAG) and prove that such a graph corresponds to a Bayesian Network guaranteed to map all the dependencies inherent in the circuit. This switching activity can be estimated by capturing complex dependencies (spatio-temporal and conditional) among signals efficiently by local message-passing based on the Bayesian networks. Switching activity estimation of ISCAS and MCNC circuits with random input streams yield high accuracy (average mean error=0.002) and low computational time (average time=3.93 seconds).

14.2 A Static Estimation Technique of Power Sensitivity in Logic Circuits [p. 215]

Taewhan Kim, Ki-Seok Chung, C. L. Liu

In this paper, we study a new problem of statically estimating the power sensitivity of a given logic circuit with respect to the primary inputs. The power sensitivity defines the characteristics of power dissipation due to changes in state of primary inputs. Consequently, estimating the power sensitivity among the inputs is essential not only to measure the power consumption of the circuit efficiently but also to provide potential opportunities of redesigning the circuit for low power. In this context, we propose a fast and reliable static estimation technique for power sensitivity based on a new concept called power equations, which are then collectively transformed into a table called power table. Experimental data on MCNC benchmark examples show that the proposed technique is useful and effective in estimating power consumption. In summary, the relative error for the estimation of maximum power consumption is 9.4% with a huge speed-up in simulation.

14.3 JouleTrack - A Web Based Tool for Software Energy Profiling [p. 220]

Amit Sinha, Anantha P. Chandrakasan

A software energy estimation methodology is presented that avoids explicit characterization of instruction energy consumption and predicts energy consumption to within 3% accuracy for a set of benchmark programs evaluated on the StrongARM SA-1100 and Hitachi SH-4 microprocessors. The tool, JouleTrack, is available as an online resource and has various estimation levels. It also isolates the switching and leakage components of the energy consumption.
Keywords Software energy, leakage estimation, instruction energy

Session 15: Functional Validation Based on Boolean Reasoning (BDD, SAT)

Chair: Limor Fix
Organizers: Masahiro Fujita, Timothy Kam

15.1 Effective Use of Boolean Satisfiability Procedures in the Formal Verification of Superscalar and VLIW Microprocessors [p. 226]

Miroslav N. Velev, Randal E. Bryant

We compare SAT-checkers and decision diagrams on the evaluation of Boolean formulas produced in the formal verification of both correct and buggy versions of superscalar and VLIW microprocessors. We identify one SAT-checker that significantly outperforms the rest. We evaluate ways to enhance its performance by variations in the generation of the Boolean correctness formulas. We reassess optimizations previously used to speed up the formal verification and probe future challenges.

15.2 Circuit-based Boolean Reasoning [p. 232]

Andreas Kuehlmann, Malay K. Ganai, Viresh Paruthi

Many tasks in CAD, such as equivalence checking, property checking, logic synthesis, and false paths analysis require efficient Boolean reasoning for problems derived from circuit structures. Traditionally, canonical representations, e.g., BDDs, or SAT-based search methods are used to solve a particular class of problems. In this paper we present a combination of techniques for Boolean reasoning based on BDDs, structural transformations, and a SAT procedure natively working on a shared graph representation of the problem. The described intertwined integration of the three techniques results in a robust summation of their orthogonal strengths. Our experiments demonstrate the effectiveness of the approach.

15.3 Checking Equivalence for Partial Implementations [p. 238]

Christoph Scholl, Bernd Becker

We consider the problem of checking whether a partial implementation can (still) be extended to a complete design which is equivalent to a given full specification. Several algorithms trading off accuracy and computational resources are presented: Starting with a simple 0,1,X-based simulation, which allows approximate solutions, but is not able to find all errors in the partial implementation, we consider more and more exact methods finally covering all errors detectable in the partial implementation. The exact algorithm reports no error if and only if the current partial implementation conforms to the specification, i.e. it can be extended to a full implementation which is equivalent to the specification. We give a series of experimental results demonstrating the effectiveness and feasibility of the methods presented.

Session 16: Verification: Life Beyond Algorithms

Chair and Organizer: Carl Pixley

16.1 Validating the Intel� Pentium� 4 Microprocessor [p. 244]

Bob Bentley

Developing a new leading edge IA-32 microprocessor is an immensely complicated undertaking. In the case of the Pentium� 4 processor, the microarchitecture is significantly more complex than any previous IA-32 microprocessor and the implementation borrows almost nothing from any previous implementation. This paper describes how we went about the task of finding bugs in the Pentium� 4 processor design prior to initial silicon, and what we found along the way.
General Terms Management, Verification.

16.2 Nuts and Bolts of Core and SoC Verification [p. 249]

Ken Albin

Digital design at Motorola is performed at design centers throughout the world, on projects with different design objectives, executed on different time scales, by different sized teams with different skill sets. This paper attempts to categorize these diverse efforts and identify common threads: what works, what the challenges are, and where we need to go.
Categories and Subject Descriptors B.5.2 [Register-Transfer-Level Implementation]: Design Aids - simulation, verification. B.6.3 [Logic Design]: Design Aids - simulation, verification. B.7.2 [Integrated Circuits]: Design Aids - simulation, verification.
General Terms Management, Documentation, Design, Economics, Verification.
Keywords Verification, monitors, biased-random simulation, testbenches.

16.3 Teaching Future Verification Engineers: The Forgotten Side of Logic Design [p. 253]

Fusun Ozguner, Duane Marhefka, Joanne DeGroat, Bruce Wile, Jennifer Stofer, Lyle Hanrahan

This paper describes a senior/graduate level course in hardware logic verification being offered by The Ohio State University in cooperation with IBM. The need for the course is established through the growing importance of logic verification to users of custom logic designs. We discuss the short-term and long-term goals for the course, and describe the course content and format. The course relies heavily on lab projects to illustrate the main concepts. Three projects and a final project review are described.

Session 17: Dissecting an Embedded System: Lessons from Bluetooth

Chair: Jan Rabaey
Organizers: Jan Rabaey, Anantha Chandrakasan

17.1 SoC Integration of Reusable Baseband Bluetooth IP [p. 256]

Torbjörn Grahm

This presentation will give a list of design criteria an ASIC Design house need to look in the process of deciding to take the complex Bluetooth specification and implement everything from scratch or to integrate reusable Intellectual Property for integration into their SoC. The presentation also include experience from a typical embedded development project where reusable Bluetooth Baseband Intellectual Property both for HW and SW is used with the Bluetooth Technology from Ericsson as example. This pper is a compressed summary.

17.2 One-chip Bluetooth ASIC Challenges [p. 262]

Paul T. M. van Zeijl

This paper will describe the constraints, trade-offs and challenges for designing a one-chip (RF/radio plus baseband integrated on the same Si-die) compared to designing a saprate RF/radio ASIC.

Session 18: Algorithmic and Compiler Transformations for High-Level Synthesis

Chair: Hiroto Yasuura
Organizers: Kazutoshi Wakabayashi, Rajesh K. Gupta

18.1 Transformations for the Synthesis and Optimization of Asynchronous Distributed Control [p. 263]

Michael Theobald, Steven M. Nowick

Asynchronous design has been the focus of renewed interest. However, a key bottleneck is the lack of high-quality CAD tools for the synthesis of large-scale systems which also allow design-space exploration. This paper proposes a new synthesis method to address this issue, based on transformations. The method starts with a scheduled and resource-bounded Control-Data Flow Graph (CDFG). Global transformations are first applied to the entire CDFG, unoptimized controllers are then extracted, and, finally, local transforms are applied to the individual controllers. The result is a highly-optimized set of interacting distributed controllers. The new transforms include aggressive timing and area-oriented optimizations, several of which have not been previously supported by existing asynchronous CAD tools. As a case study, the method is applied to the well-known differential equation solver synthesis benchmark. Results comparable to a highly-optimized manual design by Yun et al. [26] can be obtained by applying the new automated transformations. Such an implementation cannot be obtained using existing asynchronous CAD tools.

s 18.2 Speculation Techniques for High Level Synthesis of Control Intensive Designs [p. 269]

Sumit Gupta, Nick Savoiu, Sunwoo Kim, Nikil Dutt, Rajesh K. Gupta, Alex Nicolau

The quality of synthesis results for most high level synthesis approaches is strongly affected by the choice of control ow (through conditions and loops) in the input description. In this paper, we explore the effectiveness of various types of code motions, such as moving operations across conditionals, out of conditionals (speculation) and into conditionals (reverse speculation), and how they can be effectively directed by heuristics so as to lead to improved synthesis results in terms of fewer execution cycles and fewer number of states in the finite state machine controller. We also study the effects of the code motions on the area and latency of the final synthesized netlist. Based on speculative code motions , we present a novel way to perform early condition execution that leads to significant improvements in highly control-intensive designs. Overall, reductions of up to 38 % in execution cycles are obtained with all the code motions enabled.

18.3 Parallelizing DSP Nested Loops on Reconfigurable Architectures using Data Context Switching [p. 273]

Kiran Bondalapati

Reconfigurable architectures promise significant performance and flexibility advantages over conventional architectures. Automatic mapping techniques that exploit the features of the hardware are needed to leverage the power of these architectures. In this paper, we develop techniques for parallelizing nested loop computations from digital signal processing (DSP) applications onto high performance pipelined configurations. We propose a novel data context switching technique that exploits the embedded distributed memory available in reconfigurable architectures to parallelize such loops. Our technique is demonstrated on two diverse state-of-the-art reconfigurable architectures, namely, Virtex and the Chameleon Systems Reconfigurable Communications Processor. Our techniques show significant performance improvements on both architectures and also perform better than state-of-the-art DSP and microprocessor architectures.
Keywords Configurable Computing, Mapping Techniques, Loops

18.4 Using Symbolic Algebra in Algorithmic Level DSP Synthesis [p. 277]

Armita Peymandoust, Giovanni De Micheli

Current multimedia applications require the design of datapath intensive circuits. Unfortunately, current design tools and methods support design abstraction at a level that is inferior to the expectation of designers. Namely, most arithmetic-level optimizations are not supported and they are left to the designers' ingenuity. In this paper, we show how symbolic algebra can be used to construct an arithmetic-level decomposition algorithm. We also introduce our tool, SymSyn, that performs arithmetic library mapping and optimization of data-flow descriptions into data paths using arithmetic components.

Session 19: Gate Delay Calculation

Chair: Satya Pullela
Organizers: Jaijeet Roychowdhury, Joel Phillips

19.1 Computing Logic-Stage Delays Using Circuit Simulation and Symbolic Elmore Analysis [p. 283]

Clayton B. McDonald, Randal E. Bryant

The computation of logic-stage delays is a fundamental sub-problem for many EDA tasks. Although accurate delays can be obtained via circuit simulation, we must estimate the input assignments that will maximize the delay. With conventional methods, it is not feasible to estimate the delay for all input assignments on large sub-networks, so previous approaches have relied on heuristics. We present a symbolic algorithm that enables efficient computation of the Elmore delay under all input assignments and delay refinement using circuit-simulation. We analyze the Elmore estimate with three metrics using data extracted from symbolic timing simulations of industrial circuits.

19.2 A New Gate Delay Model for Simultaneous Switching and Its Applications [p. 289]

Liang-Chi Chen, Sandeep K. Gupta, Melvin A. Breuer

We present a new model to capture the delay phenomena associated with simultaneous to-controlling transitions. The proposed delay model accurately captures the effect of the targeted delay phenomena over a wide range of transition times and skews. It also captures the effects of more variables than table lookup methods can handle. The model helps improve the accuracy of static timing analysis, incremental timing refinement, and timing-based ATPG.

19.3 Static Timing Analysis Including Power Supply Noise Effect on Propagation Delay in VLSI Circuits [p. 295]

Geng Bai, Sudhakar Bobba, Ibrahim N. Hajj

This paper presents techniques to include the effect of supply voltage noise on the circuit propagation delay of a digital VLSI circuit. The proposed methods rely on an input-independent approach to calculate the logic gate's worst-case power supply noise. A quasistatic timing analysis is then applied to derive a tight upper-bound on the delay for a selected path with power supply noise effects. This upper-bound can be further reduced by considering the logic constraints and dependencies in the circuit. Experimental results for ISCAS-85 benchmark circuits are presented using the techniques described in the paper. HSPICE simulation results are also used to validate our work.

Session 20: Memory, Bus and Current Testing

Chair: Magdy S. Abadir
Organizers: Irith Pomeranz, Tim Cheng

20.1 Simulation-Based Test Algorithm Generation and Port Scheduling for Multi-Port Memories [p. 301]

Chi-Feng Wu, Chih-Tsun Huang, Kuo-Liang Cheng, Chih-Wea Wang, Cheng-Wen Wu

The paper presents a simulation-based test algorithm generation and test scheduling methodology for multi-port memories. The purpose is to minimize the testing time while keeping the test algorithm in a simple and regular format for easy test generation, fault diagnosis, and built-in self-test (BIST) circuit implementation. Conventional functional fault models are used to generate tests covering most defects. In addition, multi-port specific defects are covered using structural fault models. Port-scheduling is introduced to take advantage of the inherent parallelism among different ports. Experimental results for commonly used multi-port memories, including dual-port, four-port, and n-read-1-write memories, have been obtained, showing that efficient test algorithms can be generated and scheduled to meet different test bandwidth constraints. Moreover, memories with more ports benefit more with respect to testing time.

20.2 Improving Bus Test Via I_DDT and Boundary Scan [p. 307]

Shih-yu Yang, Christos A. Papachristou, Massood Tabib-Azar

This paper presents a systematic test methodology targeting bus line interconnect defects using IDDT testing and Boundary Scan. Traditional test is unable to detect all possible defects, especially timing-related faults. Open and short defects on interconnects between embedded modules can be detected by IDDT testing. Boundary Scan can provide accessibility to internal buses. A statistical analysis is presented discussing the uncertain factors due to process variations and power fluctuation. The effectiveness of the proposed technique on shorts, opens or the other non stuck-at fault type defects is also illustrated.
Keywords. IDDT, Current Test, Boundary Scan, Interconnect.

20.3 Fault Characterizations and Design-for-Testability Technique for Detecting I_DDQ Faults in CMOS/BiCMOS Circuits [p. 313]

Kaamran Raahemifar, Majid Ahmadi

This paper provides the results of a simulation-based fault characterization study of CMOS/BiCMOS logic families. We show that most of the shorts cause I_DDQ faults, while open defects result in delay or stuck-open faults. We propose a design-for-testability techniques for detecting short and bridging faults in CMOS/BiCMOS logic circuits. The impact of this circuit modification on the behavior of the circuit in normal mode is investigated.

20.4 Testing for Interconnect Crosstalk Defects Using On-Chip Embedded Processor Cores [p. 317]

Li Chen, Xiaoliang Bai, Sujit Dey

Crosstalk effects degrade the integrity of signals traveling on long interconnects and must be addressed during manufacturing testing. External testing for crosstalk is expensive due to the need for high-speed testers. Built-in self-test, while eliminating the need for a high-speed tester, may lead to excessive test overhead as well as overly aggressive testing. To address this problem, we propose a new software-based self-test methodology for system-on- chips (SoC) based on embedded processors. It enables an onchip embedded processor core to test for crosstalk in system-level interconnects by executing a self-test program in the normal operational mode of the SoC. We have demonstrated the feasibility of this method by applying it to test the interconnects of a processor-memory system. The defect coverage was evaluated using a system-level crosstalk defect simulation method.

Session 21: (When) Will FPGAs Kill ASICs? [p. 321]

Chair and Organizer: Rob A. Rutenbar
Panel Members: Max Baron, Thomas Daniel, Rajeev Jayaraman, Zvi Or-Bach, Jonathan Rose, Carl Sechen

: There was a time - in the dim historical past - when foundries actually made ASICs with only 5000 to 50,000 logic gates. But FPGAs and CPLDs conquered those markets and pushed ASIC silicon toward opportunities with more logic, volume, and speed. Today's largest FPGAs approach the few-million-gate size of a typical ASIC design, and continue to sprout embedded cores, such as CPUs, memories, and interfaces. And given the risks of nonworking nanometer silicon, FPGA costs and time-to-market are looking awfully attractive. So, will FPGAs kill ASICs? ASIC technologists certainly think not. ASICs are themselves sprouting patches of programmable FPGA fabric, and pushing new realms of size and especially speed. New tools claim to have tamed the convergence problems of older ASIC flows. Is the future to be found in a market full of FPGAs with ASIC-like cores? ASICs with FPGA cores? Other exotic hybrids? Our panelists will share their disagreements on these prognostications.

Session 22: Inductance 101 and Beyond

Chair: Phillip Restle
Organizers: Joel Phillips, Noel Menezes

22.1 Inductance 101: Modeling and Extraction [p. 323]

Michael W. Beattie, Lawrence T. Pileggi

Modeling magnetic interactions for on�chip interconnect has become an issue of great interest for integrated circuit design in recent years. This tutorial paper describes the basic concepts of magnetic interaction, loop and partial inductance, along with some of the high frequency effects such as skin and proximity effect.
Index Terms�Inductance, Magnetic Interaction, Interconnect Modeling.

22.2 Inductance 101: Analysis and Design Issues [p. 329]

Kaushik Gala, David Blaauw, Junfeng Wang, Vladimir Zolotov, Min Zhao

With operating frequencies approaching the gigahertz range, inductance is becoming an increasingly important consideration in the design and analysis of on-chip interconnect. In this paper, we give a tutorial overview of the analysis and design issues related to on-chip inductance effects.We explain the complexity of the current flow in VLSI circuits. We discuss the applicability of the PEEC approach in a detailed circuit model of the signal and power grid interconnect, switching devices, power pads and the package. Further, we explain techniques that can be used to speed-up simulation of the large PEEC model. We then discuss a simplified model that uses the so-called loop inductance approach, and compare it with the detailed model.We present experimental results, obtained from simulations of industrial circuits, for both the PEEC and loop models. We also cover design techniques that can help tackle the on-chip inductance issues.

22.3 Modeling Magnetic Coupling for On�Chip Interconnect [p. 335]

Michael W. Beattie, Lawrence T. Pileggi

As advances in IC technologies and operating frequencies make the modeling of on�chip magnetic interactions a necessity, it is apparent that extension of traditional inductance extraction approaches to full-chip scale problems is impractical. There are primarily two obstacles to performing inductance extraction with the same efficacy as full-chip capacitance extraction: 1) neglecting far-away coupling terms can generate an unstable inductance matrix approximation; and 2) the penetrating nature of inductance makes localized extraction via windowing extremely difficult. In this paper we propose and contrast three new options for stable and accurate window-based extraction of large-scale magnetic coupling. We analyze the required window sizes to consider the possibilities for pattern-matching style solutions, and propose three schemes for determining coupling values and window sizing for extraction via on-the-fly field solution.
Index Terms�Inductance, Susceptance, Magnetic Interaction, Interconnect Modeling.

22.4 Min/max On-Chip Inductance Models and Delay Metrics [p. 341]

Yi-Chang Lu, Mustafa Celik , Tak Young, Lawrence T. Pileggi

This paper proposes an analytical inductance extraction model for characterizing min/max values of typical on-chip global interconnect structures, and a corresponding delay metric that can be used to provide RLC delay prediction from physical geometries. The model extraction and analysis is efficient enough to be used within optimization and physical design exploration loops. The analytical min/max inductance approximations also provide insight into the effects caused by inductances.
Keywords Inductance, rlc tree, delay, on-chip interconnect modeling, physical design.

Session 23: Memory Optimization Techniques for DSP Processors

Chair: Daniel Connors Organizers: Dirk Grunwald, Marco Di Natale

23.1 Utilizing Memory Bandwidth in DSP Embedded Processors [p. 347]

Catherine H. Gebotys

This paper presents a network flow approach to solving the register binding and allocation problem for multiword memory access DSP processors. In recently announced DSP processors, such as Star*core, sixteen bit instructions which simultaneously access four words from memory are supported. A polynomial-time network flow methodology is used to allocate multiword accesses while minimizing code size. Results show that improvements of up to 87% in terms of memory bandwidth (and up to 30% reduction in energy dissipation) are obtained compared to compiler-generated DSP code. This research is important for industry since this value-added technique can increase memory bandwidths and minimize code size without increasing cost.

23.2 Address Code Generation for Digital Signal Processors [p. 353]

Sathishkumar Udayanarayanan, Chaitali Chakrabarti

In this paper we propose a procedure to generate code with minimum number of addressing instructions. We analyze different methods of generating addressing code for scalar variables and quantify the improvements due to optimizations such as offset assignment, modify register optimization and address register assignment. We propose an offset assignment heuristic that uses k address registers, an optimal dynamic programming algorithm for modify register optimization, and an optimal formulation and a heuristic algorithm for the address register assignment problem.

23.3 Reducing Memory Requirements of Nested Loops for Embedded Systems [p. 359]

J. Ramanujam, Jinpyo Hong, Mahmut Kandemir, A. Narayan

Most embedded systems have limited amount of memory. In contrast, the memory requirements of code (in particular loops) running on embedded systems is significant. This paper addresses the problem of estimating the amount of memory needed for transfers of data in embedded systems. The problem of estimating the region associated with a statement or the set of elements referenced by a statement during the execution of the entire set of nested loops is analyzed. A quantitative analysis of the number of elements referenced is presented; exact expressions for uniformly generated references and a close upper and lower bound for non-uniformly generated references are derived. In addition to presenting an algorithm that computes the total memory required, we discuss the effect of transformations on the lifetimes of array variables, i.e., the time between the first and last accesses to a given array location. A detailed analysis on the effect of unimodular transformations on data locality including the calculation of the maximum window size is discussed. The term maximum window size is introduced and quantitative expressions are derived to compute the window size. The smaller the value of the maximum window size, the higher the amount of data locality in the loop.

23.4 Detection of Partially Simultaneously Alive Signals in Storage Requirement Estimation for Data Intensive Applications [p. 365]

Per Gunnar Kjeldsberg, Francky Catthoor, Einar J. Aas

In this paper, we propose a novel storage requirement estimation methodology for use in the early system design phases when the data transfer ordering is only partially fixed. At that stage, none of the existing estimation tools are adequate, as they either assume a fully specified execution order or ignore it completely. Using representative application demonstrators, we show how our technique can effectively guide the designer to achieve a transformed specification with low storage requirement.

Session 24: Technology Dependant Logic Synthesis

Chair: Peichen Pan
Organizers: Jason Cong, Malgorzata Marek-Sadowska

24.1 A New Structural Pattern Matching Algorithm for Technology Mapping [p. 371]

Min Zhao, Sachin S. Sapatnekar

In this paper, a new structural matching algorithm for technology mapping is proposed. The algorithm is based on a key observation that the matches for a node in a subject Boolean network are related to the matches for its children. The structural relationships between the library cells are modeled using a lookup table. The proposed method is fast, has low memory usage, and is easy to implement. Experimental results show speedups of 20x over Matsunaga's fast mapping approach, and orders of magnitude over SIS, with the same or slightly better results, and much lower memory utilization.

24.2 Technology Mapping for SOI Domino Logic Incorporating Solutions for the Parasitic Bipolar Effect [p. 377]

Srirang K. Karandikar, Sachin S. Sapatnekar

We present a technology mapping algorithm for implementing a random logic gate network in domino logic. The target technology of implementation is Silicon on Insulator (SOI). SOI devices exhibit an effect known as Parasitic Bipolar Effect (PBE), which can lead to incorrect logic values in the circuit. Our algorithm solves the technology mapping problem by permitting several transformations during the mapping process in order to avoid the PBE, such as transistor reordering, altering the way transistors are organized into gates, and adding pmos discharge transistors. We minimize the total cost of implementation, which includes the discharge transistors required for correct functioning. Our algorithm generates solutions that reduce the number of discharge transistors needed by 44.23%, and reduces the size of the final solution by 11.66% on average.

24.3 Latency and Latch Count Minimization in Wave Steered Circuits [p. 383]

Amit Singh, Arindam Mukherjee, Malgorzata Marek-Sadowska

Wave Steering is a new design methodology that realizes high throughput circuits by embedding layout friendly synthesized structures in silicon. Wave Steered circuits inherently utilize latches in order to guarantee the correct signal arrival times at the inputs of these synthesized structures and maintain the high throughput of operation. In this paper, we show a method of reordering signals to achieve minimum circuit latency for Wave Steered circuits and propose an Integer Linear Programming(ILP) formulation for scheduling and retiming these circuits to minimize the number of latches for minimum latency. Experimental results show that in 0.25mm CMOS technology, as much as 33.2% reduction in latch count, at minimum latency, can be achieved over unoptimized Wave Steered circuits operating at 500 MHz.

24.4 Performance-Driven Multi-Level Clustering with Application to Hierarchical FPGA Mapping [p. 389]

Jason Cong, Michail Romesis

In this paper, we study the problem of performance-driven multi-level circuit clustering with application to hierarchical FPGA designs. We first show that the performance-driven multi-level clustering problem is NP-hard (in contrast to the fact that single-level performance-driven clustering can be solved in polynomial time optimally). Then, we present an efficient heuristic for two-level clustering for delay minimization. It can also provide area-delay trade-off by controlling the amount of node duplication. The algorithm is applied to Altera�s latest APEX FPGA architecture which has a two-level hierarchy. Experimental results with combinational circuits show that with our performance-driven two-level clustering solution we can improve the circuit performance produced by the Quartus Design System from Altera by an average of 15% for APEX devices measured in terms of delay after final layout. To our knowledge this is the first in-depth study for the performance-driven multi-level circuit clustering problem.

Session 25: Collaborative and Distributed Design Frameworks

Chair: Satish Venkatesan
Organizers: Noel Menezes, Vivek Tiwari

25.1 Application of Constraint-Based Heuristics in Collaborative Design [p. 395]

Juan A. Carballo, Stephen W. Director

Significant acceleration of today.s complex collaborative design processes can be achieved if team members are able to apply search heuristics that consider the simultaneous effect of all design constraints. We present the Active approach to Design Process Management (ADPM), whereby designers receive constraint-based feedback that enables them to apply these search heuristics effectively. To evaluate ADPM, we developed a design process evaluation environment called TeamSim. Evaluation results suggest that ADPM can reduce costly design iterations at the expense of extra, less costly, verification tool executions.

25.2 A Universal Client for Distributed Networked Design and Computing [p. 401]

Franc Brglez, Hemang Lavana

We introduce a universal client (OmniFlow) whose GUI can be readily configured by the user to invoke any number of applications, concurrently or sequentially, anywhere on the network. The design and the implementation of the client is based on the principles of taskflow-oriented programming, whereby we merge concepts from structured programming, hardware description, and mark-up languages. A mark-up language such as XML supports a well-de�ned schema that captures the decomposition of a program into a hierarchy of tasks, each representing an instance of a blackbox or a whitebox software component. The HDL-like input/output port definitions capture data-task-data dependencies. A highly interactive hierarchical GUI, rendered from the hierarchical taskflow descriptions in extended XML, supports structured programming language constructs to control sequences of task synchronization, execution, repetition, and abort. Experimental evaluations of the prototype, up to 9150 tasks and the longest path of 1600 tasks, demonstrate the scalability of the environment and the overall effectiveness of the proposed architecture for a number of networked design and computing projects.

25.3 Hypermedia-Aided Design [p. 407]

Darko Kirovski, Milenko Drinic, Miodrag Potkonjak

Recently, the Internet revolutionized many activities from entertainment to marketing and business. Two key underlying Internet technologies, efficient data delivery and hypertext, demonstrated exceptional potential as new application enablers. In this paper, we present a novel Hypermedia-Aided Design (HAD) collaboration framework that facilitates new communication and data presentation paradigms to improve the effectiveness of typical EDA applications. The framework leverages on the advantages of using semantic multicast as a communication backbone and quantized hypermedia presentations as an efficient data organization, retrieval, and presentation model. Semantic multicast is a global communication tool that relies on an internetwork of proxies to provide content discovery and semantics based profile-driven data dissemination services. We introduce the notion of a quant, an atomic interactive multimedia information primitive with embedded hyperlinks. We demonstrate how interest-specific quant retrieval and concatenation can enable more focused collaboration. The HAD framework consists of a set of applications for student(designer)-centered CAD education(consulting), collaborative design and debugging, I-commerce, and technical support. In all applications, quant-based presentations enable that theoretical and practical components are tailored according to user's interest and performances. The conceptualized and implemented applications act in synergy with existing software, hardware, and system design tools.

25.4 A Framework for Object Oriented Hardware Specification, Verification, and Synthesis [p. 413]

T. Kuhn, T. Oppold, M. Winterholer, W. Rosenstiel, Mark Edwards, Yaron Kashai

We describe two things. First, we present a uniform framework for object oriented specification and verification of hardware. For this purpose the object oriented language �e� is introduced along with a powerful run-time environment that enables the designer to perform the verification task. Second, we present an object oriented synthesis that enhances �e� and its dedicated run-time environment into a framework for specification, verification, and synthesis. The usability of our approach is demonstrated by realworld examples.
Keywords: Object oriented hardware modeling, verification, high-level synthesis.

Session 26: Panel: When Will the Analog Design Flow Catch Up with Digital Methodology? [p. 419]

Chair: Georges Gielen
Organizers: Mike Sottak, Mike Murray, Linda Kaye
Panel Members: Mar Hershenson, Ken Kundert, Philippe Magarshack, Akria Matsuzawa, Ronald A. Rohrer, Ping Yang

: Despite the fact that more and more electronic design is comprised of analog and mixed signal content, the design flows and methodologies in this area are lagging behind the pace of innovation in digital design. For sure, analog designers are in shorter supply, but this only makes the need for improvements and efficiency that much greater. Only of late have we seen production-worth attempts at functions such as analog synthesis and optimization reach the market. What is needed in today's analog design flow? What are the key technologies that are missing? How does the existing 'food chain' need to work together to drive greater efficiencies? A distinguished panel of suppliers and users assemble to discuss, analyze and predict.

Session 27: Closing the Gap Between ASIC and Custom: Design Examples

Chair: Bryan Ackland
Organizer: Kurt Keutzer

27.1 Achieving 550 Mhz in an ASIC Methodology [p. 420]

D. G. Chinnery, B. Nikolic, Kurt Keutzer

Typically, good automated ASIC designs may be two to five times slower than handcrafted custom designs. At last year's DAC this was examined and causes of the speed gap between custom circuits and ASICs were identified. In particular, faster custom speeds are achieved by a combination of factors: good architecture with well-balanced pipelines; compact logic design; timing overhead minimization; careful floorplanning, partitioning and placement; dynamic logic; post-layout transistor and wire sizing; and speed binning of chips. Closing the speed gap requires improving these same factors in ASICs, as far as possible. In this paper we examine a practical example of how these factors may be improved in ASICs. In particular we show how techniques commonly found in custom design were applied to design a high speed 550 MHz disk drive read channel in an ASIC design flow.
General Terms Performance, design.
Keywords ASIC, clock, frequency, speed, throughput, comparison, custom.

27.2 A Semi-Custom Design Flow in High-Performance Microprocessor Design [p. 426]

Gregory A. Northrop, Pong-Fei Lu

In this paper we present techniques shown to significantly enhance the custom circuit design process typical of high performance microprocessors. This methodology combines flexible custom circuit design with automated tuning and physical design tools to provide new opportunities to optimized design throughout the development cycle.
Keywords Standard cell, circuit tuning, custom design, methodology.

27.3 Reducing the Frequency Gap Between ASIC and Custom Designs: A Custom Perspective [p. 432]

Stephen E. Rich, Matthew J. Parker, Jim Schwartz

This paper proposes that the ability to control the difference between the simulated and actual frequencies of a design is a key strategy to achieving high frequency in both ASIC and custom designs. We will examine this principle and the methodologies that can be deployed to manage this gap.

Session 28: Energy and Flexibility Driven Scheduling

Chair: Marco Di Natale
Organizers: Donatella Sciuto, Luciano Lavagno

28.1 Low-Energy Intra-Task Voltage Scheduling Using Static Timing Analysis [p. 438]

Dongkun Shin, Jihong Kim, Seongsoo Lee

We propose an intra-task voltage scheduling algorithm for low energy hard real-time applications. Based on a static timing analysis technique, the proposed algorithm controls the supply voltage within an individual task boundary. By fully exploiting all the slack times, a scheduled program by the proposed algorithm always complete its execution near the deadline, thus achieving a high energy reduction ratio. In order to validate the effectiveness of the proposed algorithm, we built a software tool that automatically converts a DVS-unaware program into an equivalent low-energy program. Experimental results show that the low-energy version of an MPEG-4 encoder/decoder (converted by the software tool) consumes less than 7~25% of the original program running on a fixed-voltage system with a power-down mode.

28.2 Battery-Aware Static Scheduling for Distributed Real-Time Embedded Systems [p. 444]

Jiong Luo, Niraj K. Jha

This paper addresses battery-aware static scheduling in battery powered distributed real-time embedded systems. As suggested by previous work, reducing the discharge current level and shaping its distribution are essential for extending the battery lifespan. We propose two battery-aware static scheduling schemes. The first one optimizes the discharge power profile in order to maximize the utilization of the battery capacity. The second one targets distributed systems composed of voltage-scalable processing elements (PEs). It performs variable-voltage scheduling via efficient slack time re-allocation, which helps reduce the average discharge power consumption as well as flatten the discharge power profile. Both schemes guarantee the hard real-time constraints and precedence relationships in the real-time distributed embedded system specification. Based on previous work, we develop a battery lifespan evaluation metric which is aware of the shape of the discharge power profile. Our experimental results show that the battery lifespan can be increased by up to 29% by optimizing the discharge power file alone. Our variable-voltage scheme increases the battery lifespan by up to 76% over the non-voltage-scalable scheme and by up to 56% over the variable-voltage scheme without slack-time reallocation.

28.3 An Approach to Incremental Design of Distributed Embedded Systems [p. 450]

Paul Pop, Petru Eles, Traian Pop, Zebo Peng

In this paper we present an approach to incremental design of distributed embedded systems for hard real-time applications. We start from an already existing system running a set of applications and the design problem is to implement new functionality so that the already running applications are not disturbed and there is a good chance that, later, new functionality can easily be added to the resulted system. The mapping and scheduling problem are considered in the context of a realistic communication model based on a TDMA protocol.

Session 29: Representation and Optimization for Digital Arithmetic Circuits

Chair: Miodrag Potkonjak
Organizers: Kazutoshi Wakabayashi, Rajesh K. Gupta

29.1 Signal Representation Guided Synthesis Using Carry-Save Adders For Synchronous Data-path Circuits [p. 456]

Zhan Yu, Meng-Lin Yu, Alan N. Willson, Jr

Arithmetic transformations using carry-save adders have been exploited recently in design automation but existing transformation approaches only optimize combinatorial functions. Most applications need synchronous circuits and it is known that techniques that move the positions of the registers, such as retiming, can significantly reduce the cycle time of a synchronous circuit. However, retiming disregards arithmetic transformations and its power is limited by the circuit topology. This work is the first to exploit carry-save arithmetic transformations together with the moving of the register positions. To enable such transformations, we first propose the use of a new multiple-vector signal representation. Next, we use multiple-vector signal representation as a common guide for all of our simultaneous carry-save arithmetic transformations with the moving of the register positions. Specifically, we propose, operation forward and operation backward carry-save transformations, which are transformations across register boundaries. We also propose operation duplicate and operation merge transformations to exploit the resource sharing and timing trade-offs in the implementation of a multiple-fanout network. Finally, we propose an efficient and effective heuristic that selectively applies a sequence of transformations to optimize the timing and the area of a synchronous circuit. Experimental results show that the proposed techniques significantly out-perform previous approaches.

29.2 Improved Merging of Datapath Operators using Information Content and Required Precision Analysis [p. 462]

Anmol Mathur, Sanjeev Saluja

We introduce the notions of required precision and information content of datapath signals and use them to define functionally safe transformations on data ow graphs. These transformations reduce widths of datapath operators and enhance their mergeability. Using efficient algorithms to compute required precision and information content of signals, we dene a new algorithm for partitioning a data ow graph consisting of datapath operators into mergeable clusters. Experimental results indicate that use of our clustering algorithm for operator merging based synthesis of datapath intensive designs, can lead to significant improvement in the delay and area of the implementation.

29.3 Digital Filter Synthesis Based on Minimal Signed Digit Representation [p. 468]

In-Cheol Park, Hyeong-Ju Kang

As the complexity of digital filters is dominated by the number of multiplications, many works have focused on minimizing the complexity of multiplier blocks at compute the constant coefficient multiplications required in filters. The complexity of multiplier blocks can be significantly reduced by using an efficient number system. Although the canonical signed digit representation is commonly used as it guarantees the minimal number of additions for a constant multiplications, we propose in the paper a digital filter synthesis algorithm that is based on the minimal signed digit (MSD) representation. The MSD representation is attractive because it provides a number of forms that have the minimal number of non-zero digits for a constant. This redudancy can lead to efficient filters if a proper MSD representation is selected for each constant. In experimental results, the proposed algorithm resulted in superior filters to those generated from the CSD representation.

Session 30: Techniques for IP Protection

Chair: Ian R. Mackintosh
Organizers: Kenji Yoshida, Majid Sarrafzadeh

30.1 Publicly Detectable Techniques for the Protection of Virtual Components [p. 474]

Gang Qu

Highlighted with the newly released intellectual property (IP) protection white paper by VSI Alliance, the protection of virtual components (VCs) has received a large amount of attention recently. Digital signature is one of the most promising solutions among the known protection mechanisms. However, the trade-off between hard-to-attack and easy-to-detect and the lack of efficient detection schemes are the major obstacles for digital signatures to thrive. In this paper, we propose a new watermarking method which (i) allows the watermark to be public detected without forensic experts, (ii) gives little advantage to attackers for forgery, and (iii) does not lose the strength of protection provided by other watermarking techniques. The basic idea is to make part of the watermark public. We explain the concept of this public-private watermark and discuss the generation and embedding of such marks. We use popular VLSI CAD problems, namely technology mapping, partitioning, graph coloring, FPGA design, and Boolean satisfiability, to demonstrate its easy detectability, high credibility, low design overhead, and robustness. Finally, this technique is compatible with all the known watermarking and fingerprinting techniques.

30.2 Watermarking of SAT using Combinatorial Isolation Lemmas [p. 480]

Rupak Majumdar, Jennifer L. Wong

Watermarking of hardware and software designs is an effective mechanism for intellectual property protection (IPP). Two important criteria for watermarking schemes are credibility and fairness. In this paper, we present the unique solution-based watermarking technique which provides, in a sense, the ultimate answer to both credibility and fairness requirements. Leveraging on a combinatorial theorem of Valiant and Vazirani, we demonstrate how ultimate credibility and complete fairness can almost always be achieved with high probability during the watermarking of the solution of the satisfiability (SAT) problem. The effectiveness of the technique is demonstrated on both specially created examples where the number of solutions is known, as well as on common CAD and operation research SAT instances.

30.3 Watermarking Graph Partitioning Solutions [p. 486]

Greg Wolfe, Jennifer L. Wong, Miodrag Potkonjak

Trends in the semiconductor industry towards extensive design and code reuse motivate a need for adequate Intellectual Property Protection (IPP) schemes. We offer a new general IPP scheme called constraint-based watermarking and analyze it in the context of the graph partitioning problem. Graph partitioning is a critical optimization problem that has many applications, particularly in the semiconductor design process. Our IPP technique for graph partitioning watermarks solutions to graph partitioning problems so that they carry an author�s signature. Our technique is transparent to the actual CAD tool which does the partitioning. Our technique produces solutions that have very low quality degradation levels, yet carry signatures that are convincingly unambiguous, extremely unlikely to be present by coincidence, and difficult to detect or remove without completely resolving the partitioning problem.

30.4 Hardware Metering [p. 490]

Farinaz Koushanfar, Gang Qu

We introduce the first hardware metering scheme that enables reliable low overhead proofs for the number of manufactured parts. The key idea is to make each design slightly different. Therefore, if two identical hardware designs or a design that is not reported by the foundry are detected, the design house has proof of misconduct. We start by establishing the connection between the requirements for hardware and synthesis process. Furthermore, we present mathematical analysis of statistical accuracy of the proposed hardware metering scheme. The effectiveness of the metering scheme is demonstrated on a number of designs.

Session 31: Visualization and Animation for VLSI Design

Chair: Chandu Visweswariah
Organizers: Chandu Visweswariah, Majid Sarrafzadeh

31.1 Technical Visualizations in VLSI Design [p. 494]

Phillip J. Restle

Visualization techniques were applied to several different types of VLSI design and simulation data. A number of different visualizations have been tried, with varying results. Examples include 3D visualization of voltage and currents from fullwave interconnect analysis, on-chip clock distribution networks, chip/package power supply noise analysis, wire congestion, chip layout imaging, and static circuit tuning. The goals, successes, and failures of these examples will be discussed, along with some unexpected benefits from our ability to easily see patterns in complex visualizations.
General Terms Design, Human Factors.
Keywords Visualization, VLSI Design, Interconnect, Inductance, Clock Distribution, Circuit Optimization.

31.2 Using Texture Mapping with Mipmapping to Render a VLSI Layout [p. 500]

Jeff Solomon, Mark Horowitz

This paper presents a method of using texture mapping with mipmapping to render a VLSI layout. Texture mapping is used to save already rasterized areas of the layout from frame to frame, and to take advantage of any hardware accelerated capabilities of the host platform. Mipmapping is used to select which textures to display so that the amount of information sent to the display is bounded, and the image rendered on the display is filtered correctly. Additionally, two caching schemes are employed. The first, used to bound memory consumption, is a general purpose cache that holds textures spatially close to the user�s current viewpoint. The second, used to speed up the rendering process, is a cache of heavily used sub-designs that are precomputed so rasterization on the fly is not necessary. An experimental implementation shows that real-time navigation can be achieved on arbitrarily large designs. Results also show how this technique ensures that image quality does not degrade as the number of polygons drawn increases, avoiding the aliasing artifacts common in other layout systems.
Categories and Subject Descriptors J.6 [Computer-Aided Engineering]: [Computer-Aided Design]; I.3.3 [Computer Graphics]: Picture/Image Generation�display algorithms
Keywords texture mapping, mipmapping, VLSI layout editor

31.3 Web-based Algorithm Animation [p. 506]

Marc Najork

This paper describes JCAT, a web-based algorithm animation system. JCAT combines the expressive power of web pages for publishing passive multimedia content with a full-fledged interactive algorithm animation system that includes a rich set of libraries for creating 2D and 3D animations. The paper describes in detail how an algorithm animation is authored, and it presents a sample of existing animations.

Session 32: Application-Specific Customization for Systems-on-a-Chip

Chair: Kees Vissers
Organizers: Kurt Keutzer, Dirk Grunwald

32.1 Speeding Up Control-Dominated Applications through Microarchitectural Customizations in Embedded Processors [p. 512]

Peter Petrov, Alex Orailoglu

We present a methodology for microarchitectural customization of embedded processors by exploiting application information, thus attaining the twin benefits of processor standardization and application specific customization. Such powerful techniques enable increased application fragments to be placed on the processor, with no sacrifice in system requirements, thus reducing the custom hardware and the concomitant area requirements in SOCs. We illustrate these ideas through the branch resolution problem, known to impose severe performance degradation on control-dominated embedded applications. A low-cost late customizable hardware that uses application information to fold out a set of frequently executed branches is described. Experimental results show that for a representative set of control dominated applications a reduction in the range of 7%-22% in processor cycles can be achieved, thus extending the scope of low-cost embedded processors in complex co-designs for control intensive systems.

32.2 Automatic Generation of Application-Specific Architectures for Heterogeneous Multiprocessor System-on-Chip [p. 518]

Damien Lyonnard, Sungjoo Yoo, Amer Baghdadi, Ahmed A. Jerraya

We present a design flow for the generation of application-specific multiprocessor architectures. In the flow, architectural parameters are first extracted from a high-level system specification. Parameters are used to instantiate architectural components, such as processors, memory modules and communication networks. The flow includes the automatic generation of communication coprocessor that adapts the processor to the communication network in an application-specific way. Experiments with two system examples show the effectiveness of the presented design flow.

32.3 Dynamic Voltage Scaling and Power Management for Portable Systems [p. 524]

Tajana Simunic, Luca Benini, Andrea Acquaviva, Peter Glynn, Giovanni De Micheli

Portable systems require long battery lifetime while still delivering high performance. Dynamic voltage scaling (DVS) algorithms reduce energy consumption by changing processor speed and voltage at run-time depending on the needs of the applications running. Dynamic power management (DPM) policies trade off the performance for the power consumption by selectively placing components into low-power states. In this work we extend the DPM model presented in [2, 3] with a DVS algorithm, thus enabling larger power savings. We test our approach on MPEG video and MP3 audio algorithms running on the SmartBadge portable device [1]. Our results show savings of a factor of three in energy consumption for combined DVS and DPM approaches.

Session 33: Satisfiability Solvers and Techniques

Chair: Karem A. Sakallah
Organizers: Andrew B. Kahng, Kurt Keutzer, Limor Fix

33.1 Chaff: Engineering an Efficient SAT Solver [p. 530]

Matthew W. Moskewicz, Conor F. Madigan, Ying Zhao, Lintao Zhang, Sharad Malik

Boolean Satisfiability is probably the most studied of combinatorial optimization/search problems. Significant effort has been devoted to trying to provide practical solutions to this problem for problem instances encountered in a range of applications in Electronic Design Automation (EDA), as well as in Artificial Intelligence (AI). This study has culminated in the development of several SAT packages, both proprietary and in the public domain (e.g. GRASP, SATO) which find significant use in both research and industry. Most existing complete solvers are variants of the Davis-Putnam (DP) search algorithm. In this paper we describe the development of a new complete solver, Chaff, which achieves significant performance gains through careful engineering of all aspects of the search � especially a particularly efficient implementation of Boolean constraint propagation (BCP) and a novel low overhead decision strategy. Chaff has been able to obtain one to two orders of magnitude performance improvement on difficult SAT benchmarks in comparison with other solvers (DP or otherwise), including GRASP and SATO.
Categories and Subject Descriptors J6 [Computer-Aided Engineering]: Computer-Aided Design.
General Terms Algorithms, Verification.
Keywords Boolean satisfiability, design verification.

33.2 Dynamic Detection and Removal of Inactive Clauses in SAT with Application in Image Computation [p. 536]

Aarti Gupta, Anubhav Gupta, Zijiang Yang, Pranav Ashar

In this paper, we present a new technique for the efficient dynamic detection and removal of inactive clauses, i.e. clauses that do not affect the solutions of interest of a Boolean Satisfiability (SAT) problem. The algorithm is based on the extraction of gate connectivity information during generation of the Boolean formula from the circuit, and its use in the inner loop of a branch-and-bound SAT algorithm. The motivation for this optimization is to exploit the circuit structure information, which can be used to find unobservable gates at circuit outputs under dynamic conditions. It has the potential to speed up all applications of SAT in which the SAT formula is derived from a logic circuit. In particular , we find that it has considerable impact on an image computation algorithm based on SAT. We present practical results for benchmark circuits which show that the use of this optimization consistently improves the performance for reachability analysis, in some cases enabling the prototype tool to reach more states than otherwise possible.

33.3 SATIRE: A New Incremental Satisfiability Engine [p. 542]

Jesse Whittemore, Joonyoung Kim, Karem Sakallah

We introduce SATIRE, a new satisfiability solver that is particularly suited to verification and optimization problems in electronic design automation. SATIRE builds on the most recent advances in satisfiability research, and includes two new features to achieve even higher performance: a facility for incrementally solving sets of related problems, and the ability to handle non-CNF constraints. We provide experimental evidence showing the effectiveness of these additions to classical satisfiability solvers.

33.4 A Framework for Low Complexity Static Learning [p. 546]

Emil Gizdarski, Hideo Fujiwara

In this paper, we present a new data structure for a complete implication graph and two techniques for low complexity static learning. We show that using static indirect ��-implications and super gate extraction some hard-to-detect static and dynamic indirect implications are easily derived during static and dynamic learning as well as branch and bound search. Experimental results demonstrated the effectiveness of the proposed data structure and learning techniques.

Session 34: Power and Interconnect Analysis

Chair: L. Miguel Silveira
Organizers: Joel Phillips, Mustafa Celik

34.1 Fast Power/Ground Network Optimization Based on Equivalent Circuit Modeling [p. 550]

X.-D. Sheldon Tan, C.-J. Richard Shi

This paper presents an efficient algorithm for optimizing the area of power or ground networks in integrated circuits subject to the reliability constraints. Instead of solving the original power/ground networks extracted from circuit layouts as previous methods did, the new method first builds the equivalent models for many series resistors in the original networks, then the sequence of linear programming method [0] is used to solve the simplified networks. The solutions of the original networks then are back solved from the optimized, simplified networks. The new algorithm simply exploits the regularities in the power/ground networks. Experimental results show that the complexities of simplified networks are typically significantly smaller than that of the original circuits, which renders the new algorithm extremely fast. For instance, power/ground networks with more than one million branches can be sized in a few minutes on modern SUN workstations.

34.2 An Interconnect Energy Model Considering Coupling Effects [p. 555]

Taku Uchino, Jason Cong

This paper first presents an analytical interconnect energy model with consideration of event coupling, which is not considered by the conventional 1/2 CV² model. Our energy calculation algorithm has the same time complexity as the 1/2 CV² model, and is several orders of magnitude faster than HSPICE with less than 5% error. In comparison, the error of the 1/2 CV² model can be as high as 100%

34.3 Efficient Large-Scale Power Grid Analysis Based on Preconditioned Krylov-Subspace Iterative Methods [p. 559]

Tsung-Hao Chen, Charlie Chung-Ping Chen

In this paper, we propose preconditioned Krylov-subspace iterative methods to perform efficient DC and transient simulations for large-scale linear circuits with an emphasis on power delivery circuits. We also prove that a circuit with inductors can be simplified from MNA to NA format, and the matrix becomes an s.p.d matrix. This property makes it suitable for the conjugate gradient with incomplete Cholesky decomposition as the preconditioner, which is faster than other direct and iterative methods. Extensive experimental results on large-scale industrial power grid circuits show that our method is over 200 times faster for DC analysis and around 10 times faster for transient simulation compared to SPICE3. Furthermore, our algorithm reduces over 75% of memory usage than SPICE3 while the accuracy is not compromised.

34.4 Using Conduction Modes Basis Functions for Efficient Electromagnetic Analysis of On-Chip and Off-Chip Interconnect [p. 563]

Luca Daniel, Alberto Sangiovanni-Vincentelli, Jacob White

In this paper, we present an efficient method to model the interior of the conductors in a quasi-static or full-wave integral equation solver. We show how interconnect cross-sectional current distributions can be modeled using a small number of conduction modes as basis functions for the discretization of the Mixed Potential Integral Equation (MPIE). Two examples are presented to demonstrate the computational attractiveness of our method. In particular, we show how our new approach can successfully and efficiently capture skin effects, proximity effects and transmission line resonances.

34.5 Analysis of Non-Uniform Temperature-Dependent Interconnect Performance in High Performance ICs [p. 567]

Amir H. Ajami, Kaustav Banerjee, Massoud Pedram, Lukas P.P.P. van Ginneken

Non-uniform temperature profiles along global interconnect lines in high-performance ICs can significantly impact the performance of these lines. This paper presents a detailed analysis and modeling of the interconnect performance degradation due to non-uniform temperature profiles that exist along their lengths, which in turn arise due to the thermal gradients in the underlying substrate. A non-uniform temperature-dependent distributed RC interconnect delay model is proposed for the first time. The model has been applied to a wide variety of interconnect layouts and temperature distributions to quantify the impact on signal integrity issues including clock skew fluctuations.

Session 35: Domain Specific Design Methodologies

Chair: Yosinori Watanabe
Organizers: Anand Raghunathan, Shin-ichi Minato

35.1 VHDL-Based Design and Design Methodology for Reusable High Performance Direct Digital Frequency Synthesizers [p. 573]

Ireneusz Janiszewski, Bernhard Hoppe, Hermann Meuth

Design methodologies for high performance Direct Digital Frequency Synthesizers (DDFS) are described. Traditional look-up tables (LUT) for sine and cosine are merged with CORDIC-interpolation into a hybrid architecture. This implements DDFS-systems with high resolution without being specific to a particular target technology. Amplitude constants were obtained from mathematical trigonometric functions of the IEEE math_real package. These constants were then written via simulation of a VHDL model into a fully synthesizable package. Systematic and detailed studies varying the synthesizer�s inherent parameters lead to a design optimum of the LUT/CORDIC-ratio, which minimizes power and silicon area for a given clock frequency.
Categories and Subject Descriptors M1.5: Logic and high-level synthesis and optimization
General Terms Algorithms, Design, Languages, Theory.
Keywords HDL-based Design, Direct Frequency Synthesis, Design Optimization and Reuse, CORDIC Algorithm.

35.2 Concurrent Error Detection of Fault-Based Side-Channel Cryptanalysis of 128-Bit Symmetric Block Ciphers [p. 579]

Ramesh Karri, Kaijie Wu, Piyush Mishra, Yongkook Kim

Fault-based side channel cryptanalysis is very effective against symmetric and asymmetric encryption algorithms. Although straightforward hardware and time redundancy based concurrent error detection (CED) architectures can be used to thwart such attacks, they entail significant overhead (either area or performance). In this paper we investigate systematic approaches to low-cost, low-latency CED for symmetric encryption algorithms based on the inverse relationship that exists between encryption and decryption at algorithm level, round level and operation level and develop CED architectures that explore the trade-off between area overhead, performance penalty and error detection latency. The proposed techniques have been validated on FPGA implementations of AES finalist 128-bit symmetric encryption algorithms.

35.3 MetaCores: Design and Optimization Techniques [p. 585]

Seapahn Meguerdichian, Farinaz Koushanfar, Advait Mogre, Dusan Petranovic, Miodrag Potkonjak

Currently, hardware intellectual property (IP) is delivered at three levels of abstraction: hard, firm, and soft. In order to further enhance performance, efficiency, and flexibility of IP design, we have developed a new approach for designing hardware and software IP called MetaCores. The new design approach starts at the algorithm level and leverages on the algorithm�s intrinsic optimization degrees of freedom. The approach has four main components: (i) problem formulation and identification of optimization degrees of freedom, (ii) objective functions and constraints, (iii) cost evaluation engine, and (iv) multiresolution design space search. From the algorithmic viewpoint, the main contribution is the introduction of multiresolution search in algorithm optimization and synthesis process. We have applied the approach to the development of Viterbi and IIR MetaCores. Experimental results demonstrate the effectiveness of the new approach.

Session 36: Panel: Is Nanometer Design Under Control? [p. 591]

Chair: Andrew B. Kahng
Organizer: Bing Sheu and Andrew B. Kahng
Panel Members: Nancy Nettleton, John Cohn, Shekhar Borkar, Louis Scheffer, Ed Cheng, Sang Wang

: As fabrication technology moves to 100 nm and below, profound nanometer effects become critical in developing silicon chips with hundreds of millions of transistors. Both EDA suppliers and system houses have been re-tooling, and new methodologies have been emerging. Will these efforts meet the challenges of nanometer silicon such as performance closure, power, reliability, manufacturability, and cost? Which aspects of nanometer design are, or are not, under control? This session will consist of a debate between two teams of distinguished representatives from EDA suppliers and system design houses. Which side has the right answers and roadmap? You and a panel of judges will decide!

Session 37: Analysis and Implementation for Embedded Systems

Chair: Grant Martin
Organizers: Dirk Grunwald, Kurt Keutzer, Luciano Lavagno

37.1 A Hardware/Software Co-design Flow and IP Library Based of Simulink ^TM [p. 593]

L. M. Reyneri, F. Cucinotta, A. Serra, L. Lavagno

This paper describes a design flow for data-dominated embedded systems. We use The Mathworks' Simulink ^TM environment for functional specification and algorithmic analysis. We developed a library of Simulink blocks, each parameterized by design choices such as implementation (software, analog or digital hardware, ...) and numerical accuracy (resolution, S/N ratio). Each block is equipped with empirical models for cost (code size, chip area) and performance (timing, energy), based on surface fitting from actual measurements. We also developed an analysis toolbox that quickly evaluates algorithm and parameter choices performed by the designer, and presents the results for fast feedback. The chosen block netlist is then ready for implementation, by using a customization of The Mathworks' Real Time Workshop ^TM to generate a VHDL netlist for FPGA implementation, as well as embedded software for DSP implementation.

37.2 System-Level Power/Performance Analysis for Embedded Systems Design [p. 599]

Amit Nandi, Radu Marculescu

This paper presents a formal technique for system level power/performance analysis that can help the designer to select the right platform starting from a set of target applications. By platform we mean a family of heterogeneous architectures that satisfy a set of architectural constraints imposed to allow re-use of hardware and software components. More precisely, we introduce the Stochastic Automata Networks (SANs) as an effective formalism for average-case analysis that can be used early in the design cycle to identify the best power/performance figure among several application-architecture combinations. This information not only helps avoid lengthy profiling simulations, but also enables efficient mappings of the applications onto the chosen platform. We illustrate the features of our technique through the design of an MPEG-2 video decoder application.
Keywords: platform-based design, system-level analysis, stochastic automata networks, multimedia systems.

37.3 High-level Software Energy Macro-modeling [p. 605]

T. K. Tan, A. Raghunathan, G. Lakshminarayana, N. K. Jha

This paper presents an efficient and accurate high-level software energy estimation methodology using the concept of characterization-based macro-modeling. In characterization-based macro-modeling, a function or sub-routine is characterized using an accurate lower-level energy model of the target processor, to construct a macro-model that relates the energy consumed in the function under consideration to various parameters that can be easily observed or calculated from a high-level programming language description.The constructed macro-models eliminate the need for significantly slower instruction-level interpretation or hardware simulation that is required in conventional approaches to software energy estimation. We present two different approaches to macro-modeling for embedded software that offer distinct efficiency-accuracy characteristics: (i) complexity-based macro-modeling, where the variables that determine the algorithmic complexity of the function under consideration are used as macro-modeling parameters, and (ii) profiling-based macro-modeling, where internal profiling statistics for the functions are used as parameters in the energy macro-models. We have experimentally validated our software energy macro-modeling techniques on a wide range of embedded software routines and two different target processor architectures. Our experiments demonstrate that high-level macro-models constructed using the proposed techniques are able to estimate the energy consumption to within 95% accuracy on the average, while commanding speedups of one to five orders-of-magnitude over current instruction-level and architectural energy estimation techniques.

Session 38: Industrial Case Studies in Verification

Chair: Carl Pixley
Organizers: Anand Raghunathan, Carl Pixley

38.1 Model Checking of S3C2400X Industrial Embedded SOC Product [p. 611]

Hoon Choi, Byeongwhee Yun, Yuntae Lee, Hyunglae Roh

This paper describes our experience and methodology used in the model checking of S3C2400X industrial embedded SOC product. We employed model checking to verify the RTL implementation. We describe how to model the multiple clocks, gated clocks, unsynchronized clocks, and synchronization logics in model checking. Detailed case studies of real designs show the application of the proposed modeling techniques, environment modeling, and the properties we checked. The verification results validate the proposed techniques by finding real bugs.

38.2 Semi-Formal Test Generation with Genevieve [p. 617]

Julia Dushina, Mike Benjamin, Daniel Geist

This paper describes the first application of the Genevieve test generation methodology. The Genevieve approach uses semi-formal techniques derived from �model-checking� to generate test suites for specific behaviours of the design under test. An �interesting� behaviour is claimed to be unreachable. If a path from an initial state to the state of interest does exist, a counter-example is generated. The sequence of states specifies a test for the desired behaviour. To highlight real problems that could appear during test generation, we chose the Store Data Unit (SDU) of the ST100, a new high performance digital signal processor (DSP) developed by STMicroelectronics. This unit is specifically selected because of the following key issues:
1. big data structures that can not be directly modelled without state explosion,
2. complex control logic that would require an excessive number of tests to exercise exhaustively,
3. a design where it is difficult to determine how to drive the complete system to ensure a given behaviour in the uneit under test.
The Genvieve methodology allowed us to define a coverage model specifically devoted to covering corner cases of the design. Hence the generated test suite achieved very efficient coverage of corner cases, and checked not only functional correctness but also whether the implementation matched design intent. As a result the Genevieve tests discovered some subtle performance bugs which would otherwise be very difficult to find.

38.3 A Transaction-Based Unified Simulation/Emulation Architecture for Functional Verification [p. 623]

Murali Kudlugi, Soha Hassoun, Charles Selvidge, Duaine Pryor

A transaction-based layered architecture providing for 100% portability of a C-based testbench between simulation and emulation is proposed. Transaction-based communication results in performance which is commensurate with emulation without a hardware target. Testbench portability eliminates duplicated effort when combining system level simulation and emulation. An implementation based on the IKOS VStation emulator validates these architectural claims on real designs.

Session 39: Integrated High-Level Synthesis Based Solutions

Chair: Kazutoshi Wakabayashi
Organizers: Kazutoshi Wakabayashi, Rajesh K. Gupta

39.1 Integrated High-Level Synthesis and Power-Net Routing for Digital Design under Switching Noise Constraints [p. 629]

Alex Doboli, Ranga Vemuri

This paper presents a CAD methodology and a tool for high-level synthesis (HLS) of digital hardware for mixed analog-digital chips. In contrast to HLS for digital applications, HLS for mixed-signal systems is mainly challenged by constraints, such as digital switching noise (DSN), that are due to the analog circuits. This paper discusses an integrated approach to HLS and power net routing for effectively reducing DSN. Motivation for this research is that HLS has a high impact on DSN reduction, however, DSN evaluation is very difficult at a high level. Integrated approach also employs an original method for fast evaluation of DSN and an algorithm for power net routing and sizing. Experiments showed that our combined binding and scheduling method produces better results than traditional HLS techniques. Finally , DSN evaluation using the proposed algorithm can be significantly faster than SPICE simulation.

39.2 Integrating Scheduling and Physical Design into a Coherent Compilation Cycle for Reconfigurable Computing Architectures [p. 635]

Kia Bazargan, Seda Ogrenci, Majid Sarrafzadeh

Advances in the FPGA technology, both in terms of device capacity and architecture, have resulted in introduction of reconfigurable computing machines, where the hardware adapts itself to the running application to gain speedup. To keep up with the ever-growing performance expectations of such systems, designers need new methodologies and tools for developing reconfigurable computing systems (RCS). This paper addresses the need for fast compilation and physical design phase to be used in application development/debugging /testing cycle for RCS. We present a high-level synthesis approach that its integrated with placement, making the compilation cycle much faster. On the average, our tool generates the VHDL code (and the corresponding placement information) from the data flow graph of a program in less than a minute. By compromising 30% in the clock frequency of the circuit, we can achieve about 10 times speedup in the Xilinx placement phase, and 2.5 times overall speedup in the Xilinx place-and-route phase, a reasonable trade-off when developing RCS applications.

39.3 Statistical Design Space Exploration for Application-Specific Unit Synthesis [p. 641]

Davide Bruni, Alessandro Bogliolo, Luca Benini

The capability of performing semi-automated design space exploration is the main advantage of high-level synthesis with respect to RTL design. However, design space exploration performed during high-level synthesis is limited in scope, since it provides promising solutions that represent good starting points for subsequent optimizations, but it provides no insight about the overall structure of the design space. In this work we propose unsupervised Monte- Carlo design exploration and statistical characterization to capture the key features of the design space. Our analysis provides insight on how various solutions are distributed over the entire design space. In addition, we apply extreme value theory [11] to extrapolate achievable bounds from the sampling points.

Session 40: Timing Verification and Simulation

Chair: Ashok Vittal
Organizer: Narendra Shenoy

40.1 Static Scheduling of Multiple Asynchronous Domains For Functional Verification [p. 647]

Murali Kudlugi, Charles Selvidge, Russell Tessier

While ASIC devices of a decade ago primarily contained synchronous circuitry triggered with a single clock, many contemporary architectures require multiple clocks that operate asynchronously to each other. This multi-clock domain behavior presents significant functional verification challenges for large parallel verification systems such as distributed parallel simulators and logic emulators. In particular, multiple asynchronous design clocks make it difficult to verify that design hold times are met during logic evaluation and causality along reconvergent fanout paths is preserved during signal communication. In this paper, we describe scheduling and synchronization techniques to maintain modeling fidelity for designs with multiple asynchronous clock domains that are mapped to parallel verification systems. It is shown that when our approach is applied to an FPGA-based logic emulator, evaluation fidelity is maintained and increased design evaluation performance can be achieved for large benchmark designs with multiple asynchronous clock domains.

40.2 Functional Correlation Analysis in Crosstalk Induced Critical Paths Identification [p. 653]

Tong Xiao, Malgorzata Marek-Sadowska

In deep submicron digital circuits capacitive couplings make delay of a switching signal highly dependent on its neighbors� switching times and switching directions. A long path may have a large number of coupling neighbors with difficult to determine interdependencies. Ignoring the mutual relationship among the signals may result in a very pessimistic estimation of circuit delay. In this paper, we apply efficient functional correlation analysis techniques to identify critical paths caused by crosstalk delay effects. We also discuss applications to static timing optimization. Experiments demonstrate efficacy of the proposed technique.

40.3 An Advanced Timing Characterization Method Using Mode Dependency [p. 657]

Hakan Yalcin, Robert Palermo, Mohammad Mortazavi, Cyrus Bamji, Karem Sakallah, John Hayes

To address the problem of accurate timing characterization, this paper proposes a method that fully exploits mode dependency. It is based on the premise that circuit delays are determined largely by set of control inputs for which the number of useful combinations, i.e., modes, is small for most practical circuits. We take the mode-dependent characterization approach further and enhance it so that the delays of the I/O paths between the control inputs and outputs are calculated more accurately. We prove that, with a careful choice of propagation conditions, our method can generate timing models with very tight path delays that are guaranteed to give correct results. Experimental results using real-life circuits show that circuit delays can vary significantly among different modes for both control and data input delays, and capturing this variation can have a significant impact on the overall system timing.

40.4 Fast Statistical Timing Analysis By Probabilistic Event Propagation [p. 661]

Jing-Jia Liou, Kwang-Ting Cheng, Sandip Kundu, Angela Krstic

We propose a new statistical timing analysis algorithm, which produces arrival-time random variables for all internal signals and primary outputs for cell-based designs with all cell delays modeled as random variables. Our algorithm propagates probabilistic timing events through the circuit and obtains final probabilistic events (distributions) at all nodes. The new algorithm is deterministic and flexible in controlling run time and accuracy. However, the algorithm has exponential time complexity for circuits with reconvergent fanouts. In order to solve this problem, we further propose a fast approximate algorithm. Experiments show that this approximate algorithm speeds up the statistical timing analysis by at least an order of magnitude and produces results with small errors when compared with Monte Carlo methods.

Session 41: On-Chip Communication Architectures

Chair: Anand Raghunathan
Organizers: Anand Raghunathan, Sharad Malik

41.1 Addressing the System-on-a-Chip Interconnect Woes Through Communication-Based Design [p. 667]

M. Sgroi, M. Sheets, A. Mihal, K. Keutzer, S. Malik, J. Rabaey, A. Sangiovanni-Vincentelli

Communication-based design represents a formal approach to system-on-a-chip design that considers communication between components as important as the computations they perform. Our �network-on-chip� approach partitions the communication into layers to maximize reuse and provide a programmer with an abstraction of the underlying communication framework. This layered approach is cast in the structure advocated by the OSI Reference Model and is demonstrated with a reconfigurable DSP example. The Metropolis methodology of deriving layers through a sequence of adaptation steps between incompatible behaviors is illustrated through the Intercom design example. In another approach, MESCAL provides a designer with tools for a correct-by-construction protocol stack.
GENERAL TERMS�Design
KEYWORDS�Network-on-chip; platform-based design; communication-based design; protocol stack.

41.2 MicroNetwork-Based Integration for SOCs [p. 673]

Drew Wingard

In this paper, we describe the concept of using an on-chip network as the fundamental communication architecture for a complex SOC design. We describe some of the salient features of a MicroNetwork that we have implemented, and introduce an automated development environment that makes MicroNetwork-ased design productive. Finally, we show how one would use our MicroNetwork to integrate an SOC, and provide a general description of several completed MicroNetwork-based designs.

41.3 On-Chip Communication Architecture for OC-768 Network Processors [p. 678]

Faraydon Karim, Anh Nguyen, Sujit Dey, Ramesh Rao

The need for network processors capable of forwarding IP packets at OC-192 and higher data rates has been well established. At the same time, there is a growing need for complex tasks, like packet classification and differentiated services, to be performed by network processors. At OC-768 data rate, a network processor has 9 nanoseconds to process a minimum-size IP packet. Such ultra high-speed processing, involving complex memory-intensive tasks, can only be achieved by multi-CPU distributed memory systems, using very high performance on-chip communication architectures. In this paper, we propose a novel communication network architecture for 8-CPU distributed memory systems that has the potential to deliver the throughput required in next generation routers. We then show that our communication architecture can easily scale to accommodate much greater number of network nodes. Our network architecture yields higher performance than the traditional bus and crossbar yet has low implementation cost. It is quite flexible and can be implemented in either packet or circuit switched mode. We will compare and contrast our proposed architecture with busses and crossbars using metrics such as throughput and physical layout cost.

41.4 Route Packets, Not Wires: On-Chip Interconnection Networks [p. 684]

William J. Dally, Brian Towles

Using on-chip interconnection networks in place of ad-hoc global wiring structures the top level wires on a chip and facilitates modular design. With this approach, system modules (processors, memories, peripherals, etc...) communicate by sending packets to one another over the network. The structured network wiring gives well-controlled electrical parameters that eliminate timing iterations and enable the use of high-performance circuits to reduce latency and increase bandwidth. The area overhead required to implement an on-chip network is modest, we estimate 6.6%. This paper introduces the concept of on-chip networks, sketches a simple network, and discusses some challenges in the architecture and design of these networks.

Session 42: Compiler and Architecture Interactions

Chair: Stephen Neuendorffer
Organizers: Donatella Sciuto, Marco Di Natale

42.1 Dynamic Management of Scratch-Pad Memory Space [p. 690]

M. Kandemir, J. Ramanujam, M. J. Irwin, N. Vijaykrishnan, I. Kadayif, A. Parikh

Optimizations aimed at improving the efficiency of on-chip memories are extremely important. We propose a compiler-controlled dynamic on-chip scratch-pad memory (SPM) management framework that uses both loop and data transformations. Experimental results obtained using a generic cost model indicate significant reductions in data transfer activity between SPM and off-chip memory.

42.2 Clustered VLIW Architectures with Predicated Switching [p. 696]

Margarida F. Jacome, Gustavo de Veciana, Satish Pillai

In order to meet the high throughput requirements of applications exhibiting high ILP, VLIW ASIPs may increasingly include large numbers of functional units(FUs). Unfortunately, `switching' data through register files shared by large numbers of FUs quickly becomes a dominant cost/ performance factor suggesting that clustering smaller number of FUs around local register files may be beneficial even if data transfers are required among clusters. With such machines in mind, we propose a compiler transformation, predicated switching, which enables aggressive speculation while lever-aging the penalties associated with inter-cluster communication to achieve gains in performance. Based on representative benchmarks, we demonstrate that this novel technique is particularly suitable for application specific clustered machines aimed at supporting high ILP as compared to state-of-the-art approaches.

42.3 High-Quality Operation Binding for Clustered VLIW Datapaths [p. 702]

Viktor S. Lapinskii, Margarida F. Jacome, Gustavo A. de Veciana

Clustering is an effective method to increase the available parallelism in VLIW datapaths without incurring severe penalties associated with large number of register file ports. Efficient utilization of a clustered datapath requires careful binding of operations to clusters. The paper proposes a binding algorithm that effectively explores tradeoffs between in-cluster operation serialization and delays associated with data transfers between clusters. Extensive experimental evidence is provided showing that the algorithm generates high quality solutions for basic blocks, with up to 29% improvement over a state-of-the-art advanced binding algorithm.

42.4 Fast Bit-True Simulation [p. 708]

Holger Keding, Martin Coors, Olaf Lüthje, Heinrich Meyr

This paper presents a design environment which enables fast simulation of fixed-point signal processing algorithms. In contrast to existing approaches which use C/C++ libraries for the emulation of generic fixed-point data types, this novel approach additionally permits a code transformation to integral data types for fast simulation of the bit-true behavior. A speedup by a factor of 20 to 400 can be achieved compared to library based simulation.

Session 43: Timing with Crosstalk

Chair: Sachin S. Sapatnekar
Organizers: Jaijeet Roychowdhury, Mustafa Celik

43.1 Timing Analysis with Crosstalk as Fixpoints on Complete Lattice [p. 714]

Hai Zhou, Narendra Shenoy, William Nicholls

Increasing delay variation due to crosstalk has a dramatic impact on deep sub-micron technologies. It is now necessary to include crosstalk in timing analysis. But timing analysis with crosstalk is a chicken-and-egg problem since crosstalk effect in turn depends on timing behavior of a circuit. In this paper, we establish a theoretical foundation for timing analysis with crosstalk. We show that solutions to the problem are fixpoints on a complete lattice. Base on that, we prove in general the convergence of any iterative approach. We also show that, starting from different initial solutions, an iterative approach will reach different fixpoints. The current prevailing practice, which starts from the worst case solution, will always reach the greatest fixpoint (which is the loosest solution). In order to reach the least fixpoint, we need to start from the best case solution. Base on chaotic iteration and heterogeneous structures of coupled circuits, we also design techniques to speed up iterations.

43.2 Driver Modeling and Alignment for Worst-Case Delay Noise [p. 720]

Supamas Sirichotiyakul, David Blaauw, Chanhee Oh, Rafi Levy, Vladimir Zolotov, Jingyan Zuo

In this paper, we present a new approach to model the impact of cross-coupling noise on interconnect delay. We introduce a new linear driver model that accurately models the noise pulse induced on a switching signal net due to cross coupling capacitance. The proposed model effectively captures the non-linear behavior of the victim driver gate during the transition and has an average error below 8% whereas the traditional approach using a Thevenin model incurs an average error of 48%. We also discuss the worst case alignment of the aggressor net transitions with respect to the victim net transition, emphasizing the need to maximize not merely the delay of the interconnect alone but the combined delay of the interconnect and receiver gate. We show that the worst case alignment of an aggressor net transition is a function of the receiver gate output loading, victim transition edge rate, and the noise pulse width and height and hence propose a pre-characterization approach to efficiently predict the worst-case alignment. The proposed methods were implemented in an industrial noise analysis tool called ClariNet. Results on industrial designs are presented to demonstrate the effectiveness of our approach.

43.3 False Coupling Interactions in Static Timing Analysis [p. 726]

Ravishankar Arunachalam, Ronald D. Blanton, Lawrence T. Pileggi

Neighboring line switching can contribute to a large portion of the delay of a line for today�s deep submicron designs. In order to avoid excessive conservatism in static timing analysis, it is important to determine if aggressor lines can potentially switch simultaneously with the victim. In this paper, we present a comprehensive ATPG-based approach that uses functional information to identify valid interactions between coupled lines. Our algorithm accounts for glitches on aggressors that can be caused by static and dynamic hazards in the circuit. We present results on several benchmark circuits that show the value of considering functional information to reduce the conservatism associated with worst-case coupled line switching assumptions during static timing analysis.

43.4 Coupling Delay Optimization by Temporal Decorrelation using Dual Threshold Voltage Technique [p. 732]

Ki-Wook Kim, Seong-Ook Jung, Prashant Saxena, C. L. Liu, Sung-Mo Kang

Coupling effect due to line-to-line capacitance is of serious concern in timing analysis of circuits in ultra deep submicron CMOS technology. Often coupling delay is strongly dependent on temporal correlation of signal switching in relevant wires. Temporal decorrelation by shifting timing window can alleviate performance degradation induced by tight coupling. This paper presents an algorithm for minimizing circuit delay through timing window modulation in dual Vt technology. Experimental results on the ISCAS85 benchmark circuits indicate that the critical delay will be reduced significantly when low Vt is applied properly.
Keywords Coupling, Performance, Leakage power

Session 44: Low Power Design: Systems to Interconnect

Chair: Vivek Tiwari
Organizers: Ingrid Verbauwhede, Tadahiro Kuroda

44.1 Input Space Adaptive Design: A High-level Methodology for Energy and Performance Optimization [p. 738]

W. Wang, A. Raghunathan, G. Lakshminarayana, N. K. Jha

This paper presents a high-level design methodology, called input space adaptive design, and new design automation algorithms for optimizing energy consumption and performance. An input space adaptive design exploits the well-known fact that the quality of hardware circuits and software programs can be significantly optimized by employing algorithms and implementation architectures that adapt to the input statistics. We propose a methodology for such designs which includes identifying parts of the behavior to be optimized, selecting appropriate input sub-spaces, transforming the behavior, and verifying the equivalence of the original and optimized designs. Experimental results indicate that such designs can reduce energy consumption by up to 70.6% (average of 55.4%), and simultaneously improve performance by up to 85.1% (average of 58.1%), leading to a reduction in the energy-delay product by up to 95.6% (average of 80.7%), compared to well-optimized designs that do not employ such techniques.

44.2 A²BC: Adaptive Address Bus Coding for Low Power Deep Sub�Micron Designs [p. 744]

Jörg Henkel, Haris Lekatsas

Due to larger buses (length, width) and deep sub-micron effects where coupling capacitances between bus lines are in the same order of magnitude as base capacitances, power consumption of interconnects starts to have a significant impact on a system�s total power consumption. We present novel address bus encoding schemes that take coupling effects into consideration. The basis is a physical bus model that quantifies coupling capacitances. As a result, we report power/energy savings on the address buses of up to 56% compared to the best known ordinary power/energy efficient encoding schemes. Thereby, we exceed the only to-date approach that also takes coupling effects into consideration. Moreover, our encoding schemes do not assume any a priori knowledge that is particular to a specific application.

44.3 Coupling-Driven Bus Design for Low-Power Application-Specific Systems [p. 750]

Youngsoo Shin, Takayasu Sakurai

In modern embedded systems including communication and multimedia applications, large fraction of power is consumed during memory access and data transfer. Thus, buses should be designed and optimized to consume reasonable power while delivering sufficient performance. In this paper, we address a bus ordering problem for low-power application-specific systems. A heuristic algorithm is proposed to determine the order in a way that effective lateral component of capacitance is reduced, thereby reducing the power consumed by buses. Experimental results for various examples indicate that the average power saving from 30% to 46.7% depending on capacitance components can be obtained without any circuit overhead.

44.4 Modeling and Minimization of Interconnect Energy Dissipation in Nanometer Technologies [p. 754]

Clark N. Taylor, Sujit Dey, Yi Zhao

As the technology sizes of semiconductor devices continue to decrease, the effect of nanometer technologies on interconnects, such as crosstalk glitches and timing variations, become more significant. In this paper, we study the effect of nanometer technologies on energy dissipation in interconnects. We propose a new power estimation technique which considers DSM effects, resulting in significantly more accurate energy dissipation estimates than transition-count based methods for on-chip interconnects. We also introduce an energy minimization technique which attempts to minimize large voltage swings across the cross-coupling capacitances between interconnects. Even though the number of transitions may increase, our method yields a decrease in power consumption of up to 50%.

44.5 A True Single-Phase 8-bit Adiabatic Multiplier [p. 758]

Suhwan Kim, Conrad H. Ziesler, Marios C. Papaefthymiou

This paper presents the design and evaluation of an 8-bit adiabatic multiplier. Both the multiplier core and its built-in self-test logic have been designed using a true single-phase adiabatic logic family. Energy is supplied to the adiabatic circuitry via a sinusoidal power-clock waveform that is generated on-chip. In HSPICE simulations with post-layout extracted parasitics, our design functions correctly at clock frequencies exceeding 200 MHz. The total dissipation of the multiplier core and self-test circuitry approaches 130pJ per operation at 200MHz. Our 11,854-transistor chip has been fabricated in a 0.5 um standard CMOS process with an active area of 0.470mm². Correct chip operation has been validated for operating frequencies up to 130 MHz, the limit of our experimental setup. Measured dissipation correlates well with HSPICE simulations.
Categories and Subject Descriptors B.7.1 [Hardware]: Integrated Circuits�Types and Design Styles; B.8.1 [Hardware]: Performance and Reliability�Reliability, Testing, and Fault-Tolerance; B.8.m [Hardware]: Performance and Reliability�Miscellaneous
General Terms Design, Measurement, Verification
Keywords Adiabatic logic, Clock generator, CMOS, Dynamic logic, Low power, Low energy, SCAL, SCAL-D, Single phase, Multiplier, VLSI

Session 45: Floorplanning Representations and Placement Algorithms

Chair: Ralph H. J. M. Otten
Organizers: Louis Scheffer, Patrick Groeneveld

45.1 TCG: A Transitive Closure Graph-Based Representation for Non-Slicing Floorplans [p. 764]

Jai-Ming Lin, Yao-Wen Chang

In this paper, we propose a transitive closure graph-based representation for general floorplans, called TCG, and show its superior properties. TCG combines the advantages of popular representations such as sequence pair, BSG, and B*-tree. Like sequence pair and BSG, but unlike O-tree, B*-tree, and CBL, TCG is P-admissible. Like B*-tree, but unlike sequence pair, BSG, O-tree, and CBL, TCG does not need to construct additional constraint graphs for the cost evaluation during packing, implying faster runtime. Further, TCG supports incremental update during operations and keeps the information of boundary modules as well as the shapes and the relative positions of modules in the representation. More importantly, the geometric relation among modules is transparent not only to the TCG representation but also to its operations, facilitating the convergence to a desired solution. All these properties make TCG an effective and flexible representation for handling the general floorplan/placement design problems with various constraints. Experimental results show the promise of TCG.

45.2 Floorplanning with Abutment Constraints and L-Shaped/T-Shaped Blocks based on Corner Block List [p. 770]

Yuchun Ma, Xianlong Hong, Sheqin Dong, Yici Cai, Chung-Kuan Cheng, Jun Gu

The abutment constraint problem is one of the common constraints in practice to favor the transmittion of data between blocks. Based on Corner Block List (CBL), a new algorithm to deal with abutment constraints is developed in this paper. We can obtain the abutment information by scanning the intemediate solutions represented by CBL by scanning the intermediate solutions represented by CBL in linear time during the simulated annealing process and fixed the CBL in case the constraints are violated. Based on this algorithm, a new method to deal with L-shaped/T-shaped blocks is proposed. The shape flexibility of the soft blocks and the rotation and reflection of L-shaped/T-shaped blocks are exploited to obtain a tight packing. The experiment results are demonstrated by some benchmark data and the performance shows effectiveness of the proposed method.

45.3 Improved Cut Sequences for Partitioning Based Placement [p. 776]

Mehmet Can Yildiz, Patrick H. Madden

Recursive partitioning based placement has a long history, but there has been little consensus on how cut sequences should be chosen. In this paper, we present a dynamic programming approach to cut sequence generation. If certain assumptions hold, these sequences are optimal. After study of these optimal sequences, we observe that an extremely simple method can be used to construct sequences that are near optimal. Using this method, our bisection based placement tool Feng Shui outperforms the previously presented Capo tool by 11% on a large benchmark. By integrating our cut sequence method into Capo, we are able to improve performance by 5%, bringing the results of Feng Shui and Capo closer together.

45.4 Timing Driven Placement using Physical Net Constraints [p. 780]

Bill Halpin, C. Y. Roger Chen, Naresh Sehgal

This paper presents a new timing driven placement algorithm that explicitly meets physical net lengths constraints. It is the first recursive bi-section placement (RBP) algorithm that meets precise half perimeter bounding box constraints on critical nets. At each level of the recursive bi-section, we use linear programming to ensure that all net constraints are met. Our method can easily be incorporated with existing RBP methods. We use the net constraint based placer to improve timing results by setting and meeting constraints on timing critical nets. We report significantly better timing results on each of the MCNC benchmarks and achieve an average optimization exploitation of 69% versus previously reported 53%.

45.5 From Architecture to Layout: Partitioned Memory Synthesis for Embedded Systems-on-Chip [p. 784]

L. Benini, L. Macchiarulo, A. Macii, E. Macii, M. Poncino

We propose an integrated front-end/back-end flow for the automatic generation of a multi-bank memory architecture for embedded systems. The flow is based on an algorithm for the automatic partitioning of on-chip SRAM. Starting from the dynamic execution profile of an embedded application running on a given processor core, we synthesize a multi-banked SRAM architecture optimally fitted to the execution profile. The partitioning algorithm is integrated with the physical design phase into a complete flow that allows the back-annotation of layout information to drive the partitioning process. Results, collected on a set of embedded applications for the ARM processor, have shown average energy savings around 34%.

Session 46: Panel: What Drives EDA Innovation? [p. 790]

Chair: Steven Schulz
Organizer: Georgia Marszalek
Panel Members: Greg Hinckley, Greg Spirakis, Karen Vahtra, John Darringer, J. George Janac, Handel H. Jones

: If EDA technology innovation drives return on investment, what drives innovation? Is it tied to the semiconductor retooling cycle? To productivity requirements? To Moore�s Law? Or is there something more fundamental that we are missing? What driving forces could result in a $30 billion EDA industry, and what role will innovation play? Will the industry provide the needed breakthroughs for their customers, or seek instead the lowest common denominator user? Will designers �roll their own� tools or seek common solutions? How does EDA innovation track with the semiconductor business cycle? Does a slowdown accelerate or depress creation of new tools and new designs? Panelists will discuss the forces at work in the EDA industry of tomorrow.

Session 47: Signal Integrity: Avoidance and Test Techniques

Chair: Anirudh Devgan
Organizers: Chandu Visweswariah, Kenji Yoshida

47.1 Built-In Self-Test for Signal Integrity [p. 792]

Mehrdad Nourani, Amir Attarha

Unacceptable loss of signal integrity may harm the functionality of SoCs permanently or intermittently. We propose a systematic approach to model and test signal integrity in deep-submicron high-speed interconnects. Various signal integrity problems occurring on such interconnects (e.g. crosstalk, overshoot, noise, skew, etc.) are considered in a unified model. We also present a test methodology that uses a noise detection circuitry to detect low integrity signals and an inexpensive test architecture to measure and read the statistics for final observation and analysis.

47.2 Analysis of On-Chip Inductance Effects using a Novel Performance Optimization Methodology for Distributed RLC Interconnects [p. 798]

Kaustav Banerjee, Amit Mehrotra

This work presents a new and computationally efficient performance optimization technique for distributed RLC interconnects based on a rigorous delay computation scheme. The new optimization technique has been employed to analyze the impact of line inductance on the circuit behaviour and to illustrate the implications of technology scaling on wire inductance. It is shown that reduction in the driver capacitance and output resistance with scaling makes deep submicron (DSM) designs increasingly susceptible to inductance effects. Also, the impact of inductance variations on performance has been quantified. Additionally, the impact of the wire inductance on catastrophic logic failures and IC reliability issues have been analyzed.

47.3 Modeling and Analysis of Differential Signaling for Minimizing Inductive Cross-Talk [p. 804]

Yehia Massoud, Jamil Kawa, Don MacMillen, Jacob White

Many physical synthesis tools interdigitate signal and power lines to reduce cross-talk, and thus, improve signal integrity and timing predictability. Such approaches are extremely effective at reducing cross-talk at circuit speeds where inductive effects are inconsequential. In this paper, we use a detailed distributed RLC model to show that inductive cross-talk effects are substantial in long busses associated with 0.18 micron technology. Simulation experiments are then used to demonstrate that cross-talk in such high speed technologies is much better controlled by re-deploying inter-digitated power lines to perform differential signaling.

Session 48: Novel Approaches to Microprocessor Design and Verification

Chair: Derek Beatty
Organizers: Carl Pixley, Shin-ichi Minato

48.1 Automated Pipeline Design [p. 810]

Daniel Kroening, Wolfgang J. Paul

The interlock and forwarding logic is considered the tricky part of a fully-featured pipelined microprocessor and especially debugging these parts delays the hardware design process considerably. It is therefore desirable to automate the design of both interlock and forwarding logic. The hardware design engineer begins with a sequential implementation without any interlock and forwarding logic. A tool then adds the forwarding and interlock logic required for pipelining. This paper describes the algorithm for such a tool and the correctness is formally verified. We use a standard DLX RISC processor as an example.
Keywords Pipeline, Forwarding, Interlock

48.2 A New Verification Methodology for Complex Pipeline Behavior [p. 816]

Kazuyoshi Kohno, Nobu Matsumoto

A new test program generation tool, mVpGen, is developed for verifying pipeline design of microprocessors. The only inputs mVpGen requires are pipeline-behavior specifications; it automatically generates test cases at first from pipeline-behavior specifications and then automatically generates test programs corresponding to the test cases. Test programs for verifying complex pipeline behavior such as hazard and branch or hazard and exception, are generated. mVpGen has been integrated into a verification system for verifying RTL descriptions of a real microprocessor design and complex bugs that remained hidden in the RTL descriptions are detected.

48.3 Pre-silicon Verification of the Alpha 21364 Microprocessor Error Handling System [p. 822]

Richard Lee, Benjamin Tsien

This paper presents the strategy used to verify the error logic in the Alpha 21364 microprocessor. Traditional pre-silicon strategies of focused testing or unit-level random testing yield limited results in finding complex bugs in the error handling logic of a microprocessor. This paper introduces a technique to simulate error conditions and their recovery in a global environment using random test stimulus closely approximating traffic found in a real system. A significant number of bugs were found using this technique. A majority of these bugs could not be uncovered using a simple random environment, or were counterintuitive to focused test design.

Session 49: Scheduling Techniques for Power Management

Chair: Donatella Sciuto
Organizers: Diederik Verkest, Luciano Lavagno

49.1 Energy Efficient Fixed-Priority Scheduling for Real-Time Systems on Variable Voltage Processors [p. 828]

Gang Quan, Xiaobo (Sharon) Hu

Energy consumption has become an increasingly important consideration in designing many real-time embedded systems. Variable voltage processors, if used properly, can dramatically reduce such system energy consumption. In this paper, we present a technique to determine voltage settings for a variable voltage processor that utilizes a fixed priority assignment to schedule jobs. Our approach also produces the minimum constant voltage needed to feasibly schedule the entire job set. Our algorithms lead to significant energy saving compared with previously presented approaches.

49.2 Dynamic Power Management in a Mobile Multimedia System with Guaranteed Quality-of-Service [p. 834]

Qinru Qiu, Qing Wu, Massoud Pedram

In this paper we address the problem of dynamic power management in a distributed multimedia system with a required quality of service (QoS). Using a generalized stochastic Petri net model where the non-exponential inter-arrival time distribution of the incoming requests is captured by the �stage method�, we provide a detailed model of the power-managed multimedia system under general QoS constraints. Based on this mathematical model, the power-optimal policy is obtained by solving a linear programming problem. We compare the new problem formulation and solution technique to previous dynamic power management techniques that can only optimize power under delay constraints. We then demonstrate that these other techniques yield policies with higher power dissipation by overconstraining the delay target in an attempt to indirectly satisfy the QoS constraints. In contrast, our new method correctly formulates the power management problem under QoS constraints and obtains the optimal solution.

49.3 Power-Aware Scheduling under Timing Constraints for Mission-Critical Embedded Systems [p. 840]

Jinfeng Liu, Pai H. Chou, Nader Bagherzadeh, Fadi Kurdahi

Power-aware systems are those that must make the best use of available power. They subsume traditional low-power systems in that they must not only minimize power when the budget is low, but also deliver high performance when required. This paper presents a new scheduling technique for supporting the design and evaluation of a class of power-aware systems in mission-critical applications. It satisfies stringent min/max timing and max power constraints. It also makes the best effort to satisfy the min power constraint in an attempt to fully utilize free power or to control power jitter. Experimental results show that our scheduler can improve performance and reduce energy cost simultaneously compared to hand-crafted designs for previous missions. This tool forms the basis of the IMPACCT system-level framework that will enable designers to explore many power-performance trade-offs with confidence.

Session 50: Novel Devices and Yield Optimization

Chair: Chandu Visweswariah
Organizers: Chandu Visweswariah, Ralph H. J. M. Otten

50.1 Exploring SOI Device Structures and Interconnect Architectures for 3-Dimensional Integration [p. 846]

Rongtian Zhang, Kaushik Roy, Cheng-Kok Koh, David B. Janes

3-Dimensional (3-D) integration offers numerous advantages over conventional structures. Double-gate (DG) transistors can be fabricated for better device characteristics, and multiple device layers can be vertically stacked for better interconnect performance. In this paper, we explore the suitable device structures and interconnect architectures for multi-device-layer integrated circuits and study how 3-D SOI circuits can better meet the performance and power dissipation requirements projected by ITRS for future technology generations. Results demonstrate that DGSOI circuits can achieve as much as 20% performance gain and 80% power delay product reduction than SGSOI (single-gate SOI). More important, for an interconnect-dominated circuits, multi-device-layer integration offers significant performance improvement. Compared to 2-D integration, most 3-D circuits can be clocked at much higher frequencies (double or even triple). Multi-device-layer circuits, with suitable SOI device structures, can be a viable solution for future low power high performance applications.

50.2 Two-Dimensional Position Detection System with MEMS Accelerometer for MOUSE Applications [p. 852]

Seungbae Lee, Gi-Joon Nam, Junseok Chae, Hanseup Kim, Alan J. Drake

A hybrid two-dimensional position sensing system is designed for mouse applications. The system measures the acceleration of handmovements which are converted into two-dimensional location coordinates. The system consists of four major components: 1) MEMS accelerometers, 2) CMOS analog read-out circuitry, 3) an acceleration magnitude extraction module, and 4) a 16-bit RISC microprocessor. Mechanical and analog circuit simulation shows that the designed padless mouse system can detect accelerations as small as 5.3 mg and operate up to 18MHz.

50.3 Mismatch Analysis and Direct Yield Optimization by Spec-Wise Linearization and Feasibility-Guided Search [p. 858]

Frank Schenkel, Michael Pronath, Stephan Zizala, Robert Schwencker, Helmut Graeb, Kurt Antreich

We present a new method for mismatch analysis and automatic yield optimization of analog integrated circuits with respect to global, local and operational tolerances. Effectiveness and efficiency of yield estimation and optimization are guaranteed by consideration of feasibility regions and by performance linearization at worst-case points. The proposed methods were successfully applied to two example circuits for an industrial fabrication process.