|
GLSVLSI 1998 ABSTRACTS
Sessions:
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
-
Low Power Memory Architectures for Video Applications [p. 2]
-
Bhanu Kapoor
We provide data and insight into how the choice of cache parameters affects memory
power consumption of video algorithms. We make use of memory traces generated as a
result of running typical MPEG-2 motion estimation algorithms to simulate a large
number of cache configurations. The cache simulation data is then combined with
on-chip and off-chip memory power models to compute memory power consumption. In
the area of analysis of video algorithms, this paper focuses on the following issues:
We provide a derailed study of how varying cache size, block size, and associativity
affects memory power consumption. The configurations of particular interest are the ones
that optimize power under certain constraints. We also study the role of process
technology in these experiments. In particular, we look at how moving to a more
advanced process technology for the on-chip cache affects optimal points of operation
with respect to memory power consumption.
-
Reducing Power Consumption of Dedicated Processors through Instruction
Set Encoding [p. 8]
-
Luca Benini, Giovanni De Micheli, Alberto Macii, Enrico Macii, Massimo Poncino
With the increased clock frequency of modern, high-performance processors (over 500 MHz,
in some cases), limiting the power dissipation has become the most stringent design target.
It is thus mandatory for processor engineers to resort to a large variety of optimization
techniques to reduce the power requirements in the hot zones of the chip. In this paper,
we focus on the power dissipated by the instruction fetch and decode logic, a portion of
the processor architecture where a lot of capacitance switching normally takes place. We
propose a methodology for determining an encoding of the instruction set that guarantees
the minimization of the number of bit transitions occurring inside the registers of the
pipeline stages involved in instruction fetching and decoding. The assignment of the
binary patterns to the op-codes is driven by the statistics concerning instruction
adjacency collected through instruction-level simulation of typical software applications;
therefore, the technique is best exploited when applied to encode the instruction set of
core processors and microcontrollers, since components of these types are commonly used to
execute fixed portions of machine code within embedded systems. We illustrate the effectiveness
of the methodology through the experimental data we have obtained on an existing microprocessor.
-
A Low-Power High-Performance Embedded SRAM Macrocell [p. 13]
-
A.M. Fahim, M. Khellah, and M.I. Elmasry
A new approach to modeling the decoding hierarchy in a hierarchical word line (HWL)
SRAM architecture using integer-linear programming (ILP) is introduced Using this approach,
the HWL architecture is shown to be inadequate for very large SRAM sizes. Alternatively,
a new low-power high-speed SRAM architecture is described This architecture is shown to have
fairly constant speed and power dissipation for sizes ranging between 32kb to 4Mb. Low-power
is achieved by a voltage boosting technique not requiring a two-step voltage [7], and by a
new method of tristating memory cells during a write operation. The SRAM was implemented in
a 0.35μm CMOS technology operated at 150MHz while dissipating only 10mW.
-
Low-Power Design of Finite Field Multipliers for Wireless Applications [p. 19]
-
A.G. Wassal, M.A. Hassan, and M.I. Elmasry
Unlike most research involving finite field multipliers, this work targets a low-power
multiplier through the application of various power reduction techniques to different
types of multipliers and comparing their power consumption among other factors, rather
than comparing complexity measures such as gate count or area. Gate count is used as a
starting point to choose potential architectures, namely, polynomial and normal basis
architectures. Power reduction techniques employed are mainly concerned with Architecture-
and Logic-Level low-power techniques. They include supply voltage reduction, power cost
estimations, using low-power logic families and pipelining.
-
Guidelines for Use of Registers and Multiplexers in Low Power
Low Voltage DSP Systems [p. 26]
-
Dusan Suvakovic, C. Andre T. Salama
Registers and datapath multiplexers exist in most DSP datapaths. Although not performing
computations, they are necessary for the dataflow control and they consume energy. This
paper describes the nature of register and multiplexer energy consumption in modern low
power CMOS processes, shows its strong dependence on architectural and layout design and
provides practical design guidelines for micropower implementation.
-
A Bootstrapped NMOS Charge Recovery Logic [p. 30]
-
Seung-Moon Yoo and Sung-Mo (Steve) Kang
This paper describes a new Bootstrapped NMOS Charge Recovery Logic (BNCRL) which realizes
low energy computation. Power comparison with a state-of-the-an adiabatic charge recovery
circuit is shown for an inverter chain and an 8-bit adder The new logic circuits exhibit
full rail-to-rail logic swing, less dependency of energy consumption on output load capacitance
variations, and significant energy saving. Benchmark circuits were designed for comparison
using 0.6μm CMOS technology.
-
Power Reducing Techniques for Clocked CMOS PLAs [p. 34]
-
R.F. Hobson
Power saving techniques for CMOS Programmable Logic Arrays (PLAs) are discussed.
Two new techniques are introduced, an AND-plane pulse generator, and Wired-OR CMOS.
Power reduction in excess of 75% over Pseudo-NMOS techniques and 50% over some clocked
PLA techniques is possible.
Keywords: Pulse generator, Wired- OR CMOS, Single-phase clock, Self-timed logic.
-
Dynamic and Short-Circuit Power of CMOS Gates Driving Lossless
Transmission Lines [p. 39]
-
Yehea I. Ismail, Eby G. Friedman, and Jose L. Neves
The dynamic and short-circuit power consumption of a CMOS gate driving an LC transmission
line as a limiting case of an RLC transmission line is investigated in this paper. Closed
form solutions for the output voltage and short-circuit power of a CMOS gate driving an LC
transmission line are presented. These solutions agree with AS/X circuit simulations within
11% error for a wide range of transistor widths and line impedances. The ratio of the short-circuit
to dynamic power is shown to be less than 7% for CMOS gates driving LC transmission
lines where the line is matched or underdriven. The total power consumption is expected to
decrease as inductance effects becomes more significant as compared to an RC dominated
interconnect.
-
A New Full Adder Cell for Low-Power Applications [p. 45]
-
Ahmed M. Shams, and Magdy A. Bayoumi
A new low power CMOS i-bit full adder cell is presented. It is based on recent design
of XOR and XNOR gates [6], and pass-transistors, it has 17 transistors. This cell has
been compared to two widely used efficient adder cells; the transmission function full
adder cell (16 transistors) [2], and the low power adder cell (14 transistors) [3]. The
new cell has no short circuit power and lower dynamic power (than the other adder cells),
because of less number and magnitude of circuit capacitances. It consumes 10% to 15%
less power than the other two cells. A comparative analysis (using Magic and Hspice)
for 8-bit ripple carry and carry select adders shows that the adders based on the new
cell can save up to 25% of power consumption.
-
Beta-Driven Threshold Elements [p. 52]
-
Victor I. Varshavsky
Circuits on threshold elements have aroused considerable
interest in recent years. One of the possible
approaches of their imlementation is using output
wired CMOS invertors [3,4,5]. The model of such an
element is a CMOS pair with variable β of fully open
p- and n-transistors. This model is specified by the
ratio form of threshold function. It has been proved
that any threshold function can be rewritten in ratio
form. This gives us an evident way of β-driven implementation
of threshold functions. It has the following
differences from implementation on output wired
CMOS invertors:
- 3DTE requires one transistor per weight unit
rather than two;
- the implementability of 3DTE depends only on
threshold value, not on the input weights sum.
The analysis of /3DTE implementability, examples
of circuits and results of their SPICE simulation are
given.
-
A VLSI High-Performance Encoder with Priority Lookahead [p. 59]
-
José G. Delgado-Frias and Jabulani Nyathi
In this paper we introduce a VLSI priority encoder that uses a novel priority lookahead
scheme to reduce the delay for the worse case operation of the circuit, while maintaining
a very low transistor count. The encoder's topmost input request has the highest priority:
this priority descends linearly. Two design approaches for the priority encoder are presented.
one without a priority lookahead scheme and one with a priority lookahead scheme. For an N-bit
encoder, the circuit with the priority lookahead scheme requires only 1.094 times the number
of transistors the circuit without the priority lookahead scheme. Having a 32-bit encoder as
an example, the circuit with the priority lookahead scheme is 2.59 times faster than the
circuit without the priority lookahead. The worst case operation delay is 4.4 us for this
lookahead encoder, using a 1-μm scalable CMOS technology. The proposed lookahead scheme can
be extended to larger encoders.
-
Noise Margins of Threshold Logic Gates Containing Resonant
Tunneling Diodes [p. 65]
-
M. Bhattacharya and P. Mazumder
Threshold gates consisting of RTDs in conjunction with HBTs or CHFETs or MOS transistors
can form extremely compact, ultrafast, digital logic alternatives. The resonant tunneling
phenomenon causes these circuits to exhibit super-high-speed switching capabilities.
Additionally. by virtue of being threshold logic gates, they are guaranteed to be more
compact than traditional digital logic circuits while achieving the same functionality.
However, reliable logic design with these gates will need a thorough understanding of
their noise performance and power dissipation among other things. In this paper, we
present an analytical study of the noise performance of these threshold gates supplemented
by computer simulation results. with the objective of obtaining reliable circuit design guidelines.
-
600 MHz Digitally Controlled BiCMOS Oscillator (DCO) for VLSI Signal
Processing & Communication Applications [p. 71]
-
Azman M. Yusof, Lim Chu Aun & S.M. Rezaul Hasan
A 16-bit digitally controlled BiCMOS ring oscillator (DCO) is described, This
BiCMOS DCO design provides improved frequency stability under thermal fluctuations
compared to a CMOS DCO design presented in [1]. Simulations of a 5-stage DCO using
1μm BiCMOS process parameters achieved a controllable frequency range of 90 -
640 MHz with a linear/quasi-linear range of around 300MHz. Monotone frequency gain
(frequency vs control-word transfer function) with fine stepping (tuning) in several
KHz was verified. This augurs the prospect of accurate frequency lock in a BCMOS all
digital PLL (ADPLL) application in digital VLSI communication systems. Worst-case
jitter due to digital control transitions at pathological control-word boundaries
for the BiCMOS DCO was observed to be less than SOps, which is lower than that for
the CMOS DCO.
-
Stability of a Continuous-Time State Variable Filter with Op-amp and
OTA-C Integrators [p. 77]
-
Tim Bakken, John Choma, Jr.
The stability of a continuous-time state variable filter
is analyzed using the Routh-Hurwitz criterion. This
criterion assesses stability by indicating the number of
poles that lie in the right-half plane. The filter is
examined separately with integrators implemented with
an op-amp and an OTA-C. Both amplifier types are
characterized by a dominant-pole frequency response,
and the stability of each implementation is compared.
HSPICE simulations confirm the theoretical analyses,
which indicate that the gain-bandwidth product of the op
amps and the bandwidth of the OTAs must be much larger
than the desired frequency of operation to ensure
stability. Since the analyses assume a dominant-pole
response, all higher-order poles of the actual amplifier
must also be much greater than the unity-gain frequency to minimize excess phase.
-
Multiple-Valued Logic Voltage-Mode Storage Circuits Based on True-Single-Phase
Clocked Logic [p. 83]
-
I. Thoidis, D. Soudris, I. Karafyllidis, A. Thanailakis, and T. Stouraitis
A number of novel voltage-mode multiple-valued logic circuits are introduced.
Adopting the main features of the true single-phase clocked logic, efficient
quaternary logic dynamic and pseudo-static latches, dynamic and static master-slave
storage units, and uni-signal controlled pass gates are proposed. These circuits
use two kinds of MOS transistors, i.e., enhancement and depletion mode, each of
which has two threshold voltages. The proposed circuits exhibit regular, modular,
and iterative structure, which means that the MVL circuits are VLSI implementable and
can be easily re-designed for any radix of an arithmetic System. Since we use only
clock signal, the derived circuits have low power dissipation. Comparisons with existing
circuits prove substantial improvements in terms of speed, power consumption, and
transistor count.
-
CMOS Tapered Buffer Design for Small Width Clock/Data Signal Propagation [p. 89]
-
J. Navarro S. Jr. and Wilhelmus A. M. Van Noije
A new optimization criterion, the propagation of the minimum width pulse through the
buffer, is studied for design of tapered buffers draining capacitive loads. Contrary to
the classic minimum delay criterion, this one produces buffers which support maximum
speed signal propagation. Simulation results for a O.8μm and a O.35μn CMOS processes
are analyzed. Semi-empirical relations are proposed to relate the minimum width pulse
with the inverter gain ratio, the number of inverters, and the capacitive load.
Additionally, a brief study of the delay skew of tapered buffers due to mismatching
as a function of the gain ratio is done, showing that no severe degradation appears
with small gain ratios. Finally, this work paints out that buffers with small gain
ratios should reach higher speeds, nearly 30% over the speed of buffers with gain
ratio larger than a factor of 3.
-
Design of Clock Distribution Networks in Presence of Process Variations [p. 95]
-
M. Nekili, Y. Savaria, and G. Bois
Tolerance to process-induced skew remains one of the major concerns in the design of
large-area and high-speed clock distribution networks. Indeed, despite the availabilirv
of some efficient exact-zero skew algorithms that can be applied during circuit design,
the clock skew remains an important performance limiting factor after chip manufacturing,
and is of increasing concern for sub-micron technologies. This tutorial reviews the
importance of the problem, its sources, as well as typical examples of existing solutions,
Solutions range from design rules strategies to built-in self-compensation methods.
-
Design of an 8:1 MUX at 1.7Gbit/s in 0.8µm CMOS Technology [p. 103]
-
J. Navarro, Jr. and W.A.M. Van Noije
The design of an 8:1 multiplexer Circuit, for SDH/SONET data transmission systems,
is presented. In order to achieve maximum transmission rates, new circuits, high speed
input/output converters for ECLCMOS levels and modified true single phase clocked (TSPC)
cells, as well as new techniques for clock buffer optimization, were applied. The
multiplexer implemented in a 0. 8μm CMOS process (0.7μm effective length) achieved
1.7Gbit/s rate and 42.6μW/MHz power consumption at 5V. These results were compared to
a previous implementation (in the same process), and to other recently published works,
showing superior performances.
-
Issues in the Design of Domino Logic Circuits [p. 108]
-
Pranjal Srivastava, Andrew Pua, and Larry Welch
Domino logic circuits have become extremely
popular in the design of today's high performance
processors because they offer fast switching speeds and
reduced areas. However, the use of domino logic
introduces many design risks because it is very sensitive to
noise, circuit and layout topologies. This paper identifies
issues that might cause domino logic circuits to fail, and
discusses some possible solutions to alleviate these
problems.
-
A Novel 1.5-V CMOS Mixer [p. 113]
-
G. Giustolisi, G. Palmisano, G. Palumbo, and C. Strano
New and simple CMOS mixer powered with 1.5 V is presented. It works with a
200MHz clock, and has a -7-dB IP3. Moreover. it elaborates signals
up to 150 mV with 1-dB compression point. The particular topology makes it
useful for an integration in fully digital ICr.
-
Analysis of Adaptive CMOS Down Conversion Mixers [p. 118]
-
C.K. Sandalci and S. Kiaei
Can K. Sandalci
Analysis of CMOS direct conversion architecture with adaptive DC offset compensation
is presented. Due to process mismatches and local oscillator (LO) crosstalk, DC offsets
up to 3OmV are observed at the mixer output. For a practical direct conversion or
zero-if down-conversion system, the incoming RF signal can be as low as
-lOOdBm or few microvolts at this stage and any LO coupling will cause a DC offset
orders of magnitude larger than the received signal. The DC offset needs to be effectively
reduced to prevent the consecutive gain stages from entering saturation and destroying the
RF signal. To achieve this, an adaptive DC shifting circuit is presented. Adding a tunable
DC offset on the LO signal can effectively counteract the output DC offset by exploiting
the quadratic LO dependence of the process mismatch induced offsets. In addition to that,
DSP approaches for adaptively generating the control signals for the DC shifting circuitry
are investigated.
Keywords- Zero-IF, Direct Conversion, Mixer, DC offset
-
Artificial Neural Network Electronic Nose for Volatile Organic Compounds [p. 122]
-
Hoda S. Abdel-Aty-Zohdy
Advanced microsystems that include, sensors, interface-circuits, and pattern-recognition
integrated monolithically or in a hybrid module are needed for civilian, military, and
space applications. These include: automotive, medical applications, environmental
engineering, and manufacturing automation. ASICS with Artificial Neural Networks (ANN)
are considered in this paper, with the objective of recognizing air-borne volatile organic
compounds, especially alcohols, ethers, esters, halocarbons, NH3, NO2, and other warfare
agent simulants. The ASIC inputs are connected to the outputs from array-distributed sensors
which measure three-features for identifying each of four chemicals. A Specialized
Reinforcement Neural Network (RNN) learning approach is chosen for the chemicals classification
problem. Hardware implementation of the RNN is presented for 2 μm CMOS process, MOSIS chip.
Design implementation and evaluation are also presented.
-
A VLSI Self-Compacting Buffer for DAMQ Communication Switches [p. 128]
-
José G. Delgado-Frias and Richard Diaz
This paper describes a novel VLSI CMOS implementation of a self-compacting buffer
(SCB) for the dynamically allocated multi-queue (DAMQ) switch architecture. The 8GB
is a scheme that dynamically allocates data regions within the input buffer for each
output channel. The proposed implementation provides a high-performance solution to
buffered communication switches that are required in interconnection networks. This
performance comes from not only the DAMQ approach but also the pipelined implementation
and novel circuitry. The major components of the 8GB are described in detail in this
paper. The system has the capability of performing a read, a write, or a simultaneous
read/write operation per cycle due to its pipelined architecture.
-
A Dictionary Machine Emulation on a VLSI Computing Tree System [p. 134]
-
A.E. Harvin III and J.G. Delgado-Frias
In this paper, we propose a dictionary machine emulation using a novel VLSI tree structure
that operates on the dictionary using a blocking technique. We show that dictionary machine
operations can be performed through the implementation of a number of processing and
communication tasks overlapped on a simple structure. By manipulating the key-records
bit serially, and storing them in an external memory rather than within the layers of
the structure, we show that the size of the dictionary i3 limited only by the capacity
of the external memory. This structure, which consists of multiple units, can be
implemented in VLSI onto a single-chip. The key advantage of our structure is that
it provides a means of implementing a high speed and low cost dictionary machine
with virtually unlimited capacity; thus, eliminating the need for multiple chips should
the dictionary expand. We have that an exhaustive search on a 2048 key-record dictionary
can be performed in 29.78 μs.
-
Modeling and Analysis of the Difference-Bit Cache [p. 140]
-
Ashutosh Kulkarni, Navin Chander, Soumya Pillai and Lizy John
Advances in VLSI technology and processor architectures have resulted in a tremendous
increase in processor speeds and memory capacities. However memory latencies have failed
to improve as rapidly, making memory systems the performance bottlenecks in most high
performance processor architectures. Caching is a time-tested mechanism to solve this
speed disparity. Among the different cache mapping strategies, direct mapping is the
only configuration where the critical path is merely the time required to access a RAM.
Although direct mapped caches are preferable considering hit-access times, they have
poor hit ratios compared to associative caches. The difference-bit cache proposed by
Juan, Lang and Navarro (1/, is functionally equivalent to a two-way set-associative
cache but tries to achieve an access time smaller than that of a conventional two-way
set-associative cache and close to that of a direct-mapped cache. We modeled and analyzed
the difference-bit cache to prove the hypothesis of its small access time. We have also
tried to prove that the access time advantage of the difference-bit cache improves over
the conventional two-way set-associative cache with an increase in the cache size.
Finally we have tried to analyze the trade-off involved in applying these techniques
to a higher associativity cache.
Keywords: Cache memory, critical path, hit access time, cache mapping strategies
-
Modeling of Shift Register-Based ATM Switch [p. 146]
-
Sandeep Agarwal and Fayez El-Guibaly
In this paper, we present the modeling of shift register-based ATM switch to find
the cell loss probability, throughput and delay. The results are compared with other
switch architectures based on input queueing, input smoothing, output queueing and
completely shared buffering. It is observed that although our switch is an input-buffered
switch, it's performance us better than other switches based on traditional queueing approaches.
-
An Architecture of Full-Search Block Matching for Minimum Memory
Bandwidth Requirement [p. 152]
-
Jen-Chien Tuan, Chein-Wei Jen
In this paper an architecture of full-search block matching motion estimation
suitable for high quality video is proposed. Minimum memory bandwidth is an important
requirement in motion estimation architecture especially when dealing with high quality
video such as large frame size video. Memory bandwidth will increase to an
unrealistically high value without careful consideration, which no cost efficient
solution can afford it. This architecture is designed for overcoming the frame
memory bandwidth bottleneck by exploiting the maximum data reuse property. This is
done by setting up local memory for storing frame data. The size of local memory is
also optimized to near minimum value, only little overhead is introduced. Due to the
reduction of memory bandwidth, the costs of frame memory modules, I/O pin count and
the power consumption can be reduced but 100% hardware efficiency is still achieved.
Simple and regular interconnections is featured to ensure high speed operation by an
efficient and distributed local memory organization.
-
MPEG-2 Video Decoder for DVD [p. 157]
-
Nien-Tsu Wang, Chen-Wei Shih, Duan Juat Wong-Ho, Nam Ling
A video decoder with an efficient controller scheme and a sub-picture decoder for DYD
application is presented in this paper. Most of the reported architecture for MPEG2
video decoding uses a 64 bit bus and a complex bus arbitration scheme. Our design
uses synchronous DRAMS instead of standard EDO DRAMS and involves a novel controller
scheme that allocates bus space for DRAM access efficiently. This efficient allocation
allows us to reduce bus width from 64 bits to 32 bits, without significantly increasing
embedded buffer sizes, and stilt meeting the requirements for MPEG2 MP@ML decoding. The
bus arbitration algorithm Is also simple allowing for a less complex controller design.
Our main strategy Is to impose a certain order in the DRAM access by the various processes
instead of allowing any process to request for bus access arbitrarily. We also take
advantage of the restricted GOP(group of picture) sequence In the DYD format to allow
a longer decoding time for B frames. The sub-picture pixel data are run-length compressed
bitmaps that are overlayed on top of the MPEG reconstruction video. The architecture for
sub-picture decoding is simple and easy to implement.
-
A Self Timed Asynchronous Router for an Heterogeneous Parallel Machine [p. 161]
-
Eric SENN, Bertrand ZAVIDOVIQUE
This paper describes the implementation of the self timed asynchronous router in a
parallel machine.
The heterogenous architecture of the machine is outlined, then the need for asynchronous
operations is explained, and the interest of an asynchronous network control. The
specification and VLSI design of the router are exhibited with its measured performances.
-
Non-Refreshing Analog Neural Storage Tailored for On-Chip Learning [p. 168]
-
BA. Alhalabi, Q. Malluhi, and R. Ayoubi
Bassem A. Alhalabi, Qutaibah Malluhi, Rafic Ayoubi
In this research, we devised a new simple technique for statically holding analog
weights, which does not require periodic refreshing. It further contains a mechanism
to locally update the weights from the analog back-propagation signals for fast on-chip
learning. In this circuit, the weight is stored as a 5-bit digital number, which controls
the gates of five pass transistors allowing five binary-weighted (1,2,4,8,16) voltage
references to integrate at a voltage adder. The output of the voltage adder is the analog
weight. The 5-bit register is designed as an up/down counter so that every pulse on the
up/down input will increase/decrease the weight by one level out of 32 possible levels.
The learning circuit takes the analog graded error signal and generates two pulse streams
for up/down counting depending on the sign of the error signal. The duration of the
pulse stream is proportional to the magnitude of the error signal. This complete modular
synaptic body (storage and learning technique) is appropriate for large scaleable analog
VLSI neural networks because it handle recall and learning operations at the same speed
with full parallelism.
-
Residue to Binary Number Converters for (2n - 1,2n,2n + 1) [p. 174]
-
Yuke Wang, Xiaoyu Song, Mostapha Aboulhamid
This paper proposes three new residue-to-binary converters using 2n- bit or n-bit
adders for the three moduli residue number system of the form (2n - l,2n,2n + 1). The
2n- bit adder based converter is faster and requires about half of the hardware required
by previous methods. For n-bit adder based implementations, one new converter is twice
as fast as the previous method using similar amount of hardware; while another new
converter achieves improvement in both speed and area.
-
The Design of Residue Number System Arithmetic Units for a VLSI
Adaptive Equalizer [p. 179]
-
Inseop Lee and W. Kenneth Jenkins
This paper presents the design details of an experimental ASIC for an all-digital
adaptive equalizer. In this design, the LMS algorithm is chosen because of its simplicity.
The adaptive equalizer design, which is based on an RNS architecture, consists of an RNS
multiplier. an RNS adder, an RNS filter, a binary-to-residue converter, a residue-to-binary
converter. and an update algorithm. The design is verified by a high level hardware
simulation tool. The designs of all these units are discussed in this paper.
-
An Efficient Residue to Weighted Converter for a New Residue
Number System [p. 185]
-
Alexander Skavantzos
The Residue Number System (RNS) is an integer
system appropriate for implementing fast digital signal processors since it can
support parallel, carry-free, high-speed arithmetic. in this paper a new RNS 5 stem
and an efficient implementation of its residue- to-weighted converter are presented.
The new RNS is a balanced 5-moduli system appropriate for large dynamic ranges. T
he new residue-to-binary converter is very fast and hardware-efficient and is based
on a l's complement multioperand adder adding operands of size only 80% of the size
of the system's dynamic range.
-
The Chinese Abacus Method: Can We Use It for Digital Arithmetic? [p. 192]
-
Franco Maloberri, Clien Gang
This paper discusses how to apply the approach used in the Chinese Abacus to
implement digital arithmetic. Firstly, we examine the representations and the basic
techniques used in the Chinese Abacus: then, we propose a MOS realization of the basic
functions required; finally, we discuss a novel 12 bit full adder based on the Chinese
Abacus method. Simulations of 0.5 μm CMOS realizations showed that a parallel solution
can run at 200 MHz while a pipeline realization can achieve I GHz of clock frequency.
The complexity of the circuit is quite limited: thus, the use of the Chinese Abacus
approach results a competitive technique with respect to conventional methodologies.
-
Merged Arithmetic for Computing Wavelet Transforms [p. 196]
-
Gwangwoo Choe and Earl E. Swartzlander, Jr.
A variation of merged arithmetic is applied to the implementation of the wave/ct
transform. This approach offers a simple design trade-off between the computational
accuracy and the complexity. Our analysis shows that the trade-off is a function of the
input data resolution, the number of,filter raps, the arithmetic precision, and the
level of the wave/ct transform. The design parameter can be also fixed for a given
number of taps and used to determine the minimum word size for the wavelet coefficients
of the transform. The key element of this approach is to introduce a "truncation" within
the merged arithmetic reduction process which provides equivalent throughput with a
substantially less complexity. An experiment has been conducted to verify the analysis,
which suggests that 24-bit merged arithmetic is required for the EZW algorithm to handle
up to a level 6-wave/ct transform.
-
Digital Arithmetic Using Analog Arrays [p. 202]
-
S. Sadeghi-Emamchaie, G.A Jullien, V. Dimitrov, and W. C. Miller
This paper describes techniques for using locally connected analog Cellular
Neural Networks (CNNs) to implement digital arithmetic arrays; the arithmetic is implemented
using a recently disclosed Double-Base Number System (DBNS).
The CNN arrays are targeted for low
power low-noise DSP applications where lower slew rate
during transitions is a potential advantage. Specifically,
we demonstrate that a CNN array, using a simple nonlinear
feedback template, with hysteresis, can perform arbitrary
length arithmetic with good performance in terms of
stability and robustness. The principles presented in this
paper can also be used to implement arithmetic in other
number systems such as the binary number system.
-
A Combined Interval and Floating Point Multiplier [p. 208]
-
James E. Stine and Michael J. Schulte
Interval arithmetic provides an efficient method for monitoring and controlling errors
in numerical calculations. However, existing software packages for interval arithmetic
are often too slow for numerically intensive computations. This paper presents the design
of a multiplier that performs either interval or floating point multiplication. This
multiplier requires only slightly more area and delay than a conventional floating point
multiplier, and is one to two orders of magnitude faster than software implementations of
interval multiplication.
-
Test Compaction for Synchronous Sequential Circuits by Test
Sequence Recycling [p. 216]
-
Irith Pomeranz and Sudhakar M. Reddy
We introduce a new concept for test sequence compaction referred to as recycling.
Recycling is based on the observation that easy-to-detect faults tend to be detected
several times by a deterministic test sequence, whereas hard-to-detect faults are
detected once towards the end of the test sequence. Thus, the suffix of a test
sequence detects a large number of faults, including hard-to-detect faults. The
recycling operation keeps a suffix S1 of a test sequence T1 and discards the rest
of the sequence. The suffix S1 is then used as a prefix of a new test sequence T2.
In this process, S1 is expected to detect the more difficult to detect faults as
well as many of the easy-to-detect faults, resulting in a new sequence ~'2 which is
shorter than T1. Recycling is enhanced by a scheme where several faults are targeted
simultaneously to generate the shortest possible test sequence that detects all of them.
-
Random Self-Test Method Applications on PowerPCTM Microprocessor Caches [p. 222]
-
Rajesh Raina,Robert Molyneaux
This paper describes a novel method for generating test stimuli for digital systems.
By taking advantage of certain properties of the Design Under Validation, the method
can be used to generate test stimuli that is random as well as self-testing. We
discuss the requirements and limitations of this method on practical designs. The
use of this merthod for High-Level Design Validation of caches in PowerPCTM
microprocessors is also described. The paper concludes by identifying areas
where
further work is needed.
Topic Areas: High-Level Design Validation, Silicon
Validation, Pseudo-Random Testing, Microprocessor
Testing.
-
A Unified Approach for a Time-Domain Built-In Self-Test Technique
and Fault Detection [p. 230]
-
B. Provost, A.M. Brosa, and E. Sánchez-Sinencio
Being able to fully test a circuit is an important issue for quality manufacturing.
Unlike fault analysis for digital circuits, analog fault analysis has been comparatively
slow to evolve. The purpose of this paper is to study the feasibility of the time domain
response analysis as a test method for analog circuits'. The approach was to first study
the fault coverage obtained by testing the main parameters of the new NGCC amplifier,
which shows the feasibility of built-in self test in time-domain. A circuit macromodel
to implement a time-domain built-in self-test circuit was then proposed.
-
VHDL Testability Analysis Based on Fault Clustering and Implicit
Fault Injection [p. 237]
-
F.S. Bietti, F. Ferrandi, F. Fummi, and D. Sciuto
Testability analysis of VHDL sequential models is the main topic of this paper. We
investigate the possibility to obtain information about the testability of a sequential
VHDL description before its actual synthesis. The analysis is based on an implicit
fault model that injects faults into a BDD based description cx-traded from the VHDL
representation. Such an injection is related to the original VHDL representation thus
allowing the identification of potential testability problems before RTL and logic
synthesis. Fault injection is performed efficiently by exploiting the concept of fault
clustering, that is, the possibility of grouping faults and analyzing them concurrently.
The proposed methodology is applied to benchmarks for efficiency evaluation and to a real
VHDL description.
-
IDD Waveforms Analysis for Testing of Domino and Low
Voltage Static CMOS Circuits [p. 243]
-
Hendrawan Soeleman, Dinesh Somasekhar, and Kaushik Roy
This paper describes a test method which relies on the actual observation of supply
current ('IDD) waveforms. The method can be used to supplement the standard IDDQ test
method and it can be easily applied to dynamic and low VDD, low VT CMOS circuits, The
method allows us to detect faults which may not be detected by IDDQ test methods, and
is sensitive enough to detect potential faults, which do not manifest themselves as
functional errors. A simple built-in current sensor, which proves to be adequate in
verifying the feasibility of using the IDD waveforms analysis is proposed to safely
observe the current waveforms without significantly changing the waveforms.
-
A Design-for-Testability Technique for Detecting Delay Faults in
Logic Circuits [p. 249]
-
K. Raahemifar and M. Ahmadi
This paper provides a simulation-based study of the delay fault testing in logic
circuits. It is shown that delay testing is necessary in order to achieve a high
defect coverage. By detecting delayed time response in a transistor circuit, three
types of faults are detected:
1) faults which cause delayed transitions at the output node due to some open
defects, 2) faults which cause an intermediate voltage level at the output node,
and 3) most stuck-at faults which halt the circuit at '1' or '0'. An on-line checker
is presented which enables the concurrent detection of delay faults. Since one
checker is used for each output signal, the area overhead is minimal. This technique
does not degrade the speed of the circuit under test (CUT). We show that the test
circuit is independent of the size of the CUT. Simulation results show that this
technique can be adjusted to fit to any design style.
-
Development of a CMOS Cell Library for RF Wireless and
Telecommunications Applications [p. 258]
-
Robert H. Caverly
There is increasing interest in the use of CMOS circuits for highly integrated high
frequency wireless telecommunications systems. This paper presents the results of
on-going work into the development of a cell library that includes many of the
circuit elements required for the high frequency sub-system of a communications
integrated circuit. The cells were fabricated using standard MOSIS processes and
measurement results are presented. The full design files, testing results and
circuit tutorials describing the cells and how they interface with baseband circuits
are available from the author.
-
Design Issues of LC Tuned Oscillators for Integrated Transceivers [p. 264]
-
C. Samori, A.L. Lacaita, A. Zanchi, and P. Vita
VCO for wireless receivers must fulfill tight requirements of phase noise and their
complete integration in silicon VLSI technologies is still an open issue due to the
low quality factor of the inductors. In this paper we address some of the constraints
met in the design of low noise oscillator stages: the tank topology and its quality
factor, the dynamics of the transconductor stage and its loading effects, the limitation
resulting from the AM- to-PM conversion.
-
Novel Simple Models of CML Propagation Delay [p. 270]
-
M. Alioto and G. Palumbo
Accurate and simple models of CML propagation delay are given. The approach used is
new. The propagation delay is represented with a few terms, providing a better insight
into the relationship between delay and its electrical parameters. which in turn are
related to process parameters. The most accurate model has a typical and worst case
errors as low as 2% and 5%, respectively.
-
Next Generation Narrowband RF Front-Ends in Silicon IC Technology [p. 275]
-
John R. Long
It is anticipated that the next generation of wireless systems will deliver voice
and data services at carrier frequencies extending up to 6GHz. The front-end
circuits for these radios must be aggressively designed in order to deal with
issues such as analog and digital compatibility; higher linearity imposed by
broadband signal processing at 1F, low supply voltage to minimize size, weight
and power consumption, as well as operation in multiple frequency bands. The
challenges and opportunities facing the designer of these radio frequency (RF)
front-end IC's in silicon will be addressed in this paper from both the technological
and circuit perspectives.
-
Low Voltage Low Power CMOS AGC Circuit for Wireless Communication [p. 281]
-
Hassan 0. Elwan, Mohammed Ismail
This paper describes a new technique for realizing CMOS digitally controlled,
dB-linear variable gain amplifier (VGA) circuit. The circuit is developed taking
into account system level issues for a direct conversion receiver. Besides being
effective and simple to use from a system point of view, the developed VGA offers
precise gain control, high linearity and low power consumption. The circuit can
operate in a current domain or a voltage domain mode with single ended or fully
differential signal handling capability. The proposed VGA circuit is implemented
using a novel class AB operational transconductance amplifier and current division
networks. Simulation results are included.
-
A Continuous-Time Switched-Current ΣΔ Modulator with Reduced Loop Delay [p. 286]
-
Louis Luh, John Choma,Jr., Jeffrey Draper
A novel architecture for a second-order continuous-time switched-current ΣΔ modulator is presented. The loop delay is reduced by predicting the states of
the second integrator and feeding the predicted states to the comparator. The
predicted states are generated by summing three scaled current mode signals. A
Gain-Manager is used to accurately control the integrator gain to generate the
predicted states and stabilize the system. A newly designed high-speed current-
mode comparator is capable of summing the three scaled current inputs and comparing
them. With a 50 MHz sampling rate, ii has achieved 60 dB dynamic range (10-bit) at 1
MHz. The modulator has been fabricated in a 2um CMOS process with an active area of
0.37 mm2. The power dissipation is 16.6 mW from a 5V single power supply.
-
An Exact Input Encoding Algorithm for BDDs Representing FSMs [p. 294]
-
Wilsin Gosti, Tiziano Villa, Alexander Saldanha, Alberto L. Sangiovanni-Vincentelli
We address the problem of encoding the state variables of a finite state machine such
that the BDD representing its characteristic function has the minimum number of nodes.
We present an exact formulation of the problem. Our formulation characterizes the two
BDD reduction rules by deriving conditions under which these reduction rules can be
applied. We then provide an algorithm that finds these conditions and solves the
problem by formulating it as a 2-CNF formula and extracting all its prime implicants.
in addition to this, we implemented a simulated annealing algorithm for this problem
and provide a thorough experiment of the impact of encoding on a BDD representing an
FSM with different orderings.
-
Maximum Current Estimation in Programmable Logic Arrays [p. 301]
-
S. Bobba and I.N.Hajj
Programmable logic array (PLA) is a circuit realization for the two-level sum of
products representation of a multi-output Boolean function. The current drawn by a
PLA is input dependent and it makes the problem of estimating the maximum current
intractable. Integrated circuit reliability and signal integrity are related to the
maximum current drawn by the circuit. Hence, an estimate of the maximum current is
required for the design of a reliable VLSI circuit. In this paper, we present an
input pattern-independent algorithm to obtain the estimate of maximum and minimum
currents drawn by a PLA over all possible input vectors. Experimental results on
several benchmark circuits and comparisons with exhaustive simulations are also
included in this paper.
-
Mutually Disjoint Signals and Probability Calculation in Digital Circuits [p. 307]
-
Vishwani D. Agrawal, Sharad Seth
Signal probability calculation in circuits where signals are not independent is
generally erpensive. We show that some correlated signals may be mutually disjoint.
In such cases, the probability calculation can be as simple as it is for independent
signals. For example, two signals that cannot be simultaneously true are defined as
OR-disjoint. If these signals feed an OR gate, the probability of the output being
true is simply the sum of the probabilities of inputs being true. We give an
implication-based algorithm for identifying disjoint signals. Examples of large
adders illustrate how the identification of disjoint signals simplifies the
probability calculation.
-
Identifying High-Level Components in Combinational Circuits [p. 313]
-
Travis Doom, Jennifer White, Anthony Wojcik, Greg Chisholm
The problem of finding meaningful subcircuits in a logic layout appears in many contexts
in computer-aided design. Existing techniques rely upon finding exact matchings of
subcircuit structure within the layout. These syntactic techniques fail to identify
functionally equivalent subcircuits which are differently implemented, optimized, or
otherwise obfuscated. We present a mechanism for identifying functionally equivalent
subcircuits which is capable of overcoming many of these limitations. Such semantic
matching is particularly useful in the field of design recovery.
-
Local Optimality Theory in VLSI Channel Routing: Composite Cyclic
Vertical Constraints [p. 319]
-
Anthony D. Johnson
Local Optimality paradigm is applicable to all combinatorial optimization problems.
Its direct field of application are the constructive solution algorithm; its main
advantage is the low computational cost for multiple high quality initial solutions
for iterative improvement algorithms. The application of the paradigm to the VLSI
channel routing has necessitated the creation of new knowledge represented by the
theory of locally optimal breaking (LOB) of directed circuits (DC) in the vertical
constraint graph. Existing theory has supported deterministic polynomial time algorithms
for LOB of two classes of directed circuits, the classes of vertex disjoint DC'S, and
of couples of connected DC'S. The new LOB theory supports algorithms for more complex
classes of any number of DC'S sharing a single vertex and of a uniform lattices of DC's.
It is significant that the new theory relies on theory for couples of connected DC'S for
breaking more complex structures of connected DC'S
-
Linear Transformations and Exact Minimization of BDDs [p. 325]
-
Wolfgang Günther, Roif Drechsler
We present an exact algorithm to find an optimal linear transformation for the variables
of a Boolean function to minimize its corresponding ordered Binary Decision Diagram (BDD).
To prune the huge search space, techniques known from algorithms for finding the optimal
variable ordering are used. This BDD minimization finds direct application in FPGA design.
We give experimental results for a large variety of circuits to show the efficiency of our
approach.
-
Timed Supersetting and the Synthesis of Large Telescopic Units [p. 331]
-
L. Benini, G. De Micheli, A. Lioy, E. Macii, G. Odasso, and M. Poncino
In high-performance systems, variable-latency units are often employed to improve
the average throughput when the worst-case delay emceed, the cycle time. Although
such units have traditionally been hand-designed, recent results have shown that
variable-latency units can be automatically generated. Unfortunately, the
existing synthesis procedure has limited applicability due to its computational complexity.
In this work, we define and study an optimization problem, timed supersetting, whose
solution is at the kernel of the procedure for automatic generation of variable-latency
units. We contribute a new algorithm for solving timed supersetting in the most difficult
case, that is, when the timing behavior of the circuits is expressed through an accurate
delay model. The proposed solution overcomes the complexity limitation of previous
approaches, and its robustness is experimentally demonstrated by obtaining high-throughput,
variable-latency implementations for all the largest circuits in the Iscas'85 and Iscas'89
benchmark suites.
-
Tabu Search Based Circuit Optimization [p. 338]
-
Sadiq M. Sait, Habib Youssef, and Munir M. Zahra
In this paper we address the problem of optimizing mixed CMOS/BiCMOS circuits.
The problem is formulated as a constrained combinatorial optimization problem and
solved using an tabu search algorithm. Only gates on the critical sensitizable paths
are considered for optimization. Such a strategy leads to sizable circuit speed
improvement with minimum increase in the overall circuit capacitance. Compared to
earlier approaches. the presented technique produces circuits with remarkable increase
in speed (greater than 20%) for very small increase in overall circuit capacitance
(less than 3%).
Keywords:Tabu Search, Circuit Optimization, Search Algorithms, CMOS/BiCMOS,
Mixed Technologies, Critical Path, False Path.
-
On the Characterization of Multi-Point Nets in Electronic Designs [p. 344]
-
Dirk Stroobandt, Fadi I. Kurdahi
Important layout properties of electronic designs include interconnection length
values, clock speed, area requirements, and power dissipation. A reliable estimation
of those properties is essential for improving placement and routing techniques for
digital circuits.
Previous work on estimating design properties failed to take multi-point nets into
account. All nets were assumed to be 2-point nets (especially for estimating the number
of nets). In this paper we aim at characterizing multi-point nets in electronic designs.
We will develop a model for the behaviour of multi-point nets during the partitioning
process. The resulting distribution of nets over their net degree will be validated
through comparison with benchmark data
-
HOOVER: Hardware Object-Oriented Verification [p. 351]
-
Mostafa M. Aref and Khaled M. Elleithy
In this paper a new formal hardware verification approach based on object oriented
techniques is presented. The HOOVER system (Hardware Object Oriented VERification)
is described. A cell library of different hardware components has been implemented as
classes. Components in the cell library are described at the transistor level, gate
level, and logical level, and functional level. The verification of a CMOS inverter
and 1-bit CMOS adder using HOOVER is given in the paper.
-
MDG-Based Verification by Retiming and Combinational Transformations [p. 356]
-
O.A. Mohamed, E. Cerny, and X. Song
Multiway Decision Graphs (MDGs) have been recently proposed as art efficient verification
tool for RTL designs based on an efficient representation mechanisms. In MDG, a data
value is represented by a single variable of abstract sort, and a data operation i.s
represented by an uninterpreted function symbol. In this work we investigate the non-
termination problem of MDG-based verification. We present a novel approach to dealing
with the problem based on retiming and circuit transformations that preserve the behavior
of the circuit. We demonstrate the effectiveness of our method on the example of the
Island Tunnel Controller
(JTc,).
Keywords: Formal Verification. Multiway Decision Graphs, Retiming, Circuit
Transformations, Non-termination.
-
Practical Considerations in Formal Equivalence Checking of PowerPCTM
Microprocessors [p. 362]
-
A. Chandra, L.-C. Wang, and M. Abadir
Recently, formal verification is becoming more a part of the VLSI design methodology.
Formally verifying a design guarantees 100% coverage and negates the need to do simulation.
Theoretically, 100% coverage is very appealing and formal verification looks to be the
panacea to solve the coverage problem. However, there are many practical considerations
in deploying formal verification in real design environments. These considerations if not
evaluated can lead to ineffective and even erroneous formal verification methodologies.
In this paper we show how to make formal verification a successful part of a design methodology
by paying attention to practical considerations and knowing the limitations offormal
verification. We show the errors that can result by making over generalized assumptions
and how they can be avoided. We do this in the context of the design of PowerPC microprocessors.
We limit ourselves to a for,nal verification technique commonly used in our design
methodology--boolean equivalence checking.
-
Practical Approaches to the Automatic Verification of an ATM Switch
Fabric Using VIS [p. 368]
-
J. Lu and S. Tahar
In this paper we present several practical methods for formally verifying an Asynchronous
Transfer Mode (ATM) network switching fabric using the Verification Interacting with Synthesis
(VIS) tool. We produced Verilog RTL behavioral and netlist structural descriptions of the
switch fabric at different levels of hierarchy and established several abstracted models
of the fabric. Using various techniques presented in the paper, we provided a number of
relevant liveness and safety properties expressible in CTL, and accomplished their
verification in reasonable CPU time. Moreover, we performed equivalence checking between
the structural and behavioral descriptions of each submodule of the implementation hierarchy.
-
Performance Optimization of Self-Timed Circuits [p. 374]
-
M.A. Franklin and P. Prabhu
In this paper, we present methods for improving the performance of self-timed computation
blocks. The Hybrid Completion method permits the design of a spectrum of completion circuits
ranging from those based on pure bounded delays to those based on full complementary circuit
development. This is achieved by using a subset of the outputs of the computation block to
generate the overall completion signal. Thus, the ertra circuitry for the completion signals
of the other outputs is eliminated. The computation block's delay might also be reduced since
fewer signals are required to generate the overall completion signal. The approach seeks to
incorporate the area efficiency of the bounded delay approach and the operand based, delay
sensitivity of the full complementary approach.
-
Stochastic Evolution Algorithm for Technology Mapping [p. 380]
-
A.S. Al-Mulhem, A. Amin, and H. Youssef
A new technology mapper (SELF-Map) for Look-Up Table (L UT,) based Field Programmable
Gate Arrays (FPGAs) is described. SELF-Map is based on the Stochastic Evolution (SE)
algorithm. The state space model of the problem is defined and suitable cost function
which allows optimization for area. delay, or area-delay combinations is proposed.
Experimental re-suits show that SELF-Map has an overall better performance compared to
other algorithms reported in the literature.
-
RCRS: A Framework for Loop Scheduling with Limited Number of Registers [p. 386]
-
Kaisheng Wang, Ted Zhihong Yu, Edwin H. -M. Sha
Many real time applications such as multimedia and DSP systems require high throughput,
so it is necessary to have special purpose designs for them. Loop pipelining is an
effective approach to reduce the total execution time of loops. While most previous
research concentrates on the scheduling of computation, the experiments show that data
access may give significant overhead if the register resource is limited. This paper
studies the register constraint problem and presents Register Constrained Rotation
Scheduling (RCRS). including the algorithm analyzing the number of required registers
for loops and two classes of algorithms based on different assumptions. The first class
is for loop scheduling with a given number of registers. If the number of registers is
too stringent, the second class of algorithms are applied by inserting necessary LOAD/STORE
operations into the loop schedule. Through the series of experiments, the RCRS algorithms
are shown to achieve near optimal schedule length while satisfying register constraints.
-
A Quantitative Study of the Benefits of Area-I/O in FPGAs [p. 392]
-
Herwig Van Marck, Jo Depreitere, Dirk Stroobandt and Jan Van Campenhout
Designs targeted for FPGAs are becoming increasingly larger and more complex. The
need for i/O often surpasses the number of I/O pads that can be provided at the perimeter
of the FPGA chip. As a result, these designs have to be implemented in larger FPGAs, the
size of which is fixed by the number of I/O pads and not by the logic needed, reducing
the performance of the implementation. Providing FPGA chips with i/O pads that are spread
out across the whole chip area drastically reduces this problem. In this paper we present
a quantitative analysis of the impact of area-i/O in FPGAs.
-
Top-Down Design Using Cycle Based Simulation: An MPEG A/V
Decoder Example [p. 400]
-
Dale E. Hocevar, Ching-Yu Hung, Dan Pickens and Sundararajan Sriram
This paper presents a discussion of a top-down VLSI design approach which involves
system level performance modeling, block level cycle based simulation, RTL/VHDL
simulation and gate level emulation. An MPEG-2 Audio/Video decoder design example
illustrates the use of this top-down approach. Most of the discussion con cent rates
on the concept of block level cycle based (BLCB) simulation. HW/SW co-design also
played an important role in this work and our approach towards such co-design is
discussed as well.
-
Low-Power Driven Scheduling and Binding [p. 406]
-
Jim Crenshaw and Majid Sarrafzadeh
We investigate the problem of exploiting signal correlation between operations to find
a schedule and binding which minimizes switching. We propose several heuristics to solve
the problem. Experimentally, we give an algorithm for scheduling communications on a bus,
which reduces bus switching up to 60%, without increasing the number of cycles required
for the schedule. Low-power scheduling efforts in the literature have focused on
decreasing the number of cycles in the schedule so that the voltage required to run
the resulting circuit can be lowered. However, the number of voltages supplied to a
chip is likely to be limited, so among the processes to be implemented, typically only
a few will determine the minimum voltages, and the rest will have slack in their schedules.
Therefore it is interesting to inquire about the impact of scheduling which does not reduce
the number of time steps in order to decrease switching. In this paper. we show that power-
aware scheduling can lead to significant decreases in switching, often without an increase
in the number of time steps required. The technique is general, and can be used to schedule
operations in any kind of resources.
-
Effective Capacitance Macro-Modelling for Architectural-Level
Power Estimation [p. 414]
-
Muhammad M. Khellah and M. I. Elmasry
This paper presents a simple, yet efficient method to characterize the effective
capacitance in data-path macros for architectural-level power estimation. Given a
library of hard-macros, a capacitance model based on linear regression is derived
for each macro. A transistor-level tool is employed for capacitance extraction. The
capacitance models can be used during architectural-level power estimation. Unlike
previous approaches, our characterization methodology assumes no specific word-level
statistics of the input data, requires little knowledge about the structure of the
modules, allows the user to trade-off accuracy and characterization time, and
propagates effective capacitance directly from transistor- level (real) implementations.
Simulation experiments on a set of data-path components with various sizes are performed.
Compared to a previously published approach [I], our scheme significantly improves the
accuracy of RTL power estimation and produces results within 15% from a transistor-level
tool on the average.
-
A Methodology for High Level Power Estimation and Exploration [p. 420]
-
V. Krishna and N. Ranganathan
Effective power reduction can be achieved at higher levels of design abstraction. A number
of such techniques have been proposed for power optimization in the literature. These
techniques use RT level templates which characterize the area, delay and power of the design.
The templates are based on some knowledge of the logic block such as the number of nodes,
levels and their interconnections. Methods which model the power consumption of a logic block
whose internal details are not known are desirable to explore trade-offs early on in the design
cycle. Recently, lower bounds for switching activity at the gate level based on decision theor:y
have been proposed by the authors. This has been extended to derive the average switching
activity of a module based solely on its functionality. The experimental results on ISCAS '85
benchmark circuits indicate that the approach gives reasonably accurate estimates at low
computational cost. In this paper, we use the RT level estimates for power exploration at the
behavioral level for various high level synthesis benchmarks. The experimental results show
that appropriate design decisions can be taken at the high level to reduce the cost of redesigning
which would be incurred if committed to a particular circuit structure.
Keywords: High Level Designs, Power Estimation, Low Power Designs. Switching Activity
-
How to Transform an Architectural Synthesis Tool for Low Power VLSI Designs [p. 426]
-
S. Gailhard, N. Julien, J. -Ph. Diguet, and E. Martin
High Level Synthesis (HLS) for Low Power VLSI design is a complex optimization problem
due to the Area/Time/Power interdependence. As few low power design tools are available,
a new approach providing a modular low power synthesis method is proposed. Although based
for the moment on a generic architectural synthesis tool Gout, the use of different
"commercial" tools is possible. The Gaut_w HLS tool is constituted of low power modules
High level power dissipation estimation, Assignment, Module selection (operators and
supply voltage), Optimization criteria and Operators library. As illustration, power
saving factors on DWT algorithms are presented.
-
Sharing Electronic Design Data Via Semantic Spaces [p. 432]
-
K.C. Davis, S. Venkatesan, and L.M.L. Delcambre
Electronic Design Automation (FDA) tools, such as layout generators and simulators,
have generally focused on algorithms and techniques for hardware design. Data management
aspects have not been emphasized. but the volume of data, heterogeneity of data formats,
and the evolution/proliferation of tools have made data modeling and data interchange
increasingly importart research issues. The data sharing problem stems from the fact
that related EDA tools are often used in various sequences to manipulate and annotate
a single design. In a typical design environment, tools use file-based data storage,
with limited data modeling capabilities, and primitive or non-existent query facilities.
In order to support current tools, we wish to preserve the semantics of existing hardware
description languages in rigorous data models; we propose to capture each existing language
in a semantic space model. We view data interchange as a query against one semantic space
that produces objects, i.e., query answers, in another semantic space. We define an
integrating meta-model, the meta-space, and also deftne general query operators for
transforming objects between semantic spaces. These query operators deftne both the
intension and extension of a query result; the transformed data is described in the type
system of the meta-space, thus providing explicit semantics for the shared data. Our
modeling approach supports advanced and evolving applications, such as hardware/software
codesign, through the ability to retrieve data resident in individual semantic spaces,
as well as to share data in semantic spaces from different FDA sources.
-
VHDL-Based EDA Tool Implementation with Java [p. 440]
-
R. Miller
As part of ARPA 's RASSP Technology Program, we developed a Hardware/software CoSynthesis
Algorithm that was prototyped in C++. This initial prototype was developed in C++ using PCCTS,
Tcl, Tcl/TK, and Tcl-DP. When this environment became unwieldy due to changing hardware
development platforms and software are packages, a second prototype was built in Java.
This paper describes architectural features of the prototype and how they were addressed
in Java.
-
Standard Data Representations for VLSI Algorithm Development [p. 446]
-
D. Hertweck, M. Nica, S. Park, and C. Purdy
Because so many important problems arising in VLSI design are NP-hard, VLSI algorithms
must employ randomization techniques or heuristics. Thus the process of analyzing a new
algorithm or of comparing two algorithms is at present an experimental one. Consequently,
progress in VLSI algorithm development must be based on references to standard benchmarks.
Yet examination of literature on specific problems, such as graph partitioning, shows that
such standardization is not yet a reality. Here we describe a system, Circuit base, which
we are developing to address the standardization problem. Circuitbase will combine the
extensive graph manipulation routines of Knuth 's Stanford Graphbase package with actual
circuit examples from the Benchmark Archives at CBL, standard routines for generating
random examples of circuits, and standard methods for algorithm analysis. We describe
Circuit base versions of example behavioral, structural, and physical views of a VLSI
circuit and discuss how Circuitbase can support modern VLSI design environments.
-
A Storage Structure for Graph-Oriented Databases Using an Array of
Element Types [p. 452]
-
T. Hochin and T. Tsuji
This paper proposes a storage structure for graph-oriented databases called the flattened
separable directory method. In this method, a data representing graph, which is a unit of
representing graph, is primarily represented with an array of edge or node types. As every
node or edge can be accessed without navigation, the values of nodes and/or edges can be
quickly evaluated. Experimental evaluations support this characteristics, and clarify that
the performance of inserting data is high, and less storage overhead is needed in the case
of the graphs consisting of many node and edge types.
|