While process technologists are obsessed to follow Moore‰s curve down to nanoscale dimensions, design technologists are confronted with gigascale complexity. On the other hand, post-PC and post dotcom products require zero cost, zero energy yet software programmable novel system architectures to be sold in huge volumes and to be designed in exponentially decreasing time. How do we cope with these novel silicon architectures? What challenges in research does this create? How to create the necessary tools and skills and how to organize research and education in a world driven by shareholders value? Can you spare half an hour to reflect on these challenges to the design community?
The technical complexities of advanced SoC design are compounded by changes in the economic structure of the worldwide semiconductor industry. A look at some of the organizational and personal responsibilities that will be required to meet the challenges of SoC design in the Future
It is well recognised that the process of new product development, introduction and marketing is fraught with difficulty. Indeed the probability of achieving plan timescale, costs and budget are so low, that some degree of failure is inevitable. So whilst the primary role of the Manager is to identify and minimise all major risks, and make sure the ones remaining are adequately resourced: A secondary role is to make sure that what failure does occur, does not damage his/her reputation! The Virtual Component appears in the context of risk minimisation. CPU or UART, the motive is the same, get the right product from concept to customer as quickly as possible. The make or buy decision is a risk/cost trade-off, and as the cost of failure is normally so high, risk emerges as the dominant factor; Is it riskier to design, or buy-in?
The 2000 SIA roadmap shows over 50 % of the area in an SOC being occupied by embedded memory. The selection of the memory IP and supplier is critical to the success of the design and the ramp to volume. The Memory IP can determine yield, reliability, cost, speed and/or power. Mr. Ratford will help you navigate through the evaluation process by discussing key requirements and possible solutions when evaluating memory for your next SOC design.
Embedded software Intellectual Property (IP) is becoming vital for today‰s complex System-on-Chips We first define the notion of Hardware-dependent Software, and then review the multidimensional criteria for choosing ESW IP, including retargetability and portability, flexibility, optimisation, validation and certification.
The semiconductor industry gave the most tremendous challenge to the electronic design community and EDA industry by making available a silicon capacity that exceeds by far what today's designer can utilise in a reasonable amount of time. Reasonable timeframes for System-on-a-Chip developments in the multimedia and communication markets are less than eighteen months, when not even nine! I would like to give credit to Gary Smith, Chief Analyst at Dataquest, to have raised a very pertinent media alert in his article, ëThe Revolution isn't Coming -- It's Already Here‰, in Virtual Chip Design, May 1997. It was clearly stated that in order to fill the design gap between available gates on silicon and design methodology, the solution was through system-level integration (SLI) using what was called at that time system-level macros (SLM). The electronic design community and EDA companies picked up the gauntlet and started what will be known as the Virtual Components creation through the industry organisation called the Virtual Socket Interface Alliance (VSIA). This was followed by Mentor Graphics and Synopsys who signed a Design Reuse Partnership, which led to the publishing of the "Reuse Methodology Manual for SoC Designs". The last stage was to create an industry accepted Virtual Component Quality Spreadsheet by merging the two efforts.
We present the formal verification of the floating-point multiplier in the Intel IA-32 Pentium© 4 microprocessor. The verification is based on a combination of theorem-proving and BDD based model-checking tasks performed in a unified hardware verification environment. The tasks are tightly integrated to accomplish complete verification of the multiplier hardware coupled with the rounder logic. The approach does not rely on specialized representations like Binary Moment Diagrams or its variants.
Rewriting rules and Positive Equality [4] are combined in an automatic way in order to formally verify out-of-order processors that have a Reorder Buffer, and can issue/retire multiple instructions per clock cycle. Only register-register instructions are implemented, and can be executed out-of-order, as soon as their data operands can be either read from the Register File, or forwarded as results of instructions ahead in program order in the Reorder Buffer. The verification is based on the Burch and Dill correctness criterion [6]. Rewriting rules are used to prove the correct execution of instructions that are initially in the Reorder Buffer, and to remove them from the correctness formula. Positive Equality is then employed to prove the correct execution of newly fetched instructions. The rewriting rules resulted in up to 5 orders of magnitude speedup, compared to using Positive Equality alone. That made it possible to formally verify processors with up to 1,500 instructions in the Reorder Buffer, and issue/retire widths of up to 128 instructions per clock cycle.
As embedded systems continue to face increasingly higher performance requirements, deeply pipelined processor architectures are being employed to meet desired system performance. System architects critically need modeling techniques that allow exploration, evaluation, customization and validation of different processor pipeline configurations, tuned for a specific application domain. We propose a novel Finite State Machine (FSM) based modeling of pipelined processors and define a set of properties that can be used to verify the correctness of in-order execution in the presence of fragmented pipelines and multicycle functional units. Our approach leverages the system architect‰s knowledge about the behavior of the pipelined processor, through Architecture Description Language (ADL) constructs, and thus allows a powerful top-down approach to pipeline verification. We applied this methodology to the DLX processor to demonstrate the usefulness of our approach.
The verification of a n-stage pulse-driven IPCMOS pipeline, for any n > 0, is presented. The complexity of the system is 32n transistors and delay information is provided at the level of transistor. The correctness of the circuit highly depends on the timed behavior of its components and the environment. To verify the system, three techniques have been combined: (1) relative-timing-based verification from absolute timing information [13], (2) assume-guarantee reasoning to verify untimed abstractions of timed components and (3) mathematical induction to verify pipelines of any length. Even though the circuit can interact with pulse-driven environments, the internal behavior between stages commits a handshake protocol that enables the use of untimed abstractions. The verification not only reports a positive answer about the correctness of the system, but also gives a set of sufficient relative-timing constraints that determine delay slacks under which correctness can be maintained.
In this paper, the placement problem on FPGAs is faced using Thermodynamic Combinatorial Optimization (TCO). TCO is a new combinatorial optimization method based on both Thermodynamics and Information Theory. In TCO two kinds of processes are considered: microstate and macrostate transformations. Applying the Shannon's definition of Entropy to microstate reversible transformations, a probability of acceptance based on Fermi-Dirac statistics is derived. On the other hand, applying thermodynamic laws to reversible macrostate transformations, an efficient annealing schedule is provided TCO has been compared with Simulated Annealing (SA) on a set of benchmark circuits for the FPGA placement problem. TCO has achieved large time reductions with respect to SA, while providing interesting adaptive properties.
After the discussion on the difference between floorplanning and packing in VLSI placement design, this paper adapts the floorplanner that is based on the Q-sequence to packing algorithm. For the purpose, some empty room insertion is required to guarantee not to miss the optimum packing. To increase the performance in packing, a new move that perturbs the floorplan is introduced in terms of the Parenthesis-Tree Pair . A Simulated Anealing based packing search algorithm was implemented. Experimental results showed the effect of empty room insertion.
In this paper, we deal with arbitrary convex and concave rectilinear module packing using the Transitive Closure Graph (TCG) representation. The geometric meanings of modules are transparent to TCG and its induced operations, which makes TCG an ideal representation for floorplanning/ placement with arbitrary rectilinear modules. We first partition a rectilinear module into a set of submodules and then derive necessary and sufficient conditions of feasible TCG for the submodules. Unlike most previous works that process each submodule individually and thus need post processing to fix deformed rectilinear modules, our algorithm treats a set of submodules as a whole and thus not only can guarantee the feasibility of each perturbed solution but also can eliminate the need of the post processing on deformed modules, implying better solution quality and running time. Experimental results show that our TCG-based algorithm is capable of handling very complex instances; further, it is very efficient and results in better area utilization than previous work.
A unified approach to fault simulation for FGDs is introduced. Instead of a direct fault simulation, the proposed approach calculates indirectly from the simulator output the sets of undetectable values of the trapped charge on the floating gate transistor. It covers all potential gate charges of an FGD at one or more transistors and allows the application of conventional circuit simulators for simulating DC, AC and transient test. Based on this fault simulation, a test design methodology is presented that can determine all test sets that detect all FGDs for all possible values of gate charge.
The problem of fault grading for multiple path delay faults is studied and a method of obtaining the exact coverage is presented. The faults covered are represented and manipulated as sets by zero-suppressed binary decision diagrams (ZBDD), which are shown to be able to store a very large number of path delay faults. For the extreme case of memory problem, a method to estimate the coverage of the test set is also presented. The problem of fault grading is solved with a polynomial number of BDD operations. Experimental results on the ISCAS'85 benchmark include test sets from ATPG tools and specifically designed tests in order to investigate the limitations and properties of the proposed method.
It has always been assumed that fault models
in memories are sufficiently precise for specifying the faulty
behavior. This means that, given a fault model, it should
be possible to construct a test that ensures detecting the
modeled fault. This paper shows that some faults, called
partial faults, are particularly difficult to detect. For these
faults, more operations are required to complete their fault
effect and to ensure detection. The paper also presents
fault analysis results, based on defect injection and simulation,
where partial faults have been observed. The impact
of partial faults on testing is discussed and a test to detect
these partial faults is given.
Key words: partial faults, DRAMs, fault models, defect
simulation, memory testing, completing operations.
Deterministic observation and random excitation of fault sites during the ATPG process dramatically reduces the overall defective part level. However, multiple observations of each fault site lead to increased test set size and require more tester memory. In this paper, we propose a new ATPG algorithm to find a near-minimal test pattern set that detects faults multiple times and achieves excellent defective part level. This greedy approach uses 3-value fault simulation to estimate the potential value of each vector candidate at each stage of ATPG. The result shows generation of a close to minimal vector set is possible only using dynamic compaction techniques in most cases. Finally, a systematic method to trade-off between defective part level and test size is also presented.
As technology scales toward deep submicron, on-chip interconnects are becoming more and more sensitive to noise sources such as power supply noise, crosstalk, radiation induced effects, etc. Transient delay and logic faults are likely to reduce the reliability of data transfers across datapath bus lines. This paper investigates how to deal with these errors in an energy efficient way. We could opt for error correction, which exhibits larger decoding overhead, or for the retransmission of the incorrectly received data word. Provided the timing penalty associated with this latter technique can be tolerated, we show that retransmission strategies are more effective than correction ones from an energy viewpoint, both for the larger detection capability and for the minor decoding complexity. The analysis was performed by implementing several variants of a Hamming code in the VHDL model of a processor based on the Sparc V8 architecture, and exploiting the characteristics of AMBA bus slave response cycles to carry out retransmissions in a way fully compliant with this standard on-chip bus specification.
Systems on a chip (SOCs) are rapidly evolving into larger networks on a chip (NOCs). This work presents a new methodology for managing power consumption for NOCs. Power management problem is formulated using closed-loop control concepts, with the estimator tracking changes in the system parameters and recalculating the new power management policy accordingly. Dynamic voltage scaling and local power management are formulated in the node-centric manner, where each core has its local power manager that determines units power states. The local power manager‰s interaction with the other system cores regarding the power and the QoS needs enables network-centric power management. The new methodology for power management of NOCs is tested on a system consisting of four satellite units, each with the local power manager capable of both node and network centric power management. The results show large savings in power with good QoS.
We present strategies for "online" dynamic power management(DPM) based on the notion of the competitive ratio that allows us to compare the effectiveness of algorithms against an optimal strategy. This paper makes two contributions: it provides a theoretical basis for the analysis of DPM strategies for systems with multiple power down states; and provides a competitive algorithm based on probabilistically generated inputs that improves the competitive ratio over deterministic strategies. Experimental results show that our probability-based DPM strategy improves the efficiency of power management over the deterministic DPM strategy by 25%, bringing the strategy to within 23% of the optimal offline DPM.
This paper describes the AccuPower toolset - a set of simulation tools accurately estimating the power dissipation within a superscalar microprocessor. AccuPower uses a true hardware level and cycle level microarchitectural simulator and energy dissipation coefficients gleaned from SPICE measurements of actual CMOS layouts of critical datapath components. Transition counts can be obtained at the level of bits within data and instruction streams, at the level of registers, or at the level of larger building blocks (such as caches, issue queue, reorder buffer, function units). This allows for an accurate estimation of switching activity at any desired level of resolution. The toolsuite implements several variants of superscalar datapath designs in use today and permits the exploration of design choices at the microarchitecture level as well as the circuit level, including the use of voltage and frequency scaling. In particular, the AccuPower toolsuite includes detailed implementations of currently used and proposed techniques for energy/power conservations including techniques for data encoding and compression, alternative circuit approaches, dynamic resource allocation and datapath reconfiguration. The microarchitectural simulation components of AccuPower can be used for accurate evaluation of datapath designs in a manner well beyond the scope of the widelyšused Simplescalar tools.
Intellectual property, or IP, takes on many different meanings depending upon the context within which it is utilized. Our IP discussion focuses on the rapidly evolving world of technology IP and, more specifically, semiconductor IP. Our core belief is that in order to be successful, semiconductor IP must be more than an idea or innovation. It must be implemented seamlessly, with little resistance from the customer and have compelling value add to the customer upon implementation and thereafter. The heart of the customer's purchase decision is where we believe semiconductor IP models need to be the most focused. Is there a right model in every case? No. In fact, we would argue that the right model is the one that makes your customer‰s adoption the easiest. In some respects, we would compare most IP purchase decisions as fitting the classic make or buy scenario. Customers are only willing to embrace third party IP to save costs. Sure we can get off the track and discuss technology leads or other forms of "killer IP", but cost is at the root of almost every IP decision and, more precisely, a make or buy analysis.
We introduce the notion of problem symmetry in searchbased SAT algorithms. We develop a theory of essential points to formally characterize the potential search-space pruning that can be realized by exploiting problem symmetry. We unify several search-pruning techniques used in modern SAT solvers under a single framework, by showing them to be special cases of the general theory of essential points. We also propose a new pruning rule exploiting problem symmetry. Preliminary experimental results validate the efficacy of this rule in providing additional searchspace pruning beyond the pruning realized by techniques implemented in leading-edge SAT solvers.
We describe a SAT-solver, BerkMin, that inherits such features of GRASP, SATO, and Chaff as clause recording, fast BCP, restarts, and conflict clause ‹agingŠ. At the same time BerkMin introduces a new decision making procedure and a new method of clause database management. We experimentally compare BerkMin with Chaff, the leader among SAT-solvers used in the EDA domain. Experiments show that our solver is more robust than Chaff. BerkMin solved all the instances we used in experiments including very large CNFs from a microprocessor verification benchmark suite. On the other hand, Chaff was not able to complete some instances even with the timeout limit of 16 hours.
The core computation in BDD-based symbolic synthesis and verification is forming the image and pre-image of sets of states under the transition relation characterizing the sequential behavior of the design. Computing an image or a pre-image consists of ordering the latch transition relations, clustering them and eventually re-ordering the clusters. Existing algorithms are mainly limited by memory resources. To make them as efficient as possible, we address a set of heuristics with the main target of minimizing the memory used during image computation. They include a dynamic heuristic to order the latch relations, a dynamic framework to cluster them, and the application of conjunctive partitioning during image computation. We provide and integrate a set of algorithms and we report references and comparisons with recent work. Experimental results are given to demonstrate the efficiency and robustness of the approach.
We propose a novel approach to bus energy minimization that targets crosstalk effects. Unlike previous approaches, we try to reduce energy through capacitance optimization, by ad opting nonuniform spacing between wires. This allows reduction of power, and at the same time takes into account signal integrity. Therefore, performance is not degraded. Results show that the method saves up to 30% of total bus energy at no cost in performance or complexity of the design (no encoding-decoding circuitry is needed), and limited cost in area.
We present a Dynamic VTH Scaling (DVTS) scheme to save the leakage power during active mode of the circuit. The power saving strategy of DVTS is similar to that of the Dynamic VDD Scaling (DVS) scheme, which adaptively changes the supply voltage depending on the current workload of the system. Instead of adjusting the supply voltage, DVTS controls the threshold voltage by means of body bias control, in order to reduce the leakage power. The power saving potential of DVTS and its impact on dynamic and leakage power when applied to future technologies are discussed. Pros and cons of the DVTS system are dealt with in detail. Finally, a feedback loop hardware for the DVTS which tracks the optimal VTH for a given clock frequency, is proposed. Simulation results show that 92% energy savings can be achieved with DVTS for 70nm circuits.
Dynamic voltage scaling (DVS) is a known effective mechanism for reducing CPU energy consumption without significant performance degradation. While a lot of work has been done on inter-task scheduling algorithms to implement DVS under operating system control, new research challenges exist in intra-task DVS techniques under software and compiler control. In this paper we introduce a novel intra-task DVS technique under compiler control using program checkpoints. Checkpoints are generated at compile time and indicate places in the code where the processor speed and voltage should be re-calculated. Checkpoints also carry user-defined time constraints. Our technique handles multiple intra-task performance deadlines and modulates power consumption according to a run-time power budget. We experimented with two heuristics for adjusting the clock frequency and voltage. For the particular benchmark studied, one heuristic yielded 63% more energy savings than the other. With the best of the heuristics we designed, our technique resulted in 82% energy savings over the execution of the program without employing DVS.
This paper presents a new formulation and an efficient solution of the power and ground mesh sizing problem. We use the key observations that (1) the drops in power and ground node potentials are due not only to currents drawn by the computing blocks, but also to those drawn by the clock buffers, and (2) changes of circuit component delays are linearly proportional to the power/ground IR-drops. This leads to a linear quantification of the timing relations between the clocking and computing components in terms of the power/ground IR-drops. Our method removes all IR-drop related timing violations that occur in about 2% of paths when grids are sized using the existing methods that satisfy the maximum IR-drop constraints. In addition, we achieve supply mesh area improvements of the order of 30% while simultaneously reducing the power dissipated in the circuits by about 6.6% compared to traditional grid sizing methods.
Production test costs for today's RF circuits are rapidly escalating. Two factors are responsible for this cost escalation: (a) the high cost of RF ATEs and (b) long test times required by elaborate performance tests. In this paper, we propose a framework for low-cost signature test of RF circuits using modulation of a baseband test signal and subsequent demodulation of the DUT response. The demodulated response of the DUT is used as a "signature" from which all the performance specifications are predicted. The applied test signal is optimized in such a way that the error between the measured DUT performances and the predicted DUT performances is minimized. The proposed low-cost solution can be easily built into a load board that can be interfaced to an inexpensive tester.
In this paper, we present an innovative methodology to estimate and improve the quality of analog and mixed-signal circuit testing. We first detect and reduce the redundancy in the electrical test measurements (e-tests), then we identify the e-test acceptability regions by considering performance specifications as well as process parameter distributions. Finally, we provide an effective metric for the accurate assessment of the parametric test coverage of embedded analog IP. Experimental results confirm the validity of the proposed methodology and its broad applicability to analog, mixed-signal and RF applications for different process technologies.
For the generation of defect-oriented tests a system is developed that includes the synthesis of self-test structures. With the objective to generate a highly efficient analogue test, the fault simulation methods are greatly enhanced: (1) A new testability measure, (2)the possibility to distinguish between not-to-detect and hard-to-detect faults with respect to the tolerances of the respective measurement system. By presenting a new design flow and using the fault simulation in a very early design stage a tool-suite is developed. It allows to control the defect-robust layout and to eliminate those faults that limit the efficiency of a measurement system. This allows for economic self-test applications! It is demonstrated that the system finds the most efficient and less expense test for a given fault set. With the presented results it is possible to include the defect-oriented approach from the fault simulation to the automatic generation of layout rules and the test synthesis in an industrial design flow.
There are some types of faults in analogue and mixed signal circuits which are very difficult to detect using either voltage or current based test methods. However, it is possible to detect these faults if we add to the conventional dynamic power supply current test methods IDDT, the analysis of the changes in the slope of this dynamic power supply current. In this work, we present a Built-In Current Sensor (BICS) which is able to process the highest frequency components in the dynamic power supply current of the circuit under test (CUT). The BICS add to the resistive sensor an inductance made from a gyrator and a capacitor to carry out the current to voltage conversion. Moreover, the proposed test method improves the fault coverage in continuous circuits and switched current circuits as well.
This paper gives an overview on a Virtual electronic component or IP (Intellectual Property) exchange infrastructure whose main components are a XML "well structured IP e-catalog Builder TM" and a" XML IP profiler TM While the first module is a e_publishing and an exchange management module the second has as role to extract from the design directories the IP files and to trigger their transfer to the user site possibly via an IP distribution server under the catalog control. Direct Design file extraction from commercial configuration systems such as CVS and Clearcase is supported; notice also that the architecture supports if required a network of IP distribution servers preventing from a performance bottleneck when exchanging IPs; both modules have been implemented respectively in Java Servlet and as a Java client/server application.
This paper offers an Internet-based environment for enhancing problem-specific design flows with test pattern generation and fault simulation capabilities. Automatic Test Pattern Generation (ATPG) and fault simulation tools at structural and hierarchical levels available at geographically different places running under the virtual environment using the MOSCITO system are presented. These tools can be used separately, or in multiple applications, for test pattern generation of digital circuits. In order to link different tools together and with commercial design systems, respectively a set of translators was developed. The functionality of the integrated design and test system was verified by several benchmark circuits.
We present the concept of a distributed, web-based electronic design framework. The salient feature of our system is the extension of the client-server architecture to two-tiers, with the web server serving client requests whilst acting as client to the tool servers. In the sample application of the framework, developed in Java, any of the servers can be based on Linux, MS Windows or Sun-SPARC server. The web server that has been used to demonstrate the framework for on-line access to VAMS (a VHDL-AMS parser) and Avant! HSPICE is currently available for Linux but has been developed with a truly platform independent implementation in mind.
The structure of Internet applications and scenarios is changing rapidly today. This offers new potential for established technologies and methods to expand their area of application. New technologies encourage new methodologies to design processes and business-to-business applications. The application of such new advancements should be extended into the domain of the electronic design automation (EDA) industry. In this paper we present an approach to use webservices in the field of embedded system design.
Drastic device shrinking, power supply reduction, and increasing operating speeds that accompany the technological evolution to very deep submicron, reduce significantly the noise margins and affect the reliability of very deep submicron ICs. Timing faults escaping timing closure analysis and/or manufacturing testing, as well as soft-errors, are creating reliability issues in the field. Soft Errors: In this context, single event upsets (SEUs) are becoming one of the major signal integrity problems. Atmospheric neutrons have become a major source of SEUs in modern VDSM technologies. An SEU is the consequence of a single event transient (SET) created on a sensitive node by a particle striking an integrated circuit. When an SET occurs on a memory-cell node and flips the state of the cell it is transformed to a single event upset (SEU). An additional problem is that in today technologies, soft errors concern not only memories (which has been the case so far) but also logic. An SET, occurring on a node of a logic network, is transformed to an SEU when a latch captures it.
Today‰s market conditions are driving increasingly shorter time to market requirements for semiconductor devices. Effective techniques for achieving quick and accurate debug and fault diagnosis of increasingly complex SOC devices are therefore becoming indispensable. This presentation covers new embedded test based IP and related software tools that provide the desired level of debug and diagnosis.
Due to the VDSM evolution and an electronic systems market starving for performance, the semiconductor industry is used to hit big technology walls. Challenge after challenge, brand new domains of competencies are popping up followed by fast and accurate tools. Synthesis, routers, verification, DFT, embedded systems, SoC, áare well established as standard competencies to achieve high quality, high performance and high yield chip production. In recent roadmaps (ITRS, Medea, D&T), signal integrity has been pointed out as a major challenge. More and more causes can affect signal integrity as geometries are shrinking. One of the growing effects is the so-called "transient errors" which are due to temporary condition of use and environment. Cross-coupling, ground bounce, external terrestrial radiations create more and more unpredictable transient and soft errors which affect system reliability in unacceptable ways. In addition, reliability in devices like memories become a critical issue: the MTBF (mean time before failure) level decreasing the global system FIT ( Failure in Time) rate approaching the critical border line for the end user. Hence, for memories and for logic blocks as well using high-end process technologies, self-correcting intelligence embedded in SoC is needed to enable electronic systems to react against unpredictable and insidious errors.
A heuristic is proposed for state reduction in incompletely specified finite state machines (ISFSMs). The algorithm is based on checking sequence generation and identification of sets of compatible states. We have obtained results as good as the best exact method in the literature but with significantly better run-times. In addition to finding a reduced FSM, our algorithm also generates an I/O sequence that can be used as test vectors to verify the FSMÇs implementation.
Phased logic has been proposed as a technique for realizing self-timed circuitry that is delay-insensitive and requires no global clock signals. Early evaluation techniques have been applied to asynchronous circuits in the past in order to achieve throughput increases. A general method for computing early evaluation functions is presented for this design style. Experimental results are given that show the increase in throughput of various benchmark circuits. The results show that as much as a 30% speedup can be achieved in some cases.
We introduce a new dual threshold voltage technique for domino logic. Since domino logic is much more sensitive to noise, noise margins have to be taken into account when applying dual threshold voltages to domino logic. To guarantee the signal integrity in domino logic, we carefully consider the effect of transistor sizing and threshold voltage selection. For optimal design, tradeoffs need to be made among noise margin, power, and performance. Based on the characteristics of each logic gate, we propose noise and power constrained domino logic synthesis for high performance. ISCAS85 benchmark results show that performance can be improved up to 18.62% with 2% active power increase, while maintaining noise margin.
This paper presents a novel method to automatically generate symbolic expressions for both linear and nonlinear circuit characteristics using a template-based fitting of numerical, simulated data. The aim of the method is to generate convex, interpretable expressions. The posynomiality of the generated expressions enables the use of efficient geometric programming techniques when using these expressions for circuit sizing and optimization. Attention is paid to estimating the relative ëgoodness-of-fit‰ of the generated expressions. Experimental results illustrate the capabilities of the approach.
In this paper we introduce an approach for parameter controlled symbolic analysis of nonlinear analog circuits. Based on a state-of-the-art algorithm, it enables the removal of specific circuit parameters from a symbolic circuit description, given as a set of nonlinear differential algebraic equations (DAEs). During the removal, singularities are considered, which includes structural changes of the set of DAEs. The feasibility of our approach is shown by several circuit examples.
A new technique is presented for generating symbolic expressions for the harmonic transfer functions of linear periodically time-varying (LPTV) systems, like mixers and PLL‰s. The algorithm, which we call Symbolic HTM, is based on the organisation of the harmonic transfer functions into a harmonic transfer matrix. This representation allows to manipulate LPTV systems in a way that is similar to linear time-invariant (LTI) systems, making it possible to generate symbolic expressions which relate the overall harmonic transfer functions to the characteristics of the building blocks. These expressions can be used as design equations or as parametrized models for use in simulations. The algorithm is illustrated for a downconversion mixer.
This paper presents a new, compact, canonical graph-based representation, called Taylor Expansion Diagrams (TEDs). It is based on a general non-binary decomposition principle using Taylor series expansion. It can be exploited to facilitate the verification of high-level (RTL) design descriptions. We present the theory behind TEDs, comment upon its canonicity property and demonstrate that the representation has linear space complexity. Its application to equivalence checking of high-level design descriptions is discussed.
This paper introduces an extension to the RMS scheduling technique that we call "Hot Swapping". Hot Swapping enables a system to choose between various selected implementations of one task on-the-fly and thus to optimize the system‰s cost (e.g. power savings). The on-the-fly swapping between those implementations requires extra time to save and/or transform states of a certain task implementation. Even if the two steady-state schedules before and after the swapping are feasible, the transient schedule with the additional swapping computation time may exceed the system‰s capacity. Our technique is an extension to Rate Monotonic Scheduling (RMS). While maintaining and meeting performance requirements, our technique shows an average reduction of 31% in power consumption compared to systems using a pure static scheduling approach (RMS) that cannot make use of task swapping. We have evaluated our algorithm through simulation of five real-world task sets and in addition by use of a large number of generated task sets.
Complex systems-on-chip present one of the most challenging design problems of today. To meet this challenge, new design languages capable to model such heterogeneous, dynamic systems are needed. For implementation of such a language, the use of an object oriented C++ class library has proven to be a promising approach, since new classes dealing with design- and platform-specific problems can be added in a conceptual and seamlessly reusable way. This paper shows the development of such an extension aimed to provide a platform-independent high-level structured storage object through hiding of the low-level implementation details. It results in a completely virtualised, user-extendible component, suitable for use in heterogeneous systems.
Current System-on-Chip (SoC) designs incorporate an increasing number of mixed-signal components. Design reuse techniques have proved successful for digital design but these rules are difficult to transfer to mixed-signal design. A top-down methodology is missing but the low level of abstraction in designs makes system integration and verification a very difficult, tedious and complex task. This paper presents a contribution to mixed-signal design reuse where a design methodology is proposed based on modular and parametric behavioural components. They support a design process where non-ideal effects can be incorporated in an incremental way, allowing easy architectural selection and accurate simulations. A working example is used through the paper to highlight and validate the applicability of the methodology.
The temperature dependence of the IC(VBE) relationship can be characterised by two parameters: EG and XTI. The classical method to extract these parameters consists in a "best fitting" from measured VBE(T) values, using least square algorithm at constant collector current. This method involves an accurate measurement of VBE voltage and an accurate value of the operating temperature. We propose in this paper, a configurable test structure dedicated to the extraction of temperature dependence of IC(VBE) characteristic for BJT designed with bipolar or BiCMOS processes. This allows a direct measurement of die temperature and consequently an accurate measurement of VBE(T). First, the classical extraction method is explained. Then, the implementation techniques of the new method are discussed, the improvement of the design is presented.
The oscillator problem consists of determining good initial values for the node voltages and the frequency of oscillation and the avoidance of the DC solution. Standard approaches for limit cycle calculations of autonomous circuits exhibit poor convergence behavior in practice. By introducing an additional periodic probe voltage source to the oscillator circuit, the system of autonomous differential algebraic equations (DAEs) can be reformulated as a system of non-autonomous DAEs with the constraint, that the current through the source has to be zero for the limit cycle. Using a two stage approach leads to a greater range of convergence as the standard approach, but the success of the algorithm is heavily dependent on the initial amplitude of the probe source and the frequency of oscillation. This paper presents a fast and reliable optimization based initialization procedure which overcomes the initialization problem of the two stage algorithm.
This paper introduces several new component clustering techniques for the optimization of asynchronous systems. In particular, novel ‹Burst-Mode awareŠ restrictions are imposed to limit the cluster sizes and to ensure synthesizability. A new control specification language, CH, is also introduced which facilitates the manipulation and optimization of handshake control components. The new method has been fully integrated into a comprehensive asynchronous synthesis package, Balsa. Experimental results on several substantial design examples, including an 32-bit microprocessor core, indicate significant performance improvements for the optimized circuits.
The paper presents a new method for checking Unique and Complete State Coding, the crucial conditions in the synthesis of asynchronous control circuits from Signal Transition Graphs (STGs). The method detects state coding conflicts in an STG using its partial order semantics (unfolding prefix) and an integer programming technique. This leads to huge memory savings compared to methods based on reachability graphs, and also to significant speedups in many cases. In addition, the method produces execution paths leading to an encoding conflict. Finally, the approach is extended to checking the normalcy property of STGs, which is a necessary condition for their implementability using gates whose characteristic functions are monotonic.
This paper addresses verifying the timing of circuits containing level-sensitive latches in the presence of cross talk. We show that three consecutive periodic occurrences of the aggressor‰s input switching window must be compared with the victim‰s input switching window. We propose a new phase shift operator to allow aligning the aggressor‰s three relevant switching windows with the victim‰s input signals. We solve the problem iteratively in polynomial time, and show an upper bound on the number of iterations equal to the number of capacitors in the circuit. Our experiments demonstrate that eliminating false coupling results in finding a smaller clock period at which a circuit will run.
RF front-end architectures of today's wireless applications need to meet tough requirements on nonlinear distortion to minimize unwanted effects such as crosstalk. An analysis of the nonlinear behavior of analog communication circuits or architectures is not straightforward. This paper presents a modified Volterra series approach to the simulation of nonlinear systems described at the architectural level. The total computed response is decomposed in its nonlinear contributions and the main nonlinearities can be identified. This yields a better insight into the system's nonlinear behavior and allows simplifications. The simplified system can then be simulated more efficiently. The implementation is only based on vector calculation to minimize the computation time, and has been applied to a complete 5 GHz WLAN receiver front-end.
The systematic design of a high-speed, high-accuracy Nyquist A/D converter is proposed. The presented design methodology covers the complete flow and is supported by software tools. A generic behavioral model is used to explore the A/D converter's specifications during high level design and exploration. The inputs are the specifications of the A/D converter and the technology process. The result is a generated layout and the corresponding extracted behavioral model. The approach has been applied to a real-life test case, where a Nyquistrate 8-bit 200MS/s 4-2 interpolating A/D converter was developed for a WLAN application.
A bio-inspired model for an analog parallel array processor (APAP), based on studies on the vertebrate retina, permits the realization of complex spatio-temporal dynamics in VLSI. This model mimics the way in which images are processed in the visual pathway what renders a feasible alternative for the implementation of early vision tasks in standard technologies. A prototype chip has been designed in CMOS. Design challenges, trade-offs and the building blocks of such a high-complexity system ( transistors, most of them operating in analog mode) are presented in this paper.
As the complexity of VLSI circuits is increasing due to the exponential rise in transistor count per chip, testing cost is becoming an important factor in the overall integrated circuit (IC) manufacturing cost. This paper addresses the issue of decreasing test cost by lowering the test data bits and the number of clock cycles required to test a chip. We propose a new incremental algorithm for generating tests for Illinois Scan Architecture (ILS) based designs and provide analysis of test data and test time reduction. This algorithm is very efficient in generating tests for a number of ILS designs in order to find the most optimal configuration.
A gate level, automated fault diagnosis scheme is proposed for scan-based BIST designs. The proposed scheme utilizes both fault capturing scan chain information and failing test vector information and enables location identification of single stuck-at faults to a neighborhood of a few gates through set operations on small pass/fail dictionaries. The proposed scheme is applicable to multiple stuck-at faults and bridging faults as well. The practical applicability of the suggested ideas is confirmed through numerous experimental runs on all three fault models.
We present a new scan-BIST approach for determining failing vectors for fault diagnosis. This approach is based on the application of overlapping intervals of test vectors to the circuit under test. Two MISRs are used in an interleaved fashion to generate intermediate signatures, thereby obviating the need for multiple test sessions. The knowledge of failing and non-falling intervals is used to obtain a set S of candidate failing vectors that includes all the actual (true) failing vectors. We present analytical results to determine an appropriate interval length and the degree of overlap, an upper bound on the size of S, and a lower bound on the number of true failing vectors; the latter depends only on the knowledge of failing and non-failing intervals. Finally, we describe two pruning procedures that allow us to reduce the size of S, while retaining most true failing vectors in S. We present experimental results for the ISCAS 89 benchmark circuits to demonstrate the effectiveness of the proposed scan-BIST diagnosis approach.
In this paper we propose a new compression algorithm geared to reduce the time needed to test scan-based designs. Our scheme compresses the test vector set by encoding the bits that need to be flipped in the current test data slice in order to obtain the mutated subsequent test data slice. Exploitation of the overlap in the encoded data by effective traversal search algorithms results in drastic overall compression. The technique we propose can be utilized as not only a stand-alone technique but also can be utilized on test data already compressed, extracting even further compression. The performance of the algorithm is mathematically analyzed and its merits experimentally confirmed on the larger examples of the ISCAS‰89 benchmark circuits.
Third generation's wireless communications systems comprise advanced signal processing algorithms that increase the computational requirements more than ten-fold over 2G's systems. Numerous existing and emerging standards require flexible implementations ("software radio"). Thus efficient implementations of the performance-critical parts as Turbo decoding on programmable architectures are of great interest. Besides high-performance DSPs, application-customized RISC cores offer the required performance while still maintaining the aspired flexibility. This paper presents for the first time Turbo decoder implementations on customized RISC cores and compares the results with implementations on state-of-the-art VLIW DSPs. The results of our studies show that the Log-MAP performance is about 50% higher than on an ST120, a current VLIW architecture.
For many embedded applications, program code size is a critical design factor. One promising approach for reducing code size is to employ a "dual instruction set", where processor architectures support a normal (usually 32 bit) Instruction Set, and a narrow, space-efficient (usually 16 bit) Instruction Set with a limited set of op-codes and access to a limited set of registers. This feature, however, requires compilers that can reduce code size by compiling for both Instruction Sets. Existing compiler techniques operate at the function-level granularity and are unable to make the trade-off between the increased register pressure (resulting in more spills) and decreased code size. We present a profitability based compiler heuristic that operates at the instruction-level granularity and is able to effectively take advantage of both Instruction Sets. We also demonstrate improved code size reduction, for the MIPS 32/16 bit ISA, using our technique. Our approach more than doubles the code size reduction achieved by existing compilers.
The number of embedded systems is increasing and a remarkable percentage is designed as mobile applications. For the latter, the energy consumption is a limiting factor because of today‰s battery capacities. Besides the processor, memory accesses consume a high amount of energy. The use of additional less power hungry memories like caches or scratchpads is thus common. Caches incorporate the hardware control logic for moving data in and out automatically. On the other hand, this logic requires chip area and energy. A scratchpad memory is much more energy efficient, but there is a need for software control of its content. In this paper, an algorithm integrated into a compiler is presented which analyses the application and selects program and data parts which are placed into the scratchpad. Comparisons against a cache solution show remarkable advantages between 12% and 43% in energy consumption for designs of the same memory size.
This paper is meant to be a short introduction to a new paradigm for systems on chip (SoC) design. We refer the interested reader to an extended overview of this problem [1] and to some recent results in this area in industry [21, 10] and academia [4, 5]. The premises are that a component-based design methodology will prevail in the future, to support component re-use in a plug-and-play fashion. At the same time, SoCs will have to provide a functionally-correct, reliable operation of the interacting components. The physical interconnections on chip will be a limiting factor for performance and energy consumption. The international technology roadmap for semiconductors (ITRS) [23] projects that we will be designing multi-billion transistor chips by the end of this decade, with feature sizes around 50nm and clock frequencies around 10GHz. Delays on wires will dominate: global wires spanning a significant fraction of the chip size will carry signals whose propagation delay will exceed the clock period. Whereas relatively large delays can be managed with wire pipelining techniques, timing uncertainty will be more problematic for designers. Moreover, synchronization of chips with a single clock source and negligible skew will be extremely hard or impossible. The most likely synchronization paradigm for future chips is globally-asynchronous locally synchronous (GALS), with many different clocks. Global wires will span multiple clock domains, and synchronization failures in communicating between different clock domains will be rare but unavoidable events [7].
We consider the implication of deep sub-micron VLSI technology on the design of communication frameworks for parallel DSP systems-on-chip. We assert that distributed data transfer and control mechanisms are necessary to manage many independent processing subsystems and software tasks. An example of a parallel DSP architecture is given and used to demonstrate these mechanisms at work. We show the similarity of these mechanism and those used in large scale computing networks.
We advocate a network on silicon (NOS) as a hardware architecture to implement communication between IP cores in future technologies, and as a software model in the form of a protocol stack to structure the programming of NOSs. We claim guaranteed services are essential. In the ¨THEREAL NOS they pervade the NOS as a requirement for hardware design, and as foundation for software programming.
Efficient exploitation of temporal locality in the memory accesses on array signals can have a very large impact on the power consumption in embedded data dominated applications. The effective use of an optimized custom memory hierarchy or a customized software controlled mapping on a predefined hierarchy, is crucial for this. Only recently effective systematic techniques to deal with this specific design step have begun to appear. They were still limited in their exploration scope. In this paper we introduce an extended formalized methodology based on an analytical model of the data reuse of a signal. The cost parameters derived from this model define the search space to explore and allow us to exploit the maximum data reuse possible. The result is an automated design technique to find power efficient memory hierarchies and generate the corresponding optimized code.
This paper presents a novel Energy-Aware Compilation (EAC) framework that can estimate and optimize energy consumption of a given code taking as input the architectural and technological parameters, energy models, and energy/performance constraints. The framework has been validated using a cycle-accurate architectural-level energy simulator and found to be within 6% error margin while providing significant estimation speedup. The estimation speed of EAC is the key to the number of optimization alternatives that can be explored within a reasonable compilation time.
In embedded processors, instruction fetch and decode can consume more than 40% of processor power. An instruction filter cache can be placed between the CPU core and the instruction cache to service the instruction stream. Power savings in instruction fetch result from accesses to a small cache. In this paper, we introduce decode filter cache to provide decoded instruction stream. On a hit in the decode filter cache, fetching from the instruction cache and the subsequent decoding is eliminated, which results in power savings in both instruction fetch and instruction decode. We propose to classify instructions into cacheable or uncacheable depending on the decoded width. Then sectored cache design is used in the decode filter cache so that cacheable and uncacheable instructions can coexist in a decode filter cache sector. Finally, a prediction mechanism is presented to reduce the decode filter cache miss penalty. Experimental results show average 34% processor power reduction and less than 1% performance degradation.
In this paper, we suggest hardware-assisted data compression as a tool for reducing energy consumption of core-based embedded systems. We propose a novel and efficient architecture on-the-y data compression and decompression whose field operation is the cache-to-memory path. Uncompressed cache lines are compressed before they are written back to main memory, and decompressed when cache refills take place. We explore two classes of compression methods, profile-driven and differential, since they are characterized by compact HW implementations, and we compare their performance to those provided by some state-of-the-art compression methods (e.g., we have considered a few variants of the Lempel-Ziv encoder We present experimental results about memory traffic and energy consumption in the cache-to-memory path of a core-based system running standard benchmark programs. The achieved average energy savings range from 4.2% to 35.2%, depending on the selected compression algorithm.
Noise estimation and avoidance are becoming critical, "must have" capabilities in today‰s high performance IC design. An accurate yet efficient crosstalk noise model which contains as many driver/interconnect parameters as possible, is necessary for any sensitivity based noise avoidance approach. In this paper, we present a complete analytical crosstalk noise model which incorporates all physical properties including victim and aggressor drivers, distributed RC characteristics of interconnects and coupling locations in both victim and aggressor lines. We present closed-form analytical expressions for peak noise and noise width as well as sensitivities to all model parameters. We then use these model parameter sensitivities to analyze and evaluate various noise avoidance techniques such as driver sizing, wire sizing, wire spacing and layer assignment. Both our model and noise avoidance evaluations are verified using realistic circuits in 0:13u technology. We also present effectiveness of discussed noise avoidance techniques on a high performance microprocessor core.
Electromigration is caused by high current density stress in metallization patterns and is a major source of breakdown in electronic devices. It is therefore an important reliability issue to verify current densities within all stressed metallization patterns. In this paper we propose a new methodology for hierarchical verification of current densities in arbitrarily shaped analog circuit layouts, including a quasi-3D model to verify irregularities such as vias. Our approach incorporates thermal simulation data to account for the temperature dependency of electromigration. The described methodology, which can be integrated into any IC design flow as a design rule check (DRC), has been successfully tested and verified in commercial design flows.
Antenna problem is a phenomenon of plasma induced gate oxide degradation. It directly affects manufacturability of VLSI circuits, especially in deep-submicron technology using high density plasma. Diode insertion is a very effective way to solve this problem. Ideally diodes are inserted directly under the wires that violate antenna rules. But in today's high-density VLSI layouts, there is simply not enough room for "under-the-wire" diode insertion for all wires. Thus it is necessary to insert many diodes at legal "off-wire" locations and extend the antenna-rule violating wires to connect to their respective diodes. Previously only simple heuristic algorithms were available for this diode insertion and routing problem. In this paper, we show that the diode insertion and routing problem for an arbitrary given number of routing layers can be optimally solved in polynomial time. Our algorithm guarantees to find a feasible diode insertion and routing solution whenever one exists. Moreover, we can guarantee to find a feasible solution to minimize a cost function of the form alpha . L + beta . N where L is the total length of extension wires and N is the total number of vias on the extension wires. Experimental results show that our algorithm is very efficient.
This paper proposes a comprehensive model for test planning in a core-based environment. The main contribution of this work is the use of several types of TAMs and the consideration of different optimization factors (area, pins and test time) during the global TAM and test schedule definition. This expansion of concerns makes possible an efficient yet fine-grained search in the huge design space of a reuse-based environment. Experimental results clearly show the variety of trade-offs that can be explored using the proposed model, and its effectiveness on optimizing the system test design.
System-on-chip (SOC) design methodology is becoming the trend in the IC industry. Integrating reusable cores from multiple sources is essential in SOC design, and different design-for-testability methodologies are usually required for testing different cores. Another issue is test integration. The purpose of this paper is to present a hierarchical test scheme for SOC with heterogeneous core test and test access methods. A hierarchical test manager (HTM) is proposed to generate the control signals for these cores, taking into account the IEEE P1500 Standard proposal. A standard memory BIST interface is also presented, linking the HTM and the memory BIST circuit. It can control the BIST circuit with the serial or parallel test access mechanism. The hierarchical test control scheme has low area and pin overhead, and high flexibility. An industrial case using this scheme has been designed, showing an area overhead of only about 0.63%.
Core test wrappers and test access mechanisms (TAMs) are important components of a system-on-chip (SOC) test architecture. Wrapper/TAM co-optimization is necessary to minimize the SOC testing time. Most prior research in wrapper/TAM design has addressed wrapper design and TAM optimization as separate problems, thereby leading to results that are sub-optimal. We present a fast heuristic technique for wrapper/TAM co-optimization, and demonstrate its scalability for several industrial SOCs. This extends recent work on exact methods for wrapper/TAM co-optimization based on integer linear programming and exhaustive enumeration. We show that the SOC testing times obtained using the new heuristic algorithm are comparable to the testing times obtained using exact methods. Moreover, more than two orders of magnitude reduction can be obtained in the CPU time compared to exact methods. Furthermore, we are now able to design efficient test access architectures with a larger number of TAMs.
In this paper, we analyze the use of UML as a starting point to go from design issues to end of production testing of complex embedded systems. The first point is the analysis of the big gap between system signals and UML messages; then the paper focuses on the additional information necessary to fill such gap; different test types are considered, focusing on the application software test; finally the actuation and observation are both analyzed inside the test environment, with particular care to the black -box requirement for behavioral testing. The emphasis of the work is on the resulting test engine definition, verified on a complex case study of a top-of-the-line automotive application; this application is a modern car console, grouping many controls of carrelated devices, such as phone, navigation, radio, CD. The testing of GSM capabilities of such device is studied in particular.
Complex embedded systems consist of hardware and software components from different domains, such as control and signal processing, many of them supplied by different IP vendors. The embedded system designer faces the challenge to integrate, optimize and verify the resulting heterogeneous systems. While formal verification is available for some subproblems, the analysis of the whole system is currently limited to simulation or emulation. In this paper, we tackle the analysis of global resource sharing, scheduling, and buffer sizing in heterogeneous embedded systems. For many practically used preemptive and non-preemptive hardware and software scheduling algorithms of processors and busses, semi-formal analysis techniques are known. However, they cannot be used in system level analysis due to incompatibilities of their underlying event models. This paper presents a technique to couple the analysis of local scheduling strategies via an event interface model. We derive transformation rules between the most important event models and provide proofs where necessary. We use expressive examples to illustrate their application.
In this paper, we present an efficient two-step iterative synthesis approach for distributed embedded systems containing dynamic voltage scalable processing elements (DVS-PEs), based on genetic algorithms. The approach partitions, schedules, and voltage scales multi-rate specifications given as task graphs with multiple deadlines. A distinguishing feature of the proposed synthesis is the utilisation of a generalised DVS method. In contrast to previous techniques, which ŠsimplyŠ exploit available slack time, this generalised technique additionally considers the PE power profile during a refined voltage selection to further increase the energy savings. Extensive experiments are conducted to demonstrate the efficiency of the proposed approach. We report up to 43.2% higher energy reductions compared to previous DVS scheduling approaches based on constructive techniques and total energy savings of up to 82.9% for mapping and scheduling optimised DVS systems.
By using a macro/micro state model we show how assumptions on the resolution of logical and physical timing of computation in computer systems has resulted in design methodologies such as component-based decomposition, where they are completely coupled, and function/architecture separation, where they are completely independent. We discuss why these are inappropriate for emerging programmable, concurrent system design. By contrast, schedulers layered on hardware in concurrent systems already couple logical correctness with physical performance when they make effective resource sharing decisions. This paper lays a foundation for understanding how layered logical and physical sequencing will impact the design process, and provides insight into the problems that must be solved in such a design environment. Our layered approach is that of a virtual machine. We discuss our MESH research project in this context.
The minimization of cost, power consumption and time-to-market of DSP applications requires the development of methodologies for the automatic implementation of floating-point algorithms in fixed-point architectures. In this paper, a new methodology for evaluating the quality of an implementation through the automatic determination of the Signal to Quantization Noise Ratio (SQNR) is under consideration. The theoretical concepts and the different phases of the methodology are explained. Then, the ability of our approach for computing the SQNR efficiently and its beneficial contribution in the process of data word-length minimization are shown through some examples.
In this paper, we investigate the benefits of a flexible, application-specific instruction set by adding a run-time Reconfigurable Functional Unit (RFU) to a VLIW processor. Preliminary results on the motion estimation stage in an MPEG4 video encoder are presented. With the RFU modeled at functional level and under realistic assumptions on execution latency, technology scaling and reconfiguration penalty, we explore different RFU instructions at fine-grain (instruction-level) and coarsegrain (loop-level) granularity to speedup the application execution. The memory bandwidth bottleneck, typical for streaming applications, is alleviated through the combined adoption of custom prefetch pattern instructions and an extent of local memory. Performance evaluations indicate up to 8x improvement, with looplevel optimizations is achieved under various architectural assumptions.
A new technique is presented in this paper to improve the efficiency of data scheduling for multi-context reconfigurable architectures targeting multimedia and DSP applications. The main goal is to improve the applications execution time minimizing external memory transfers. Some amount of on-chip data storage is assumed to be available in the reconfigurable architecture. Therefore the Complete Data Scheduler tries to optimally exploit this storage, saving data and result transfers between on-chip and external memories. In order to do this, specific algorithms for data placement and replacement have been designed. We also show that a suitable data scheduling could decrease the number of transfers required to implement the dynamic reconfiguration of the system.
Microprocessors are today getting more and more
inefficient for a growing range of applications. Its
principles -The Von Neumann paradigm[3]- based on the
sequential execution of algorithms will no longer be able
to cope with the kind of highly computing intensive
applications of multimedia world.
Nowadays approaches to deal with these limitations
consist in the following:
- The first, and most natural way to increase the
computing power is obviously to decrease the cycle
execution time, thanks to new silicon technology: The
functional frequencies for the newcomers CPUs are now
getting on the way to 2 GHz.
- The second approach is co-design. The intended general
purpose CPU will confide the computation of the most
time demanding applications to a dedicated core. The
most famous example are PC graphic cards which
manage all the 2D and 3D display operations that even
high-end CPUs are not able to handle efficiently.
Both methods are not satisfying. The first one quickly
finds its limitations in however limited functional
frequencies and power consumption reduction, as the
second requires the design of a new core for each
intended algorithm. New parallel execution based
machine paradigms must be considered. Thanks to their
high level of flexibility structurally programmable
architectures are potentially interesting candidates to
overcome classical CPUs limitations.
Based on a parallel execution model, we present in this
paper a new dynamically reconfigurable architecture,
dedicated to data oriented applications acceleration.
Principles, realizations and comparative results will be
exposed for some classical applications, targeted on
different architectures.
In this paper, we introduce the concept of (self-)reconfigurable finite state machines as a formal model to describe state-machines implemented in hardware that may be reconfigured during operation. By the advent of reconfigurable logic devices such as FPGAs, this model may become important to characterize and implement (self-)reconfigurable hardware. An FSM is called (self-)reconfigurable if reconfiguration of either output function or transition function is initiated by the FSM itself and not based on external reconfiguration events. We propose an efficient hardware realisation and give algorithmic solutions and bounds for the reconfiguration overhead of migrating a given FSM specification into a new target FSM.
The relative tolerances for interconnect and device parameter variations have not scaled with feature sizes which have brought about significant performance variability. As we scale toward 10nm technologies, this problem will only worsen. New circuit families and design methodologies will emerge to facilitate construction of reliable systems from unreliable nanometer scale components. Such methodologies require new models of performance which accurately capture the manufacturing realities. Recently, one step toward this goal was made via a new variational reduced order interconnect model that efficiently captures large scale fluctuations in global parameter values. Using variational calculus the linear interconnect systems are represented by analytical models that include the global variational parameters explicitly. In this work we present a framework which extends the previous work to a linear-centric simulation methodology with accurate nonlinear device models and their fluctuations. The framework is applied to generate path delay distributions under nonlinear and linear parameter fluctuations.
The key performance of many analog circuits is directly related to accurate capacitor ratios. It is well known that capacitor ratio precision is greatly enhanced by paralleling identical size unit capacitors in a commoncentroid geometry. In this paper, a general algorithm for fitting arbitrary capacitor ratios in a common-centroid unit-capacitor array is presented. The algorithm gives special care to both non-integer and identical ratios in order to minimize mismatch. A method for capacitance mismatch estimation based upon an oxide gradient model is also introduced. It enables the comparison of different unit-capacitor array assignments. Layout issues are discussed with emphasis on a generic routing model. Both the algorithm and the mismatch estimation method are implemented in an automatic capacitor array generation tool.
In this paper, a method for nominal design of analog integrated circuits is presented that includes process variations and operating ranges by worst-case parameter sets. These sets are calculated adaptively during the sizing process based on sensitivity analyses. The method leads to robust designs with high parametric yield, while being much more efficient than design centering methods.
Designing complex analog systems needs different abstraction levels to reduce the overall complexity. The required level of abstraction depends on the accuracy and the purpose of the model. High-frequency amplifier models can vary from simple transfer functions for efficient biterror-rate analysis up to detailed transistor level descriptions for accurate load-pull prediction. This paper introduces a nonlinear black-box model for high-frequency amplifiers. It extends the linear S-parameter representation to enable both efficient system-level simulations and loadpull prediction. Both are demonstrated on the measurements of a high-frequency amplifier excited using WLAN - OFDM modulation.
Software self-testing for embedded processor cores based on their instruction set, is a topic of increasing interest since it provides an excellent test resource partitioning technique for sharing the testing task of complex Systems-on-Chip (SoC) between slow, inexpensive testers and embedded code stored in memory cores of the SoC. We introduce an efficient methodology for processor cores self-testing which requires knowledge of their instruction set and Register Transfer (RT) level description. Compared with functional testing methodologies proposed in the past, our methodology is more efficient in terms of fault coverage, test code size and test application time. Compared with recent software based structural testing methodologies for processor cores, our methodology is superior in terms of test development effort and has significantly smaller code size and memory requirements, while virtually the same fault coverage is achieved with an order of magnitude smaller test application time.
We present a new test resource partitioning (TRP) technique for reduced pin-count testing of system-on-a-chip (SOC). The proposed technique is based on test data compression and on-chip decompression. It makes effective use of frequency-directed run-length codes, internal scan chains, and boundary scan chains. The compression/ decompression scheme decreases test data volume and the amount of data that has to be transported from the tester to the SOC. We show via analysis as well as through experiments that the proposed TRP scheme reduces testing time and allows the use of a slower tester with fewer I/O channels. Finally, we show that an uncompacted test set applied to an embedded core after on-chip decompression is likely to increase defect coverage.
This paper proposes a new test data compression/ decompression method for systems-on-a-chip. The method is based on analyzing the factors that influence test parameters: compression ratio, area overhead and test application time. To improve compression ratio, the new method is based on a Variable-length Input Huffman Coding (VIHC), which fully exploits the type and length of the patterns, as well as a novel mapping and reordering algorithm proposed in a pre-processing step. The new VIHC algorithm is combined with a novel parallel on-chip decoder that simultaneously leads to low test application time and low area overhead. It is shown that, unlike three previous approaches [2, 3, 10] which reduce some test parameters at the expense of the others, the proposed method is capable of improving all the three parameters simultaneously. For example, the proposed method leads to similar or better compression ratio when compared to frequency directed run-length coding [2], however with lower area overhead and test application time. Similarly, there is comparable or lower area overhead and test application time with respect to Golomb coding [3], with improvements in compression ratio. Finally, there is similar or improved test application time when compared to selective coding [10], with reductions in compression ratio and significantly lower area overhead. An experimental comparison on benchmark circuits validates the proposed method.
In this work, the problem of open faults affecting the interconnections of SC circuits composed by data-path and control is analyzed. In particular, it is shown that, in case opens affect control signals, some problems may arise even if both control and data-path signals are concurrently checked. In particular, wrong codewords may be generated at the outputs of multiplexers and registers. To address this problem, new registers and multiplexers are proposed which allow the design data-paths which are TSC with respect to opens (and resistive opens). These components are also TSC with respect to stuck-at, transistor and gross delay faults. They present a good testability with respect to resistive bridgings.
To enable fast and accurate evaluation of HW/SW implementation choices of on-chip communication, we present a method to automatically generate timed OS simulation models. The method generates the OS simulation models with the simulation environment as a virtual processor. Since the generated OS simulation models use final OS code, the presented method can mitigate the OS code equivalence problem. The generated model also simulates different types of processor exceptions. This approach provides two orders of magnitude higher simulation speedup compared to the simulation using instruction set simulators for SW simulation.
Due to the increasing operating frequencies and the manner in which the corresponding integrated circuits and systems must be designed, the extraction, modeling and simulation of the magnetic couplings for final design verification can be a daunting task. In general, when modeling inductance and the associated return paths, one must consider the on-chip conductors as well as the system packaging. This can result in an RLC circuit size that is impractical for traditional simulators. In this paper we demonstrate a localized, window-based extraction and simulation methodology that employs the recently proposed susceptance (the inverse of inductance matrix) concept. We provide a qualitative explanation for the efficacy of this approach, and demonstrate how it facilitates pre-manufacturing simulations that would otherwise be intractable. A critical aspect of this simulation efficiency is owed to a susceptance-based circuit formulation that we prove to be symmetric positive definite. This property, along with the sparsity of the susceptance matrix, enables the use of some advanced sparse matrix solvers. We demonstrate this extraction and simulation methodology on some industrial examples.
In this paper we propose a new harmonic balance simulation methodology based on a linear-centric modeling approach. A linear circuit representation of the nonlinear devices and associated parasitics is used along with corresponding time and frequency domain inputs to solve for the nonlinear steady-state response via successive chord (SC) iterations. For our circuit examples this approach is shown to be up to 60x more run-time efficient than traditional Newton-Raphson (N-R) based iterative methods, while providing the same level of accuracy. This SC-based approach converges as reliably as the N-R approaches, including for circuit problems which cause alternative relaxation-based harmonic balance approaches to fail[1][2]. The efficacy of this linear-centric methodology further improves with increasing model complexity, the inclusion of interconnect parasitics and other analyses that are otherwise difficult with traditional nonlinear models.
This paper presents a simulator operating on a logical representation of an asynchronous circuit that gives energy estimates within 10% of electrical (hspice) simulation. Our simulator is the first such tool in the literature specifically targeted to efficient energy estimation of QDI asynchronous circuits. As an application, we show how the simulator has been used to accurately estimate the energy consumption in different parts of an asynchronous MIPS R3000 microprocessor. This is the first energy breakdown of an asynchronous microprocessor in the literature.
In this era of Deep Sub-Micron (DSM) technologies, the impact of interconnects is becoming increasingly important as it relates to integrated circuit (IC) functionality and performance. In the traditional top-down IC design flow, interconnect effects are first taken into account during logic synthesis by way of wireload models. However, for technologies of 0.25mm and below, the wiring capacitance dominates the gate capacitance and the delay estimation based on fanout and design legacy statistics can be highly inaccurate. In addition, logic block size is no longer dictated solely by total cell area, and is often limited by wiring area resources. For these reasons, wiring congestion is an extremely important design factor, and should be taken into consideration at the earliest possible stages of the design flow. In this paper we propose a novel methodology to incorporate congestion minimization within logic synthesis, and present results for industrial circuits that validate our approach.
We present a novel algorithm that applies physical layout information during common subexpression extraction to improve wiring congestion and delay, resulting in improved design closure. As feature sizes decrease and chip sizes increase, the traditional separation of physical design and logic synthesis proves to be increasingly detrimental. Interconnect delay and wiring congestion, among the most critical objective functions to meet design closure, are not considered during logic synthesis. On the other hand, physical design is too deep in the design process to be able to significantly restructure the already technology mapped netlist. While this problem has been addressed previously, the existing solutions only apply simple synthesis transforms during physical design. Hence they are generally unable to reverse decisions made during logic restructuring which have a major negative impact on the circuit structure. In our novel approach, we propose a layout driven algorithm for the concurrent extraction of common subexpressions, one of the most important steps that affect the overall circuit structure, and consequently congestion and wire length during logic synthesis. In addition, we consider dependency relations between cube divisors to improve the extraction process. As a result, our layout driven decomposition algorithm combines logic synthesis and physical layout information to effectively decrease wire length and improve congestion for improved design closure.
In this paper, we show that under the constant delay model the placement problem is equivalent to minimizing a weighted sum of wire lengths. The weights can be efficiently computed once in advance and still accurately reflect the circuit area throughout the placement process. The existence of an efficient and accurate cost function allows us to directly optimize circuit area. This leads to better results compared to heuristic edge weight estimates or optimization for secondary criteria such as wire length. We leverage this property to improve a recursive partitioning based tool flow. We achieve area savings of 27% for some circuits and 15% on average. The use of the constant delay model additionally enables timing closure without iterations.
The dynamic PLA style has become popular in designing high performance microprocessors because of its high speed and predictable routing delay. However, like all other dynamic circuits, dynamic PLAs have suffered from the crosstalk noise problem. In this paper,we propose two techniques to alleviate crosstalk noise for dynamic PLAs. The first technique makes use of the fact that depending on the ordering of product lines, some crosstalk does not cause errors in outputs. A proper ordering can greatly reduce the number of lines affected by crosstalk noise. For those product lines which can be affected by crosstalk, we attempt to reduce the parallel length by re-ordering the input and output lines. We have performed experiments on a large set of MCNC benchmark circuits. The results show that after re-ordering, 86.7% of product lines become crosstalk immune and need not be considered for crosstalk prevention.
We present a unified framework that considers flip-flop and repeater insertion and the placement of flip-flop/repeater blocks during RT or higher level design. We introduce the concept of independent feasible regions in which flip-flops and repeaters can be inserted in an interconnect to satisfy both delay and cycle time constraints. Experimental results show that, with flip-flop insertion, we greatly increase the ability of interconnects to meet timing constraints. Our results also show that it is necessary to perform interconnect optimization at early design steps as the optimization will have even greater impact on the chip layout as feature size continually scales down.
In this paper, we study and implement a routability-driven floorplanner with buffer block planning. It evaluates the routability of a floorplan by computing the probability that a net will pass through each particular location of a floorplan taken into account buffer locations and routing blockages. Experimental results show that our congestion model can optimize congestion and delay (by successful buffer insertions) of a circuits better with only a slight penalty in area.
In this paper, we address the problem of simultaneous routing and buffer insertion. Recently in [12, 22], the authors considered simultaneous maze routing and buffer insertion under the Elmore delay model. Their algorithms can take into account both routing obstacles and restrictions on buffer locations. It is well known that Elmore delay is only a first-order approximation of signal delay and hence could be very inaccurate. Moreover, we cannot impose constraints on the transition times of the output signal waveform at the sink or at the buffers on the route. In this paper we extend the algorithm in [12] so that accurate delay models (e.g., transmission line model, delay look-up table from SPICE, etc.) can be used. We show that the problem of finding a minimum-delay buffered routing path can be formulated as a shortest path problem in a specially constructed weighted graph. By including only the vertices with qualifying transition times in the graph, we guarantee that all transition time constraints are satisfied. Our algorithm can be easily extended to handle buffer sizing and wire sizing. It can be applied iteratively to improve any given routing tree solution. Experimental results show that our algorithm performs well.
Transistor tapering is a widely used technique applied to optimize the geometries of CMOS transistors in high-performance circuit design with a view to minimizing the delay of a FET network. Currently, in a long series-connected FET chain, the dimensions of the transistors are decreased from bottom transistor to the top transistor in a manner where the width of transistors is tapered linearly or exponentially. However, it has not been mathematically proved whether either of these tapering schemes yields optimal results in terms of minimization of switching delays of the network. In this paper, we rigorously analyze MOS circuits consisting of long FET chains under the widely used Elmore delay model and derive the optimality of transistor tapering by employing variational calculus. Specifically, we demonstrate that neither linear nor exponential tapering alone minimizes the discharge time of the FET chain. Instead, a composition of exponential and constant tapering actually optimizes the delay of the network. We have also corroborated our analytical results by performing extensive simulation of FET networks and showing that both analytical and simulation results are always consistent.
An incremental simulation-based approach to fault diagnosis and logic debugging is presented. During each iteration of the algorithm, a single suspicious location is identified and fault modeled such that the functionality of the new design becomes "closer" to its specification. The method is based on a simple and, at a first glance, counter-intuitive theoretical result along with a number of heuristics which help avoid the exponential complexity inherent to the problems. Experiments on multiple design errors and multiple stuck-at faults confirm its effectiveness and accuracy, which scales well with increasing number of errors.
Test sets for path delay faults in circuits with large numbers of paths are typically generated for path delay faults associated with the longest circuit paths. We show that such test sets may not detect faults associated with the next-to-longest paths. This may lead to undetected failures since shorter paths may fail without any of the longest paths failing. In addition, paths that appear to be shorter may actually be longer than the longest paths if the procedure used for estimating path length is inaccurate. We propose a test enrichment procedure that increases significantly the number of faults associated with the next-to-longest paths that are detected by a (compact) test set. This is achieved by allowing the underlying test generation procedure the flexibility of detecting or not detecting the faults associated with the next-to-longest paths. Faults associated with next-to-longest paths are detected without increasing the number of tests beyond that required to detect the faults associated with the longest paths. The proposed procedure thus improves the quality of the test set without increasing its size.
This paper develops an improved approach for hierarchical functional test generation for complex chips. In order to deal with the increasing complexity of functional test generation, hierarchical approaches have been suggested wherein functional constraints are extracted for each module under test (MUT) within a design. These constraints describe a simplified ATPG view for the MUT and thereby speed up the test generation process. This paper develops an improved approach which applies this technique at deeper levels of hierarchy, so that effective tests can be developed for large designs with complex submodules. A tool called FACTOR (FunctionAl ConsTraint extractOR), which implements this methodology is described in this work. Results on the ARM design prove the effectiveness of FACTOR-ising large designs for test generation and testability analysis.
This article describes the Balboa component integration environment that is composed of three parts: a script language interpreter, compiled C++ components, and a set of Split-Level Interfaces to link the interpreted domain to the compiled domain. The environment applies the notion of split-level programming to relieve system engineers of software engineering concerns and to let them focus on system architecture. The script language is a Component Integration Language because it implements a component model with introspection and loose typing capabilities. Component wrappers use split-level interfaces that implement the composition rules, dynamic type determination and type inference algorithms. Using an interface description language compiler automatically generates the split-level interfaces. The contribution of this work is two fold: an active code generation technique, and a three-layer environment that keeps the C++ components intact for reuse. We present an overview of the environment; demonstrate our approach by building three simulation models for an adaptive memory controller, and comment on code generation ratios.
This paper addresses the problem of test vectors generation starting from an high level description of the system under test, specified in SystemC. The verification method considered is based upon the simulation of input sequences. The system model adopted is the classical Finite State Machine model. Then, according to different strategies, a set of sequences can be obtained, where a sequence is an ordered set of transitions. For each of these sequences, a set of constraints is extracted. Test sequences can be obtained by generating and solving the constraints, by using a constraint solver (GProlog). A solution of the constraint solver yields the values of the input signals for which a sequence of transitions in the FSM is executed. If the constraints cannot be solved, it implies that the corresponding sequence cannot be executed by any test. The presented algorithm is not based on a specific fault model, but aims at reaching the highest possible path coverage.
We present a design method (HASoC) for the lifecycle modelling of embedded systems that are targeted primarily, but not necessarily, at SoC implementations. The object-oriented development technique is based on our experiences of using an existing modelling technique (MOOSE) and supports a lifecycle that explicitly separates the behaviour of a system from its hardware and software implementation technologies. The design process, which uses a UML-RT-based notation, begins with the incremental development and validation of an executable model of a system. This model is then partitioned into hardware and software to create a committed model, which is mapped onto a system platform. The methodology emphasises the reuse of preexisting hardware and software platforms to ease the development process. An example application is presented in order to illustrate the main concepts in HASoC.
This paper discusses aBlox - a specification notation for high-level synthesis of mixed-signal systems. aBlox addresses three important aspects of mixed-signal system specification: (1) description of functionality and (2) performance issues and (3) expression of analog-digital interactions. The semantics of aBlox embeds concepts and rules of a functional computational model, and uses a declarative style to denote performance elements. The paper shows some mixed-signal specifications that we developed in aBlox. Finally, we describe a high-level analog synthesis experiment that used aBlox specifications as inputs.
This very short paper describes the objectives, content, and usage of a real-time UML profile that has been standardized by the Object Management Group. This profile defines a common framework for describing the quantitative aspects of software systems. In addition, it provides specific facilities for analysing real-time systems for schedulability or performance.
The specification, design and implementation of embedded systems demands new approaches which go beyond traditional hardware-based notations such as HDLs. The growing dominance of software in embedded systems design requires a careful look at the latest methods for software specification and analysis. The development of the Unified Modeling Language (UML), and a number of extension proposals in the realtime domain holds promise for the development of new design flows which move beyond static and traditional partitions of hardware and software. However, UML as currently defined lacks several key capabilities. In this paper, we will survey the requirements for system-level design of embedded systems, and give an overview of the extensions required to UML that will be dealt with in more detail in the related papers. In particular, we will discuss how the notions of platform-based design intersect with a UML based development approach.
The fast growing complexity of today's real time embedded systems necessitates new design methods and tools to face the problems of design, analysis, integration and validation of complex systems. We present a system level design method for embedded real-time systems combining the informal strengths of UML with the formal strengths of SDL. We demonstrate our flow by the design example of a telecommunications application from the wireless or access domain, showing the applicability of the flow to control and data - dominated types of systems. Finally we will show how the application results and other end-user needs and requirements influenced the current UML 2.0 proposal with support for real-time and embedded systems.
To fully exploit the benefit of variable voltage processors, voltage schedules must be designed in the context of work load requirement. In this paper, we present an approach to finding the least-energy voltage schedule for executing realtime jobs on such a processor according to a fixed priority, preemptive policy. The significance of our approach is that the theoretical limit in terms of energy saving for such systems is established, which can thus serve as the standard to evaluate the performance of various heuristic approaches. Two algorithms for deriving the optimal voltage schedule are provided. The first one explores fundamental properties of voltage schedules while the second one builds on the first one to further reduce the computational cost. Experimental results are shown to compare the results of this paper with previous ones.
Dynamic voltage scaling (DVS), which adjusts the clock speed and supply voltage dynamically, is an effective technique in reducing the energy consumption of embedded realtime systems. The energy efficiency of a DVS algorithm largely depends on the performance of the slack estimation method used in it. In this paper, we propose a novel DVS algorithm for periodic hard real-time tasks based on an improved slack estimation algorithm. Unlike the existing techniques, the proposed method takes full advantage of the periodic characteristics of the real-time tasks under priority-driven scheduling such as EDF. Experimental results show that the proposed algorithm reduces the energy consumption by 2040% over the existing DVS algorithm. The experiment results also show that our algorithm based on the improved slack estimation method gives comparable energy savings to the DVS algorithm based on the theoretically optimal (but impractical) slack estimation method.
We present an extension of synchronous programming languages that can be used to declare program locations irrelevant for verification. An efficient algorithm is proposed to generate from the output of the usual compilation an abstract real-time model by ignoring the irrelevant states, while retaining the quantitative information. Our technique directly generates a single real-time transition system, thus overcoming the known problem of composing several real-time models. A major application of this approach is the verification of real-time properties by symbolic model checking.
This paper presents an on-chip, interconnect-aware methodology for high-speed analog and mixed signal (AMS) design which enables early incorporation of on-chip transmission line (T-line) components into AMS design flow. The proposed solution is based on a set of parameterized T-line structures, which include single and two coupled microstrip lines with optional side shielding, accompanied by compact true transient models. The models account for frequency dependent skin and proximity effects, while maintaining passivity requirements due to their pure RLC nature. The signal bandwidth supported by the models covers a range from DC to 100 GHz. The models are currently verified in terms of S-parameter data against hardware (up to 40 GHz) and against EM solver (up to 100 GHz). This methodology has already been used for several designs implemented in SiGe (Silicon-Germanium) BiCMOS technology.
In this paper we present efficient closed-form formulas to estimate capacitive coupling-induced crosstalk noise for distributed RC coupling trees. The efficiency of our approach stems from the fact that only the five basic operations are used in the expressions: addition (x + y), subtraction (x - y), multiplication (x x y), division (x/y) and square root (square root (x)). The formulas do not require exponent computation or numerical iterations. We have developed closed-form expressions for the peak crosstalk noise amplitude, the peak noise occurring time and the width of the noise waveform. Our approximations are conservative and yet achieve acceptable accuracy. The formulas are simple enough to be used in the inner loops of performance optimization algorithms or as cost functions to guide routers. They capture the influence of coupling direction (near-end and far-end coupling) and coupling location (near-driver and near-receiver).
This paper presents an efficient approach to compute the dominant poles for the reduced-order admittance (Y parameter) matrix of lossy interconnects. Using the global approximation technique, the efficient frameworks are constructed to transform the frequency-domain Telegrapher‰s equations into compact linear algebraic equations. The dominant poles and residues can be extracted by directly solving the linear equations. The closed-form formulas are derived to compute the low-order dominant poles. Due to high accuracy of the global approximation, the extracted poles can accurately represent the exact admittance matrices in a wide frequency range. By using the recursive convolution technique, the pole-residue models can be represented by companion models, which have linear complexity with respect to the computational time. The presented modeling approaches are shown to preserve passivity. Numerical experiments of transient simulation show that the presented modeling approaches lead to higher efficiency, while maintaining comparable accuracy.
Accurate gate-level static timing analysis in the presence of RC loads has become an important problem for modern deep-submicron designs. Non-capacitive loads are usually analyzed using the concept of an effective capacitance, Ceff. Most published algorithms for Ceff, however, require special cell characterization or supplemental information that is not part of standard timing libraries. In this paper we present a novel Ceff algorithm that is strictly compatible with existing timing libraries. It is also fast, easily implemented, and quite accurateÖwithin 3% of transistor-level simulation in our tests. The method is based on approximating a gate by a current source, estimating the delay difference when the gate drives the actual RC load and a reference capacitor, and then converting the delay discrepancy into a Ceff value. Central to carrying out this program is the innovative concept of delay correction transfer function.
We propose a self-checking scheme for the on-line testing of power supply noise exceeding a tolerance bound to be chosen accordingly to system‰s constraints. Upon the occurrence of such a noise, our scheme provides an output error message, which can be exploited for diagnosis purposes or to recover from the detected noise (thus guaranteeing the system‰s correct operation). As far as we are concerned, no on-line testing scheme for power supply noise has been proposed up to now. Our scheme negligibly impacts system‰s performance, features self-checking ability with respect to a wide set of possible internal faults and keeps on revealing on-line the occurrence of power supply noise, despite the possible presence of noise affecting also ground.
The need for integrated mechanisms providing on-line error detection or fault tolerance is becoming a major concern due to the increasing sensitivity of the circuits to their environment. This paper reports on a tool automating the implementation of such mechanisms by modifying high-level VHDL descriptions. The modifications are compatible with industrial design flows based on commercial synthesis and simulation tools. The results demonstrate the feasibility and the efficiency of the approach.
Although algorithm level re-computing techniques can trade-off the detection capability of Concurrent Error Detection (CED) vs. time overhead, it results in 100% time overhead when the strongest CED capability is achieved. Using the idle cycles in the data path to do the re-computation can reduce this time overhead. However dependencies between operations prevent the recomputation from fully utilizing the idle cycles. Deliberately breaking some of these data dependencies can further reduce the time overhead associated with algorithm level re-computing.
Fault-tolerant circuits are currently required in several major application sectors, and a new generation of CAD tools is required to automate the insertion and validation of fault-tolerant mechanisms. This paper outlines the characteristics of a new fault-injection platform and its evaluation in a real industrial environment. It also details techniques devised and implemented within the platform to speed-up fault-injection campaigns. Experimental results are provided, showing the effects of the different techniques, and demonstrating that they are able to reduce the total time required by fault-injection campaigns by at least one order of magnitude.
With the term flexibility, we introduce a new design dimension of an embedded system that quantitatively characterizes its feasibility in implementing not only one, but possibly several alternative behaviors. This is important when designing systems that may adopt their behavior during operation, e.g., due to new environmental conditions, or when dimensioning a platform-based system that must implement a set of different behaviors. A hierarchical graph model is introduced that allows to model flexibility and cost of a system formally. Based on this model, an efficient exploration algorithm to find the optimal flexibility/cost-tradeoff-curve of a system using the example of the design of a family of Set-Top boxes is proposed.
We present an area and delay estimator in the context of a compiler that takes in high level signal and image processing applications described in MATLAB and performs automatic design space exploration to synthesize hardware for a Field Programmable Gate Array (FPGA) which meets the user area and frequency specifications. We present an area estimator which is used to estimate the maximum number of Configurable Logic Blocks (CLBs) consumed by the hardware synthesized for the Xilinx XC4010 from the input MATLAB algorithm. We also present a delay estimator which finds out the delay in the logic elements in the critical path and the delay in the interconnects. The total number of CLBs predicted by us is within 16% of the actual CLB consumption and the synthesized frequency estimated by us is within an error of 13% of the actual frequency after synthesis through Synplify logic synthesis tools and after placement and routing through the XACT tools from Xilinx. Since the estimators proposed by us are fast and accurate enough, they can be used in a high level synthesis framework like ours to perform rapid design space exploration.
In this paper, we present an efficient methodology to validate high performance algorithms and prototype them using reconfigurable hardware. We follow a strict topdown Hardware/Software Codesign paradigm using stepwise refinement techniques. Starting from a performance evaluation on the data-flow level using the OCAPI system, we partition the simulated high-level data-flow description into hardware and software modules. The hardware parts, described in Handel-C, are compiled and mapped to Xilinx Virtex 2000E FPGAs, and the software is executed on a PC processor that hosts the Virtex boards. Hardware/software interfacing and communication between processor and FPGA is established via the PCI bus by shared memory DMA transfers. This paper presents the methodology and illustrates the method with an example of a channel coder.
Simple and powerful modeling of concurrency and reactivity along with their efficient implementation in the simulation kernel are crucial to the overall usefulness of system level models using the C++-based modeling frameworks. However, the concurrency alignment in most modeling frameworks is naturally expressed along hardware units, being supported by the various language constructs, and the system designers express concurrency in their system models by providing threads for some modules/units of the model. Our experimental analysis shows that this concurrency model leads to inefficient simulation performance, and a concurrency alignment along dataflow gives much better simulation performance, but changes the conceptual model of hardware structures. As a result, we propose an algorithmic transformation of designs written in these C++-based environments with concurrency alignment along units/modules. This transformation, provided as a compiler front-end, will re-assign the concurrency along the dataflow, as opposed to threading along concurrent hardware/software modules, keeping the functionality of the model unchanged. Such a front-end transformation strategy will relieve hardware system designers from concerns about software engineering issues such as, threading architecture, and simulation performance, while allowing them to design in the most natural manner, whereas, the simulation performance can be enhanced up to almost two times as shown in our experiments.
Design automation for analog/mixed-signal (A/MS) circuits and systems is still lagging behind compared to what has been reached in the digital area. As System-on-Chip (SoC) designs include analog components in most cases, these analog parts become even more a bottleneck in the overall design process. The paper is dedicated to latest R&D activities within the MEDEA+ project ANASTASIA+. Main focus will be the development of seamless top-down design methods for integrated analog and mixed-signal systems and to achieve a high level of automation and reuse in the A/MS design process. These efforts are motivated by the urgent need to close the current gap in the industrial design flow between system specification and design on the one hand and block-level circuit design on the other hand. The paper will focus on three subtopics starting with the topdown design flow with applications from circuit sizing, design centering, and automated behavioral modeling. The next part focuses on modeling and simulation of specific functionalities in sigma-delta design while the last section is dedicated to a mixed-signal System-on-Chip design environment.
In programmable embedded systems, the memory subsystem represents a major cost, performance and power bottleneck. To optimize the system for such different goals, the designer would like to perform Design Space Exploration, evaluating different memory modules from a memory IP library, and selecting the most promising designs. However, while the memory modules are important, the rate at which the memory system can produce the data for the CPU is significantly impacted by the connectivity architecture between the memory subsystem and the CPU. Thus, it is critical to consider the connectivity architecture early in the design flow, in conjunction with the memory architecture. We present a connectivity architecture exploration approach, evaluating a wide range of cost, performance, and energy connectivity architectures. When coupled with our memory modules exploration approach, we can significantly improve the system behavior. We present experiments on a set of large real-life benchmarks, showing significant performance improvements for varied cost and power characteristics, allowing the designer to tailor the performance, cost and power of the programmable embedded system.
Multimedia applications are characterized by a large number of data accesses and complex array index manipulations. The built-in address decoder in the RAM memory model commonly used by most memory synthesis tools, unnecessarily restricts the freedom of address generator synthesis. Therefore a memory model in which the address decoder is decoupled from the memory cell array is proposed. In order to demonstrate the benefits and limitations of this alternative memory model, synthesis results for a Shift Register based Address Generator that does not require address decoding are compared to those for a counter-based address generator that requires address decoding. Results show that delay can be nearly halved at the expense of increased area.
This paper presents an heuristic method to solve the combined resource selection and binding problems for the high-level synthesis of multiple-precision specifications. Traditionally, the number of functional (and storage) units in a datapath is determined by the maximum number of operations scheduled in the same cycle, with their respective widths depending on the number of bits of the wider operations. When these wider operations are not scheduled in such ‹busyŠ cycle, this way of acting could produce a considerable waste of area. To overcome this problem, we propose the selection of the set of resources taking into account the only truly relevant aspect: the maximum number of bits calculated and stored simultaneously in a cycle. The implementation obtained is a multiple-precision datapath, where the number and widths of the resources are independent of the specification operations and data objects.
This paper presents a new approach for model-order reduction of linear time varying system based on expanding the time-varying system in the right half plane of the s-domain. The proposed algorithm is developed through introducing Krylov subspace-based reduction to time-varying transfer functions. The proposed algorithm does not require solution of large system of equations to construct a basis for the time-varying moments. Instead, it computes such a basis through time-domain integration of the corresponding linear time-varying differential algebraic equations. Numerical experiments show that expanding in the right-half plane compresses the transient phase of the response of these equations by several orders of magnitude.
As system integration evolves and tighter design constraints must be met, it becomes necessary to account for the non-ideal behavior of all the elements in a system. For high-speed digital, and microwave systems, it is increasingly important to model previously neglected frequency domain effects. In this paper, results from Nevanlinna-Pick interpolation theory are used to develop a bounded real matrix rational approximation algorithm. A method is presented that allows for the generation of guaranteed passive rational function models of passive systems by approximating their scattering parameter matrices. Since the order of the models may in some cases be high, an incremental fitting strategy is also proposed that allows for the generation of smaller models while still meeting the required passivity and accuracy requirements. Results of the application of the proposed method to several real-world examples are also shown.
We present a new passive model reduction algorithm-based on the Laguerre expansion of the time response of interconnect networks. We derive expressions for the Laguerre coefficient matrices that minimize a weighted square of the approximation error, and show how these matrices can be computed efficiently using Krylov subspace methods. We discuss the connections between our method and other methods such as PRIMA [4]. Numerical simulations show that our method can better approximate the original model as compared to PRIMA.
This paper presents an innovative algorithm for the automatic generation of March Tests. The proposed approach is able to generate an optimal March Test for an unconstrained set of memory faults in very low computation time.
Most industrial memories have an external word-width of more than one bit.
However, most published memory test algorithms assume 1-bit memories;
they will not detect coupling faults between the cells of a word. This
paper improves upon the state of the art in testing word-oriented memories
by presenting a new method for detecting state coupling faults between
cells of the same word, based on the use of m-out-of-n codes. The result
is a reduction in test time, which varies between 20% and 30%.
Key words: State coupling faults, word-oriented memories, data backgrounds,
m-out-of-n codes. The result is a reduction in test time, which varies between
20% and 30%.
Keywords: State coupling faults, word-oriented memories, tests, data
backgrounds, m-out-of-n codes.
This paper presents a new fault-independent method for maximizing local conflicting value assignments for the purpose of untestable faults identification. The technique first computes a large number of logic implications across multiple time-frames and stores them in an implication graph. Then, by maximizing conflicting scenarios in the circuit, the algorithm identifies a large number of untestable faults that require such impossibilities. The proposed approach identifies impossible combinations locally around each Boolean gate in the circuit, and its complexity is thus linear in the number of nodes, resulting in short execution times. Experimental results for both combinational and sequential benchmark circuits showed that many more untestable faults can be identified with this approach efficiently.
Models meant for logic verification and simulation are often used for ATPG. For custom digital circuits, these models contain many tristate devices, which leads to lower fault coverage. Unlike other research in the literature, the modeling algorithms presented in this paper analyze each channel connected component in the context of its environment, thereby capturing the relationship among its input signals. This reduces the number of tristates and increases the modeling efficiency, as measured by faults coverage. Experimental results demonstrate the superiority of this approach.
We have developed a technique to compute a Quasi Static Schedule of a concurrent specification for the software partition of an embedded system. Previous work did not take into account correlations among run-time values of variables, and therefore tried to find a schedule for all possible outcomes of conditional expressions. This is advantageous on one hand, because by abstracting data values one can find schedules in many cases for an originally undecidable problem. On the other hand it may lead to exploring false paths, i.e., paths that can never happen at run-time due to constraints on how the variables are updated. This affects the applicability of the approach, because it leads to an explosion in the running time and the memory requirements of the compile-time scheduler itself. Even worse, it also leads to an increase in the final code size of the generated software. In this paper, we propose a semi-automatic algorithm to solve the problem of false paths: the designer identifies and tags critical expressions, and synchronization channels are automatically added to the specification to drive the search of a schedule. As a proof of concept, the proposed technique has been applied to a subsystem of an MPEG-2 decoder, and allowed us to find a schedule that previous techniques could not identify.
This paper explores the role of data analysis methods to support system-level designers in characterising the performance of embedded applications. In particular, we address the performance modelling of software applications running on an embedded microprocessor. We propose a data analysis method, which, on the basis of a parameterisation of the software functionality and the hardware architecture, is able to predict the number of execution cycles on an embedded processor. Experiments with standard computational code (sorting, mathematical computation) and with MPEG variable length decoding are presented to support this claim.
This paper focuses on I-cache behaviour enhancement through the application of high-level code transformations. Specifically, a flow for the iterative application of the I-Cache performance optimizing transformations is proposed. The procedure of applying transformation is driven by a set of analytical equations, which receive parameters related to code and I-cache structure and predict the number of I-cache misses. Experimental results from a real-life demonstration application shows that order of magnitude reductions of the number of Icache misses can be achieved by the application of the proposed methodology.
Intra-iteration data reuse occurs when multiple array references exhibit data reuse in a single loop iteration. An optimizing compiler can exploit this reuse by clustering (in the loop body) array references with data reuse as much as possible. This reduces the number of intervening references between references to the same array and improves overall execution time and energy consumption. In this paper, we present a strategy where inter-statement and intrastatement optimizations are used in concert for optimizing intra-iteration data reuse. The objective is to cluster (within the loop body) the array references with spatial or temporal reuse. Using four array-intensive applications from image processing domain, we show that our approach improves the cache behavior of programs by 13.8% on the average.
CAD has always been hardly understood by the CEO‰s of companies because it obeys rules (if any) very different from the process. A rich variety of CAD and TCAD solutions have been developed in Europe in the early days of the CAD industry. These solutions have come to introduce real innovations in the field, but because they were mostly internal to the companies they have never reached the proper engineering level that would have enabled their introduction in the market. A review of the CAD history activity in Europe will be presented in this Plenary Session, together with some prospects on how it could evolve in the coming years and change from its lackluster industrial visibility.
Future networked appliances should be able to download new services or upgrades from the network and execute them locally. This flexibility is typically achieved by processors that can download new software over the network, using JAVA technology. This paper demonstrates that FPGAs are a realistic implementation platform for thin server or client applications. FPGAs can offer the same end-user experience as software based systems, combined with more computational power and lower cost.
Many approaches recently proposed for high-speed asynchronous pipelines are applicable only to linear datapaths. However, real systems typically have non-linearities in their datapaths, i.e. stages may have multiple inputs ("joins") or multiple outputs ("forks"). This paper presents several new pipeline templates that extend existing high-speed approaches for linear dynamic logic pipelines, by providing efficient control structures that can accommodate forks and joins. In addition, constructs for conditional computation are also introduced. Timing analysis and SPICE simulations show that the performance overhead of these extensions is fairly low (5% to 20%).
This paper presents a new fast and templatized family of fine-grain asynchronous pipeline stages based on the single-track protocol. No explicit control wires are required outside of the datapath and the data is 1- of-N encoded. With a forward latency of 2 transitions and a cycle time of 6 for most configurations, the new family can run at 1.6 GHz using MOSIS TSMC 0.25 µm process. This is significantly faster than all known quasi-delay-insensitive templates and has less timing assumptions than the recently proposed ultra-highspeed GasP bundled-data circuits.
Optimizing power consumption at high-level is a critical step towards power-efficient digital system designs. This paper addresses the power management problem by scheduling a given control-dominated data flow graph. We discuss delay and power issues with scheduling, and propose an improvement algorithm for insertion of so called soft edges which enable power optimization under timing constraints. Power savings obtained by our approach on tested circuits range between 15% and 30% of the initial power dissipation.
The design of application (-domain) specific instruction-set processors (ASIPs), optimized for code size, has traditionally been accompanied by the necessity to program assembly, at least for the performance critical parts of the application. The highly encoded instruction sets simply lack the orthogonal structure present in e.g. VLIW processors, that allows efficient compilation. This lack of efficient compilation tools has also severely hampered the design space exploration of code-size efficient instruction sets, and correspondingly, their tuning to the application domain. In [13] a practical method is demonstrated to model a broad class of highly encoded instruction sets in terms of virtual resources easily interpreted by classic resource constrained schedulers (such as the popular list-scheduling algorithm), thereby allowing efficient compilation with well understood compilation tools. In this paper we will demonstrate the suitability of this model to also enable instruction set design (-space exploration) with a simple, well-understood and proven method long used in the High-Level Synthesis (HLS) of ASICs. A small case study proves the practical applicability of the method.
This paper presents a novel substrate coupling simulation tool that is well suited to floorplanning of large mixed signal IC designs. The IC layout may consist of several subcircuits, hence a hierarchical design flow, which is usually used for IC circuit design and layout, is supported. Coupling data modelling the substrate inside subcircuits are precalculated and subsequently used during floorplanning leading to shorter simulation time. In addition, the impedance model of the power grid is considered as well making it possible to provide estimation results of substrate coupling quickly after only one simulation step. The approach is verified by experimental results in 0.13µm CMOS and 0.25µm BiCMOS technologies.
S-parameter based circuit simulators are used a lot for the design of microwave circuits. The accuracy of these simulators is limited by the fact that they do not take the electromagnetic coupling between the components and transmission lines that compose a circuit into account. In this article we present a technique that enables us to take this coupling into account without increasing the calculation time too much.
In this paper, we study the simultaneous switching noise problem by using an application-specific modeling method. A simple yet accurate MOSFET model is proposed in order to derive closed-form formulas for simultaneous switching noise voltage waveforms. We first derive a simple formula assuming that the inductances are the only parasitics. And through HSPICE simulation, we show that the new formula is more accurate than previous results based on the same assumption. We then study the effect of the parasitic capacitances of ground bonding wires and pads. We show that the maximum simultaneous switching noise should be calculated using four different formulas depending on the value of the parasitic capacitances and the slope of the input signal. The proposed formulas, modeling both parasitic inductances and capacitances, are within 3% of HSPICE simulation results.
This paper addresses the development of accurate and efficient behavioral models of digital integrated circuit input and output ports for EMC and signal integrity simulations. A practical modeling process is proposed and applied to some example devices. The modeling process is simple and efficient, and it yields models performing at a very high accuracy level.
The market demand for portable multimedia applications has exploded in the recent years. Unfortunately, for such applications current compilers and software optimization methods often require designers to do part of the optimization manually. Specifically, the high-level arithmetic optimizations and the use of complex instructions are left to the designers' ingenuity. In this paper, we present a tool flow, SymSoft, that automates the optimization of power-intensive algorithmic constructs using symbolic algebra techniques combined with energy profiling. SymSoft is used to optimize and tune the algorithmic level description of an MPEG Layer III (MP3) audio decoder for the SmartBadge [2] portable embedded system. We show that our tool lowers the number of instructions and memory accesses and thus lowers the system power consumption. The optimized MP3 audio decoder software meets real-time constraints on the SmartBadge system with low energy consumption. Furthermore, the performance improves by a factor of 7.27 and the energy consumption decreases by a factor of 4.45 over the original executable specification.
As bus lengths on multi-hundred-million transistor SOCs (Systems-On-a-Chip) gro and as inter-wire capacitances of sub-0.10u technologies increase, the resulting high switching capacitances of buses (and interconnects in general) have a non-negligible impact on the power consumption of a whole SOC. In this paper, we address this problem by introducing our bus encoding technique 'ADES' that minimizes the power consumption of data buses through a dictionary-based encoding technique. We show that our technique saves between 18% and 40% of bus energy compared to the non-encoded cases using a large set of (freely-accessible) real-world applications. Furthermore, we compare our technique to the best-known data bus encoding techniques to date and it exceeds all of them in energy savings for the same set of applications. The additional hardware effort for our bus en/decoder is thereby very small.
In this paper, we present a methodology for power minimization by data cache tag compression. The set of tags being accessed by the major application loops is analyzed statically during compile time and an efficient and optimal compression scheme is proposed. Only a very limited number of tag bits are stored in the tag array for cache conflict identification, thus achieving a significant reduction in the number of active bitlines, sense amps, and comparator cells. The underlying hardware support for dynamically compressing the tags consists of a highly cost and power efficient programmable encoder, which lies outside the cache access path, thus not affecting the processor cycle time. A detailed VLSI implementation has been performed and a number of experimental results on a set of embedded applications and numerical kernels is reported. Energy dissipation decreases of up to 95% can be observed for the tag arrays, while significant energy reductions in the range of 10%-50% are observed when amortized across the overall cache subsystem.
The hierarchical structure of real-life data dominated applications limits the exploration space for high level optimisations. This limitation is often overcome by function inlining. However, it increases the basic block code size, which causes a significant growth of instruction cache misses and thus performance slow-down. This effect has been confirmed on experiments with our applications. We have developed a novel methodology for selective function inlining steered by cost/gain balance to trade-off power and performance. Although this results in a speed up, the increase of the instruction cache misses is still present, i.e. the memory power consumption is higher. This implies the possibility of the Pareto-optimal trade-offs between memory power and performance. Our methodology is demonstrated on an MPEG-4 video decoder.
We present the first approach to model checking for nonlinear analog systems. Based on digital CTL model checking ideas, results in hybrid model checking and special needs in analog verification, a new model checking tool has been implemented. Published model checking tools for hybrid systems require discrete or partly linear system descriptions. Our focus is on nonlinear analog behavior, therefore a new approach is necessary. There are mainly two aspects to be considered. Firstly, a discrete model retaining the essential nonlinear analog behavior has to be developed. Secondly, model checking for analog systems requires extensions of the language to define analog system properties in a reasonable way.
This paper presents performance results for a new SAT solver designed specifically for EDA applications. The new solver significantly outperforms most efficient SAT solvers - Chaff[2], SATO[3], and GRASP[1] - on a large set of benchmarks. Performance improvements for standard benchmark groups vary from 1.5x to 60x. They were achieved through a new decision-making strategy and more efficient boolean constraint propagation (BCP).
We introduce a new approach to Boolean satisfiability (SAT) that combines backtrack search techniques and zero-suppressed binary decision diagrams (ZBDDs). This approach implicitly represents SAT instances using ZBDDs, and performs search using an efficient implementation of unit propagation on the ZBDD structure. The adaptation of backtrack search algorithms to such an implicit representation allows for a potential exponential increase in the size of problems that can be handled.
Power consumption is becoming one of the most critical parameters in VLSI design. In this paper we describe a novel state assignment algorithm targeting towards low power CMOS realizations of controllers. The main features of the new approach can be summarized as follows: 1) flexible column encoding strategy which allows handling the area and the register activity cost functions separately and 2) preliminary analysis of the FSM to control relative weight of each cost function. Experimental results show that on average there is a 25% reduction in power consumption compared to an standard tool and without area penalty.
The algorithms for static reordering of Reduced Ordered Binary Decision Diagrams (ROBDDs) rely on dependable properties for grouping of variables. Two such properties have been studied so far: keeping symmetric variables adjacent [1] and minimizing the ROBDD's width [2]. However, counterexamples have been found for the both cases [1], [3]. In this paper, we introduce a new condition for grouping of variables, suggesting to keep adjacent the variables from all bound sets of the function which are explicitly given by its composition tree. Bound set is a proper subset Y of the variables X of a function f : {0,1}|X| -> {0,1} resulting in the decomposition of type f(X) =g(h(Y),Z), where Z = X - Y. Composition tree of is a structure reflecting all its non-overlapping bound sets [4]- [6]. Bound-set-preserving ordering (X) of the variables of a ROBDD for f(X) is a vector, describing the variables of X in order from top to bottom of the ROBDD, in which the variables of any node of T(f) are adjacent in (X). For example, if a function f(x1, x2, x3) has a single non-trivial bound set {x1, x2}, then the orderings (x1, x2, x3), (x3, x1, x2), (x3, x2, x1) are bound-set-preserving ones, while the orderings (x1, x3, x2) and (x2, x3, x1) are not. A composition tree T(f) is unique for f (up to isotopy) and therefore any Boolean function has a unique bound-set-preserving ordering. We prove that the intersection of the set of bound-set-preserving orderings and the set of best orderings in non-empty for any Boolean function:
Today's high capacity Field-Programmable Gate Arrays (FPGAs) and the upcoming trend to System-On-Programmable-Chip (SOPC) require novel implementation strategies. These have to overcome long implementation times of traditional synthesis approaches. In this poster, a unique approach for technology mapping of both datapath modules and controller descriptions into Look-Up Table (LUT)-based FPGAs is presented. The proposed method starts at Register-Transfer-Level (RTL) and follows the Library of Parameterized Modules (LPM) standard. The mapping environment includes an implicit state minimization algorithm for FSMs.
We study the problem of concurrent and selective logic extraction in a Boolean circuit. We first model the problem using graph theory, prove it to be NP-hard, and subsequently formulate it as a Maximum-Weight Independent Set problem in a graph. We then use efficient heuristics for solving the MWIS problem. Concurrent logic extraction not only allows us to achieve larger literal saving and smaller area due to a more global view of the extraction space, but also provides us with a framework for reducing the circuit delay.
The effective technology mapping for PAL-based devices is presented in this paper. The aim of this method is to cover a multiple-output function by a minimal number of PAL-based logic blocks. The product terms included in a logic block can be shared by several functions. Experimental results are compared to the classical technology mapping method.
Redundancy removal of combinational circuits has been the subject of many papers over the last decades. Most of these papers work with the relatively small circuits available as benchmarks in the logic synthesis community. In Magma's BlastFusion and BlastChip software, very large blocks of logic (millions of gates) are handled flat (Blast-Fusion and BlastChip are registered trademarks of Magma Design Automation). We implemented redundancy removal in a way that will allow it to run efficiently (fast, low memory usage) and robustly (no run time or memory explosion on any netlist) on industrial designs of up to several million gates. We achieve this without resorting to partitioning. Other than most published approaches we do not try to identify all redundancies in a circuit, as an exact solution to this NP-hard problem is infeasible for the large circuits we face. Instead we try to identify as many as possible in a reasonable run time. We use a carefully engineered combination of Fault Collapsing, Random Test Generation (RTG) and the good old D-algorithm. As the goal is finding redundancies, and not sets of test vectors, these algorithms need changes and adaptations for optimal efficiency and robustness. Fault Collapsing can be more aggressive than for test generation. RTG was implemented with a novel dynamic control of the bitparallelism employed. The D-algorithm‰s effort control was not implemented with a traditional backtrack limit, but on a more fine-grain level, to increase robustness. For details, please refer to [1]. Results on 11 industrial netlists are shown in table 1. All tests were run on a Sun Ultra-80 workstation. A comparison is shown to a state-of-the-art SAT-based approach. Our approach is clearly faster while identifying more redundancies.
A new method, algorithms and tool for the visualisation of a finite complete prefix (FCP) of a Petri net (PN) or a signal transition graph are presented. A transformation is defined that converts such a prefix into a two-level model. At the top level, it has a finite state machine (FSM), describing modes of operation and transitions between them. At the low level, there are marked graphs, which can be drawn as waveforms, embedded into the top level nodes. The models of both levels are abstractions traditionally used by electronics engineers. The resultant model is completed trace equivalent to the original prefix. Moreover, the branching structure of the latter is preserved as much as possible.
This poster presents the design of complex arbitration modules, like those required in SoC communication systems. Clock-less, delay-insensitive arbiters are studied in the perspective of making easier and more practical the design of future GALS or GALA SoCs. This work focuses on high-level modeling and delay-insensitive implementations of low-power and reliable fixed and dynamic priority arbiters.
The paper exploits the drawbacks of wasting power when accessing the instruction cache that stores only static sequence of instructions. Although trace cache is first introduced to catch the dynamic characteristics of instructions in execution, conventional trace cache (CTC) does increase the power consumption in fetch unit. A Sequential Trace Cache (STC) has been investigated for its power efficiency in this paper.
Cache memories are known to consume a large percentage
of on-chip energy in current microprocessors. For example,
[1] reports that the on-chip cache in DEC Alpha
21264 consumes approximately 25% of the on-chip energy.
Both sizes and complexities of state-of-the-art caches play
a major role in their energy consumption. Direct-mapped
caches are, in general, more energy efficient (from a per access
energy consumption viewpoint) as they are simpler as
compared to set-associative caches, and require no complex
line replacement mechanisms (i.e., there is no decision concerning
which line has to be evicted when a new line is to
be loaded).
While there exists a large body of compiler-based techniques
to manipulate access pattern of a given code to improve
its cache utilization, there are not many compiler
techniques that try to improve cache energy consumption
of a given code. Rather, in many cases, a reliance is placed
upon the observation that optimizing cache locality also optimizes
cache energy. This is true to some extent as optimizing
locality (performance) of memory accesses reduces
the activity between cache and off-chip memory, and consequently,
decreases the number of writes into cache. Recent
work (e.g., [2]) also shows that the classical performance-oriented
compiler optimizations (e.g., loop-level transformations)
can be very effective in reducing overall memory
system energy.
Reconfiguration is a very helpful feature that can improve the design life cycle of an embedded system and its quality. Reconfiguration means that software AND hardware parts may be updated in the field. The update of system hardware implies the use of FPGAs in a shipped system. Normally, the update is done server controlled, which means that the active role comes from an external instance. We present a new automatic reconfiguration approach that stores all system configuration data in XML format. The system itself searches for the related components a component broker, and sets up during start up. A case study shows that especially when dealing with permanently connected devices, we achieve promising results while spending a reasonable price.
In this paper we propose a novel approach for solving the Boolean satisfiability problem by combining software and reconfigurable hardware. The suggested technique avoids instance-specific hardware compilation and, as a result, achieves a higher performance than pure software approaches. Moreover, it permits problems that exceed the resources of the available reconfigurable hardware to be solved.
An essential characteristic of embedded systems is realtime, but the commonly used specification techniques do not consider temporal aspects in general like fulfillment of high level timing requirements or dynamic reactions on timing violations. We show a new formal time model that fills this gap: Timing requirements specify the timing behaviour of real-time systems. Different models allow the specification of clock properties and the relations between clocks. With this time model, timing requirements as well as the desired properties of the involved clocks can be specified within a formal description technique.
MILP-based models are useful for finding optimal schedules and for proving their optimality. Because of the problem complexity, model improvements have to be investigated. We analyze the constraints necessary for precluding resource conflicts, present novel formulations, and evaluate them. The efficiency of the solution process can be improved significantly by selecting the proper formulation.
The major characteristic of a counting unit is its performance. The basic properties that a fast counter must have are: i) high counting rate, preferably independent of the counter size, ii) a binary output; read on-the-fly, iii) sampling rate equal to the counting rate, and iv) a regular implementation suitable for VLSI. For safety critical applications, the synchronous operation of a fault-secure binary counter makes reading the counter‰s value difficult and reduces the counting rate proportionally to counter‰s size. In this paper an implementation of a fault-secure binary counter using the Johnson-Mobius encoding scheme is presented.
The property called mutual exclusiveness, responsible for the degree of conditional reuse achievable after a high-level synthesis (HLS) process, is intrinsic to the systems behavior. But sometimes it is only partially reflected in the actual description written by a designer. Our algorithm performs a transformation of the input description that exploits the maximum conditional reuse of the behavior, independently of description style, allowing the HLS tools to obtain circuits with less area.
This paper proposes the use of templatized asynchronous control circuits with single-rail datapaths to create low-power bundled-data non-linear pipelines. First, we adapt an existing templatized control style for 1-of-N rail pipelines, the Pre-Charged Full Buffer PCFB [1], to bundled-data pipelines. Then, we present a novel true 4-phase template (T4PFB) that has lower control overhead. Simulation results indicate 12%-44% higher throughput for the pipeline stage equivalent to 8 to 40 gates.
Floorplanning is an important step of IC design. Traditionally, floorplan representation has been segregated between slicing and non-slicing structures. We present a heuristic that translates any arbitrary structure into a slicing one, topologically equivalent to the initial one after a 1-D compaction.
In this poster, we presented a new formulation by introducing the concept of block partition such that the shape of modules can be automatically determined based on the goal of optimization. Experimental results from MCNC benchmarks indicate that the zero dead space solutions can be obtained for most test cases under our formulation.
In this paper, we study the problem of changing the shapes and dimensions of the flexible modules to fill up the unused area of a preliminary floorplan, while keeping the relative positions between the modules unchanged. The selection of modules and empty spaces is made by the users interactively. We formulate the problem as a mathematical program. We use the Lagrangian relaxation technique [1, 2] to solve the problem. The formulation is in such a perfect way that the dimensions of all the rectangular and non-rectangular modules can be computed by closed form equations efficiently.
In this paper, we introduce a class of irredundant low power encoding techniques for memory address buses. For a data address bus, the proposed encoding techniques make use of two working zones in the memory address space, whereas for a multiplexed data and instruction address bus, up to four working zones can be supported. The zones are dynamically updated to increase the saving in switching activity. Our techniques decrease the switching activity of data address and multiplexed address buses by an average of 55% and 77%, respectively, up from 25% and 64% achieved by previous methods.
Because of the increasing importance of cross coupled capacitances in deep submicron technologies [1], it is of great interest to extend the existing high-level power estimation techniques by considering the spatial correlation between adjacent lines. This work addresses the modeling and estimation of power dissipation in on-chip buses based on the statistical properties of data sequences. Using the derived models, a power estimation technique is proposed and evaluated for various coding schemes. For different DSP applications, our results depict less than 5 % discrepancy with precise bit level estimations.
The analysis of linear analog amplifiers at the beginning of the design process shows in some cases an unwanted resonance in the amplitude response or an unwanted overshooting in the time domain. It is important for the designer to know design methods for compensating this effect. An approach of the symbolic analysis, that supports the representation of a signal-flow graph with feedback for an amplifier circuit, will be introduced. The method is based on the node analysis and mathematical handling of symbolic expressions. Using the proposed approach the feedback, the open-loop gain and the loop gain can be analyzed and calculated. With the analysis of pole-zero of the symbolic loop gain, parameters of the amplifier can be determined for the compensation of the amplitude response.
The parametric fault diagnosis techniques hold an important part in the field of analog fault diagnosis. These techniques, starting from a series of measurements carried out on a previously selected test point set, given the circuit topology and the nominal values of the components, are aimed at determining the effective values of the circuit parameters by solving a set of equations nonlinear with respect to the component values. Here the role of symbolic techniques in the automation of parametric fault diagnosis of analog circuits is investigated. Being in fact the actual component values the unknown quantities, symbolic approach results particularly suitable for the automation of parametric fault diagnosis techniques, as shown, for example, in [1]. Obviously all this is applicable to linear analog circuits or to nonlinear circuits suitably linearized. On the other hand, present trend is moving as much as possible to techniques of design that lead to linear analog circuits, so this is not a so serious restriction [2].
Noise is an important consideration in the design of integrated circuits. Increased immunity to noise, however, typically comes at the expense of increased delay. So, it is very important to have an adequate noise immunity with a minimum penalty in performance. ‹GlobalŠ noise immunity schemes can be used when the noise is approximately the same on all nodes in the circuit; but when a few nodes are noisier then others much better results can be obtained by selective noise immunity schemes. The Selective Pull-up (SP) technique for dynamic circuits is a method for improving the noise immunity of inputs selectively, so that the least penalty in delay is paid for inputs that intrinsically have higher noise immunity.
Accurately predicting the impact of substrate parasitics in Radio Frequency design with simulations is one of the major concerns to ensure first silicon success in a System on Chip approach. The practical design experience of a 2GHz RF front-end circuit (designed in a 0.35 mm SiGe Bicmos technology), presented here, illustrates how measurements results can be accurately predicted using a substrate parasitic extractor.
A PLL power model that accurately estimates the power consumption during both lock and acquisition states is presented. The model is within 5% of circuit level simulation (SPICE) values. No significant power overhead (+/- 5% of the power consumed at the final frequency) is incurred during the acquisition process.
In this poster we present statistical-timing driven partitioning for performance optimization. We show that by using the concept of node criticality we can enhance the Fiduccia-Mattheyses (FM) partitioning algorithm to achieve, on average, around 20% improvements in terms of timing, among partitions with the same cut size. By incorporating mechanisms for timing optimization at the partitioning level, we facilitate wire-planning at high levels of the design process.
To reduce the long circuitšlevel simulation time of 16 modulators, a variety of techniques and tools exist that use high-level models for discreteštime (DT) 16 modulators. There is, however, no rigorous methodology implemented in a tool for the continuousštime (CT) counterpart. Therefore, we have developed a methodology for the high-level simulation of CT 16 modulators and implemented this method in a user-friendly tool. Key features are the simulation speed, accuracy and extensibility. Non idealities such as finite gain, finite GBW, output impedance and also the important effect of jitter are modelled. Finally, experiments were carried out using the tool, exploring important design trade-offs.
We present a method for automated design of CMOS switched-capacitor filters (SCFs) from user-defined top-level specifications to component sizes and physical layout. In other words, we present a complete top-down design ow for SCFs. The method is based on careful analysis and modeling of the SCF using analog circuit design and system engineering expertise, formulating design constraints in a special convex form, and numerical optimization (geometric programming).
Full 3D lumped partial inductance models usually contain a tremendous amount of forward coupling terms. To reduce the complexity of simulation and analysis, a simplified model that excludes the forward coupling terms is often adopted in practice [3][4]. This paper addresses the question whether ignoring forward couplings is always an acceptable choice or if full 3D models are necessary in certain cases. We show that the significance of the forward coupling inductance depends on various aspects of the design.
The purpose of this paper is a slight modification of a recently proposed series expansion method [1, 2], developed for the electrical modeling of lossy-coupled multilayer interconnection lines, that does not involve iterations and yields solutions of sufficient accuracy for most practical interconnections as used in common VLSI chips. We use here a Fourier series restricted to cosine functions. The solution for the layered medium is found by matching the potential expressions in the different homogeneous layers with the help of boundary conditions. In the plane of conductors, the boundary conditions are satisfied only at a finite, discrete set of points (point matching procedure).
This paper describes a systematic algorithm for obtaining passive time domain reduced order transmission line macromodels. The proposed algorithm makes use of a new order reduction technique that removes the redundant poles obtained using conventional order reduction methods. The reduced macromodel is passive by construction.
This paper deals with an innovative method of EMC-compliant design. This technique helps to optimize emission level as soon as in the design phase, and provides noise-related solutions which will be evaluated and integrated into the silicon. This method allows to model the activity of thousand-gate circuits thanks to only two current generators which represent supply current consumption in the VDD and the VSS rails. This allows EMC evaluation and optimization (conducted noise) for a packaged integrated circuit within its electrical environment.
When a design is manufactured for the first time, it may suffer from timing-related errors that result from inaccuracies in the timing analysis tool used during the design process. Such errors will appear as delay faults in all (or many) of the manufactured chips. In addition, variations that occur during the manufacturing process may cause delay defects that vary across chips. It necessary to diagnose and correct failures of the first type (in the presence of failures of the second type) before the chip can be manufactured again. This may have to be repeated until design errors are eliminated.
This paper presents an iDDT test method for embedded
CMOS SRAMs. A total of 192 faults were inserted
and simulated using parameters from a 0.35 um
process. The SRAM model includes realistic effects
such as wire bonding inductance and resistance
parameters as well as bypass capacitance. A sensor is
introduced and incorporated into the SRAM cell
array to detect abnormal iDDT switching. Figure 1
shows a 1-bit SRAM organized into 64 128 x 128
cell blocks
with an iDDT sensor monitoring each cell block. The
SRAM model includes the following parameters:
On-chip wire bonding inductance of 2 nH
On-chip wire bond resistance of .01 Ohms
On-chip bypass capacitance of 1 pF
Bitline capacitance of 3 pF
Power line capacitance of 40 pF
The results of the fault simulations comparing voltage,
IDDQ and iDDT test methods are given in Table 1.
We present a novel integrated method for fault detection and localization using wavelet transform of transient current (IDD) waveform. The time-frequency resolution property of wavelet helps us detect as well as localize faults in digital CMOS circuits. Experiments performed on an 8-bit ALU show promising results for both detection and localization.
This paper presents a test and diagnosis scheme for feedback type of linear analog circuits with minimal added circuits. For testing, the scheme transforms the circuit-under-test (CUT) into an oscillation circuit by (1) increasing the loop gain of the circuit, and/or (2) reconfiguring the circuit through selectively powering-off operational amplifiers (OP) of the circuit. This eliminates the need of added global paths as in the conventional oscillation test scheme. For diagnosis, the scheme transforms the circuit into a Schmitt trigger type of circuit with a positive feedback. The output of the circuit under an applied triangular input gives signatures which are used to identify faults. Benchmark circuits have been applied with this scheme and results show that it is very effective for testing and diagnosing the feedback type of linear analog circuit.
This paper introduces the use of the oscillation test technique for MEMS testing. This well-known test technique is here adapted to MEMS. Its efficiency is evaluated based on a case study: A CMOS electromechanical magnetometer.
Logic BIST is about to become a more main stream test method for IC testing. In some flows when a failure is encountered the IC is diagnosed to determine the cause of the failure. Diagnosing fails in Logic BIST is significantly different from that in a stored pattern test methodology. The first step is to determine the failing pattern or interval among the many patterns that were applied. Today this involves a binary search of the tests that were applied with Logic BIST. In this paper we improve on this binary search strategy to reduce the time taken to isolate the failing patterns by orders of magnitude.
Weighted pseudorandom test generation (WPRTG) uses test sequences
characterized by non-uniform distributions of test vectors
in order to increase the detection probability of random resistant
faults. Such non-uniform distributions are characterized by the values
of signal probability of the CUT inputs (weights). Since different
faults may require different distributions, a (small) number of
distributions is typically used [1]. The weights of such distributions
are identified by analyzing the CUT. The corresponding pseudorandom
sequences are typically obtained by inserting a combinational
network between the TPG and the CUT.
Several different methodologies have been proposed in order to
calculate the weights. Some approaches make use of deterministic
test sequences [2]. Another class of heuristics, instead, makes
use of numerical optimization strategies to determine the set(s) of
weights [1]. More recently, genetic algorithms have been identified
to provide a good solution to weights selection [3]. All such methods
evaluate only the first order coefficients of the distribution(s)
and may suffer from a few problems. In particular, the detection of
some random resistant faults may strongly depend on signal correlations.
Even if the effects of signal correlations can be reduced,
some problems are still in order. Consider, for instance, a fault that
can be detected by a test vector and its complement. Any WPRTG
method using signal probability evaluation would provide (when
targeting such a fault) the same coefficients of a uniform distribution.
Design methodologies for large designs produce circuits that consist of interconnections of functional blocks. If the blocks are large, as in core-based designs, they may be isolated for testing purposes (e.g., by test wrappers) such that different blocks can be tested independently. However, even if a test wrapper exists, it is advantageous to test functional paths that go through two or more blocks by using test vectors that propagate fault effects through several blocks. This contributes to testing of defects that cannot be detected if each block is tested separately. One of the issues that arises when several blocks are tested by the same test is that of fault isolation. If a test that propagates fault effects through blocks C1 and C2 produces a faulty response on the outputs of C2, the goal of fault isolation is to identify which one of C1 and C2 is faulty. Fault isolation is perfect if every faulty response on the outputs of the circuit can be uniquely attributed to a single block. This happens when every pair of faults belonging to different blocks is distinguishable. If faults of different blocks remain indistinguished, fault isolation is not possible when responses equal to the responses produced by these faults are produced by the circuit-under-test. It may appear that tests for several non-isolated blocks will not be able to isolate faults. In this work, we study this issue and demonstrate that perfect or close-to-perfect fault isolation is possible with tests that propagate fault effects through several blocks.
This paper considers the test-scheduling problem of a SoC. The proposed approach is based on a "sessionless" test scheme. It minimizes the system test time while respecting a power dissipation limit and test resource sharing constraints. Experimental results show that our approach outperforms other related test scheduling solutions.
Reusability of tests is crucial for reducing total design time. This raises the problem of test knowledge transfer, physical test application and test scheduling. We present a formulation of the embedded core-based system-on-chip (SOC) test scheduling problem (ECTSP) as a network transportation problem. The problem is NP-hard and we present a O(mn(m+2n)) 2-approximation algorithm using the result of the single source unsplittable flow problem. We describe the single source unsplittable flow problem (UFP) as given in [1]; let G = (V,E) be a capacitated directed graph with edge capacities c : E -> R+, a source s and k commodities with terminals ti and demands di is a member of R+, 1 <= i >= k. A vertex may contain a number of terminals. For each i, we would like to route di units of commodity i along a single path from s to the corresponding terminal so that the total flow through an edge e is at most its capacity c(e).
This poster presents the first truly non-intrusive structural concurrent test approach, aimed to test partial and dynamically reconfigurable SRAM-based FPGAs without disturbing its operation. This is accomplished by using a new methodology to carry out the replication of active Configurable Logic Blocks (CLBs), i.e. CLBs that are part of an implemented function that is actually being used by the system, releasing it to be tested in a way that is completely transparent to the system.
This work is a part of researches directed to checking methods development for approximate calculations executed in floating-point circuits in a mantissa part. A problem of the truncated non-restoring division residue checking is solved. It provides an efficient implementation of truncated division reduced almost twice hardware amount and time in iterative array divider.
The overall goal of this work is to define an instruction-level power macro-modeling and characterization methodology for VLIW embedded processor cores. The approach presented in this paper is a major extension of the work previously proposed in [1š3], targeting an instruction-level energy model to evaluate the energy consumption associated with a program execution on a pipelined VLIW core. Our first goal is the reduction of the complexity of the processor‰s energy model, without reducing the accuracy of the results. The second goal is to show how the energy model can be further simplified by introducing a methodology to automatically cluster the whole Instruction Set with respect to their average energy cost, in order to converge to an highly effective design of experiments for the actual characterization task. The paper describes also the application of the proposed model to a real industrial VLIW core (the Lx Architecture developed by HP Labs and STMicroelectronics), to validate the effectiveness and accuracy of the proposed methodology.
For the application of new technologies with ever shorter lifecycles, the availability of the most recent knowledge is mandatory. The intervals within which acquired knowledge bases therefore have to be updated, become shorter and shorter. It is well known that software development tools and systems are getting more and more sophisticated, and the learning expenditure for the personnel is growing accordingly. This tendency affects major parts of the electrical and electronics industry where demand for qualified workforce already manifests itself in the ëdesigner crisis‰. The combined effects of the increased functionality of new tool generations, the change of application areas of relevant methods due to technological progress and the improvement of the information exchange facilities lead to increased requirements with respect to further professional training. The microelectronic industry and related business sectors are extremely innovative and knowledge based. Students, engineers, scientists and others need to develop, transfer and share knowledge. The above mentioned knowledge processes and knowledge flow from researchers and universities to industry and vice versa need to be strengthened to ensure a leading edge position for European companies and institutes in this market.
A collaborative design system depends heavily on the chosen collaboration methodology, as well as on its technological infrastructure. This paper presents three data repository technologies and discusses their pros and cons on the role of supporting a collaborative design system.
This paper presents FlexBench, which is a complete framework for SoC verification at the Module and SoC level, both with and without embedded processors. The focus is to increase the productivity of the verification engineer by providing a framework to reuse verification IP, which includes parts of the testbench and the test stimulus.
Method for the selection of processor core and algorithm combinations for system on chip designs is presented. The method uses a mappability concept that is an addition to performance and cost metrics used in codesign. The mappability estimation is based on the analysis of the correlations of algorithm and core characteristics. The method is demonstrated with an analysis tool and the experimental results with DSP cores and algorithms are similar to expectations.
The use of behavioural modelling for operational amplifiers has been well known for many years and previous work has included modelling of specific fault conditions using a macro-model. In this paper, the models are implemented in a more abstract form using an Analogue Hardware Description Language (AHDL), VHDL-AMS, taking advantage of the ability to control the behaviour of the model using high-level fault condition states. The implementation method allows a range of fault conditions to be integrated without switching to a completely new model. The various transistor faults are categorised, and used to characterise the behaviour of the HDL models. Simulations compare the accuracy and speed of the transistor and behavioural level models under a set of representative fault conditions.
Cycle-based simulation at RT- and gate level realized by a Levelized Compiled Code (LCC) technique represents a well established method for functional verification in processor design. We present a parallel LCC simulation system developed to run on loosely-coupled processor systems allowing significant simulation acceleration. It comprises three parallel simulators and a complex model partitioning environment. A key idea of our approach is to valuate circuit model partitions with respect to the expected parallel simulation run-time and to integrate corresponding cost functions into partitioning algorithms. Experimental results are given with respect to IBM processor models of different size.
The combined effects of devices increased complexity
and reduced design cycle time creates a testing
problem: an increasing larger portion of the design
time is devoted to testing and verification. Today EDA
tools, moving towards higher levels of abstraction,
promise greater designer productivity, resulting in
increased design complexity and size.
In order to reduce the testing and verification time,
different high-level approaches have been proposed in
literature [2]. Most of these approaches are based on
the definition of an error or fault model, applicable at a
higher level of abstraction of the description of the
system to be implemented.
In this paper we concentrate our attention on the
evaluation of error models, used in test generation and
in functional verification. Evaluation of error models
is also an important aspect when fault injection
methodologies are used to evaluate the dependability
of complex system.
The ideas proposed by this work try to solve this
evaluation and analysis problem starting from the
following requirements:
Ý the error simulation task should be based only
on the original hardware description language
primitives;
Ý the flow from the given specification to the
fault simulation should be as automatic as
possible;
What is characteristic of modern embedded systems like
mobile phones, multimedia terminals, etc. is that their design
requires several different description techniques: The
radio-frequency part of a mobile phone is designed using
analog techniques, the signal processing part can be described
using synchronous data-flow, while the protocol
stack uses an extended finite state machine based description
model. This heterogeneity poses a challenge to embedded
system design methodologies, and has resulted in
a search for a System Level Design Language (SLDL) for
describing both software and hardware.
We believe that to obtain a good SLDL one needs to first
understand what the combination of models of computation
means. To this end we are developing a kernel language in
which it is possible to use different models of computation.
The main contributions of this work are: (1) a common set
of concepts that form the basis of the kernel language, (2)
a formally defined operational semantics, which also makes
it possible to verify designs using e.g. model-checking, (3)
the explicit use of atomicity and, (4) the introduction of the
notion of execution policy.
There appears to be an increasing trend towards the use of the C/C++ language as a basis for the next generation modeling tools and platform methodology to encompass design reuse. However, even with this convergence, industry is suffering the pain that there is no one tool or a complete tool flow methodology that can implement a top-down design methodology from C to silicon . In this paper we suggest a top-down methodology from C to silicon. In our methodology, we focus on methods to make the design flow smooth, efficient, and easy. The proposed methodology is a pure top-down methodology. We developed our design methodology by using SpecC [1], VCC[2], and SystemC[3]. We choose SpecC, VCC and SystemC because they are all C-related and each have strong support in at least one field of design. Our proposal for a methodology is based on our experiences of attempting to model the JPEG encoder with SpecC, SystemC and VCC, and one internal project, attempting to implement architecture exploration for MPEG encoding and decoding using VCC.
The need for high performance in ASIC embedded processors, coupled with aggressive energy and area goals, is pushing researchers and designers toward processor specialisation for a given application-domain. In this paper, specialisation is addressed through introduction of Ad-hoc Functional UnitsÖspecial arithmetic/logic units added to a traditional architecture to perform domain-specific complex operations.
Shooting, finite difference or Harmonic Balance techniques in conjunction with Newton‰s method are widely employed for the numerical calculation of limit cycles of oscillators. The resulting set of nonlinear equations are normally solved by damped Newton‰s method. In some cases however, divergence occurs when the initial estimate of the solution is not close enough to the exact one. A two-dimensional homotopy method is presented in this paper which overcomes this problem. The resulting linear set of equations employing Newton‰s method is under-determined and is solved in a least squares sense for which a rigorous mathematical basis can be derived.