Decode Stage

Embedded Processor Architecture

Peter Barry , Patrick Crowley , in Mod Embedded Calculating, 2012

Decode Phase to Consequence Stage

The decode stages can decode up to ii instructions to keep the two-upshot pipeline filled; withal, in some cases the decoder is limited in only being able to decode one instruction per cycle. Cases where the decoder is limited to i instruction per cycle include x87 floating-betoken instructions and branch instructions. The Intel Atom processor is dual-upshot superscalar, but information technology is non perfectly symmetrical. Not every possible pairing of operations can execute in the pipeline at the same time. The instruction queue holds instructions until they are fix to execute in the retentivity execution cluster, the integer execution cluster, or the FP/SIMD execution cluster.

Read total chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123914903000059

DSP Architectures

Robert Oshana , in DSP Software Evolution Techniques for Embedded and Real-Time Systems, 2006

Branching control flow

Co-operative atmospheric condition are detected in the decode stage of the pipeline. In a branch instruction, the target address is not known until the execute stage of the pedagogy. The subsequent instruction(s) have already been fetched, which can potentially crusade additional problems.

This condition does not just happen in branches simply also subroutine calls and returns. An occurrence of a co-operative instruction in the pipeline is shown in Figure 5.22.

Figure 5.22. Occurrence of a co-operative in a DSP pipeline

The solution to branch effects is to "flush" or throw abroad every subsequent pedagogy currently in the pipeline. The pipeline is effectively stalled while the processor is busy fetching new instructions until the target address is known. Then the processor starts fetching the branch target.

This condition results in a "bubble" where the processor is doing nothing, finer making the branch a multicycle instruction equal to the depth of the branch resolution in the pipeline. Therefore, the "deeper" the pipeline (the more stages of an instruction at that place are), the longer information technology takes to flush the pipeline and the longer the processor is stalled.

Another solution is called a delayed branch. This is essentially like a flush solution simply has a programming model where the instructions later the branch volition always be executed. The number of instructions is equal to the number of cycles that have to exist flushed. The developer fills the slots with instructions that practice useful work if possible. Otherwise the developer inserts NOPS. An example of a delayed branch with iii delay slots is the following:

BRNCH Addr.; Branch to new accost

INSTR i       ;Always executed

INSTR 2       ;E'er executed

INSTR 3       ;Always executed

INSTR 4       ;Executed when branch not taken.

In a "conditional branch with annul" solution, processor interrupts are disabled. The succeeding instructions are then fetched. If the branch condition is not met then the execution gain equally normal. If the status is met, the processor will annul the instructions post-obit the branch until the target instruction is fetched.

Some processors implement a branch prediction solution. In this arroyo, a paradigm is used to predict whether the co-operative will be taken or non. A cache of branch target locations is kept past the processor. The tag is the location of the branch, and the information is the location of the target of the branch. Control bits can indicate the history of the branch. When an instruction is fetched and if it is in the cache, information technology is a branch and a prediction is made. If the prediction is taken, the co-operative target is fetched next. Otherwise, the next sequential instruction is fetched. When resolved, if the proessor predicted correctly then execution gain without whatsoever stalls. If mis-predicted, the processor will flush the instructions past the co-operative instruction and re-fetch the correct education.

Branch prediction is another arroyo for handling branch instructions. This approach reduces the number of bubbles, depending on the accuracy of the prediction. In this approach the processor must not change the car state until the co-operative is resolved. Considering of the significant unpredictability in this approach, co-operative prediction is not used in DSP architectures.

Read full chapter

URL:

https://world wide web.sciencedirect.com/science/article/pii/B9780750677592500077

Microarchitecture

David Money Harris , Sarah 50. Harris , in Digital Design and Computer Compages (Second Edition), 2013

7.5.2 Pipelined Control

The pipelined processor takes the aforementioned control signals as the unmarried-cycle processor and therefore uses the same command unit. The command unit examines the opcode and funct fields of the teaching in the Decode stage to produce the control signals, every bit was described in Department vii.three.2. These control signals must exist pipelined along with the data and then that they remain synchronized with the education.

The entire pipelined processor with control is shown in Effigy 7.47. RegWrite must be pipelined into the Writeback stage before information technology feeds back to the register file, only equally WriteReg was pipelined in Figure vii.46.

Effigy 7.47. Pipelined processor with control

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123944245000070

Microarchitecture

Sarah L. Harris , David Money Harris , in Digital Design and Estimator Architecture, 2016

7.5.ii Pipelined Control

The pipelined processor takes the same control signals as the single-cycle processor and therefore uses the same command unit of measurement. The control unit examines the Op and Funct fields of the education in the Decode phase to produce the command signals, as was described in Department 7.iii.two. These control signals must be pipelined along with the data so that they remain synchronized with the instruction. The control unit also examines the Rd field to handle writes to R15 (PC).

The unabridged pipelined processor with control is shown in Figure 7.47. RegWrite must exist pipelined into the Writeback phase before it feeds back to the register file, just as WA3 was pipelined in Figure 7.45.

Effigy 7.47. Pipelined processor with control

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128000564000078

Microarchitecture

Sarah Fifty. Harris , David Harris , in Digital Design and Computer Architecture, 2022

vii.5.ii Pipelined Control

The pipelined processor uses the same control signals as the single-cycle processor and, therefore, has the same control unit of measurement. The command unit examines the op, funct3, and funct7 5 fields of the instruction in the Decode phase to produce the command signals, as was described in Section 7.3.3 for the unmarried-cycle processor. These control signals must exist pipelined along with the data and so that they remain synchronized with the instruction.

The entire pipelined processor with control is shown in Figure 7.51. RegWrite must be pipelined into the Writeback stage earlier information technology feeds back to the register file, just as Rd was pipelined in Figure 7.50. In addition to R-type ALU instructions, lw, sw, and beq, this pipelined processor also supports jal and I-blazon ALU instructions.

Figure 7.51. Pipelined processor with command

Read full affiliate

URL:

https://www.sciencedirect.com/science/article/pii/B9780128200643000076

Mistake Detection via Redundant Execution

In Architecture Pattern for Soft Errors, 2008

6.9.1 A Simultaneous Multithreaded Processor

SMT is a technique that allows fine-grained resource sharing among multiple threads in a dynamically scheduled superscalar processor [18]. An SMT processor extends a standard superscalar pipeline to execute instructions from multiple threads, possibly in the same bicycle. To facilitate the discussion in this department, a specific SMT implementation is used (Figure six.10). Mukherjee et al. draw an alternate implementation of SMT in a commercial microprocessor design that was eventually canceled [six]. In the SMT implementation in Figure vi.10 , the fetch stage feeds instructions from multiple threads (one thread per cycle) to a fetch/decode queue. The decode phase picks instructions from this queue, decodes them, locates their source operands, and places them into the register update unit (RUU). The RUU serves as a combination of global reservation station pool, rename annals file, and reorder buffer. Loads and stores are broken into an address and a memory reference. The address generation portion is placed in the RUU, while the memory reference portion is placed into a similar structure, the load/store queue (LSQ) (not shown in Figure 6.x).

Effigy 6.ten. Sharing of RUU between two threads in an SMT processor.

Reprinted with permission from Reinhardt and Mukherjee [ten]. Copyright © 2000 IEEE. Copyright © 2000

Figure vi.10 shows instructions from two threads sharing the RUU. Multiple IPCs are issued from the RUU to the execution units and written back to the RUU without considering thread identity. The processor provides precise exceptions by committing results for each thread from the RUU to the register files in program guild. Tullsen et al. [17] showed that optimizing the fetch policy—the policy that determines the thread from which the instructions are fetched in each cycle—tin can ameliorate the performance of an SMT processor. The best-performing policy Tullsen, et al. examined was named ICount. The ICount policy counts the number of instructions from active threads that are currently in the education buffers and fetches instructions from the thread that has the fewest instructions. The assumption is that the thread with the fewest instructions moves instructions through the processor quickly and hence makes the nearly efficient apply of the pipeline.

Read full affiliate

URL:

https://www.sciencedirect.com/scientific discipline/article/pii/B9780123695291500082

An Overview of Architecture-Level Power- and Energy-Efficient Design Techniques

Ivan Ratković , ... Veljko Milutinović , in Advances in Computers, 2015

Clock Gating

Pipeline blocks are clock gated either if they are known to be idle or if they are supposed to be doing useless work. The first arroyo (deterministic clock gating) is more bourgeois and practice not spoil performance, while the 2nd one is more "risky" and could dethrone performance with, of course, significant ability savings.

Deterministic Clock Gating

The idea of Deterministic Clock Gating (DCG) awarding on the pipeline is to clock gate the structures that are known to be idle, without spoiling the performance only decreasing EDP at the same time. Li et al. [30] give a detailed description of DCG in a superscalar pipeline. They consider a loftier-performance implementation using dynamic domino logic for speed. This ways that besides latches, the pipeline stages themselves must be clock gated.

The idea is to observe out if a latch or pipeline stage is non going to be used. In Fig. 8 is depicted a pipeline which clock-gate-able parts are shown nighttime. The Fetch and Decode stages and their latches are, for example, never clock gated since instructions are needed nearly every cycle, while there is completely plenty time to clock gate functional units.

Figure eight. Deterministic Clock Gating. Pipeline latches and pipeline stages that can be clock gated are shown shaded.

Source: Adapted from Ref. [56].

DCG was evaluated with Wattch [30]. By applying DCG to all the latches and stages described above, they report power savings of 21% and 19% (on average) for the SPEC2000 integer and floating bespeak benchmarks, respectively. They found DCG more promising than pipeline balancing, another clock gating technique.

Although this work is applied to scalar compages, it is also applicable to other kinds of architectures. An instance of an efficient DCG application on functional units for energy-efficient vector architectures tin can be found in Ref. [57].

Improving Energy Efficiency of Speculative Execution

Although they are necessary in lodge to keep functional units busy and to take loftier Instructions Per Bicycle (IPC), branch predictors and speculative activity arroyo are adequately power hungry. Besides the actual power consumption overhead of supporting branch prediction and speculative execution (e.grand., prediction structures, support for check pointing, increased run-time country), in that location is also the issue of wrong execution.

Manne et al. [31] attempt to solve this energy inefficiency of speculative activity proposing arroyo which is named pipeline gating. The idea is to gate and stall the whole pipeline when the processor threads down to very uncertain (execution) paths. Since pipeline gating refrains from executing when confidence in branch prediction is depression, it can inappreciably hurt performance. There are two cases when information technology does: when execution would eventually plough out to be correct and is stalled, or when incorrect execution had a positive outcome on the overall functioning (eastward.one thousand., because of prefetching). On the other hand, information technology can effectively avert a considerable amount of incorrect execution and save the corresponding ability.

The confidence of branch prediction in Ref. [31] is determined in two ways: counting the number of mispredicted branches that can be detected as low conviction, and the number of low-confidence branch predictions that are turn out to be wrong. They discover out that if more than one low-confident branch enters the pipeline, then the chances of going down the incorrect path increase significantly. They suggest several confidence estimators which details could be institute in Ref. [31]. In their exam, authors show that certain estimators used for gshare and McFarling application with a gating threshold of 2 (number of low-confident branches), a significant office of incorrect execution, can be eliminated without perceptible touch on on performance. Of course, the earlier the pipeline is gated, the more wrong piece of work is saved. However, this assumes larger punishment of stalling right execution.

Aragón et al. [32] did similar piece of work but with slightly unlike arroyo. Instead of having a single mechanism to stall execution as Manne et al., Aragón et al. examine a range of throttling mechanisms: fetch throttling, decode throttling, and option-logic throttling. Equally throttling is performed deeper in the pipeline, its impact on execution is diminished. Thus, fetch throttling at the start of the pipeline is the most aggressive in disrupting execution, starving the whole pipeline from instructions, while decode or selection-logic throttling deeper in the pipeline is progressively less aggressive. This is exploited in relation to branch conviction: the lower the confidence of a branch prediction, the more aggressively the pipeline is throttled. The overall technique is called "selective throttling."

Pipeline gating, beingness an all-or-cipher mechanism, is much more than sensitive to the quality of the confidence estimator. This is due to the severe bear on on performance when the confidence interpretation is wrong. Selective throttling, on the other hand, is able to better balance confidence estimation with performance impact and ability savings, yielding a meliorate EDP for representative SPEC 2000 and SPEC 95 benchmarks.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/S0065245815000303

Architectural Vulnerability Analysis

In Architecture Design for Soft Errors, 2008

3.9 Computing AVF with Little's Police force

As was seen in the previous section, a structure'due south AVF tin can be expressed as the ratio of the average number of ACE bits in a cycle resident in the structure and the total number of bits in that structure. Piffling's police force [7] is a basic queuing theory equation that enables one to compute the average number of ACE bits resident in a structure. Little's law can exist translated into the equation N = B × L, where N is the average number of $.25 in a box, B is the average bandwidth per cycle into the box, and L is the average latency of an individual object through the box (Figure 3.5a), where none of the objects flowing into the box is lost or removed. Fiddling'due south police tin also exist applied to a subset of the bits. Hence, by applying this to ACE bits (Figure three.5b), one gets the boilerplate number of ACE $.25 in a box every bit the product of the average bandwidth of ACE bits into the box (B ACE) times the boilerplate residence cycles of an ACE bit in the box (50 ACE). Thus, one can express the AVF of a structure every bit

Effigy 3.5. Analogy of Little's law to compute AVF. (a) Menses of ACE and un-ACE instructions through a hardware structure, such as an instruction queue. (b) Flow of only ACE instructions through the construction.

AVF structure = Average number of ACE bits in a structure in a cycle Full number of bits in a structure = B ACE × L ACE Total number of $.25 in a construction

This is a powerful equation that not only allows ane to rapidly do back-of-the-envelope calculations of AVF but too provides insight into the parameters AVF depends on.

Case

To quickly compute the estimate AVF of a 32-entry education queue, permit us categorize instructions into ACE and un-ACE and ignore the ACE bits in un-ACE instructions. Assume that the didactics per bike (IPC) of ACE instructions is 2 and average delay of an instruction in the instruction queue is five cycles. What is the approximate AVF of the instruction queue?

SOLUTION B ACE = 2 IPCs, L ACE=5 cycles. Then, AVF=ii×5/32=10/32=31%.

EXAMPLE

Compute the AVF of a branch commit table in a microprocessor. At the decode stage, every decoded branch and its associated information are entered into the branch commit table. When the branch commits and is accounted to have been mispredicted, then the data in the commit table is accessed to recover the state of the pipeline and restart the pipeline from the correct-path teaching after the branch. Presume an entire entry in the branch commit table is either ACE or un-ACE, the average IPC of the machine is ii, the decode to commit delay (including queueing delay) is 30 cycles, one out of v instructions are branches, the co-operative misprediction charge per unit is three%, and the branch commit tabular array has 64 entries.

SOLUTION At any instant, there are 4 types of entries in the branch commit table: ACE mispredicted co-operative entries that will exist used for recovery, un-ACE branch instructions that are predicted correctly, wrong-path un-ACE branch entries, and idle un-ACE entries. In that location is one branch per five committed instructions. The branch misprediction rate is three% so 3 out of 100 branches are mispredicted. In other words, 3 out of 500 instructions are mispredicted. The mispredicted branch IPC is then ii × (3/500) = 0.012. Since the decode to commit filibuster (including queueing filibuster) is 30 cycles, the average number of mispredicted branch instructions in the commit table is 0.12 × 30 = 0.36. The total number of entries in the commit table is 64. Hence, the AVF = 0.36/64 = 0.56%.

Although Lilliputian's police is useful to compute the AVF of hardware structures, it must be applied carefully. Little'south law cannot be applied if the ACE objects flowing through a structure change. For case, Little's law cannot be directly applied to an adder, which takes two operands as inputs and produces one output. In this case, Piffling's police force can be applied separately to the input and output datapath latches.

3.9.1 Implications of Little's Constabulary for AVF Computation

Using Niggling's constabulary to compute the AVF gives one iv important insights into the ciphering of AVF and the factors AVF depends on. First, AVF is a office of the architecturally sensitive area of exposure to radiation. This is expressed through Little'southward law by multiplying the number of incoming ACE bits into a structure with the delay experienced in the structure. "Sensitive" expanse in this context refers to the fraction of area that on average is occupied by ACE objects.

2nd, IPC alone may not determine the AVF of microprocessor pipeline structures. Allow the states consider the instruction queue in a processor pipeline. Allow us define ACE IPC and ACE latency as the IPC of ACE instructions and latency of ACE instructions through the instruction queue, respectively. The instruction queue usually has the aforementioned IPC as the retire unit of measurement in a processor pipeline. A benchmark with high IPC can have loftier ACE IPC simply low ACE latency because instructions may be flowing apace through the pipeline. Similarly, a benchmark with low IPC can take low ACE IPC, but instructions may exist stalled behind cache misses, making ACE latency high. Consequently, both these benchmarks tin can have very similar AVF for the instruction queue in the pipeline.

Third, it is oft not unusual to assume that a structure's AVF decreases if objects move faster through the structure, thereby reducing the exposure time to radiation. However, the AVF may not actually decrease if there is a corresponding increase in the bandwidth of ACE objects flowing into the structure.

Fourth, one can relate AVF of unlike structures using Piddling's law. If objects menstruum from a structure A to a structure B, and then in the steady state, the average bandwidth of ACE objects through both A and B will usually be the same. To compute the AVF, even so, one needs the boilerplate delay through objects A and B and the size of each structure, which may differ. Even so, in the degenerate instance where a sequence of unmarried-chip storage cells with unit delay is continued (e.g., sequence of flow-through latches in a datapath), the AVF of each of these storage cells is the same.

Read full affiliate

URL:

https://www.sciencedirect.com/scientific discipline/article/pii/B9780123695291500057

An Engineering science Surroundings for Hardware/Software Co-Simulation

David Becker , ... Stephen M. Tell , in Readings in Hardware/Software Co-Pattern, 2002

4 Interface implementation

The NIU host monitor link has simple examples of the two parts of the co-simulation environment specific to each link: the hardware interface functions and the simulation module. The NIU processor firmware is a more than complex example. The next two subsections will describe these 2 co-simulation links and examine the considerations in designing both the advice model and the interface at each endpoint.

Later, nosotros describe the implementation of the IPC simulator extensions, which are largely contained of the interface issues handled in the other two components. Our original implementation of this environment required an IPC extension customized for each co-simulation link. The current environment employs a full general-purpose simulator extension that is reused for all co-simulation links.

4.1 NIU host monitor code

The NIU and the other boards in Pixel Planes 5 send letters to each over a high speed ring. The band packets are of unlimited length, sent a discussion at a time with a data valid signal and completed past an end of message betoken. The IPC message format used by the host monitor co-simulation link reflects how the real ring packets function. Messages to the simulator are "put X in the write annals and affirm data-valid" and "assert end of message signal". Return messages are "X was written to the read register" and "end of message".

The format for a host monitor bulletin is two words long where the first word is the message type identifier and the second is a data word. A simulated ring packet is sent past successively sending messages of type WRITEDATA for each information discussion in the ring bundle. The hardware interface role that writes these data words to a register is replaced with a function that sends WRITEDATA messages to the simulation followed by a WRITEEND message. When the host expects a ring packet from the NIU, information technology waits for READDATA letters followed past a READ-Stop message from the simulator that together form the faux ring parcel.

IPC letters are sent and received by a Verilog simulation module that represents the ring board connector in the hardware description. This module checks each clock cycle for an incoming IPC message and when ane arrives, information technology asserts the appropriate signals on the false NIU ring connector pins. If the module sees approachable information on the connector, it sends a message containing that data back to the host monitor program. The Verilog module is responsible for the correct signal timing on its input and output wires.

Effigy 3. Band connector module managing IPC messages

The host monitor software component of the NIU system was written using this co-simulation interface to commutation letters with simulated hardware. When the existent hardware arrived, the functions to read and write words of a ring packet were rewritten to use the real hardware registers rather than ship IPC messages. The rest of the code remained the same and ran correctly since it had already be tested.

4.2 NIU processor firmware

The processor firmware controls the NIU data pipelines through several control, status, and information registers mapped into the processor address space. As well, several data pipeline events interrupt the processor. In co-simulation, the firmware sends messages to the simulator of the form "poke X into accost A", "peek what is in address A" or "trap handler done". The simulator sends messages of the course "address A holds 10" or "interrupt X has occurred".

The firmware programme uses the hardware interface functions poke () and peek () for all operations on retentivity-mapped I/O registers. These ii functions are written to send IPC letters during co-simulation and are replaced with simple macros when using real hardware. The processor firmware has a office called trap() for trap handling, where the traps of interest are the hardware interrupts. In the real organization, trap() is called from assembly language via the trap vector table. In co-simulation, an interface role checks for incoming messages and calls trap() asynchronously.

The simulator module of this link is a behavioral model of the SPARC processor's internal pipeline which generates accurate control signals for testing the memory and I/O devices on our board. Our model unremarkably fetches instructions from memory, but treats all instructions as no-operation (NOP) instead of decoding fetched instructions. The decode stage of the pipeline checks for IPC letters from the firmware. If a POKE or PEEK message is received, a store or load bike is executed instead of NOP. When a load cycle completes, a PEEKREPLY message with the result of the load is sent back to the firmware program. When external hardware asserts the interrupt lines of the processor model, an INTERRUPT message is sent to the program indicating which interrupt occurred. When the firmware returns a TRAPDONE message, a return from trap wheel is faux.

An INTERRUPT message needs to interrupt the firm-ware asynchronously to let trap handling to be realistically fake. We used the Unix signal machinery to indistinguishable this asynchronous behavior. The co-simulation support library linked with the firmware program requests Unix to send a signal when a bulletin arrives from the simulator. This indicate interrupts the firmware program and moves execution to a signal handling office which is function of the support code. This handler reads an INTERRUPT message and calls the firmware'southward trap treatment part. If a PEEK is in progress, the signal handler waits for the PEEKREPLY before calling trap () because PEEK and PEEKREPLY contain a single indivisible instruction. When the trap handler completes, a TRAPDONE message is sent to the simulator and Unix moves firmware execution back to where it was interrupted.

The well-nigh circuitous part of the firmware co-simulation was writing the processor pipeline model. The pipeline model, yet, would also have to be written for other simulation strategies and only a small-scale office of it is concerned with sending and receiving IPC messages. The bootstrap lawmaking and assembly language parts of trap handling were written after the lath arrived. The firmware was linked to this associates and ran on the new hardware just every bit information technology did on the simulation, only a bit faster.

4.three Communication through the Verilog simulator

The Verilog hardware description linguistic communication does non incorporate whatsoever interprocess advice facilities. It does have a Programming Language Interface (PLI) that allows user written C or C++ code to be called from within the simulation[Due south]. We used this PLI facility to add extensions to the simulator so the remote software programs could brand a TCP connection to the simulator program and communicate with modules within the simulation. The starting time strategy we used involved specialized extensions for each co-simulation link. From that experience, we devised a general mechanism that we feel is much easier to use and describe.

The Verilog PLI facility assembly a user written subroutine with a Verilog job name. Invoking a user written task in a Verilog program will cause the simulation to call the C++ function associated with that task name. These functions are chosen nether several circumstances. I possible configuration is to have the C++ function called whenever an argument to the Verilog job changes during simulation. Our first communication machinery is based on this form of user function.

Two tasks were added to Verilog in the original solution: $sparc () and $ring (). In the processor module, the pipeline executes NOP instructions unless the $spare () job signals through i of its arguments that the firmware requests execution of an LD, ST or RETT pedagogy. The Verilog module and $ssparc () communicate through a pocket-size set of arguments, shown in Figure four.

Figure four. Parameters passed to $sparc() by the cy7c611 () module.

Since the clock signal is an argument, the C++ code associated with $spare () is called every clock cycle. During each cycle the C++ code tin read and write the signal lines connected to it. On the initial bicycle, it connects to the firmware process. When the reset signal is negated, a RESET bulletin is sent to the firmware process. On subsequent cycles, $sparc () checks for messages from the firmware process. The C++ code sets the $sparc () parameters to the instruction, address and information as needed for that cycle. When the bike completes, $spare () is signalled with the results and then the C++ lawmaking tin ship a message back to the firmware procedure. With the exception of the clock, these arguments do non correspond any actual electrical signals, but are only a communication mechanism between the simulation module and C++ lawmaking.

The ring port simulation module functions in a like fashion, associating $ring () with its C++ code. When $ring () is initially called, it creates a socket to which the monitoring program tin connect. Thereafter, on every clock bike $ring () checks to see if the command plan has connected. In one case connected, $ring () relays messages between the simulated hardware port and the monitor plan. When a WRITE message is received, information technology begins clocking the message into the port one word per cycle. When the hardware sends data to the port, $ring () stores it until the end of message point is asserted. At the end of the message, a READ message is sent over the socket to the monitoring process.

A custom improver to the Verilog simulator for each link had several problems. Ane was writing a third slice of custom code for each link and another was the difficulty of modifying the simulator plan. Farther, the original C++ code performed some link-specific operations on the messages which were easily moved into Verilog or deemed unnecessary. Verilog is more than suited for checking signals each clock bike and sending acknowledgment signals. C++ is suited for making the networking organization calls, so our new implementation extends Verilog with tasks simply for generalized interprocess communication (IPC). The form of task/user function association used with these tasks arranges for the C++ part to exist called every time its task is invoked during simulation.

The IPC facility added to Verilog allows modules to create TCP connections to remote Unix programs during the simulation. The $makeserver () job creates a Unix socket at the specified TCP port on the machine running the simulation. The software component of the co-simulation can connect to the simulation by using the IP address of the calculator running Verilog and the TCP port number existence served.

The $send () and $recv () tasks communicate with programs that connect to the TCP ports prepare up by $makeserver (). The send task takes a variable number of arguments, all of which are put into a package and sent to the remote program. If the server is non continued to a remote program, the $transport () call is ignored. Each call of the $recv () task checks for an arriving message. If a message has arrived, the arguments to $recv () are filled with the message information. When no remote program is connected or no incoming data is waiting, $recv () sets its start data parameter, conventionally used for a message type identifier, to zilch and ignores the other parameters.

Some software components demand to know the state of the simulation when they connect. The $connect () task returns true if the associated server is connected to a remote programme. Commencement-up messages can be sent when the connection is established. The $disconnect () task terminates the electric current connection of the server and tin can exist preceded by shutdown messages. Simply one remote programme can connect to a server at a time. After a server disconnects, a new connection can be polled for with the connect call. This mechanism is used by our processor co-simulation to first the firmware process when the processor reset signal is negated.

The software developed for the co-simulation environment includes Verilog simulator extensions, Verilog simulation modules and modified versions of the hardware interface functions. The Verilog modules for the processor and ring port were moderately complex, requiring nigh 700 lines of commented code for behavioral simulation. But a small portion of this code is concerned with interprocess communications; the residual would have to be written for other simulation strategies as well. The IPC tasks added to the interpreter were written in about 200 lines of commented C++ lawmaking and the simulation interface functions for the software components required 300 lines of commented C++ code.

Read total chapter

URL:

https://world wide web.sciencedirect.com/scientific discipline/article/pii/B9781558607026500508

Design automation for application-specific on-chip interconnects: A survey

Alessandro Cilardo , Edoardo Fusella , in Integration, 2016

2.2 Crossbar-based architectures

The batten topology (too called double-decker matrix) is a multi-layered communication architecture with multiple buses connecting multiple inputs to multiple outputs in a matrix scheme. Fig. 3 shows the internal architecture of the crossbar in Fig. 2 (b). The input phase is equipped with buffers in society to handle interrupted bursts too as register and shop incoming transfers if receiving slaves cannot accept them immediately. The decode phase generates select signals for the advisable slaves. Different traditional shared bus architectures, arbitration is non centralized, merely rather it is distributed, with every slave having its own arbiter. Buses can operate meantime every bit long every bit they practice not refer the same resources (we will call this type of parallelism local parallelism). A crossbar connecting every input with every output is called a full crossbar. However, based on the bodily connectivity required by the application, we can specify a connectivity matrix z(c) indicating just a subset of actually connected input/output pairs. The resulting compages is chosen a partial batten. Since on-chip interconnects are not constrained by a limited off-scrap pivot count, commercial solutions such as AMBATM AXITM by ARM ® [16], introduced separated address/data channels in order to increase performance. In such architectures, different topologies may be specified for the two channels: a Multiple-Accost Multiple-Data (MAMD) consists of separate information and address crossbars, while a Shared-Address Multiple-Data (SAMD) combines crossbars for the data channels and shared buses for write and read address channels. In fact, in most systems, the address aqueduct bandwidth requirement is significantly less than the data channel. Such systems tin achieve a adept remainder between organisation functioning and interconnect complexity by using a shared address bus with multiple information buses enabling parallel data transfer.

Fig. iii. The internal architecture of the fractional crossbar in Fig. 2(b).

The main advantage of a crossbar is that any parallel application tin can be mapped to a physical interconnect exhibiting the necessary parallelism. In improver a crossbar is inherently a unmarried-hop latency interconnect. These benefits come up at non-negligible area and power costs [eighteen,19]: the expanse cost grows quadratically with the number of ports [39] and, consequently, the ability consumption has a similar tendency [xxx]. Consequently, unmarried crossbars practice non scale well with the number of IP cores. To mitigate this drawback, it is possible to grouping slaves on shared buses as long as functioning constraints are met [24,25]. In addition, transactions that involve slaves accessed exclusively by a single master do not necessarily accept to get through the crossbar: a shared bus tin be placed on a input port of the crossbar in lodge to group these slaves and the corresponding master. Since these shared buses connect at almost ane master, they practice not require additional arbitration components. Such a structure has fewer channels, which reduces the crossbar area in terms of wires and arbiter components and simplifies the pattern of decoders, reducing in turn the resulting power consumption. A further improvement can be accomplished past also grouping masters [23,28]. Obviously, the resulting construction should closely lucifer the traffic characteristics and performance requirements of the application. By relying on this simplification, the crossbar size can be further reduced. Both approaches need efficient algorithms able to cluster master and slave cores and reduce congestion. They are motivated by the ascertainment that communication patterns of different applications tin can be effectively handled past dissimilar logical topologies as in most cases applications crave just a pocket-sized portion of all-to-all communication [thirty,64,65]. Fig. two(c) displays a crossbar-based architecture enhanced every bit described to a higher place in order to reduce the interconnect size. Note that slave S 2 is placed on the same omnibus of chief G 2 since M ii is the only master involved in the communication with Southward 2. Using 2 shared buses to group slaves and masters allows reducing the size of the crossbar from 4×4 to three×2.

When designing the topology, it is critical to consider the result of the physical parameters, such every bit the wire delays. To achieve timing closure of the blueprint, the inherent wiring complexity of alternative topologies should be considered during the synthesis phase. Taking the wiring complexity into account in the early on stages of the pattern cycle will lead to better and more scalable communication architectures. As the system size grows, shared bus size cannot increment indefinitely and, hence, the central bus matrix may go prohibitively big. As a consequence, the logic depth of the crossbar increases and so do the wires. Since the delay of a wire grows quadratically with its length [74], the added delay will inevitably lower the clock frequency of the passenger vehicle matrix. Partitioning the wires in segments with repeaters in betwixt [75] leads the total wire delay to become linear with the full wire length. Unfortunately, inserting repeaters increase the cost of the advice architecture in terms of area and power consumption [76]. In lodge to design high frequency, power efficient batten-based architectures under stringent area constraints, the cascaded crossbar paradigm was introduced [26,21,33,22,29,32,27,31]. Equally exemplified by Fig. two(d), a cascaded crossbar topology consists of multiple small crossbars continued to each other without bridges in a cascaded scheme such that each primary can access each connected slave through a multi-hop path. A motorcoach pipeline stage, called register slice, tin can be inserted either between the masters and the batten or betwixt the batten and the slaves also as between crossbar pairs for pipelining the interconnect where the disquisitional path is also long, so that the timing can run into the given requirements. Of course, this reduced logic delay comes at the cost of additional latency cycles.

In add-on, during the design space exploration of crossbar-based interconnects, it is necessary to determine the connectivity matrix of each batten. Using fully connected crossbars limits the area efficiency and the achievable performance due to many unused paths. On the other hand, eliminating the unnecessary connections after the synthesis pace but marginally improves the resulting compages, and does not enlarge the design space [22]. Improved solutions can be reached past because the partial connexion of crossbars simultaneously when determining the topology [32].

Read full commodity

URL:

https://www.sciencedirect.com/science/article/pii/S0167926015000966