Home »
Power management
Reducing energy cost of intra-chip communications
May 21, 2012 | Fabien Clermidy, Ivan Miro-Panades, Yvain Thonnart and Pascal Vivet | 222904611
Fabien Clermidy, Ivan Miro-Panades, Yvain Thonnart and Pascal Vivet of CEA-Leti focus on how to reduce the energy costs associated with intra-chip communications.
From multi-cores to many-cores
For achieving high-performance systems, it is well-known that the race towards higher frequency has moved towards a race in terms of number of cores. This is true for desktop, but also for laptop, tablets and mobile phones with an even quicker evolution speed. Figure 1 shows a typical evolution of current SoCs: A multi-core host processor is used for sustaining the required performance for web applications while a sea of Processing Engines (PE) is used for sustaining highly parallel and computing-intensive applications. Then, each application will use parts of these PE in a configurable manner, while the corresponding software stack will run on the host processors.

Figure 1: System-on-chip architectures evolution (Source: ITRS 2011)
Figure 2 shows International Technology Roadmap of Semiconductors (ITRS) view in terms of number of cores in embedded systems. The trend is clear: the number of processing engines will be exploding in the next years. PE, typically 250 kgates and 64 kbits of memory, will be more or less flexible and the full architecture will be most probably heterogeneous with some PE dedicated to some parts of an application while others will be general purpose for providing a good level of flexibility.
Two important constraints have to be considered: Time-To-Market (TTM) and power consumption. The two points are linked to communications between PE. Indeed, SoC design is clearly moving from Intellectual Properties (IP) or blocks reuse to platform reuse in order to minimize software development efforts for TTM reasons. Communications represent the key point to master for platform reuse. They must bring the correct flexibility, throughput and latency on heterogeneous types of cores while limiting their power consumption. Energy spent in communications can account for up to 30 percent in current SoC and is growing similarly to the number of cores. To cope with these issues, new and more efficient communications paradigms are needed.

Figure 2: Number of cores evolution in embedded systems (Source: ITRS 2011)

Figure 3: A mesochronous implementation using inverted clocks scheme (DSPIN)


Figure 4: FIFO resynchronization (left) versus pausable clocks (right) for asynchronous GALS


Figure 5: Quasi-Delay Insensitive (QDI) asynchronous style using a 4-phase handshake


Chasing false ideas about asynchronous design
Asynchronous logic has been known in the designer community to come with three issues delaying their adoption in industry: high area overhead; the need of specialized logic cells for performance purpose, “Muller gates” or “C-element” necessary for arbitrating signals (figure 6); finally, the need of specialized tools for synthesis and back-end.



Figure 7: 2-inputs Muller gate schematic (left), symbol and truth table (middle) and WCHB half-buffer based on Muller gates (right)
Two important constraints have to be considered: Time-To-Market (TTM) and power consumption. The two points are linked to communications between PE. Indeed, SoC design is clearly moving from Intellectual Properties (IP) or blocks reuse to platform reuse in order to minimize software development efforts for TTM reasons. Communications represent the key point to master for platform reuse. They must bring the correct flexibility, throughput and latency on heterogeneous types of cores while limiting their power consumption. Energy spent in communications can account for up to 30 percent in current SoC and is growing similarly to the number of cores. To cope with these issues, new and more efficient communications paradigms are needed.

Figure 2: Number of cores evolution in embedded systems (Source: ITRS 2011)
The advent of network-on-chip
Until the early 2000, busses were mostly used in communication infrastructure. They presented good advantages in terms of flexibility and were widely adopted. However, they also came with some drawbacks, especially in terms of scalability and power consumption: busses were crossing the whole chip for connecting IP and scalability was obtained by increasing the number of wires, resulting in high wires capacitances. This reduced performance and increased power consumption. Segmented busses were later introduced, but came with irregular structures limiting the bus interests while not really solving the issues.
In the late 1990s, the network-on-chip (NoC) concept was introduced. Keywords for defining NoCs are regularity, flexibility, throughput scalability and reduced power consumption. NoCs leverage on multiprocessors interconnects background but differ in their implementation with different latency, area cost and power consumption requirements. As regular structures, they bring the flexibility and scalability needed for the platform concept. In terms of power consumption, they are more efficient than busses thanks to smaller wire lengths and typically divide by two communications power consumption. However, these advantages come to a cost in terms of latency as going from one PE to another one is made by crossing different switches or routers.
Limitations of classical NoC-based architectures
Even if NoC-based architectures solve many issues linked to many-core architectures, the power consumption stays at a high level and tends to increase due to the increasing number of cores. Without innovation in this field, the communication alone could have accounted for more than 50 percent of the full SoC power consumption. This is due to many factors, the first one being clock distribution. Indeed, NoC are distributed all over the SoC and the clock tree of a fully synchronous NoC typically represents 30 percent of its power consumption. This is due to added buffers required for obtaining balanceded clock on high frequency NoC clock due to the high communication throughput.
However, clock distribution is not the only problem. One more fundamental issue is the difficulty to predict communications events which are often performed by data bursts and whose dynamic is dependent on the different PE behaviors. As a result, defining power modes in interconnect is a harsh task.
Globally-asynchronous, locally synchronous (GALS) paradigm
GALS architectures are a solution to deal with multiple clocks domains. Consequently, it is a solution to solve the clock tree distribution issue in NoC-based architectures and has been widely used. The main difficulty with GALS architectures is the re-synchronization phase which can imply large area and latency overheads.
Until the early 2000, busses were mostly used in communication infrastructure. They presented good advantages in terms of flexibility and were widely adopted. However, they also came with some drawbacks, especially in terms of scalability and power consumption: busses were crossing the whole chip for connecting IP and scalability was obtained by increasing the number of wires, resulting in high wires capacitances. This reduced performance and increased power consumption. Segmented busses were later introduced, but came with irregular structures limiting the bus interests while not really solving the issues.
In the late 1990s, the network-on-chip (NoC) concept was introduced. Keywords for defining NoCs are regularity, flexibility, throughput scalability and reduced power consumption. NoCs leverage on multiprocessors interconnects background but differ in their implementation with different latency, area cost and power consumption requirements. As regular structures, they bring the flexibility and scalability needed for the platform concept. In terms of power consumption, they are more efficient than busses thanks to smaller wire lengths and typically divide by two communications power consumption. However, these advantages come to a cost in terms of latency as going from one PE to another one is made by crossing different switches or routers.
Limitations of classical NoC-based architectures
Even if NoC-based architectures solve many issues linked to many-core architectures, the power consumption stays at a high level and tends to increase due to the increasing number of cores. Without innovation in this field, the communication alone could have accounted for more than 50 percent of the full SoC power consumption. This is due to many factors, the first one being clock distribution. Indeed, NoC are distributed all over the SoC and the clock tree of a fully synchronous NoC typically represents 30 percent of its power consumption. This is due to added buffers required for obtaining balanceded clock on high frequency NoC clock due to the high communication throughput.
However, clock distribution is not the only problem. One more fundamental issue is the difficulty to predict communications events which are often performed by data bursts and whose dynamic is dependent on the different PE behaviors. As a result, defining power modes in interconnect is a harsh task.
Globally-asynchronous, locally synchronous (GALS) paradigm
GALS architectures are a solution to deal with multiple clocks domains. Consequently, it is a solution to solve the clock tree distribution issue in NoC-based architectures and has been widely used. The main difficulty with GALS architectures is the re-synchronization phase which can imply large area and latency overheads.
The so-called mesochronous scheme is the most classical one. It considers clocks with the same frequencies but different phases. Synchronization between frequency domains can then be simplified thanks to these identical clock frequencies. One solution is to inverse clocks between two neighbor blocks (Figure 3). The phase drift is then limited to half the clock period but it relaxes a lot the clock tree synthesis and thus the corresponding power consumption. Another solution is to use a learning phase where signal conflicts are detected and then avoid the conflicting cases in a second phase. This scheme leads to minimum hardware for synchronization purpose and reduces latency compared to the clock inversion scheme thanks to the learning phase. This second scheme can also be extended to ratiochronous clocks, i.e. clocks related by an integer ratio. It thus allows the connections of PE with different frequencies, all related to a root clock. However, the precision in terms of frequency selection is limited when the clock root frequency and the objective frequency are in the same range.



Figure 3: A mesochronous implementation using inverted clocks scheme (DSPIN)
The asynchronous scheme is the most advanced paradigm. In that case, clocks frequencies and phases are not related. It then requires a complex and costly synchronization scheme between two frequency domains due to meta-stability issue. Two solutions have been studied: asynchronous FIFOs and pausable clock (Figure 4).
The first solution is costly both in terms of hardware because successive data have to be temporarily stored for assuring a data transmission per cycle; and latency because at least two cycles are lost when crossing a frontier. However, pausable clock scheme requires a local clock generator for being able to control the core clock when conflicts are detected. Moreover, the clock is more or less paused depending on traffic between the core and the outside. Thus, the core performance depends on the quantity of communication. As a result, this technique has not been implemented in industrial circuits due to its inherent issues.
Asynchronous GALS allows an advanced power management of cores, as it can be associated to Dynamically Voltage and Frequency Scaling (DVFS). However, clocks are still distributed in the whole chip, and communication remains difficult to foresee, thus limiting the impact of power management on the NoC itself. In this perspective, mesochronous schemes are intrinsically limited, but asynchronous ones can be further exploited by completely removing the clock inside the NoC.
The first solution is costly both in terms of hardware because successive data have to be temporarily stored for assuring a data transmission per cycle; and latency because at least two cycles are lost when crossing a frontier. However, pausable clock scheme requires a local clock generator for being able to control the core clock when conflicts are detected. Moreover, the clock is more or less paused depending on traffic between the core and the outside. Thus, the core performance depends on the quantity of communication. As a result, this technique has not been implemented in industrial circuits due to its inherent issues.
Asynchronous GALS allows an advanced power management of cores, as it can be associated to Dynamically Voltage and Frequency Scaling (DVFS). However, clocks are still distributed in the whole chip, and communication remains difficult to foresee, thus limiting the impact of power management on the NoC itself. In this perspective, mesochronous schemes are intrinsically limited, but asynchronous ones can be further exploited by completely removing the clock inside the NoC.


Figure 4: FIFO resynchronization (left) versus pausable clocks (right) for asynchronous GALS
Towards asynchronous communications?
Asynchronous logic stands for all the logics which are not using the clock synchronization scheme. It thus results to a large range of possibilities largely studied by research groups going from ad-hoc synchronization to structured handshake protocols. A technique studied by different research teams is to use Quasi Delay Insensitive (QDI) asynchronous logic for designing the NoC (figure 5). Such a solution presents three main advantages: no clock distribution, event-based communication and finally natural adaptation to process and environment variations. The two first points are linked to power consumption: the clock tree can be reduced to each IP while NoC is active only when communication occurs, resulting in minimal dynamic power consumption. Figure 7 shows a comparison between mesochronous and asynchronous implementation of the same NoC. For an equivalent 250 MHz frequency, the asynchronous NoC is consuming 3 times less power than its mesochronous counterpart.
Asynchronous logic stands for all the logics which are not using the clock synchronization scheme. It thus results to a large range of possibilities largely studied by research groups going from ad-hoc synchronization to structured handshake protocols. A technique studied by different research teams is to use Quasi Delay Insensitive (QDI) asynchronous logic for designing the NoC (figure 5). Such a solution presents three main advantages: no clock distribution, event-based communication and finally natural adaptation to process and environment variations. The two first points are linked to power consumption: the clock tree can be reduced to each IP while NoC is active only when communication occurs, resulting in minimal dynamic power consumption. Figure 7 shows a comparison between mesochronous and asynchronous implementation of the same NoC. For an equivalent 250 MHz frequency, the asynchronous NoC is consuming 3 times less power than its mesochronous counterpart.


Figure 5: Quasi-Delay Insensitive (QDI) asynchronous style using a 4-phase handshake


Figure 6: Mesochronous versus asynchronous NoC comparison on a 130 nm low power technology
Chasing false ideas about asynchronous design
Asynchronous logic has been known in the designer community to come with three issues delaying their adoption in industry: high area overhead; the need of specialized logic cells for performance purpose, “Muller gates” or “C-element” necessary for arbitrating signals (figure 6); finally, the need of specialized tools for synthesis and back-end.



Figure 7: 2-inputs Muller gate schematic (left), symbol and truth table (middle) and WCHB half-buffer based on Muller gates (right)
The area drawback is true when comparing mesochronous and asynchronous implementations (Figure 6). However, the difference between these two design styles is only 30 percent compared to the 3 signals required for coding a single data in asynchronous data with report to only one for mesochronous. This is due to the clock signal which is not present for asynchronous design and can be accounted for a second signal for the mesochronous case. Now, for comparing apple to apple, i.e. two NoC with equivalent features, asynchronous design and asynchronous GALS must be considered as they both allow advanced power management schemes such as DVFS. Figure 8 shows such a comparison. A 6 data slot FIFO is needed for the synchronous NoC version while only 4 slots are necessary for the asynchronous one. This is due to the inherent adaptation of asynchronous logic. As a result, the overhead is reduced to less than 5 percent. Moreover, this overhead comes with a reduced latency (12 cycles instead of 17) which can be a key element for reducing pressure on the memories.

Figure 8: Asynchronous vers GALS synchronous implementations comparisons

Figure 9: Idle times depending on data granularity in telecommunication application

Conclusion: communication scheme is a key element for power management
As demonstrated in this article, communication quality is a key factor for low-power many-core development. Firstly, it determines the key factors of success: throughput, latency and clock distribution. Secondly, it allows advanced power scheme such as fine-grain DVFS thanks to different GALS strategy. Finally, its inherent power consumption can account for a large portion of the total power and must be mastered.
For all these reasons, the selection of an adapted GALS technology has to be considered. Mesochronous, asynchronous GALS and asynchronous merits have then been compared. We have shown that asynchronous NoC can now been considered as a solution for solving communications power consumption issues in complex many-core architectures.
About the authors:
Fabien Clermidy, Ivan Miro-Panades, Yvain Thonnart and Pascal Vivet are researchers at CEA-Leti, the French institute focused on micro- and nanotechnologies and their applications. CEA-Leti is part of CEA, French Atomic Energy and Alternative Energies Commission.

Figure 8: Asynchronous vers GALS synchronous implementations comparisons
The two other drawbacks have proved to be less important in recent years: only 30 new cells are required for designing efficient asynchronous logic, compared to the hundreds of cells required for synchronous design. The design effort is thus relatively negligible. Finally, the tools issue has been partially solved for QDI logic style: using classical synchronous tools for QDI back-end has been recently demonstrated with high performances. The synthesis is still hand-made but, for dedicated blocks such as routers for NoC, this step represents only a portion of the design time.
Advanced NoC power management: sleep mode using asynchronous logic
One of the main advantages of asynchronous design is its capacity to react on events. This feature can be used to further reduce the power consumption by implementing a voltage scaling scheme on the routers controlled by the arrival / departure of packets. Figure 9 shows the different idle times of a telecommunication application depending on the data granularity. In this example, the block level is the most interesting functional level for applying a power management scheme. Control bits are added to the NoC Packets for indicating the block level, and the router puts itself in high voltage when the beginning of a block is detected at the inputs and goes back to low voltage when the last block-level packet is detected. This simple mechanism is easy to implement in asynchronous logic while it is impossible to perform in synchronous versions due to the clock distribution. Finally, figure 10 summarizes the main advantages of asynchronous design versus synchronous GALS and the power reduction obtained thanks to inherent advantages of asynchronous logic as well as advanced power management techniques. A gain of up to 10 times can be observed.
Advanced NoC power management: sleep mode using asynchronous logic
One of the main advantages of asynchronous design is its capacity to react on events. This feature can be used to further reduce the power consumption by implementing a voltage scaling scheme on the routers controlled by the arrival / departure of packets. Figure 9 shows the different idle times of a telecommunication application depending on the data granularity. In this example, the block level is the most interesting functional level for applying a power management scheme. Control bits are added to the NoC Packets for indicating the block level, and the router puts itself in high voltage when the beginning of a block is detected at the inputs and goes back to low voltage when the last block-level packet is detected. This simple mechanism is easy to implement in asynchronous logic while it is impossible to perform in synchronous versions due to the clock distribution. Finally, figure 10 summarizes the main advantages of asynchronous design versus synchronous GALS and the power reduction obtained thanks to inherent advantages of asynchronous logic as well as advanced power management techniques. A gain of up to 10 times can be observed.

Figure 9: Idle times depending on data granularity in telecommunication application

Figure 10: Asynchronous baseline and low-power versions versus synchronous GALS NoC
Conclusion: communication scheme is a key element for power management
As demonstrated in this article, communication quality is a key factor for low-power many-core development. Firstly, it determines the key factors of success: throughput, latency and clock distribution. Secondly, it allows advanced power scheme such as fine-grain DVFS thanks to different GALS strategy. Finally, its inherent power consumption can account for a large portion of the total power and must be mastered.
For all these reasons, the selection of an adapted GALS technology has to be considered. Mesochronous, asynchronous GALS and asynchronous merits have then been compared. We have shown that asynchronous NoC can now been considered as a solution for solving communications power consumption issues in complex many-core architectures.
About the authors:
Fabien Clermidy, Ivan Miro-Panades, Yvain Thonnart and Pascal Vivet are researchers at CEA-Leti, the French institute focused on micro- and nanotechnologies and their applications. CEA-Leti is part of CEA, French Atomic Energy and Alternative Energies Commission.
Please login to post your comment - click here
Related News
- No news
MOST POPULAR NEWS
- Volvo evaluates flywheel hybrid drive - fuel savings of up to 25%
- PV storage market is set to grow to USD19bn by 2017
- Ultra-low-power SoC supports world's smallest Bluetooth location stickers
- Power-One enters into patent license agreement with Microchip
- Quad-MOSFET solution boosts efficiency and eliminates heat sinking in active bridge applications
- Solar industry capital spending hits seven-year low in 2013 but upturn is on the cards
- Market for GaN and SiC power semiconductors set to rise by factor of 18 in next decade
- Imec and Renesas collaborate on ultra-low power short range radios
- Advanced microcontroller combines floating point and low leakage technology to achieve longest battery lifetime in portable applications
- World's lowest power Bluetooth smart chip is unveiled
Interview
Technical papers
- Dangers of Aftermarket Counterfeit Battery Packs
- High Voltage Surge Stoppers Ensure Reliable Operation During Power Surges
- Motor-Drive Design made Simple
- Adaptive Cell Converter Topology Enables Constant Efficiency in PFC Applications
- Micropower Isolated Flyback Converter with Input Voltage Range from 6V to 100V
- Derating of Schottky Diodes
- Heatsink Optimization
- High Performance ZVS Buck Regulator Removes Barriers To Increased Power Throughput
- Waste heat replaces batteries
- Stepper Motor Control IC
Poll
MOSFETs
Power Management
International Rectifier
Power Supplies
Solar
Energy Harvesting
Linear Technology
Diodes
Vishay Intertechnology
Power Supply
Batteries
National Semiconductor
Texas Instruments
STMicroelectronics
Fairchild Semiconductor
Battery
UPS
IMS Research
NXP Semiconductors
Smart Grid
Power
Analog
GaN
Microcontroller
Intersil
Photovoltaic
MOSFET
Maxim Integrated Products
Analog Devices
Microcontrollers
All material on this site Copyright © 2009 - 2010 European Business Press SA. All rights reserved.
This site contains articles under license from EETimes Group , a division of United Business Media LLC.
This site contains articles under license from EETimes Group , a division of United Business Media LLC.


