Evolution of the multi-core processor architecture Intel Core: Conroe, Kentsfield...

Author:
Date: 27/06/2006

Undoubtedly, one of the most interesting IT intrigues of this season is the forthcoming announcement of a new generation of the multicore processor architecture Intel Core. Due to Intel's benevolent PR policies on the whole and open contacts with the press in particular, we've known so much about these chips already now, before the official announcement of various processor models. At least this is more than enough that today we are presenting to our readers' attention a review of the architectural changes and improvements implemented in the new generation of processors built on the Intel Core architecture.

It's no longer a secret for anyone that the new dual-core processors having the working names Merom, Conroe, and Woodcrest for the markets of mobile, desktop and server computers, respectively, will have the unified architectural framework under the consolidated name Intel Core (formerly named as Architecture 101), with some additions meeting specific requirements of each market sector. Nevertheless, while presenting the new generation architecture Intel Core, we'll be making the major focus on the chips for desktop PCs – the Conroe.

Let me put it straight that the story deals solely with the architectural features of Intel's new processors. Therefore, it makes no sense to expect any rumors, leakages or hints regarding the marking of Conroe chips, timelines of their announcements and arrival to the retail, expected prices, etc. The most what the author allowed for himself within this story is assumptions of the probable performance boost at specific applications.

All the other information accompanied by comparative tests of the new chips will be presented to our readers in due time. Now it is just the very moment when it's "better be safe than sorry" and present only authentic information rather than spreading gossip prematurely. I hope our readers, after "digesting" the architectural features of Intel's new generation of processors, will be able not only scrutinizing the "marks" in a laid-back way, but also get a better idea of the causes and consequences which lead to a specific result. Let's start.

Basic formulas defining the efficiency of modern processor architecture

As is known, a few years ago Intel gave up the idea of "boosting megahertz" and headed towards development of efficient processor micro architectures of economical power consumption. In this regard, the maximum operating efficiency of the processor is more dependent directly on the number of instructions executed per cycle rather than the clock speed. In other words, the processor's clock sped is merely one of the factors in this simple formula:

[Performance] = [Clock speed] x [Number of instructions per cycle]

Therefore, in practice you don't have to boost up the clock speed - there are many other effective methods to raise performance substantially. One of the subsets of such methods in particular is the use of currently so popular multicore processing, although, as the practice shows, it's not an easy task to parallelize computations among a number of cores and it can't be "brute-forced".

Another rather effective method to raise one of the factors in the above formula for performance calculation is the method for reducing the number of instructions required to run a specific task or, in other words, the command thread optimization. The most visual example of that is the MMX SIMD-commands (single instruction multiple data) used by Intel in the form of integer 64-bit SIMD instructions since 1996, starting with Pentium chips supporting the MMX, as well as the later introduced 128-bit SIMD floating-point single precision instructions which were first presented in the SSE SIMD-extensions in the Pentium III chip later complemented by SSE2 and SSE3 instruction sets.

Another bright example of the command thread optimization technology is the so-called microfusion technology implying that a number of internal micro-ops of the CPU can be merged into a single micro-op, which substantially reduces the total number of micro-ops required to run a specific task.

At the same time, the current mindset in the industry aimed at production of economical processors requires other computations. Therefore, there has been introduced the concept of optimum performance, which reflects the quantity of energy spent by the CPU to run a specific task. It turns out that the power consumption can be estimated as a product of dynamic capacitance (a ratio of the electrostatic charge of the conductor to the potential between conductors which provide the charge) and the efficiency of executing instructions per cycle, squared supply voltage, and the clock speed:

[Power consumption] = [Dynamic capacity] x [Voltage] x [Voltage] x [Clock speed]

Correlating this equation for the calculation of power consumption versus the previous formula, processor developers will be able to better estimate the optimum balance between the efficiency of the number of instructions executed per cycle, dynamic capacitance, on the one hand, and appropriate supply voltage for the core and buffer circuits in combination with the chip's clock speed, on the other hand. This will let achieve the optimum performance and efficient power consumption.

I apologize for the long-drawn introduction and explanation of the copy-book truth, but this prelude will let you understand the goals and methods used in the development of the new-generation micro architecture Intel Core that offers improved performance and, most importantly, improved per-watt performance.

Main features of the Intel Core architecture

The most precise, authentic and detailed information on the inner structure of Intel's new-generation processors for desktop PCs which are expected to emerge in the nearest future was made public during the spring forums arranged by Intel for developers - Intel Developer Forum, and during the Moscow IDF Spring 2006, in particular. It was just the first time when Intel distinctly pronounced its plans to start deliveries of processors on the base of the Intel Core architecture with the 65-nm process technology already in the third quarter of 2006. That was just the time when it became known for sure that the new architecture will be the framework for processors of all the market sectors – desktop PCs (Conroe), mobile PCs (Merom), and servers (Woodcrest).

The new chips built on the Intel Core architecture promise a substantial performance boost - from 40% for Conroe up to 80% for Woodcrest, with the power consumption reduced by 35-40%.

That the materials explaining the essence of these innovations have appeared on our site only now is caused by a number of reasons. First, Intel has finally finished rebranding the processor lines and now we can state with confidence that the new chips for desktop PCs will be represented just by the trade marks Intel Core 2 Extreme (Conroe XE) and Intel Core 2 Duo (Conroe, Merom). Secondly, the time passed since the spring IDF has allowed to comprehend the architectural changes and sort out with the operational specifics in order to present the essence of these novelties to our readers at maximum authenticity. Thirdly, Computex 2006 held in the first decade of June, where working prototypes of systems built on the base of Conroe chips were presented, has put everything in the right places: the new-generation architecture has been around for quite a long time not only on paper but also in the form of specimens ready for retail sales. So it is quite possible that selection of the forthcoming date for announcement of Conroe chips is caused more by "marketing policies" considerations rather than production aspects.

The new processor architecture inherits the philosophy of effective power consumption first implemented in Intel Pentium M processors for mobile PCs having the working name Banias. The functional capabilities of the new-generation processors have been improved not only due to the new technologies but also due to the developments successfully used in the chips of the Intel NetBurst architecture. Nevertheless, the key role is played by the innovations first implemented in Intel's new-generation architecture:

The Intel Wide Dynamic Execution technology is to provide a greater number of instructions executed per cycle, thus improving the efficiency of running applications and reducing the power consumption. Each core of the processor that supports this technology is now able executing up to four instructions simultaneously using the 14-stage pipeline.
The Intel Intelligent Power Capability that enables specific components of the chip only when needed allows to achieve a substantial reduction in the power consumption of the system on the whole.
The Intel Advanced Smart Cache technology implies using a unified L2 cache memory common for all the cores, whose joint use allows to cut down the power consumption and raise the performance. At the same time, one of the processor cores may use up the whole volume of the cache memory whenever needed, with the other core disabled dynamically.
The Intel Smart Memory Access technology increases the system performance due to the reduced memory response time and thus optimized bandwidth of the memory subsystem.
The Intel Advanced Digital Media Boost technology allows processing all the 128-bit SSE, SSE2, and SSE3 commands widely used in multimedia and graphic applications in one cycle, which increased their speed of execution.

These are the major changes introduced into the new generation of the Intel Core micro architecture. It is now time we dwelled on each of them.

Intel Wide Dynamic Execution

The Intel Wide Dynamic Execution technology implies a set of novelties – advanced data analysis, speculative, priority-oriented command execution etc.first implemented by Intel in the P6 architecture and used in Pentium Pro, Pentium II, and Pentium III. In the Intel NetBurst architecture, the Advanced Dynamic Execution module was used for these purposes, which provided load to the executive modules of the processor and offered an improved branch processing algorithm in order to reduce the number of wrong branch predictions.

At the level of the Intel Core architecture, all this is consolidated into the advanced technology complex named the Intel Wide Dynamic Execution which allows providing greater number of commands executed per cycle, thus saving the time and energy.

Now each processor core is able processing not three as it was in the Intel NetBurst architecture but four commands at a time, which gives a 33% boost as compared to the previous generations. Among the additional features implemented in the set of Intel Wide Dynamic Execution technologies, of note is the more precise branch prediction and deeper command buffering, which imparts additional flexibility to the execution process.

Along with these, Intel Wide Dynamic Execution implies an efficient use of the Macro-Fusion technology (Macro-OPs Fusion) that merges micro- and macro operations into unified executable macro operations. While in the previous generations of Intel processors each incoming instruction was decoded and executed separately, now the use of the macro-fusion principle during the command decoding allows merging pairs of some instructions into a unified internal micro-op.

Execution of two instructions as a unified micro-op allows reducing the total CPU usage and increasing the number of instructions processed per cycle. Moreover, the Arithmetic Logic Units used in Intel Core processors have been improved for better processing the commands merged into the macro-ops, which also results in the overall reduction of the chip's power consumption.

Therefore, according to Intel, it becomes possible to reduce load on operations by up to 15% and cut sown the number of micro operations by up to 10%, in the general case. As can be seen on the below diagram, the prefetch modules prepare a number of x86 commands, and up to five of them can be processed by four decoding units simultaneously. In the case of Macro-Fusion, it becomes possible to process five instructions per cycle at a time (no more than one macro command can be generated at a time).

Intel Intelligent Power Capability

Another innovation under the consolidated name Intel Intelligent Power Capability appears to be a set of measures aimed at the reduction of power consumption of the chip and optimization of general design requirements. Technologies coordinating the power consumption by all the executable units of the processor include advanced features optimized for access time which keep track of the load at specific logical circuits.

It is important to note that load reduction in the Intel Core architecture is done not through disabling unused circuitry; on the contrary – the trace logic of the Intel Intelligent Power Capability enables the required logical subsystems once they are needed. Additionally, many inner buses and arrays of the CPU logical units are now distributed and powered through separate keys, which made it possible to switch them into the additional economical power-saving mode while processing certain types of data.

The major task of such "point-of-use" power scheme was to achieve a fast system response, e.g. when reverting the system back to the full capacity mode. As a result, such weighted approach in implementing the Intel Intelligent Power Capability has made it possible to further reduce the power consumption without detriment to the system response speed and increase the total power optimization of the Intel Core architecture.

Intel Advanced Smart Cache

The new Intel Core architecture implements a rather efficient model of shared use of common L2 cache by the CPU cores. The Intel Advanced Smart Cache technology is optimized in a way allowing each core of the dual-core processor access data at the maximum efficiency.

Not all modern multicore processors are able distributing access to the shared L2 cache memory. In practice it means that each core has to operate with similar data placed in its own L2 cache. Moreover, the downtime of one of the cores in using the scheme of separate L2 caches simply means a downtime of the L2 cache memory in this core, which results in inefficient use of resources – whereas the second core may be running "breathless" without additional resources of the L2 cache.

In the case of the Intel Core architecture, both the cores access the common L2 cache and are able redistributing – up to 100%! - of L2 cache resources dynamically depending on the current load. This Multi-Core Optimized Cache technology allows for optimum use of the cache memory resource subsystem. Additional advantage of the Multi-Core Optimized Cache is in the much faster fetching of data out of the cache.

Intel Smart Memory Access

The Intel Smart Memory Access technology allows boosting the system performance through optimization of data exchanged with the memory subsystem, thus reducing the latency in accessing the memory.

There is also an absolutely new feature implemented during the development of the Intel Smart Memory Access technology called Memory Disambiguation. The Memory Disambiguation feature allows increasing the efficiency of out-of-order data processing through providing the cores with speculative data fetching in executing instructions - long before a number of pre-queued instructions are executed.

Normally, when the out-of-order processor reorders instructions, it can't transpose Load to Store because these is no information on the position of respective data yet. Use of the Memory Disambiguation principle allows eliminating ambiguities with special algorithms which determine if a Load command can be executed prior to the preceding Store. If the result is positive, the queue may be rearranged for a better parallelization of the instruction handling process. In those rare occasions when it is impossible the technology locates the conflict, reloads correct data and repeats executing an instruction.

Along with the Memory Disambiguation, the Intel Smart Memory Access technology includes improved prefetch units which are able "predicting" the memory contents and determining if the data placed in the cache can be used once needed. Of course, increase in the number of loads out of the cache versus fetching from the system memory has a positive effect on reducing the latency and improving the performance.

The Intel Core architecture implies using two prefetch units per each L1 cache and two per L2 cache. These caches detect the threads and jointly distribute access, which allows for timely placement of data in the L1 cache. The L2 cache prefetchers analyze calls from the cores and provide the L2 cache with data which may be of use to the cores in future.

Intel Advanced Digital Media Boost

The Intel Advanced Digital Media Boost feature improves the CPU performance during execution of SSE instructions. Both classes of operations - 128-bit integer arithmetic SIMD, and 128-bit floating point double-precision SIMD - are meant to reduce the total number of instructions required to execute specific program tasks, they are able boosting many applications to do with video and photo processing, speech recognition, encryption, financial and scientific calculations.

In many processors of the previous generations, processing each 128-bit SSE, SSE2, and SSE3 instruction is regarded as an instruction executed in two cycles. Due to the Intel Advanced Digital Media Boost technology, execution of such 128-bit instructions becomes possible at the peak speed in one cycle. The use of Intel Advanced Digital Media Boost is especially effective for processing multimedia content like graphics, video, audio, and other data that makes intensive use of SSE, SSE2, and SSE3.

Summing up. Future prospects for the Intel Core micro architecture

This is in brief a whole overview of major improvements implemented in the new Intel Core micro architecture with multicore optimization. As you can see, each of these technologies separately is able substantially improving the CPU efficiency. Taken as a whole, they appear to be a serious force in setting new performance standards in combination with economical power consumption.

Therefore, the new Intel Core micro architecture has made use of all the advantages already implemented in the first generations of mobile Intel Pentium M processors, inherited the best of the Intel NetBurst architecture, and has been enriched by the most fresh innovative ideas of developers.

Today, we are not talking about a specific performance of the Intel Core architecture. The time has not yet come. However, the fact that Intel will use Intel Core in all the key sectors of computing equipment – servers, desktop, and mobile systems - means the company has put so much at stake. Judging by various indirect evidence, we can make precise enough conclusions that the stake is indeed worth it, but… let's not talk about that today and wait for the announcement and results of laboratory tests.

Reminding it once again that the Intel Core architecture will be implemented in specific retail products for various market sectors already in the second half of 2006. Processors with the working name Conroe for the desktop PC market are expected to emerge earlier than the others. Evidently, the new-generation economical chips will let system integrators start developing a new generation of quiet, thin and powerful PCs in absolutely unexpected form factors.

As regards the most immediate future, the next generation of chips built on the Intel Core architecture with even greater number of cores is looming out there on the horizon. In particular, Kentsfield, Intel's first 4-core processor for the sector of most powerful desktop PCs, based on the Intel Core architecture with outstanding power efficiency indices will be the processor for desktop PCs. Launch of deliveries of these processors is planned for the first quarter of 2007.

At that, I am parting hopefully for not long, because we are in for really grandiose events...