More On Tasters

This page contains supplementary material for part two of Performance Assurance for IT Systems, which contains 16 technology tasters. It is divided into:

 

·         downloads which are typically draft versions of new technology tasters (pdf format)

·         observations which are typically short, in-line pieces that are considered to be useful additions to the existing technology tasters.  It is obviously assumed that you have access to a copy of the book to understand the context of the additional material. You can download all observations (pdf)

Downloads

 

These are additional technology tasters to those that can be found in part two of Performance Assurance for IT Systems. They are draft versions which may be modified in the future in the light of (a) comments from readers (which are welcome); (b) additional information; or (c) I just do not like the way that they read.

If you have not read the book it is important that you understand what the objectives of the tasters are. This is an extract from the introduction to the tasters.

“The basic foundation that underpins effective performance and technical architecture work is a solid understanding of the major hardware and software technologies that are used in server-based systems. The simple objective of these tasters is to provide a brief introduction to a range of technologies. In my experience, there are a small number of key aspects, typically no more than 6, which heavily influence the performance and scalability capabilities of any particular technology. The tasters attempt to convey these key pointers. There is, quite deliberately, no attempt to compare products or technologies; the objective is to provide sufficient grounding that the reader can confidently perform his own reviews. Although the tasters contain useful information that can assist Performance Analysts and Technical Architects, they are no substitute for a comprehensive understanding that can only come from more in-depth study. My background is in commercial IT systems and it is therefore inevitable that the material is heavily biased in that direction.”

Readers need to register to be able to access any download that is flagged as password-protected. At the current time none of the tasters are password-protected. Note that registered users will be automatically emailed when new material is posted on the site. Information on registered users will not be passed to any third-parties.

Observations

 

The specific taster in the book to which an observation relates is given in the heading unless it is a more general topic.

 

Contents:

Intel / AMD – confusion between 32-bit, “partial-64 bit”, and 64-bit

Update on Intel and AMD Server Chips and Chipsets (February 2007)

IP-Based Storage: iSCSI

IP-Based Storage: FCIP and iFCP

Infiniband-based SANs

Dual Core Processor Chips

Multi-core CPU Chips (February 2007)

IO Interconnect Technologies

2007 Server Announcements (July 2007)

2007 Server Announcements (November 2007)

Intel / AMD – Confusion between 32-bit, “partial 64-bit”, and 64-bit systems (CPU and Operating System Basics Tasters)

Intel’s approach was “relatively” straightforward: it had Xeon and Pentium 4 in the 32-bit arena; and Itanium in the 64-bit arena.  AMD came along and confused matters by announcing AMD64 chips that will run both in both 32-bit and 64-bit modes.  Naturally, the majority of the AMD “marketing speak” emphasises the 64-bit capability; I will call this mode “partial 64-bit” for the moment.  Intel was forced to respond, bringing out EM64T, its own partial 64-bit offering.  This is something that they had been working on in secret, frequently denying its existence. This is hardly surprising; as it was a defensive measure which they no doubt fervently hoped would never see the light of day, given the investment that has been put into Itanium.

 

The following paragraphs attempt to alleviate some of the confusion between the various offerings.

 

Existing Plain Vanilla 32-bit. Chips employ an IA-32-compatible architecture.  While there are various IA-32 technical shortcomings in hindsight, the most obvious constraint to users is 32-bit memory addressing which limits the size of the system address space to 4GB.  In Windows Operating Systems the default memory allocation is split: 2GB for user mode and 2GB for kernel mode.  The user allocation can be increased to 3GB at the expense of a reduced kernel.  This arrangement may be suitable for some applications, e.g. Microsoft Terminal Server that can consume significant memory when supporting large numbers of users, but not others.  The fact that large amounts of physical memory can now be configured on relatively small servers, e.g. 12GB on a dual CPU system is feasible, can make the use of 32-bit a constraining factor. 

 

Memory Extenders.  Intel introduced Page Addressing Extensions (PAE) to its 32-bit chips in an attempt to alleviate the memory constraint.  Data (and only data) can be stored between the 4GB line and 64GB line.  However, it must be brought below the 4GB line before it can be accessed.  This approach can be thought of as a three tier memory management system: memory (0-4GB); memory-based backing store (4-64GB); and disk-based backing store.  PAE is useful for software with very large data caches, e.g. DBMSs.

 

Partial 64-bit.  The first thing to say is that AMD64 and EM64T chips can operate in standard 32-bit mode (otherwise known as “legacy” mode) running on a 32-bit OS.  Compatibility mode is next; it runs 32-bit applications on a system with a 64-bit OS and 64-bit drivers. The advantage of this mode is that each process has its own 4GB limit, as opposed to a system-wide 4GB limit.  Finally, there is 64-bit mode (actually termed long mode by AMD and IA32e by Intel).  64-bit mode makes use of 64-bit registers and ALUs.  With respect to memory addressing, it uses 40 bits, which can provide access to 1TB of memory.  Apart from a 64-bit OS and drivers, this mode requires 64-bit applications that have been compiled to make use of the new features. Note that, despite the memory enhancements and selective use of 64-bit, Partial 64-bit is still essentially an IA32-compatible architecture.

 

64-bit (IA-64).  This is pure 64-bit, engineered from the ground upwards. Intel started work in this area 11 years ago to produce servers that would be competitive with products from the main Unix vendors.  It employs a totally new architecture, termed IA-64.

 

What does it all mean? In particular, is Partial 64-bit an alternative to IA-64? Here is my current twopenneth:

 

·         Despite any technical attractions, Partial 64-bit is fundamentally a commercial fight between Intel and AMD.  AMD64 has forced Intel to veer away from the grand march towards IA-64, if only temporarily, while it fights AMD on this middle ground. If  Intel loses this fight then IA-64 may be in some jeopardy

 

·         Partial 64-bit provides a transition path towards full 64-bit for those customers who require it

 

·         Compatibility and 64-bit modes can undoubtedly help to relieve memory constraints for non-DB software that suffer from such problems

 

·         Encryption algorithms will benefit from 64-bit registers and ALUs

 

·         No benefits for floating point, as these registers are already 80 or 128 bits on existing 32-bit systems

 

·         Although general performance improvements are claimed over 32-bit, I have not seen any figures yet to be able to comment.  A number of techniques, mainly increases in clock speed, have allowed IA-32 architectures to keep abreast with the performance of IA-64 systems.  This parity is helped by the fact that IA-64 is still relatively immature, and some of the promised benefits, e.g. parallel processing, have yet to be realized

 

·         multi-processor performance (4 CPUs upwards) has always been something of a problem on 32-bit architectures, despite gradual improvements.  Many of the issues relate to general multi-processor system design, chipsets and OSs rather than to basic CPU design.  The handling of large amounts of physical memory may be a particular problem on Partial 64-bit multiprocessors, depending on the design.  Certainly above 4 CPUs I feel more comfortable with IA-64 systems. 

Update on Intel and AMD Server Chips and Chipsets (February 2007)

AMD was formed back in 1969 by ex employees of Fairchild Semiconductor. It started life as a producer of logic chips and moved into memory chips in 1975. See the Wikipedia article on AMD for more on the history of the company. It has periodically been an irritant to Intel since the early 1990s when it produced a clone of the 386. However, battle was really joined with the arrival of the K8 architecture which featured AMD64 – 64 bit extensions to the x86 instruction set – see the above observations on confusion over 32 and 64 bit systems. Opteron, the server variant of K8, was introduced in April 2003. While comparison of the performance of a single CPU showed little difference between Intel’s Xeon and Opteron, the AMD chip was markedly superior when deployed as part of a multi-processor (SMP) server. Indeed, it was noticeable that TPC-C benchmarks for Intel-based systems with 4 or more CPUs vanished almost overnight.

 

Intel’s Achilles heel on Xeons has always been the relatively poor scalability of SMP systems with 4 or more CPUs, principally due to the bus architecture where inter-CPU and CPU-memory traffic share the same bus. The design seems satisfactory on dual processor systems where the level of traffic that is generated can be accommodated.  The problems on larger systems have led Unisys (e.g. ES7000/one) and IBM (x-series) to build their own architectures which are based on boards with 4 CPUs. The diagram below illustrates their approach which effectively consists of two dual processor Intels that are connected to a custom-built board or cell controller.  Larger systems can be built by connecting these boards together by some form of interconnect.  Satisfactory scalability is driven by the number of hops that are required to get from one board to another; this appears to be in the range of 16-24 CPUs depending on the individual design. 

 

                                                                  Example Unisys Cell with Paxville CPUs

 

The Opteron-based design includes an on-chip memory controller and direct connections between CPUs using AMD’s HyperTransport technology; each CPU has its own memory, as opposed to the more traditional shared memory approach, but the use of NUMA techniques allows one CPU to access the memory of another CPU directly. The implementation is limited to 4 CPUs.

 

AMD further built on this advantage when it announced dual core CPU chips in May 2005, thereby allowing servers with up to 8 cores. Benchmarks indicate that performance across the two cores is fairly satisfactory. Various hardware vendors have been attracted by AMD’s offerings, particularly HP, Sun and IBM to a lesser degree. This success forced Intel to revise its plans in the autumn of 2005. In particular, the focus on the latest 64-bit Itanium, codenamed Montecito, was reduced in the light of continuing problems, and attention was switched to Intel’s dual core EM64T-based products that would compete directly with AMD. This subsequently led to a rash of announcements over the spring and summer of 2006. They fall into two main camps: the Xeon 5000 and 7000 series.

 

The 5000 series appears to be aimed at single and dual CPU systems at this point in time, i.e. support for up to 4 cores. The 5000, codenamed Dempsey, which was originally meant to plug the holes until Intel’s new core microarchitecture was ready, was announced in May 2006.  Somewhat bizarrely, it was followed a month later by the new core microarchitecture in the form of the 5100, previously codenamed Woodcrest – the server / workstation equivalent of Intel’s Core 2 processor in PCs and laptops. The current top end of the 5100 is the 5160 running at 3.0GHz with a 1333MHz system bus which has a single core performance (SPECint2000) that is ~60-65% faster than the previous fastest single core Xeon chip. There are few scalability benchmarks to be found thus far for the 5100; they tend to indicate that while throughput is better than AMD, it is arguably not as good as might be expected. So, it is a case of watch this space for the moment.

 

The Intel 7000 series is aimed at larger SMP systems, 4 CPUs and above. Paxville (7000) was originally announced in November 2005 although indications were that the overall performance of the chip across the two cores was somewhat modest. The improved version, Tulsa (7100) with its large L3 cache (up to 16MB) was announced at the end of August 2006. The basic speed of a single core when measured by SPECint2000 is broadly similar to the previous generation Xeons. I have seen no trustworthy benchmarks on Tulsa-based systems at the time of writing.

 

Towards the close of 2006 Intel announced the quad-core 5300, originally codenamed Clovertown, in the single and dual CPU range. The very limited benchmarks that have been published thus far seem to indicate to me that the current top of the range 2.66 GHz 5355 with a 1333 MHz system bus will support 30-35% more throughput than the top of the range dual-core Woodcrest 5160. A quad core version of the 7000 series is touted for later in 2007. Quad cores are discussed in the multi-core CPU chip observation below.

 

Having enjoyed two years in the ascendancy, AMD now find themselves on the back foot for the moment. There is much speculation as to what form their response might take – see the Wikipedia article on AMD K8L for various pieces of speculation. On the subject of Wikipedia, there is a useful article on Intel Xeon that neatly summarises the recent rash announcements from Intel along with details of the promised offerings in the near to medium term.

 

There are obviously other factors apart from performance that contribute to the debate between Intel and AMD offerings. Popular topics of the moment include LV (Low voltage) CPU chips to reduce power consumption and better support for server virtualisation.

 

IP-Based Storage: iSCSI (HardDisk: Fibre Channel, SAN and NAS Taster)

Traditional disk subsystems have direct connections with the servers that access them (albeit sometimes via SAN fabric).  The entire system from host to disk target is usually SCSI-based, or in more recent times Fibre Channel-based.  While NAS subsequently provided a mechanism to connect an entire disk subsystem to a LAN, the disks themselves are still directly connected to the server(s) within the NAS itself.  iSCSI now allows the connection between the server and the disk target to be done via the LAN.  As shown in the diagram below the iSCSI protocol provides a mechanism to stretch the components of physical disk access over an IP network in a roughly similar fashion to the way that the software components are stretched in NFS or CIFS.

 

 

In iSCSI, the “client” is an iSCSI initiator, which is provided by the majority of operating system vendors.  The iSCSI commands, encapsulated in TCP/IP, pass over an IP network to the disk target where the commands are processed. The “marketing speak” for IP SANs tends to concentrate on comparisons with Fibre Channel-based SANs.  However, the waters can quickly become muddied:

 

  • The disk target may be an array with network port(s) to support IP but the hardware may well be “wall-to-wall” fibre channel. Some traditional storage vendors who had been watching IP SAN from the sidelines have joined the fray by offering two versions of a given disk array, an iSCSI variant in addition to the original Fibre Channel version. There does not appear to be a great deal of difference between the two, apart from the host ports and the support for the iSCSI protocol.

 

  • Another approach is to use IP storage switches. As indicated in the following diagram, they are modified LAN switches that act as bridges; IP through standard ethernet ports on one side and fibre channel on the other side

 

 

  • It is easier for existing NAS vendors to support IP SANs. As their host ports are already network ports, they simply have to support the iSCSI protocol.

 

Key Performance observations include:

 

  • As stated above, iSCSI uses TCP/IP.  If a server generates significant disk traffic there may be a significant CPU overhead (remember from the Network Basics taster that TCP/IP is a fairly fat protocol).  It follows that the performance of the server could be adversely affected.  Some vendors offer a special network card that offloads the TCP/IP overhead (and in some instances the iSCSI overheads as well). These cards are obviously more expensive than simple network cards, though they are still cheaper than fibre channel cards

 

  • Similar to any relatively immature protocol, iSCSI can be quite “chatty”, i.e. multiple exchanges may be required between the source and the disk target to complete a single disk access. One paper that I have read discussed the use of a software-based cache on the server which was claimed to reduce protocol traffic by 3-4 fold

 

  • The addition of disk traffic encapsulated in TCP/IP will increase the overall network traffic (possibly significantly). If the network design and implementation does not cater for the disk traffic the overall LAN performance may be adversely affected.  Similarly, a single network connection from a server that is used for both normal network traffic and disk traffic may become a bottleneck if it is not correctly sized

 

  • A common outcome when using NAS or IP SAN is that there is invariably less bandwidth between the server and the disk than is the case in a more traditional disk subsystem.  Any decrease in bandwidth, coupled with the TCP/IP overheads, is likely to result in increased latency, i.e. longer disk response times

 

  • iSCSI may be eminently suitable for small to low-end medium systems, or for those systems where disk performance is not crucial and hence higher latencies can be accommodated (e.g. email systems, off-peak batch tasks such as overnight backup, or possibly asynchronous remote mirroring to a Disaster Recovery site).  However, it is very questionable if it is suitable for medium to large systems or systems where any disk performance degradation is not acceptable (e.g. OLTP systems)

 

  • Think carefully about using iSCSI for existing NFS or CIFS systems.  Here you have already stretched the front-end (NFS client separated from the server) and hence increased disk latencies, and now you are looking at stretching the back-end (NFS or CIFS server to disk arrays).

 

The amount of marketing hype on this subject has increased markedly over the past 12 months.  Much of it revolves around the message that running more stuff under the IP that we know and love must be a good thing, e.g. our network people do not need to be retrained and it will be significantly cheaper.  I find this all a bit simplistic and slightly economical with the truth.  Although Fibre Channel may still be expensive (relatively), I tend to go by the old maxim which indicates that ultimately you get what you pay for.

 

IP-Based Storage: FCIP and iFCP (Hard Disk: Fibre Channel, SAN and NAS taster)

FCIP and  iFCP are mechanisms for running fibre channel commands over IP.  Whereas iSCSI is very visible in the sense that it is involved in the whole physical disk I/O from its issue on a server through to the disk target (or IP SAN switch if the target is fibre channel), FCIP and iFCP are relatively invisible, in that they are part of the plumbing.  They are implemented in gateways, sometimes called switches or more typically SAN routers by some vendors, allowing fibre-channel traffic to flow between “islands” of fibre-channel equipment over an IP network.

 

FCIP is the simplest protocol for a vendor to implement, as it simply encapsulates entire FC (Fibre Channel) frames within TCP, a technique that is often referred to as tunnelling.  This effectively merges multiple SAN islands into a single SAN fabric, as shown in the diagram below. It can be said that FCIP operates at the fabric level. The easiest implementation is to work with a single tunnel, which in essence corresponds to a single TCP session.

 

 

 

iFCP operates at the device level rather than the fabric level.  It maps each FC address (server, device or whatever) to a separate IP address and each FC session to a separate TCP session.  In essence, iFCP replaces the FC transport layer with an IP network, e.g. Ethernet, but retains the information in the upper layer by mapping it to TCP. As Fibre Channel devices use FC Generic Services, IP-equivalent protocols are required for iFCP, the most obvious example being iSNS which provides storage name services.  It is important to note that FC sessions are terminated at the gateway device, shipped over the network by IP to the target gateway where corresponding FC sessions are constructed to carry traffic to the target device. It seems reasonable to describe iFCP as a more native protocol than FCIP.  If FCIP extends a single fabric then iFCP can be said to enable communication between multiple fabrics.

 

 

 

It should be noted that there is much semi-religious debate surrounding the use of FCIP and iFCP, mostly generated by the marketing people of the vendors who offer one or the other.  The debate is partly fuelled by the fact that implementations of the protocols currently differ from vendor to vendor.  Arguments may equally occur between vendors pushing iFCP, in addition to the more obvious FCIP versus iFCP battles.  Differences in the implementation of the protocols bring the possible disadvantage of locking you into a single vendor.

 

Arguably, the main advantages of FCIP and iFCP are that they provide a cheaper solution than wall-to-wall fibre channel and they bring the concept of wide area storage networks closer.

 

With respect to performance, many of the observations on iSCSI also apply here. Briefly,

 

  • TCP is a fat protocol that will inevitably result in longer latencies than vanilla-flavoured fibre channel. This may be an issue (e.g. in larger OLTP systems) or it may not (e.g. in messaging systems)

 

  • Congestion, due to inadequate network bandwidth or spikes of access activity, will exacerbate latencies, as TCP takes its usual avoiding action by reducing the rate of flows across the network. This may be unacceptable for disk performance

 

  • Congestion may more of an issue in FCIP, particularly in those implementations where there is a single tunnel, as the multiple FC flows which have been merged into that tunnel will all be affected, whereas not all sessions may be affected in iFCP.

 

FCIP and iFCP may be well suited to those storage infrastructures that only have to support small to medium disk access rates, for off-peak activities such as overnight back-up, or for those systems where asynchronous IO is satisfactory (e.g. remote mirroring).  They are unlikely to be suitable for larger infrastructures, particularly those with high access rates.

 

Beware of vendors peddling IP storage, replete with the warning that fibre channel technology is dead.  I am not sure whether they are talking about all parts of the technology or just the fabric.  Needless to say, they are probably network vendors.  Take what they say with a very large pinch of salt.

 

Finally, having whinged previously about the quality of technical information that is put out by some Technology Associations and Forums, I have to say that I found some very useful documents on the SNIA site; viz. iFCP - A Technical Overview and a slide show entitled IP Storage by Dave Dale of Network Appliance.

 

Infiniband-based SANs (or even Infiniband-based IP)

The recent taster on cluster interconnect technologies and the various pieces on IP-based storage have resulted in a question from one reader on the possible use of Infiniband to provide the fabric for SANs, as promoted by the Infiniband Trade Association.

 

In essence, the Infiniband architecture provides a conduit for other protocols to run over the medium; or to use the current lingua franca these protocols are “tunnelled” through Infiniband.  For example, IPoIB supports the tunnelling of IP.  With respect to storage, the SCSI RDMA Protocol (SRP) runs over Infiniband.  Similar to iSCSI in the IP world, SRP requires an initiator at the client end (i.e. on the server), as well as the necessary protocol support at the device target end.

 

The promoters of Infiniband preach the message of a single cluster-wide IO fabric, obviating the need for Fibre Channel fabric.  The marketing spiel talks about the simple need for a single interface card (HCA) on each node in the cluster to support both inter-node and storage traffic.  However, the target disk subsystem(s) also need HCA(s) and the necessary support for SRP.  While Mellanox, one of the major Infiniband vendors, provide a modest Infiniband-based storage system that supports up to 1TB of SATA disk, the major storage vendors have been loath to follow suit.  A more popular approach has consisted of the use of Infiniband-to-Fibre Channel gateway boxes from companies such as TopSpin (recently acquired by Cisco) and Voltaire (part-funded by Hitachi).  One side of the gateway plugs into the Infiniband fabric while there are storage ports on the other side, viz. Fibre Channel and SCSI.  IBM is currently selling a BladeCentre, consisting of up to 14 blade servers that can form a cluster via a TopSpin-based Infiniband fabric; TopSpin gateways are supported to allow storage and IP extensions.  Currently available gateways are limited, typically supporting only two Fibre Channel ports. While this may not be an issue on small systems, it is likely that multiple gateways will be necessary on any reasonable-sized system.

 

From a performance perspective, the plus points are the inherent high throughput and low latency of Infiniband, while the possible disadvantages are the overheads of SRP or the gateway, which obviously depend on the implementation. Careful sizing is required if there are significant amounts of both inter-server and storage traffic at a given node. Apart from basic throughput requirements, the performance of the server IO infrastructure on each node may need to be assessed, e.g. the capacity of a PCI bus.

 

As mentioned, Infiniband also supports IP via IPoIB.  This can muddy the waters in the storage area, as iSCSI, the IP-based storage protocol, can be run over Infiniband using IPoIB.  Despite the availability of increased bandwidth it is still likely to suffer from the relatively poor performance of TCP.  Some improvements can be obtained by using the RDMA assist for iSCSI. Here, the control aspects of the protocol continue to be sent by TCP but the actual data uses RDMA to bypass it.  The effect is to maximise the advantages of the increased bandwidth and low latency, although the benefits are more likely to be noticeable when larger storage block sizes are used.

 

The bottom line at the time of writing is that, while all vendors are keen on the concept of a unified fabric that will support all system IO, there is no general market acceptance of Infiniband, or indeed of any other technology.  Infiniband may triumph eventually but we will have to wait and see. I will pen some words on other IO technologies shortly.

 

Dual Core Processor Chips (CPU Basics and Multiprocessors – Shared Memory Tasters)

Dual core CPU chips (two CPUs on a single die) have been gradually appearing in server-based systems, including: IBM who were early in the field with Power 4, and more recently with the Power 5; Sun’s UltraSPARC IV appeared in 2004; AMD’s announcement came early in 2005; while Intel’s initial offerings in the IA32 space were announced at the back end of 2005, their hand probably being forced by AMD.  HP is also in this arena with the PA-8800 and PA-8900. However, they are making few noises about them, as the overall plot (at least from a commercial perspective) seems to be that Itanium will replace PA-RISC chips in their server systems, as soon as is practical. With respect to Itanium, Intel’s dual core Montecito chip is currently slated for 2007 although HP already has its own version of dual core Itanium, a temporary measure until Montecito appears.

 

Dual core designs vary, as shown in the examples on the diagram below. The basic set up (a) incorporates two processor cores, each with its own L1 caches; the memory controller is shared by the cores; and the L2 cache is off-chip. An improvement (b) is similar except that the L2 cache is on-chip. A further refinement (c) also has the L2 cache on-chip but introduces a larger L3 cache which is held off-chip.

 

 

When comparing the performance of systems with single and dual core processor chips, there are two key areas: cache coherence and general cache performance; plus memory performance.  Minimally, there needs to be as much cache, main memory and system bus bandwidth per CPU on a dual core-based system as there is on the equivalent single core system. Some people claim that the amount of main memory accesses will be reduced on a dual core system, on the basis that the data required is more likely to be in the processor cache, possibly having been previously fetched by the other CPU on the chip.  This seems highly speculative: yes, it can be the case on certain workloads that exhibit high levels of non-volatile data / instruction sharing between threads; but the more likely case, certainly in the commercial world, is that applications exhibit high levels of memory volatility, particularly in the middle and database tiers, and therefore the likelihood of cache hits (finding the data already in the cache) is probably low. 

 

Let us assume, for the sake of argument, that we have two systems which are largely identical, except for the use of single or dual core processor chips. In particular, each system has the same amount of main memory, L1 and L2 cache and system bus bandwidth per CPU.  In a dual CPU system, the single core variant will need to use a standard SMP cache coherency protocol to ensure that data integrity is maintained across the two CPUs. However, the dual core variant only requires one processor chip to house the two CPUs, obviating the need for cross chip cache coherency. All things being equal (and I know that they are often not) the dual core variant should perform at least as well as the single core variant in this dual CPU system, possibly better.  In reality, the amount of cache per core, particularly L2, is reduced on many current implementations and performance will suffer. Some early implementations of design type (a) and (b) lose as much as 25% speed per core, i.e. a dual core will perform at 1.5 times the speed of the equivalent single core; better designs may lose 10-20%; while design type (c) may only lose 5% (or less) per core.

 

On a 4 CPU system, or above, the dual core system must also resort to using a cache coherency protocol and will lose the advantage that it had on the dual CPU system. Based on a limited sample, a dual core system with 4 CPUs (based on design type (b) in the diagram), running a typical commercial workload appears to under-perform the equivalent single core system by 10-20%. This appears to be an intuitive result. A system based on design type (c) in the diagram should in theory be closer to the performance of a single core system although I have not seen any numbers to validate this speculation. Naturally, there may well be other factors to take into consideration when comparing performance, viz. any price difference between the respective single and dual core systems, plus any performance improvements that may have been made in other areas of the system, e.g. the underlying server infrastructure (chipset), which may help to offset the reduction in performance.

 

In addition to the provision of dual core processor chips, some vendors are also introducing support for SMT (Simultaneous Multi-Threading), which allows multiple threads (typically two at the current time) to execute concurrently on a single CPU. This obviously muddies the waters further when attempting any comparisons, as the purpose of SMT is to increase the throughput of each CPU.  See the taster on this subject for further information.

 

Multi-Core CPU Chips (February 2007)

Multi-core roadmaps generally focus on squeezing more cores onto a single chip.  There are currently two main strands of activity: specialised systems where the objective is get large numbers of cores onto a single chip; and general purpose CPUs that aim to maintain performance levels while gradually increasing the number of cores.

 

In the specialised system area manufacturers of embedded systems are already producing products. In the commercial computing world Sun announced their UltraSPARC T1 architecture at the end of 2005, based on the previously codenamed Niagara chip. The chip houses 8 cores and supports 4 threads per core. The cores have been simplified, in comparison with a typical modern complex CPU that attempts to get as much performance as possible by employing all the tricks of the trade, e.g. speculative memory fetches, out of order processing, et cetera. Rather, the approach in Niagara is simply to switch to another core / thread if any sort of delay is encountered.  The emphasis is on keeping as many cores / threads as busy as possible, i.e. throughput is more important, relatively speaking, than response times.  For these reasons, it is not surprising that no CPU speed benchmarks such as SPECInt2006 have been published, although throughput-based benchmarks such as SPECWeb2005 and SPECjAppServer2004 have been. Niagara is currently deployed in a single processor system, i.e. there is no support for traditional SMP designs. There are two basic server models: the T1000 (6 or 8 cores running at 1.0 GHz) and the T2000 (4, 6 or 8 cores running at 1.0 GHz, or 8 cores running at 1.2 GHz). It should be noted that all chips are manufactured with 8 cores, but to improve the yield chips with any defective cores have them disabled so that they can still be used, albeit as 4 or 6 core chips.  The T2000 supports 32GB of memory, twice as much as the T1000. It also contains more reliability features than the T1000. Sun appears to be aiming the T2000 at application servers (not sure that I agree with this unless the application layer is fairly thin), network infrastructure services, and mail / messaging; and the T1000 at web servers, portal servers and streaming media servers. 4, 6 or 8 core chips running at 1.0 GHz can also be deployed as a carrier grade server (Netra T2000) or in blade form (Netra CP3060). Finally, apart from performance-related considerations, it should be noted that the T1-based machines are low voltage systems, i.e. they are meant to appeal to large server farms where power usage is an important consideration.

 

Other mainstream hardware vendors appear to be trying to build on dual cores, i.e. to maintain wherever possible the performance features that have been built into single and dual cores. There are no true quad cores at this time although various vendors are trying to convince you that there are (or soon will be). IBM announced the POWER 5+ with “quad cores” in late 2005, and extended the range of servers that it offered with them in August 2006. These systems are primarily aimed at small and medium-sized businesses. In fact, these so called “quad cores” consist of two discrete dual core dies that are packaged into one module, called a QCM (Quad Core Module).

 

Meanwhile, in the Intel / AMD arena the pursuit of “The Holy Grail” that is the EM64T/AMD64 quad core chip has been “won” by Intel, at least in the single and dual CPU system arena. Intel has achieved this by shipping two dual core dies (2 Woodcrests) on a single package. Is this really a quad core? Purists will claim that it is a fudge, but that will not bother Intel’s sales and marketing people. From the limited information that is available I would guesstimate that Intel’s quad core provides circa twice the throughput of a single core and 30-35% more than a dual core – these are fairly modest figures. In the larger SMP system space Intel is working on a quad core chip, codenamed Tigerton, which will sit in a Caneland server (due for release in the second half of 2007).  It is unclear at the moment if this will be a genuine quad core or not; it is speculated that it will include a technology called "dedicated high-speed interconnect" which will provide each processor with a direct pathway to the chipset, a potentially significantly faster solution than its front-side bus technology.

 

AMD is developing a genuine quad core chip, codenamed Barcelona, which is due to be shipped by the end of Q2 2007. No information is available on its likely performance thus far; I will incorporate it as and when it becomes available.

IO Interconnect Technologies (Server Infrastructure)

PCI has been the major IO interconnect technology for close on a decade. Although it started life in the PC world it quickly spread to servers. Despite its universal popularity, PCI has a number of disadvantages which are becoming more noticeable as the years slip by:

 

·         It employs a shared bus technology (as shown in diagram (a) below). This means that only one device can use the bus at a time, which in turn means that a bus arbitration protocol is required to obtain use of the bus

 

·         Electrical noise can limit the number of devices on a bus. This is handled by a hierarchical approach where multiple buses are accessed via bridges (see diagram (b) below)

 

·         Bandwidth / speed. PCI started out with 32 bit buses running at 33MHz (equivalent to 132 MB/sec). Refinements have led to: 64 bit at 33MHz (264 MB/sec); 64 bit at 66 MHz (512 MB/sec).  It is typically the slowest part of a server, which is becoming an issue with the advent of higher speed networks and cluster interconnects

 

·         It uses a straightforward load-store, flat memory-based communications model, which is a constraint, particularly when compared with a routed, packet-based approach.

 

 

 

A short term solution, particularly with respect to the bandwidth / speed issue, is PCI-X. It started with a top end of 133MHz but versions are now available at 266 MHz and 533 MHz (plus 1066 MHz is being talked about).  The arrival of 266 MHz saw the introduction of double data rate (transferring data on both the rising and falling edges of a clock cycle).  The theoretical speeds of 266 MHz and 533 MHz working at double rate are 2 GB/sec and 4 GB/sec respectively. The other problems remain and issues like electrical noise can be exacerbated by the increase in traffic.

 

Work has been going on in the industry to replace the shared bus approach with a high performance switch-based, point-to-point technology, as shown in the diagram below.

 

 

PCI-SIG has plumped for Intel’s 3GIO technology, which is now known as PCI Express. Here a device is connected to the switch via a serial link. The basic link (called 1x) transmits at 2.5Gbits/sec in each direction (full duplex is supported). Multiple links (also called lanes) can be bonded together to form a channel to support higher bandwidth. The PCI Express specification talks about 2x, 4x, 8x, 16x and 32x links (a 32x link supports a theoretical 16GB/sec). Support for PCI is part and parcel of PCI Express. “Bridges” can translate PCI Express packets to PCI signals and vice versa. These bridges can be on a motherboard or on a card.  The switch itself can be part of a server chipset. For example, the Intel 900 series chipset incorporates the switch on the southbridge; it includes “bridge” logic which will allow both PCI and PCI Express devices to be supported.  Advanced Switch is a new feature of PCI Express which will attempt to challenge RapidIO in the embedded system market.

 

Although PCI Express is arguably in pole position to take over from PCI as the de facto industry standard, there are other serial switch-based technologies that are in the frame:

 

·         HyperTransport developed by AMD. This is already used as a CPU interconnect on AMD and MIPS processors. It employs a low latency parallel point-point technology with varying width of bits per link – 2,4, 8, 16 or 32. A 32 bit wide path can support a theoretical 22.8GB/sec. It is used as an IO bus in Apple’s G5 PowerMac. PCI is supported. A feature called Direct Path is being introduced that will attempt to challenge RapidIO in the telecoms arena

 

·         RapidIO developed by Motorola. The focus of this low latency technology has been on embedded systems, particularly in the telecoms market, e.g. DSP farms.  There are two variants: Parallel RapidIO has 8 and 16 bit paths, operating at up to 6 GB/sec; while SerialRapidIO has 1 and 4 lanes, operating at up to 1.8GB/sec

 

·         Infiniband.  This technology is arguably trailing at the moment in the IO interconnect space.  See the Cluster Interconnect Technologies taster for background information.

 

Note that it is not valid to simply compare the bandwidth of a single link across the technologies. The bandwidth of the overall system, along with the associated latencies, is the real criterion.  In fact, it is probable that the performance of the switch will be the ultimate differentiator.

 

The high speed serial IO interconnect is a marketplace which it will pay to monitor, as the various parties jockey for market share. At the current time PCI Express seems to be the favourite in the general purpose server arena with RapidIO in the embedded design market. HyperTransport, along with Infiniband, appear to be trying to be all things to all men.

2007 Server Announcements (July 2007)

IBM Power 6. IBM formally announced their latest chip in late May 2007. There have been rumours that they have been struggling to produce it in volume at the expected higher clock speeds (circa. 5GHz). The fact that the Power 6 will only be shipped in the p570 server (up 16 cores) initially from November 2007 may well corroborate this story. IBM’s view, no doubt, is that there is sufficient commercial mileage in the existing Power 5-based servers to warrant a modest start.

 

The main technical highlights of the Power 6 include:

 

·         chips to run at 3.5GHz, 4.2GHz and 4.7GHz

·         65nm technology

·         still dual core, albeit packaged in 4s (Quad Core Module)

·         8MB of level 2 cache per chip and 32MB level 3 cache per chip

·         two memory controllers per chip.

 

Using the published SPECint2006 figure (processor speed) a 4.7GHz core appears to be 55-60% faster than a 2.2GHz Power 5+. Based on observation of limited benchmarks so far I would speculate that overall performance / throughput is likely to be 40% better than the Power 5+.

 

Sun Sells Fujitsu’s Newly Announced M Series.  A short history lesson is required to provide the necessary background for this move. Around 2002-3 Sun was preparing to ship the UltraSPARC IV, their first dual core offering. It was perceived as a short term solution while UltraSPARC V, the genuine dual core solution, was being developed. Sun also had plans for a multi-core chip, subsequently dubbed Niagara (see the multi-core section on this page).  However, Sun started to experience financial problems, starting in the 2002 fiscal year and they declared a net loss of over $2bn in 2003. UltraSPARC V development was shelved and Sun entered into an agreement with Fujitsu to market their SPARC-based offerings – Fujitsu had been designing SPARC-compliant chips since 1995. In 2005 Sun started to ship the UltraSPARC T1, alias Niagara, as a single socket system with between 4 and 8 cores. It was now re-branded as a general purpose system, whereas the original idea had been to market it as a specialist server for web, streaming media and other telecomms-related functions.

 

In April 2007 Fujitsu announced its new M series, based on their latest SPARC64 VI chip, which Sun is actively selling. The current top of the range model is the M9000 with up to 64 dual core chips, each running at 2.4GHz. The latest base SPECint2006 figure for a 2.28GHz chip is 9.7, very roughly in the same ballpark as Power 5. This represents a significant increase in raw power over UltraSPARC IV. Another major plus point is more granular domaining facilities than are found on Sun’s mid to high-end server range; a single domain can consist of as little as 1 socket plus associated memory and IO connectivity. 

 

While Sun now has a more competitive server to offer (in power terms) it begs the question as to what its server strategy is. Niagara appears to form the basis of future in-house development for the moment. Niagara II (UltraSPARC T2) is due in the second half of 2007. It will have a 15% increase in clock speed, two ALUs (Arithmetic and Logic Units) rather than one, a floating point unit plus the ability to support 4 threads per core (rather than the current figure of 2). It is difficult to imagine that these enhancements will greatly improve performance. Of greater importance is the likelihood that a server will be made available with 2 sockets, i.e. up to a total of 16 cores.

 

In the medium term Sun will no doubt hope to become truly competitive in the enterprise class arena with Rock. Rock will reputedly consist of 16 core chips, 32 threads per core. However, it will have to be deployed in 4 and 8 socket servers (codenamed SuperNova) before it becomes a serious threat to the competition. Sun is talking about 2008 for initial implementations of Rock, but from my cynical perspective that sounds too soon; 2009 sounds more realistic (as a shipping date) with 2010 for the initial SuperNovas.

 

Meanwhile, Sun will presumably be forced to continue to offer the Fujitsu systems at the higher end.

 

AMD.  The background here is that AMD, having trounced Intel in 1-8 core arena for a couple of years, have themselves been on the receiving end since Intel’s plethora of announcements in mid-2006, particularly in the one and two socket dual core area. Intel also claimed victory in the quad core area, largely by some sleight of hand, packaging two dual cores together on a single module – similar to what IBM did on Power 5.

 

AMD’s marketing riposte has been “wait for Barcelona – our genuine quad core”. This was promised for the first half of 2007, then mid-2007 and now September / October seems to be the current estimate.

 

Despite Intel’s improvements in 2006 AMD has continued to lead them in the 8 and 16 core areas. AMD consolidated this lead early in 2007 by announcing the M2 variant of the 8000 series which provides a 10-15% increase in performance. Sun and HP both ship servers that use these chips / chipsets. Sun provides the 4600 (up to 16 cores); HP only provides an 8 core version (DL585 G2), as it offers its Integrity servers at 16 cores and above. Meanwhile, Intel is due to introduce Tigerton, a quad core Xeon, in the second half of 2007 (2008 realistically?) in an attempt to compete with AMD in this space.

 

Server Announcements (November 2007)

AMD. The story remains one of AMD’s continuing struggles to get Barcelona, their quad-core chip, to perform in line with expectation. The initial chips, running at 1.9 and 2GHz, quietly appeared in October, AMD attempting to cover their embarrassment by emphasising other features, e.g. low voltage. Chips running at 2.5GHz and 2.6GHz have been promised for January 2008 although the latest rumours indicate that this date will not be met. In fact there is some uncertainty as to whether servers with the initial chips are being shipped at this time, or whether they have been withdrawn. Notwithstanding this indecision I am giving my estimates for the likely performance of servers with these chips below.

 

Intel.  Despite my previous innate cynicism Intel has indeed announced Tigerton, the quad core Xeon. HP’s DL580 G5 (4P16C) is the first announced server with these chips. Dell will probably follow shortly. In recent years IBM and Unisys have specialised in providing more scalable Intel servers by designing and building their own chipsets and node expansion capabilities.  IBM has already announced x4, its latest design, which will support Tigerton. The 3850 M2 is a single node with 4 quad core sockets (4P16C). This will eventually be upgradeable to a 3950 M2 with a maximum of 4 nodes currently (16P64C), using its ScaleXpander feature. These servers will probably be available in Spring 2008. Unisys support for Tigerton is likely in Q4 2008.

 

In an attempt to put recent AMD and Intel (and IBM customisations) into perspective, here is my guessestimate of relative performance for OLTP workloads (treat with great care – do your own research if you are seriously interested in any of these servers). Be particularly aware that I have not seen any published AMD benchmarks thus far. The figures assume the use of SQL Server 2005 running on Windows 2003 Enterprise x64:

 

Intel 4p8c 7041 3GHz = 1 (base system)

AMD 4p8C 8222SE 2.8GHz = 1.5

Intel 4p8c 7140 3.4GHz = 1.6

Intel 4p16c 7350 2.9GHz (standard Intel bus, e.g. DL580G5) = 2

AMD 4p16C 8350 2GHz = 2 (TBC)

AMD 4p16c 8360SE? 2.6GHz = 2.4 (TBC)

Intel 4p16c 7350 2.9GHz (IBM X4 architecture - 3850 M2) = 2.5

Intel 8p16c 7140 3.4GHz (Unisys Enterprise7000/one) = 2.5

Intel 8p32c 7350 2.9GHz (IBM X4 architecture - 3950 M2) = 3.7

Intel 16p32c 7140 3.4GHz (Unisys Enterprise7000/one) = 3.7.

 

I will revise the above figures, particularly for AMD, when any information comes to light.

 

IBM Power 6. See details above. The slow rollout of the new chip continues. Thus far, apart from the p570, IBM has only announced that it will be deployed on the JS22 Blade.

 

Sun Multi-Core chips.  As expected Sun has announced the first modest set of enhancements to its multi-core offerings with Niagara II (e.g. T6300). As previously stated it includes: a 15% increase in clock speed; two ALUs (Arithmetic and Logic Units) rather than one; a floating point unit plus the ability to support 4 threads per core (rather than the current figure of 2). It is still a single socket server; presumably we have to wait longer for the expected dual socket version.