Intels AMD64 type Aktuell Seminare Reports Homepage Software
printer / text mode version
university-logo
draheim
@informatik.hu-berlin.de

Homepage (en)
- vita/* Verzeichnis
 : (L) (R) (M) (T) (S) (P)
 :Uni Kurse liste
- Fachkenntnisse
- Photo Gallery
 : i g c l j
 
Artikel / Tips (de)
-Übersicht GiroKonten
-Photos ausdrucken
-Sacco nach Masz
-Burg Draheim
-... und ein paar links
 
Articles (en)
 blogging fashion
 Intels AMD64 type
 ExSampled Out
 
Lustiges
- zitate / definitionen
- NO Pascal Here
- a REAL desktop
.
.


sitemap


-guidod-pygtk
sitemap             *offsite link

2004-05-04
(C) Guido Draheim
guidod@gmx.de

 
generated by mksite.sh

AMD64 / EM64T - The coming market

The following text has been taken from www.sciencedaily.com/encyclopedia/em64t . My expansions on the topic can be found near the end of this page.

EM64T (Extended Memory 64-bit Technology) is an extension to the IA-32 instruction set developed by Intel which adds 64 bit extentions to the x86 architecture. EM64T is derived from and largely compatible with AMD's AMD64 instruction set extension.

Intel's first processor to implement the EM64T technology is expected to be a processor codenamed Nocona. The Nocona will eventually become a dual-processor version of the Intel Xeon. Since the Xeon itself is directly based on Intel's desktop processor, the Pentium 4, so eventually one can expect to see this technology come to the consumer Pentium line too.

The main idea of EM64T is to give Intel processors access to greater amounts of memory than 4 Gigabytes. 4 GB is the standard memory size limit of 32-bit processors, such as IA-32 processors. With 64-bit memory address limits, the theoretical memory size limit would be 16 Giga-Gigabyte (Exabyte), although the initial implementations do not implement the full 64-bit addressability.

The history of the EM64T project is long and convoluted, mainly due to the internal politics of the Intel Corporation. It started out with the codename Yamhill, named after a county in the state of Oregon in the USA. After several years of denying this project existed, Intel eventually admitted it existed in early 2004, and gave it the codename CT (Clackamas Technology); as it turns out Clackamas is the name of a county adjacent to Yamhill county. Then within the space of weeks of the CT announcement, Intel gave it several new names. After the spring 2004 IDF, it gave it the decidely lacklustre name of IA-32E (IA-32 Extensions), and then a few weeks later it came up with the name EM64T. Intel's chairman, Craig Barrett, had to admit that this was one of their worst kept secrets.

This technology is not compatible with Intel's earlier 64-bit CPU technology, the Itanium processor based on IA-64 technology. It cannot run the same software written for IA-64. EM64T is an extension of the 32-bit x86 or IA-32 instruction set, while IA-64 is a complete rethink from scratch. One of the biggest complaints about IA-64 was that it was not able to run the existing IA-32 software fast enough, since it emulated the older IA-32 instructions rather than directly interpretting them. EM64T shouldn't have such a problem because IA-32 would just be a subset of its own native machine language.

Intel tried to keep the existence of EM64T secret for a long time for two reasons. First reason was that it did not want to give its customers mixed signals about the future viability of its Itanium IA-64 processors. However, the success of AMD's Opteron and Athlon 64 processors, based on its AMD64 technology, pretty much meant that Intel had to respond to the competitive threat. Which brings us to the second reason for Intel's secrecy, Intel didn't want to admit that it had to copy from its arch-rival AMD. That's why it gave it the brand name EM64T rather than AMD64, even though they are near identical twins.

Early reviews of the Nocona implementation of the architecture are not good [1] [1], with reviewers attributing this to many 64-bit instructions having been implemented by emulation rather than directly. However, it is expected to improve.

Note: The original source of this article can be found on the main Wikipedia Web site. - This article is licensed under the GNU Free Documentation License, which means that you can copy and modify it as long as the entire work (including additions) remains under this license.


(created 2004-05-04):

(largefile) I have been interested in the 64bit extensions from AMD for quite a time. Reason: databases need to access large files, just as cutting video files (out of the DVB air) touches large files. This is foremost a topic of large files (with sizes >2GB) and not programs. There are no programs with a size >2GB.

The handling of large files is a two fold topic: (a) you need to handle large sizes of files and the required large offsets of a read/writer pointer of a unix file descriptor. The old "long" types was only 32bit. For a long time (going back a decade) the unix vendors were looking for ways out of this constraint. They did introduce the "off_t" type and for Unix98 branding a unix kernel must support 64bit file descriptors. The problems of changeover in 64on32 systems have been described by me earlier - in a freshmeat article and its own website about largefile problems.

The (b) second area is using a different way of accessing files by means of memory mapping of files, in posix speak using a mmap() call. This mode gets file offset and size (e.g. zero for start of file) and gives it a start and end address in main memory. The difference of start and end adress is the given size that shall be mapped. Without map() one would implement similar behavior with malloc() to grab a piece of main memory from the system in the given size, then seek() to the given file position, and read() the amount of data to main memory. Now all data bytes have copies in main memory with an adress.

(background) The modern unix kernels however do not copy any byte from the file to main memory. Instead they assign unbound memory pages - this is using a feature from the MMU (memory mapping unit) of a CPU. Originally the MMU would map an adress of the programs address space (virtual adress) to the real hardware adress of main memory (physical adress). That virt2phys mapping allows to move a memory block physically without notifying the application - e.g. to compactify many smaller unused chungs of main memory into a single larger block that is big enough to hold a program. An extension to this virt2phys was added shortly later with swap area - some programs sleep long times in the background (e.g. a mail daemon) but they still hog memory. We can make use of that physical memory by cutting the bonds - the actual physical data is moved to disk (into a swap area) but the virtual memory addresses still exist. If the background process wakes up and needs some of its physical data then a `page trap` occurs. The system will put the process on hold, and hurry to up to get the physical data back from swap space to physical main memory. The virt2phys allows to put it back to a different physical adress than it was originally.

(fullmmap) Back to memory map of files - it is possible to only mmap() those parts of a file that we are currently intested on. If we move beyond that part then we call munmap()/mmap() another time on the adjascent part. This is also known as moving an "access window". It is a tedious task and it requires the application writer to come up with some check and manage code that constantly checks when and where to move the access window. A lot more easier is the variant to mmap() the whole file to memory from zero offset to actual end of file. Since a modern unix kernel will postpone loading of pages until actual access this is not a problem even for very very large files. But wait, there is a problem: if the machine is 32bit and the file size beyond the limits of that simply because file sizes are measured in 64bit.

(needed?) Intel did claim beforehand that such a thing is not actually needed and it did not expect to need to implement it on their processors before 2006. And in parts they are correct. Most programms do not need to access large files (atleast not yet) - and those that do have already implemented an access window. Furthermore, scanning multiple megabytes of main memory takes a long time - other than most people expect. Reason: assume 10ns access delay for each 64bit bus word in main memory. How long does it take to visit all rows in four gigabyte of main memory? Answer: 4*1024*1024*1024 / 100.000*8 = 5369seconds = ~90minutes. Visiting all bytes in 100MB memory block takes about 2minutes unless optimized with large bursts.

Furthermore we see that we can actually install more memory and beyond 4GB (the adressable limit of 32bit). The reason is again in the virt2phys mapping. We can let each program to be still working in 32bit mode of its virtual adresses. At the same time we extend the MMU to hold larger values for the physical addresses. This has been done in fact a few years ago - modern x86-compatible desktop processors feature the "PAE" = page address extension which uses 36bit for the physical address providing for a maximum of 64GB physical memory. In linux, check /proc/cpuinfo for "pae", the "pge" and "pse36" flags are related to this feature as well. Still, the application can only see a maximum of 4GB limiting it to an access window even for files below 64GB.

(prepare!) The reason we still want AMD64 extensions comes not so much from electrical reasons or the case to actually touch so much memory or the reason we want to install more physical memory. The real reason is in the software industry. Today most people are used to a wealth of computing power - which has also made it easier for software developers to write software. Being efficient with system resources is not of paramount importance. Just give it a megabyte more plus some cpu cycles. It is more important to get product results quickly in order to arrive at the market first. But with large files that changes.

As we saw above, scanning large files may still take many minutes or even hours. Even more so they would be placed on hard disk. In this area it makes a big difference if a computed answer can be seen after five seconds or ten seconds. It really feels different to the user. So we need to be efficient in accessing such large files - and we like to dump all the check and management overhead for an access window with mmap(). Just use a fullmmap of a file and let the MMU do the rest. If we happen to cross into a page area that had not been touched anytime before than make it a `page trap`, load it as quickly as possible (burst mode!), and come back. No checks in between for each and every single address, that check is done by the MMU, in hardware, and not involving a single CPU cycle (the MMU is in parallel with the CPU computing power).

(everything changes) However, changing the algorithmics of your program requires some time. For efficiency the access window management has been carved deep into the scanning algorithms. Taking it out requires quite some rewriting of code. Therefore, programmers want a 64bit addressable machine years before the actual programs are actually needed on the market. And it is not good to stay behind for the noticable advantage of a 64bit-ready program: it could have noticable market impact when running faster. And therefore every new project started must take care to be 64bit ready. Simply because the avarage running time of project is about two to three years. So if Intel expected a market impact around 2006 then one can not wait today (in 2004) and instead by a 64bit machine today.

Now pushing the 64bit addressability as secondary mode of a normal x86 desktop processor - it has great advantage since one can continue to use traditional 32bit x86 software, and at the same time test new programs for 64bit addressibility without moving the program to a different machine and testing the program by remote. It's all just local. And last not least, this is about the cheapest way to help making programs ready for the world of 64bit adressability.

(everybody wants it) This initial urge of programmers for the AMD64 extensions has brought about a stampede. Now every economist in the markt knows that the software industry is preparing its programs for the 64bit world. Everyone knows it is tested on x86 with only a minor 64bit extension. Everybody knows the new and faster software can run in parallel with traditional 32bit x86 software. Everybody knows it is cheap to buy the 64bit extension even now. And so it is cheap to prepare your own business and your own installation of information technology for the next years.

Therefore Intel had to move - just to avoid loosing market share. An intesting point is the fact that the add-on 64bit extensions of the current Intel processors are much slower than those implemented in AMD64. It turns out to be unimportant for the current market: the programmers only need to test the programs and project the possible advantages when the technology is used in all corners of an operating system. And the economy does not have much programs requiring the x86 64bit adressability. Not yet. So nobody can yet feel the advantage of 64bit-optimized programs (running actually) but everybody knows it.

(conclusion) We can expect that the software industry will march on in that direction. Over the next two years all the software will be converted and made 64bit-ready. Even desktop software originally running only in 32bit x86 computers. And they'll not advertise it with a general "64bit-ready" label (that would include the ia64 or sparc64 or power5). They'll tell it in terms of the x86 variant known so widely in the consumer market - and here it is known as AMD64 or EM64T. Perhaps named... AMD64-compatible.


(24.1.2005) Till has pointed out that memory is actually a lot faster these days. The thing is that the delay for each memory line is in the nanoseconds range (15ns tCAS, tRAS) but this memory bank line is very wide. The line (loaded to a buffer with some heavy delay) can be transferred via 400 MHz double data rate on dual channel bus interfaces which would theoretically be 12 GB/s and a 4 GB search takes less than a second. - That's both correct and wrong. There is no data structure that matches the linear capability of a DDR RAM and on a real server a process is not alone. In reality it still takes some dozen times more to search a large memory block but my vanilla example is spoiled nonetheless. Thanks.