The End of Architecture
Burton Smith, Tera Computer Company
17th Annual Symposium on Computer Architecture
Seattle, Washington
May 29, 1990
(Thanks, Wendy!)
the body of a very slow loop
The End of Architecture
Burton Smith, Tera Computer Company
17th Annual Symposium on Computer Architecture
Seattle, Washington
May 29, 1990
(Thanks, Wendy!)
Briefly, undervolting is the process of manipulating a processor’s P-state tables to cause it to run at a lower voltage, while keeping frequency unchanged. This has no effect whatsoever on performance, but can extend battery life and reduce heat dissipation.
Undervolting cannot damage your CPU, but it can cause your machine to crash. You should be prepared to boot your mac into safe mode, in the event that something goes awry.
I’m going to assume a basic knowledge of undervolting, and merely describe the process and results for my MacBook. (For more detailed info, there’s a great article at Nordic Hardware.)
OK lets get started. Here’s what we need:
Continue reading ‘How to Undervolt a MacBook’
Burton Smith, David Patterson and a host of other parallel computing and computer architecture demigods were on hand to answer the question, “What the hell are we going to do with all these cores?”
If, like me, you couldn’t make it, at least you can view the slides, available here.
Compiler writers and assembly coders have long bemoaned the fact that x86 has no CMOVcc store. Additionally, many are shocked to learn that a CMOVcc load always reads memory. Consider the following situation:
int x = (p == NULL) ? 0 : p->v;
You’d like to generate this code, but it will crash when p is null:
cmp rax, 0
cmovne rcx, [rax+foo_offset]
I just discovered this comp.arch posting, where Andy Glew explains how this all came to be.
CNet News wonders why, Despite its aging design, the x86 is still in charge.
The article includes this quote from Simon Crosby, CTO of XenSource:
There’s no reason whatsoever why the Intel architecture remains so complex. There’s no reason why they couldn’t ditch 60 percent of the transistors on the chip, most of which are for legacy modes.
Wow. 60%? What a huge waste! Could that really be true? Lets take a look at a random K8 die shot from the inter-web:
Hmm. See that highly regular pattern that comprises the right half the chip? That would be cache.
Lets assume that Crosby meant 60% of the other part. You know, the not-cache stuff. Even if we are most charitable, he still wrong. We could perhaps simplify the front end of the chip (marked above as “Fetch Scan Align Micro-code.”) Still, much of that section is for the branch predictor and TLB, which might be good to keep around.
If I were going to invent a number, I’d have picked something much closer to 1%.
Here is a quote from Chris Hecker at GDC 2005:
Modern CPUs use out-of-order execution, which is there to make crappy code run fast. This was really good for the industry when it happened, although it annoyed many assembly language wizards in Sweden.
I first heard this when Chris was quoted by Pete Isensee (from the XBOX 360 team) in his NWCPP talk a year ago. Maybe Chris was kidding. I don’t know. What I do know is:
Processors implement dynamic scheduling because sometimes the ideal order for a given sequence of instructions can only be known at runtime. In fact, the ideal order can change each time the instructions are executed.
Imagine your binary contains the following very simple code:
mov rax, [foo]
mov rbx, [bar]
Two loads — that’s all. Lets assume that each of the loads misses cache 10% of the time. Often, one will miss but the other will hit. If you have an in-order machine, and the first load misses, you are forced to wait — you cannot proceed to the 2nd load, and you cannot hide any of the miss latency.
No matter how much of an uber assembly coder you are, you are going to be forced to choose an order for these two loads. More likely, your compiler will make this choice for you. Either way, that choice will be wrong at least some of the time.
An OoO processor can do the right thing every time.
I’ve been reading Patterson’s classic RISC manifesto tonight.
I find nearly all of his arguments compelling. I’m so credulous. I’ve got to disagree with something or I’ll lose my CISC-zealot card. Ah, here we are:
Better use of chip area.
If you have the area, why not implement the CISC? For a given chip area there are many tradeoffs for what can be realized. We feel that the area gained back by designing a RISC architecture rather than a CISC architecture can be used to make the RISC even more attractive than the CISC. For example, we feel that the entire system performance might improve more if silicon area were instead used for on-chip caches [Patterson,Srquin80], larger and faster transistors, or even pipelining. As VLSI technology improves, the RISC architecture can always stay one step ahead of the comparable CISC. When the CISC becomes realizable on a single chip, the RISC will have the silicon area to use pipelining techniques; when the CISC gets pipelining the RISC will have on chip caches, etc. The CISC also suffers by the fact that its intrinsic complexity often makes advanced techniques even harder to implement.
I think this one ended up not being a very big deal. Once we started translating the software-visible CISC ops into internal micro-ops, we got most of the advantages of RISC. The decoder is a complex machine, but we solved it once and now it’s just a fixed amount of overhead.
But what the heck do I know — I’m a software guy. I’m just parroting the hardware gurus.
I just read this old Mike Abrash article (warning, 3.5M PDF) from Byte magazine about optimizing for the 286 and 386. If you don’t know Mike Abrash’s name, you almost certainly know his work.
This is fascinating stuff.
Probably the most amazing thing is how much instruction fetch dominates 286/386 performance. Abrash frequently counts instruction bytes and includes this in his calculations for expected execution speed. After a bit of Googling, this made more sense: these were the days before instruction cache.
(Although it’s still possible, you rarely see fetch-limited code on something like a K8.)
So, if you think x86 is an ugly instruction set, consider this heritage. Variable instruction length actually made a lot of sense back then. It was a form of compression. For the same reason, Intel added complex instructions like BOUNDS and REP MOVSW. These compactly express a whole bunch of work.
I guess some things never change. I still do the same kind of measurements that Abrash did all those years ago. There are small differences — he calls a timer routine, where I can simply execute rdtsc — but the method is the same. I find this remarkable, considering how different the machines are.
(via Osterman’s blog)
This web server uses a Tyan S2462 motherboard, and thus requires an oddball ATX GES power supply. Near as I can tell, ATX GES was some kind of stopgap standard used primarily (solely?) by older AthlonMP motherboards.
Today I installed a new Seasonic M12. The new powersupply conforms to the EPS12V spec, but this is easily converted to ATX GES by way of a $20 adapter. The Seasonic bears an 80 plus sticker, indicating that it will exhibit 80% or greater efficiency. I’m hoping this will lower my electric bill.
Perhaps I’m tempting fate here, but I’m about ready to declare victory.
I have finally gotten to the root of a problem with one of my linux boxes (this very web server, actually). The machine had become increasingly unstable over the last 6 months, and unfortunately this was coincident with the addition of some new hardware and umpteen software changes, which effectively obscured the real culprit: leaky capacitors.
It appears that I am not the only person to have problems with leaky capacitors on this exact motherboard. The good news is that I’m now up and running on new hardware, and things are looking good.
This will be the third motherboard I’ve had in this system. The first one caught on freakin’ fire.
Some guys have all the luck.
Latest Comments
RSS