Exactly! But this was not just obvious in retrospect, it was what Intel was sayi...

phire · 2026-03-28T11:20:39 1774696839

I'm not sure thats a fair description of server workloads. I'm also not sure it's fair to say Itanium was bad at integer-heavy, branchy workloads (at least not compared to Netburst)

The issue is more that server workloads are very memory bound, and it turns out the large OoO windows do an exceptional job of hiding memory latency. I'm sure the teams actually building OoO processors knew this, but maybe it wasn't obvious outside them.

Besides, Itanium was also designed to hide memory latency with its very flexible memory prefetch systems.

The main difference between the two approaches is static scheduling vs dynamic scheduling.

Itanium was the ultimate expression of the static scheduling approach. It required that mythical "smart enough compiler" to statically insert the correct prefetch instructions at the most optimal places. They had to strike a balance simultaneously wasting resources issuing unneeded prefetches and unable to issue enough prefetches because they were hidden behind branches.

While the OoO x86 cores had extra runtime scheduling overhead, but could dynamically issue the loads when they were needed. An OoO core can see branches behind multiple speculative branches (dozens of speculative branches on modern cores). And a lot of people miss the fact than an OoO core can actually take the branch miss-predict penalty (multiple times) that are blocked behind a slow memory instruction that's going all the way to main memory. Sometimes the branch mispredict cycles are entirely hidden.

In the 90s, static scheduling vs dynamic scheduling was very much an open question. It was not obvious just how much it would fall flat on its face (at least for high end CPUs).