How memory bandwidth is killing AMD's 32-core Threadripper performance

How memory bandwidth is killing AMD's 32-core Threadripper performance

AMD's 32-core Threadripper 2990WX is the fastest consumer CPU ever sold. And let's be clear: We're in full agreement with anyone who said that. But we would also be the first ones to say it has its limitations, too. 

The most glaring is the lack of consumer applications that can truly exploit the cores available. The other limitation is apparent in the diagram below, which shows how AMD built this 32-core monster. Rather than a single chip with every single CPU core on it, AMD connects four dies using its high-speed Infinity Fabric.

Why memory bandwidth affects the 32-core Threadripper

If you look closer at the diagram, you can see that two of the dies don't have their own memory controllers or PCIe access. Instead, they have to talk to an adjacent CPU die.

It is, essentially, like having having a two-apartment unit where the second one must access the hallway outside by going through the first apartment.

2990wx die topology updated IDG

AMD says the four-die Threadripper has 25GB of bandwidth shared among all of the chips.

Perhaps more important is the overall bandwidth available. AMD had initially said the total bandwidth available between the four CPU dies was 25GBps bi-directional. The company amended its original documentation to state it was total bandwidth. Compare that with the 16-core Threadripper 2950X, with its 50GBps of bandwidth and two links between the two dies (also updated information from AMD.)

die top 2950x updated AMD

A two-die 16-core Threadripper 2950X has 50GBps and two links between two dies,vs. the 25GBps among four dies that AMD originally claimed (and then amended).

Many believe this is Threadripper 2990WX's main weakness: Lack of memory bandwidth per core is impacting it in memory-intensive tasks such as compression and encoding. Even worse for Threadripper 2990WX is that bandwidth has to be shared on a CPU with 14 more cores than Intel's Core i9-7980XE.

Below, you can see the result of Sandra 2018 Titanium's memory bandwidth test and the available bandwidth per core. As you can see, the bandwidth per core plummets from almost 5GB at 8-core and 16-core to just 2GB when you utilize all 32 cores. 

ryzen threadripper 2990wx sandra 2018 per core memory bandwidth IDG

Sisoft Sandra 2018 Titanium's per core memory bandwidth results say the Threadripper has only 2GB per core available.

Synthetic memory bandwidth tests are one thing. To dig further into performance in memory-intensive tests, we fired up the newest version of the free and popular 7-Zip application. Written by Igor Pavlov, this open-source compression and decompression utility is popular and generally awesome. For example, when I run tests on a laptop and decompress Cinebench R15.08 and its thousands of small files with Windows 10's built-in utility, it takes several minutes to finish. I can actually connect to the Internet, download 7-Zip, and decompress the contents of Cinebench R15.08 with it in less time than it takes the built-in Windows utility to do its thing.

The GUI version runs two tests, for compression and decompression. The overall score looks like a simple average of the two results.

What 7-Zip tests

You can read more about the test on the 7-cpu.com web site, but we've highlighted some of the key information about the tests here. Regarding the Compression test, the website discusses the factors that influence the test results, saying it "strongly depends from memory (RAM) latency, Data Cache size/speed and TLB. Out-of-Order execution feature of CPU is also important for that test." The site goes on: "The compression test has big number of random accesses to RAM and Data Cache. So big part of execution time the CPU waits the data from Data Cache or from RAM."

About the Decompression test, the website says it "strongly depends on CPU integer operations. The most important things for that test are: branch misprediction penalty (the length of pipeline) and the latencies of 32-bit instructions ('multiply', 'shift', 'add' and other). The decompression test has very high number of unpredictable branches."

How we retested Threadripper vs. Core i9

For our retest, we decided to lock both the Threadripper 2990WX and the Core i9-7980XE at 3GHz to remove any variables from each CPU's boost schemes. This was done to make the comparison more dependent on the test rather than the clock speed differences between the two. We also set both to DDR4/3,200 clocks, and both were run in quad-channel mode except where noted. To be up-front: The Threadripper system had a slight edge in CAS latency at CL14 and 1T, while the Core i9 was running at CL15 and 2T. As in our original review, both were running Founders Edition GTX 1080 cards using the same drivers and the same version of Windows 10 Enterprise Edition.

Because much of the concern over Threadripper is its per-core memory bandwidth performance, we decided to run from 1 thread to the maximum number of threads on each CPU. We also decided to see whether performance of the Threadripper would change if you turned off dies, so we ran it with a single die (8 cores/16 threads) and two dies (16 cores/32 threads), and all four (32 cores/64 threads).

In the integer-focused decompression component of 7-Zip, the performance was quite nice. Although we don't see perfect scaling, there's little difference in 7-Zip decompression performance as you switch off dies.

All of the tests were also completed using the GUI version of 7-Zip 18.05 with the default dictionary size of 32MB (although we did decide to recompile our own version, too.)

ryzen threadripper 2990wx 7 zip 18.05 decompression performance  per die lzma IDG

There's no apparent change in the decompression performance by moving between one, two, or four dies on the 32-core Threadripper.

You're probably more interested in the Core i9 vs. Threadripper 2990WX, so we ran that, of course. For the most part, it's not bad for either part. Interestingly, Threadripper 2990WX seems to have that slight fall-off in decompression performance as you cross the threshold of 8 cores. Core i9 has a decent performance advantage up to about 16 cores, but after that it runs out of steam and ends up losing to the 32-core Threadripper 2990WX CPU.

ryzen threadripper 2990wx 7 zip decompression performance vs core i9 IDG

The 7-Zip LZMA decompression is more sensitive to integer, branch prediction, and instruction latency. Although Core i9 has some advantage, it's clear that more cores are better in the end.

This shouldn't surprise too many, though. The CPU performance when you don't run out of memory bandwidth is a known quantity of the Threadripper 2990WX. You only have to look at our multi-threaded rendering tests to see how it's simply a monster.

The question is, what happens under memory bandwidth or memory latency tests? Here are the results of the Threadripper 2990WX in 7-Zip's compression test. It's not pretty, but the the good news is switching dies off didn't seem to matter. As you can see, the CPU appears to hit a ceiling at 26 threads, and then it just gets worse from there.

ryzen threadripper 2990wx 7 zip 18.05 compression performance  per die lzma IDG

We ran the Threadripper 2990WX in single-die, dual-die and quad-die configuration to see if memory bandwidth issues would ease. 

Perhaps worse is when you compare it to the Core i9-7980XE. Again—remember both of the CPUs were at a fixed clock speed of 3GHz and DDR4/3200.

ryzen threadripper 2990wx 7 zip compression performance vs core i9 IDG

7-Zip's compression test is said to be memory latency, cache, and out-of-order efficiency sensitive. Obviously, it doesn't do great on the 32-core Threadripper

That's just not a good look for the 32-core Threadripper 2990WX and does seem to confirm that memory latency and bandwidth chores suffer greatly.

But can memory bandwidth also hurt Core i9? To find out, we switched the Core i9 system from quad-channel mode into single-channel mode. Unfortunately, for our test, we did have to lower total memory to 16GB rather than 32GB due to lack of density on modules. The good news is the 7-Zip with the default dictionary fits fine, and we don't believe overall memory capacity was the issue. We can say that overall memory bandwidth as measured in Sandra 2018 was cut from 77GBps in quad-channel memory mode to 18.5GBps in single-channel mode on the Intel part. Per-core memory bandwidth went from 4.8GBps in quad-channel to 1GBps in single-channel mode.

ryzen threadripper 2990wx 7 zip compression performance vs core i9. single channeljpg IDG

Does cutting memory bandwidth on the Core i9-7980XE also kill its 7-Zip compression performance? Yup.

As you can see, the performance of Core i9-7980XE also suffers when its memory bandwidth is drastically cut. It doesn't suffer as much as the Threadripper 2990XE, but this doesn't appear to be the fault of some pro-Intel code at work. 

Linux tests bring a surprise. Keep reading!

Linux tests show how Windows 10 affects results

I'd normally say, okay, memory bandwidth and latency are the real issues, but there is that Linux thing. That is, in tests run by Michael Larabel at Linux-focused site Phoronix, the Threadripper 2990WX actually performs on a par with the Core i9-7980XE rather than heavily trail it. Phoronix runs a slightly older version of 7-Zip, but it's clear that moving to Linux helps Threadripper 2990WX. A lot. Phoronix even tested it using Windows 10 Server.

phoronix 7 zip results Phoronix

Maybe it's not the Threadripper after all?

Phoronix's Linux test shows issues not just with 7-Zip, but also several other tests where Windows 10 underperformed the Linux version. So it's clear Windows has an issue right now. But if you're in the crowd that wholesale dismisses it as a weakness at all, I'm not so sure.

One Linux vs. Windows test that would back up memory bandwidth and latency as issues are tests by Steve Walton over at Techspot.com. Walton tested Windows and Linux performance using the latest 7-Zip version and found Core i9 still ahead despite having fewer cores. Greatly improved for Threadripper? Yes. But still clearly slower in a multi-threaded test that does scale to all available cores.

tech spot Techspot

Techspot's Linux vs. Windows test still puts Threadripper behind the Core i9.

The compiler is another factor

In searching for more answers on Threadripper's 7-Zip performance, we wondered whether the compiler was at fault. If an outdated compiler was used to build the 7-Zip executable, it could certainly hurt the Threadripper's performance. To find out, we downloaded the source code for 7-Zip, the latest version of Microsoft's Visual Studio 2017, and compiled it into an executable.

We ended up with basically the same result, and it looks like the latest version of 7-Zip is actually on the latest available Visual C++ compiler. This doesn't completely dismiss compilers, as different compilers do matter. If, for example, the applications on Linux were compiled with the GCC or Intel compiler, it might explain the performance differences.

7zip compiled on right IDG

We recompiled the sourcecode for 7-Zip 18.05 using the latest version of Visual Studio 2017 and found that, well, that's probably what 7-Zip was recently compilled with.

HandBrake test brings up more questions

While Windows 10 clearly, clearly has issues with the design of Threadripper, it would be wrong to say memory bandwidth and latency aren't in play.

To see just how much memory bandwidth helps or hurts both CPUs, we took VeraCrypt and ran it with the larger 1GB workload. As we saw with 7-Zip, the Core i9 's VeraCrypt performance drops off a cliff and is actually is worse than the Threadripper's (albeit with quad-core memory), as you can see from the blue bars below.

The Threadripper 2990WX does suffer greatly with the 1GB workload. But if the issue is how Windows handles the memory configuration on the Threadripper, it should get better after shutting off two dies, right? It does—but as you can see in the green bars below, performance increases only slightly when limiting it to just 16 cores and two threads. The result is again confusing, because if Windows 10 is at fault for the poor performance of the shared memory controller design,why is the performance of the Threadripper 2990WX not as fast as the Core i9's? Remember—both CPUs are locked at 3GHz.

ryzen threadripper 2990wx veracrypt 1gb 3ghz IDG

Cutting memory bandwidth just kills performance of the Core i9 (blue) but oddly the Threadripper's performance doesn't bump up when two of the dies are switched off.

Our last test used HandBrake 1.1.1 to encode a 4K video file using the 1080p Chromecast preset. Note: This HandBrake result is different from others we've run, so it can't be compared to previous results.

Video encoding is often associated with increased memory bandwidth. While it does matter, we can see it's not a big deal even when you go from 77GBps to 18GBps on the Core i9 on this particular preset.

Our results from cutting the Threadripper's die use from four to two also isn't a big deal. It's actually slightly faster with two dies turned off, but almost within the margin for error in HandBrake encodes.

This leads us to believe that the only reason a 32-core Threadripper is slightly slower than an 18-core Core i9 in this particular HandBrake run is likely due to the vagaries of HandBrake itself, and how well it runs on each processor. We should also note that the app itself is multi-threaded, but doesn't scale with core counts.

ryzen threadripper 2990wx core i9 handbrake 4k chromecast IDG

Gutting memory bandwidth on the Core i9 didn't see as drastic a change in performance as you'd expect which tells you how video encoding isn't as dependent on memory bandwidth as you think.

There's no easy answer

If you were hoping for an easy answer to your lingering Threadripper performance questions—take a number. Based on our tests, the answer is, it's complicated.

While we didn't do Linux testing, we've seen enough results run by others now to say that Windows 10 is handcuffing performance in certain applications (although the compiler used for those particular tests might share some blame, too.)

We also believe that Threadripper 2990WX can be handcuffed by memory bandwidth and latency in some workloads. It just makes sense when you're talking about sharing quad-channel memory among 32 cores, versus sharing quad-channel memory among 18 cores.

In the end, we think you should still choose your high-performance CPU based on the task it'll do. Our results from our original review still basically apply. If you do thread-heavy tasks such as 3D rendering or modelling or tend to multi-task, having 32 cores and 64 threads in a Threadripper 2990WX ($1,749 on Amazon) will be unlike anything you've ever had before.

If, however, you tend to stick to workloads that aren't has heavily threaded, such as most video encoding chores, and need higher clock speeds on apps on lightly threaded applications—and also are very memory bandwidth dependent, the Core i9-7980XE ($2,000 on Amazon) might be the better choice for you.

ryzen threadripper 2990wx cinebench thread scaling percentage IDG

If your applications tend to use fewer threads and prefer higher clock speeds, you live on the left side of this chart, and Core i9 makes more sense. If, however, you need more cores, you live on the right side of this chart, and Threadripper is the better choice.

IDG Insider

PREVIOUS ARTICLE

« Amazon Echo Show: 10 essential tips

NEXT ARTICLE

Roborock S5 Robot Vacuum Cleaner review: This premium vacuum busts dust and mops, too »
author_image
IDG News Service

The IDG News Service is the world's leading daily source of global IT news, commentary and editorial resources. The News Service distributes content to IDG's more than 300 IT publications in more than 60 countries.

  • Mail

Recommended for You

International Women's Day: We've come a long way, but there's still an awfully long way to go

Charlotte Trueman takes a diverse look at today’s tech landscape.

Trump's trade war and the FANG bubble: Good news for Latin America?

Lewis Page gets down to business across global tech

20 Red-Hot, Pre-IPO companies to watch in 2019 B2B tech - Part 1

Martin Veitch's inside track on today’s tech trends

Poll

Do you think your smartphone is making you a workaholic?