Message boards :
Number crunching :
Skylake and ram scaling
I tested. I tested some more. Had a mug of tea, then went back for more testing. Then I made some charts and here are the results!
CPU: i7-6700k at 4.2 GHz, ring at 4.1 GHz, HT off
Mobo: MSI Gaming Pro, bios 1.7
GPU: 9500 GT (just to make sure no ram bandwidth is stolen by integrated graphics)
RAM: for the results presented I use two types
G.Skill F4-3333C16-4GRRD Ripjaws 4, 4x4GB kit
G.Skill F4-3200C16-8GVK Ripjaws V, 2x8GB kit
Testing was performed using Prime95 28.7 built in benchmark in Windows 7 64-bit. Each setting was run once, after the PC had been given time to settle down after rebooting. All test configurations had the ram in dual channel mode. Timing values listed are ordered CAS-RCD-RP-RAS as commonly shown in most software.
Most testing was with all 4 modules of the Ripjaws 4 kit fitted, for reasons discussed later. This ram is known from previous experience not to boot in this mobo at 3333 with 4 modules fitted, so I tested at common ram speeds from 2133 to 3200. To not complicate matters with timings, these were fixed at 16-18-18-38 for scaling tests, which may disadvantage the slower speeds since the values would typically be lower in practice. Latency will be considered separately later.
As the clock increases we see no significant difference in performance. This is not ram limited.
There is a slight increase in performance here as ram clocks go up, but not much.
Now we are starting to see something happen.
And here we see a clear relation with speed and performance.
Here we alter the display a bit so we can compare ram settings. 3 speeds are tested. Actually, two of these are not exciting. At 3200 the results for 15-16-16-36 and 16-18-18-38 are practically identical. At 2800, 14-16-16-36 and 16-18-18-38 gave a 1% average advantage to C14, but this is so small it is hard to say if this could just be measurement variation. It gets a little more interesting at 2133, where 3 speeds were tested: 14-14-14-35, 15-15-15-35, and 16-18-18-38. The last one is on average 4% slower than the other two, which were the same. This may be an area for future research, although it seems ram speed is more important to performance. Timings might get you a little more as a secondary optimisation.
Putting the ram to one side, how does CPU speed affect performance? These 4 lines show the combinations of CPU at 3.5 and 4.2 GHz, with 1 or 4 workers active.
With one worker, the scaling is near perfect with the faster CPU 19% faster, compared to 20% for ideal clock scaling.
With 4 workers, it would seem the ram is the limit. We only see 4% increase for the 20% clock increase. This may present opportunities for power saving as the higher clock doesn't help here. It may be interesting to see how scaling applies over a wider range of CPU speeds.
And finally, this is the cause of some unexpected behaviour I saw. I had two comparable systems, but I saw a massive performance difference between them which I struggled to explain. I tried various things and even wrongly blamed the mobo for being rubbish, but it would seem module rank has a major influence. This isn't so commonly discussed or even specified. I found Thaiphoon Burner as free software that can read this. The Ripjaws 4 modules are single rank, and the Ripjaws 5 module is dual rank (caution: other parts in the series may vary!). General consensus seems to be that having higher ranks can slightly increase bandwidth, at the cost of slightly higher latency.
This chart is going to take some explaining. The chart again shows the 4 worker throughput. The grey line is the Ripjaws V kit, and light blue line is Ripjaws 4 kit with 4 modules fitted, both at 3200. So on each memory channel is a total of 2 rank, and performance is so identical you can't see the light blue line under the grey line! So far so good? Let's take two of the Ripjaws 4 modules out, leaving it running in dual channel mode. Logically, this shouldn't make a difference. It is still 2 channels, running at the same clock and timings. Nope. We see a 19% drop in performance (orange line). This is massive! How massive? The yellow and blue lines are 4 modules running at 2666 and 2400 respectively, and they go neatly either side of the orange line. That is quite a performance drop!
The tentative conclusion from this is that, it seems it is worth having the higher rank modules, or running more modules to do so, otherwise you will reduce your potential significantly. Unfortunately it doesn't seem that easy to find out what rank a module is before buying it.
Ideally more testing could be done to make sure it is the rank, and not something else. I'd need for example 8GB modules with single rank to make sure the module capacity isn't in some way influencing it. Or alternatively, 4GB modules with dual rank.
I have quite a lot of data from this testing, so if there are different ways the data could be cut, I could have a go at showing it.
I should add some extra words to discuss how this might relate to PrimeGrid tasks. I'm making an assumption that Prime95 performance is indicative of LLR, since they both use the same library. There may be some other differences.
The Prime95 benchmark starts at 1024k FFT, going up to 8192k.
A couple of weeks ago, Michael posted the FFT sizes for the projects so I will use that to compare. FFT sizes may vary for various reasons and generally increase with time as numbers get bigger. Where a project covers a range of sizes, I'll only list the largest.
So the benchmark FFT sizes are on the higher end. The lower end does overlap with the bigger projects at PrimeGrid. I'm currently running TRP to try and check this in practical terms, and will move to do PSP and SoB after Tour de Primes is over.
Note the clock scaling implications don't go small enough of interest here. If you run the smaller units, say SGS, PPSE, PPS, maybe Mega, you could see near perfect scaling with clock as they don't have much ram demand. Only above that might you see ram impacting scaling.
For indication only, following would be genefer FFT sizes for various n:
To my understanding, genefer does NOT use the same library as LLR/Prime95 so the applicability of the results has not been established.
(Little joke: put this music while dialing OCs. I've hear it helps you go higher.)
You missed a quick summary:
0- Dual channel? I've heard it helps.....
1- Get high speed, high latency RAM. Speed is the most important thing, and higher latency sticks have a higher chance of having 2 ranks per module. And most likely those will be cheaper as well.
2- Run the more intensive CPU tasks and you won't need to OC your processor. Actually, do it because DDR4 compatible CPUs are the ones with the most compute power available, so you should help the subprojects that need it the most. Aka not SGS / PPSE.
3- Get Thaiphoon Burner, check your module rank, and post it on this thread for future reference. Mine is the Corsair Vengeance 3000mhz C15 (CMK16GX4M2B3000C15), and I'm super happy to anounce that this is a 2 rank kit. Yay! Funny thing is, the model is reportedly the black one, but my sticks are red (even though there is no red variant in Corsair's website.... and I actually wanted black RAM).
Oh, and one more thing. Can you test running at 2.7ghz (the speed of an i5 6400 under 4 core loads)? I wonder how viable a 6400 cruncher would be.....
I'm tested out for one day so I'm not going to revisit the CPU tests for now. I notice I forgot to say, the CPU test was done at 3000 ram since that is the known stable speed for my system. I'm not 100% confident on faster even if it seems fine for benchmarking.
This is very interesting stuff-many thanks for putting it all together.
I haven't tried pushing my memory yet, perhaps its time I give it a try...
I just tried to find a way to re-express the previous data in a way to give a better insight to ram requirements. I think I found it.
I took the 3200 speed ram result as the reference, as it was the fastest I ran at, and thus should be least limited. Strictly speaking, it would be nice to have even more bandwidth but that's not happening unless I get a quad channel ram system.
Anyway, I took the 1, 2, 3, 4 worker results for the tested ram speeds (2133 to 3200) and divided that by the reference results to give a scaling indication. That was then divided again by the number workers. I then divided the resulting value into the calculated ram bandwidth at each speed. For indication, 3200 ram in dual channel mode should offer 50GB/s. I tested with 4 single rank modules, so this is the higher performing state with 2 rank per channel. CPU was the i7-6700k at 4.2 GHz and cache at 4.1 GHz.
As there was a lot of data, I tried to simplify it by only showing 1, 2, 4, 8M FFT sizes. They follow a similar trend, with minor variations throughout. I will leave that for another day, but the overall trend is clear enough. More bandwidth = faster up to a point of diminishing returns.
I should add this only applies to the Skylake at 4.2 GHz. Presumably a slower clocked CPU will have more ram bandwidth relative to the CPU speed, and shift the charts up a bit. I need to test this and will have to work in some lower clocked CPU results later.
I need to do more checking to see how well this fits in with past real data on scaling.
I had some data at 3.5 GHz as well as the 4.2 earlier. I had to simplify at this point, and since there wasn't that much variation between the FFT sizes I just used 1024k as the first number that falls out. When I normalised the horizontal axis to include CPU clock, the two sets overlaid on top of each other nicely. I probably should do more to be extra sure, but already I need to handle more dimensions in Excel than I know how to! So there's a lot of manual work and it is getting messy...
I also tried splitting out the number of workers to look at the discontinuities. I think there are two things going on there. As mentioned in the original testing, when I changed ram speed, I kept the same numerical timings. This puts slower ram at a disadvantage since the timings should tighten as speed slows. Secondly, it looks like where you have fast ram + more cores vs. slower ram and fewer cores for the same bandwidth ratio, fewer cores takes the slight edge. So a combo of these effects are contributing.
I got a rough indication from that of how to scale ram and CPU clocks. As it turned out, if the nominal ram speed e.g. 3200 matches the CPU clock (for quad core), then each worker is about 90% the speed compared to the relatively unlimited case of running 1 worker. For 95%, knock the CPU clock down another 12% or so. Note it may be more efficient, but you will still be getting less work done due to lower clock, assuming ram is fixed. Put it another way, if you keep increasing core clock, you will still do more, but the efficiency drop will reduce the gain you see. At some point, they may practically cancel each other out and you are purely ram limited.
Again this assumes you're running dual channel ram, with modules that are either 2 rank or have 2x single rank per channel. Otherwise downscale your performance expectations.
Coolermaster Hyper 212 Evo
Corasir TX750 PSU
MSI Gaming Pro + bios 1.7
2x4GB Corsair Vengeance DDR4 3000.
Yesterday I decided to do some power consumption tests. This is on the fussier of the MSI boards I have, on which I can overclock EITHER the CPU or the RAM, but not both at the same time otherwise it becomes highly unreliable for booting. I also know that depending on the task, you might want to balance it either way, but I never tested it.
I tested under 3 conditions:
CPU at 4.2 GHz. Ram at 2133. R9 280X.
CPU at 3.5 GHz. Ram at 3000. R9 280X.
CPU at 3.5 GHz. Ram at 3000. Intel GPU
The chart is for total system power. In short, for a small task at higher clock uses more power, but when ram limited it is about the same as lower clock, faster ram. Taking the R9 280X provides a good drop in power consumption, but what about the performance?
I should add that at 4.2 GHz, the CPU was set to 1.25v and this doesn't drop when idle although it does still down clock to 800 MHz. At 3.5 GHz, I left it on bios auto which picked 1.21V and it did reduce operating voltage when idle, but that didn't seem to affect power much.
Excuse the typo, not sure what an i6 is!
Let's get the easy part out of the way, there is no performance difference between using the external GPU and integrated. So unless you need the GPU for actual use, you can ditch it.
As for the CPU vs. ram, as you'd expect the CPU dominates for small tasks, and the ram impacts bigger tasks. It is interesting seeing the transition between the two states. This started off as a simplified chart picking only FFT sizes in factors of 2 from each other. But we could normalise it and include more data points.
This accounts for FFT size so if nothing impacts performance, you'd expect to see a horizontal line. The L3 cache does impact performance. To the left is the CPU clock limited zone, and not surprising 4.2 is faster than 3.5. To the right we're ram limited, and it is relatively flat. Again, faster ram = more performance. The transition between them in interesting. At 1.5MB/core you'd expect to see performance drop off at that point. This is more visible on the blue line, which reaches that limit at 192k. With the slower CPU, faster ram, we still see that drop but it isn't as sharp.
If I'm willing to say I'm not optimising for small work units, then running faster ram at a less aggressive CPU speed will gain me performance while not changing power use. Presumably there isn't a reduction in power as the faster ram uses a bit more, and also allows the CPU to work harder even at the lower clock. Also taking out the R9 280X which I run rarely will save quite a fair bit of power load too. If I were to overclock both at the same time, I could arguably get the best of both worlds but I'm not sure what the impact on power would be.
Message boards :
Number crunching :
Skylake and ram scaling