PrimeGrid
Please visit donation page to help the project cover running costs for this month

Toggle Menu

Join PrimeGrid

Returning Participants

Community

Leader Boards

Results

Other

drummers-lowrise
1) Message boards : Number crunching : Max FFT size for each LLR project? (Message 131196)
Posted 1 day ago by rjs5
The basic operating state for gwnum is for a single core on a single task. Running multiple tasks, and/or multi-threaded tasks takes a little more consideration.

I think George had said in the past the multi-thread code isn't the best, and breaks up the work into smaller bits before re-assembling them. So the practical result of that is, smaller tasks don't scale well running multiple threads.

With normalised data from Prime95 benchmark it is possible to see how different scenarios behave. In general, if the total footprint of all running tasks is less than the L3 size, you generally get good performance. This could be simplified as FFT size * 8 * number of tasks running < L3 cache size. If you exceed L3 cache size, then ram bandwidth enters into the equation. Ram bandwidth shortage is the biggest limitation for bigger tasks. Dual channel is wholly inadequate for >4 core fast cores (most Intel Core CPUs, Zen 2 Ryzen). Running a single task multi-thread helps in this scenario.

HT is a complicated matter. I've only occasionally seen hints that, in some limited scenarios, it can give an uplift in performance compared to not having/using it. In general it doesn't seem to give any significant boost but still increases power consumption. There are also scenarios where it can be used to lessen losses e.g. Windows scheduler has sucked in past, don't know if it changed since then. Running multiple single thread tasks one per core without affinity could lower throughput by ~10%. Affinity resolves that, turning off HT resolves that, or running more threads than cores resolves that (at higher power usage).


I think we are all 100% in agreement.

I understand the task that George tackled and solved. I understand completely what he was saying about limitations and why he was saying it. I also understand that running an empirical test is the easiest way to get a "reasonable best" performance.

I am just making one additional point that I think prefetching is aggravating the "ram bandwidth" problem ... not helping for threads or multiple WU. I think the "ram bandwidth" problem is triggered and aggravated earlier than "L3 cache full" by gwnum prefetching of data into L1 that is already in the L2 and L3 caches and not in DRAM. I am getting a huge spike in FILL BUFFER NOT AVAILABLE events and very few last level cache (LLC) evictions.

If the data is already in the cache hierarchy, the L1 prefetch instruction consumes a line fill buffer which is the read path from DRAM. All T0 and T1 prefetches would cause this problem. These instructions will insert dead spots in the data pipe from memory and starve the caches every time they prefetch cached data. Other threads or WU really needing data from DRAM will be stalled while the unneeded prefetches are completed.

I have been unable to use Linux performance tools to look at the run-time performance of sllr64 because of the way it is built. All the Linux tools print garbage for the sllr64 code and give up on run-time disassembly.

I could do more analysis if I could build sllr64 (especially on Linux), but I haven't had much success.




2) Message boards : Number crunching : Max FFT size for each LLR project? (Message 131162)
Posted 2 days ago by rjs5
Hi,

There seems to be a bit of disagreement in regards to calculating the approximate L3 footprint of the various PG WUs. The difference between multiplying by 8 or 24 is pretty large! Is there a reason why 8x seems to work in practical terms when 24x appears to be the theoretical answer?


There is a lot of empirical data that shows that running the CPU at 50% (no hyper-threading) yields the "best" or "near best" performance. I am not sure how long this has been the assumption, but I am not sure how frequently the bottleneck is the cache size.

On the i9-9980XE I am running, it appears more like the Memory Fill Buffer limit is the problem. Fill Buffer unavailable event counts spike at 50% CPU load and the number of cache line evictions stays flat. That implies that there is room in the cache (no evictions), but memory Read/Write traffic is too high.

The information available about when a software prefetch consumes a Fill Buffer and/or generates a CPU stall is confusing to me and I think changes between CPU. Think Intel was even talking about changing the CPU behavior on Skylake so the software prefetch was converted into a NOP if there were no Fill Buffers available.

It appears more like gwnum is over-tuned for one WU which causes multiple WU to choke the bus ... stalling other WU.

I wish it was easier to build the Liinux64 gwnum.a from source, then I could do some testing/analysis.



3) Message boards : Number crunching : Max FFT size for each LLR project? (Message 131019)
Posted 8 days ago by rjs5
What are the units of the FFT sizes in the table?
Do those sizes represent all the data needed for the FFT plus the additional software prefetched data?


Those are the FFT sizes (in K, as in 2K == 2048) that LLR chooses for its calculation. That's the number of elements in each array.

Each element is a complex (i.e., real + imaginary) number consisting of 2 64-bit double precision floating point numbers. That's 8 bytes for each number. I believe you need space to store 3 copies of that so you can do a multiply operation such as C = A * B, so the total memory usage should be 24 times the FFT size.

So, for example, for the largest SoB FFT of 2880K, the memory usage for the FFT calculations is 24 * 2880 * 1024 or 70778880. That's 67.5 MB.

For the smallest PPSE FFT of 120K, that's 24 * 120 * 1024 or 2.8 MB.

There's other memory used as well, but it's insignificant. What's important is that the FFT storage fits in cache, since cache is a LOT faster than main memory. If the entire FFT fits in cache the task can run a lot faster.


So the software prefetching is only bringing the 3 copies for the "C = A * B" into the caches?
I thought that the code might also be prefetching the next 3 copies in to prepare for the next iteration.

thanks





4) Message boards : Number crunching : Max FFT size for each LLR project? (Message 131017)
Posted 8 days ago by rjs5
What are the units of the FFT sizes in the table?
Do those sizes represent all the data needed for the FFT plus the additional software prefetched data?


I think there was a post in the past that had FFT sizes for each project, but I can't find it again. Anyone know where it was? It is possible it would be out of date also, so is there some way I can find current values short of looking at random units?

Why? Ryzen 3000 (Zen 2) was launched yesterday, and comes with at least 32MB of L3 cache. I expect to get my sample on Tuesday and bench it. Based on what I currently understand of it, it has potential to be the best choice for LLR use in terms of a balance of power consumption, compute performance, and pricing. FFT size will allow a more educated guess into the optimal running configuration, which I can then test with benchmarks.


Current values:

+-------+----------------------------------------+------+------+ | appid | user_friendly_name | min | max | +-------+----------------------------------------+------+------+ | 2 | Sophie Germain (LLR) | 128 | 128 | | 3 | Woodall (LLR) | 1440 | 1920 | | 4 | Cullen (LLR) | 1536 | 1920 | | 7 | 321 (LLR) | 800 | 800 | | 8 | Prime Sierpinski Problem (LLR) | 1920 | 2048 | | 10 | PPS (LLR) | 192 | 192 | | 13 | Seventeen or Bust | 2560 | 2880 | | 15 | The Riesel Problem (LLR) | 720 | 1008 | | 18 | PPSE (LLR) | 120 | 120 | | 19 | Sierpinski/Riesel Base 5 Problem (LLR) | 560 | 720 | | 20 | Extended Sierpinski Problem | 1280 | 1280 | | 21 | PPS-Mega (LLR) | 200 | 256 | | 30 | Generalized Cullen/Woodall (LLR) | 1440 | 1792 | +-------+----------------------------------------+------+------+


The SoB line includes post-DC n=31M candidates.

5) Message boards : Sieving : Reservation Limits (Message 128374)
Posted 113 days ago by rjs5
How frequently does it usually take to get ADMIN APPROVAL?
6) Message boards : Sieving : RTX 2080 Sieving Performance (Message 128368)
Posted 113 days ago by rjs5
Quite nice, thank you for sharing. Is that with or without the w1 parameter?

You might be able to get a couple dozen more P/day out of it on Windows if you're running two at the same time. At least that's how it is for my non-Ti 2080 on Windows 10, with one task there are some load fluctuations even at B13, two solve that and improve throughput a bit.


Those are all without the W1 parameter set.

7) Message boards : Sieving : RTX 2080 Sieving Performance (Message 128360)
Posted 114 days ago by rjs5
Fedora 29 gfnsvocl_linux_x86_64 on an Nvidia 2080 Ti Founders Edition
Intel Core i9 9980XE executing 3.8ghz

default 10688.0/s (185.3P/day)
b8 20978.0/s (363.7P/day)
b9 41317.8/s (716.4P/day)
b10 53742.1/s (931.8P/day)
b11 56832.2/s (985.4P/day)
b12 58204.9/s (1009.2P/day)
b13 59421.0/s (1030.3P/day)


I downloaded the default chunk of gfn22 and ran at each block size. No problems using the machine.

gfnsvocl_w64_2G.exe on an Nvidia 2080 Ti Founders Edition Windows 64 Pro
Intel Core i9 7920X @ 2.90GHz base frequency
Skylake-X 14nm Technology
Only BIOS change was to limit CPU temp to 70 degrees
Standard turbo boost running at 3.8ghz.

default 10048.0/s (174.2P/day)
b8 20010.7/s (347.0P/day)
b9 38741.3/s (671.7P/day)
b10 50858.7/s (881.8P/day)
b11 55296.0/s (958.8P/day)
b12 57433.2/s (995.8P/day)
b13 58397.9/s (1012.6P/day)


I can confirm what Honza said here - http://www.primegrid.com/forum_thread.php?id=8250&nowrap=true#121338

This is about RTX 2080 but anyway for those wondering how it is doing.

gfnsvocl_w64_2G.exe with default settings on GFN22 - about 177P/day.
B8 - 350P/day
B9 - 495P/day
B10 - 680P/day
B11 - 720P/day
B12 - 730P/day
B13 - 765P/day, not nice to work with.




The 2080 seriously fast at sieving!

My AMD 8350 CPU is a bottleneck on GFN21 sieving. I had to run two tasks, but it peaks out at around 760 P/day.

Each of the two tasks is using ~8% of my total CPU. With one task it was maxed at 12.5% and only sieving at 670 P/day.


In comparison, a 1070 was sieving at 156 P/day.

Both using B13.

That makes the RTX 2080 GPU 4.87 times faster than the GTX 1070 at manual sieving.


According to GPU-Z the GPU usuage is at 99-100%, temps are around 76C with fans going at 93% using a manually set fan curve. It's using a max of 260watts but fluctuates around 250-258 watts. I have the power limit maxed out on my card through afterburner software. Core clock is at 1935MHz and memory speed is at 1775MHz.


The performance per watt between the RTX 2080 and the GTX 1070 is about 3x greater.



To double check the sieving speeds I also ran a GFN22 sieve. GFN22 sieving came back at 768 P/day occasionally going up to 772 P/day. That is 32 P/hour. This was using 7.5% of the total CPU for just 1 task.

Impressive.

For reference the card I am using is a GIGABYTE GeForce RTX 2080 GAMING OC 8G Video Card, GV-N2080GAMING OC-8GC.


8) Message boards : Sieving : RTX 2080 Sieving Performance (Message 128359)
Posted 114 days ago by rjs5
I downloaded the default chunk of gfn22 and ran at each block size. No problems using the machine.

gfnsvocl_w64_2G.exe on an Nvidia 2080 Ti Founders Edition Windows 64 Pro
Intel Core i9 7920X @ 2.90GHz base frequency
Skylake-X 14nm Technology
Only BIOS change was to limit CPU temp to 70 degrees
Standard turbo boost running at 3.8ghz.

default 10048.0/s (174.2P/day)
b8 20010.7/s (347.0P/day)
b9 38741.3/s (671.7P/day)
b10 50858.7/s (881.8P/day)
b11 55296.0/s (958.8P/day)
b12 57433.2/s (995.8P/day)
b13 58397.9/s (1012.6P/day)


I can confirm what Honza said here - http://www.primegrid.com/forum_thread.php?id=8250&nowrap=true#121338

This is about RTX 2080 but anyway for those wondering how it is doing.

gfnsvocl_w64_2G.exe with default settings on GFN22 - about 177P/day.
B8 - 350P/day
B9 - 495P/day
B10 - 680P/day
B11 - 720P/day
B12 - 730P/day
B13 - 765P/day, not nice to work with.




The 2080 seriously fast at sieving!

My AMD 8350 CPU is a bottleneck on GFN21 sieving. I had to run two tasks, but it peaks out at around 760 P/day.

Each of the two tasks is using ~8% of my total CPU. With one task it was maxed at 12.5% and only sieving at 670 P/day.


In comparison, a 1070 was sieving at 156 P/day.

Both using B13.

That makes the RTX 2080 GPU 4.87 times faster than the GTX 1070 at manual sieving.


According to GPU-Z the GPU usuage is at 99-100%, temps are around 76C with fans going at 93% using a manually set fan curve. It's using a max of 260watts but fluctuates around 250-258 watts. I have the power limit maxed out on my card through afterburner software. Core clock is at 1935MHz and memory speed is at 1775MHz.


The performance per watt between the RTX 2080 and the GTX 1070 is about 3x greater.



To double check the sieving speeds I also ran a GFN22 sieve. GFN22 sieving came back at 768 P/day occasionally going up to 772 P/day. That is 32 P/hour. This was using 7.5% of the total CPU for just 1 task.

Impressive.

For reference the card I am using is a GIGABYTE GeForce RTX 2080 GAMING OC 8G Video Card, GV-N2080GAMING OC-8GC.

9) Message boards : Number crunching : Out of curiousity (Message 128338)
Posted 115 days ago by rjs5
Right now I'm running AP27 and checking the times on tasks I double check. I'm doing pretty well vs the 2080, but the 2080 Ti absolutely demolishes me (457.53 seconds for the RTX, 772.69 for me :( ).



You can buy the 2080 Ti Founders Edition directly from Nvidia for $1199. They have been consistently available.

https://www.nvidia.com/en-us/shop/geforce/?page=1&limit=9&locale=en-us


10) Message boards : Sieving : RTX 2080 Sieving Performance (Message 128337)
Posted 115 days ago by rjs5
Do we have any proud owners of a 2080 Ti here willing to post some P/day numbers?


I could do it, but would need help on how to do it.


Next 10 posts
[Return to PrimeGrid main page]
DNS Powered by DNSEXIT.COM
Copyright © 2005 - 2019 Rytis Slatkevičius (contact) and PrimeGrid community. Server load 1.75, 1.89, 1.90
Generated 19 Jul 2019 | 15:03:32 UTC