PrimeGrid
Please visit donation page to help the project cover running costs for this month

Toggle Menu

Join PrimeGrid

Returning Participants

Community

Leader Boards

Results

Other

drummers-lowrise
1) Message boards : Number crunching : Better multi-threading (Message 132530)
Posted 41 days ago by rjs5
I am running 5 copies of llrCUL with 4 threads each on my 18/36 cores/thread Intel 9980xe. I have 6 tasks set in "Job Control and Multi-threading" setting and would like to insure that 1 of the WU is allocated to the RTX 2080 Ti GPU.

PrimeGrid keeps allocating a 6th llrCUL WU leaving the GPU idle for the hours that it takes to finish the llrCUL.

Max # of simultaneous PrimeGrid tasks 6
Multi-threading: Max # of threads for each task 4

What knobs would I set on my 36 thread machine to consistently use 21 threads?
5 WU x -t 4 of llrCUL
1 WU x GFN

2) Message boards : Number crunching : Intel Vtune: Download a free copy of Intel Vtune backed by community forum support (Message 132500)
Posted 44 days ago by rjs5
I just noticed that Intel has made its Vtune Amplifier profiler available free with forum support. There is a Linux and Windows version. I have used it before to perform some system-wide profiling and disassembly.

It is pretty easy to see where the bottleneck locations are in a running program. It shows that gwnum is too aggressive with its software prefetching and limits computation with redundant memory read operations when running multiple WU.

I don't know what it does when confronted with an AMD CPU.

https://software.intel.com/en-us/vtune/choose-download
3) Message boards : Number crunching : Multithreading? (Message 131803)
Posted 68 days ago by rjs5
Could anyone using multithreading give me some suggestions on how to apply it to this computer?

GenuineIntel
Intel(R) Core(TM) i9-9980XE CPU @ 3.00GHz [Family 6 Model 85 Stepping 4]
(36 processors) [3] NVIDIA GeForce RTX 2080 Ti (4095MB) driver: 43040

Thank you


I am running he same machine, but with just one 2080 Ti Founders Edition, Fedora and the best cooler I could buy.

I set the BIOS CPU temperature limit to 75 degrees.
Nvidia vents the hot air out the TOP of the 2080 Ti FE instead of the REAR like previous designs. Be aware of where the 2080 Ti boards dispose of their heat. The 2080 Ti runs at 80% centigrade on most of the PG GPU WU.

I set BOINC to run five llr321 WU with 4-threads. The "top" command indicates each WU is using about 370% CPU or somewhat less than 4 since there is some overhead in synchronizing the threads. The instructions per cycle have dropped from 1.8 to 1.3, so I am pretty close to a "good" number. I did this by setting the CPU PREFERENCES to 60%.

After you get to somewhere about 40% CPUs used, the SW prefetch instructions exhaust the MEMORY FILL BUFFERS and any more WU will simply stall waiting for memory buffer. I have never bothered turning off HT and am not sure how much difference that will make over all the CPU, system and OS variations.

The clip from Intel Vtune Amplifier instrumentation shows where in the code LLR is stalling. You can see that it is spending its life waiting on the prefetcht0z instructions waiting for a fill buffer. My problem with the SW prefetch instructions is ... they are easy to add to code, but everyone then assumes that they still make sense and leave them in.


0x19b2325 0 vaddpd %zmm18, %zmm16, %zmm16 1.0 15,000,000
0x19b232b 0 prefetcht0z (%r12) 1.5 45,000,000
0x19b2330 0 vsubpd %zmm19, %zmm17, %zmm18 27.5 735,000,000
0x19b2336 0 vaddpd %zmm19, %zmm17, %zmm17
0x19b233c 0 prefetcht0z 0x40(%r12)
0x19b2342 0 vsubpd %zmm23, %zmm21, %zmm19 29.5 675,000,000
0x19b2348 0 vaddpd %zmm23, %zmm21, %zmm21 1.0 45,000,000
0x19b234e 0 prefetcht0z (%r12,%r8,1) 1.0 -
0x19b2353 0 vsubpd %zmm20, %zmm22, %zmm23 23.5 300,000,000
0x19b2359 0 vaddpd %zmm20, %zmm22, %zmm22 1.0 30,000,000
0x19b235f 0 prefetcht0z 0x40(%r12,%r8,1) 1.0 15,000,000
0x19b2365 0 vaddpd %zmm13, %zmm24, %zmm20 23.5 375,000,000
0x19b236b 0 vsubpd %zmm13, %zmm24, %zmm24 0.5 45,000,000
0x19b2371 0 prefetcht0z (%r12,%r8,2) 0.5 15,000,000
0x19b2376 0 vaddpd %zmm14, %zmm15, %zmm13 24.5 225,000,000
0x19b237c 0 vsubpd %zmm14, %zmm15, %zmm15 1.0 15,000,000
0x19b2382 0 prefetcht0z 0x40(%r12,%r8,2) 0.5 -
0x19b2388 0 vsubpd %zmm7, %zmm6, %zmm14 20.0 495,000,000
0x19b238e 0 vaddpd %zmm7, %zmm6, %zmm6 1.0 -
0x19b2394 0 prefetcht0z (%r12,%r10,1) 0.5 30,000,000
0x19b2399 0 vsubpd %zmm4, %zmm18, %zmm7 21.0 225,000,000
0x19b239f 0 vaddpd %zmm18, %zmm4, %zmm4
0x19b23a5 0 prefetcht0z 0x40(%r12,%r10,1) 0.5 60,000,000
0x19b23ab 0 vsubpd %zmm19, %zmm23, %zmm18 26. 210,000,000
0x19b23b1 0 vaddpd %zmm23, %zmm19, %zmm19 0.5 45,000,000
0x19b23b7 0 prefetcht0z (%r14) 0.0 45,000,000
0x19b23bb 0 vaddpd %zmm5, %zmm25, %zmm23 21.5 210,000,000
0x19b23c1 0 vsubpd %zmm5, %zmm25, %zmm25 0.5 60,000,000








4) Message boards : Number crunching : Max FFT size for each LLR project? (Message 131196)
Posted 88 days ago by rjs5
The basic operating state for gwnum is for a single core on a single task. Running multiple tasks, and/or multi-threaded tasks takes a little more consideration.

I think George had said in the past the multi-thread code isn't the best, and breaks up the work into smaller bits before re-assembling them. So the practical result of that is, smaller tasks don't scale well running multiple threads.

With normalised data from Prime95 benchmark it is possible to see how different scenarios behave. In general, if the total footprint of all running tasks is less than the L3 size, you generally get good performance. This could be simplified as FFT size * 8 * number of tasks running < L3 cache size. If you exceed L3 cache size, then ram bandwidth enters into the equation. Ram bandwidth shortage is the biggest limitation for bigger tasks. Dual channel is wholly inadequate for >4 core fast cores (most Intel Core CPUs, Zen 2 Ryzen). Running a single task multi-thread helps in this scenario.

HT is a complicated matter. I've only occasionally seen hints that, in some limited scenarios, it can give an uplift in performance compared to not having/using it. In general it doesn't seem to give any significant boost but still increases power consumption. There are also scenarios where it can be used to lessen losses e.g. Windows scheduler has sucked in past, don't know if it changed since then. Running multiple single thread tasks one per core without affinity could lower throughput by ~10%. Affinity resolves that, turning off HT resolves that, or running more threads than cores resolves that (at higher power usage).


I think we are all 100% in agreement.

I understand the task that George tackled and solved. I understand completely what he was saying about limitations and why he was saying it. I also understand that running an empirical test is the easiest way to get a "reasonable best" performance.

I am just making one additional point that I think prefetching is aggravating the "ram bandwidth" problem ... not helping for threads or multiple WU. I think the "ram bandwidth" problem is triggered and aggravated earlier than "L3 cache full" by gwnum prefetching of data into L1 that is already in the L2 and L3 caches and not in DRAM. I am getting a huge spike in FILL BUFFER NOT AVAILABLE events and very few last level cache (LLC) evictions.

If the data is already in the cache hierarchy, the L1 prefetch instruction consumes a line fill buffer which is the read path from DRAM. All T0 and T1 prefetches would cause this problem. These instructions will insert dead spots in the data pipe from memory and starve the caches every time they prefetch cached data. Other threads or WU really needing data from DRAM will be stalled while the unneeded prefetches are completed.

I have been unable to use Linux performance tools to look at the run-time performance of sllr64 because of the way it is built. All the Linux tools print garbage for the sllr64 code and give up on run-time disassembly.

I could do more analysis if I could build sllr64 (especially on Linux), but I haven't had much success.




5) Message boards : Number crunching : Max FFT size for each LLR project? (Message 131162)
Posted 89 days ago by rjs5
Hi,

There seems to be a bit of disagreement in regards to calculating the approximate L3 footprint of the various PG WUs. The difference between multiplying by 8 or 24 is pretty large! Is there a reason why 8x seems to work in practical terms when 24x appears to be the theoretical answer?


There is a lot of empirical data that shows that running the CPU at 50% (no hyper-threading) yields the "best" or "near best" performance. I am not sure how long this has been the assumption, but I am not sure how frequently the bottleneck is the cache size.

On the i9-9980XE I am running, it appears more like the Memory Fill Buffer limit is the problem. Fill Buffer unavailable event counts spike at 50% CPU load and the number of cache line evictions stays flat. That implies that there is room in the cache (no evictions), but memory Read/Write traffic is too high.

The information available about when a software prefetch consumes a Fill Buffer and/or generates a CPU stall is confusing to me and I think changes between CPU. Think Intel was even talking about changing the CPU behavior on Skylake so the software prefetch was converted into a NOP if there were no Fill Buffers available.

It appears more like gwnum is over-tuned for one WU which causes multiple WU to choke the bus ... stalling other WU.

I wish it was easier to build the Liinux64 gwnum.a from source, then I could do some testing/analysis.



6) Message boards : Number crunching : Max FFT size for each LLR project? (Message 131019)
Posted 96 days ago by rjs5
What are the units of the FFT sizes in the table?
Do those sizes represent all the data needed for the FFT plus the additional software prefetched data?


Those are the FFT sizes (in K, as in 2K == 2048) that LLR chooses for its calculation. That's the number of elements in each array.

Each element is a complex (i.e., real + imaginary) number consisting of 2 64-bit double precision floating point numbers. That's 8 bytes for each number. I believe you need space to store 3 copies of that so you can do a multiply operation such as C = A * B, so the total memory usage should be 24 times the FFT size.

So, for example, for the largest SoB FFT of 2880K, the memory usage for the FFT calculations is 24 * 2880 * 1024 or 70778880. That's 67.5 MB.

For the smallest PPSE FFT of 120K, that's 24 * 120 * 1024 or 2.8 MB.

There's other memory used as well, but it's insignificant. What's important is that the FFT storage fits in cache, since cache is a LOT faster than main memory. If the entire FFT fits in cache the task can run a lot faster.


So the software prefetching is only bringing the 3 copies for the "C = A * B" into the caches?
I thought that the code might also be prefetching the next 3 copies in to prepare for the next iteration.

thanks





7) Message boards : Number crunching : Max FFT size for each LLR project? (Message 131017)
Posted 96 days ago by rjs5
What are the units of the FFT sizes in the table?
Do those sizes represent all the data needed for the FFT plus the additional software prefetched data?


I think there was a post in the past that had FFT sizes for each project, but I can't find it again. Anyone know where it was? It is possible it would be out of date also, so is there some way I can find current values short of looking at random units?

Why? Ryzen 3000 (Zen 2) was launched yesterday, and comes with at least 32MB of L3 cache. I expect to get my sample on Tuesday and bench it. Based on what I currently understand of it, it has potential to be the best choice for LLR use in terms of a balance of power consumption, compute performance, and pricing. FFT size will allow a more educated guess into the optimal running configuration, which I can then test with benchmarks.


Current values:

+-------+----------------------------------------+------+------+ | appid | user_friendly_name | min | max | +-------+----------------------------------------+------+------+ | 2 | Sophie Germain (LLR) | 128 | 128 | | 3 | Woodall (LLR) | 1440 | 1920 | | 4 | Cullen (LLR) | 1536 | 1920 | | 7 | 321 (LLR) | 800 | 800 | | 8 | Prime Sierpinski Problem (LLR) | 1920 | 2048 | | 10 | PPS (LLR) | 192 | 192 | | 13 | Seventeen or Bust | 2560 | 2880 | | 15 | The Riesel Problem (LLR) | 720 | 1008 | | 18 | PPSE (LLR) | 120 | 120 | | 19 | Sierpinski/Riesel Base 5 Problem (LLR) | 560 | 720 | | 20 | Extended Sierpinski Problem | 1280 | 1280 | | 21 | PPS-Mega (LLR) | 200 | 256 | | 30 | Generalized Cullen/Woodall (LLR) | 1440 | 1792 | +-------+----------------------------------------+------+------+


The SoB line includes post-DC n=31M candidates.

8) Message boards : Sieving : Reservation Limits (Message 128374)
Posted 200 days ago by rjs5
How frequently does it usually take to get ADMIN APPROVAL?
9) Message boards : Sieving : RTX 2080 Sieving Performance (Message 128368)
Posted 201 days ago by rjs5
Quite nice, thank you for sharing. Is that with or without the w1 parameter?

You might be able to get a couple dozen more P/day out of it on Windows if you're running two at the same time. At least that's how it is for my non-Ti 2080 on Windows 10, with one task there are some load fluctuations even at B13, two solve that and improve throughput a bit.


Those are all without the W1 parameter set.

10) Message boards : Sieving : RTX 2080 Sieving Performance (Message 128360)
Posted 201 days ago by rjs5
Fedora 29 gfnsvocl_linux_x86_64 on an Nvidia 2080 Ti Founders Edition
Intel Core i9 9980XE executing 3.8ghz

default 10688.0/s (185.3P/day)
b8 20978.0/s (363.7P/day)
b9 41317.8/s (716.4P/day)
b10 53742.1/s (931.8P/day)
b11 56832.2/s (985.4P/day)
b12 58204.9/s (1009.2P/day)
b13 59421.0/s (1030.3P/day)


I downloaded the default chunk of gfn22 and ran at each block size. No problems using the machine.

gfnsvocl_w64_2G.exe on an Nvidia 2080 Ti Founders Edition Windows 64 Pro
Intel Core i9 7920X @ 2.90GHz base frequency
Skylake-X 14nm Technology
Only BIOS change was to limit CPU temp to 70 degrees
Standard turbo boost running at 3.8ghz.

default 10048.0/s (174.2P/day)
b8 20010.7/s (347.0P/day)
b9 38741.3/s (671.7P/day)
b10 50858.7/s (881.8P/day)
b11 55296.0/s (958.8P/day)
b12 57433.2/s (995.8P/day)
b13 58397.9/s (1012.6P/day)


I can confirm what Honza said here - http://www.primegrid.com/forum_thread.php?id=8250&nowrap=true#121338

This is about RTX 2080 but anyway for those wondering how it is doing.

gfnsvocl_w64_2G.exe with default settings on GFN22 - about 177P/day.
B8 - 350P/day
B9 - 495P/day
B10 - 680P/day
B11 - 720P/day
B12 - 730P/day
B13 - 765P/day, not nice to work with.




The 2080 seriously fast at sieving!

My AMD 8350 CPU is a bottleneck on GFN21 sieving. I had to run two tasks, but it peaks out at around 760 P/day.

Each of the two tasks is using ~8% of my total CPU. With one task it was maxed at 12.5% and only sieving at 670 P/day.


In comparison, a 1070 was sieving at 156 P/day.

Both using B13.

That makes the RTX 2080 GPU 4.87 times faster than the GTX 1070 at manual sieving.


According to GPU-Z the GPU usuage is at 99-100%, temps are around 76C with fans going at 93% using a manually set fan curve. It's using a max of 260watts but fluctuates around 250-258 watts. I have the power limit maxed out on my card through afterburner software. Core clock is at 1935MHz and memory speed is at 1775MHz.


The performance per watt between the RTX 2080 and the GTX 1070 is about 3x greater.



To double check the sieving speeds I also ran a GFN22 sieve. GFN22 sieving came back at 768 P/day occasionally going up to 772 P/day. That is 32 P/hour. This was using 7.5% of the total CPU for just 1 task.

Impressive.

For reference the card I am using is a GIGABYTE GeForce RTX 2080 GAMING OC 8G Video Card, GV-N2080GAMING OC-8GC.




Next 10 posts
[Return to PrimeGrid main page]
DNS Powered by DNSEXIT.COM
Copyright © 2005 - 2019 Rytis Slatkevičius (contact) and PrimeGrid community. Server load 2.81, 2.76, 2.85
Generated 14 Oct 2019 | 18:40:34 UTC