PrimeGrid
Please visit donation page to help the project cover running costs for this month

Toggle Menu

Join PrimeGrid

Returning Participants

Community

Leader Boards

Results

Other

drummers-lowrise
1) Message boards : Generalized Fermat Prime Search : High CPU usage again: Genefer 21 3.19 GPU (OCLcudaGFN) (Message 120988)
Posted 316 days ago by Warp Zero
I have confirmed the LD_PRELOAD hack still fixes it. This seems to affect all my machines (but I set up all of my machines very similarly, they're all running close to the same versions of everything.)


Can you let me know specifically what libraries you linked via LD_PRELOAD to resolve this?

Cheers

- Iain

That tiny libsleep.c one that overrides sched_yield.
2) Message boards : Generalized Fermat Prime Search : High CPU usage again: Genefer 21 3.19 GPU (OCLcudaGFN) (Message 120929)
Posted 319 days ago by Warp Zero
I have confirmed the LD_PRELOAD hack still fixes it. This seems to affect all my machines (but I set up all of my machines very similarly, they're all running close to the same versions of everything.)
3) Message boards : Generalized Fermat Prime Search : High CPU usage again: Genefer 21 3.19 GPU (OCLcudaGFN) (Message 120918)
Posted 320 days ago by Warp Zero
The GFN GPU tasks on my linux machines still use 100% CPU (100% of a single thread) even though this is supposed to be fixed according to Mike Goetz.

The systems involved are (driver versions saved here for reference):
crystal:
http://www.primegrid.com/show_host_detail.php?hostid=936874
Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz [Family 6 Model 158 Stepping 10] (12 processors)
NVIDIA GeForce GTX 1060 (4095MB) driver: 396.54, INTEL Intel(R) UHD Graphics Coffee Lake Halo GT2 (4096MB)
Debian GNU/Linux testing (buster) [4.18.0-1-amd64|libc 2.27 (Debian GLIBC 2.27-6)]
Output from this machine:
geneferocl 3.3.3-2 (Linux/OpenCL/64-bit) Copyright 2001-2018, Yves Gallot Copyright 2009, Mark Rodenkirch, David Underbakke Copyright 2010-2012, Shoichiro Yamada, Ken Brazier Copyright 2011-2014, Michael Goetz, Ronald Schneider Copyright 2011-2018, Iain Bethune Genefer is free source code, under the MIT license. Running on platform 'NVIDIA CUDA', device 'GeForce GTX 1060', vendor 'NVIDIA Corporation', version 'OpenCL 1.2 CUDA' and driver '396.54'. 10 computeUnits @ 1733MHz, memSize=6078MB, cacheSize=160kB, cacheLineSize=128B, localMemSize=48kB, maxWorkGroupSize=1024. Supported transform implementations: ocl ocl2 ocl3 ocl4 ocl5 Command line: ../../projects/www.primegrid.com/primegrid_genefer_3_3_3_3.19_x86_64-pc-linux-gnu__OCLcudaGFN15 -boinc -q 101950630^32768+1 --device 0 Normal priority change failed (needs superuser privileges. Checking available transform implementations... OCL transform is past its b limit. OCL3 transform is past its b limit. OCL4 transform is past its b limit. OCL5 transform is past its b limit. Using OCL2 transform Starting initialization... Initialization complete (0.054 seconds). Testing 101950630^32768+1... Estimated time for 101950630^32768+1 is 0:03:26 101950630^32768+1 is complete. (262419 digits) (err = 0.0000) (time = 0:03:30) 15:20:24 15:20:24 (4685): called boinc_finish

(the GPU on this system is locked into a power-saving mode, so it should run a fraction as fast as you would expect from a normal 1060, so the CPU thread should have to communicate with it even less often. Despite the fact that it says 10 compute units at 1733Mhz its actually running at 607Mhz at best. ETA for GFN 21 was about 7 days...)

buttercup:
GenuineIntel
Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz [Family 6 Model 58 Stepping 9] (8 processors)
NVIDIA GeForce GTX 970 (4041MB) driver: 390.67
Debian GNU/Linux 9 (stretch) 4.17.0-0.bpo.1-amd64
http://www.primegrid.com/show_host_detail.php?hostid=910164

Output from this machine:
geneferocl 3.3.3-2 (Linux/OpenCL/64-bit) Copyright 2001-2018, Yves Gallot Copyright 2009, Mark Rodenkirch, David Underbakke Copyright 2010-2012, Shoichiro Yamada, Ken Brazier Copyright 2011-2014, Michael Goetz, Ronald Schneider Copyright 2011-2018, Iain Bethune Genefer is free source code, under the MIT license. Running on platform 'NVIDIA CUDA', device 'GeForce GTX 970', vendor 'NVIDIA Corporation', version 'OpenCL 1.2 CUDA' and driver '390.67'. 13 computeUnits @ 1177MHz, memSize=4041MB, cacheSize=208kB, cacheLineSize=128B, localMemSize=48kB, maxWorkGroupSize=1024. Supported transform implementations: ocl ocl2 ocl3 ocl4 ocl5 Command line: ../../projects/www.primegrid.com/primegrid_genefer_3_3_3_3.19_x86_64-pc-linux-gnu__OCLcudaGFN -boinc -q 266884^2097152+1 --device 0 Normal priority change failed (needs superuser privileges. Checking available transform implementations... A benchmark is needed to determine best transform, testing available transform implementations... Testing OCL transform... Testing OCL2 transform... Testing OCL3 transform... Testing OCL4 transform... Testing OCL5 transform... Benchmarks completed (20.951 seconds). Using OCL4 transform Starting initialization... Initialization complete (12.814 seconds). Testing 266884^2097152+1... Estimated time for 266884^2097152+1 is 21:30:00


I tested both GFN-15 and GFN-21. This is all just via BOINC.

Since this interferes with my CPU tasks, I have switched to AP27 for now.

I have confirmed BOINC is using the latest GPU version: primegrid_genefer_3_3_3_3.19_x86_64-pc-linux-gnu__OCLcudaGFN

Both systems use the nVidia GPU for X11.

I will try the LD_PRELOAD hack later or this weekend if I have time.
4) Message boards : News : Another PPS-Mega Prime! (Message 111445)
Posted 644 days ago by Warp Zero
Hi! I found this prime! I was wondering if there would be anyway to get the decimal representation of it? I know it's over 1M digits, is there any software that can produce it?


You can get the decimal representation directly on the PrimeGrid website.

If you go to your primes page, you'll see a link next to the prime's length. Click where it says decimal.

But to answer your exact question, if you ever want to get the full decimal representation of an arbitrary number, you can use PFGW with the -od command line arguments. For example:

pfgw64 -od -q"943*2^3442990+1"


Congratulations!

Thanks!
5) Message boards : News : Another PPS-Mega Prime! (Message 111432)
Posted 644 days ago by Warp Zero
Hi! I found this prime! I was wondering if there would be anyway to get the decimal representation of it? I know it's over 1M digits, is there any software that can produce it?
6) Message boards : Number crunching : Linux PPS Tuning (Message 110991)
Posted 665 days ago by Warp Zero
EDIT: Ah, wait a second, you are taking about GPUs there. LLR is always going to be faster when you don't share cores with a feeding a GPU alongside the LLR.

I haven't done the GPU testing yet because it's very tedious and takes a very long time. But I will keep this in mind for when I maybe eventually do it.
7) Message boards : Number crunching : Linux PPS Tuning (Message 110978)
Posted 666 days ago by Warp Zero
Forgot to reply earlier, interesting results. I'm aiming to repeat some scenarios on a 6700k Windows system. I'm particularly interested in the apparent gain of ~4% throughput running 4x2t compared to 4x1t with or without HT on. Based on what I've seen before, I wouldn't expect that.

I would expect it to perform worse if the affinities weren't set so carefully or if it caused the Turbo boost not to kick in as high. You have to be sure that for a single task, its two threads are on different cores.
8) Message boards : Number crunching : Linux PPS Tuning (Message 110974)
Posted 666 days ago by Warp Zero
If the same thing is happening to the LLR runtime as to the PPS-Sieve, then this is a nail in the coffin for the argument for using HT with multithreading.

I don't understand what you mean... all my fastest results are using HT with multithreading.
9) Message boards : Number crunching : Linux PPS Tuning (Message 110962)
Posted 666 days ago by Warp Zero
Have you tried other combinations? Like 4x single thread tasks? MEGA (256k FFT), should take up to 2MB per task. i7 would have enough L3 cache to allow 4 to run in parallel without being hindered by ram. i5s however would not.


Are you sure? This should only be the case if in-place FFTs are used. If the FFT isn't done in-place, it'll take at least double (4MiB).


If HT is left on, I found (on Windows at least) you need to set affinity in some way to ensure two different tasks don't end up on the same core at the same time, as it leads to ~10% performance drop. I usually set affinity for BOINC as a whole to use one thread per core. Even if you set multiple threads per task, the above applies. Note I don't use GPU so that isn't factored into this.

In all my testing, I've yet to find HT to provide any throughput benefit over one thread per core for FMA3 tasks, except if there is some inefficiency where HT masks it. The 10% penalty above can be eliminated by running extra threads like it sounds like you are doing. For small tasks like these, the scaling with increased thread count isn't good. Run one task on one thread. Then run another task using two threads, it'll probably NOT be twice as fast, and it gets worse as you increase the thread count. By running two tasks of 4 threads, it kinda mitigates that inefficiency.

I still think to get a picture on how optimal the current timings are, try running a single task on single thread and compare throughput against that. Alternatively, 4 tasks with one thread each, and either the affinity as described earlier, or turn off HT.


I took some measurements with various configurations to have a look at what you were saying. Here's my results: https://docs.google.com/spreadsheets/d/1UmE1WFnRYjpF-RXdsVNGrvHbnOQOCjQW8BYRz97E_2U/edit?usp=sharing

As you can see without the GPU running, 4 tasks with 2 threads each spread over the cores was the best followed by 2 tasks with 4 threads each. Fully utilizing hyperthreading. However, if your CPU isn't overclocked like mine and hyperthreading causes the CPU clock to decrease because of the heat, I would definitely think that using all hardware threads would be slower. However, I think the results clearly show you should leave hyperthreading ON and set the boinc client to 50% CPU if you don't desire to utilize hyperthreading, rather than turning hyperthreading off.

I still think 2 tasks with 4 threads each is the way to go if you're using a GPU also, but I haven't done the testing to confirm it yet. Gathering the data I did was very very tedious and took all weekend.

Also note running HT like this will use more power. I have done a test in the past, running 4 tasks either with HT off, or affinity set, as a baseline. I compared this to running 8 tasks. Those 8 tasks took twice as long. I did twice as many tasks in double the time, so effective throughput was unchanged. However I was monitoring power consumption at the time, and the power taken was increased with 8 tasks compared to 4. Power efficiency had dropped.
I don't disagree that using hyperthreading will use more power, but well, my room is really cold.
10) Message boards : Number crunching : Linux PPS Tuning (Message 110904)
Posted 670 days ago by Warp Zero
Hi I just thought I'd share some things I'd come up with while doing PPS LLR (CPU) and Sieve (GPU) on a somewhat high-end system.

Hardware: i7-7700K gingerly overclocked to 4.7Ghz, 64GiB of not the best ram, nVidia 1070 (factory overclocked, underclocked it back down closer to nVidia specifications).

Software: Debian 9 "Stretch", nVidia 375.82. This is a desktop computer so I also had to factor in my browser being open and other apps taking up CPU and GPU resources.

Parallel LLR

I found two setups for LLR (especially PPS-Mega) to be the best performing:
1. Two instances with 4 threads each - about 21min/test
2. One instance with 7 threads - about 11min/test

Yes, one instance with 8 threads was actually slower. Why? Because the Sieve thread, browser, desktop environment, etc also take up small amounts of CPU. The Sieve thread usually consumes about 16-20% of one hardware thread on this system. In addition, the nVidia driver itself consumes another 1%. If a single thread of a LLR process gets held up because the processor is doing something else, all of the other threads must wait for it as well.

Affinity

In the first setup, two instances with 4 threads each, I found it beneficial to set the CPU affinity of all 8 individual compute threads. Since the CPU is organized into four cores each with two threads, LLR runs faster when the four threads are assigned to different cores. Or in other words, thread 1 of process 1 is pinned to core 1, thread 2 is pinned to core 2, thread 3 is pinned to core 3 and thread 4 is pinned to core 4. The same applies to the second process, it also has four threads and each is assigned to a different core.

This helps because one of the goals of hyperthreading is to allow a second process to run if the first process is stalled because it's waiting on RAM. In this scenario, each core has a thread from each process running, so if one LLR process is stalling due to RAM, the other process can take over on all cores.

In addition the LLR code in multithreaded mode seems to have some locking overhead: I often notice the first thread of a process at 95% while the others are at 90% CPU utilization. In this setup, when the other three threads are waiting on the first thread, the second LLR process can take over.

Using fewer threads doesn't seem to help except on PPSE, probably because of PPS-Mega's large FFT size which needs to fit in L3 cache (for this CPU, 8MiB of it) in order to prevent the process from having to wait on the much slower RAM. With four processes of two threads each, there are too many FFTs for the L3 cache to remember all at once.

Priority

With all cores being hit as hard as they can by the above setup, it can reduce the amount of GPU utilization while the GPU waits for the PPS-Sieve process to get scheduled on a hardware thread and then run and send it more commands. Since BOINC is designed to be minimally intrusive to a computer's operation CPU processes run at the lowest priority (nice 19) and GPU processes run (on the CPU) at a slightly higher priority (nice 10) by default.

If the CPU process priority was changed it would make the computer very annoying to use: browsing and other tasks would be laggy if they were usable. However, since the PPS-Sieve process only uses at most 20% of one hardware thread that leaves plenty left over for browsing and other tasks even if its set to a very high priority.

Optimally, we want the PPS-Sieve process to run immediately as soon as the GPU is done with whatever the PPS-Sieve process told it to do last, so that PPS-Sieve can give it more work and minimize the time the GPU is idle.

Fortunately, the Linux kernel has a facility to help with just this scenario: realtime priorities. Processes with realtime priority are given the CPU as soon as work is ready for them, preempting all other processes. So, I avoided reduced PPS-Sieve performance while still utilizing the CPU to the max with PPS-Mega by assigning the PPS-Sieve process realtime priority. This increases the amount of power, heat, utilization reported by the GPU and shortens PPS-Sieve times. In fact, the nVidia driver already
runs with realtime priority to improve graphics performance.

Running mutliple PPS-Sieve processes on my single GPU also increased power, heat, and utilization, but seemed to have a negative impact on PPS-Sieve times and made my browser too laggy.

Other things I tried

I tried setting the PPS-Sieve process's CPU affinity, this didn't seem to help. I tried changing the nVidia driver's CPU affinity, that didn't seem to help either. Setting thread's affinity to a specific hardware thread instead of a whole core didn't seem to help. Setting affinity when only running one LLR process with 7 threads didn't help.

Overclocking the heck out of my RAM definitely helped.

Summary

The two main things I found that helped improve performance were running two LLR processes, with a thread on each core, and giving the Sieve process realtime priority. I hope this helps someone!

Here's the script I use to set process affinities and priorities: https://github.com/orezpraw/scripts/blob/master/primegrid.py (sorry it's not user-friendly.)


Next 10 posts
[Return to PrimeGrid main page]
DNS Powered by DNSEXIT.COM
Copyright © 2005 - 2019 Rytis Slatkevičius (contact) and PrimeGrid community. Server load 0.95, 1.04, 1.08
Generated 21 Aug 2019 | 20:34:07 UTC