Please visit donation page to help the project cover running costs for this month
1) Message boards : Sophie Germain Prime Search : Running SGS and 321 sieve nearly halves SGS CPU time (Message 140175)
Posted 84 days ago by Profile composite
From these results - and a short, couple of hours test I ran on an i7 some time ago - I would say, at least the current SGS tasks do benefit from hyperthreading.

Since it seems common knowledge here that SGS suffers from HT, are there statistics available?

Your eyes do not deceive you. If you look far enough in old message threads there's a debate around the theme that *some* LLR tasks benefit from HT=on for *some* computers. So the official mantra is to try different things and see what works best for your particular CPU. The common knowledge effect that you also see is a sample bias, occuring because *some* people who did not see this throughput effect have a propensity to post that isn't so constrained by resource X.
2) Message boards : Number crunching : Year of the Rat Challenge (Message 139639)
Posted 112 days ago by Profile composite
zombie67 wrote:
I think it comes down to how the OS manages CPU threads, and do they keep them assigned to the chiplets. If an MT task can keep the threads in the same chiplet, it is much faster than having to communicate over the bus to the other chiplets. Similar to when you have a multi-CPU net up, but now we are talking about the chiplets with the AMD CPUs. And I think windows is just bad at it.

zombie67 wrote:
Observations: Wow, Win does even worse. It really just doesn't know how to work the affinity thing. At least not yet. Also, I should have ben running this with linux on 4 threads for the challenge, not 8. :)

For this dataset Zombie's conclusion is supported by the measurements.

From the perspective of average throughput, highest to lowest for SR5:
Linux, 8 simultaneous tasks, 4 threads per task: 103.3 tasks/day
Linux, 4 simultaneous tasks, 8 threads per task: 80.9 tasks/day
Windows, 8 simultaneous tasks, 4 threads per task: 62.8 tasks/day
Windows, 4 simultaneous tasks, 8 threads per task: 56.7 tasks/day

From the perspective of "firsts per day" for SR5:
Warning: small sample size is small, but the relative ranking is fine
Linux, 8 simultaneous tasks, 4 threads per task: 77 firsts/day
Linux, 4 simultaneous tasks, 8 threads per task: 60 firsts/day
Windows, 8 simultaneous tasks, 4 threads per task: 34 firsts/day
Windows, 4 simultaneous tasks, 8 threads per task: 15 firsts/day

Since I was way off in my guess that Windows and Linux would perform about the same at 8 threads per task, I would be looking at what services are running on Windows at the same time as BOINC. If you can't control that, then the Linux setup is your crunching powerhouse.
3) Message boards : Number crunching : Year of the Rat Challenge (Message 139082)
Posted 134 days ago by Profile composite
Consumer Zen 2 isn't NUMA, as that refers to how the memory is organised, not the CPU internals as such. The fragmented L3 and limited internal bandwidth is likely a contributing factor in demanding use cases, and this is where a monolithic design can make things easier.

I see your point and agree with you. I was mistakenly thinking of the cache as part of the memory architecture but NUMA strictly considers the unreplicated parts of the memory (i.e. the RAM), whereas cache contains copies of data in RAM.

NUMA considerations aside, fetching data from another CCX's cache is slower than accessing data from the local cache. On top of that we don't know if hitting a CCX's cache with 8 threads saturates it, or if there is a penalty to local cores' access time when remote cores access a cache.

So I reiterate, I would like to see zombie try an experiment, comparing run time between Windows and Linux using 4 threads per task rather than 8.
4) Message boards : Number crunching : Year of the Rat Challenge (Message 139016)
Posted 139 days ago by Profile composite
P.S. In this case, linux is only about 40% faster. I have seen even more on other apps combined with different MT settings. But this was the example I have to share due to the current challenge.

AMD Zen 2 uses NUMA processor architecture, which is more common in server-class CPUs. There are 4 cores in a CCX connected by a crossbar switch to the local portion of L3 cache. Each core in a CCX has equal access time to the L3 cache in that CCX. A chiplet contains 2 CCXs, and there are 4 chiplets in that CPU, connected by Infinity fabric (a bus is not as fast as a crossbar switch). Linux knows about NUMA architectures and aligns memory usage closer to the cores. I speculate that Windows (desktop version) does this relatively poorly for this architecture. If you reduce your thread count per task from 8 to 4, does performance become nearly identical between Windows and Linux? If so, then you should use Linux for tasks that run with more than 4 threads. However, I would rather stick to 4 threads to maximize the utility of that architecture, especially with HT off.
5) Message boards : Number crunching : Tour de Primes 2020 (Message 137635)
Posted 175 days ago by Profile composite
The fundamental problem of small samples...
Sub-project Host Tasks Firsts First percentage Send/receive duration Elapsed time CPU time (avg/min/max) (avg/min/max) (avg/min/max) 321 (Sieve) XXXX 1 1 100.00 20,493 / 20,493 / 20,493 20,472 / 20,472 / 20,472 20,450 / 20,450 / 20,450

WOOT! WOOT! 100%!

(I know, 321 Sieve work doesn't qualify for TdP.)
6) Message boards : Number crunching : nvdia-cuda-mps-control (Message 137595)
Posted 176 days ago by Profile composite
Yes, I remember that fix but I didn't remember that Genefer is OpenCL. My bad.

According to our favorite *pedia "CUDA provides both a low level API (CUDA Driver API, non single-source) and a higher level API (CUDA Runtime API, single-source)".

It's not apparent whether OpenCL communicates directly with the GPU card, or via the CUDA driver. If the latter, then Genefer should run blissfully unaware that nvidia-cuda-mps-control is sharing the card among multiple GPU applications. Alas I cannot test this, my GPU card has only compute capability 3.0

I use the Nvidia proprietary driver 390.77 downloaded from the Nvidia site.
The Nvidia package provided by my disto having the same version numbering scheme is the "NVIDIA CUDA Driver Library" whereas the "NVIDIA CUDA Runtime Library" has a version number like 8.0.44-4

This lends credence to the possibility that OpenCL talks to the GPU through the CUDA driver.
7) Message boards : General discussion : Run Time VS CPU Time? (Message 137591)
Posted 176 days ago by Profile composite
Run Time is the amount of time the workunit has been running.

CPU Time is the amount of time the CPU has actively been working on the workunit.

As these workunits are using the idle cycles of the CPU, throughout the day there are other processes that will take the CPU time from the workunit, so you'll see Run Time being longer than the CPU Time.

But when using multi-threading, more than one CPU core can work on a task simultaneously. In this case the CPU time is more likely to exceed the Run Time because the CPU Time shown is actually the total time of all the threads working on the task, and the simultaneous computation of multiple threads reduces the Run Time.

Since energy consumption is directly related to CPU time, multi-threading causes more energy to be used to get the same result sooner.
8) Message boards : Number crunching : nvdia-cuda-mps-control (Message 137574)
Posted 176 days ago by Profile composite
Has anyone tried using the nvida-cuda-mps-control daemon in Linux with BOINC projects?

From the man page:
MPS is a runtime service designed to let multiple MPI processes using CUDA to run concurrently in a way that's transparent to the MPI program. A CUDA program runs in MPS mode if the MPS control daemon is running on the system. When CUDA is first initialized in a program, the CUDA driver attempts to connect to the MPS control daemon. If the connection attempt fails, the program continues to run as it normally would without MPS. ... Currently, CUDA MPS is available on 64-bit Linux only, requires a device that supports Unified Virtual Address (UVA) and has compute capability SM 3.5 or higher. Applications requiring pre- CUDA 4.0 APIs are not supported under CUDA MPS. Certain capabilities are only available startā€ ing with compute capability SM 7.0.

GTX 10XX and RTX 20XX and some less powerful GPUs have the required compute capability.

I wonder if running this daemon fixes the 100% CPU core utilization problem with Genefer subprojects.
What CUDA API version is used by Genefer?
9) Message boards : Number crunching : Tour de Primes 2020 (Message 137571)
Posted 176 days ago by Profile composite
Interesting concept - is first percentage a zero-sum game?
I moved a slow computer from PPS-MEGA to PPSE.
So now not only is this computer completing more tasks per day, it is winning some of these workunit races instead of always losing.
In consequence, the competition among other participants has stiffened in both the PPSE and the PPS-MEGA subprojects.
In the latter case it is because there are fewer easy pickin's.
10) Message boards : General discussion : New Recent Activity table (Message 137570)
Posted 176 days ago by Profile composite
Uncertainty about the Tasks, First, and Pct columns...

I have a problem figuring out where the pending results are tallied. Are they included in the tasks?
Based on what I saw in my table, yes - pending tasks count as firsts until invalidated.
Otherwise, given that most pending tasks will result in a first, barring errors in validation, at what time would they be included in the Firsts column. For the fast tasks, I would think the same day, but for the longer tasks this would likely make those pending tasks flip at a later date. And if they flip at a later date are they simply forgotten as their WU was from a previous date?
Based on the comments in the TdP 2020 thread about privacy of this table, this is a live database query of results recorded in the previous 24 hours, and so counts pending tasks like firsts as of they moment they are recorded as complete. If it takes more than 24 hours to validate it won't matters since the pending tasks drop out of the computation just like the firsts.

Next 10 posts
[Return to PrimeGrid main page]
Copyright © 2005 - 2020 Rytis Slatkevičius (contact) and PrimeGrid community. Server load 1.85, 1.86, 1.95
Generated 3 Aug 2020 | 18:31:34 UTC