Join PrimeGrid
Returning Participants
Community
Leader Boards
Results
Other
drummers-lowrise
|
Message boards :
Project Staging Area :
WFS/WSS task size thoughts
Author |
Message |
|
After the last few challenges, and looking at all the data from my various gpus as well as the recent optimization thread and other threads about modern GPU resource utilization, I did some thinking and testing on the workunit size of the Wieferich and Wall-Sun-Sun subprojects. My main conclusion is that the workunits are simply too small. I did some testing on my machines, and I have the following proposal:
What if we increased the unit size by 10x? instead of a 1e11/1e10 range, how about 10e11 for WFS and 10e10 for WSS? (10x credit, too, of course)
This is what I discovered from my tests of 1e10/1e11 vs 10e10/10e11, results of which of several ranges were rather consistent. (Methodology: ran a big 10x range at or near current leading edge as one big unit and the same range in 10 1x units, just one unit at a time on each GPU, tested on Maxwell and Fermi):
1. The "init time" was within a few 1/100ths of a second for short and long ranges.
2. P/sec actually increased by up to 13% (and GPU load was consistently higher as measured through GPU-Z).
3. Doing 10x range in one unit took ~9-9.3 times (Maxwell) and ~8.9-9.1 times (Fermi) as long as doing 10 individual 1x ranges, not including any interstitial times between workunits that the prpclient creates.
4. Memory usage didn't change.
5. On little Fermi (GT430, 96 CU), 100% GPU usage is easily reached, so the time gains were generally limited to doing only a single init vs. 10, but more importantly, it never took more than 10x the time to do 10x the work.
In real world timing, on my old and hopefully soon to be upgraded GTX580, a 10e11 WFS unit took ~510 seconds or 8m30s, which is still rather fast for something that is slower than a GTX1050. Certainly lesser GPUs will take much longer (My unoptimized 980ti is over 20x faster than my GT430 at WSS, but people will often run what they have, big or small, and that's OK), but comparing to some BOINC projects, it's not a bad time at all. Heck, make the task size 100x and it's about the same time requirement as an AP27 task (theoretically, I didn't test it), though I think 10x is a good place to start the discussion, especially considering midrange and lower GPU users. I know I'm probably ignoring the needs of CPU users, but like PPS Sieve, the GPU over CPU advantage is too great already.
I like completing 70k tasks in a weeklong challenge (more if I didn't also love gaming) who wouldn't? But if I could do 8-12% more in the same time frame and have one fewer digit of tasks instead, wouldn't that be more worthwhile project-wise?
Sample raw comparison outputs of single ranges vs. 10x range on my 980ti. 104 Mp/s! On PRPnet, I can't even crack 100 consistently with 2+ tasks running at the same time:
WFS
>wwwwcl64.exe -v -p 591537500000000000 -P 591537600000000000 -T Wieferich
wwwwcl v2.2.5, a GPU program to search for Wieferich and WallSunSun primes
Platform 0 is a NVIDIA Corporation NVIDIA CUDA, version OpenCL 1.2 CUDA 8.0.0
Device 0 is a NVIDIA Corporation GeForce GTX 980 Ti
workGroupSize = 8650752 = 12288 * 32 * 22 (blocks * workGroupSizeMultiple * deviceComputeUnits)
Running with 16 threads
Allocated memory (prior to sieving): 1584 MB in CPU, 1584 MB in GPU
Sieve started: (cmdline) 591537500000000000 <= p < 591537600000000000
Sieve complete: 591537500000000001 <= p < 591537600000000000 2443719505 primes tested
Clock time: 26.40 seconds at at 92552827 p/sec.
Processor time: 241.24 sec. (25.43 init + 215.81 sieve).
Seconds spent in CPU and GPU: 49.85 (cpu), 194.01 (gpu)
Percent of time spent in CPU vs. GPU: 20.44 (cpu), 79.56 (gpu)
CPU/GPU utilization: 9.14 (cores), 1.00 (devices)
Percent of GPU time waiting for GPU: 39.67
>wwwwcl64.exe -v -p 587713200000000000 -P 587714200000000000 -T Wieferich
wwwwcl v2.2.5, a GPU program to search for Wieferich and WallSunSun primes
Platform 0 is a NVIDIA Corporation NVIDIA CUDA, version OpenCL 1.2 CUDA 8.0.0
Device 0 is a NVIDIA Corporation GeForce GTX 980 Ti
workGroupSize = 8650752 = 12288 * 32 * 22 (blocks * workGroupSizeMultiple * deviceComputeUnits)
Running with 16 threads
Allocated memory (prior to sieving): 1584 MB in CPU, 1584 MB in GPU
Sieve started: (cmdline) 587713200000000000 <= p < 587714200000000000
p=587714181312583151, 104.4M p/sec, 9.33 CPU cores, 98.1% done. ETA 31 Dec 18:31
Sieve complete: 587713200000000001 <= p < 587714200000000000 24440820358 primes tested
Clock time: 236.72 seconds at at 103248005 p/sec.
Processor time: 2206.95 sec. (23.49 init + 2183.45 sieve).
Seconds spent in CPU and GPU: 156.59 (cpu), 2107.95 (gpu)
Percent of time spent in CPU vs. GPU: 6.91 (cpu), 93.09 (gpu)
CPU/GPU utilization: 9.32 (cores), 1.00 (devices)
Percent of GPU time waiting for GPU: 49.41
WSS
>wwwwcl64.exe -v -p 235389880000000000 -P 235389890000000000 -T WallSunSun
wwwwcl v2.2.5, a GPU program to search for Wieferich and WallSunSun primes
setting 3072
Platform 0 is a NVIDIA Corporation NVIDIA CUDA, version OpenCL 1.2 CUDA 8.0.0
Device 0 is a NVIDIA Corporation GeForce GTX 980 Ti
workGroupSize = 2162688 = 3072 * 32 * 22 (blocks * workGroupSizeMultiple * deviceComputeUnits)
Running with 4 threads
Allocated memory (prior to sieving): 115 MB in CPU, 115 MB in GPU
Sieve started: (cmdline) 235389880000000000 <= p < 235389890000000000
Sieve complete: 235389880000000001 <= p < 235389890000000000 249992251 primes tested
Clock time: 15.94 seconds at at 15683416 p/sec.
Processor time: 18.35 sec. (3.14 init + 15.21 sieve).
Seconds spent in CPU and GPU: 0.71 (cpu), 51.90 (gpu)
Percent of time spent in CPU vs. GPU: 1.35 (cpu), 98.65 (gpu)
CPU/GPU utilization: 1.15 (cores), 1.00 (devices)
Percent of GPU time waiting for GPU: 56.40
>wwwwcl64.exe -v -p 235389880000000000 -P 235389980000000000 -T WallSunSun
wwwwcl v2.2.5, a GPU program to search for Wieferich and WallSunSun primes
setting 3072
Platform 0 is a NVIDIA Corporation NVIDIA CUDA, version OpenCL 1.2 CUDA 8.0.0
Device 0 is a NVIDIA Corporation GeForce GTX 980 Ti
workGroupSize = 2162688 = 3072 * 32 * 22 (blocks * workGroupSizeMultiple * deviceComputeUnits)
Running with 4 threads
Allocated memory (prior to sieving): 115 MB in CPU, 115 MB in GPU
Sieve started: (cmdline) 235389880000000000 <= p < 235389980000000000
p=235389975458280079, 16.93M p/sec, 1.04 CPU cores, 95.5% done. ETA 31 Dec 18:47
Sieve complete: 235389880000000001 <= p < 235389980000000000 2499971252 primes tested
Clock time: 148.22 seconds at at 16866230 p/sec.
Processor time: 157.58 sec. (3.12 init + 154.46 sieve).
Seconds spent in CPU and GPU: 4.95 (cpu), 522.39 (gpu)
Percent of time spent in CPU vs. GPU: 0.94 (cpu), 99.06 (gpu)
CPU/GPU utilization: 1.06 (cores), 1.00 (devices)
Percent of GPU time waiting for GPU: 63.11
____________
Eating more cheese on Thursdays. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13804 ID: 53948 Credit: 345,369,032 RAC: 1,967
                              
|
Yes, you're ignoring the CPU users.
____________
My lucky number is 75898524288+1 | |
|
|
Yes, you're ignoring the CPU users.
Very well, let me "unignore" the CPU users.
I ran a 1x WFS task on my 3.8 GHz Sandy Bridge 3930k; it went rather quickly, too, 41 minutes. I'll multiply the time by about 2 to pretend I ran a full CPU+HT load, so (rough calculation) a 10x CPU unit would take...13.5 or so hours on a full CPU, maybe some more if I'm overly optimistic about scaling. Not a bad time, and multiple 10x tasks can be completed in a 1 week challenge. I'll do a real 12 thread test tomorrow and see how close my calculation was. The program is sieve based, so do older CPUs (eg. Core 2, Phenom) hold up as robustly as they do now on sieve projects?
I imagine WSS would be shorter seeing as the tasks complete in half the time on GPUs vs. WFS. That, of course, would also be completely ignoring the content of this thread, which suggests that CPU users are already out in the cold (I can't run it, either, so I have no data).
Though is silly to use the PPSsieve average runtimes to compare CPU/GPU times in a sieve (it happens to be 52x, I doubt it's representative of anything, although old hardware is pretty good at sieving), but the concept is enough to start on that PPSieve and AP27 task lengths are designed (as you've said repeatedly in threads) to be primarily run by GPUs that cut through the tasks like a fork through soup, not CPUs. Why is it unreasonable to extend the same concept to PRPNet projects, where the userbase is far smaller and the tasks can currently be completed in under a minute on a $100 GPU? Tasks have time floors due to overheads in running and initialization among other limitations. Increasing task length in this case will help to mitigate these wastes in WFS/WSS, no configuration trickery or advanced knowledge needed (aside from wwww.ini but that's not the issue here), "flip the switch" as it were and throughput could be up 10% immediately. Of course more data are needed, I'd love to see some midrange GPU owners chime in with tests, and some older CPUs as well report some runtime info.
____________
Eating more cheese on Thursdays. | |
|
Dave  Send message
Joined: 13 Feb 12 Posts: 3063 ID: 130544 Credit: 2,127,204,724 RAC: 1,444,597
                      
|
Never knew it could work on CPU personally.
I like being able to blip the throttle & do a few filler tasks as required. Just yesterday needed to do exactly 16 tasks to help get my PSA total to a clean point. The concept of longer tssks still has merit of course - how short will they become on say the upcoming 1080Ti?!
GTX580 a) still rocks & b) ~58 secs for me with wwww.ini & no BOINC. | |
|
|
What if we increased the unit size by 10x? instead of a 1e11/1e10 range, how about 10e11 for WFS and 10e10 for WSS? (10x credit, too, of course)
I think it is a good idea. /JeppeSN
| |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,621,444 RAC: 0
                    
|
I will have to give the suggested ranges a try tomorrow.
Last 2 days I have been successfully testing WFS/WSS on an AMD R9 280X, Catalyst 14.12, with the following results from PRPNet:
Wieferich, wwwwcl v2.1.9
6 threads, 1024 blocks, 5x WU, 89% GPU, 98% CPU: 320 WU's in 5:13:09. This is average one Work Unit in 58.7 seconds.
6 threads, 2048 blocks, 5x WU, 93% GPU, 99% CPU: 630 WU's in 9:46:01. This is average one Work Unit in 55.8 seconds.
WallSunSun, wwwwcl v2.2.5
blocks=4096, threads=2, 94-99% GPU, 0-33% CPU: 300 Work Units in 3:16:19. This is average one Work Unit in 39.2 seconds.
2 Directories:
blocks=4096, threads=1, 85 Work Units in 1:41:14. This is average one Work Unit in 71.5 seconds.
blocks=4096, threads=1, 85 Work Units in 1:42:30. This is average one Work Unit in 72.4 seconds.
So total one Work Unit every 36.0 seconds. Therefore 2 Directories is superior.
There is overhead of init time and talking to the servers every 1-20 WU's. Wasted GPU time can be avoided running multiple instances per GPU.
2 instances is better than 1 as shown above. I am not sure how far that scales up though. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13804 ID: 53948 Credit: 345,369,032 RAC: 1,967
                              
|
What if we increased the unit size by 10x? instead of a 1e11/1e10 range, how about 10e11 for WFS and 10e10 for WSS? (10x credit, too, of course)
We'll think about. If you really want us to...
Bear in mind that us thinking about this might be a bad thing.
WWWW doesn't produce a usable residue, so double check comparisons are impossible. Not double checking GPU tasks is a HORRIBLE idea. I'm sure you're aware of how we feel about the need to double check results.
We know many of the wwww results are faulty because of the occasional false near-finds. We just have no way of detecting them, or determining how frequently they occur.
We've been gradually moving all of the projects off of PRPNet and onto BOINC mostly to get them into an environment where we can easily double check everything. It also brings vastly more participation, of course, but double checking is the primary reason.
WSS and and Wieferich will not be moving to BOINC.
If someone came to us with the wwww app today, we wouldn't run it due to the lack of double checking.
Basically, we don't trust any of the results. To me, that makes it worthless.
If we start thinking about wwww, we're as likely to shut it off as we are to make changes.
EDIT: I wrote this post several hours ago, and was discussing it with the other admins before posting it. No decisions have been made, but my personal feelings that it's pointless to be running this project without double checking are shared by a lot of others. The genie is officially out of the bottle. Pandora's Box is wide open. We shall see what comes of this. (I'm not saying we're shutting it down, just that the project as it exists today doesn't make a lot of sense. It's way too early to be worrying about what comes next.)
____________
My lucky number is 75898524288+1 | |
|
|
Ran a full CPU (w/HT) run of WFS tasks, and the average time was about 68 minutes, so a 1.66x time increase on a single task, much better than I guessed yesterday. I would expect it might be a little higher on CPUs where the ratio of threads to memory channels is larger. I'm finding that calculation time on CPU scales roughly linear with task size, as there is a minimal initialization of the task, so I will still use the "worst case scenario" 10x length for 10x size.
So, CPU-wise, a 10x task on every core would take just 680 minutes or 11.3 hours, which is a really good time, all things considered. The credit sucks (just 2000 points per core per day, perhaps one of the lowest rates of all PG outside of the x87 GFN tasks), but by the same token it's also low for GPUs, too.
Edit: Just saw your post, Michael, and I definitely agree with you that the lack of double check (or even the abilty to verify) is a problem. I think there are many of us who would love to see an updated and fully usable wwww app, but I would understand if you just shut it off one day.
____________
Eating more cheese on Thursdays. | |
|
|
We'll think about. If you really want us to...
Bear in mind that us thinking about this might be a bad thing.
WWWW doesn't produce a usable residue, so double check comparisons are impossible. Not double checking GPU tasks is a HORRIBLE idea. I'm sure you're aware of how we feel about the need to double check results.
We know many of the wwww results are faulty because of the occasional false near-finds. We just have no way of detecting them, or determining how frequently they occur.
We've been gradually moving all of the projects off of PRPNet and onto BOINC mostly to get them into an environment where we can easily double check everything. It also brings vastly more participation, of course, but double checking is the primary reason.
WSS and and Wieferich will not be moving to BOINC.
If someone came to us with the wwww app today, we wouldn't run it due to the lack of double checking.
Basically, we don't trust any of the results. To me, that makes it worthless.
If we start thinking about wwww, we're as likely to shut it off as we are to make changes.
EDIT: I wrote this post several hours ago, and was discussing it with the other admins before posting it. No decisions have been made, but my personal feelings that it's pointless to be running this project without double checking are shared by a lot of others. The genie is officially out of the bottle. Pandora's Box is wide open. We shall see what comes of this. (I'm not saying we're shutting it down, just that the project as it exists today doesn't make a lot of sense. It's way too early to be worrying about what comes next.)
I have also thought about this major short-coming. I hope one day someone will come up with a way to have a "result" of a WWWW range. It could be the XORed total of the last 64 bits of each of the A values (when the residue modulo p^2 is written as ±1 + A*p or 0 + A*p, that is the "A" I am talking about) of each of the primes in that range, or something. Then it would need to be implemented in each WWWW processor flavor. And at that point PrimeGrid ought to restart the entire search from zero again, with double checking.
It would be cool.
/JeppeSN | |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,621,444 RAC: 0
                    
|
Running Grebuloner's suggested single range vs. 10x range on my AMD 280X gave me 9.68x with WFS and 9.36x with WSS.
Stopwatch time of WFS v2.1.9 below is the sum of "init" and "sieve" in brackets after "Elapsed time".
Stopwatch time of WSS v2.2.5 below is simply "Clock time".
WFS
>wwwwcl64.exe -v -p 591537500000000000 -P 591537600000000000 -T Wieferich
wwwwcl v2.1.9, a GPU program to search for Wieferich and WallSunSun primes
Platform 0 is an Advanced Micro Devices, Inc. AMD Accelerated Parallel Processing, version OpenCL 2.0 AMD-APP (1642.5)
Device 0 is an Advanced Micro Devices, Inc. Tahiti
workGroupSize = 4194304 = 2048 * 64 * 32 (blocks * workGroupSizeMultiple * deviceComputeUnits)
Running with 6 threads
Allocated memory (prior to sieving): 288 MB in CPU, 288 MB in GPU
Sieve started: 591537500000000000 <= p < 591537600000000000
Sieve complete: 591537500000000001 <= p < 591537600000000000 2443719505 primes tested
Elapsed time: 182.81 sec. (3.69 init + 52.49 sieve) at 43411320 p/sec.
Processor time: 306.79 sec. (16.97 init + 289.82 sieve).
Seconds spent in CPU and GPU: 126.52 (cpu), 65.70 (gpu)
Percent of time spent in CPU vs. GPU: 0.66 (cpu), 0.34 (gpu)
CPU/GPU utilization: 0.17 (cores), 0.09 (devices)
>wwwwcl64.exe -v -p 587713200000000000 -P 587714200000000000 -T Wieferich
wwwwcl v2.1.9, a GPU program to search for Wieferich and WallSunSun primes
Platform 0 is an Advanced Micro Devices, Inc. AMD Accelerated Parallel Processing, version OpenCL 2.0 AMD-APP (1642.5)
Device 0 is an Advanced Micro Devices, Inc. Tahiti
workGroupSize = 4194304 = 2048 * 64 * 32 (blocks * workGroupSizeMultiple * deviceComputeUnits)
Running with 6 threads
Allocated memory (prior to sieving): 288 MB in CPU, 288 MB in GPU
Sieve started: 587713200000000000 <= p < 587714200000000000
p=587713273849653499, 47.67M p/sec, 5.76 CPU cores, 7.4% done. ETA 02 Jan 07:5
p=587713436880537173, 47.19M p/sec, 5.76 CPU cores, 23.7% done. ETA 02 Jan 07:
p=587713555036130687, 46.85M p/sec, 5.76 CPU cores, 35.5% done. ETA 02 Jan 07:
p=587713670440993127, 46.79M p/sec, 5.78 CPU cores, 47.0% done. ETA 02 Jan 07:
p=587713776778918211, 46.89M p/sec, 5.77 CPU cores, 57.7% done. ETA 02 Jan 07:
p=587713897588352671, 46.91M p/sec, 5.77 CPU cores, 69.8% done. ETA 02 Jan 07:
p=587714017044026473, 46.86M p/sec, 5.77 CPU cores, 81.7% done. ETA 02 Jan 07:
p=587714116490963333, 45.67M p/sec, 5.60 CPU cores, 91.6% done. ETA 02 Jan 07:48
Sieve complete: 587713200000000001 <= p < 587714200000000000 24440820358 primes tested
Elapsed time: 1806.12 sec. (3.65 init + 540.34 sieve) at 44919133 p/sec.
Processor time: 2980.34 sec. (16.85 init + 2963.49 sieve).
Seconds spent in CPU and GPU: 1262.01 (cpu), 652.69 (gpu)
Percent of time spent in CPU vs. GPU: 0.66 (cpu), 0.34 (gpu)
CPU/GPU utilization: 0.17 (cores), 0.09 (devices)
WSS
>wwwwcl64.exe -v -p 235389880000000000 -P 235389890000000000 -T WallSunSun
wwwwcl v2.2.5, a GPU program to search for Wieferich and WallSunSun primes
Platform 0 is an Advanced Micro Devices, Inc. AMD Accelerated Parallel Processing, version OpenCL 2.0 AMD-APP (1642.5)
Device 0 is an Advanced Micro Devices, Inc. Tahiti
workGroupSize = 8388608 = 4096 * 64 * 32 (blocks * workGroupSizeMultiple * deviceComputeUnits)
Running with 2 threads
Allocated memory (prior to sieving): 224 MB in CPU, 224 MB in GPU
Sieve started: (cmdline) 235389880000000000 <= p < 235389890000000000
Sieve complete: 235389880000000001 <= p < 235389890000000000 249992251 primes tested
Clock time: 36.67 seconds at at 6816775 p/sec.
Processor time: 21.37 sec. (3.68 init + 17.69 sieve).
Seconds spent in CPU and GPU: 0.87 (cpu), 57.98 (gpu)
Percent of time spent in CPU vs. GPU: 1.48 (cpu), 98.52 (gpu)
CPU/GPU utilization: 0.58 (cores), 1.00 (devices)
Percent of GPU time waiting for GPU: 29.04
>wwwwcl64.exe -v -p 235389880000000000 -P 235389980000000000 -T WallSunSun
wwwwcl v2.2.5, a GPU program to search for Wieferich and WallSunSun primes
Platform 0 is an Advanced Micro Devices, Inc. AMD Accelerated Parallel Processing, version OpenCL 2.0 AMD-APP (1642.5)
Device 0 is an Advanced Micro Devices, Inc. Tahiti
workGroupSize = 8388608 = 4096 * 64 * 32 (blocks * workGroupSizeMultiple * deviceComputeUnits)
Running with 2 threads
Allocated memory (prior to sieving): 224 MB in CPU, 224 MB in GPU
Sieve started: (cmdline) 235389880000000000 <= p < 235389980000000000
p=235389888890408723, 7.283M p/sec, 0.53 CPU cores, 8.9% done. ETA 02 Jan 07:3
p=235389897950144033, 7.413M p/sec, 0.52 CPU cores, 18.0% done. ETA 02 Jan 07:
p=235389957013068167, 7.411M p/sec, 0.51 CPU cores, 77.0% done. ETA 02 Jan 07:
p=235389966069204251, 7.311M p/sec, 0.52 CPU cores, 86.1% done. ETA 02 Jan 07:
p=235389974961156503, 7.245M p/sec, 0.52 CPU cores, 95.0% done. ETA 02 Jan 07:25
Sieve complete: 235389880000000001 <= p < 235389980000000000 2499971252 primestested
Clock time: 343.41 seconds at at 7279915 p/sec.
Processor time: 179.31 sec. (3.68 init + 175.63 sieve).
Seconds spent in CPU and GPU: 5.81 (cpu), 587.34 (gpu)
Percent of time spent in CPU vs. GPU: 0.98 (cpu), 99.02 (gpu)
CPU/GPU utilization: 0.52 (cores), 1.00 (devices)
Percent of GPU time waiting for GPU: 35.52 | |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,621,444 RAC: 0
                    
|
I figured out why the checksum is zero, for WSS at least.
It's not saving the A value in the Kernel when it's not a Special Result.
In wallsunsun_kernel.h after:
"result[gid] = 0;\n" \ just add the line:
"quot[gid] = c21;\n" \ The A value will then always become available in the ii_QuotientList[].
Checksum is currently just a simple addition, not an XOR:
il_CheckSum += ii_QuotientList[ii]; To print out the Checksum just add this line to WallSunSun.cpp ChildTestRange() after the for loop:
ip_WWWW->ReportSpecial("Final Checksum: %016llx", il_CheckSum); It's not getting as far as WWWW.cpp LogStats() for some reason, but should be easy to debug.
Example output:
>wwwwcl64.exe -v -p 1217727803528000 -P 1217727803529000 -T WallSunSun
wwwwcl v2.2.5, a GPU program to search for Wieferich and WallSunSun primes
Platform 0 is an Advanced Micro Devices, Inc. AMD Accelerated Parallel Processing, version OpenCL 2.0 AMD-APP (1642.5)
Device 0 is an Advanced Micro Devices, Inc. Tahiti
workGroupSize = 8388608 = 4096 * 64 * 32 (blocks * workGroupSizeMultiple * deviceComputeUnits)
Running with 2 threads
Allocated memory (prior to sieving): 224 MB in CPU, 224 MB in GPU
Sieve started: (cmdline) 1217727803528000 <= p < 1217727803529000
1217727803528521 is a special instance (+0 -49 p)
Final Checksum: ffffffffffffa5f7
Sieve complete: 1217727803528001 <= p < 1217727803529000 31 primes tested
Clock time: 1.90 seconds at at 16 p/sec.
Processor time: 0.80 sec. (0.78 init + 0.02 sieve).
Seconds spent in CPU and GPU: 0.56 (cpu), 0.96 (gpu)
Percent of time spent in CPU vs. GPU: 36.97 (cpu), 63.03 (gpu)
CPU/GPU utilization: 0.42 (cores), 0.50 (devices) | |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,621,444 RAC: 0
                    
|
OK, I got the WWWW.cpp LogStats() working too. The Checksum is reported by the threads to the main App in a call to WriteCheckPoint(), and that is not currently called unless the Sieve is interrupted. To fix just add a call to WriteCheckpoint() before printing out the "Sieve complete" message in App.cpp Finish(). That and the additional line in wallsunsun_kernel.h is all you need to get Checksum to report in WSS.
Similarly quot[gid] is not being set in wieferich_kernel.h unless it is a Special Result. | |
|
Message boards :
Project Staging Area :
WFS/WSS task size thoughts |