Join PrimeGrid
Returning Participants
Community
Leader Boards
Results
Other
drummers-lowrise
|
Message boards :
Generalized Fermat Prime Search :
Source code of Genefer for OpenCL is available.
Author |
Message |
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
The source code of a variant of Genefer for OpenCL is available on assembla:
[...]\branches\yves\2013\OclGenefer and [...]\branches\yves\2013\Common
I tested it with NVidia SDK on Fermi and Kepler GPUs and with Intel SDK for OpenCL, running on the CPU because Intel HD Graphics have no FP64. I have no ATI card, but the code doesn't depend on any external library then it should run on ATI cards.
It is about as fast as GeneferCUDA on my computers but the algorithm is different then speed ratio depends on hardware and exponent.
It is not finished: error is not computed and many improvements are still possible.
The source code is "standard" C++, then you can compile it with Visual Studio 2012 (Express) or gcc with NVidia or ATI or Intel SDK for OpenCL.
Yves
| |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
Here's some benches on NVidia: this version doesn't like Fermi architecture :-(
GTX 680 (GK 104)
Genefer CUDA
538452^1048576+1 Time: 2.5 ms/mul. Err: 0.2031 6009544 digits
440400^2097152+1 Time: 4.76 ms/mul. Err: 0.2051 11836006 digits
360204^4194304+1 Time: 9.27 ms/mul. Err: 0.2167 23305854 digits
294612^8388608+1 Time: 20.5 ms/mul. Err: 0.1797 45879398 digits
Genefer OpenCL
538452^1048576+1 Time: 3.11 ms/mul. 6009544 digits
440400^2097152+1 Time: 6.7 ms/mul. 11836006 digits
360204^4194304+1 Time: 12.9 ms/mul. 23305854 digits
294612^8388608+1 Time: 26.8 ms/mul. 45879398 digits
------------------------------------------
GTX 660 (GK 106)
Genefer CUDA
538452^1048576+1 Time: 3.99 ms/mul. Err: 0.2031 6009544 digits
440400^2097152+1 Time: 7.87 ms/mul. Err: 0.2051 11836006 digits
360204^4194304+1 Time: 15.9 ms/mul. Err: 0.2167 23305854 digits
294612^8388608+1 Time: 35.1 ms/mul. Err: 0.1797 45879398 digits
Genefer OpenCL
538452^1048576+1 Time: 4.82 ms/mul. 6009544 digits
440400^2097152+1 Time: 10.6 ms/mul. 11836006 digits
360204^4194304+1 Time: 20.7 ms/mul. 23305854 digits
294612^8388608+1 Time: 43 ms/mul. 45879398 digits
------------------------------------------
GTX 580 (GF 110)
Genefer CUDA
538452^1048576+1 Time: 1.72 ms/mul. Err: 0.2031 6009544 digits
440400^2097152+1 Time: 3.28 ms/mul. Err: 0.1953 11836006 digits
360204^4194304+1 Time: 6.8 ms/mul. Err: 0.2031 23305854 digits
294612^8388608+1 Time: 14.1 ms/mul. Err: 0.1797 45879398 digits
Genefer OpenCL
538452^1048576+1 Time: 2.83 ms/mul. 6009544 digits
440400^2097152+1 Time: 6.17 ms/mul. 11836006 digits
360204^4194304+1 Time: 15.5 ms/mul. 23305854 digits
294612^8388608+1 Time: 38.9 ms/mul. 45879398 digits
| |
|
|
Very necessary tests on Tahiti. | |
|
|
Tried to compiling but end up with the same errors as with other genefercuda version, maybe a problem with OpenCL.lib. | |
|
|
App crashed with "cannot create program".
Added current status to my dropbox. If someone is interested tell me. | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
You can compile it with cl.h and OpenCL.lib from any OpenCL SDK (Intel, Nvidia or ATI).
The executable will mount the OpenCL.dll installed by graphics driver(s).
On my laptop, with a HD 4000 and a GeForce GT 740M, any OpenCL binary can run on Intel GPU and on NVidia GPU. I don't have to generate one with Intel SDK and another one with NVidia SDK.
| |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
App crashed with "cannot create program".
Added current status to my dropbox. If someone is interested tell me.
It doesn't find "Genefer.cl". | |
|
|
App crashed with "cannot create program".
Added current status to my dropbox. If someone is interested tell me.
It doesn't find "Genefer.cl".
Oh, thx, my fault, forgot to add it.
Test results on Asus HD7950:
Platform 'AMD Accelerated Parallel Processing': 1 GPU device(s) found.
Platform 'AMD Accelerated Parallel Processing': 1 CPU device(s) found.
"C:\Users\user\AppData\Local\Temp\OCLBA77.tmp.cl", line 12: warning: OpenCL
extension is now part of core
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
^
Running on platform 'AMD Accelerated Parallel Processing' and device 'Tahiti'.
2199064^8192+1 Time: 99.6 us/mul. 51956 digits
1798620^16384+1 Time: 95.8 us/mul. 102481 digits
1471094^32768+1 Time: 98 us/mul. 202102 digits
1203210^65536+1 Time: 122 us/mul. 398482 digits
984108^131072+1 Time: 182 us/mul. 785521 digits
804904^262144+1 Time: 385 us/mul. 1548156 digits
658332^524288+1 Time: 916 us/mul. 3050541 digits
538452^1048576+1 Time: 2.01 ms/mul. 6009544 digits
440400^2097152+1 Time: 4.53 ms/mul. 11836006 digits
360204^4194304+1 Time: 12 ms/mul. 23305854 digits
294612^8388608+1 Time: 31.1 ms/mul. 45879398 digits
30^32+1 is a probable prime. (0.8 sec., err = 0.00e+000)
20000066^32+1 is a probable prime. (0.8 sec., err = 0.00e+000)
102^64+1 is a probable prime. (0.8 sec., err = 0.00e+000)
15000250^64+1 is a probable prime. (0.9 sec., err = 0.00e+000)
120^128+1 is a probable prime. (0.8 sec., err = 0.00e+000)
10000038^128+1 is a probable prime. (0.9 sec., err = 0.00e+000)
278^256+1 is a probable prime. (0.9 sec., err = 0.00e+000)
5684328^256+1 is a probable prime. (1.1 sec., err = 0.00e+000)
46^512+1 is a probable prime. (1.0 sec., err = 0.00e+000)
4619000^512+1 is a probable prime. (1.5 sec., err = 0.00e+000)
824^1024+1 is a probable prime. (1.4 sec., err = 0.00e+000)
3752220^1024+1 is a probable prime. (2.1 sec., err = 0.00e+000)
150^2048+1 is a probable prime. (1.8 sec., err = 0.00e+000)
3066672^2048+1 is a probable prime. (3.7 sec., err = 0.00e+000)
1534^4096+1 is a probable prime. (4.0 sec., err = 0.00e+000)
2485064^4096+1 is a probable prime. (7.3 sec., err = 0.00e+000)
30406^8192+1 is a probable prime. (12.2 sec., err = 0.00e+000)
2030234^8192+1 is a probable prime. (17.0 sec., err = 0.00e+000)
67234^16384+1 is a probable prime. (25.7 sec., err = 0.00e+000)
1651902^16384+1 is a probable prime. (32.7 sec., err = 0.00e+000)
70906^32768+1 is a probable prime. (50.5 sec., err = 0.00e+000)
1277444^32768+1 is a probable prime. (66.1 sec., err = 0.00e+000)
abort rest, taking too long but its working.
What are command lines for the app? | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13952 ID: 53948 Credit: 391,849,676 RAC: 147,542
                               
|
What are command lines for the app?
There are no command line options for this program. What you see is all it can do.
Running a predefined list of known primes through the algorithm is all that program does. It's not intended to be a complete application. Like Shoichiro's OpenCL code, it's the code for an algorithm that's intended to be plugged into our existing Genefer framework. It's not intended to be useful on its own.
After 18 months of wishing we had an OpenCL implementation, we now have two!
Which to use? Yves' is more portable as it does not require an external FFT library. Yves wrote his own transforms in portable C++, so it will run on any platform. Shoichiro used AMD's FFT library, which Iain says isn't available on Mac. It would be interesting to see whether Yves' transform is faster than the AMD libraries.
I had relatively little difficulty building Yves' application with WS2012 Express as an x64 app using the CUDA 3.2 toolkit. Run times on my GTX 460 are slightly more than twice as slow as the CUDA version.
____________
My lucky number is 75898524288+1 | |
|
|
What are command lines for the app?
There are no command line options for this program. What you see is all it can do.
Running a predefined list of known primes through the algorithm is all that program does. It's not intended to be a complete application. Like Shoichiro's OpenCL code, it's the code for an algorithm that's intended to be plugged into our existing Genefer framework. It's not intended to be useful on its own.
After 18 months of wishing we had an OpenCL implementation, we now have two!
Which to use? Yves' is more portable as it does not require an external FFT library. Yves wrote his own transforms in portable C++, so it will run on any platform. Shoichiro used AMD's FFT library, which Iain says isn't available on Mac. It would be interesting to see whether Yves' transform is faster than the AMD libraries.
I had relatively little difficulty building Yves' application with WS2012 Express as an x64 app using the CUDA 3.2 toolkit. Run times on my GTX 460 are slightly more than twice as slow as the CUDA version.
You are right, Yves looks much better to only use 2 files. But you can split CUDA and OpenCL for Nvidia/ATI cards. I wish OpenCL on these highend cards of ATI could be much faster than CUDA. | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
What are command lines for the app?
Sorry but it is just a first release for tests.
I will update it and add the command line option "-q".
Test results on Asus HD7950: [...]
HD7950 is about as fast as a GTX 680. It seems in accordance with other benchmarks.
I had relatively little difficulty building Yves' application with VS2012 Express as an x64 app.
Note that a win32 app is as fast as a x64 one.
I have been playing with OpenCL for only two months. Then there are still many variations to test and a large speed improvement is expected!
| |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
What are command lines for the app?
I just added the option -q "b^N+1". | |
|
rogueVolunteer developer
 Send message
Joined: 8 Sep 07 Posts: 1255 ID: 12001 Credit: 18,565,548 RAC: 0
 
|
Hello Yves, it has been a while. You might recall me from helping get the original genefer running on PowerMac. I wrote the checkpointing code that the various flavors of genefer use today.
I'm excited to see you back in action on this. I've been waiting for someone to port this to OpenCL so that I can run it on my Mac Pro as it has an ATI card.
Now if you could only port a command line version of Proth to OpenCL... :-)
To get this to build on Mac, I had to make this change:
#ifdef __APPLE__
#include <OpenCL/cl.h>
#else
#include <CL/cl.h>
#endif
in OclGenefer.cpp.
Also, to use clang (Apple's newer incarnation of gcc) I had to make this change:
virtual ~ITransform() {};
although llvm (Apple's older incarnation of gcc) allowed the original code:
virtual ~ITransform() = 0 {};
and the updated version.
Unfortunately, I am running into this problem with clang:
OclGenefer.cpp:176:29: error: no matching conversion for functional-style cast from 'std::ifstream' (aka 'basic_ifstream<char>') to 'std::istreambuf_iterator<char>'
const std::string source((std::istreambuf_iterator<char>(std::ifstream("Genefer.cl", std::ifstream::in))), std::istreambuf_iterator<char>());
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/c++/4.2.1/bits/streambuf_iterator.h:48:11: note: candidate constructor (the implicit copy constructor) not viable: no known conversion from 'std::ifstream' (aka 'basic_ifstream<char>') to 'const std::istreambuf_iterator<char, std::char_traits<char> >' for 1st argument
class istreambuf_iterator
^
/usr/include/c++/4.2.1/bits/streambuf_iterator.h:98:7: note: candidate constructor not viable: no known conversion from 'std::ifstream' (aka 'basic_ifstream<char>') to 'istream_type &' (aka 'basic_istream<char, std::char_traits<char> > &') for 1st argument
istreambuf_iterator(istream_type& __s) throw()
^
/usr/include/c++/4.2.1/bits/streambuf_iterator.h:102:7: note: candidate constructor not viable: no known conversion from 'std::ifstream' (aka 'basic_ifstream<char>') to 'streambuf_type *' (aka 'basic_streambuf<char, std::char_traits<char> > *') for 1st argument
istreambuf_iterator(streambuf_type* __s) throw()
^
/usr/include/c++/4.2.1/bits/streambuf_iterator.h:94:7: note: candidate constructor not viable: requires 0 arguments, but 1 was provided
istreambuf_iterator() throw()
llvm doesn't give as useful a message:
OclGenefer.cpp: In constructor ‘Program::Program(bool)’:
OclGenefer.cpp:176: error: invalid conversion from ‘void*’ to ‘std::basic_streambuf<char, std::char_traits<char> >*’
OclGenefer.cpp:176: error: initializing argument 1 of ‘std::istreambuf_iterator<_CharT, _Traits>::istreambuf_iterator(std::basic_streambuf<_CharT, _Traits>*) [with _CharT = char, _Traits = std::char_traits<char>]’
I'm not good enough with C++ streams to fix this. | |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,668,824 RAC: 0
                    
|
Test result with HD7970Ghz using Rebirther's Dropbox genefer.exe:
Was using 18% of CPU (X6 1100T) and up to 99% of GPU. The high use of GPU is too aggressive and program does not finish long runs.
Not shown in screen shot: 572186^131072+1 took 3424.5 seconds and 2418^262144+1 took 8000+ seconds. | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
Hi Mark,
Also, to use clang (Apple's newer incarnation of gcc) I had to make this change:
virtual ~ITransform() {};
although llvm (Apple's older incarnation of gcc) allowed the original code:
virtual ~ITransform() = 0 {};
and the updated version.
This is a bug of the compiler with pure virtual destructor !?
I removed polymorphism, it is useless in this version... but it will not work with the full version of Genefer. Then maybe you could report the bug.
I removed STL calls (and then C++ streams).
You can download the latest version.
| |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
Test result with HD7970Ghz using Rebirther's Dropbox genefer.exe:
Thank's for the results.
I don't understand the second warning with FP_CONTRACT.
This pragma is defined with OpenCL 1.0, 1.1 and 1.2.
Rebirther doesn't have this warning on his computer.
Something else that I don't understand:
If 1203210^65536+1 Time: 164 us/mul. and 984108^131072+1 Time: 236 us/mul. then
if 48594^65536+1 172 sec. we expect that
62722^131072+1 ~ 172 * 236/164 * 2 ~ 500 sec
1500 sec is too slow.
Was using 18% of CPU (X6 1100T) and up to 99% of GPU.
This should be 0% of CPU and up to 99% of GPU. Why CPU is running ?
| |
|
|
The latest app with result, Yves you need a version number...
OclGenefer 2013, Copyright (C) 2001-2013, Yves Gallot.
Options: -q "b^N+1" Test expression.
Platform 'AMD Accelerated Parallel Processing': 1 GPU device(s) found.
Platform 'AMD Accelerated Parallel Processing': 1 CPU device(s) found.
Running on platform 'AMD Accelerated Parallel Processing' and device 'Tahiti'.
2199064^8192+1 Time: 98.9 us/mul. 51956 digits
1798620^16384+1 Time: 96.8 us/mul. 102481 digits
1471094^32768+1 Time: 99.2 us/mul. 202102 digits
1203210^65536+1 Time: 123 us/mul. 398482 digits
984108^131072+1 Time: 177 us/mul. 785521 digits
804904^262144+1 Time: 386 us/mul. 1548156 digits
658332^524288+1 Time: 910 us/mul. 3050541 digits
538452^1048576+1 Time: 1.98 ms/mul. 6009544 digits
440400^2097152+1 Time: 4.53 ms/mul. 11836006 digits
360204^4194304+1 Time: 12.3 ms/mul. 23305854 digits
294612^8388608+1 Time: 31.3 ms/mul. 45879398 digits
30^32+1 is a probable prime. (0.8 sec., err = 0.00e+000)
20000066^32+1 is a probable prime. (0.9 sec., err = 0.00e+000)
102^64+1 is a probable prime. (0.8 sec., err = 0.00e+000)
15000250^64+1 is a probable prime. (0.9 sec., err = 0.00e+000)
120^128+1 is a probable prime. (0.8 sec., err = 0.00e+000)
10000038^128+1 is a probable prime. (1.0 sec., err = 0.00e+000)
278^256+1 is a probable prime. (0.9 sec., err = 0.00e+000)
5684328^256+1 is a probable prime. (1.1 sec., err = 0.00e+000)
46^512+1 is a probable prime. (1.0 sec., err = 0.00e+000)
4619000^512+1 is a probable prime. (1.4 sec., err = 0.00e+000)
824^1024+1 is a probable prime. (1.4 sec., err = 0.00e+000)
3752220^1024+1 is a probable prime. (2.1 sec., err = 0.00e+000)
150^2048+1 is a probable prime. (1.8 sec., err = 0.00e+000)... | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13952 ID: 53948 Credit: 391,849,676 RAC: 147,542
                               
|
The latest app with result, Yves you need a version number...
He doesn't need a version number because that's not an app. It's just some code that will be included in the real app at some point. It's not released code.
____________
My lucky number is 75898524288+1 | |
|
|
HD7750 with Ubuntu
$ ./Genefer
OclGenefer 2013, Copyright (C) 2001-2013, Yves Gallot.
Options: -q "b^N+1" Test expression.
Platform 'AMD Accelerated Parallel Processing': 1 GPU device(s) found.
Platform 'AMD Accelerated Parallel Processing': 1 CPU device(s) found.
"/tmp/OCLYaqLaz.cl", line 12: warning: OpenCL extension is now part of core
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
^
Running on platform 'AMD Accelerated Parallel Processing' and device 'Capeverde'.
2199064^8192+1 Time: 122 us/mul. 51956 digits
1798620^16384+1 Time: 183 us/mul. 102481 digits
1471094^32768+1 Time: 122 us/mul. 202102 digits
1203210^65536+1 Time: 244 us/mul. 398482 digits
984108^131072+1 Time: 488 us/mul. 785521 digits
804904^262144+1 Time: 977 us/mul. 1548156 digits
658332^524288+1 Time: 3.91 ms/mul. 3050541 digits
538452^1048576+1 Time: 7.81 ms/mul. 6009544 digits
440400^2097152+1 Time: 15.6 ms/mul. 11836006 digits
360204^4194304+1 Time: 31.2 ms/mul. 23305854 digits
294612^8388608+1 Time: 125 ms/mul. 45879398 digits
30^32+1 is a probable prime. (1.0 sec., err = 0.00e+00)
20000066^32+1 is a probable prime. (1.0 sec., err = 0.00e+00)
102^64+1 is a probable prime. (1.0 sec., err = 0.00e+00)
15000250^64+1 is a probable prime. (1.0 sec., err = 0.00e+00)
120^128+1 is a probable prime. (2.0 sec., err = 0.00e+00)
10000038^128+1 is a probable prime. (1.0 sec., err = 0.00e+00)
278^256+1 is a probable prime. (2.0 sec., err = 0.00e+00)
5684328^256+1 is a probable prime. (1.0 sec., err = 0.00e+00)
46^512+1 is a probable prime. (2.0 sec., err = 0.00e+00)
4619000^512+1 is a probable prime. (2.0 sec., err = 0.00e+00)
824^1024+1 is a probable prime. (2.0 sec., err = 0.00e+00)
3752220^1024+1 is a probable prime. (4.0 sec., err = 0.00e+00)
150^2048+1 is a probable prime. (2.0 sec., err = 0.00e+00)
3066672^2048+1 is a probable prime. (6.0 sec., err = 0.00e+00)
1534^4096+1 is a probable prime. (8.0 sec., err = 0.00e+00)
2485064^4096+1 is a probable prime. (13.0 sec., err = 0.00e+00)
30406^8192+1 is a probable prime. (20.0 sec., err = 0.00e+00)
2030234^8192+1 is a probable prime. (27.0 sec., err = 0.00e+00)
67234^16384+1 is a probable prime. (42.0 sec., err = 0.00e+00)
1651902^16384+1 is a probable prime. (54.0 sec., err = 0.00e+00)
70906^32768+1 is a probable prime. (107.0 sec., err = 0.00e+00)
1277444^32768+1 is a probable prime. (133.0 sec., err = 0.00e+00)
48594^65536+1 is a probable prime. (327.0 sec., err = 0.00e+00)
857678^65536+1 is a probable prime. (407.0 sec., err = 0.00e+00)
62722^131072+1 is a probable prime. (1342.0 sec., err = 0.00e+00)
572186^131072+1 is a probable prime. (1608.0 sec., err = 0.00e+00)
| |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
Test result with HD7970Ghz using Rebirther's Dropbox genefer.exe [...]
There is clearly a problem with this card if we compare to the HD7750 results :
HD7750
1203210^65536+1 Time: 244 us/mul.
984108^131072+1 Time: 488 us/mul.
48594^65536+1 327 sec... OK
62722^131072+1 1342 sec... OK
HD7970Ghz
1203210^65536+1 Time: 164 us/mul.
984108^131072+1 Time: 236 us/mul
48594^65536+1 172 sec... OK
62722^131072+1 1509 sec... NOK
Is the graphic driver up-to-date ? | |
|
rogueVolunteer developer
 Send message
Joined: 8 Sep 07 Posts: 1255 ID: 12001 Credit: 18,565,548 RAC: 0
 
|
Ugh! The Radeon HD 5870 only supports cl_amd_fp64, not cl_khr_fp64. I can't run the code on my MacPro. When it tries to use the CPU, I get this error:
inline Complex _MulWeightOut(const Complex z, const Complex aw, const double nInv, const double g)
^
Running on platform 'Apple' and device 'Intel(R) Xeon(R) CPU W3530 @ 2.80GHz'.
Error detected on GPU device.
I don't expect it to be fast on the CPU though, but I wouldn't expect this error either. | |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,668,824 RAC: 0
                    
|
Running Catalyst 12.8. That might explain the extra pragma warning and the test result. | |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,668,824 RAC: 0
                    
|
HD7970Ghz test result with new version:
Same %18 CPU and 99% GPU usage. Catalyst 12.8. I'll update to 13.4 and run again. Not shown is 572186^131072+1 that took 5985.4 sec. | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
Ugh! The Radeon HD 5870 only supports cl_amd_fp64, not cl_khr_fp64. I can't run the code on my MacPro.
It should run with cl_amd_fp64 (I think: I can't find cl_amd_fp64 instruction set).
But 5 instructions only are called : +, - , *, fma, rint.
There was an error in the logic of the defines : cl_amd_fp64 was not set with OpenCL 1.2. I solved this. You can download the latest release.
I removed "_MulWeightOut", it was an unreachable code.
| |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,668,824 RAC: 0
                    
|
Tested HD7970Ghz with Catalyst 13.4. Behavior is different. Now uses around 6% of CPU but still 99% of GPU.
No #pragma warning, faster times, but maybe a memory leak when testing 676754^262144+1?
In this screen shot you can see memory use maxed out at 6GB, is not using any GPU, so has gone zombie:
I tested -q "676754^262144+1", -q "75898^524288+1" and -q "475856^524288+1" and all had memory usage that quickly went crazy after only seconds, as shown by Task Manager.
Note this test was straight after updating to Catalyst 13.4.
I reran the test after a reboot and got some slightly better times for high N and slightly worse times for low N. | |
|
rogueVolunteer developer
 Send message
Joined: 8 Sep 07 Posts: 1255 ID: 12001 Credit: 18,565,548 RAC: 0
 
|
Ugh! The Radeon HD 5870 only supports cl_amd_fp64, not cl_khr_fp64. I can't run the code on my MacPro.
It should run with cl_amd_fp64 (I think: I can't find cl_amd_fp64 instruction set).
But 5 instructions only are called : +, - , *, fma, rint.
There was an error in the logic of the defines : cl_amd_fp64 was not set with OpenCL 1.2. I solved this. You can download the latest release.
I removed "_MulWeightOut", it was an unreachable code.
That only solves part of the problem. This code:
clGetDeviceInfo(devices[d], CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE, sizeof(cl_uint), &dWidth, NULL);
return a dWidth of 0 on the HK 5870, thus the device cannot be chosen. | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
Tested HD7970Ghz with Catalyst 13.4. Behavior is different. Now uses around 6% of CPU but still 99% of GPU.
No #pragma warning, faster times, but maybe a memory leak when testing 676754^262144+1?
In this screen shot you can see memory use maxed out at 6GB, is not using any GPU, so has gone zombie:
I tested -q "676754^262144+1", -q "75898^524288+1" and -q "475856^524288+1" and all had memory usage that quickly went crazy after only seconds, as shown by Task Manager.
On my computer, OclGenefer -q "475856^524288+1"
=> CPU Memory = 29.7 MB (the program allocates memory when it starts but not during the computation),
=> CPU load = 12.5 % (a bug of NVidia driver: https://forums.geforce.com/default/topic/543115/opencl-driver-support-for-fah/)
=> GPU Load = 99 %.
I don't understand... someone else can check this with an ATI card ? | |
|
|
Tested HD7970Ghz with Catalyst 13.4. Behavior is different. Now uses around 6% of CPU but still 99% of GPU.
No #pragma warning, faster times, but maybe a memory leak when testing 676754^262144+1?
In this screen shot you can see memory use maxed out at 6GB, is not using any GPU, so has gone zombie:
I tested -q "676754^262144+1", -q "75898^524288+1" and -q "475856^524288+1" and all had memory usage that quickly went crazy after only seconds, as shown by Task Manager.
On my computer, OclGenefer -q "475856^524288+1"
=> CPU Memory = 29.7 MB (the program allocates memory when it starts but not during the computation),
=> CPU load = 12.5 % (a bug of NVidia driver: https://forums.geforce.com/default/topic/543115/opencl-driver-support-for-fah/)
=> GPU Load = 99 %.
I don't understand... someone else can check this with an ATI card ?
Tested and confirmed. The program took all my 16GB memory and cpu was at full usage (memory leak). | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
Tested HD7970Ghz [...] had memory usage that quickly went crazy after only seconds.
Tested and confirmed. The program took all my 16GB memory and cpu was at full usage (memory leak).
I've got it! The same problem occured with Intel's driver.
I enqueued all OpenCL commands with a single CLFinish (which is a synchronization point) at the end. Intel and ATI drivers create a huge stack with all commands!
You can download the lastest version (OclGenefer 2013-08-03). It enqueues 1024 commands and CPU waits GPU, etc.
This solves the problem with Intel's driver. | |
|
|
Tested HD7970Ghz [...] had memory usage that quickly went crazy after only seconds.
Tested and confirmed. The program took all my 16GB memory and cpu was at full usage (memory leak).
I've got it! The same problem occured with Intel's driver.
I enqueued all OpenCL commands with a single CLFinish (which is a synchronization point) at the end. Intel and ATI drivers create a huge stack with all commands!
You can download the lastest version (OclGenefer 2013-08-03). It enqueues 1024 commands and CPU waits GPU, etc.
This solves the problem with Intel's driver.
Looks good on ATI now. | |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,668,824 RAC: 0
                    
|
Bug is fixed, 676754^262144+1 test uses 39.0MB memory, only 2% of CPU and 97% of GPU. This is a much better combination, 99% GPU makes the computer less responsive.
^524288 tests use 1.0% of CPU, 43.9MB and 98% GPU. Usage depends on exponent being tested. | |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,668,824 RAC: 0
                    
|
Full test run on HD7970Ghz with Catalyst 13.4:
Is all sweetness and light. What's the next step for this algorithm?
Great work Yves! | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
Full test run on HD7970Ghz with Catalyst 13.4.
[...] What's the next step for this algorithm?
Thank's, it works fine for a first release.
The next steps are:
1- To improve performance. On one hand, we have Folding@Home FP64 benchmark:
GTX Titan 17.9
HD7990 17.1
HD7970GHz 10.1
GTX 780 8.7
GTX 580 8.3
GTX 680 5.5
and on the other hand GeneferOpenCL and GeneferCUDA comparison:
GTX580
GNF-OCL 2^20 2.8 ms/mul
GNF-OCL 2^22 15.5 ms/mul
GNF-CUDA 2^20 1.7 ms/mul
GNF-CUDA 2^22 6.8 ms/mul
GTX680
GNF-OCL 2^20 3.1 ms/mul
GNF-OCL 2^22 12.9 ms/mul
GNF-CUDA 2^20 2.5 ms/mul
GNF-CUDA 2^22 9.3 ms/mul
If GeneferOpenCL is as fast as GeneferCUDA then the expected timing on a HD7970GHz is
GNF-OCL 2^20 2 ms/mul -> 1.5 ms/mul
GNF-OCL 2^22 12.4 ms/mul -> 5.6 ms/mul
It is a reasonnable target because many optimizations have not been tested.
2- To plug the transform into Genefer application. This can be done quickly because the interface of OclTransform is based on AVXTransform.
I'm finishing the tests of some "simple" optimizations and I will create a full GeneferOpenCL application. | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
My benchmark was incorrect.
That's good news, because OclGenefer runs faster than it was displayed!
During the benchmark (which is similar to Genefer code), 24 loops are computed (to initialize memory and caches) and then the true benchmark starts. But I forget to add a synchronization point at the end of the 24 loops, then the computation of these transforms biased estimates (especially for large n).
The error is corrected in OclGenefer 2013-08-10.
I'm on holiday and tested the program on a GeForce GT 740M (GK107, 2 SMX, 384 shaders). OclGenefer is faster than GeneferCUDA on it.
Then today, GeneferCUDA is (still :-) ) faster on Fermi, but OclGenefer is faster on Kepler with few SMX (it should be faster on GK107, i.e. GT 640 and GTX 650). I cannot extrapolate to GK104 (GTX 680) or GK110 (GTX 780 & Titan)... a benchmark on the Titan would be very interesting. | |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,668,824 RAC: 0
                    
|
New assembla OclGenefer.cpp revision 381 code run on HD7970Ghz:
Benchmarks for large N are definitely faster. New device initialisation information.
I guess the full probable prime tests are unaffected by the latest code change?
GFN-OCL 2^20 1.84 ms/mul
2^20 Goal=1.5 ms/mul => another 18% drop required
GFN-OCL 2^22 7.67 ms/mul
2^22 Goal=5.6 ms/mul => another 27% drop required | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
New device initialisation information.
Thanks, this information is very useful for optimization. Programming guides don't give all parameters.
On a Fermi, it is
Global mem size = 2048 MB, cache size = 32 kB (ReadWrite), cache line size = 128 Bytes.
Local mem size = 48 kB (dedicated), Constant mem size = 64 kB.
Max workgroup size = 1024.
I guess the full probable prime tests are unaffected by the latest code change?
Yes, OpenCL code didn't change in this version.
GFN-OCL 2^20 1.84 ms/mul
2^20 Goal=1.5 ms/mul => another 18% drop required
GFN-OCL 2^22 7.67 ms/mul
2^22 Goal=5.6 ms/mul => another 27% drop required
And then it will run faster on a HD 7970 GHz than GeneferCUDA on a GeForce GTX Titan :o) | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
A new version is available on assembla (2013-08-12).
It is really faster than the previous one on my laptop, I hope that it is fast on high and mid range graphics cards.
Max error is computed.
Any benchmark is welcome.
If tests are ok, the next version will be a beta release of geneferOpenCL. | |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,668,824 RAC: 0
                    
|
Latest assembla Genefer.cl and OclGenefer.cpp revision 382 code run on HD7970Ghz:
Of course I will run full test after the challenge.
GFN-OCL 2^20 1.46 ms/mul
2^20 Goal=1.5 ms/mul => Goal achieved!
GFN-OCL 2^22 5.61 ms/mul
2^22 Goal=5.6 ms/mul => Goal achieved!
Congrats!
And then it will run faster on a HD 7970 GHz than GeneferCUDA on a GeForce GTX Titan :o)
| |
|
|
HD7950
OclGenefer 2013-08-12, Copyright (C) 2001-2013, Yves Gallot.
Options: -q "b^N+1" Test expression.
Platform 'AMD Accelerated Parallel Processing': 1 GPU device(s) found.
Platform 'AMD Accelerated Parallel Processing': 1 CPU device(s) found.
Running on platform 'AMD Accelerated Parallel Processing', device 'Tahiti', vers
ion 'OpenCL 1.2 AMD-APP (1124.2)' and driver '1124.2 (VM)'.
Clock frequency = 900 MHz, compute units = 28.
Global mem size = 2048 MB, cache size = 16 kB (ReadWrite), cache line size = 6
4 Bytes.
Local mem size = 32 kB (dedicated), Constant mem size = 64 kB.
Max workgroup size = 256.
2199064^8192+1 Time: 83.5 us/mul. Err: 0.2188 51956 digits
1798620^16384+1 Time: 87 us/mul. Err: 0.2227 102481 digits
1471094^32768+1 Time: 86.4 us/mul. Err: 0.2383 202102 digits
1203210^65536+1 Time: 123 us/mul. Err: 0.2305 398482 digits
984108^131072+1 Time: 166 us/mul. Err: 0.2188 785521 digits
804904^262144+1 Time: 353 us/mul. Err: 0.2266 1548156 digits
658332^524288+1 Time: 770 us/mul. Err: 0.2227 3050541 digits
538452^1048576+1 Time: 1.49 ms/mul. Err: 0.2109 6009544 digits
440400^2097152+1 Time: 2.9 ms/mul. Err: 0.2188 11836006 digits
360204^4194304+1 Time: 5.61 ms/mul. Err: 0.2109 23305854 digits
294612^8388608+1 Time: 11.1 ms/mul. Err: 0.2109 45879398 digits
102^64+1 is a probable prime. (0.8 sec., err = 1.46e-011)
15000250^64+1 is a probable prime. (0.8 sec., err = 0.375)
120^128+1 is a probable prime. (0.8 sec., err = 5.09e-011)
10000038^128+1 is a probable prime. (1.0 sec., err = 0.344)
278^256+1 is a probable prime. (0.9 sec., err = 3.49e-010)
5684328^256+1 is a probable prime. (1.2 sec., err = 0.164)
46^512+1 is a probable prime. (1.0 sec., err = 1.73e-011)
4619000^512+1 is a probable prime. (1.7 sec., err = 0.174)
824^1024+1 is a probable prime. (1.5 sec., err = 8.5e-009)
3752220^1024+1 is a probable prime. (2.5 sec., err = 0.188)
150^2048+1 is a probable prime. (2.0 sec., err = 4.37e-010)
Testing 3066672^2048+1...
Its faster than the old test but the error rate is much higher. | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
The error was not computed in the previous release.
HD 7950 is about as fast as the HD 7970 GHz... is it an overclocked HD7950 ?
The HD 7950 is faster than a GeForce GTX Titan with full DP enabled!
The relative GPU value (genefer mark / price) is 4 / 1.
Great, I'm working on the OpenCL version of the real genefer app. | |
|
|
The error was not computed in the previous release.
HD 7950 is about as fast as the HD 7970 GHz... is it an overclocked HD7950 ?
The HD 7950 is faster than a GeForce GTX Titan with full DP enabled!
The relative GPU value (genefer mark / price) is 4 / 1.
Great, I'm working on the OpenCL version of the real genefer app.
Yes. factory OC 900/1250 | |
|
rogueVolunteer developer
 Send message
Joined: 8 Sep 07 Posts: 1255 ID: 12001 Credit: 18,565,548 RAC: 0
 
|
Yves, should I be able to use my 5870 or not? If not, I'm okay with that (albeit disappointed). | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
Yves, should I be able to use my 5870 or not? If not, I'm okay with that (albeit disappointed).
HD 5870 and 5850 have double precision FP, then it should be able to run.
Try it without the test on CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE.
If it works, I will replace it by CL_DEVICE_EXTENSIONS and check if it contains cl_khr_fp64 or cl_amd_fp64. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13952 ID: 53948 Credit: 391,849,676 RAC: 147,542
                               
|
On my stock GTX 460:
GeneferCUDA:
genefercuda 3.1.0-1 (Windows 32-bit CUDA)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: genefercuda -b
Generalized Fermat Number Bench
2199064^8192+1 Time: 164 us/mul. Err: 0.2188 51956 digits
1798620^16384+1 Time: 162 us/mul. Err: 0.2188 102481 digits
1471094^32768+1 Time: 214 us/mul. Err: 0.2031 202102 digits
1203210^65536+1 Time: 338 us/mul. Err: 0.2031 398482 digits
984108^131072+1 Time: 599 us/mul. Err: 0.2070 785521 digits
804904^262144+1 Time: 1.06 ms/mul. Err: 0.2031 1548156 digits
658332^524288+1 Time: 1.99 ms/mul. Err: 0.2266 3050541 digits
538452^1048576+1 Time: 3.98 ms/mul. Err: 0.2031 6009544 digits
440400^2097152+1 Time: 8.3 ms/mul. Err: 0.1953 11836006 digits
360204^4194304+1 Time: 16.7 ms/mul. Err: 0.2031 23305854 digits
294612^8388608+1 Time: 34.7 ms/mul. Err: 0.1797 45879398 digits
The first OpenCL version:
OclGenefer 2013, Copyright (C) 2001-2013, Yves Gallot.
Platform 'NVIDIA CUDA': 1 GPU device(s) found.
Platform 'AMD Accelerated Parallel Processing': 1 CPU device(s) found.
Running on platform 'NVIDIA CUDA' and device 'GeForce GTX 460'.
2199064^8192+1 Time: 116 us/mul. 51956 digits
1798620^16384+1 Time: 193 us/mul. 102481 digits
1471094^32768+1 Time: 327 us/mul. 202102 digits
1203210^65536+1 Time: 439 us/mul. 398482 digits
984108^131072+1 Time: 726 us/mul. 785521 digits
804904^262144+1 Time: 1.61 ms/mul. 1548156 digits
658332^524288+1 Time: 3.04 ms/mul. 3050541 digits
538452^1048576+1 Time: 6.39 ms/mul. 6009544 digits
440400^2097152+1 Time: 14.4 ms/mul. 11836006 digits
360204^4194304+1 Time: 35.2 ms/mul. 23305854 digits
294612^8388608+1 Time: 88.4 ms/mul. 45879398 digits
The most recent OpenCL version:
OclGenefer 2013-08-12, Copyright (C) 2001-2013, Yves Gallot.
Options: -q "b^N+1" Test expression.
Platform 'NVIDIA CUDA': 1 GPU device(s) found.
Platform 'AMD Accelerated Parallel Processing': 1 CPU device(s) found.
Running on platform 'NVIDIA CUDA', device 'GeForce GTX 460', version 'OpenCL 1.1
CUDA' and driver '320.57'.
Clock frequency = 1350 MHz, compute units = 7.
Global mem size = 1024 MB, cache size = 112 kB (ReadWrite), cache line size =
128 Bytes.
Local mem size = 48 kB (dedicated), Constant mem size = 64 kB.
Max workgroup size = 1024.
2199064^8192+1 Time: 85.3 us/mul. Err: 0.2188 51956 digits
1798620^16384+1 Time: 127 us/mul. Err: 0.2266 102481 digits
1471094^32768+1 Time: 168 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 306 us/mul. Err: 0.2188 398482 digits
984108^131072+1 Time: 557 us/mul. Err: 0.2422 785521 digits
804904^262144+1 Time: 1.13 ms/mul. Err: 0.2178 1548156 digits
658332^524288+1 Time: 2.19 ms/mul. Err: 0.2256 3050541 digits
538452^1048576+1 Time: 4.56 ms/mul. Err: 0.2031 6009544 digits
440400^2097152+1 Time: 9.48 ms/mul. Err: 0.2305 11836006 digits
360204^4194304+1 Time: 19.9 ms/mul. Err: 0.1953 23305854 digits
294612^8388608+1 Time: 42.2 ms/mul. Err: 0.1973 45879398 digits
____________
My lucky number is 75898524288+1 | |
|
rogueVolunteer developer
 Send message
Joined: 8 Sep 07 Posts: 1255 ID: 12001 Credit: 18,565,548 RAC: 0
 
|
I think that timings are incorrect. If they are correct, then something appears to be impacting the overall throughput. I commented out the check for CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE.
Running on platform 'Apple', device 'ATI Radeon HD 5870', version 'OpenCL 1.1 ' and driver '1.0'.
Clock frequency = 850 MHz, compute units = 20.
Global mem size = 1024 MB, cache size = 0 kB (None), cache line size = 0 Bytes.
Local mem size = 32 kB (dedicated), Constant mem size = 64 kB.
Max workgroup size = 1024.
2199064^8192+1 Time: 169 us/mul. Err: 0.2344 51956 digits
1798620^16384+1 Time: 173 us/mul. Err: 0.2344 102481 digits
1471094^32768+1 Time: 171 us/mul. Err: 0.2734 202102 digits
1203210^65536+1 Time: 177 us/mul. Err: 0.2344 398482 digits
984108^131072+1 Time: 178 us/mul. Err: 0.2584 785521 digits
804904^262144+1 Time: 249 us/mul. Err: 0.2486 1548156 digits
658332^524288+1 Time: 252 us/mul. Err: 0.2378 3050541 digits
538452^1048576+1 Time: 266 us/mul. Err: 0.2188 6009544 digits
440400^2097152+1 Time: 304 us/mul. Err: 0.2275 11836006 digits
360204^4194304+1 Time: 352 us/mul. Err: 0.2168 23305854 digits
294612^8388608+1 Time: 337 us/mul. Err: 0.2344 45879398 digits
102^64+1 is a probable prime. (0.1 sec., err = 1.82e-11)
15000250^64+1 is a probable prime. (0.3 sec., err = 0.406)
120^128+1 is a probable prime. (0.2 sec., err = 5.09e-11)
10000038^128+1 is a probable prime. (0.5 sec., err = 0.406)
278^256+1 is a probable prime. (0.3 sec., err = 3.53e-10)
5684328^256+1 is a probable prime. (0.9 sec., err = 0.172)
46^512+1 is a probable prime. (0.5 sec., err = 1.64e-11)
4619000^512+1 is a probable prime. (1.8 sec., err = 0.195)
824^1024+1 is a probable prime. (1.6 sec., err = 9.08e-09)
3752220^1024+1 is a probable prime. (3.7 sec., err = 0.195)
150^2048+1 is a probable prime. (2.4 sec., err = 5.24e-10)
3066672^2048+1 is a probable prime. (7.1 sec., err = 0.219)
1534^4096+1 is a probable prime. (7.3 sec., err = 8.2e-08)
2485064^4096+1 is a probable prime. (14.5 sec., err = 0.215)
30406^8192+1 is a probable prime. (20.7 sec., err = 5.25e-05)
2030234^8192+1 is a probable prime. (28.9 sec., err = 0.234)
67234^16384+1 is a probable prime. (44.9 sec., err = 0.00037)
1651902^16384+1 is a probable prime. (57.9 sec., err = 0.22)
70906^32768+1 is a probable prime. (92.1 sec., err = 0.000671)
1277444^32768+1 is a probable prime. (113.7 sec., err = 0.227)
48594^65536+1 is a probable prime. (174.8 sec., err = 0.000458)
857678^65536+1 is a probable prime. (222.8 sec., err = 0.148)
62722^131072+1 is a probable prime. (365.5 sec., err = 0.00122)
572186^131072+1 is a probable prime. (429.6 sec., err = 0.105)
24518^262144+1 is a probable prime. (904.9 sec., err = 0.000275) | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
I think that timings are incorrect. If they are correct, then something appears to be impacting the overall throughput. I commented out the check for CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE. [...]
The HD 5870 is very fast!
24518^262144+1: 904.9 sec => 904.9 / (262144 * log2(24518)) = 237 us/mul.
genefercuda on a GTX 460: b^262144+1 Time: 1.06 ms/mul.
This program on a HD 7970 GHz: b^262144+1 Time: 336 us/mul.
If the timing is correct for N = 8388608, your computer can enter in the TOP500 list of supercomputers :o)
... or there is a problem with the accuracy of the clock() function or with OpenGL synchronization.
I will replace CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE test by CL_DEVICE_EXTENSIONS to support cl_amd_fp64. | |
|
|
clock() is Elapsed Time on Windows.
clock() is Cpu Time on Unix.
time() is Elapsed Time on Unix,but resolution is second. | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
clock() is Elapsed Time on Windows.
clock() is Cpu Time on Unix.
OK. Apple OS is Unix and because computation is running on GPU, CPU time ~ 0. Thanks!
It will be correct with the real genefer that uses gettimeofday() on Unix. | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
A new version is available on assembla (2013-08-14).
It supports "cl_khr_fp64" and "cl_amd_fp64".
A bug was corrected, a possible final round-off error. It occurred with composite numbers but not with primes!
This may be the last release of this program: the next one is a beta of "GeneferOCL" (the real genefer app). | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13952 ID: 53948 Credit: 391,849,676 RAC: 147,542
                               
|
I expect I'll be releasing GeneferOCL as a beta/app_info app later today. I'm playing with it right now.
It appears to be happy running on ANYTHING. On my computer, specifying "--device 0" runs on my GTX 460 and "--device 1" runs on all 4 cores of my Core2Quad.
Although I don't have this hardware on this computer, it should also run on an ATI or Intel GPU just as easily.
So caution is warranted when running this on computers with multiple OpenCL capable devices.
This raises the possibility of officially supporting not just AMD/AT GPUs but also Intel GPUs, and possibly CPU-based multicore apps. (Well, that last part might be wishful thinking. It looks like GeneferOCL running on all 4 cores of my computer is about 10 times SLOWER than GenefX64 running on a single core.)
OpenCL on CPU:
Command line: geneferocl -q 212346^1048576+1 -d 1
Running on platform 'AMD Accelerated Parallel Processing', device 'Intel(R) Core(TM)2 Quad CPU @ 2.40GHz', version 'OpenCL 1.1 AMD-APP (831.4)' and driver '2.0'.
Testing 212346^1048576+1...
Starting initialization...
Initialization complete (6.343 seconds).
Estimated total run time for 212346^1048576+1 is 2940:06:11
GenefX64 on CPU:
Command line: genefx64 -q 212346^1048576+1
Priority change succeeded.
Testing 212346^1048576+1...
Starting initialization...
Initialization complete (327.333 seconds).
Estimated total run time for 212346^1048576+1 is 246:35:21
OpenCl on GTX 460:
Command line: geneferocl -q 212346^1048576+1 -d 0
Running on platform 'NVIDIA CUDA', device 'GeForce GTX 460', version 'OpenCL 1.1 CUDA' and driver '320.57'.
Testing 212346^1048576+1...
Starting initialization...
Initialization complete (6.359 seconds).
Estimated total run time for 212346^1048576+1 is 23:28:41
CUDA on GTX 460:
Command line: genefercuda -q 212346^1048576+1 -shift 7
Testing 212346^1048576+1...
SHIFT override specified; using SHIFT=7 (instead of default value of 7)
Starting initialization...
maxErr during b^N initialization = 0.0000 (21.825 seconds).
Estimated total run time for 212346^1048576+1 is 20:26:50
____________
My lucky number is 75898524288+1 | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
[...], it should also run on an ATI or Intel GPU just as easily.
Intel HD Graphics doesn't support double FP.
Intel CPU does but that's incredibly slow!
On my laptop:
geneferocl -q 212346^32768+1 -d 1
Running on platform 'Intel(R) OpenCL', device 'Intel(R) Core(TM) i3-3217U CPU @ 1.80GHz'
Estimated total run time for 212346^32768+1 is 4:44:56
geneferavx -q "212346^32768+1"
Estimated total run time for 212346^32768+1 is 0:04:04
And according to Intel's OpenCL compiler: "all kernel functions were successfully vectorized".
Does Intel's SDK emulate double FP ? | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13952 ID: 53948 Credit: 391,849,676 RAC: 147,542
                               
|
Intel HD Graphics doesn't support double FP.
We can forget about Intel GPUs. Oh well.
Intel CPU does but that's incredibly slow!
On my laptop:
geneferocl -q 212346^32768+1 -d 1
Running on platform 'Intel(R) OpenCL', device 'Intel(R) Core(TM) i3-3217U CPU @ 1.80GHz'
Estimated total run time for 212346^32768+1 is 4:44:56
geneferavx -q "212346^32768+1"
Estimated total run time for 212346^32768+1 is 0:04:04
And according to Intel's OpenCL compiler: "all kernel functions were successfully vectorized".
Does Intel's SDK emulate double FP ?
That could explain it. So, it will run... although "crawl" would be a better word than "run".
____________
My lucky number is 75898524288+1 | |
|
|
Hi,
testing with my 7950 the estimated runtime (after 3% work done) is 27050 seconds. GPU load is 99%, CPU is about 1%. I am using this app_info.xml:
<app_info>
<app>
<name>genefer</name>
<user_friendly_name>Genefer OCL</user_friendly_name>
</app>
<file_info>
<name>geneferocl-windows.exe</name>
<executable/>
</file_info>
<app_version>
<app_name>genefer</app_name>
<version_num>206</version_num>
<api_version>7.0.64</api_version>
<avg_ncpus>1.000000</avg_ncpus>
<max_ncpus>1.000000</max_ncpus>
<file_ref>
<file_name>geneferocl-windows.exe</file_name>
<main_program/>
</file_ref>
<platform>windows_intelx86</platform>
<coproc>
<type>ATI</type>
<count>1.000000</count>
</coproc>
</app_version>
</app_info>
____________
DeleteNull | |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,668,824 RAC: 0
                    
|
Latest assembla OclGenefer.cpp revision 386 code with full run on HD7970Ghz:
Are we expecting the same "B" limits as GeneferCUDA? | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13952 ID: 53948 Credit: 391,849,676 RAC: 147,542
                               
|
Are we expecting the same "B" limits as GeneferCUDA?
Running the full blown geneferocl app using the -l option, the b limits mostly look to be similar. Let's see if I can cut and paste something usefull...
Here's GeneferCUDA:
Generalized Fermat Number b Limits
The upper bound m = 8192, b = 2650000, Err = 0.2813
The upper bound m = 16384, b = 2280000, Err = 0.2969
The upper bound m = 32768, b = 1840000, Err = 0.2969
The upper bound m = 65536, b = 1525000, Err = 0.2969
The upper bound m = 131072, b = 1270000, Err = 0.2969
The upper bound m = 262144, b = 995000, Err = 0.2813
The upper bound m = 524288, b = 815000, Err = 0.2813
The upper bound m = 1048576, b = 695000, Err = 0.3047
The upper bound m = 2097152, b = 580000, Err = 0.2969
The upper bound m = 4194304, b = 475000, Err = 0.3125
The upper bound m = 8388608, b = 400000, Err = 0.3125
GeneferOCL:
Generalized Fermat Number b Limits
The upper bound m = 8192, b = 2670000, Err = 0.2910
The upper bound m = 16384, b = 2210000, Err = 0.2969
The upper bound m = 32768, b = 1780000, Err = 0.2969
The upper bound m = 65536, b = 1505000, Err = 0.2969
The upper bound m = 131072, b = 1240000, Err = 0.2969
The upper bound m = 262144, b = 1015000, Err = 0.3047
The upper bound m = 524288, b = 825000, Err = 0.3057
The upper bound m = 1048576, b = 680000, Err = 0.3047
The upper bound m = 2097152, b = 555000, Err = 0.2969
The upper bound m = 4194304, b = 455000, Err = 0.2813
The upper bound m = 8388608, b = 385000, Err = 0.3125
The higher limits are green and the lower limits are red. As you can see, the limits are similar but not identical.
____________
My lucky number is 75898524288+1 | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13952 ID: 53948 Credit: 391,849,676 RAC: 147,542
                               
|
From the database:
Running on platform 'AMD Accelerated Parallel Processing', device 'Tahiti', version 'OpenCL 1.2 AMD-APP (1124.2)' and driver '1124.2 (VM)'.
Starting initialization...
Initialization complete (2.464 seconds).
Testing 216560^1048576+1...
Estimated total run time for 216560^1048576+1 is 7:39:03
216560^1048576+1 is complete. (5594760 digits) (err = 0.0469) (time = 9:48:38) 06:53:04
Running on platform 'AMD Accelerated Parallel Processing', device 'Tahiti', version 'OpenCL 1.2 AMD-APP (1124.2)' and driver '1124.2 (VM)'.
Starting initialization...
Initialization complete (3.205 seconds).
Testing 216878^1048576+1...
Estimated total run time for 216878^1048576+1 is 14:56:13
216878^1048576+1 is complete. (5595428 digits) (err = 0.0488) (time = 7:51:30) 14:44:38
Running on platform 'AMD Accelerated Parallel Processing', device 'Tahiti', version 'OpenCL 1.2 AMD-APP (1124.2)' and driver '1124.2 (VM)'.
Starting initialization...
Initialization complete (3.078 seconds).
Testing 216862^1048576+1...
Estimated total run time for 216862^1048576+1 is 7:52:07
216862^1048576+1 is complete. (5595394 digits) (err = 0.0469) (time = 8:20:41) 16:17:09
A few interesting things. The first and most obvious is that it works.
The second is that it doesn't tell you the exact device model. That's a shame.
The third is that the estimates vary by more than I'm accustomed to seeing on my system. These three results are from the same computer, but it's not mine.
____________
My lucky number is 75898524288+1 | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
Running the full blown geneferocl app using the -l option, [...]As you can see, the limits are similar but not identical.
Yes, because the algorithms are similar but not identical.
There are many ways to split a large FFT into smaller ones. Then the round-off error of CUFFT is not identical to the OpenCL transform.
More surprisingly the error of OpenCL implementation seems to depend on hardware...
"cl_khr_fp64" requires that arithmetic is IEEE 754-2008 compliant, and I set the flag FP_CONTRACT=OFF, which disallows the implementation to contract expressions.
But that's not enough... hopefully round-off errors are very similar.
| |
|
|
A few interesting things. The first and most obvious is that it works.
The second is that it doesn't tell you the exact device model. That's a shame.
The third is that the estimates vary by more than I'm accustomed to seeing on my system. These three results are from the same computer, but it's not mine.
My computer has two devices:
device 0 = HD7970
device 1 = HD7950
I started with device 1 (device 0 was running distrigen), the estimated time was about 7.5 hours.
Till 75% of the WU the estimated time and the run time were consistent.
At 90% device 0 run dry (no new work), and geneferocl runs wery slow. So instead of 7.5 hours....the run time increases to 9.8 hours.
The second WU started on device 1, and after one hour the progress was about 2% (estimated time = 50 hours!)
After starting a third WU on device 0 geneferocl was running faster, the second WU finishes after 7,8 hours, the third WU after 8,33 hours.
In this moment:
device 1 (7950): 54.3% at 4 hours, estimated time = 7,37 hours
device 0 (7970): 19.55% at 2.5 hours, estimated time = 11,76 hours
It seems to me, that the app uses not only the declared device.....
____________
DeleteNull | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
My computer has two devices:
device 0 = HD7970
device 1 = HD7950
If you start genefer using interactive mode and enter a command, it prints the device list that it found.
Could you copy this list? I don't know if there is one OpenCL 'platform' or two.
I downloaded DistrRTgen source code and compared OpenCL initialisation.
I found a difference and modified genefer code (now, initialisations are similar). Mike, could you update GeneferOCL?
I noticed that DistrRTgen uses boinc_get_opencl_ids() to get platform and device ids. Mike, do you know if we should use this function?
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13952 ID: 53948 Credit: 391,849,676 RAC: 147,542
                               
|
I downloaded DistrRTgen source code and compared OpenCL initialisation.
I found a difference and modified genefer code (now, initialisations are similar). Mike, could you update GeneferOCL?
Certainly.
I noticed that DistrRTgen uses boinc_get_opencl_ids() to get platform and device ids. Mike, do you know if we should use this function?
I don't know anything about it, unfortunately. I do recall some chatter about how much fun it is to get the right OpenCL devices, but I don't remember if that was BOINC related or PRPNet related.
My guess would be that the function is there to solve a problem related to selecting the right device, so chances are it's a good idea to use it.
EDIT: I'm going to bump the variant (-#) version number, so people can tell one version from another. We'll probably use a lot of them, but that's ok.
____________
My lucky number is 75898524288+1 | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13952 ID: 53948 Credit: 391,849,676 RAC: 147,542
                               
|
The latest OpenCL beta build is available for download. Its version number is 3.1.2-1, and can be downloaded via this thread.
____________
My lucky number is 75898524288+1 | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13952 ID: 53948 Credit: 391,849,676 RAC: 147,542
                               
|
On my system (Core2Quad and GTX 460) GeneferCUDA uses almost no CPU at all. GeneferOCL seems to use an entire CPU core as well as the entire GPU.
____________
My lucky number is 75898524288+1 | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
On my system (Core2Quad and GTX 460) GeneferCUDA uses almost no CPU at all. GeneferOCL seems to use an entire CPU core as well as the entire GPU.
Yes, a bug of NVidia driver: https://forums.geforce.com/default/topic/543115/opencl-driver-support-for-fah/
NVidia also removed OpenCL documentation from its SDK and still uses OpenCL 1.1 (OpenCL 1.2 specifications were defined in November 2011!).
It is clear today that NVidia doesn't actively support OpenCL. | |
|
|
Hi Yves,
the output with both versions of geneferocl-windows.exe is:
Device List:
0: GPU device 'Tahiti' on 'AMD Accelerated Parallel Processing'.
1: GPU device 'Tahiti' on 'AMD Accelerated Parallel Processing'.
2: CPU device ' Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz' on 'AMD Accelerated Parallel Processing'.
____________
DeleteNull | |
|
|
Running on platform 'NVIDIA CUDA', device 'GeForce GTX 580', version 'OpenCL 1.1 CUDA' and driver '326.58'.
geneferocl 3.1.2-1 (Windows 64-bit OpenGL)
2199064^8192+1 Time: 68.5 us/mul. Err: 0.2344 51956 digits
1798620^16384+1 Time: 72.4 us/mul. Err: 0.2266 102481 digits
1471094^32768+1 Time: 76.2 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 114 us/mul. Err: 0.2656 398482 digits
984108^131072+1 Time: 198 us/mul. Err: 0.2188 785521 digits
804904^262144+1 Time: 411 us/mul. Err: 0.2188 1548156 digits
658332^524288+1 Time: 793 us/mul. Err: 0.2188 3050541 digits
538452^1048576+1 Time: 1.59 ms/mul. Err: 0.2266 6009544 digits
440400^2097152+1 Time: 3.29 ms/mul. Err: 0.2266 11836006 digits
360204^4194304+1 Time: 6.83 ms/mul. Err: 0.2031 23305854 digits
294612^8388608+1 Time: 14.6 ms/mul. Err: 0.1895 45879398 digits
genefercuda 3.1.2-2 (Windows 32-bit CUDA)
2199064^8192+1 Time: 87.5 us/mul. Err: 0.2188 51956 digits
1798620^16384+1 Time: 98.1 us/mul. Err: 0.2188 102481 digits
1471094^32768+1 Time: 114 us/mul. Err: 0.2031 202102 digits
1203210^65536+1 Time: 168 us/mul. Err: 0.2031 398482 digits
984108^131072+1 Time: 312 us/mul. Err: 0.2070 785521 digits
804904^262144+1 Time: 519 us/mul. Err: 0.2031 1548156 digits
658332^524288+1 Time: 945 us/mul. Err: 0.2266 3050541 digits
538452^1048576+1 Time: 1.65 ms/mul. Err: 0.2031 6009544 digits
440400^2097152+1 Time: 3.17 ms/mul. Err: 0.1953 11836006 digits
360204^4194304+1 Time: 6.58 ms/mul. Err: 0.2031 23305854 digits
294612^8388608+1 Time: 14.1 ms/mul. Err: 0.1797 45879398 digits | |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,668,824 RAC: 0
                    
|
Are we expecting the same "B" limits as GeneferCUDA?
Running the full blown geneferocl app using the -l option, the b limits mostly look to be similar. Let's see if I can cut and paste something useful...
GeneferOCL:
Generalized Fermat Number b Limits
The upper bound m = 8192, b = 2670000, Err = 0.2910
The upper bound m = 16384, b = 2210000, Err = 0.2969
The upper bound m = 32768, b = 1780000, Err = 0.2969
The upper bound m = 65536, b = 1505000, Err = 0.2969
The upper bound m = 131072, b = 1240000, Err = 0.2969
The upper bound m = 262144, b = 1015000, Err = 0.3047
The upper bound m = 524288, b = 825000, Err = 0.3057
The upper bound m = 1048576, b = 680000, Err = 0.3047
The upper bound m = 2097152, b = 555000, Err = 0.2969
The upper bound m = 4194304, b = 455000, Err = 0.2813
The upper bound m = 8388608, b = 385000, Err = 0.3125
The higher limits are green and the lower limits are red. As you can see, the limits are similar but not identical.
I got exactly the same "B" limits testing GeneferOCL -l on the HD7970Ghz.
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13952 ID: 53948 Credit: 391,849,676 RAC: 147,542
                               
|
Some of DeleteNull's GeneferOCL tasks have now been validated (one with a CPU, the other with a Tesla GPU), so we now officially have our first credit awarded to GeneferOCL and an ATI/AMD GPU.
This is a milestone many people have been waiting for.
Many thanks and congratulation to Yves for the great achievement! (This is, of course, just the latest advance for which Yves deserves our thanks.)
____________
My lucky number is 75898524288+1 | |
|
|
Results using HD 7970 on I7 Window7 64:
C:\Users\user\Downloads>geneferocl-windows.exe -b
geneferocl 3.1.2-1 (Windows 64-bit OpenGL)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: geneferocl-windows.exe -b
Running on platform 'AMD Accelerated Parallel Processing', device 'Tahiti', vers
ion 'OpenCL 1.2 AMD-APP (1272.2)' and driver '1272.2 (VM)'.
Generalized Fermat Number Bench
2199064^8192+1 Time: 70.8 us/mul. Err: 0.2344 51956 digits
1798620^16384+1 Time: 73.9 us/mul. Err: 0.2266 102481 digits
1471094^32768+1 Time: 74.5 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 87.9 us/mul. Err: 0.2656 398482 digits
984108^131072+1 Time: 122 us/mul. Err: 0.2188 785521 digits
804904^262144+1 Time: 322 us/mul. Err: 0.2188 1548156 digits
658332^524288+1 Time: 820 us/mul. Err: 0.2188 3050541 digits
538452^1048576+1 Time: 1.56 ms/mul. Err: 0.2266 6009544 digits
440400^2097152+1 Time: 2.81 ms/mul. Err: 0.2266 11836006 digits
360204^4194304+1 Time: 5.63 ms/mul. Err: 0.2031 23305854 digits
294612^8388608+1 Time: 11.3 ms/mul. Err: 0.1895 45879398 digits
Genefer Mark = 865.
C:\Users\user\Downloads>geneferocl-windows.exe -b3
geneferocl 3.1.2-1 (Windows 64-bit OpenGL)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: geneferocl-windows.exe -b3
Running on platform 'AMD Accelerated Parallel Processing', device 'Tahiti', vers
ion 'OpenCL 1.2 AMD-APP (1272.2)' and driver '1272.2 (VM)'.
14^32768+1 37557 digits 0 days 0.0 hours (0.10 ms/mul, 124
758 iterations) 5772 GFLOPS
75898^32768+1 159916 digits 0 days 0.0 hours (0.08 ms/mul, 531
226 iterations) 19240 GFLOPS
700000^32768+1 191533 digits 0 days 0.0 hours (0.08 ms/mul, 636
255 iterations) 23088 GFLOPS
5000000^32768+1 219512 digits 0 days 0.0 hours (0.08 ms/mul, 729
201 iterations) 26455 GFLOPS
14^65536+1 75113 digits 0 days 0.0 hours (0.09 ms/mul, 249
517 iterations) 10101 GFLOPS
75898^65536+1 319831 digits 0 days 0.0 hours (0.09 ms/mul, 106
2453 iterations) 44252 GFLOPS
710000^65536+1 383469 digits 0 days 0.0 hours (0.09 ms/mul, 127
3852 iterations) 52910 GFLOPS
2500000^65536+1 419296 digits 0 days 0.0 hours (0.09 ms/mul, 139
2868 iterations) 58201 GFLOPS
14^131072+1 150226 digits 0 days 0.0 hours (0.15 ms/mul, 499
036 iterations) 36075 GFLOPS
75898^131072+1 639662 digits 0 days 0.0 hours (0.12 ms/mul, 212
4908 iterations) 124579 GFLOPS
700000^131072+1 766129 digits 0 days 0.0 hours (0.12 ms/mul, 254
5023 iterations) 150553 GFLOPS
1000000^131072+1 786432 digits 0 days 0.0 hours (0.12 ms/mul, 261
2469 iterations) 152958 GFLOPS
14^262144+1 300451 digits 0 days 0.0 hours (0.32 ms/mul, 998
074 iterations) 152477 GFLOPS
75898^262144+1 1279324 digits 0 days 0.3 hours (0.33 ms/mul, 424
9818 iterations) 670033 GFLOPS
468750^262144+1 1486604 digits 0 days 0.4 hours (0.32 ms/mul, 493
8388 iterations) 750360 GFLOPS
815000^262144+1 1549575 digits 0 days 0.4 hours (0.33 ms/mul, 514
7574 iterations) 809523 GFLOPS
14^524288+1 600902 digits 0 days 0.4 hours (0.82 ms/mul, 199
6149 iterations) 790764 GFLOPS
75898^524288+1 2558647 digits 0 days 1.9 hours (0.84 ms/mul, 849
9637 iterations) 3450213 GFLOPS
468750^524288+1 2973207 digits 0 days 2.1 hours (0.80 ms/mul, 987
6777 iterations) 3800381 GFLOPS
710000^524288+1 3067745 digits 0 days 2.3 hours (0.84 ms/mul, 101
90825 iterations) 4112550 GFLOPS
C:\Users\user\Downloads>more genefer.bench
Generalized Fermat Number Bench
2199064^8192+1 Time: 70.8 us/mul. Err: 0.2344 51956 digits
1798620^16384+1 Time: 73.9 us/mul. Err: 0.2266 102481 digits
1471094^32768+1 Time: 74.5 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 87.9 us/mul. Err: 0.2656 398482 digits
984108^131072+1 Time: 122 us/mul. Err: 0.2188 785521 digits
804904^262144+1 Time: 322 us/mul. Err: 0.2188 1548156 digits
658332^524288+1 Time: 820 us/mul. Err: 0.2188 3050541 digits
538452^1048576+1 Time: 1.56 ms/mul. Err: 0.2266 6009544 digits
440400^2097152+1 Time: 2.81 ms/mul. Err: 0.2266 11836006 digits
360204^4194304+1 Time: 5.63 ms/mul. Err: 0.2031 23305854 digits
294612^8388608+1 Time: 11.3 ms/mul. Err: 0.1895 45879398 digits
Genefer Mark = 865.
____________
| |
|
|
Running on platform 'AMD Accelerated Parallel Processing', device 'AMD Athlon(tm) 64 X2 Dual Core Processor 4200+', version 'OpenCL 1.2 AMD-APP (1124.2)' and driver '1124.2 (sse2)'.
Generalized Fermat Number Bench
2199064^8192+1 Time: 8.73 ms/mul. Err: 0.2344 51956 digits
ATI 3550 or somehting like that.....
____________
wbr, Me. Dead J. Dona
| |
|
|
The second is that it doesn't tell you the exact device model. That's a shame.
Why tell the prog only 'Tahiti' for HD7950 and HD7970?
I have made a test with a litte java prog (works with opncl dll's):
querying CL_DEVICE_NAME gives back
GeForce GTS 450 for my little NVIDIA
Tahiti for my HD7950
Tahiti for my HD7970
so (if Tahiti) we have to query CL_DEVICE_MAX_COMPUTE_UNITS also:
Tahiti, 32 => HD7970
Tahiti, 28 => HD7950
Tahiti, 24 => HD7870 Boost Edition
____________
DeleteNull | |
|
|
For reference, it seems that OpenCL version is faster than CUDA on GTX Titan.
First runs with Double Precision ENABLED.
CUDA:
genefercuda-windows.exe -b
genefercuda 3.1.2-2 (Windows 32-bit CUDA)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: genefercuda-windows.exe -b
Generalized Fermat Number Bench
2199064^8192+1 Time: 135 us/mul. Err: 0.2344 51956 digits
1798620^16384+1 Time: 157 us/mul. Err: 0.2188 102481 digits
1471094^32768+1 Time: 173 us/mul. Err: 0.2500 202102 digits
1203210^65536+1 Time: 208 us/mul. Err: 0.2352 398482 digits
984108^131072+1 Time: 331 us/mul. Err: 0.5000 785521 digits
804904^262144+1 Time: 500 us/mul. Err: 0.2227 1548156 digits
658332^524288+1 Time: 889 us/mul. Err: 0.2500 3050541 digits
538452^1048576+1 Time: 1.75 ms/mul. Err: 0.2031 6009544 digits
440400^2097152+1 Time: 3.23 ms/mul. Err: 0.2051 11836006 digits
360204^4194304+1 Time: 6.09 ms/mul. Err: 0.2167 23305854 digits
294612^8388608+1 Time: 13.3 ms/mul. Err: 0.1797 45879398 digits
OpenCL:
geneferocl-windows.exe -b
geneferocl 3.1.2-1 (Windows 64-bit OpenGL)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: geneferocl-windows.exe -b
Running on platform 'NVIDIA CUDA', device 'GeForce GTX TITAN', version 'OpenCL 1
.1 CUDA' and driver '320.49'.
Generalized Fermat Number Bench
2199064^8192+1 Time: 82.5 us/mul. Err: 0.2344 51956 digits
1798620^16384+1 Time: 80.3 us/mul. Err: 0.2266 102481 digits
1471094^32768+1 Time: 85.2 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 97.7 us/mul. Err: 0.2656 398482 digits
984108^131072+1 Time: 134 us/mul. Err: 0.2188 785521 digits
804904^262144+1 Time: 279 us/mul. Err: 0.2188 1548156 digits
658332^524288+1 Time: 465 us/mul. Err: 0.2188 3050541 digits
538452^1048576+1 Time: 859 us/mul. Err: 0.2266 6009544 digits
440400^2097152+1 Time: 1.63 ms/mul. Err: 0.2266 11836006 digits
360204^4194304+1 Time: 3.33 ms/mul. Err: 0.2031 23305854 digits
294612^8388608+1 Time: 6.88 ms/mul. Err: 0.1895 45879398 digits
Genefer Mark = 1487.
----------
Double Precision DISABLED.
CUDA:
genefercuda-windows.exe -b
genefercuda 3.1.2-2 (Windows 32-bit CUDA)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: genefercuda-windows.exe -b
Generalized Fermat Number Bench
2199064^8192+1 Time: 146 us/mul. Err: 0.2344 51956 digits
1798620^16384+1 Time: 153 us/mul. Err: 0.2188 102481 digits
1471094^32768+1 Time: 177 us/mul. Err: 0.2500 202102 digits
1203210^65536+1 Time: 229 us/mul. Err: 0.2352 398482 digits
984108^131072+1 Time: 425 us/mul. Err: 0.2188 785521 digits
804904^262144+1 Time: 646 us/mul. Err: 0.2227 1548156 digits
658332^524288+1 Time: 1.08 ms/mul. Err: 0.2500 3050541 digits
538452^1048576+1 Time: 2.02 ms/mul. Err: 0.2031 6009544 digits
440400^2097152+1 Time: 3.66 ms/mul. Err: 0.2051 11836006 digits
360204^4194304+1 Time: 6.86 ms/mul. Err: 0.2167 23305854 digits
294612^8388608+1 Time: 14.7 ms/mul. Err: 0.1797 45879398 digits
OpenCL:
geneferocl-windows.exe -b
geneferocl 3.1.2-1 (Windows 64-bit OpenGL)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: geneferocl-windows.exe -b
Running on platform 'NVIDIA CUDA', device 'GeForce GTX TITAN', version 'OpenCL 1
.1 CUDA' and driver '320.49'.
Generalized Fermat Number Bench
2199064^8192+1 Time: 88.7 us/mul. Err: 0.2344 51956 digits
1798620^16384+1 Time: 82.5 us/mul. Err: 0.2266 102481 digits
1471094^32768+1 Time: 98.5 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 125 us/mul. Err: 0.2656 398482 digits
984108^131072+1 Time: 180 us/mul. Err: 0.2188 785521 digits
804904^262144+1 Time: 346 us/mul. Err: 0.2188 1548156 digits
658332^524288+1 Time: 609 us/mul. Err: 0.2188 3050541 digits
538452^1048576+1 Time: 1.15 ms/mul. Err: 0.2266 6009544 digits
440400^2097152+1 Time: 2.28 ms/mul. Err: 0.2266 11836006 digits
360204^4194304+1 Time: 4.64 ms/mul. Err: 0.2031 23305854 digits
294612^8388608+1 Time: 9.69 ms/mul. Err: 0.1895 45879398 digits
Genefer Mark = 1072.
----------
I'm changing my main computer into Haswell one so can't run these again on this same platform. Can post new ones after mobo/CPU change (ASUS P8Z68-V PRO GEN3 & i7 2600K -> ASUS Z87 Maximus VI Gene & i5-4670k).
I used the precompiled executables from http://www.primegrid.com/forum_thread.php?id=4889#63012, if they have been updated (should be the same which is in beta Primegrid BOINC). | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13952 ID: 53948 Credit: 391,849,676 RAC: 147,542
                               
|
Running on platform 'AMD Accelerated Parallel Processing', device 'AMD Athlon(tm) 64 X2 Dual Core Processor 4200+', version 'OpenCL 1.2 AMD-APP (1124.2)' and driver '1124.2 (sse2)'.
Generalized Fermat Number Bench
2199064^8192+1 Time: 8.73 ms/mul. Err: 0.2344 51956 digits
ATI 3550 or somehting like that.....
It's running on the CPU.
____________
My lucky number is 75898524288+1 | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13952 ID: 53948 Credit: 391,849,676 RAC: 147,542
                               
|
If anyone builds a Linux or Mac build of GeneferOCL before we do, the source has been changed to 3.1.2-2, which has a few minor tweaks to the utility functions. The only significant change is that under Linux and Mac, the benchmarks should produce the correct results.
____________
My lucky number is 75898524288+1 | |
|
|
I'm testing on an ATI Radeon HD 7770 stock speed with one CPU core reserved -> http://www.primegrid.com/show_host_detail.php?hostid=407824
Estimated elapsed time 27½ hours. CPU run time 82% of elapsed time
____________
| |
|
|
Testing on a HD7950 with 1 CPU core reserved, estimated time for a WU ~28,000 sec. | |
|
|
For reference, it seems that OpenCL version is faster than CUDA on GTX Titan.
CUDA:
538452^1048576+1 Time: 1.75 ms/mul. Err: 0.2031 6009544 digits
OpenCL:
538452^1048576+1 Time: 859 us/mul. Err: 0.2266 6009544 digits
Wow, OpenCL is almost twice as fast on your Titan! That is totally unexpected. | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
Wow, OpenCL is almost twice as fast on your Titan! That is totally unexpected.
No, that's not unexpected :o)
First, my graphics card is a Kepler then I optimized the current version for this architecture.
Second, the maximum number of registers per thread and per multiprocessor is (I think) a major bottleneck. With 255 registers per thread, the new Kepler GK110 removed this limitation.
It would be interesting to compare the real GFLOPS of genefer with the peak GFLOPS of GPUs. | |
|
|
ATI 7970
Completed and validated
Run time 26,231.64
CPU time 491.36
pkt 29,125.69 | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
I tuned geneferocl. It's a bit faster on my computer for large exponents.
I committed the new parameters on assembla, in the branch [...]\branches\yves\2013\OclGenefer.
I think that it is faster on a 'Tahiti', but only a real experiment can answer to the question. Please, could someone compile it and run the bench on a HD79x0?
Mike, could you also test it on your GTX 460? Because, I don't know why, but your card doesn't like my algorithm :o) If it is the card that found 75898^524288+1 with GeneferCUDA, I understand it!
If new tuning is faster on Tahiti and Fermi, I will update the real genefer.
Thanks, Yves | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13952 ID: 53948 Credit: 391,849,676 RAC: 147,542
                               
|
The GeneferOCL we've been running is a 64 bit version. Until today, I had not tried running a 32 bit version. I would expect the 32 bit and 64 bit builds to run at the same speed, and in fact their benchmark speeds are the same. This behavior is consistent with GeneferCUDA.
However, the B limits are NOT the same.
64 bit GeneferOCL:
Command line: geneferocl-windows-x64.exe -l
Running on platform 'NVIDIA CUDA', device 'GeForce GTX 460', version 'OpenCL 1.1
CUDA' and driver '320.57'.
Generalized Fermat Number b Limits
The upper bound m = 8192, b = 2670000, Err = 0.2910
The upper bound m = 16384, b = 2210000, Err = 0.2969
The upper bound m = 32768, b = 1780000, Err = 0.2969
The upper bound m = 65536, b = 1505000, Err = 0.2969
The upper bound m = 131072, b = 1240000, Err = 0.2969
The upper bound m = 262144, b = 1015000, Err = 0.3047
The upper bound m = 524288, b = 825000, Err = 0.3057
The upper bound m = 1048576, b = 680000, Err = 0.3047
The upper bound m = 2097152, b = 555000, Err = 0.2969
The upper bound m = 4194304, b = 455000, Err = 0.2813
The upper bound m = 8388608, b = 385000, Err = 0.3125
32 bit GeneferOCL:
Command line: geneferocl-windows-x86.exe -l
Running on platform 'NVIDIA CUDA', device 'GeForce GTX 460', version 'OpenCL 1.1
CUDA' and driver '320.57'.
Generalized Fermat Number b Limits
The upper bound m = 8192, b = 2720000, Err = 0.2969
The upper bound m = 16384, b = 2210000, Err = 0.2969
The upper bound m = 32768, b = 1830000, Err = 0.3008
The upper bound m = 65536, b = 1490000, Err = 0.2969
The upper bound m = 131072, b = 1235000, Err = 0.3008
The upper bound m = 262144, b = 1015000, Err = 0.2891
The upper bound m = 524288, b = 840000, Err = 0.3047
The upper bound m = 1048576, b = 690000, Err = 0.3008
The upper bound m = 2097152, b = 565000, Err = 0.3066
The upper bound m = 4194304, b = 470000, Err = 0.3125
The upper bound m = 8388608, b = 385000, Err = 0.3125
While I wasn't surprised that there were small differences between GeneferCUDA and GeneferOCL, this does surprise me a bit because the math done on the GPU shouldn't be affected by the integer size on the CPU, especially since that doesn't affect the double precision floating point on either the CPU or the GPU.
____________
My lucky number is 75898524288+1 | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
While I wasn't surprised that there were small differences between GeneferCUDA and GeneferOCL, this does surprise me a bit because the math done on the GPU shouldn't be affected by the integer size on the CPU, especially since that doesn't affect the double precision floating point on either the CPU or the GPU.
The cos/sin tables are computed during initialisation by the CPU.
I think that it is more accurate than the GPU sin/cos functions.
Win32 binaries still use FP80 for internal computation. x64 uses SSE2 and then FP64. If you compile the win32 app with "SSE2 instruction set", the results may be identical...? | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13952 ID: 53948 Credit: 391,849,676 RAC: 147,542
                               
|
While I wasn't surprised that there were small differences between GeneferCUDA and GeneferOCL, this does surprise me a bit because the math done on the GPU shouldn't be affected by the integer size on the CPU, especially since that doesn't affect the double precision floating point on either the CPU or the GPU.
The cos/sin tables are computed during initialisation by the CPU.
I think that it is more accurate than the GPU sin/cos functions.
Win32 binaries still use FP80 for internal computation. x64 uses SSE2 and then FP64. If you compile the win32 app with "SSE2 instruction set", the results may be identical...?
Not at all. With SSE2 enabled on the 32 bit build, none of the B limits changed, except for N=19 through 22, where all went up by 5000. Those 4 B limits were already higher than the 64 bit versions, so with SSE2 the gap was slightly larger.
There's no obvious pattern to which version (32 or 64 bit) is better here, so I suspect that the differences might reflect inaccuracies in the B limit testing process rather than actual differences in precision.
EDIT: For future builds, I'm going to stay with the 32 bit (SSE2) version because it will run on either 32 or 64 bit platforms. The -l limit test seems to indicate that the limits are higher at the N values we care about, so it's also a better choice from that perspective.
____________
My lucky number is 75898524288+1 | |
|
|
I'll probably run your OpenCL version instead of CUDA on this card, if these seem to be correct on full runs also :)
I got my Haswell put together, which I'm intending to do mostly BOINC stuff when not in use. This is the same card with different mobo, cpu and memory. Not much difference from platform change.
GF WR bugged out with Titan and 690's on same computer so had to build new one just for the Titan :S
These are made on stock speeds on CPU and GPU.
-----
Double precision ENABLED.
CUDA:
genefercuda-windows.exe -b
genefercuda 3.1.2-2 (Windows 32-bit CUDA)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: genefercuda-windows.exe -b
Generalized Fermat Number Bench
2199064^8192+1 Time: 131 us/mul. Err: 0.2344 51956 digits
1798620^16384+1 Time: 141 us/mul. Err: 0.2188 102481 digits
1471094^32768+1 Time: 159 us/mul. Err: 0.2500 202102 digits
1203210^65536+1 Time: 206 us/mul. Err: 0.2352 398482 digits
984108^131072+1 Time: 328 us/mul. Err: 0.5000 785521 digits
804904^262144+1 Time: 477 us/mul. Err: 0.2227 1548156 digits
658332^524288+1 Time: 793 us/mul. Err: 0.2500 3050541 digits
538452^1048576+1 Time: 1.61 ms/mul. Err: 0.2031 6009544 digits
440400^2097152+1 Time: 2.96 ms/mul. Err: 0.2051 11836006 digits
360204^4194304+1 Time: 5.75 ms/mul. Err: 0.2167 23305854 digits
294612^8388608+1 Time: 12.8 ms/mul. Err: 0.1797 45879398 digits
OpenCL, with the two different versions 3.1.2-1 and 3.1.2-2:
geneferocl-windows.exe -b
geneferocl 3.1.2-1 (Windows 64-bit OpenGL)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: geneferocl-windows.exe -b
Running on platform 'NVIDIA CUDA', device 'GeForce GTX TITAN', version 'OpenCL 1.1 CUDA' and driver '320.49'.
Generalized Fermat Number Bench
2199064^8192+1 Time: 75.2 us/mul. Err: 0.2344 51956 digits
1798620^16384+1 Time: 76.5 us/mul. Err: 0.2266 102481 digits
1471094^32768+1 Time: 81.2 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 95.5 us/mul. Err: 0.2656 398482 digits
984108^131072+1 Time: 131 us/mul. Err: 0.2188 785521 digits
804904^262144+1 Time: 275 us/mul. Err: 0.2188 1548156 digits
658332^524288+1 Time: 459 us/mul. Err: 0.2188 3050541 digits
538452^1048576+1 Time: 855 us/mul. Err: 0.2266 6009544 digits
440400^2097152+1 Time: 1.61 ms/mul. Err: 0.2266 11836006 digits
360204^4194304+1 Time: 3.33 ms/mul. Err: 0.2031 23305854 digits
294612^8388608+1 Time: 6.84 ms/mul. Err: 0.1895 45879398 digits
Genefer Mark = 1494.
geneferocl-windows.exe -b
geneferocl 3.1.2-2 (Windows 64-bit OpenGL)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: geneferocl-windows.exe -b
Running on platform 'NVIDIA CUDA', device 'GeForce GTX TITAN', version 'OpenCL 1.1 CUDA' and driver '320.49'.
Generalized Fermat Number Bench
2199064^8192+1 Time: 75.3 us/mul. Err: 0.2344 51956 digits
1798620^16384+1 Time: 76.4 us/mul. Err: 0.2266 102481 digits
1471094^32768+1 Time: 81.1 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 95.5 us/mul. Err: 0.2656 398482 digits
984108^131072+1 Time: 131 us/mul. Err: 0.2188 785521 digits
804904^262144+1 Time: 275 us/mul. Err: 0.2188 1548156 digits
658332^524288+1 Time: 459 us/mul. Err: 0.2188 3050541 digits
538452^1048576+1 Time: 852 us/mul. Err: 0.2266 6009544 digits
440400^2097152+1 Time: 1.6 ms/mul. Err: 0.2266 11836006 digits
360204^4194304+1 Time: 3.31 ms/mul. Err: 0.2031 23305854 digits
294612^8388608+1 Time: 6.94 ms/mul. Err: 0.1895 45879398 digits
Genefer Mark = 1494.
-----
And here's -b3 runs. Estimated time difference starts to get huge later on.
CUDA:
genefercuda 3.1.2-2 (Windows 32-bit CUDA)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
14^262144+1 300451 digits 0 days 0.1 hours (0.48 ms/mul, 998074 iterations) 230880 GFLOPS
75898^262144+1 1279324 digits 0 days 0.5 hours (0.48 ms/mul, 4249818 iterations) 972582 GFLOPS
468750^262144+1 1486604 digits 0 days 0.6 hours (0.48 ms/mul, 4938388 iterations) 1132755 GFLOPS
815000^262144+1 1549575 digits 0 days 0.6 hours (0.47 ms/mul, 5147574 iterations) 1176045 GFLOPS
14^524288+1 600902 digits 0 days 0.4 hours (0.80 ms/mul, 1996149 iterations) 762866 GFLOPS
75898^524288+1 2558647 digits 0 days 1.8 hours (0.80 ms/mul, 8499637 iterations) 3253965 GFLOPS
468750^524288+1 2973207 digits 0 days 2.1 hours (0.80 ms/mul, 9876777 iterations) 3785951 GFLOPS
710000^524288+1 3067745 digits 0 days 2.2 hours (0.80 ms/mul, 10190825 iterations) 3901391 GFLOPS
14^1048576+1 1201803 digits 0 days 1.8 hours (1.63 ms/mul, 3992299 iterations) 3127943 GFLOPS
75898^1048576+1 5117293 digits 0 days 7.5 hours (1.60 ms/mul, 16999276 iterations) 13082238 GFLOPS
468750^1048576+1 5946413 digits 0 days 8.8 hours (1.62 ms/mul, 19753555 iterations) 15344381 GFLOPS
700000^1048576+1 6129030 digits 0 days 9.0 hours (1.61 ms/mul, 20360194 iterations) 15757079 GFLOPS
14^2097152+1 2403605 digits 0 days 6.5 hours (2.94 ms/mul, 7984600 iterations) 11298690 GFLOPS
75898^2097152+1 10234585 digits 1 days 3.7 hours (2.94 ms/mul, 33998553 iterations) 48078355 GFLOPS
380742^2097152+1 11703432 digits 1 days 7.7 hours (2.94 ms/mul, 38877955 iterations) 54960022 GFLOPS
570000^2097152+1 12070945 digits 1 days 8.7 hours (2.94 ms/mul, 40098808 iterations) 56705090 GFLOPS
14^4194304+1 4807210 digits 1 days 1.2 hours (5.68 ms/mul, 15969202 iterations) 43659408 GFLOPS
1248^4194304+1 12986466 digits 2 days 19.7 hours (5.66 ms/mul, 43140102 iterations) 117384683 GFLOPS
10000^4194304+1 16777217 digits 3 days 15.8 hours (5.67 ms/mul, 55732704 iterations) 152105187 GFLOPS
50000^4194304+1 19708909 digits 4 days 6.8 hours (5.66 ms/mul, 65471576 iterations) 178148932 GFLOPS
150000^4194304+1 21710101 digits 4 days 17.3 hours (5.66 ms/mul, 72119391 iterations) 196272531 GFLOPS
309258^4194304+1 23028076 digits 5 days 0.2 hours (5.66 ms/mul, 76497608 iterations) 208261456 GFLOPS
480000^4194304+1 23828853 digits 5 days 4.8 hours (5.68 ms/mul, 79157734 iterations) 216188817 GFLOPS
14^8388608+1 9614419 digits 4 days 13.2 hours (12.31 ms/mul, 31938406 iterations) 189172009 GFLOPS
36^8388608+1 13055212 digits 6 days 4.4 hours (12.32 ms/mul, 43368473 iterations) 257018502 GFLOPS
100^8388608+1 16777217 digits 7 days 22.9 hours (12.33 ms/mul, 55732704 iterations) 330588895 GFLOPS
OpenCL:
geneferocl 3.1.2-2 (Windows 64-bit OpenGL)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Running on platform 'NVIDIA CUDA', device 'GeForce GTX TITAN', version 'OpenCL 1.1 CUDA' and driver '320.49'.
14^32768+1 37557 digits 0 days 0.0 hours (0.09 ms/mul, 124758 iterations) 4810 GFLOPS
75898^32768+1 159916 digits 0 days 0.0 hours (0.08 ms/mul, 531226 iterations) 21164 GFLOPS
700000^32768+1 191533 digits 0 days 0.0 hours (0.09 ms/mul, 636255 iterations) 25974 GFLOPS
5000000^32768+1 219512 digits 0 days 0.0 hours (0.09 ms/mul, 729201 iterations) 29341 GFLOPS
14^65536+1 75113 digits 0 days 0.0 hours (0.10 ms/mul, 249517 iterations) 11544 GFLOPS
75898^65536+1 319831 digits 0 days 0.0 hours (0.10 ms/mul, 1062453 iterations) 48100 GFLOPS
710000^65536+1 383469 digits 0 days 0.0 hours (0.10 ms/mul, 1273852 iterations) 58682 GFLOPS
2500000^65536+1 419296 digits 0 days 0.0 hours (0.10 ms/mul, 1392868 iterations) 63973 GFLOPS
14^131072+1 150226 digits 0 days 0.0 hours (0.13 ms/mul, 499036 iterations) 31746 GFLOPS
75898^131072+1 639662 digits 0 days 0.0 hours (0.13 ms/mul, 2124908 iterations) 133718 GFLOPS
700000^131072+1 766129 digits 0 days 0.0 hours (0.13 ms/mul, 2545023 iterations) 160173 GFLOPS
1000000^131072+1 786432 digits 0 days 0.0 hours (0.13 ms/mul, 2612469 iterations) 164502 GFLOPS
14^262144+1 300451 digits 0 days 0.0 hours (0.28 ms/mul, 998074 iterations) 134199 GFLOPS
75898^262144+1 1279324 digits 0 days 0.3 hours (0.28 ms/mul, 4249818 iterations) 561808 GFLOPS
468750^262144+1 1486604 digits 0 days 0.3 hours (0.28 ms/mul, 4938388 iterations) 655122 GFLOPS
815000^262144+1 1549575 digits 0 days 0.3 hours (0.28 ms/mul, 5147574 iterations) 680615 GFLOPS
14^524288+1 600902 digits 0 days 0.2 hours (0.47 ms/mul, 1996149 iterations) 448292 GFLOPS
75898^524288+1 2558647 digits 0 days 1.0 hours (0.46 ms/mul, 8499637 iterations) 1880229 GFLOPS
468750^524288+1 2973207 digits 0 days 1.2 hours (0.46 ms/mul, 9876777 iterations) 2185183 GFLOPS
710000^524288+1 3067745 digits 0 days 1.3 hours (0.47 ms/mul, 10190825 iterations) 2289079 GFLOPS
14^1048576+1 1201803 digits 0 days 0.9 hours (0.87 ms/mul, 3992299 iterations) 1672437 GFLOPS
75898^1048576+1 5117293 digits 0 days 4.0 hours (0.86 ms/mul, 16999276 iterations) 7064447 GFLOPS
468750^1048576+1 5946413 digits 0 days 4.6 hours (0.86 ms/mul, 19753555 iterations) 8133229 GFLOPS
700000^1048576+1 6129030 digits 0 days 4.8 hours (0.86 ms/mul, 20360194 iterations) 8392488 GFLOPS
14^2097152+1 2403605 digits 0 days 3.6 hours (1.64 ms/mul, 7984600 iterations) 6282822 GFLOPS
75898^2097152+1 10234585 digits 0 days 15.2 hours (1.61 ms/mul, 33998553 iterations) 26393913 GFLOPS
380742^2097152+1 11703432 digits 0 days 17.4 hours (1.61 ms/mul, 38877955 iterations) 30144751 GFLOPS
570000^2097152+1 12070945 digits 0 days 17.9 hours (1.61 ms/mul, 40098808 iterations) 31091359 GFLOPS
14^4194304+1 4807210 digits 0 days 14.9 hours (3.37 ms/mul, 15969202 iterations) 25885496 GFLOPS
1248^4194304+1 12986466 digits 1 days 15.9 hours (3.33 ms/mul, 43140102 iterations) 69098536 GFLOPS
10000^4194304+1 16777217 digits 2 days 3.4 hours (3.32 ms/mul, 55732704 iterations) 89080719 GFLOPS
50000^4194304+1 19708909 digits 2 days 12.4 hours (3.33 ms/mul, 65471576 iterations) 104741598 GFLOPS
150000^4194304+1 21710101 digits 2 days 18.7 hours (3.33 ms/mul, 72119391 iterations) 115515517 GFLOPS
309258^4194304+1 23028076 digits 2 days 22.5 hours (3.32 ms/mul, 76497608 iterations) 122234125 GFLOPS
480000^4194304+1 23828853 digits 3 days 0.9 hours (3.32 ms/mul, 79157734 iterations) 126370244 GFLOPS
14^8388608+1 9614419 digits 2 days 14.4 hours (7.04 ms/mul, 31938406 iterations) 108104750 GFLOPS
36^8388608+1 13055212 digits 3 days 12.6 hours (7.02 ms/mul, 43368473 iterations) 146522220 GFLOPS
100^8388608+1 16777217 digits 4 days 12.2 hours (6.99 ms/mul, 55732704 iterations) 187490914 GFLOPS
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13952 ID: 53948 Credit: 391,849,676 RAC: 147,542
                               
|
I tuned geneferocl. It's a bit faster on my computer for large exponents.
I committed the new parameters on assembla, in the branch [...]\branches\yves\2013\OclGenefer.
I think that it is faster on a 'Tahiti', but only a real experiment can answer to the question. Please, could someone compile it and run the bench on a HD79x0?
Mike, could you also test it on your GTX 460? Because, I don't know why, but your card doesn't like my algorithm :o) If it is the card that found 75898^524288+1 with GeneferCUDA, I understand it!
If new tuning is faster on Tahiti and Fermi, I will update the real genefer.
Thanks, Yves
Here's the new code on my 460:
OclGenefer 2013-08-16, Copyright (C) 2001-2013, Yves Gallot.
Options: -q "b^N+1" Test expression.
Platform 'NVIDIA CUDA': GPU device 'GeForce GTX 460' found.
Platform 'AMD Accelerated Parallel Processing': CPU device 'Intel(R) Core(TM)2 Q
uad CPU @ 2.40GHz' found.
Running on platform 'NVIDIA CUDA', device 'GeForce GTX 460', version 'OpenCL 1.1
CUDA' and driver '320.57'.
Clock frequency = 1350 MHz, compute units = 7.
Global mem size = 1024 MB, cache size = 112 kB (ReadWrite), cache line size =
128 Bytes.
Local mem size = 48 kB (dedicated), Constant mem size = 64 kB.
Max workgroup size = 1024.
2199064^8192+1 Time: 83 us/mul. Err: 0.2188 51956 digits
1798620^16384+1 Time: 125 us/mul. Err: 0.2266 102481 digits
1471094^32768+1 Time: 166 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 304 us/mul. Err: 0.2188 398482 digits
984108^131072+1 Time: 520 us/mul. Err: 0.2422 785521 digits
804904^262144+1 Time: 1.05 ms/mul. Err: 0.2178 1548156 digits
658332^524288+1 Time: 2.02 ms/mul. Err: 0.2256 3050541 digits
538452^1048576+1 Time: 4.24 ms/mul. Err: 0.2031 6009544 digits
440400^2097152+1 Time: 8.83 ms/mul. Err: 0.2305 11836006 digits
360204^4194304+1 Time: 18.6 ms/mul. Err: 0.1953 23305854 digits
294612^8388608+1 Time: 39.6 ms/mul. Err: 0.1973 45879398 digits
EDIT: This is slightly faster than the previous versions.
____________
My lucky number is 75898524288+1 | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13952 ID: 53948 Credit: 391,849,676 RAC: 147,542
                               
|
If anyone has BOTH an Nvidia and an AMD GPU in the same system, I'd like to know which one GeneferOCL chooses.
____________
My lucky number is 75898524288+1 | |
|
|
Here's some benchmarks for some Fermis.
These are on AMD PII 1100Ts systems @ 3.8GHZ accept the 570 which is on a 2600K @ 4.3GHZ
All GPUs at stock speed
GeForce GTX 470
C:\>genefercuda-windows.exe -b
genefercuda 3.1.2-2 (Windows 32-bit CUDA)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: genefercuda-windows.exe -b
Generalized Fermat Number Bench
2199064^8192+1 Time: 111 us/mul. Err: 0.2188 51956 digits
1798620^16384+1 Time: 119 us/mul. Err: 0.2188 102481 digits
1471094^32768+1 Time: 177 us/mul. Err: 0.2031 202102 digits
1203210^65536+1 Time: 271 us/mul. Err: 0.2031 398482 digits
984108^131072+1 Time: 527 us/mul. Err: 0.2070 785521 digits
804904^262144+1 Time: 742 us/mul. Err: 0.2031 1548156 digits
658332^524288+1 Time: 1.35 ms/mul. Err: 0.2266 3050541 digits
538452^1048576+1 Time: 2.5 ms/mul. Err: 0.2031 6009544 digits
440400^2097152+1 Time: 4.77 ms/mul. Err: 0.1953 11836006 digits
360204^4194304+1 Time: 10 ms/mul. Err: 0.2031 23305854 digits
294612^8388608+1 Time: 20.9 ms/mul. Err: 0.1797 45879398 digitsght
C:\>geneferocl-windows.exe -b
geneferocl 3.1.2-2 (Windows 64-bit OpenGL)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: geneferocl-windows.exe -b
Running on platform 'NVIDIA CUDA', device 'GeForce GTX 470', version 'OpenCL 1.1
CUDA' and driver '320.18'.
Generalized Fermat Number Bench
2199064^8192+1 Time: 90.3 us/mul. Err: 0.2344 51956 digits
1798620^16384+1 Time: 98.9 us/mul. Err: 0.2266 102481 digits
1471094^32768+1 Time: 134 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 195 us/mul. Err: 0.2656 398482 digits
984108^131072+1 Time: 322 us/mul. Err: 0.2188 785521 digits
804904^262144+1 Time: 654 us/mul. Err: 0.2188 1548156 digits
658332^524288+1 Time: 1.19 ms/mul. Err: 0.2188 3050541 digits
538452^1048576+1 Time: 2.46 ms/mul. Err: 0.2266 6009544 digits
440400^2097152+1 Time: 5 ms/mul. Err: 0.2266 11836006 digits
360204^4194304+1 Time: 10.5 ms/mul. Err: 0.2031 23305854 digits
294612^8388608+1 Time: 22.2 ms/mul. Err: 0.1895 45879398 digits
Genefer Mark = 483.
GeForce GTX 560ti
C:\>genefercuda-windows.exe -d 1 -b
genefercuda 3.1.2-2 (Windows 32-bit CUDA)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: genefercuda-windows.exe -d 1 -b
Generalized Fermat Number Bench
2199064^8192+1 Time: 97.7 us/mul. Err: 0.2188 51956 digits
1798620^16384+1 Time: 111 us/mul. Err: 0.2188 102481 digits
1471094^32768+1 Time: 150 us/mul. Err: 0.2031 202102 digits
1203210^65536+1 Time: 239 us/mul. Err: 0.2031 398482 digits
984108^131072+1 Time: 444 us/mul. Err: 0.2070 785521 digits
804904^262144+1 Time: 752 us/mul. Err: 0.2031 1548156 digits
658332^524288+1 Time: 1.48 ms/mul. Err: 0.2266 3050541 digits
538452^1048576+1 Time: 2.97 ms/mul. Err: 0.2031 6009544 digits
440400^2097152+1 Time: 6.09 ms/mul. Err: 0.1953 11836006 digits
360204^4194304+1 Time: 12.3 ms/mul. Err: 0.2031 23305854 digits
294612^8388608+1 Time: 25.6 ms/mul. Err: 0.1797 45879398 digits
C:\>geneferocl-windows.exe -d 1 -b
geneferocl 3.1.2-2 (Windows 64-bit OpenGL)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: geneferocl-windows.exe -d 1 -b
Running on platform 'NVIDIA CUDA', device 'GeForce GTX 560 Ti', version 'OpenCL
1.1 CUDA' and driver '320.18'.
Generalized Fermat Number Bench
2199064^8192+1 Time: 68.4 us/mul. Err: 0.2344 51956 digits
1798620^16384+1 Time: 73.2 us/mul. Err: 0.2266 102481 digits
1471094^32768+1 Time: 106 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 190 us/mul. Err: 0.2656 398482 digits
984108^131072+1 Time: 366 us/mul. Err: 0.2188 785521 digits
804904^262144+1 Time: 791 us/mul. Err: 0.2188 1548156 digits
658332^524288+1 Time: 1.58 ms/mul. Err: 0.2188 3050541 digits
538452^1048576+1 Time: 3.28 ms/mul. Err: 0.2266 6009544 digits
440400^2097152+1 Time: 6.95 ms/mul. Err: 0.2266 11836006 digits
360204^4194304+1 Time: 14.5 ms/mul. Err: 0.2031 23305854 digits
294612^8388608+1 Time: 30.3 ms/mul. Err: 0.1895 45879398 digits
Genefer Mark = 353.
GeForce GTX 570
C:\>genefercuda-windows.exe -b
genefercuda 3.1.2-2 (Windows 32-bit CUDA)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: genefercuda-windows.exe -b
Generalized Fermat Number Bench
2199064^8192+1 Time: 91.6 us/mul. Err: 0.2188 51956 digits
1798620^16384+1 Time: 98 us/mul. Err: 0.2188 102481 digits
1471094^32768+1 Time: 145 us/mul. Err: 0.2031 202102 digits
1203210^65536+1 Time: 226 us/mul. Err: 0.2031 398482 digits
984108^131072+1 Time: 410 us/mul. Err: 0.2070 785521 digits
804904^262144+1 Time: 596 us/mul. Err: 0.2031 1548156 digits
658332^524288+1 Time: 1.09 ms/mul. Err: 0.2266 3050541 digits
538452^1048576+1 Time: 2.03 ms/mul. Err: 0.2031 6009544 digits
440400^2097152+1 Time: 3.91 ms/mul. Err: 0.1953 11836006 digits
360204^4194304+1 Time: 8.13 ms/mul. Err: 0.2031 23305854 digits
294612^8388608+1 Time: 16.7 ms/mul. Err: 0.1797 45879398 digits
C:\>geneferocl-windows.exe -b
geneferocl 3.1.2-2 (Windows 64-bit OpenGL)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: geneferocl-windows.exe -b
Running on platform 'NVIDIA CUDA', device 'GeForce GTX 570', version 'OpenCL 1.1
CUDA' and driver '310.70'.
Generalized Fermat Number Bench
2199064^8192+1 Time: 81.8 us/mul. Err: 0.2344 51956 digits
1798620^16384+1 Time: 88.8 us/mul. Err: 0.2266 102481 digits
1471094^32768+1 Time: 118 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 170 us/mul. Err: 0.2656 398482 digits
984108^131072+1 Time: 273 us/mul. Err: 0.2188 785521 digits
804904^262144+1 Time: 527 us/mul. Err: 0.2188 1548156 digits
658332^524288+1 Time: 957 us/mul. Err: 0.2188 3050541 digits
538452^1048576+1 Time: 1.91 ms/mul. Err: 0.2266 6009544 digits
440400^2097152+1 Time: 3.91 ms/mul. Err: 0.2266 11836006 digits
360204^4194304+1 Time: 8.2 ms/mul. Err: 0.2031 23305854 digits
294612^8388608+1 Time: 17.3 ms/mul. Err: 0.1895 45879398 digits
Genefer Mark = 618.
GeForce GTX 580
C:\>genefercuda-windows.exe -b
genefercuda 3.1.2-2 (Windows 32-bit CUDA)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: genefercuda-windows.exe -b
Generalized Fermat Number Bench
2199064^8192+1 Time: 98.4 us/mul. Err: 0.2188 51956 digits
1798620^16384+1 Time: 94.8 us/mul. Err: 0.2188 102481 digits
1471094^32768+1 Time: 114 us/mul. Err: 0.2031 202102 digits
1203210^65536+1 Time: 159 us/mul. Err: 0.2031 398482 digits
984108^131072+1 Time: 278 us/mul. Err: 0.2070 785521 digits
804904^262144+1 Time: 489 us/mul. Err: 0.2031 1548156 digits
658332^524288+1 Time: 867 us/mul. Err: 0.2266 3050541 digits
538452^1048576+1 Time: 1.61 ms/mul. Err: 0.2031 6009544 digits
440400^2097152+1 Time: 3.09 ms/mul. Err: 0.1953 11836006 digits
360204^4194304+1 Time: 6.42 ms/mul. Err: 0.2031 23305854 digits
294612^8388608+1 Time: 13.3 ms/mul. Err: 0.1797 45879398 digits
C:\>geneferocl-windows.exe -b
geneferocl 3.1.2-2 (Windows 64-bit OpenGL)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: geneferocl-windows.exe -b
Running on platform 'NVIDIA CUDA', device 'GeForce GTX 580', version 'OpenCL 1.1
CUDA' and driver '314.22'.
Generalized Fermat Number Bench
2199064^8192+1 Time: 67.3 us/mul. Err: 0.2344 51956 digits
1798620^16384+1 Time: 71.8 us/mul. Err: 0.2266 102481 digits
1471094^32768+1 Time: 74.5 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 112 us/mul. Err: 0.2656 398482 digits
984108^131072+1 Time: 186 us/mul. Err: 0.2188 785521 digits
804904^262144+1 Time: 396 us/mul. Err: 0.2188 1548156 digits
658332^524288+1 Time: 752 us/mul. Err: 0.2188 3050541 digits
538452^1048576+1 Time: 1.52 ms/mul. Err: 0.2266 6009544 digits
440400^2097152+1 Time: 3.13 ms/mul. Err: 0.2266 11836006 digits
360204^4194304+1 Time: 6.64 ms/mul. Err: 0.2031 23305854 digits
294612^8388608+1 Time: 14.2 ms/mul. Err: 0.1895 45879398 digits
Genefer Mark = 767.
Any specific testing needed let me know.
____________
Largest Primes to Date:
As Double Checker: SR5 109208*5^1816285+1 Dgts-1,269,534
As Initial Finder: SR5 243944*5^1258576-1 Dgts-879,713
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13952 ID: 53948 Credit: 391,849,676 RAC: 147,542
                               
|
Here's some benchmarks for some Fermis.
These are on AMD PII 1100Ts systems @ 3.8GHZ accept the 570 which is on a 2600K @ 4.3GHZ
All GPUs at stock speed
Interesting. On my machine, I saw identical speeds with both 32 and 64 bits. On yours, with a variety of Fermi GPUs, the 32 bit builds all slightly faster.
One difference: Core2 on my system, Phenom II on yours.
Looks like yet another reason to use a 32 bit build.
____________
My lucky number is 75898524288+1 | |
|
|
I do have a second 560ti, exact same Zotac model as first bench but on my 2600K SB. The benchmarks are slightly slower. Here's the bench from the 2600K with the PII 1100T below it to compare.
GTX 560ti on 2600K
C:\>genefercuda-windows.exe -d 1 -b
genefercuda 3.1.2-2 (Windows 32-bit CUDA)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: genefercuda-windows.exe -d 1 -b
Generalized Fermat Number Bench
2199064^8192+1 Time: 97.7 us/mul. Err: 0.2188 51956 digits
1798620^16384+1 Time: 112 us/mul. Err: 0.2188 102481 digits
1471094^32768+1 Time: 151 us/mul. Err: 0.2031 202102 digits
1203210^65536+1 Time: 243 us/mul. Err: 0.2031 398482 digits
984108^131072+1 Time: 449 us/mul. Err: 0.2070 785521 digits
804904^262144+1 Time: 771 us/mul. Err: 0.2031 1548156 digits
658332^524288+1 Time: 1.5 ms/mul. Err: 0.2266 3050541 digits
538452^1048576+1 Time: 3.01 ms/mul. Err: 0.2031 6009544 digits
440400^2097152+1 Time: 6.25 ms/mul. Err: 0.1953 11836006 digits
360204^4194304+1 Time: 12.5 ms/mul. Err: 0.2031 23305854 digits
294612^8388608+1 Time: 26.1 ms/mul. Err: 0.1797 45879398 digits
C:\>geneferocl-windows.exe -d 1 -b
geneferocl 3.1.2-2 (Windows 64-bit OpenGL)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: geneferocl-windows.exe -d 1 -b
Running on platform 'NVIDIA CUDA', device 'GeForce GTX 560 Ti', version 'OpenCL
1.1 CUDA' and driver '310.70'.
Generalized Fermat Number Bench
2199064^8192+1 Time: 69.6 us/mul. Err: 0.2344 51956 digits
1798620^16384+1 Time: 74.2 us/mul. Err: 0.2266 102481 digits
1471094^32768+1 Time: 107 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 195 us/mul. Err: 0.2656 398482 digits
984108^131072+1 Time: 374 us/mul. Err: 0.2188 785521 digits
804904^262144+1 Time: 806 us/mul. Err: 0.2188 1548156 digits
658332^524288+1 Time: 1.59 ms/mul. Err: 0.2188 3050541 digits
538452^1048576+1 Time: 3.32 ms/mul. Err: 0.2266 6009544 digits
440400^2097152+1 Time: 6.95 ms/mul. Err: 0.2266 11836006 digits
360204^4194304+1 Time: 14.6 ms/mul. Err: 0.2031 23305854 digits
294612^8388608+1 Time: 30.6 ms/mul. Err: 0.1895 45879398 digits
Genefer Mark = 350.
GTX 560ti on 1100T
C:\>genefercuda-windows.exe -d 1 -b
genefercuda 3.1.2-2 (Windows 32-bit CUDA)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: genefercuda-windows.exe -d 1 -b
Generalized Fermat Number Bench
2199064^8192+1 Time: 97.7 us/mul. Err: 0.2188 51956 digits
1798620^16384+1 Time: 111 us/mul. Err: 0.2188 102481 digits
1471094^32768+1 Time: 150 us/mul. Err: 0.2031 202102 digits
1203210^65536+1 Time: 239 us/mul. Err: 0.2031 398482 digits
984108^131072+1 Time: 444 us/mul. Err: 0.2070 785521 digits
804904^262144+1 Time: 752 us/mul. Err: 0.2031 1548156 digits
658332^524288+1 Time: 1.48 ms/mul. Err: 0.2266 3050541 digits
538452^1048576+1 Time: 2.97 ms/mul. Err: 0.2031 6009544 digits
440400^2097152+1 Time: 6.09 ms/mul. Err: 0.1953 11836006 digits
360204^4194304+1 Time: 12.3 ms/mul. Err: 0.2031 23305854 digits
294612^8388608+1 Time: 25.6 ms/mul. Err: 0.1797 45879398 digits
C:\>geneferocl-windows.exe -d 1 -b
geneferocl 3.1.2-2 (Windows 64-bit OpenGL)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: geneferocl-windows.exe -d 1 -b
Running on platform 'NVIDIA CUDA', device 'GeForce GTX 560 Ti', version 'OpenCL
1.1 CUDA' and driver '320.18'.
Generalized Fermat Number Bench
2199064^8192+1 Time: 68.4 us/mul. Err: 0.2344 51956 digits
1798620^16384+1 Time: 73.2 us/mul. Err: 0.2266 102481 digits
1471094^32768+1 Time: 106 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 190 us/mul. Err: 0.2656 398482 digits
984108^131072+1 Time: 366 us/mul. Err: 0.2188 785521 digits
804904^262144+1 Time: 791 us/mul. Err: 0.2188 1548156 digits
658332^524288+1 Time: 1.58 ms/mul. Err: 0.2188 3050541 digits
538452^1048576+1 Time: 3.28 ms/mul. Err: 0.2266 6009544 digits
440400^2097152+1 Time: 6.95 ms/mul. Err: 0.2266 11836006 digits
360204^4194304+1 Time: 14.5 ms/mul. Err: 0.2031 23305854 digits
294612^8388608+1 Time: 30.3 ms/mul. Err: 0.1895 45879398 digits
Genefer Mark = 353.
I wonder if it's the CPU, driver version, GPU chip quality or something else. I'll try different driver version a little later.
____________
Largest Primes to Date:
As Double Checker: SR5 109208*5^1816285+1 Dgts-1,269,534
As Initial Finder: SR5 243944*5^1258576-1 Dgts-879,713
| |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,668,824 RAC: 0
                    
|
I tuned geneferocl. It's a bit faster on my computer for large exponents.
Full test run with newly tuned geneferocl, revision 394. HD7970GHz GPU, CPU is 3.5 GHz AMD Phenom II X6 1100T:
It seems the higher exponents are slower and some lower exponents are faster. Can the GeneferOCL program optimise it's parameters at run time during initialisation? | |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,668,824 RAC: 0
                    
|
The latest OpenCL beta build is available for download. Its version number is 3.1.2-1, and can be downloaded via this thread.
Benching geneferocl 3.1.2-1 Windows 64-bit
HD7970GHz on X6 1100T
C:\>geneferocl-windows.exe -b
Generalized Fermat Number Bench
2199064^8192+1 Time: 80 us/mul. Err: 0.2344 51956 digits
1798620^16384+1 Time: 77.1 us/mul. Err: 0.2266 102481 digits
1471094^32768+1 Time: 78.1 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 83.7 us/mul. Err: 0.2656 398482 digits
984108^131072+1 Time: 129 us/mul. Err: 0.2188 785521 digits
804904^262144+1 Time: 335 us/mul. Err: 0.2188 1548156 digits
658332^524288+1 Time: 791 us/mul. Err: 0.2188 3050541 digits
538452^1048576+1 Time: 1.52 ms/mul. Err: 0.2266 6009544 digits
440400^2097152+1 Time: 2.8 ms/mul. Err: 0.2266 11836006 digits
360204^4194304+1 Time: 5.36 ms/mul. Err: 0.2031 23305854 digits
294612^8388608+1 Time: 10.8 ms/mul. Err: 0.1895 45879398 digits
Genefer Mark = 892.
C:\>geneferocl-windows.exe -b3
geneferocl 3.1.2-1 (Windows 64-bit OpenGL)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Running on platform 'AMD Accelerated Parallel Processing', device 'Tahiti', version 'OpenCL 1.2 AMD-APP (1124.2)' and driver '1124.2 (VM)'.
14^32768+1 37557 digits 0 days 0.0 hours (0.09 ms/mul, 124758 iterations) 5291 GFLOPS
75898^32768+1 159916 digits 0 days 0.0 hours (0.08 ms/mul, 531226 iterations) 19721 GFLOPS
700000^32768+1 191533 digits 0 days 0.0 hours (0.08 ms/mul, 636255 iterations) 23569 GFLOPS
5000000^32768+1 219512 digits 0 days 0.0 hours (0.08 ms/mul, 729201 iterations) 26936 GFLOPS
14^65536+1 75113 digits 0 days 0.0 hours (0.08 ms/mul, 249517 iterations) 9139 GFLOPS
75898^65536+1 319831 digits 0 days 0.0 hours (0.09 ms/mul, 1062453 iterations) 47619 GFLOPS
710000^65536+1 383469 digits 0 days 0.0 hours (0.08 ms/mul, 1273852 iterations) 47619 GFLOPS
2500000^65536+1 419296 digits 0 days 0.0 hours (0.11 ms/mul, 1392868 iterations) 72631 GFLOPS
14^131072+1 150226 digits 0 days 0.0 hours (0.14 ms/mul, 499036 iterations) 33189 GFLOPS
75898^131072+1 639662 digits 0 days 0.0 hours (0.14 ms/mul, 2124908 iterations) 143819 GFLOPS
700000^131072+1 766129 digits 0 days 0.0 hours (0.12 ms/mul, 2545023 iterations) 151515 GFLOPS
1000000^131072+1 786432 digits 0 days 0.0 hours (0.13 ms/mul, 2612469 iterations) 156806 GFLOPS
14^262144+1 300451 digits 0 days 0.0 hours (0.33 ms/mul, 998074 iterations) 157287 GFLOPS
75898^262144+1 1279324 digits 0 days 0.4 hours (0.34 ms/mul, 4249818 iterations) 702741 GFLOPS
468750^262144+1 1486604 digits 0 days 0.4 hours (0.31 ms/mul, 4938388 iterations) 740740 GFLOPS
815000^262144+1 1549575 digits 0 days 0.4 hours (0.31 ms/mul, 5147574 iterations) 772486 GFLOPS
14^524288+1 600902 digits 0 days 0.4 hours (0.89 ms/mul, 1996149 iterations) 853294 GFLOPS
75898^524288+1 2558647 digits 0 days 1.9 hours (0.81 ms/mul, 8499637 iterations) 3315533 GFLOPS
468750^524288+1 2973207 digits 0 days 2.3 hours (0.87 ms/mul, 9876777 iterations) 4151992 GFLOPS
710000^524288+1 3067745 digits 0 days 2.2 hours (0.81 ms/mul, 10190825 iterations) 3974984 GFLOPS
Fast, but not as fast as a TITAN. | |
|
|
My first WU with Genefer OpenCL on my 7950 is complete and validated:
http://www.primegrid.com/workunit.php?wuid=343200843
It took 27,721 sec. | |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,668,824 RAC: 0
                    
|
I am trying to get geneferOCL working with BOINC. I've never used app_info.xml, never had a need for it.
I created an app_info.xml file same as DeleteNull, except have <platform>windows_x86_64</platform>
Then I placed the file in C:\ProgramData\BOINC\projects\www.primegrid.com, along with the geneferocl-windows.exe
Then I set Primegrid preferences for short GFN, but there is no GFN option for ATI.
When I start BOINC it downloads some GFN for CPU. If I enable ATI in the preferences it starts PPS Sieve on ATI.
Have I missed a step? What should I be setting the Primegrid preferences to? | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13952 ID: 53948 Credit: 391,849,676 RAC: 147,542
                               
|
I am trying to get geneferOCL working with BOINC. I've never used app_info.xml, never had a need for it.
I created an app_info.xml file same as DeleteNull, except have <platform>windows_x86_64</platform>
Then I placed the file in C:\ProgramData\BOINC\projects\www.primegrid.com, along with the geneferocl-windows.exe
Then I set Primegrid preferences for short GFN, but there is no GFN option for ATI.
When I start BOINC it downloads some GFN for CPU. If I enable ATI in the preferences it starts PPS Sieve on ATI.
Have I missed a step? What should I be setting the Primegrid preferences to?
You had it right the first time. You need to tell the server you're running a CPU Genefer. There's no ATI option (yet) on the server.
It will send you "CPU tasks", but a task is just a task. App_info overrides the "how it's run" part, and tells your computer to use GeneferOCL instead of what the server is telling it to do.
Tasks sent to a computer consist of two parts: The "what needs to be crunched" part, and the "how does it get crunched" part. App_info redefines the "how does it get crunched" portion, so all the server is really sending is "what needs to be crunched." It doesn't actually matter (much) that it's running on an ATI instead of a CPU.
____________
My lucky number is 75898524288+1 | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
HD7970GHz The higher exponents are slower and some lower exponents are faster. Can the GeneferOCL program optimise it's parameters at run time during initialisation?
Yes, it should because with new paramters GeneferOCL is faster on NVidia and slower on ATI (for N = 1048576 and 4194304).
Then, the next step is a self-tuning transform...
| |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,668,824 RAC: 0
                    
|
You had it right the first time. You need to tell the server you're running a CPU Genefer. There's no ATI option (yet) on the server.
It will send you "CPU tasks", but a task is just a task. App_info overrides the "how it's run" part, and tells your computer to use GeneferOCL instead of what the server is telling it to do.
Tasks sent to a computer consist of two parts: The "what needs to be crunched" part, and the "how does it get crunched" part. App_info redefines the "how does it get crunched" portion, so all the server is really sending is "what needs to be crunched." It doesn't actually matter (much) that it's running on an ATI instead of a CPU.
OK. BOINC is now reading my app_info file. Trick was to Remove Primegrid project, close BOINC, create www.primegrid.com directory and place app_info.xml file in it, restart BOINC, and add Primegrid project.
In the Primegrid preferences I have set to use CPU only and get only Generalized Fermat Prime Search (short) tasks.
When I update the Primegrid project its not downloading any tasks. In the BOINC Event log I get this:
17/08/2013 9:27:01 PM | PrimeGrid | update requested by user
17/08/2013 9:27:03 PM | PrimeGrid | Sending scheduler request: Requested by user.
17/08/2013 9:27:03 PM | PrimeGrid | Requesting new tasks for CPU
17/08/2013 9:27:09 PM | PrimeGrid | Scheduler request completed: got 0 new tasks
17/08/2013 9:27:09 PM | PrimeGrid | No tasks sent
17/08/2013 9:27:09 PM | PrimeGrid | No tasks are available for Genefer
17/08/2013 9:27:09 PM | PrimeGrid | Tasks for NVIDIA GPU are available, but your preferences are set to not accept them
17/08/2013 9:27:09 PM | PrimeGrid | Tasks for AMD/ATI GPU are available, but your preferences are set to not accept them
17/08/2013 9:27:09 PM | PrimeGrid | Project has no tasks available
It also whines "Your app_info.xml file doesn't have a usable version of..." for the 13 other sub-projects. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13952 ID: 53948 Credit: 391,849,676 RAC: 147,542
                               
|
Yves,
I put self-tuning ability into the the CUDA version earlier this year. It's particularly helpful when running on hardware very different from what we have available to test.
The CUDA tuning (essentially a full "-b2 N" test) is only run once at the start of the test, and the result saved in a file for use during restarts. Tuning (aka "shift") can also be set from the command line, and can be specified from the PrimeGrid preferences.
If you want, we can do that with OCL too. I didn't try OCL with N=22 tests, but there may be some value in users "de-tuning" OCL a bit to improve screen lag.
____________
My lucky number is 75898524288+1 | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13952 ID: 53948 Credit: 391,849,676 RAC: 147,542
                               
|
OK. BOINC is now reading my app_info file. Trick was to Remove Primegrid project, close BOINC, create www.primegrid.com directory and place app_info.xml file in it, restart BOINC, and add Primegrid project.
I'ts not necessary to remove PrimeGrid. Just put the app_info file in there and restart the BOINC client. It's best to reboot to make sure, since the BOINC client runs in the background -- it's NOT the program you're interacting with on the screen.
In the Primegrid preferences I have set to use CPU only and get only Generalized Fermat Prime Search (short) tasks.
When I update the Primegrid project its not downloading any tasks. In the BOINC Event log I get this:
I'm not sure what's going on, because I'm looking at your preferences and they're not set that way.
On the "---" venue, you have SGS and PPS Sieve (ATI & CUDA) selected (and CPU and ATI enabled). On the "work" venue you have SR5 and PPS Sieve enabled (and CPU/ATI/CUDA processing enabled.
On neither venue is GFN enabled. I'm assuming you've turned it off?
____________
My lucky number is 75898524288+1 | |
|
|
It seems that a GFN task takes up to 20+ hours on a HD5870 (based on 5% done in 1 hour)
Running the B3 test on a AMD HD5870 icm with Windows x64 and Intel Core-i7 860
PS C:\Users\E> .\geneferocl-windows.exe -b3
geneferocl 3.1.2-2 (Windows 64-bit OpenGL)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: C:\Users\E\geneferocl-windows.exe -b3
Running on platform 'AMD Accelerated Parallel Processing', device 'Cypress', version 'OpenCL 1.2 AMD-APP (1124.2)' and d
river '1124.2 (VM)'.
14^32768+1 37557 digits 0 days 0.0 hours (0.17 ms/mul, 124758 iterations) 10101 GFLOPS
75898^32768+1 159916 digits 0 days 0.0 hours (0.14 ms/mul, 531226 iterations) 37037 GFLOPS
700000^32768+1 191533 digits 0 days 0.0 hours (0.14 ms/mul, 636255 iterations) 43290 GFLOPS
5000000^32768+1 219512 digits 0 days 0.0 hours (0.14 ms/mul, 729201 iterations) 50505 GFLOPS
14^65536+1 75113 digits 0 days 0.0 hours (0.24 ms/mul, 249517 iterations) 28379 GFLOPS
75898^65536+1 319831 digits 0 days 0.0 hours (0.24 ms/mul, 1062453 iterations) 120250 GFLOPS
710000^65536+1 383469 digits 0 days 0.0 hours (0.26 ms/mul, 1273852 iterations) 157287 GFLOPS
2500000^65536+1 419296 digits 0 days 0.0 hours (0.23 ms/mul, 1392868 iterations) 155363 GFLOPS
14^131072+1 150226 digits 0 days 0.0 hours (0.53 ms/mul, 499036 iterations) 126022 GFLOPS
75898^131072+1 639662 digits 0 days 0.2 hours (0.50 ms/mul, 2124908 iterations) 513708 GFLOPS
700000^131072+1 766129 digits 0 days 0.3 hours (0.51 ms/mul, 2545023 iterations) 626743 GFLOPS
1000000^131072+1 786432 digits 0 days 0.3 hours (0.50 ms/mul, 2612469 iterations) 626743 GFLOPS
14^262144+1 300451 digits 0 days 0.2 hours (0.94 ms/mul, 998074 iterations) 453583 GFLOPS
75898^262144+1 1279324 digits 0 days 1.0 hours (0.93 ms/mul, 4249818 iterations) 1894659 GFLOPS
468750^262144+1 1486604 digits 0 days 1.2 hours (0.92 ms/mul, 4938388 iterations) 2192398 GFLOPS
815000^262144+1 1549575 digits 0 days 1.3 hours (0.92 ms/mul, 5147574 iterations) 2285231 GFLOPS
14^524288+1 600902 digits 0 days 1.0 hours (1.89 ms/mul, 1996149 iterations) 1816256 GFLOPS
75898^524288+1 2558647 digits 0 days 4.5 hours (1.95 ms/mul, 8499637 iterations) 7951411 GFLOPS
468750^524288+1 2973207 digits 0 days 5.0 hours (1.85 ms/mul, 9876777 iterations) 8788832 GFLOPS
710000^524288+1 3067745 digits 0 days 5.2 hours (1.86 ms/mul, 10190825 iterations) 9122165 GFLOPS
14^1048576+1 1201803 digits 0 days 4.6 hours (4.21 ms/mul, 3992299 iterations) 8093787 GFLOPS
75898^1048576+1 5117293 digits 0 days 19.4 hours (4.11 ms/mul, 16999276 iterations) 33606027 GFLOPS
468750^1048576+1 5946413 digits 0 days 22.5 hours (4.11 ms/mul, 19753555 iterations) 39041327 GFLOPS
700000^1048576+1 6129030 digits 0 days 23.2 hours (4.11 ms/mul, 20360194 iterations) 40289041 GFLOPS
14^2097152+1 2403605 digits 0 days 19.6 hours (8.85 ms/mul, 7984600 iterations) 34004295 GFLOPS
75898^2097152+1 10234585 digits 3 days 10.0 hours (8.69 ms/mul, 33998553 iterations) 142110007 GFLOPS
380742^2097152+1 11703432 digits 3 days 21.9 hours (8.70 ms/mul, 38877955 iterations) 162711237 GFLOPS
570000^2097152+1 12070945 digits 4 days 0.7 hours (8.69 ms/mul, 40098808 iterations) 167550578 GFLOPS
14^4194304+1 4807210 digits 3 days 14.8 hours (19.58 ms/mul, 15969202 iterations) 150366853 GFLOPS
1248^4194304+1 12986466 digits 9 days 14.8 hours (19.26 ms/mul, 43140102 iterations) 399673001 GFLOPS
10000^4194304+1 16777217 digits 12 days 10.8 hours (19.31 ms/mul, 55732704 iterations) 517570911 GFLOPS
50000^4194304+1 19708909 digits 14 days 13.0 hours (19.20 ms/mul, 65471576 iterations) 604485206 GFLOPS
150000^4194304+1 21710101 digits 16 days 1.7 hours (19.26 ms/mul, 72119391 iterations) 668048875 GFLOPS
309258^4194304+1 23028076 digits 17 days 1.1 hours (19.25 ms/mul, 76497608 iterations) 708420648 GFLOPS
480000^4194304+1 23828853 digits 17 days 13.5 hours (19.17 ms/mul, 79157734 iterations) 730009371 GFLOPS
14^8388608+1 9614419 digits 16 days 10.5 hours (44.47 ms/mul, 31938406 iterations) 683195084 GFLOPS
36^8388608+1 13055212 digits 22 days 4.9 hours (44.24 ms/mul, 43368473 iterations) 922773007 GFLOPS
100^8388608+1 16777217 digits 28 days 7.7 hours (43.91 ms/mul, 55732704 iterations) 1177114263 GFLOPS
____________
Member of the Dutch Power Cows
My Stats | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13952 ID: 53948 Credit: 391,849,676 RAC: 147,542
                               
|
Running the B3 test on a AMD HD5870 icm with Windows x64 and Intel Core-i7 860
The -b test is a better, more concise, and much faster test and is better for looking at relative performance. The -b3 test was put in there mostly as fluff, and to give you an estimate of run times for complete computations.
____________
My lucky number is 75898524288+1 | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13952 ID: 53948 Credit: 391,849,676 RAC: 147,542
                               
|
The latest version of GeneferOCL (3.1.2.-3) incorporating Yves' most recent changes can now be downloaded via this thread.
It's faster. You want it.
____________
My lucky number is 75898524288+1 | |
|
|
Not much effect on Titan it seems.
New drivers fixed my Genefer crashes so leaving the Titan to crunch over night with BOINC & geneferocl, will see the results tomorrow.
All test runs I've made have been ok (no crashes or bailouts because of errors).
-----
geneferocl 3.1.2-3 (Windows 32-bit OpenGL)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: geneferocl-windows.exe -b
Running on platform 'NVIDIA CUDA', device 'GeForce GTX TITAN', version 'OpenCL 1.1 CUDA' and driver '326.41'.
Generalized Fermat Number Bench
2199064^8192+1 Time: 74.3 us/mul. Err: 0.2188 51956 digits
1798620^16384+1 Time: 76.2 us/mul. Err: 0.2344 102481 digits
1471094^32768+1 Time: 80 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 95.2 us/mul. Err: 0.2813 398482 digits
984108^131072+1 Time: 129 us/mul. Err: 0.2295 785521 digits
804904^262144+1 Time: 274 us/mul. Err: 0.2188 1548156 digits
658332^524288+1 Time: 486 us/mul. Err: 0.2266 3050541 digits
538452^1048576+1 Time: 973 us/mul. Err: 0.2188 6009544 digits
440400^2097152+1 Time: 1.7 ms/mul. Err: 0.2188 11836006 digits
360204^4194304+1 Time: 3.41 ms/mul. Err: 0.1953 23305854 digits
294612^8388608+1 Time: 6.81 ms/mul. Err: 0.2070 45879398 digits
Genefer Mark = 117.
-----
geneferocl 3.1.2-3 (Windows 32-bit OpenGL)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: geneferocl-windows.exe -b3
Running on platform 'NVIDIA CUDA', device 'GeForce GTX TITAN', version 'OpenCL 1.1 CUDA' and driver '326.41'.
14^32768+1 37557 digits 0 days 0.0 hours (0.10 ms/mul, 124758 iterations) 294 GFLOPS
75898^32768+1 159916 digits 0 days 0.0 hours (0.10 ms/mul, 531226 iterations) 1253 GFLOPS
700000^32768+1 191533 digits 0 days 0.0 hours (0.08 ms/mul, 636255 iterations) 1501 GFLOPS
5000000^32768+1 219512 digits 0 days 0.0 hours (0.08 ms/mul, 729201 iterations) 1720 GFLOPS
14^65536+1 75113 digits 0 days 0.0 hours (0.10 ms/mul, 249517 iterations) 1243 GFLOPS
75898^65536+1 319831 digits 0 days 0.0 hours (0.10 ms/mul, 1062453 iterations) 5292 GFLOPS
710000^65536+1 383469 digits 0 days 0.0 hours (0.10 ms/mul, 1273852 iterations) 6345 GFLOPS
2500000^65536+1 419296 digits 0 days 0.0 hours (0.10 ms/mul, 1392868 iterations) 6938 GFLOPS
14^131072+1 150226 digits 0 days 0.0 hours (0.14 ms/mul, 499036 iterations) 5233 GFLOPS
75898^131072+1 639662 digits 0 days 0.0 hours (0.14 ms/mul, 2124908 iterations) 22281 GFLOPS
700000^131072+1 766129 digits 0 days 0.0 hours (0.14 ms/mul, 2545023 iterations) 26687 GFLOPS
1000000^131072+1 786432 digits 0 days 0.1 hours (0.14 ms/mul, 2612469 iterations) 27394 GFLOPS
14^262144+1 300451 digits 0 days 0.0 hours (0.27 ms/mul, 998074 iterations) 21978 GFLOPS
75898^262144+1 1279324 digits 0 days 0.3 hours (0.27 ms/mul, 4249818 iterations) 93581 GFLOPS
468750^262144+1 1486604 digits 0 days 0.3 hours (0.26 ms/mul, 4938388 iterations) 108744 GFLOPS
815000^262144+1 1549575 digits 0 days 0.3 hours (0.28 ms/mul, 5147574 iterations) 113350 GFLOPS
14^524288+1 600902 digits 0 days 0.2 hours (0.49 ms/mul, 1996149 iterations) 92097 GFLOPS
75898^524288+1 2558647 digits 0 days 1.1 hours (0.48 ms/mul, 8499637 iterations) 392151 GFLOPS
468750^524288+1 2973207 digits 0 days 1.3 hours (0.50 ms/mul, 9876777 iterations) 455688 GFLOPS
710000^524288+1 3067745 digits 0 days 1.3 hours (0.48 ms/mul, 10190825 iterations) 470178 GFLOPS
14^1048576+1 1201803 digits 0 days 0.9 hours (0.90 ms/mul, 3992299 iterations) 385133 GFLOPS
75898^1048576+1 5117293 digits 0 days 4.1 hours (0.88 ms/mul, 16999276 iterations) 1639903 GFLOPS
468750^1048576+1 5946413 digits 0 days 4.8 hours (0.89 ms/mul, 19753555 iterations) 1905606 GFLOPS
700000^1048576+1 6129030 digits 0 days 5.0 hours (0.89 ms/mul, 20360194 iterations) 1964127 GFLOPS
14^2097152+1 2403605 digits 0 days 3.7 hours (1.67 ms/mul, 7984600 iterations) 1607512 GFLOPS
75898^2097152+1 10234585 digits 0 days 15.6 hours (1.65 ms/mul, 33998553 iterations) 6844813 GFLOPS
380742^2097152+1 11703432 digits 0 days 17.8 hours (1.65 ms/mul, 38877955 iterations) 7827166 GFLOPS
570000^2097152+1 12070945 digits 0 days 18.2 hours (1.64 ms/mul, 40098808 iterations) 8072956 GFLOPS
14^4194304+1 4807210 digits 0 days 15.1 hours (3.42 ms/mul, 15969202 iterations) 6697969 GFLOPS
1248^4194304+1 12986466 digits 1 days 16.3 hours (3.37 ms/mul, 43140102 iterations) 18094270 GFLOPS
10000^4194304+1 16777217 digits 2 days 4.4 hours (3.38 ms/mul, 55732704 iterations) 23375990 GFLOPS
50000^4194304+1 19708909 digits 2 days 13.2 hours (3.37 ms/mul, 65471576 iterations) 27460769 GFLOPS
150000^4194304+1 21710101 digits 2 days 19.8 hours (3.39 ms/mul, 72119391 iterations) 30249065 GFLOPS
309258^4194304+1 23028076 digits 2 days 23.9 hours (3.38 ms/mul, 76497608 iterations) 32085422 GFLOPS
480000^4194304+1 23828853 digits 3 days 2.1 hours (3.37 ms/mul, 79157734 iterations) 33201160 GFLOPS
14^8388608+1 9614419 digits 2 days 15.3 hours (7.14 ms/mul, 31938406 iterations) 27863552 GFLOPS
36^8388608+1 13055212 digits 3 days 13.6 hours (7.11 ms/mul, 43368473 iterations) 37835316 GFLOPS
100^8388608+1 16777217 digits 4 days 13.8 hours (7.10 ms/mul, 55732704 iterations) 48622060 GFLOPS
| |
|
|
Husu*, you forgot to abort the task on previous host 400849 and now it is lost - it is very bad. | |
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2380 ID: 1178 Credit: 17,888,653,435 RAC: 9,307,576
                                                
|
Slower overall than geneferCUDA on GT-440 OEM card (64-bit Vista in i7-920):
Command line: geneferocl-windows.exe -b
Running on platform 'NVIDIA CUDA', device 'GeForce GT 440', version 'OpenCL 1.1
CUDA' and driver '314.07'.
Generalized Fermat Number Bench
2199064^8192+1 Time: 120 us/mul. Err: 0.2188 51956 digits
1798620^16384+1 Time: 169 us/mul. Err: 0.2344 102481 digits
1471094^32768+1 Time: 311 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 583 us/mul. Err: 0.2813 398482 digits
984108^131072+1 Time: 1.05 ms/mul. Err: 0.2295 785521 digits
804904^262144+1 Time: 2.24 ms/mul. Err: 0.2188 1548156 digits
658332^524288+1 Time: 4.47 ms/mul. Err: 0.2266 3050541 digits
538452^1048576+1 Time: 9.36 ms/mul. Err: 0.2188 6009544 digits
440400^2097152+1 Time: 20 ms/mul. Err: 0.2188 11836006 digits
360204^4194304+1 Time: 42.5 ms/mul. Err: 0.1953 23305854 digits
294612^8388608+1 Time: 92.7 ms/mul. Err: 0.2070 45879398 digits
Genefer Mark = 10.
Command line: genefercuda.exe -b
Generalized Fermat Number Bench
2009574^8192+1 Time: 192 us/mul. Err: 0.1719 51636 digits
1632282^16384+1 Time: 194 us/mul. Err: 0.1563 101791 digits
1325824^32768+1 Time: 328 us/mul. Err: 0.1563 200622 digits
1076904^65536+1 Time: 583 us/mul. Err: 0.1602 395325 digits
874718^131072+1 Time: 1.13 ms/mul. Err: 0.1563 778813 digits
710492^262144+1 Time: 2.04 ms/mul. Err: 0.1641 1533952 digits
577098^524288+1 Time: 4.29 ms/mul. Err: 0.1875 3020555 digits
468750^1048576+1 Time: 8.71 ms/mul. Err: 0.1563 5946413 digits
380742^2097152+1 Time: 18.4 ms/mul. Err: 0.1484 11703432 digits
309258^4194304+1 Time: 36.7 ms/mul. Err: 0.1719 23028076 digits
100^8388608+1 Time: 76.4 ms/mul. Err: 0.0000 16777217 digits
EDIT: Should note that all 8 threads (HT is on) were loaded and running a combination of TRP sieve and PPS LLR | |
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2380 ID: 1178 Credit: 17,888,653,435 RAC: 9,307,576
                                                
|
...But faster on GTX 650Ti (AMD 1100T, 64-bit WIn7 with all cores same load as my i7-920):
Command line: geneferocl-windows.exe -b
Running on platform 'NVIDIA CUDA', device 'GeForce GTX 650 Ti', version 'OpenCL
1.1 CUDA' and driver '311.06'.
Generalized Fermat Number Bench
2199064^8192+1 Time: 130 us/mul. Err: 0.2188 51956 digits
1798620^16384+1 Time: 147 us/mul. Err: 0.2344 102481 digits
1471094^32768+1 Time: 201 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 318 us/mul. Err: 0.2813 398482 digits
984108^131072+1 Time: 652 us/mul. Err: 0.2295 785521 digits
804904^262144+1 Time: 1.29 ms/mul. Err: 0.2188 1548156 digits
658332^524288+1 Time: 2.51 ms/mul. Err: 0.2266 3050541 digits
538452^1048576+1 Time: 5 ms/mul. Err: 0.2188 6009544 digits
440400^2097152+1 Time: 10.2 ms/mul. Err: 0.2188 11836006 digits
360204^4194304+1 Time: 20.9 ms/mul. Err: 0.1953 23305854 digits
294612^8388608+1 Time: 43.1 ms/mul. Err: 0.2070 45879398 digits
Genefer Mark = 20.
Command line: genefercuda-windows.exe -b
Generalized Fermat Number Bench
2199064^8192+1 Time: 189 us/mul. Err: 0.2344 51956 digits
1798620^16384+1 Time: 236 us/mul. Err: 0.2188 102481 digits
1471094^32768+1 Time: 266 us/mul. Err: 0.2500 202102 digits
1203210^65536+1 Time: 416 us/mul. Err: 0.2352 398482 digits
984108^131072+1 Time: 791 us/mul. Err: 0.2188 785521 digits
804904^262144+1 Time: 1.37 ms/mul. Err: 0.2227 1548156 digits
658332^524288+1 Time: 2.55 ms/mul. Err: 0.2500 3050541 digits
538452^1048576+1 Time: 5.46 ms/mul. Err: 0.2031 6009544 digits
440400^2097152+1 Time: 10.7 ms/mul. Err: 0.2051 11836006 digits
360204^4194304+1 Time: 21.5 ms/mul. Err: 0.2167 23305854 digits
294612^8388608+1 Time: 47.5 ms/mul. Err: 0.1797 45879398 digits
| |
|
|
If anyone has BOTH an Nvidia and an AMD GPU in the same system, I'd like to know which one GeneferOCL chooses.
In a week's time (hopefully), i'll be able to answer you. I was building a machine with nvidia for genefer and AMD for other projects, and parts are still on their way to me.
____________
| |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
A new version of oCLgenefer with self-tuning is available on assembla [...]\branches\yves\2013\OclGenefer.
In OpenCL, we can enable profiling of commands in the command-queue: I used it: it's fast and easy to use.
I printed the two parameters, just out of curiosity.
According to NVidia guide, they should be a multiple of 32 (warp size) and to ATI guide, a multiple of 64 (wavefront size)... but that's just the theory.
A bench on Tahiti and Fermi are welcome.
Thanks, Yves
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13952 ID: 53948 Credit: 391,849,676 RAC: 147,542
                               
|
A new version of oCLgenefer with self-tuning is available on assembla [...]\branches\yves\2013\OclGenefer.
In OpenCL, we can enable profiling of commands in the command-queue: I used it: it's fast and easy to use.
I printed the two parameters, just out of curiosity.
According to NVidia guide, they should be a multiple of 32 (warp size) and to ATI guide, a multiple of 64 (wavefront size)... but that's just the theory.
A bench on Tahiti and Fermi are welcome.
Thanks, Yves
I ran the test several times because I was getting inconsistent results at low N.
GTX 460:
OclGenefer 2013-08-17, Copyright (C) 2001-2013, Yves Gallot.
Options: -q "b^N+1" Test expression.
Platform 'NVIDIA CUDA': GPU device 'GeForce GTX 460' found.
Platform 'AMD Accelerated Parallel Processing': CPU device 'Intel(R) Core(TM)2 Quad CPU @ 2.40GHz' found.
Running on platform 'NVIDIA CUDA', device 'GeForce GTX 460', version 'OpenCL 1.1 CUDA' and driver '320.57'.
Clock frequency = 1350 MHz, compute units = 7.
Global mem size = 1024 MB, cache size = 112 kB (ReadWrite), cache line size = 128 Bytes.
Local mem size = 48 kB (dedicated), Constant mem size = 64 kB.
Max workgroup size = 1024.
localWorkSize0 = 4, localWorkSize1 = 4.
2199064^8192+1 Time: 91.3 us/mul. Err: 0.2188 51956 digits
localWorkSize0 = 16, localWorkSize1 = 16.
1798620^16384+1 Time: 134 us/mul. Err: 0.2266 102481 digits
localWorkSize0 = 16, localWorkSize1 = 16.
1471094^32768+1 Time: 176 us/mul. Err: 0.2344 202102 digits
localWorkSize0 = 32, localWorkSize1 = 32.
1203210^65536+1 Time: 277 us/mul. Err: 0.2188 398482 digits
localWorkSize0 = 64, localWorkSize1 = 64.
984108^131072+1 Time: 507 us/mul. Err: 0.2422 785521 digits
localWorkSize0 = 32, localWorkSize1 = 32.
804904^262144+1 Time: 1.08 ms/mul. Err: 0.2178 1548156 digits
localWorkSize0 = 32, localWorkSize1 = 32.
658332^524288+1 Time: 2.05 ms/mul. Err: 0.2256 3050541 digits
localWorkSize0 = 32, localWorkSize1 = 32.
538452^1048576+1 Time: 4.28 ms/mul. Err: 0.2031 6009544 digits
localWorkSize0 = 32, localWorkSize1 = 32.
440400^2097152+1 Time: 8.9 ms/mul. Err: 0.2305 11836006 digits
localWorkSize0 = 32, localWorkSize1 = 32.
360204^4194304+1 Time: 18.7 ms/mul. Err: 0.1953 23305854 digits
localWorkSize0 = 32, localWorkSize1 = 32.
294612^8388608+1 Time: 39.7 ms/mul. Err: 0.1973 45879398 digits
localWorkSize0 = 4, localWorkSize1 = 4.
2199064^8192+1 Time: 91.1 us/mul. Err: 0.2188 51956 digits
localWorkSize0 = 16, localWorkSize1 = 16.
1798620^16384+1 Time: 136 us/mul. Err: 0.2266 102481 digits
localWorkSize0 = 16, localWorkSize1 = 16.
1471094^32768+1 Time: 178 us/mul. Err: 0.2344 202102 digits
localWorkSize0 = 32, localWorkSize1 = 2.
1203210^65536+1 Time: 292 us/mul. Err: 0.2188 398482 digits
localWorkSize0 = 64, localWorkSize1 = 64.
984108^131072+1 Time: 510 us/mul. Err: 0.2422 785521 digits
localWorkSize0 = 32, localWorkSize1 = 32.
804904^262144+1 Time: 1.08 ms/mul. Err: 0.2178 1548156 digits
localWorkSize0 = 32, localWorkSize1 = 32.
658332^524288+1 Time: 2.06 ms/mul. Err: 0.2256 3050541 digits
localWorkSize0 = 32, localWorkSize1 = 32.
538452^1048576+1 Time: 4.29 ms/mul. Err: 0.2031 6009544 digits
localWorkSize0 = 32, localWorkSize1 = 32.
440400^2097152+1 Time: 8.89 ms/mul. Err: 0.2305 11836006 digits
localWorkSize0 = 32, localWorkSize1 = 32.
360204^4194304+1 Time: 18.6 ms/mul. Err: 0.1953 23305854 digits
localWorkSize0 = 32, localWorkSize1 = 32.
294612^8388608+1 Time: 39.7 ms/mul. Err: 0.1973 45879398 digits
localWorkSize0 = 4, localWorkSize1 = 1.
2199064^8192+1 Time: 92.3 us/mul. Err: 0.2188 51956 digits
localWorkSize0 = 16, localWorkSize1 = 4.
1798620^16384+1 Time: 135 us/mul. Err: 0.2266 102481 digits
localWorkSize0 = 16, localWorkSize1 = 32.
1471094^32768+1 Time: 175 us/mul. Err: 0.2344 202102 digits
localWorkSize0 = 32, localWorkSize1 = 32.
1203210^65536+1 Time: 276 us/mul. Err: 0.2188 398482 digits
localWorkSize0 = 64, localWorkSize1 = 64.
984108^131072+1 Time: 504 us/mul. Err: 0.2422 785521 digits
localWorkSize0 = 32, localWorkSize1 = 32.
804904^262144+1 Time: 1.07 ms/mul. Err: 0.2178 1548156 digits
localWorkSize0 = 32, localWorkSize1 = 32.
658332^524288+1 Time: 2.03 ms/mul. Err: 0.2256 3050541 digits
localWorkSize0 = 32, localWorkSize1 = 32.
538452^1048576+1 Time: 4.25 ms/mul. Err: 0.2031 6009544 digits
localWorkSize0 = 32, localWorkSize1 = 32.
440400^2097152+1 Time: 8.84 ms/mul. Err: 0.2305 11836006 digits
localWorkSize0 = 32, localWorkSize1 = 32.
360204^4194304+1 Time: 18.6 ms/mul. Err: 0.1953 23305854 digits
localWorkSize0 = 32, localWorkSize1 = 32.
294612^8388608+1 Time: 39.7 ms/mul. Err: 0.1973 45879398 digits
localWorkSize0 = 4, localWorkSize1 = 4.
2199064^8192+1 Time: 90 us/mul. Err: 0.2188 51956 digits
localWorkSize0 = 16, localWorkSize1 = 2.
1798620^16384+1 Time: 136 us/mul. Err: 0.2266 102481 digits
localWorkSize0 = 16, localWorkSize1 = 2.
1471094^32768+1 Time: 180 us/mul. Err: 0.2344 202102 digits
localWorkSize0 = 32, localWorkSize1 = 64.
1203210^65536+1 Time: 276 us/mul. Err: 0.2188 398482 digits
localWorkSize0 = 64, localWorkSize1 = 64.
984108^131072+1 Time: 505 us/mul. Err: 0.2422 785521 digits
localWorkSize0 = 32, localWorkSize1 = 32.
804904^262144+1 Time: 1.07 ms/mul. Err: 0.2178 1548156 digits
localWorkSize0 = 32, localWorkSize1 = 32.
658332^524288+1 Time: 2.03 ms/mul. Err: 0.2256 3050541 digits
localWorkSize0 = 32, localWorkSize1 = 32.
538452^1048576+1 Time: 4.25 ms/mul. Err: 0.2031 6009544 digits
localWorkSize0 = 32, localWorkSize1 = 32.
440400^2097152+1 Time: 8.85 ms/mul. Err: 0.2305 11836006 digits
localWorkSize0 = 32, localWorkSize1 = 32.
360204^4194304+1 Time: 18.6 ms/mul. Err: 0.1953 23305854 digits
localWorkSize0 = 32, localWorkSize1 = 32.
294612^8388608+1 Time: 39.7 ms/mul. Err: 0.1973 45879398 digits
____________
My lucky number is 75898524288+1 | |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,668,824 RAC: 0
                    
|
A new version of oCLgenefer with self-tuning is available on assembla [...]\branches\yves\2013\OclGenefer.
Full test run with self-tuning geneferocl, revision 400. HD7970GHz GPU, CPU is 3.5 GHz AMD Phenom II X6 1100T:
OclGenefer 2013-08-17, Copyright (C) 2001-2013, Yves Gallot.
Options: -q "b^N+1" Test expression.
Platform 'AMD Accelerated Parallel Processing': GPU device 'Tahiti' found.
Platform 'AMD Accelerated Parallel Processing': CPU device 'AMD Phenom(tm) II X6
1100T Processor' found.
Running on platform 'AMD Accelerated Parallel Processing', device 'Tahiti', vers
ion 'OpenCL 1.2 AMD-APP (1124.2)' and driver '1124.2 (VM)'.
Clock frequency = 1050 MHz, compute units = 32.
Global mem size = 2048 MB, cache size = 16 kB (ReadWrite), cache line size = 6
4 Bytes.
Local mem size = 32 kB (dedicated), Constant mem size = 64 kB.
Max workgroup size = 256.
localWorkSize0 = 64, localWorkSize1 = 128.
2199064^8192+1 Time: 583 us/mul. Err: 0.2188 51956 digits
localWorkSize0 = 1, localWorkSize1 = 2.
1798620^16384+1 Time: 303 us/mul. Err: 0.2227 102481 digits
localWorkSize0 = 1, localWorkSize1 = 2.
1471094^32768+1 Time: 526 us/mul. Err: 0.2383 202102 digits
localWorkSize0 = 2, localWorkSize1 = 4.
1203210^65536+1 Time: 621 us/mul. Err: 0.2305 398482 digits
localWorkSize0 = 8, localWorkSize1 = 8.
984108^131072+1 Time: 625 us/mul. Err: 0.2188 785521 digits
localWorkSize0 = 16, localWorkSize1 = 32.
804904^262144+1 Time: 944 us/mul. Err: 0.2266 1548156 digits
localWorkSize0 = 4, localWorkSize1 = 4.
658332^524288+1 Time: 711 us/mul. Err: 0.2109 3050541 digits
localWorkSize0 = 8, localWorkSize1 = 4.
538452^1048576+1 Time: 1.15 ms/mul. Err: 0.2134 6009544 digits
localWorkSize0 = 4, localWorkSize1 = 2.
440400^2097152+1 Time: 2.24 ms/mul. Err: 0.2266 11836006 digits
localWorkSize0 = 8, localWorkSize1 = 2.
360204^4194304+1 Time: 4.64 ms/mul. Err: 0.1953 23305854 digits
localWorkSize0 = 8, localWorkSize1 = 2.
294612^8388608+1 Time: 9 ms/mul. Err: 0.2109 45879398 digits
localWorkSize0 = 1, localWorkSize1 = 1.
102^64+1 is a probable prime. (0.9 sec., err = 1.46e-011)
localWorkSize0 = 1, localWorkSize1 = 1.
15000250^64+1 is a probable prime. (1.6 sec., err = 0.375)
localWorkSize0 = 2, localWorkSize1 = 2.
120^128+1 is a probable prime. (1.2 sec., err = 5.09e-011)
localWorkSize0 = 2, localWorkSize1 = 2.
10000038^128+1 is a probable prime. (2.6 sec., err = 0.344)
localWorkSize0 = 1, localWorkSize1 = 2.
278^256+1 is a probable prime. (2.0 sec., err = 3.49e-010)
localWorkSize0 = 2, localWorkSize1 = 2.
5684328^256+1 is a probable prime. (4.5 sec., err = 0.164)
localWorkSize0 = 2, localWorkSize1 = 2.
46^512+1 is a probable prime. (2.9 sec., err = 1.73e-011)
localWorkSize0 = 1, localWorkSize1 = 2.
4619000^512+1 is a probable prime. (6.1 sec., err = 0.174)
localWorkSize0 = 2, localWorkSize1 = 4.
824^1024+1 is a probable prime. (6.0 sec., err = 8.5e-009)
localWorkSize0 = 2, localWorkSize1 = 2.
3752220^1024+1 is a probable prime. (7.3 sec., err = 0.188)
localWorkSize0 = 4, localWorkSize1 = 2.
150^2048+1 is a probable prime. (7.5 sec., err = 4.37e-010)
localWorkSize0 = 2, localWorkSize1 = 2.
3066672^2048+1 is a probable prime. (9.8 sec., err = 0.217)
localWorkSize0 = 2, localWorkSize1 = 2.
1534^4096+1 is a probable prime. (10.2 sec., err = 7.08e-008)
localWorkSize0 = 2, localWorkSize1 = 64.
2485064^4096+1 is a probable prime. (13.7 sec., err = 0.211)
localWorkSize0 = 128, localWorkSize1 = 128.
30406^8192+1 is a probable prime. (18.8 sec., err = 4.58e-005)
localWorkSize0 = 2, localWorkSize1 = 1.
2030234^8192+1 is a probable prime. (20.6 sec., err = 0.209)
localWorkSize0 = 64, localWorkSize1 = 128.
67234^16384+1 is a probable prime. (32.6 sec., err = 0.000351)
localWorkSize0 = 64, localWorkSize1 = 128.
1651902^16384+1 is a probable prime. (40.3 sec., err = 0.219)
localWorkSize0 = 64, localWorkSize1 = 64.
70906^32768+1 is a probable prime. (53.5 sec., err = 0.000648)
localWorkSize0 = 128, localWorkSize1 = 128.
1277444^32768+1 is a probable prime. (69.6 sec., err = 0.203)
localWorkSize0 = 2, localWorkSize1 = 4.
48594^65536+1 is a probable prime. (128.6 sec., err = 0.000458)
localWorkSize0 = 2, localWorkSize1 = 4.
857678^65536+1 is a probable prime. (155.1 sec., err = 0.137)
localWorkSize0 = 8, localWorkSize1 = 8.
62722^131072+1 is a probable prime. (293.9 sec., err = 0.00116)
localWorkSize0 = 8, localWorkSize1 = 128.
572186^131072+1 is a probable prime. (352.2 sec., err = 0.0957)
localWorkSize0 = 16, localWorkSize1 = 32.
24518^262144+1 is a probable prime. (1310.7 sec., err = 0.000259)
localWorkSize0 = 8, localWorkSize1 = 32.
676754^262144+1 is a probable prime. (1636.6 sec., err = 0.215)
localWorkSize0 = 4, localWorkSize1 = 4.
75898^524288+1 is a probable prime. (4649.8 sec., err = 0.00391)
localWorkSize0 = 4, localWorkSize1 = 4.
475856^524288+1 is a probable prime. (5369.9 sec., err = 0.16)
I compared the results to revision 386 from 14/08/2013 (previous fastest for me).
N values 524288 and higher are definitely having quicker times.
N values 131072 and lower are definitely having slower times.
The auto tuned transform is showing lots of promise but needs tweaking. | |
|
|
Here's more Fermi benches. I've listed both the new 32 bit version with the old 64 bit one. It looks like the 32 bit version is slightly faster on the GTX 560ti (GF114 chip) but slightly slower on the 'Big Fermis' (GF100, GF110 chips).
EDIT: Forgot to set 470 to stock before testing. Shaders are @ 1415 for test instead of stock 1215.
GTX 470 on 1100T
C:\>geneferocl-windows_2.exe -b
geneferocl 3.1.2-3 (Windows 32-bit OpenGL)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: geneferocl-windows_2.exe -b
Running on platform 'NVIDIA CUDA', device 'GeForce GTX 470', version 'OpenCL 1.1
CUDA' and driver '320.18'.
Generalized Fermat Number Bench
2199064^8192+1 Time: 86.1 us/mul. Err: 0.2188 51956 digits
1798620^16384+1 Time: 92.8 us/mul. Err: 0.2344 102481 digits
1471094^32768+1 Time: 126 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 177 us/mul. Err: 0.2813 398482 digits
984108^131072+1 Time: 271 us/mul. Err: 0.2295 785521 digits
804904^262144+1 Time: 605 us/mul. Err: 0.2188 1548156 digits
658332^524288+1 Time: 1.11 ms/mul. Err: 0.2266 3050541 digits
538452^1048576+1 Time: 2.25 ms/mul. Err: 0.2188 6009544 digits
440400^2097152+1 Time: 4.61 ms/mul. Err: 0.2188 11836006 digits
360204^4194304+1 Time: 9.69 ms/mul. Err: 0.1953 23305854 digits
294612^8388608+1 Time: 20.6 ms/mul. Err: 0.2070 45879398 digits
Genefer Mark = 43.
C:\>geneferocl-windows.exe -b
geneferocl 3.1.2-2 (Windows 64-bit OpenGL)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: geneferocl-windows.exe -b
Running on platform 'NVIDIA CUDA', device 'GeForce GTX 470', version 'OpenCL 1.1
CUDA' and driver '320.18'.
Generalized Fermat Number Bench
2199064^8192+1 Time: 86.7 us/mul. Err: 0.2344 51956 digits
1798620^16384+1 Time: 96.4 us/mul. Err: 0.2266 102481 digits
1471094^32768+1 Time: 127 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 182 us/mul. Err: 0.2656 398482 digits
984108^131072+1 Time: 293 us/mul. Err: 0.2188 785521 digits
804904^262144+1 Time: 591 us/mul. Err: 0.2188 1548156 digits
658332^524288+1 Time: 1.06 ms/mul. Err: 0.2188 3050541 digits
538452^1048576+1 Time: 2.13 ms/mul. Err: 0.2266 6009544 digits
440400^2097152+1 Time: 4.34 ms/mul. Err: 0.2266 11836006 digits
360204^4194304+1 Time: 9.14 ms/mul. Err: 0.2031 23305854 digits
294612^8388608+1 Time: 19.5 ms/mul. Err: 0.1895 45879398 digits
Genefer Mark = 554.
GTX 580 on 1100T
C:\>geneferocl-windows_2.exe -b
geneferocl 3.1.2-3 (Windows 32-bit OpenGL)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: geneferocl-windows_2.exe -b
Running on platform 'NVIDIA CUDA', device 'GeForce GTX 580', version 'OpenCL 1.1
CUDA' and driver '314.22'.
Generalized Fermat Number Bench
2199064^8192+1 Time: 72.4 us/mul. Err: 0.2188 51956 digits
1798620^16384+1 Time: 77.2 us/mul. Err: 0.2344 102481 digits
1471094^32768+1 Time: 80.3 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 122 us/mul. Err: 0.2813 398482 digits
984108^131072+1 Time: 199 us/mul. Err: 0.2295 785521 digits
804904^262144+1 Time: 438 us/mul. Err: 0.2188 1548156 digits
658332^524288+1 Time: 844 us/mul. Err: 0.2266 3050541 digits
538452^1048576+1 Time: 1.73 ms/mul. Err: 0.2188 6009544 digits
440400^2097152+1 Time: 3.54 ms/mul. Err: 0.2188 11836006 digits
360204^4194304+1 Time: 7.52 ms/mul. Err: 0.1953 23305854 digits
294612^8388608+1 Time: 16 ms/mul. Err: 0.2070 45879398 digits
Genefer Mark = 56.
C:\>geneferocl-windows.exe -b
geneferocl 3.1.2-2 (Windows 64-bit OpenGL)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: geneferocl-windows.exe -b
Running on platform 'NVIDIA CUDA', device 'GeForce GTX 580', version 'OpenCL 1.1
CUDA' and driver '314.22'.
Generalized Fermat Number Bench
2199064^8192+1 Time: 73.5 us/mul. Err: 0.2344 51956 digits
1798620^16384+1 Time: 78.6 us/mul. Err: 0.2266 102481 digits
1471094^32768+1 Time: 81.3 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 123 us/mul. Err: 0.2656 398482 digits
984108^131072+1 Time: 204 us/mul. Err: 0.2188 785521 digits
804904^262144+1 Time: 433 us/mul. Err: 0.2188 1548156 digits
658332^524288+1 Time: 822 us/mul. Err: 0.2188 3050541 digits
538452^1048576+1 Time: 1.68 ms/mul. Err: 0.2266 6009544 digits
440400^2097152+1 Time: 3.43 ms/mul. Err: 0.2266 11836006 digits
360204^4194304+1 Time: 7.28 ms/mul. Err: 0.2031 23305854 digits
294612^8388608+1 Time: 15.6 ms/mul. Err: 0.1895 45879398 digits
Genefer Mark = 699.
GTX 570 0n 2600K
C:\>geneferocl-windows_2.exe -b
geneferocl 3.1.2-3 (Windows 32-bit OpenGL)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: geneferocl-windows_2.exe -b
Running on platform 'NVIDIA CUDA', device 'GeForce GTX 570', version 'OpenCL 1.1
CUDA' and driver '326.41'.
Generalized Fermat Number Bench
2199064^8192+1 Time: 81.8 us/mul. Err: 0.2188 51956 digits
1798620^16384+1 Time: 88.2 us/mul. Err: 0.2344 102481 digits
1471094^32768+1 Time: 117 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 168 us/mul. Err: 0.2813 398482 digits
984108^131072+1 Time: 249 us/mul. Err: 0.2295 785521 digits
804904^262144+1 Time: 537 us/mul. Err: 0.2188 1548156 digits
658332^524288+1 Time: 1.01 ms/mul. Err: 0.2266 3050541 digits
538452^1048576+1 Time: 2.03 ms/mul. Err: 0.2188 6009544 digits
440400^2097152+1 Time: 4.14 ms/mul. Err: 0.2188 11836006 digits
360204^4194304+1 Time: 8.75 ms/mul. Err: 0.1953 23305854 digits
294612^8388608+1 Time: 18.4 ms/mul. Err: 0.2070 45879398 digits
Genefer Mark = 48.
C:\>geneferocl-windows.exe -b
geneferocl 3.1.2-2 (Windows 64-bit OpenGL)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: geneferocl-windows.exe -b
Running on platform 'NVIDIA CUDA', device 'GeForce GTX 570', version 'OpenCL 1.1
CUDA' and driver '326.41'.
Generalized Fermat Number Bench
2199064^8192+1 Time: 82.4 us/mul. Err: 0.2344 51956 digits
1798620^16384+1 Time: 88.8 us/mul. Err: 0.2266 102481 digits
1471094^32768+1 Time: 118 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 168 us/mul. Err: 0.2656 398482 digits
984108^131072+1 Time: 273 us/mul. Err: 0.2188 785521 digits
804904^262144+1 Time: 527 us/mul. Err: 0.2188 1548156 digits
658332^524288+1 Time: 967 us/mul. Err: 0.2188 3050541 digits
538452^1048576+1 Time: 1.93 ms/mul. Err: 0.2266 6009544 digits
440400^2097152+1 Time: 3.91 ms/mul. Err: 0.2266 11836006 digits
360204^4194304+1 Time: 8.2 ms/mul. Err: 0.2031 23305854 digits
294612^8388608+1 Time: 17.5 ms/mul. Err: 0.1895 45879398 digits
Genefer Mark = 615.
GTX 560 Ti on 2600K
C:\>geneferocl-windows_2.exe -d 1 -b
geneferocl 3.1.2-3 (Windows 32-bit OpenGL)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: geneferocl-windows_2.exe -d 1 -b
Running on platform 'NVIDIA CUDA', device 'GeForce GTX 560 Ti', version 'OpenCL
1.1 CUDA' and driver '326.41'.
Generalized Fermat Number Bench
2199064^8192+1 Time: 69 us/mul. Err: 0.2188 51956 digits
1798620^16384+1 Time: 73.2 us/mul. Err: 0.2344 102481 digits
1471094^32768+1 Time: 107 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 194 us/mul. Err: 0.2813 398482 digits
984108^131072+1 Time: 352 us/mul. Err: 0.2295 785521 digits
804904^262144+1 Time: 757 us/mul. Err: 0.2188 1548156 digits
658332^524288+1 Time: 1.5 ms/mul. Err: 0.2266 3050541 digits
538452^1048576+1 Time: 3.13 ms/mul. Err: 0.2188 6009544 digits
440400^2097152+1 Time: 6.6 ms/mul. Err: 0.2188 11836006 digits
360204^4194304+1 Time: 13.8 ms/mul. Err: 0.1953 23305854 digits
294612^8388608+1 Time: 29.1 ms/mul. Err: 0.2070 45879398 digits
Genefer Mark = 30.
C:\>geneferocl-windows.exe -d 1 -b
geneferocl 3.1.2-2 (Windows 64-bit OpenGL)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: geneferocl-windows.exe -d 1 -b
Running on platform 'NVIDIA CUDA', device 'GeForce GTX 560 Ti', version 'OpenCL
1.1 CUDA' and driver '326.41'.
Generalized Fermat Number Bench
2199064^8192+1 Time: 69.6 us/mul. Err: 0.2344 51956 digits
1798620^16384+1 Time: 74.2 us/mul. Err: 0.2266 102481 digits
1471094^32768+1 Time: 107 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 195 us/mul. Err: 0.2656 398482 digits
984108^131072+1 Time: 376 us/mul. Err: 0.2188 785521 digits
804904^262144+1 Time: 811 us/mul. Err: 0.2188 1548156 digits
658332^524288+1 Time: 1.59 ms/mul. Err: 0.2188 3050541 digits
538452^1048576+1 Time: 3.34 ms/mul. Err: 0.2266 6009544 digits
440400^2097152+1 Time: 7.03 ms/mul. Err: 0.2266 11836006 digits
360204^4194304+1 Time: 14.6 ms/mul. Err: 0.2031 23305854 digits
294612^8388608+1 Time: 30.6 ms/mul. Err: 0.1895 45879398 digits
Genefer Mark = 349.
More tweaking needs to be done.
____________
Largest Primes to Date:
As Double Checker: SR5 109208*5^1816285+1 Dgts-1,269,534
As Initial Finder: SR5 243944*5^1258576-1 Dgts-879,713
| |
|
|
Windows 32-bit OpenGL
Windows 64-bit OpenGL
Maybe OpenCL? | |
|
|
Husu*, you forgot to abort the task on previous host 400849 and now it is lost - it is very bad.
It has been aborted now.
------
First regular task done on Titan earlier http://www.primegrid.com/result.php?resultid=474003987, second one coming shortly.
Run time 16,600.85
CPU time 16,596.48
The OpenCL CPU reguirement is bad Nvidia implementation though :(
Seems the new drivers have fixed the crashings, now I can make full runs. | |
|
|
Under CUDA there is a limit on the size of numbers the GPU can handle. Does this limit still apply in the OPEN CL version or can we go higher?
____________
Member team AUSTRALIA
My lucky number is 9291*2^1085585+1 | |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,668,824 RAC: 0
                    
|
Seems they can go higher or lower by just adding auto-tuning:
geneferocl 3.1.2-3 (Windows 32-bit OpenGL)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: geneferocl-windows.exe -l
Running on platform 'AMD Accelerated Parallel Processing', device 'Tahiti', vers
ion 'OpenCL 1.2 AMD-APP (1124.2)' and driver '1124.2 (VM)'.
Generalized Fermat Number b Limits
The upper bound m = 8192, b = 2720000, Err = 0.2969
The upper bound m = 16384, b = 2210000, Err = 0.2969
The upper bound m = 32768, b = 1830000, Err = 0.3008
The upper bound m = 65536, b = 1490000, Err = 0.2969
The upper bound m = 131072, b = 1235000, Err = 0.3008
The upper bound m = 262144, b = 1015000, Err = 0.2891
The upper bound m = 524288, b = 840000, Err = 0.3047
The upper bound m = 1048576, b = 690000, Err = 0.3008
The upper bound m = 2097152, b = 565000, Err = 0.3066
The upper bound m = 4194304, b = 470000, Err = 0.3125
The upper bound m = 8388608, b = 385000, Err = 0.3125
geneferocl 3.1.2-1 and geneferocl 3.1.2-2:
Generalized Fermat Number b Limits
The upper bound m = 8192, b = 2670000, Err = 0.2910
The upper bound m = 16384, b = 2210000, Err = 0.2969
The upper bound m = 32768, b = 1780000, Err = 0.2969
The upper bound m = 65536, b = 1505000, Err = 0.2969
The upper bound m = 131072, b = 1240000, Err = 0.2969
The upper bound m = 262144, b = 1015000, Err = 0.3047
The upper bound m = 524288, b = 825000, Err = 0.3057
The upper bound m = 1048576, b = 680000, Err = 0.3047
The upper bound m = 2097152, b = 555000, Err = 0.2969
The upper bound m = 4194304, b = 455000, Err = 0.2813
The upper bound m = 8388608, b = 385000, Err = 0.3125
These values can be compared against CUDA "B" limits given in this thread:
http://www.primegrid.com/forum_thread.php?id=4152 | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
Under CUDA there is a limit on the size of numbers the GPU can handle. Does this limit still apply in the OPEN CL version or can we go higher?
A limit? The limit is GPU memory size... and a reasonable amount of time!
We can extend the tests:
360204^4194304+1 Err: 0.1953 23305854 digits
294612^8388608+1 Err: 0.1973 45879398 digits
240964^16777216+1 Err: 0.1914 90294174 digits
197084^33554432+1 Err: 0.1907 177659020 digits
If you test 100000^33554432+1 on a Titan, the computation time is about log2(100000) * 2^25 * 30 ms ~ 200 days. | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
A new version of oCLgenefer with self-tuning is available on assembla [...]
Full test run with self-tuning geneferocl, revision 400. HD7970GHz
N values 524288 and higher are definitely having quicker times.
N values 131072 and lower are definitely having slower times.
The auto tuned transform is showing lots of promise but needs tweaking.
Thanks, very good news!
538452^1048576+1 Time: 1.52 => 1.15 ms/mul.
360204^4194304+1 Time: 5.61 => 4.64 ms/mul.
That's the most important for the Generalized Fermat Prime Search.
As expected, "AMD Accelerated Parallel Processing, OpenCL Programming Guide, August 2013": The fundamental unit of work on AMD GPUs is called a wavefront. Each wavefront consists of 64 work-items; thus, the optimal local work size is an integer multiple of 64 (specifically 64, 128, 192, or 256) work-items per workgroup.
The bench proves that here optimal settings are local work size = 8 and 2!
For small N's, the problem should be accuracy of measurement. A call takes few microseconds. But the timer resolution of OpenCL devices can be obtained, then errors of measurement are known. | |
|
|
Michael Goetz wrote: GeneferOCL 3.1.2-3 is now available for download. You DO want this new version if you're running GeneferOCL because it's faster!
It's got a new and improved faster transform.
I give the latest GeneferOCL - 3.1.2-3 a chance, but it is not faster on my machine.
Calculation after 7% done is 29+ hours and a higher cpu usage (96.6%) than v3.1.2-2.
Result with that version returned in 27 hours and cpu usage of 82.1%.
____________
| |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
Michael Goetz wrote: GeneferOCL 3.1.2-3 is now available for download. You DO want this new version if you're running GeneferOCL because it's faster!
I give the latest GeneferOCL - 3.1.2-3 a chance, but it is not faster on my machine.
Calculation after 7% done is 29+ hours and a higher cpu usage (96.6%) than v3.1.2-2.
Result with that version returned in 27 hours and cpu usage of 82.1%.
Yes, it's slower on ATI cards, sorry for that.
The 3.1.2-4 will solve this problem with a "self-tuning transform".
| |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,668,824 RAC: 0
                    
|
I tried a bench of the new 3.1.2-3 geneferocl with HD7970Ghz:
538452^1048576+1 Time: 1.78 ms/mul. Err: 0.2188 6009544 digits
360204^4194304+1 Time: 6.8 ms/mul. Err: 0.1953 23305854 digits
Genefer Mark = 60.
Definitely slower. Then I tried revision 400 of the assembla algorithm with "self-tuning transform" (again):
538452^1048576+1 Time: 1.1 ms/mul. Err: 0.2134 6009544 digits
360204^4194304+1 Time: 4.45 ms/mul. Err: 0.1953 23305854 digits
New best runs! I can see your working on 3.1.2-4 of geneferocl, lets see what that brings. | |
|
|
Yes, it's slower on ATI cards, sorry for that.
The 3.1.2-4 will solve this problem with a "self-tuning transform".
No worry, Yves. This thread is called "....available for testing" ;) | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13952 ID: 53948 Credit: 391,849,676 RAC: 147,542
                               
|
Thanks, very good news!
538452^1048576+1 Time: 1.52 => 1.15 ms/mul.
360204^4194304+1 Time: 5.61 => 4.64 ms/mul.
That's the most important for the Generalized Fermat Prime Search.
I'd include N=2097152 in the "most important" list. At most, we're only a few years away from exhausting N=1048576, and it could be a lot shorter than that. GeneferOCL likely will greatly increase the number of computers crunching GFN, and if we do a GFN challenge next year there's a good chance we'll be crunching 2097152 in 2014.
____________
My lucky number is 75898524288+1 | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
New best runs! I can see your working on 3.1.2-4 of geneferocl, lets see what that brings.
I've just committed OclGenefer 2013-08-18 (assembla rev 402).
Roger, if it's OK on your HD7970Ghz, I will update real genefer with it.
Just the result of the bench is necessary.
It also prints "Profiling timer resolution": 1 microsecond on NVidia.
Thanks! | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
I'd include N=2097152 in the "most important" list. At most, we're only a few years away from exhausting N=1048576, and it could be a lot shorter than that.
I would be interested in some statistics, but I don't find them Genefer subproject status:
What is the range for N=1048576 [2-600000] ?
How many candidates pass through the sieve and should be tested?
38545 were tested.
Today, how many of them are tested (and double checked) per day? | |
|
|
One of my just tasks got validated, so roughly:
Titan (16,600sec) on OpenCL is about 2x faster than 580 on CUDA (31,596sec).
Although OpenCL also uses as much CPU time as it does GPU time currently, the workunit:
http://www.primegrid.com/workunit.php?wuid=350124153
Also for comparison a 670 takes 45,393sec on OpenCL (ran two wu's): http://www.primegrid.com/workunit.php?wuid=350124165
I'll leave the Titan to continue with OpenCL Genefer, as it's way faster than CUDA per workunit. | |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,668,824 RAC: 0
                    
|
New best runs! I can see your working on 3.1.2-4 of geneferocl, lets see what that brings.
I've just committed OclGenefer 2013-08-18 (assembla rev 402).
Roger, if it's OK on your HD7970Ghz, I will update real genefer with it.
Just the result of the bench is necessary.
It also prints "Profiling timer resolution": 1 microsecond on NVidia.
Thanks!
402 benchies:
OclGenefer 2013-08-18, Copyright (C) 2001-2013, Yves Gallot.
Options: -q "b^N+1" Test expression.
Platform 'AMD Accelerated Parallel Processing': GPU device 'Tahiti' found.
Platform 'AMD Accelerated Parallel Processing': CPU device 'AMD Phenom(tm) II X6
1100T Processor' found.
Running on platform 'AMD Accelerated Parallel Processing', device 'Tahiti', vers
ion 'OpenCL 1.2 AMD-APP (1124.2)' and driver '1124.2 (VM)'.
Clock frequency = 1050 MHz, compute units = 32.
Global mem size = 2048 MB, cache size = 16 kB (ReadWrite), cache line size = 6
4 Bytes.
Local mem size = 32 kB (dedicated), Constant mem size = 64 kB.
Max workgroup size = 256, Profiling timer resolution = 0.0 usec.
2199064^8192+1 Time: 76.5 us/mul. Err: 0.2188 51956 digits
1798620^16384+1 Time: 73.7 us/mul. Err: 0.2227 102481 digits
1471094^32768+1 Time: 73.2 us/mul. Err: 0.2383 202102 digits
1203210^65536+1 Time: 87.6 us/mul. Err: 0.2305 398482 digits
984108^131072+1 Time: 130 us/mul. Err: 0.2188 785521 digits
804904^262144+1 Time: 305 us/mul. Err: 0.2266 1548156 digits
localWorkSize0 = 4, localWorkSize1 = 4.
658332^524288+1 Time: 670 us/mul. Err: 0.2109 3050541 digits
localWorkSize0 = 4, localWorkSize1 = 2.
538452^1048576+1 Time: 1.1 ms/mul. Err: 0.2134 6009544 digits
localWorkSize0 = 4, localWorkSize1 = 2.
440400^2097152+1 Time: 2.34 ms/mul. Err: 0.2266 11836006 digits
localWorkSize0 = 8, localWorkSize1 = 2.
360204^4194304+1 Time: 4.45 ms/mul. Err: 0.1953 23305854 digits
localWorkSize0 = 8, localWorkSize1 = 2.
294612^8388608+1 Time: 9.34 ms/mul. Err: 0.2109 45879398 digits
Awesome!
Profiling timer resolution a bit dodgy on AMD. | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
New best runs! I can see your working on 3.1.2-4 of geneferocl, lets see what that brings.
I've just committed OclGenefer 2013-08-18 (assembla rev 402).
assembla rev 403.
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13952 ID: 53948 Credit: 391,849,676 RAC: 147,542
                               
|
I'd include N=2097152 in the "most important" list. At most, we're only a few years away from exhausting N=1048576, and it could be a lot shorter than that.
I would be interested in some statistics, but I don't find them Genefer subproject status:
What is the range for N=1048576 [2-600000] ?
The range is 6 to a little bit below the "-l" B limit. Exactly where we'll stop will depend on a number of factors which may change between now and then.
2 and 4 can't be tested with Genefer.
How many candidates pass through the sieve and should be tested?
We've sieved N=20, 21, and 22 to an incredible depth due to 2 unexpected occurrences. If I remember correctly, our original long-term plan was to sieve to about 61P, which was the limit of the sieving software. It was expected to take many years to get there.
Then we had a new GPU-based sieving program. That was followed up by another person using a lot of AWS GPU-servers (each with dual Tesla GPUs) to do a prodigious amount of sieving. All three N ranges have now been sieved beyond 33E (33000P), which is 500 times higher than our original long term goal. N=22 has been sieved to beyond 100E!!! (The sieve goes from b=2 to b=100M.)
Although it's still beneficial to continue sieving, the overall ratio of candidates removed to candidates remaining isn't going to change much.
Jim can probably give you better numbers than I can. For n=20 and b=6 through 199986, there's approximately 36175 candidates that were not removed by the sieve and were tested with Genefer. That is, of course, 36% remaining, and 64% removed by the sieve (counting just even b's).
38545 were tested.
Today, how many of them are tested (and double checked) per day?
Over the last 21 days, we've been averaging 75 short GFN's per day. Extrapolated back to the beginning of the year, that's about 18000 candidates. The difference between 18000 and 38545 is mostly due to the GFN challenge. :)
____________
My lucky number is 75898524288+1 | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
Profiling timer resolution = 0.0 usec.
Profiling timer resolution a bit dodgy on AMD.
Please, wait for the 404 (I hope it's the latest).
Maybe Profiling timer resolution < 0.1 usec. | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
Profiling timer resolution = 0.0 usec.
Profiling timer resolution a bit dodgy on AMD.
Please, wait for the 404 (I hope it's the latest).
Maybe Profiling timer resolution < 0.1 usec.
I committed the 404.
It seems to work but I would like to understand why!
Roger, a final, please. | |
|
Tyler Project administrator Volunteer tester Send message
Joined: 4 Dec 12 Posts: 1078 ID: 183129 Credit: 1,376,122,338 RAC: 8,547
                         
|
Here is genefer ocl on my nvidia gt 430. Running genfercuda gives me an error. ("The application was unable to start correctly (0xc0000013). Click OK to close the application.")
Running on platform 'NVIDIA CUDA', device 'GeForce GT 430', version 'OpenCL 1.1
CUDA' and driver '326.19'.
Generalized Fermat Number Bench
2199064^8192+1 Time: 101 us/mul. Err: 0.2344 51956 digits
1798620^16384+1 Time: 177 us/mul. Err: 0.5000 102481 digits
1471094^32768+1 Time: 335 us/mul. Err: 0.5000 202102 digits
1203210^65536+1 Time: 691 us/mul. Err: 0.5000 398482 digits
984108^131072+1 Time: 1.28 ms/mul. Err: 0.5000 785521 digits
804904^262144+1 Time: 2.78 ms/mul. Err: 0.5000 1548156 digits
658332^524288+1 Time: 5.72 ms/mul. Err: 0.5000 3050541 digits
538452^1048576+1 Time: 12 ms/mul. Err: 0.5000 6009544 digits
440400^2097152+1 Time: 25.4 ms/mul. Err: 0.5000 11836006 digits
360204^4194304+1 Time: 53.9 ms/mul. Err: 0.5000 23305854 digits
294612^8388608+1 Time: 119 ms/mul. Err: 0.5000 45879398 digits
Genefer Mark = 8.
____________
275*2^3585539+1 is prime!!! (1079358 digits)
Proud member of Aggie the Pew
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13952 ID: 53948 Credit: 391,849,676 RAC: 147,542
                               
|
I've moved several posts from the New genefer (3.x.x) apps now available for testing thread to this thread so that all testing information can be consolidated in a single place.
____________
My lucky number is 75898524288+1 | |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,668,824 RAC: 0
                    
|
I committed the 404.
It seems to work but I would like to understand why!
Roger, a final, please.
404:
OclGenefer 2013-08-18, Copyright (C) 2001-2013, Yves Gallot.
Options: -q "b^N+1" Test expression.
Platform 'AMD Accelerated Parallel Processing': GPU device 'Tahiti' found.
Platform 'AMD Accelerated Parallel Processing': CPU device 'AMD Phenom(tm) II X6
1100T Processor' found.
Running on platform 'AMD Accelerated Parallel Processing', device 'Tahiti', vers
ion 'OpenCL 1.2 AMD-APP (1124.2)' and driver '1124.2 (VM)'.
Clock frequency = 1050 MHz, compute units = 32.
Global mem size = 2048 MB, cache size = 16 kB (ReadWrite), cache line size = 6
4 Bytes.
Local mem size = 32 kB (dedicated), Constant mem size = 64 kB.
Max workgroup size = 256, Profiling timer resolution = 0.001 usec.
2199064^8192+1 Time: 74.2 us/mul. Err: 0.2188 51956 digits
1798620^16384+1 Time: 73.3 us/mul. Err: 0.2227 102481 digits
1471094^32768+1 Time: 74.3 us/mul. Err: 0.2383 202102 digits
1203210^65536+1 Time: 79.8 us/mul. Err: 0.2305 398482 digits
984108^131072+1 Time: 122 us/mul. Err: 0.2188 785521 digits
804904^262144+1 Time: 335 us/mul. Err: 0.2266 1548156 digits
localWorkSize0 = 4, localWorkSize1 = 4.
658332^524288+1 Time: 670 us/mul. Err: 0.2109 3050541 digits
localWorkSize0 = 4, localWorkSize1 = 2.
538452^1048576+1 Time: 1.1 ms/mul. Err: 0.2134 6009544 digits
localWorkSize0 = 4, localWorkSize1 = 2.
440400^2097152+1 Time: 2.31 ms/mul. Err: 0.2266 11836006 digits
localWorkSize0 = 8, localWorkSize1 = 2.
360204^4194304+1 Time: 4.64 ms/mul. Err: 0.1953 23305854 digits
localWorkSize0 = 8, localWorkSize1 = 2.
294612^8388608+1 Time: 9.28 ms/mul. Err: 0.2109 45879398 digits
AMD Profiling timer resolution better than expected, but believable? | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13952 ID: 53948 Credit: 391,849,676 RAC: 147,542
                               
|
404 on GTX460:
OclGenefer 2013-08-18, Copyright (C) 2001-2013, Yves Gallot.
Options: -q "b^N+1" Test expression.
Platform 'NVIDIA CUDA': GPU device 'GeForce GTX 460' found.
Platform 'AMD Accelerated Parallel Processing': CPU device 'Intel(R) Core(TM)2 Q
uad CPU @ 2.40GHz' found.
Running on platform 'NVIDIA CUDA', device 'GeForce GTX 460', version 'OpenCL 1.1
CUDA' and driver '320.57'.
Clock frequency = 1350 MHz, compute units = 7.
Global mem size = 1024 MB, cache size = 112 kB (ReadWrite), cache line size =
128 Bytes.
Local mem size = 48 kB (dedicated), Constant mem size = 64 kB.
Max workgroup size = 1024, Profiling timer resolution = 1.000 usec.
2199064^8192+1 Time: 84 us/mul. Err: 0.2188 51956 digits
1798620^16384+1 Time: 127 us/mul. Err: 0.2266 102481 digits
1471094^32768+1 Time: 171 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 309 us/mul. Err: 0.2188 398482 digits
984108^131072+1 Time: 563 us/mul. Err: 0.2422 785521 digits
804904^262144+1 Time: 1.14 ms/mul. Err: 0.2178 1548156 digits
localWorkSize0 = 32, localWorkSize1 = 32.
658332^524288+1 Time: 2.06 ms/mul. Err: 0.2256 3050541 digits
localWorkSize0 = 32, localWorkSize1 = 32.
538452^1048576+1 Time: 4.29 ms/mul. Err: 0.2031 6009544 digits
localWorkSize0 = 32, localWorkSize1 = 32.
440400^2097152+1 Time: 8.89 ms/mul. Err: 0.2305 11836006 digits
localWorkSize0 = 32, localWorkSize1 = 32.
360204^4194304+1 Time: 18.6 ms/mul. Err: 0.1953 23305854 digits
localWorkSize0 = 32, localWorkSize1 = 32.
294612^8388608+1 Time: 39.7 ms/mul. Err: 0.1973 45879398 digits
____________
My lucky number is 75898524288+1 | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
Clock frequency = 1050 MHz,
[...] Profiling timer resolution = 0.001 usec.
AMD Profiling timer resolution better than expected, but believable?
Possible if counter = program counter (GPU clock).
It's time for the 3.1.2-4. | |
|
|
Is it possible to reduce the CPU usage? Value 0.01 and less it would be nice. | |
|
|
Here are the runtimes for the different versions of geneferocl.
device 0: HD7970 clock 1000/1425 (gpu/memory)
device 1: HD7970 clock 1050/1500 (gpu/memory)
geneferocl 3.1.2-1 (Windows 64-bit OpenGL) --device 1: 26666/947
geneferocl 3.1.2-1 (Windows 64-bit OpenGL) --device 0: 27710/935
geneferocl 3.1.2-2 (Windows 64-bit OpenGL) --device 1: 25643/960
geneferocl 3.1.2-2 (Windows 64-bit OpenGL) --device 0: 27638/945
geneferocl 3.1.2-3 (Windows 32-bit OpenGL) --device 1: 29554/966
geneferocl 3.1.2-3 (Windows 32-bit OpenGL) --device 0: 32407/971
So, the best erformance for AMD/ATI 7970 has geneferocl 3.1.2-2
____________
DeleteNull | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13952 ID: 53948 Credit: 391,849,676 RAC: 147,542
                               
|
Clock frequency = 1050 MHz,
[...] Profiling timer resolution = 0.001 usec.
AMD Profiling timer resolution better than expected, but believable?
Possible if counter = program counter (GPU clock).
It's time for the 3.1.2-4.
3.1.2-4 is available for download from the beta thread http://www.primegrid.com/forum_thread.php?id=4889.
Results on my GTX 460 are unchanged at the higher Ns we're interested in.
____________
My lucky number is 75898524288+1 | |
|
|
Tested the new version on Titan, I'll replace my app_info executable so get a view of full run.
-----
geneferocl 3.1.2-4 (Windows 32-bit OpenCL)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: geneferocl-windows.exe -b
Running on platform 'NVIDIA CUDA', device 'GeForce GTX TITAN', version 'OpenCL 1.1 CUDA' and driver '326.41'.
Generalized Fermat Number Bench
2199064^8192+1 Time: 78.1 us/mul. Err: 0.2188 51956 digits
1798620^16384+1 Time: 78.7 us/mul. Err: 0.2344 102481 digits
1471094^32768+1 Time: 83 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 97.7 us/mul. Err: 0.2813 398482 digits
984108^131072+1 Time: 137 us/mul. Err: 0.2295 785521 digits
804904^262144+1 Time: 293 us/mul. Err: 0.2188 1548156 digits
658332^524288+1 Time: 469 us/mul. Err: 0.2266 3050541 digits
538452^1048576+1 Time: 898 us/mul. Err: 0.2188 6009544 digits
440400^2097152+1 Time: 1.64 ms/mul. Err: 0.2188 11836006 digits
360204^4194304+1 Time: 3.28 ms/mul. Err: 0.1953 23305854 digits
294612^8388608+1 Time: 7.19 ms/mul. Err: 0.2070 45879398 digits
Genefer Mark = 120.
------
EDIT:
Had to abort current WU because of the earlier version name difference, probably won't effect anything after this change :D
Checkpoint saved by genefer Windows 32-bit OpenGL, expected Windows 32-bit OpenCL | |
|
|
I discover very interesed result in genefercl on Nvidia. If with her running task on the processor (trp sieve), genefercl-windows.exe does not consume CPU time, but GPU load drop to ~ 85% | |
|
|
GeneferOCL 3.1.2-4 is now available for testing with app_info, and can be downloaded via the first post in this thread.
invalid
<core_client_version>7.0.64</core_client_version>
<
|
Tested the new version on Titan, [...] geneferocl 3.1.2-4 (Windows 32-bit OpenCL)
It was faster with some previous versions. But because that's true for some exponents for which the code is similar, the reason is "Windows 64-bit -> 32-bit" or "driver '320.49' => '326.41' ? | |
|
|
So far my 7950 completed 6 'short' Genefer OpenCL tasks without any errors. All are validated by wingmens using CUDA Genefer versions.
GPU: ASUS HD7950 DC2T (Factory OC: 900MHz)
OS: Windows 7 64bit
AMD Catalyst: 13.4
Genefer OpenCL version 3.1.2-2
CPU load: <1%
GPU load: 97-99%
Runtimes: 26,516 - 28,080 sec
For the record:
http://www.primegrid.com/workunit.php?wuid=339797861
http://www.primegrid.com/workunit.php?wuid=342700453
http://www.primegrid.com/workunit.php?wuid=343200843
http://www.primegrid.com/workunit.php?wuid=345650390
http://www.primegrid.com/workunit.php?wuid=345651275
http://www.primegrid.com/workunit.php?wuid=345653871
Some benchmarks:
3.1.2-2
Command line: geneferocl-windows.exe -b
Running on platform 'AMD Accelerated Parallel Processing', device 'Tahiti', vers
ion 'OpenCL 1.2 AMD-APP (1124.2)' and driver '1124.2 (VM)'.
Generalized Fermat Number Bench
2199064^8192+1 Time: 83.3 us/mul. Err: 0.2344 51956 digits
1798620^16384+1 Time: 113 us/mul. Err: 0.2266 102481 digits
1471094^32768+1 Time: 122 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 127 us/mul. Err: 0.2656 398482 digits
984108^131072+1 Time: 155 us/mul. Err: 0.2188 785521 digits
804904^262144+1 Time: 364 us/mul. Err: 0.2188 1548156 digits
658332^524288+1 Time: 768 us/mul. Err: 0.2188 3050541 digits
538452^1048576+1 Time: 1.48 ms/mul. Err: 0.2266 6009544 digits
440400^2097152+1 Time: 2.81 ms/mul. Err: 0.2266 11836006 digits
360204^4194304+1 Time: 5.55 ms/mul. Err: 0.2031 23305854 digits
294612^8388608+1 Time: 11 ms/mul. Err: 0.1895 45879398 digits
3.1.2-4
Command line: geneferocl-windows3.1.2-4.exe -b
Running on platform 'AMD Accelerated Parallel Processing', device 'Tahiti', vers
ion 'OpenCL 1.2 AMD-APP (1124.2)' and driver '1124.2 (VM)'.
Generalized Fermat Number Bench
2199064^8192+1 Time: 77.8 us/mul. Err: 0.2188 51956 digits
1798620^16384+1 Time: 78.1 us/mul. Err: 0.2344 102481 digits
1471094^32768+1 Time: 80.2 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 103 us/mul. Err: 0.2813 398482 digits
984108^131072+1 Time: 148 us/mul. Err: 0.2295 785521 digits
804904^262144+1 Time: 354 us/mul. Err: 0.2188 1548156 digits
658332^524288+1 Time: 635 us/mul. Err: 0.2266 3050541 digits
538452^1048576+1 Time: 1.17 ms/mul. Err: 0.2188 6009544 digits
440400^2097152+1 Time: 2.37 ms/mul. Err: 0.2188 11836006 digits
360204^4194304+1 Time: 4.84 ms/mul. Err: 0.1953 23305854 digits
294612^8388608+1 Time: 9.78 ms/mul. Err: 0.2070 45879398 digits
Edit:
So 3.1.2-4 is definitely faster | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
invalid
[...]
http://www.primegrid.com/result.php?resultid=475051861
Did you try using interactive mode and 1. bench ? | |
|
|
Tested the new version on Titan, [...] geneferocl 3.1.2-4 (Windows 32-bit OpenCL)
It was faster with some previous versions. But because that's true for some exponents for which the code is similar, the reason is "Windows 64-bit -> 32-bit" or "driver '320.49' => '326.41' ?
I made re-runs with the earlier versions I have available and current settings, Driver version '326.41'.
3.1.2-2 (Windows 64-bit OpenGL)
Generalized Fermat Number Bench
2199064^8192+1 Time: 79.3 us/mul. Err: 0.2344 51956 digits
1798620^16384+1 Time: 78.7 us/mul. Err: 0.2266 102481 digits
1471094^32768+1 Time: 84.2 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 97.7 us/mul. Err: 0.2656 398482 digits
984108^131072+1 Time: 137 us/mul. Err: 0.2188 785521 digits
804904^262144+1 Time: 283 us/mul. Err: 0.2188 1548156 digits
658332^524288+1 Time: 488 us/mul. Err: 0.2188 3050541 digits
538452^1048576+1 Time: 898 us/mul. Err: 0.2266 6009544 digits
440400^2097152+1 Time: 1.64 ms/mul. Err: 0.2266 11836006 digits
360204^4194304+1 Time: 3.44 ms/mul. Err: 0.2031 23305854 digits
294612^8388608+1 Time: 6.88 ms/mul. Err: 0.1895 45879398 digits
3.1.2-3 (Windows 32-bit OpenGL)
Generalized Fermat Number Bench
2199064^8192+1 Time: 79.3 us/mul. Err: 0.2188 51956 digits
1798620^16384+1 Time: 78.7 us/mul. Err: 0.2344 102481 digits
1471094^32768+1 Time: 83 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 97.7 us/mul. Err: 0.2813 398482 digits
984108^131072+1 Time: 137 us/mul. Err: 0.2295 785521 digits
804904^262144+1 Time: 273 us/mul. Err: 0.2188 1548156 digits
658332^524288+1 Time: 488 us/mul. Err: 0.2266 3050541 digits
538452^1048576+1 Time: 898 us/mul. Err: 0.2188 6009544 digits
440400^2097152+1 Time: 1.64 ms/mul. Err: 0.2188 11836006 digits
360204^4194304+1 Time: 3.44 ms/mul. Err: 0.1953 23305854 digits
294612^8388608+1 Time: 7.19 ms/mul. Err: 0.2070 45879398 digits
3.1.2-4 (Windows 32-bit OpenCL)
Generalized Fermat Number Bench
2199064^8192+1 Time: 79.3 us/mul. Err: 0.2188 51956 digits
1798620^16384+1 Time: 78.7 us/mul. Err: 0.2344 102481 digits
1471094^32768+1 Time: 84.2 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 97.7 us/mul. Err: 0.2813 398482 digits
984108^131072+1 Time: 137 us/mul. Err: 0.2295 785521 digits
804904^262144+1 Time: 283 us/mul. Err: 0.2188 1548156 digits
658332^524288+1 Time: 488 us/mul. Err: 0.2266 3050541 digits
538452^1048576+1 Time: 898 us/mul. Err: 0.2188 6009544 digits
440400^2097152+1 Time: 1.64 ms/mul. Err: 0.2188 11836006 digits
360204^4194304+1 Time: 3.44 ms/mul. Err: 0.1953 23305854 digits
294612^8388608+1 Time: 7.19 ms/mul. Err: 0.2070 45879398 digits
-----
I did more test runs and there's a slight variation in the numbers per run, this may be because of the "boosting" effects of 1) CPU 2) GPU, so in general for Titan I'd just read the averages instead of to the letter.
Titan "boosts" itself depending on temperature and load, can't make it run on fixed speed, can't disable the feature either. The double precision "slows the boost down" a bit so it won't boost that much over the default GPU Clock.
Example:
Default GPU Clock on my Titan is 837MHz from GPU-Z application information, on idle it's 324MHz.
On Double Precision -b run it's 849.2MHz (48C temperature), hotter it's 836.1MHz on double precision (79C).
On Single Precision -b run it's 1006MHz (no matter of the temperature), other GPU load below 78C it's 992.9MHz - 1006MHz, 79C it's 940.6MHz, etc, etc.
So really depends on the load and temperatures.
NOTE: This is without any overclocking or meddling with the GPU, this is how it works as-is out of the box.
Anyways, the 32-bit version (latest) is more stable in terms of what the output will be, 64-bit 3.1.2-2 version has larger variance
For example I get this on 3.1.2-4 occasionally, usually it's the one I posted before:
Generalized Fermat Number Bench
2199064^8192+1 Time: 79.3 us/mul. Err: 0.2188 51956 digits
1798620^16384+1 Time: 78.7 us/mul. Err: 0.2344 102481 digits
1471094^32768+1 Time: 84.2 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 97.7 us/mul. Err: 0.2813 398482 digits
984108^131072+1 Time: 132 us/mul. Err: 0.2295 785521 digits
804904^262144+1 Time: 283 us/mul. Err: 0.2188 1548156 digits
658332^524288+1 Time: 488 us/mul. Err: 0.2266 3050541 digits
538452^1048576+1 Time: 859 us/mul. Err: 0.2188 6009544 digits
440400^2097152+1 Time: 1.72 ms/mul. Err: 0.2188 11836006 digits
360204^4194304+1 Time: 3.44 ms/mul. Err: 0.1953 23305854 digits
294612^8388608+1 Time: 6.88 ms/mul. Err: 0.2070 45879398 digits | |
|
|
invalid
[...]
http://www.primegrid.com/result.php?resultid=475051861
Did you try using interactive mode and 1. bench ?
Command line: geneferocl-windows.exe -b
Running on platform 'AMD Accelerated Parallel Processing', device 'Tahiti', vers
ion 'OpenCL 1.2 AMD-APP (1214.3)' and driver '1214.3 (VM)'.
Generalized Fermat Number Bench
2199064^8192+1 Time: 70.2 us/mul. Err: 0.2188 51956 digits
1798620^16384+1 Time: 68.4 us/mul. Err: 0.2344 102481 digits
1471094^32768+1 Time: 69.6 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 78.1 us/mul. Err: 0.2813 398482 digits
984108^131072+1 Time: 116 us/mul. Err: 0.2295 785521 digits
804904^262144+1 Time: 296 us/mul. Err: 0.2188 1548156 digits
658332^524288+1 Time: 551 us/mul. Err: 0.2266 3050541 digits
538452^1048576+1 Time: 1.05 ms/mul. Err: 0.2188 6009544 digits
440400^2097152+1 Time: 2.19 ms/mul. Err: 0.2188 11836006 digits
360204^4194304+1 Time: 4.56 ms/mul. Err: 0.1953 23305854 digits
294612^8388608+1 Time: 9.06 ms/mul. Err: 0.2070 45879398 digits
Genefer Mark = 93.
| |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
invalid
[...]
http://www.primegrid.com/result.php?resultid=475051861
Did you try using interactive mode and 1. bench ?
Command line: geneferocl-windows.exe -b
[...]
It seems to work. Maybe it tried to resume from a previous version? | |
|
|
It seems to work. Maybe it tried to resume from a previous version?
No
http://www.primegrid.com/results.php?hostid=396466 | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
It seems to work. Maybe it tried to resume from a previous version?
No
http://www.primegrid.com/results.php?hostid=396466
I think that I found the bug and fixed it.
It may occur if windows app is a 32-bit application and if the GPU device address space size is 64 bits.
Address space size of NVidia GPUs is 32 bits, that's why it works on NVidia cards.
I committed the fix, then I hope that the 3.1.2-5 will solve your problem.
Yves | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13952 ID: 53948 Credit: 391,849,676 RAC: 147,542
                               
|
I committed the fix, then I hope that the 3.1.2-5 will solve your problem.
Yves
3.1.2-5 is available for download from the beta download thread.
____________
My lucky number is 75898524288+1 | |
|
|
unfortunately, it also invalid
http://www.primegrid.com/results.php?userid=43788
http://www.primegrid.com/result.php?resultid=475080186
http://www.primegrid.com/result.php?resultid=475053577
| |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
unfortunately, it also invalid
http://www.primegrid.com/results.php?userid=43788
http://www.primegrid.com/result.php?resultid=475080186
http://www.primegrid.com/result.php?resultid=475053577
:o(
Anyone else with an ATI 7970?
Please, could you run the test using interactive mode and 2. test?
| |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
invalid
538452^1048576+1 Time: 1.05 ms/mul. Err: 0.2188 6009544 digits
440400^2097152+1 Time: 2.19 ms/mul. Err: 0.2188 11836006 digits
360204^4194304+1 Time: 4.56 ms/mul. Err: 0.1953 23305854 digits
294612^8388608+1 Time: 9.06 ms/mul. Err: 0.2070 45879398 digits
Genefer Mark = 93.
But your card run faster than a HD 7970 GHz!
Is it overclocked? | |
|
|
I have tested it with my 7950, .....and it wasted all WU's.
So, back to 3.1.2-2
unfortunately, it also invalid
http://www.primegrid.com/results.php?userid=43788
http://www.primegrid.com/result.php?resultid=475080186
http://www.primegrid.com/result.php?resultid=475053577
:o(
Anyone else with an ATI 7970?
Please, could you run the test using interactive mode and 2. test?
____________
DeleteNull | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 13952 ID: 53948 Credit: 391,849,676 RAC: 147,542
                               
|
If you want, you can try a 64-bit version of 3.1.2-5: click here
EDIT: 3.1.2-6 should fix the 32-bit problem, so the 64 bit app is no longer necessary.
____________
My lucky number is 75898524288+1 | |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 801 ID: 164101 Credit: 305,670,756 RAC: 5,414

|
I have tested it with my 7950, .....and it wasted all WU's.
I don't understand :o(
Please, could someone run the test of the win32 3.1.2-5, using interactive mode, on a HD 79x0?
Roger, could you compile a win32 version of OclGenefer 2013-08-18, revision 406 and run the full test?
Thanks, Yves | |
|
|
I am trying to get this to run on my 6870 but when i tell it to run a benchmark i get this
C:\Users\Plomos\Downloads>geneferocl-windows.exe -b
geneferocl 3.1.2-5 (Windows 32-bit OpenCL)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: geneferocl-windows.exe -b
No OpenCL device found.
what am i missing here? | |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,668,824 RAC: 0
                    
|
I have tested it with my 7950, .....and it wasted all WU's.
I don't understand :o(
Please, could someone run the test of the win32 3.1.2-5, using interactive mode, on a HD 79x0?
Roger, could you compile a win32 version of OclGenefer 2013-08-18, revision 406 and run the full test?
Thanks, Yves
406 (x64) with HD7970Ghz:
OclGenefer 2013-08-18, Copyright (C) 2001-2013, Yves Gallot.
Options: -q "b^N+1" Test expression.
Platform 'AMD Accelerated Parallel Processing': GPU device 'Tahiti' found.
Platform 'AMD Accelerated Parallel Processing': CPU device 'AMD Phenom(tm) II X6
1100T Processor' found.
Running on platform 'AMD Accelerated Parallel Processing', device 'Tahiti', vers
ion 'OpenCL 1.2 AMD-APP (1124.2)' and driver '1124.2 (VM)'.
Clock frequency = 1050 MHz, compute units = 32, 32-bit.
Global mem size = 2048 MB, cache size = 16 kB (ReadWrite), cache line size = 6
4 Bytes.
Local mem size = 32 kB (dedicated), Constant mem size = 64 kB.
Max workgroup size = 256, Profiling timer resolution = 0.001 usec.
2199064^8192+1 Time: 76.3 us/mul. Err: 0.2188 51956 digits
1798620^16384+1 Time: 73.4 us/mul. Err: 0.2227 102481 digits
1471094^32768+1 Time: 75.4 us/mul. Err: 0.2383 202102 digits
1203210^65536+1 Time: 81.1 us/mul. Err: 0.2305 398482 digits
984108^131072+1 Time: 129 us/mul. Err: 0.2188 785521 digits
804904^262144+1 Time: 396 us/mul. Err: 0.2266 1548156 digits
658332^524288+1 Time: 639 us/mul. Err: 0.2109 3050541 digits
538452^1048576+1 Time: 1.16 ms/mul. Err: 0.2134 6009544 digits
440400^2097152+1 Time: 2.24 ms/mul. Err: 0.2266 11836006 digits
360204^4194304+1 Time: 4.69 ms/mul. Err: 0.1953 23305854 digits
294612^8388608+1 Time: 8.72 ms/mul. Err: 0.2109 45879398 digits
When I try to compile Win32 version I get linker errors:
51 unresolved externals
I probably need to link some 32 bit libraries rather than 64 bit.
Job for Rebirther. I am time poor for the next few days.
geneferocl 3.1.2-5 (Windows 32-bit OpenCL)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: geneferocl-windows.exe -b
Running on platform 'AMD Accelerated Parallel Processing', device 'Tahiti', vers
ion 'OpenCL 1.2 AMD-APP (1124.2)' and driver '1124.2 (VM)'.
Generalized Fermat Number Bench
2199064^8192+1 Time: 75.1 us/mul. Err: 0.2188 51956 digits
1798620^16384+1 Time: 73 us/mul. Err: 0.2344 102481 digits
1471094^32768+1 Time: 73.4 us/mul. Err: 0.2344 202102 digits
1203210^65536+1 Time: 84.2 us/mul. Err: 0.2813 398482 digits
984108^131072+1 Time: 132 us/mul. Err: 0.2295 785521 digits
804904^262144+1 Time: 368 us/mul. Err: 0.2188 1548156 digits
658332^524288+1 Time: 701 us/mul. Err: 0.2266 3050541 digits
538452^1048576+1 Time: 1.07 ms/mul. Err: 0.2188 6009544 digits
440400^2097152+1 Time: 2.24 ms/mul. Err: 0.2188 11836006 digits
360204^4194304+1 Time: 4.66 ms/mul. Err: 0.1953 23305854 digits
294612^8388608+1 Time: 9.25 ms/mul. Err: 0.2070 45879398 digits
Genefer Mark = 91.
geneferocl 3.1.2-5 (Windows 32-bit OpenCL)
Copyright 2001-2013, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2013, Iain Bethune, Michael Goetz, Ronald Schneider
Command line: geneferocl-windows.exe -b3
Running on platform 'AMD Accelerated Parallel Processing', device 'Tahiti', vers
ion 'OpenCL 1.2 AMD-APP (1124.2)' and driver '1124.2 (VM)'.
14^32768+1 37557 digits 0 days 0.0 hours (0.09 ms/mul, 124
758 iterations) 294 GFLOPS
75898^32768+1 159916 digits 0 days 0.0 hours (0.08 ms/mul, 531
226 iterations) 1253 GFLOPS
700000^32768+1 191533 digits 0 days 0.0 hours (0.08 ms/mul, 636
255 iterations) 1501 GFLOPS
5000000^32768+1 219512 digits 0 days 0.0 hours (0.08 ms/mul, 729
201 iterations) 1720 GFLOPS
14^65536+1 75113 digits 0 days 0.0 hours (0.09 ms/mul, 249
517 iterations) 1243 GFLOPS
75898^65536+1 319831 digits 0 days 0.0 hours (0.08 ms/mul, 106
2453 iterations) 5292 GFLOPS
710000^65536+1 383469 digits 0 days 0.0 hours (0.09 ms/mul, 127
3852 iterations) 6345 GFLOPS
2500000^65536+1 419296 digits 0 days 0.0 hours (0.10 ms/mul, 139
2868 iterations) 6938 GFLOPS
14^131072+1 150226 digits 0 days 0.0 hours (0.14 ms/mul, 499
036 iterations) 5233 GFLOPS
75898^131072+1 639662 digits 0 days 0.0 hours (0.14 ms/mul, 212
4908 iterations) 22281 GFLOPS
700000^131072+1 766129 digits 0 days 0.0 hours (0.13 ms/mul, 254
5023 iterations) 26687 GFLOPS
1000000^131072+1 786432 digits 0 days 0.0 hours (0.13 ms/mul, 261
2469 iterations) 27394 GFLOPS
14^262144+1 300451 digits 0 days 0.0 hours (0.33 ms/mul, 998
074 iterations) 21978 GFLOPS
75898^262144+1 1279324 digits 0 days 0.4 hours (0.36 ms/mul, 424
9818 iterations) 93581 GFLOPS
468750^262144+1 1486604 digits 0 days 0.4 hours (0.33 ms/mul, 493
8388 iterations) 108744 GFLOPS
815000^262144+1 1549575 digits 0 days 0.5 hours (0.37 ms/mul, 514
7574 iterations) 113350 GFLOPS
14^524288+1 600902 digits 0 days 0.3 hours (0.70 ms/mul, 199
6149 iterations) 92097 GFLOPS
75898^524288+1 2558647 digits 0 days 1.7 hours (0.73 ms/mul, 849
9637 iterations) 392151 GFLOPS
468750^524288+1 2973207 digits 0 days 1.9 hours (0.72 ms/mul, 987
6777 iterations) 455688 GFLOPS
710000^524288+1 3067745 digits 0 days 2.0 hours (0.73 ms/mul, 101
90825 iterations) 470178 GFLOPS
14^1048576+1 1201803 digits 0 days 1.2 hours (1.13 ms/mul, 399
2299 iterations) 385133 GFLOPS
75898^1048576+1 5117293 digits 0 days 5.2 hours (1.10 ms/mul, 169
99276 iterations) 1639903 GFLOPS
468750^1048576+1 5946413 digits 0 days 6.1 hours (1.13 ms/mul, 197
53555 iterations) 1905606 GFLOPS
700000^1048576+1 6129030 digits 0 days 6.2 hours (1.10 ms/mul, 203
60194 iterations) 1964127 GFLOPS
14^2097152+1 2403605 digits 0 days 5.1 hours (2.31 ms/mul, 798
4600 iterations) 1607512 GFLOPS
75898^2097152+1 10234585 digits 0 days 21.2 hours (2.25 ms/mul, 339
98553 iterations) 6844813 GFLOPS
380742^2097152+1 11703432 digits 1 days 0.4 hours (2.26 ms/mul, 388
77955 iterations) 7827166 GFLOPS
570000^2097152+1 12070945 digits 1 days 1.1 hours (2.26 ms/mul, 400
98808 iterations) 8072956 GFLOPS
14^4194304+1 4807210 digits 0 days 20.7 hours (4.67 ms/mul, 159
69202 iterations) 6697969 GFLOPS
1248^4194304+1 12986466 digits 2 days 6.7 hours (4.57 ms/mul, 431
40102 iterations) 18094270 GFLOPS
10000^4194304+1 16777217 digits 2 days 22.9 hours (4.58 ms/mul, 557
32704 iterations) 23375990 GFLOPS
50000^4194304+1 19708909 digits 3 days 10.6 hours (4.55 ms/mul, 654
71576 iterations) 27460769 GFLOPS
150000^4194304+1 21710101 digits 3 days 19.3 hours (4.56 ms/mul, 721
19391 iterations) 30249065 GFLOPS
309258^4194304+1 23028076 digits 4 days 1.8 hours (4.61 ms/mul, 764
97608 iterations) 32085422 GFLOPS
480000^4194304+1 23828853 digits 4 days 4.7 hours (4.58 ms/mul, 791
57734 iterations) 33201160 GFLOPS
14^8388608+1 9614419 digits 3 days 10.3 hours (9.28 ms/mul, 319
38406 iterations) 27863552 GFLOPS
36^8388608+1 13055212 digits 4 days 15.6 hours (9.27 ms/mul, 433
68473 iterations) 37835316 GFLOPS
100^8388608+1 16777217 digits 5 days 22.7 hours (9.22 ms/mul, 557
32704 iterations) 48622060 GFLOPS
No errors. | |
|
|
|