Join PrimeGrid
Returning Participants
Community
Leader Boards
Results
Other
drummers-lowrise
|
Message boards :
Generalized Fermat Prime Search :
Genefer "B" limits
Author |
Message |
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 435,627,755 RAC: 870,396
                               
|
B limits for the upcoming 3.2.9 versions of Genefer:
Note that except for OCL4(low), ocl5, ocl3, and ocl4(high), the B limits are "fuzzy" -- near the limit, some test will succeed while others will fail, and it's impossible to predict which will succeed. Also, the B limits tend to vary somewhat with different hardware and different Operating Systems. With ocl4(low), ocl5, ocl3, and ocl4(high), the B limit is exact -- all tasks below the limit will succeed, and it won't run tasks above the limit.
-32-bit Sempron CPU- ---------------------32-bit Haswell CPU--------------------- ---------------------64-bit Haswell CPU---------------------
n 2^n x87 default x87 default sse2 sse4 avx fma3 x87 default sse2 sse4 avx fma3 ocl ocl4(low) ocl5 ocl3 ocl4(high)
15 32768 99,460,000 2,180,000 99,710,000 2,180,000 2,420,000 2,420,000 2,520,000 2,480,000 99,710,000 2,180,000 2,420,000 2,420,000 2,520,000 2,390,000 2,260,000 8,090,445 16,152,995 16,777,216 400,000,000
16 65536 79,010,000 1,690,000 81,840,000 1,690,000 1,930,000 1,930,000 1,970,000 1,950,000 81,840,000 1,690,000 1,930,000 1,930,000 1,970,000 1,990,000 1,840,000 5,720,809 11,421,893 11,863,283 400,000,000
17 131072 64,150,000 1,420,000 65,450,000 1,410,000 1,570,000 1,570,000 1,660,000 1,590,000 65,450,000 1,410,000 1,550,000 1,550,000 1,660,000 1,620,000 1,450,000 4,045,223 8,076,498 8,388,608 400,000,000
18 262144 53,680,000 1,150,000 54,080,000 1,120,000 1,300,000 1,300,000 1,340,000 1,350,000 54,080,000 1,120,000 1,300,000 1,300,000 1,330,000 1,350,000 1,220,000 2,860,404 5,710,946 5,931,642 400,000,000
19 524288 43,620,000 940,000 43,530,000 950,000 1,050,000 1,050,000 1,110,000 1,060,000 43,530,000 950,000 1,070,000 1,070,000 1,100,000 1,100,000 1,020,000 2,022,611 4,038,249 4,194,304 400,000,000
20 1048576 36,030,000 760,000 36,300,000 780,000 890,000 890,000 890,000 890,000 36,300,000 780,000 900,000 900,000 900,000 900,000 810,000 1,430,202 2,855,473 2,965,821 400,000,000
21 2097152 29,050,000 640,000 29,740,000 630,000 720,000 720,000 760,000 720,000 29,740,000 630,000 710,000 710,000 720,000 730,000 660,000 1,011,306 2,019,124 2,097,152 400,000,000
22 4194304 23,870,000 530,000 24,100,000 510,000 600,000 600,000 630,000 590,000 24,100,000 510,000 600,000 600,000 620,000 610,000 540,000 715,101 1,427,737 1,482,910 400,000,000
"32-bit Sempron CPU" is genefer_windows32.exe running on a single core, 32-bit Sempron that lacks SSE2. "32-bit Haswell CPU" is genefer_windows32.exe running on a Haswell Core i5. "64-bit Haswell CPU" is genefer_windows64.exe running on a Haswell Core i5. All GPU tests (CUDA, OCL, OCL2, OCL3) are done on an Nvidia GTX 580.
Included below is an older B limit table, for comparison:
This is not an exact science, and in some respects this is comparing apples to oranges because the versions I have available to me right now aren't using exactly the same tests, but here's what the various programs are reporting as their upper limit for B:
Red = PRPNet
Blue = BOINC
Green = PRPNet and BOINC
Version 2.3.0:
N Genefer Genefer80 GeneferX64 GeneferCUDA
64 12,020,000
128 8,401,000
256 6,613,000 253,240,000 5,985,000
512 5,200,000 207,170,000 5,130,000
1,024 4,302,000 171,850,000 4,170,000
2,048 3,401,000 140,350,000 3,460,000
4,096 2,619,000 113,350,000 2,820,000
8,192 2,129,000 93,520,000 2,165,000 2,650,000
16,384 1,811,000 77,750,000 1,840,000 2,280,000 32,768 1,416,000 64,510,000** 1,510,000 1,840,000 PRPNet
65,536 1,099,000 52,330,000** 1,240,000 1,525,000 PRPNet 131,072 962,000 42,950,000 1,025,000 1,270,000 262,144 752,000 35,490,000** 865,000 995,000* PRPNet 524,288 620,000 29,120,000** 735,000 815,000* PRPNet 1,048,576 512,000 24,450,000 600,000** 695,000* BOINC 2,097,152 19,700,000 495,000 565,000 4,194,304 475,000* BOINC 8,388,608 400,000
Version 3.1.2:
N Genefer Genefer80 GeneferX64 GeneferSSE3 GeneferAVX GeneferCUDA GeneferOCL
256 8,770,000 259,340,000 6,005,000 7,600,000 7,600,000
512 7,635,000 210,170,000 5,170,000 6,595,000 6,595,000
1,024 6,135,000 174,750,000 4,235,000 5,250,000 5,250,000
2,048 4,965,000 140,700,000 3,470,000 4,355,000 4,355,000
4,096 4,045,000 116,150,000 2,905,000 3,485,000 3,485,000
8,192 3,330,000 95,920,000 2,180,000 2,885,000 2,885,000 2,650,000 2,720,000
16,384 2,695,000 78,950,000 1,860,000 2,340,000 2,340,000 2,280,000 2,210,000 32,768 2,195,000 64,710,000** 1,540,000 1,955,000 1,955,000 1,840,000 1,830,000 PRPNet
65,536 1,785,000 53,080,000** 1,240,000 1,600,000 1,600,000 1,525,000 1,490,000 PRPNet 131,072 1,440,000 43,150,000 1,060,000 1,305,000 1,305,000 1,270,000 1,235,000 262,144 1,175,000 35,840,000** 870,000 1,065,000 1,065,000 995,000* 1,015,000* PRPNet 524,288 955,000 29,120,000** 735,000 890,000 890,000 815,000* 840,000* PRPNet 1,048,576 775,000 24,500,000 615,000 720,000** 720,000** 695,000* 690,000* BOINC 2,097,152 625,000 20,250,000 495,000 595,000 595,000 580,000 565,000 4,194,304 505,000 16,290,000 435,000 495,000 495,000 475,000* 470,000* BOINC 8,388,608 400,000 390,000
* = Preferred program (GPU)
** = Preferred program (CPU)
____________
My lucky number is 75898524288+1 | |
|
|
Out of curiosity, is there any reason we didn't start searching at N=131072?
____________
PrimeGrid Challenge Overall standings --- Last update: From Pi to Paddy (2016)
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 435,627,755 RAC: 870,396
                               
|
Out of curiosity, is there any reason we didn't start searching at N=131072?
It's reserved. I don't know the details.
____________
My lucky number is 75898524288+1 | |
|
|
Is there any chance of a 64 bit version of Genefer80, to test large b values?
15079730^32768+1
15547296^32768+1
I found these two primes, but they took awhile, and I was using a 64 bit machine. Technically, any 32 bit program running on a 64 bit machine is running through an emulator of a 32 bit machine, which slows it down. Plus, I'm hoping the usage of 64 bit processing would have other benefits. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 435,627,755 RAC: 870,396
                               
|
Is there any chance of a 64 bit version of Genefer80, to test large b values?
tl;dr answer: No chance at all. It's impossible, by definition.
Confusing answer: Why would you want to downgrade to 64 bits when Genefer80 is an 80 bit program?
Full answer:
MOST computers have historically had single (32 bit) and double (64 bit) precision floating point hardware. Except for the Intel x86 architecture.
Back in the good ole' days, when Intel's floating point hardware came on a separate 8087 comprocessor chip, they did something a bit interesting with the 8087: In order to maintain precision in a long series of calculations, the 8087 was built not with 64 bit internal registers, but with 80 bit internal registers. If you do your entire calculation just using the registers, you can take advantage of the extended 80 bit precision and not have to worry about rounding errors messing up the final answer. You can also store the 80 bit numbers in memory, if needed, but it's significantly slower because the memory architecture isn't optimized for transferring 10 bytes at a time. (8 yes, 16, sometimes yes, but 10, no.)
Every Intel CPU from the 486 onward included the floating point hardware in the main CPU, but the onboard floating point hardware is backwards compatible with the 8087, 80287, and 80387 floating point processors and retain those internal 80 bit registers. So we can still use 80 bit arithmetic, and that's exactly what Genefer80 does.
A lot of years have passed since then and there have been many advances in the x86 architecture. Perhaps the most important, for our purposes, is what's called SIMD instructions: Single Instruction Multiple Data. In essence, this allows the CPU to do multiple calculations in parallel. Over the years, these new instructions (SSE, SSE2, a few other SSE versions, and most lately AVX) have greatly increased calculation speed.
But ALL of those SIMD instructions, at best, work on 64 bit floating point numbers. None of them can do 80 bit floating point. For that, you need the original 8087 floating point instructions, and if you use those 80 instructions you lose the speed benefits of the SIMD instructions.
You can use either the extended 80 bit precision, or the modern and fast SIMD instructions, but not both.
So, technically we could build a 64 bit version of Genefer80, but it wouldn't be using 64 bit integer arithmetic (which is where the actual gain from the 32 to 64 bit change comes from), and it wouldn't be using SIMD (SSE/AVX) instructions, so I'd be extremely surprised if there was any change in speed at all. There might be a very small increase, or a very small decrease, but the difference would most likely be negligible.
EDIT: I missed the "32-bit emulator" part. As Ron said, that statement is completely wrong, and, in practice, many 32 bit apps are actually faster than their 64 bit counterparts. x86 architecture chips are actually 16 bit chips which have 32 bit extensions (starting with the 80386 chip) and 64 bit extensions starting with AMD's Athlon x64 chips. They'll happily run in native 16 bit or 32 bit mode if you want them to, although good luck finding an operating system that will run in 16 bit mode. There's still a lot of 32 bit code around, perhaps even MOST software, although 16 bit code is largely a thing of the past.
____________
My lucky number is 75898524288+1 | |
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
Is there any chance of a 64 bit version of Genefer80, to test large b values?
Genefer80 uses the FPU and there is each register 80bit long.
Switching back/forward to the FPU needs more time in 64bit than in 32bit, you would see increased computation times...
This behaviour was and is also viewable with GeneferCUDA and the coming llrCUDA.
Technically, any 32 bit program running on a 64 bit machine is running through an emulator of a 32 bit machine, which slows it down.
No, Intel and AMD have nativ support for both. Only Intels Titanic/Itanium has a 32bit emulation layer.
Plus, I'm hoping the usage of 64 bit processing would have other benefits.
No. Take a look at the situation on LLR with AVX. The is no difference in runtimes between a 32bit-app and their 64bit-counterpart.
____________
Best wishes. Knowledge is power. by jjwhalen
| |
|
|
That is all interesting. I had no idea that Intel made an 80-bit register system! Weird.
What if two 64 bit registers were used instead, to make a 128 bit program? Couldn't the newer 64 bit instructions be utilized rather effectively, while also expanding b value limits?
In terms of emulation, I understand that 32 bit instructions are still native for new processors, but I've read that Microsoft Windows technically runs an emulator that slows things down just a little bit (maybe 2-3% slower). That's why the program shows up as "Genefer80.exe *32" in Windows Task Manager. It's a 32-bit build.
Originally, I wasn't suggesting the use of only 64 bits. Rather, I was looking for a 64-bit build of a program that happens to use the 80 bit registers. 64 bit builds can still use old registers, yet they run slightly faster in a 64-bit OS. Well, that's supposedly true for MS windows. I don't know about Linux. | |
|
rroonnaalldd Volunteer developer Volunteer tester
 Send message
Joined: 3 Jul 09 Posts: 1213 ID: 42893 Credit: 34,634,263 RAC: 0
                 
|
If you would use two registers to get one 128bit register, you would see the same loose of performance like the Bulldozer with his "256bit" AVX register...
With BD it makes no difference if you use AVX or SSE. SSE is 128bit width and AVX uses two 128bit registers to create one 256bit AVX register.
The only difference on BD would make the usage of FMA (Fast Multiply Add), which Intel will support with AVX2 next year with Haswell.
SSE was designed by Intel with 128bit width. Only AMD had a 64bit width in the past in their first SSE-design.
GenefX64 needs CPU-support of SSE2, but GenefX64 does not really need to be a 64bit-app. It was a developer decision to make the CPU feature detection easier. All CPUs with 64bit-support has also SSE2 on board.
Take a look at the SPEC. The published values for SPECint and SPECfp are mostly higher in 32bit than the 64bit counterpart. Not ever are 64bit-apps faster than their 32bit counterparts.
____________
Best wishes. Knowledge is power. by jjwhalen
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 435,627,755 RAC: 870,396
                               
|
In terms of emulation, I understand that 32 bit instructions are still native for new processors, but I've read that Microsoft Windows technically runs an emulator that slows things down just a little bit (maybe 2-3% slower). That's why the program shows up as "Genefer80.exe *32" in Windows Task Manager. It's a 32-bit build
There's definitely no loss of performance under Windows. Just the opposite. Unless the program does a lot of large integer math (and thus can make use of the 64 bit integer math instructions), 32-bit programs usually run slightly faster than their 64 bit counterparts, in my experience.
The reason 64 bit apps are slower is most likely due a a few esoteric details about CPU timing, such as saturating memory bandwidth and/or increased cache misses, both due to more data being transferred by 64 bit ops, some of which is bound to simply be 32 extra zeroes. For highly tuned assembly programs (LLR, Genefer80, GenefX64, and probably others), changing the amount of cache misses could have a noticeable affect on performance.
For LLR and similar programs, 64 bit vs 32 bit isn't that significant because they don't use a lot of 64 bit integer arithmetic, which is where you would see the real advantage. Sieves are different, and generally can make use of those 64 bit integer instructions, so there's a big difference there. But for programs that rely on floating point, there's no advantage, and you're moving more bytes around for no reason, which makes the program slower rather than faster.
____________
My lucky number is 75898524288+1 | |
|
|
The 64 bit version of LLR is faster on my computer than the 32 bit version of LLR. How come? | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 435,627,755 RAC: 870,396
                               
|
The 64 bit version of LLR is faster on my computer than the 32 bit version of LLR. How come?
It's probably about 5% faster, at least it is on my computer.
I honestly don't know the answer. Clearly, there's something in the code that takes advantage of some 64 bit features with good results.
The 64 bit version of GeneferCUDA is slower than the 32 bit version by a small amount.
Just making a 64 bit build of a program doesn't automatically make it faster. It needs to be taking advantage of 64 bit features on the CPU that are faster than the 32 bit equivalent. The big advantage is when you're doing 64 bit integer math. That's important for sieves, but generally not so important for programs like LLR. Frankly, I was surprised that this 64 bit build of LLR was faster.
If nobody else gives you a better answer to this question, you could go over to the Mersenne forums and ask there.
____________
My lucky number is 75898524288+1 | |
|
|
The 64 bit version of LLR is faster on my computer than the 32 bit version of LLR. How come?
64 bit programs have access to eight more SSE (and integer) registers than 32 bit programs do. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 435,627,755 RAC: 870,396
                               
|
The 64 bit version of LLR is faster on my computer than the 32 bit version of LLR. How come?
You might want to jump over to the LLR development version 3.8.7 is available! thread for more information.
For one thing, I misspoke about my performance gain -- it was more like 10 or 12% with the 64 bit build, not 5%. But some people were seeing performance drops with the 64 bit version. It seems to depend on the CPU.
____________
My lucky number is 75898524288+1 | |
|
|
http://msdn.microsoft.com/en-us/library/aa384249(v=vs.85).aspx
I think this is the 32 bit emulator MS Windows runs in order to run a 32 bit program such as Genefer80. The WOW64 emulator. I read somewhere else that the emulating process, while minimal, does slow down the application in comparison to running the 32 bit application in a 32 bit OS environment. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 435,627,755 RAC: 870,396
                               
|
That's just an OS compatibility layer. It's not an emulator in the sense that the word is usually used. It's analogous to Wine (which takes great pains to declare itself "not an emulator"). All that, however, is just semantics.
OS calls may be very slightly slower, but the program itself isn't affected.
The bottom line is still that you can't make Genefer80 go faster by compiling it into a 64 bit app.
____________
My lucky number is 75898524288+1 | |
|
|
Well that's a bummer.
So what's the limiting factor when it comes to GeneferCUDA? Why, at N=22, can it only test up to b=475,000? Is it possible that future GPUs will be able to test higher b values? That is, will the limiting factor (whatever it is) change in the future? | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 435,627,755 RAC: 870,396
                               
|
Well that's a bummer.
So what's the limiting factor when it comes to GeneferCUDA? Why, at N=22, can it only test up to b=475,000? Is it possible that future GPUs will be able to test higher b values? That is, will the limiting factor (whatever it is) change in the future?
The B limits are determined, experimentally, as the point where the mathematical calculations performed upon the number being tested starts producing rounding errors that may be destructive to the calculations.
Above that point the calculations produce incorrect results.
What is needed to go beyond that point is either an as yet unknown modification to the software, or higher precision hardware.
It's extremely unlikely that GPUs will start being built with 80 bit or 128 bit floating point hardware. Even the 64 bit double precision hardware is a relatively new addition to GPUs. 32 bit single precision hardware is generally all that's needed for graphics, so there's very little market pressure for high precision math on GPUs. Higher precision floating point hardware is the easy way to go higher in our calculations, but I don't expect that to happen anytime in the foreseeable future.
It's far more likely that searching at n=23 will become feasible than the chance of significantly raising the B limits. The limit on searching at n=23 right now is simply speed and GPU memory, both of which are very likely to go up significantly before we run out of numbers to crunch at n=22.
475K may not seem like a lot since we've been running through the lower n ranges so quickly, but there's a huge amount of processing to be done at n=22. If we had 1000 GTX 580 GPUs at our disposal, running 24/7 and never making an error, it would still take us about 4 years to do all the crunching. I don't think we have anywhere near that amount of crunching power, but that, at least, is something that will improve with time.
I'm not worried about running out of WUs at n=22. I'm more worried about living long enough to see us run out of WUs at n=22. :)
____________
My lucky number is 75898524288+1 | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 435,627,755 RAC: 870,396
                               
|
So would I expect any difference in either speed, type or kind of task available or something else when it comes to running Genefer by means of CUDA using Windows Ultimate 64 bits?
GeneferCUDA should run exactly the same speed under 32-bit or 64-bit versions of Windows.
GenefX64, the CPU app, can only run under the 64 bit version of Windows.
____________
My lucky number is 75898524288+1 | |
|
|
[...] If we had 1000 GTX 580 GPUs at our disposal, running 24/7 and never making an error, it would still take us about 4 years to do all the crunching. I don't think we have anywhere near that amount of crunching power, but that, at least, is something that will improve with time.
I'm not worried about running out of WUs at n=22. I'm more worried about living long enough to see us run out of WUs at n=22. :)
It's easy to achieve. Raise credit level to a certain point and power will appear immediately. ;) | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 435,627,755 RAC: 870,396
                               
|
[...] If we had 1000 GTX 580 GPUs at our disposal, running 24/7 and never making an error, it would still take us about 4 years to do all the crunching. I don't think we have anywhere near that amount of crunching power, but that, at least, is something that will improve with time.
I'm not worried about running out of WUs at n=22. I'm more worried about living long enough to see us run out of WUs at n=22. :)
It's easy to achieve. Raise credit level to a certain point and power will appear immediately. ;)
This has been discussed in some detail in various places. Here's one reply I made back on 1-Feb.
There's also another discussion, somewhere, where I made the point that if the admins raise credit, the very next thing to happen will be that lots of credit hunters flock to PrimeGrid. Some number of milliseconds later, other admins will raise their credit even more. :)
____________
My lucky number is 75898524288+1 | |
|
|
Yea, I read that thread. Some may like it, some may not, but more credit ( of course in appropriate range ) is the some kind of solution. | |
|
|
It's easy to achieve. Raise credit level to a certain point and power will appear immediately. ;)
if they stop c/w sieve they can have a couple more GPUs straight away (hint, hint). | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 435,627,755 RAC: 870,396
                               
|
I updated the chart to refect the change in status of n=19 and n=20.
At this time, if you have a CUDA CC1.3 or better GPU and want to run GeneferCUDA, you can run n=20 (short) or n=22 (long WR) on BOINC, or GFN262144 or GFN524288 on PRPNet.
If you have a 64 bit CPU and want to run GenefX64, the only range available is n=20 (short) on BOINC.
You can run 32 bit CPUs running Genefer80 on PRPNet with GFN32768, GFN65536, GFN262144, or GFN524288. 64 bit CPUs can also run Genefer80, but Genefer80 is about 3 times slower than GenefX64.
Note that as of right now, PRPNet work hasn't been loaded into GFN524288, but I expect that to happen soon.
____________
My lucky number is 75898524288+1 | |
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1957 ID: 352 Credit: 6,148,713,421 RAC: 2,303,821
                                      
|
You can run 32 bit CPUs running Genefer80 on PRPNet with GFN32768, GFN65536, GFN262144, or GFN524288. 64 bit CPUs can also run Genefer80, but Genefer80 is about 3 times slower than GenefX64.
I've conducted a small test with GFN32768 and Genefer80, LLR64, PFGW64/AVX on i5-3570. All 4 apps where run at once.
>genefer80.exe -q "4108672^32768+1"
genefer80 2.3.0-0 (Windows x86 80-bit x87)
...
4108672^32768+1 is a probable prime. (216718 digits) (err = 0.0022) (time = 0:17:46)
[Honza] That's 1065 sec.
>CLLR64.exe -q "4108672^32768+1"
4108672^32768+1 is prime! Time : 1158.581 sec.
>pfgw64.exe -q"4108672^32768+1"
PFGW Version 3.6.3.64BIT.20120316.Win_Dev [GWNUM 27.5]
4108672^32768+1 is 3-PRP! (1164.8072s+0.0041s)
>pfgw64.exe -q"4108672^32768+1"
PFGW Version 3.6.7.64BIT.20121129.Win_Dev [GWNUM 27.8]
4108672^32768+1 is 3-PRP! (1166.5173s+0.0045s)
Running only a single instance
>pfgw64.exe -q"4108672^32768+1"
PFGW Version 3.6.7.64BIT.20121129.Win_Dev [GWNUM 27.8]
4108672^32768+1 is 3-PRP! (875.9579s+0.0053s)
>CLLR64.exe -q "4108672^32768+1"
4108672^32768+1 is prime! Time : 870.493 sec.
>genefer80.exe -q "4108672^32768+1"
...
4108672^32768+1 is a probable prime. (216718 digits) (err = 0.0022) (time = 0:15:41)
[Honza] That's 941 sec
When running a single test, PFGW64 and LLR are fastest, not Genefer80
____________
My stats | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 435,627,755 RAC: 870,396
                               
|
When running a single test, PFGW64 and LLR are fastest, not Genefer80
I don't have an AVX capable CPU, so my tests don't incorporate AVX. In my tests, Genefer80 came out faster than PFGW64 or LLR.
AVX, of course, provides a large boost in performance, so it's not surprising that Genefer80 is slower.
The difference in speed between the single and multiple instances should be due mostly to cache misses. That effect is going to be different with every CPU, so the results you found are going to be valid only for your computer. Everyone should probably run their own tests to see which is faster on their hardware.
____________
My lucky number is 75898524288+1 | |
|
|
N=19 genefercuda will be reached very soon on prpnet. Enjoy while you can...
____________
676754^262144+1 is prime | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 435,627,755 RAC: 870,396
                               
|
The discussion regarding not getting work for GFN262144 has been moved to GFN262144 Not getting work.
____________
My lucky number is 75898524288+1 | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 435,627,755 RAC: 870,396
                               
|
The first post has been edited to show the limits of the new 3.1.2 apps.
____________
My lucky number is 75898524288+1 | |
|
pschoefer Volunteer developer Volunteer tester
 Send message
Joined: 20 Sep 05 Posts: 686 ID: 845 Credit: 2,910,184,413 RAC: 199,509
                              
|
While GeneferCUDA reports the same values as in the first post on my GTX 460 and GTX 570, it's slightly different on GTX 660Ti:
The upper bound m = 8192, b = 2555000, Err = 0.2969
The upper bound m = 16384, b = 2220000, Err = 0.2969
The upper bound m = 32768, b = 1775000, Err = 0.2910
The upper bound m = 65536, b = 1415000, Err = 0.2969
The upper bound m = 131072, b = 1225000, Err = 0.3047
The upper bound m = 262144, b = 985000, Err = 0.2969
The upper bound m = 524288, b = 810000, Err = 0.3047
The upper bound m = 1048576, b = 680000, Err = 0.2969
The upper bound m = 2097152, b = 545000, Err = 0.2891
The upper bound m = 4194304, b = 465000, Err = 0.3125
The upper bound m = 8388608, b = 395000, Err = 0.3125
____________
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 435,627,755 RAC: 870,396
                               
|
While GeneferCUDA reports the same values as in the first post on my GTX 460 and GTX 570, it's slightly different on GTX 660Ti:
The upper bound m = 8192, b = 2555000, Err = 0.2969
The upper bound m = 16384, b = 2220000, Err = 0.2969
The upper bound m = 32768, b = 1775000, Err = 0.2910
The upper bound m = 65536, b = 1415000, Err = 0.2969
The upper bound m = 131072, b = 1225000, Err = 0.3047
The upper bound m = 262144, b = 985000, Err = 0.2969
The upper bound m = 524288, b = 810000, Err = 0.3047
The upper bound m = 1048576, b = 680000, Err = 0.2969
The upper bound m = 2097152, b = 545000, Err = 0.2891
The upper bound m = 4194304, b = 465000, Err = 0.3125
The upper bound m = 8388608, b = 395000, Err = 0.3125
That's interesting. It's also unexpected. Thanks for the information.
____________
My lucky number is 75898524288+1 | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 435,627,755 RAC: 870,396
                               
|
I've added GeneferOCL's limits to the first post.
____________
My lucky number is 75898524288+1 | |
|
|
I've added GeneferOCL's limits to the first post.
Is any of the GFN being run on PRPNet still within cuda or opencl b limits?
____________
676754^262144+1 is prime | |
|
Scott Brown Volunteer moderator Project administrator Volunteer tester Project scientist
 Send message
Joined: 17 Oct 05 Posts: 2394 ID: 1178 Credit: 18,681,105,424 RAC: 6,902,248
                                                
|
I've added GeneferOCL's limits to the first post.
Is any of the GFN being run on PRPNet still within cuda or opencl b limits?
Yes, the 524288 port for both CUDA and OpenCL.
| |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 435,627,755 RAC: 870,396
                               
|
I've added GeneferOCL's limits to the first post.
Is any of the GFN being run on PRPNet still within cuda or opencl b limits?
N=19 (524288) definitely is because we moved it back to PRPNet because it was close to the Genefx64 limit, which is substantially lower than the CUDA limit.
I'm not sure about N=18; it might also be below the CUDA limit but above th Genefx64 limit.
The first post in this thread lists the "official" limits, and the leading edge of the PRPNet ports is visible on the status pages. The official limit, however, is somewhat fuzzy in that when you're near the limit some test will fail and others will succeed.
In fact, Iain and Yves have discovered that within a single test, some iterations will exceed the error threshold and some won't. We're considering code that will actively switch back and forth between different transforms to get the fastest execution when you're near the limit.
Imagine, if you will, you're near the B limit and instead of GeneferCUDA giving up when the threshold is exceeded -- which might only happen in 5 iterations out of several million iterations -- it switches internally to the 80 bit Genefer80 transform for a few iterations, then switches back to CUDA for the rest of the calculation. Is it possible? Yes. The tricky part is knowing when to switch back and forth.
____________
My lucky number is 75898524288+1 | |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,668,824 RAC: 0
                    
|
I've added GeneferOCL's limits to the first post.
Is any of the GFN being run on PRPNet still within cuda or opencl b limits?
N=19 (524288) definitely is because we moved it back to PRPNet because it was close to the Genefx64 limit, which is substantially lower than the CUDA limit.
I am currently running on the 524288 port with OpenCL. Of 11 WUs yesterday, 9 succeeded and 2 failed by exceeding the error threshold.
At ~b=788,000 the port is still well within the theoretical b-limit of 840,000 for OpenCL. | |
|
|
I am definitely able to run N=524288 units using genefercuda on prpnet. I've done 10+ in the last day or two without error. Something must be broken though, as I haven't found any primes yet :-)
I cannot run N=262144 units using either genefercuda or geneferocl. I hit "maxerr exceeded" with each. Not unexpected.
This is with a gtx570 on linux.
--Gary | |
|
compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1150 ID: 55391 Credit: 1,099,825,858 RAC: 813,250
                        
|
A little over 2 years ago this board saw the following exchange:
So what's the limiting factor when it comes to GeneferCUDA? Why, at N=22, can it only test up to b=475,000? Is it possible that future GPUs will be able to test higher b values? That is, will the limiting factor (whatever it is) change in the future?
The B limits are determined, experimentally, as the point where the mathematical calculations performed upon the number being tested starts producing rounding errors that may be destructive to the calculations.
Above that point the calculations produce incorrect results.
What is needed to go beyond that point is either an as yet unknown modification to the software, or higher precision hardware.
Since then, we've seen some success with Genefer switching to different implementations to get around the occasional rounding error.
Christopher Siegert alluded to a different software possibility in the earlier message 51141:
What if two 64 bit registers were used instead, to make a 128 bit program? Couldn't the newer 64 bit instructions be utilized rather effectively, while also expanding b value limits?
The discussion digressed to hardware and never really took note of Chris' suggestion.
The techique Chris referred to is called double-double arithmetic. It daisy-chains (in software) 2 double-precision values to perform pseudo-quad-precision calculations. See https://en.wikipedia.org/wiki/Quadruple_precision_floating-point_format#Double-double_arithmetic
Using AVX instructions to do the required 4 double-precision cross-multiplications in parallel should take about the same time as it would otherwise for a single multiplication directly in a serial implementation of hardware-supported precision. The principal slowdown would be from carrying double the data in the CPU cache for the same number of algorithm steps, but it might still be faster than x87 extended precision.
You could probably also apply this technique (with somewhat more difficulty) to GPU calculations.
Or you can wait an indeterminate number of decades until the mainstream microprocessor makers introduce real quad-precision hardware, already half a century after IBM started to put it into their mainframes. | |
|
compositeVolunteer tester Send message
Joined: 16 Feb 10 Posts: 1150 ID: 55391 Credit: 1,099,825,858 RAC: 813,250
                        
|
The techique Chris referred to is called double-double arithmetic. It daisy-chains (in software) 2 double-precision values to perform pseudo-quad-precision calculations.
Oh, I see it's been done already and it's 5 times slower without AVX, as expected.
https://www.assembla.com/code/genefer/subversion/nodes/692/trunk/src/genefer128
At least it's not as bad as using the GNU quadmath library, which is 50 times slower (I tried it with genefer 1.3), so it seems that libquadmath implements true quad-precision rather than double-double.
| |
|
|
What is the largest b value that PrimeGrid plans to test at the N=20 level? | |
|
axnVolunteer developer Send message
Joined: 29 Dec 07 Posts: 285 ID: 16874 Credit: 28,027,106 RAC: 0
            
|
What is the largest b value that PrimeGrid plans to test at the N=20 level?
From the "Genefer B limits" thread:
N Genefer Genefer80 GeneferX64 GeneferSSE3 GeneferAVX GeneferCUDA GeneferOCL
1,048,576 775,000 24,500,000 615,000 720,000 720,000 695,000 690,000
So, around 700000? | |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,668,824 RAC: 0
                    
|
GeneferCUDA app for n=20 has now been disabled in BOINC.
n=20 leading edge now past 695,000.
GeneferOCL and CPU apps still running.
What is the largest b value that PrimeGrid plans to test at the N=20 level?
From the "Genefer B limits" thread:
N Genefer Genefer80 GeneferX64 GeneferSSE3 GeneferAVX GeneferCUDA GeneferOCL
1,048,576 775,000 24,500,000 615,000 720,000 720,000 695,000 690,000
So, around 700000?
Those are b-limits are for Version 3.1.2, which we don't run anymore.
Limits tests for HD7970 GPU with more modern versions of GeneferOCL:
geneferocl 3.2.2 (Windows/OpenCL/32-bit)
>primegrid_genefer_3_2_2_0_3.01_windows_intelx86__atiGFN.exe -l
The upper bound m = 1048576, b = 750000, Err = 0.3086
geneferocl 3.2.5-dev (Windows/OpenCL/32-bit)
The upper bound m = 1048576, b = 750000, Err = 0.2969
So a bit of juice still left in the tank.
Each hardware architecture will potentially have a different b-limit. Use the -l command line option to see where your theoretical b-limit is. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 435,627,755 RAC: 870,396
                               
|
What is the largest b value that PrimeGrid plans to test at the N=20 level?
720K.
____________
My lucky number is 75898524288+1 | |
|
|
What is the largest b value that PrimeGrid plans to test at the N=20 level?
720K.
The first n=20 generalized Fermat will never be discovered, potentially, if the first prime is after b=720'000, which is probable with the knowledge we have now.
Is it not planned to go beyond 720'000 with "Genefer80", i.e. the CPU 80-bit arithmetics? I realize it will take very long to get a hit with that "mode".
/JeppeSN
| |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,668,824 RAC: 0
                    
|
What is the largest b value that PrimeGrid plans to test at the N=20 level?
720K.
For n=20 we're running genefer version 3.2.6 on BOINC.
I ran some CPU limits tests:
i7-3540M CPU
avx-intel: 775000
sse4: 775000
sse2: 775000
default: 775000
x87: 24500000
AMD X6 1100 CPU
sse2: 775000
default: 775000
x87: 23900000
B-limits from version 3.2.5 to 3.2.6 have increased from 750000 to 775000 from tests on my PCs. Some tests will hit maxErr well before the B-limit though. | |
|
RogerVolunteer developer Volunteer tester
 Send message
Joined: 27 Nov 11 Posts: 1138 ID: 120786 Credit: 268,668,824 RAC: 0
                    
|
For n=19 we're running genefer version 3.2.5 on PRPNet version prpclient-5.3.2.
I ran some CPU limits tests:
i7-3540M CPU
avx-intel: 955000
sse4: 955000
sse2: 955000
default: 955000
x87: 29120000
AMD X6 1100 CPU
sse2: 955000
default: 955000
x87: 30720000
B-limits from version 3.2.1 to 3.2.5 to 3.2.7 have stayed at 955,000. Leading edge is at n=920,118, plenty of life left yet in this CPU PRPNet sub-project!
Disclaimer 1: some tests will hit maxErr well before the B-limit.
Disclaimer 2: these are my opinions, not necessarily those of PrimeGrid. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 435,627,755 RAC: 870,396
                               
|
B limits for the upcoming 3.2.9 versions of Genefer:
Note that except for OCL3, B limits are "fuzzy" -- near the limit, some test will succeed while others will fail, and it's impossible to predict which will succeed. With OCL3, the B limit is exact -- all tasks below the limit will succeed, and it won't run tasks above the limit.
-32-bit Sempron CPU- ---------------------32-bit Haswell CPU--------------------- ---------------------64-bit Haswell CPU---------------------
n 2^n x87 default x87 default sse2 sse4 avx fma3 x87 default sse2 sse4 avx fma3 cuda ocl ocl2 ocl3
15 32768 99,460,000 2,180,000 99,710,000 2,180,000 2,420,000 2,420,000 2,520,000 2,480,000 99,710,000 2,180,000 2,420,000 2,420,000 2,520,000 2,390,000 2,120,000 2,260,000 95,520,000 16,777,216
16 65536 79,010,000 1,690,000 81,840,000 1,690,000 1,930,000 1,930,000 1,970,000 1,950,000 81,840,000 1,690,000 1,930,000 1,930,000 1,970,000 1,990,000 1,710,000 1,840,000 81,670,000 11,863,283
17 131072 64,150,000 1,420,000 65,450,000 1,410,000 1,570,000 1,570,000 1,660,000 1,590,000 65,450,000 1,410,000 1,550,000 1,550,000 1,660,000 1,620,000 1,350,000 1,450,000 60,430,000 8,388,608
18 262144 53,680,000 1,150,000 54,080,000 1,120,000 1,300,000 1,300,000 1,340,000 1,350,000 54,080,000 1,120,000 1,300,000 1,300,000 1,330,000 1,350,000 1,090,000 1,220,000 50,790,000 5,931,642
19 524288 43,620,000 940,000 43,530,000 950,000 1,050,000 1,050,000 1,110,000 1,060,000 43,530,000 950,000 1,070,000 1,070,000 1,100,000 1,100,000 950,000 1,020,000 41,350,000 4,194,304
20 1048576 36,030,000 760,000 36,300,000 780,000 890,000 890,000 890,000 890,000 36,300,000 780,000 900,000 900,000 900,000 900,000 760,000 810,000 35,020,000 2,965,821
21 2097152 29,050,000 640,000 29,740,000 630,000 720,000 720,000 760,000 720,000 29,740,000 630,000 710,000 710,000 720,000 730,000 620,000 660,000 28,310,000 2,097,152
22 4194304 23,870,000 530,000 24,100,000 510,000 600,000 600,000 630,000 590,000 24,100,000 510,000 600,000 600,000 620,000 610,000 510,000 540,000 22,470,000 1,482,910
"32-bit Sempron CPU" is genefer_windows32.exe running on a single core, 32-bit Sempron that lacks SSE2. "32-bit Haswell CPU" is genefer_windows32.exe running on a Haswell Core i5. "64-bit Haswell CPU" is genefer_windows64.exe running on a Haswell Core i5. All GPU tests (CUDA, OCL, OCL2, OCL3) are done on an Nvidia GTX 580.
Reported limits may vary on different hardware, except for OCL3.
EDIT: Limits also may vary depending on the operating system. (Except for OCL3.)
____________
My lucky number is 75898524288+1 | |
|
|
Is n=19 near B limit for FMA3 on Haswell x64??
Testing 1029472^524288+1... took >19 hours vs TheDawgz average of <8 hours
http://www.primegrid.com/result.php?resultid=688847987
Or are TheDawgz confused??
____________
There's someone in our head but it's not us. | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 435,627,755 RAC: 870,396
                               
|
Is n=19 near B limit for FMA3 on Haswell x64??
Testing 1029472^524288+1... took >19 hours vs TheDawgz average of <8 hours
http://www.primegrid.com/result.php?resultid=688847987
Or are TheDawgz confused??
It's an approximate limit, and we're kind of close, so it's possible.
____________
My lucky number is 75898524288+1 | |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2645 ID: 29980 Credit: 568,565,361 RAC: 147
                              
|
http://www.primegrid.com/forum_thread.php?id=6511&nowrap=true#92161
See also the chart above I did for n=13 where some units started getting longer before the nominal limit. | |
|
RafaelVolunteer tester
 Send message
Joined: 22 Oct 14 Posts: 912 ID: 370496 Credit: 552,485,027 RAC: 466,454
                         
|
Michael, can you please update the table in OP to include OCL4 (both high and low) as well as OCL5? Possibly exclude OCL2 as well, given that it's pretty much useless next to OCL4-High.
Would be pretty convenient... | |
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 435,627,755 RAC: 870,396
                               
|
Michael, can you please update the table in OP to include OCL4 (both high and low) as well as OCL5? Possibly exclude OCL2 as well, given that it's pretty much useless next to OCL4-High.
Would be pretty convenient...
I've adjusted the chart to reflect the current transforms, and removed OCL2 and CUDA from the table since they are no longer used.
____________
My lucky number is 75898524288+1 | |
|
GDBSend message
Joined: 15 Nov 11 Posts: 298 ID: 119185 Credit: 4,070,448,791 RAC: 1,962,721
                      
|
The OCL names in your table aren't the current names being used? | |
|
GDBSend message
Joined: 15 Nov 11 Posts: 298 ID: 119185 Credit: 4,070,448,791 RAC: 1,962,721
                      
|
The OCL names in your table aren't the current names being used?
When I run a GFN, I see OCL, OC2, OCL3, OCL4, and OCL5 transforms being benchmarked.
| |
|
RafaelVolunteer tester
 Send message
Joined: 22 Oct 14 Posts: 912 ID: 370496 Credit: 552,485,027 RAC: 466,454
                         
|
The OCL names in your table aren't the current names being used?
When I run a GFN, I see OCL, OC2, OCL3, OCL4, and OCL5 transforms being benchmarked.
Yeah, they're different. Originally, the OCL4 transform had a "low" and "high" variant, two different transforms built into a single one. Also, the OCL2 was something else entirely. At some point, the original OCL2 got outdated: OCL4-High was faster and had a much better limit, so the OCL4 transform was split and renamed, thus OCL-High became OCL2 and OCL4-Low became OCL4. | |
|
|
Can you tell fmai, avxi, sse4i and sse2i limits?
____________
| |
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 820 ID: 164101 Credit: 305,989,513 RAC: 1,728

|
Can you tell fmai, avxi, sse4i and sse2i limits?
The limits are round-off errors and are fuzzy. GFN-16 is close to the limit (180M or more). The limit is larger than 130M for GFN-17, it can probably be extended to 150M+. For GFN-18 it is larger than 95M and for GFN-22/DYFL than 35M. | |
|
|
Can you tell fmai, avxi, sse4i and sse2i limits?
The limits are round-off errors and are fuzzy. GFN-16 is close to the limit (180M or more).
That's unfortunate and will make it harder for CPUs to be competitive on GFN16 in TdP next year :(
____________
| |
|
Post to thread
Message boards :
Generalized Fermat Prime Search :
Genefer "B" limits |