PrimeGrid
Please visit donation page to help the project cover running costs for this month

Toggle Menu

Join PrimeGrid

Returning Participants

Community

Leader Boards

Results

Other

drummers-lowrise
11) Message boards : Sieving : ppsieve/tpsieve CUDA testing (Message 24176)
Posted 4863 days ago by Redstar3894Project donor
Win7 64
i7-920 w/6GB RAM
GTX 260 Core 216 (Factory OC)
Driver 197.45
BOINC running CPU tasks only


Here's two tests I ran using Ken's latest build...with the standard cudart.dll as agreed on earlier....

D:\Patrick\ppsieve-cuda-vc\Release>ppsieve-cuda.exe -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 20000 ppsieve version cuda-0.1.1 (testing) nstart=76, nstep=32, gpu_nstep=32 ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 20000 Sieve started: 42070000000000 <= p < 42070030000000 Thread 0 starting Detected GPU 0: GeForce GTX 260 Detected compute capability: 1.3 Detected 27 multiprocessors. Thread 0 completed Waiting for threads to exit Sieve complete: 42070000000000 <= p < 42070030000000 Found 0 factors count=955289,sum=0x2dbc17167afb6a8d Elapsed time: 0.71 sec. (0.05 init + 0.67 sieve) at 45328606 p/sec. Processor time: 0.31 sec. (0.06 init + 0.25 sieve) at 120778519 p/sec. Average processor utilization: 1.39 (init), 0.38 (sieve)

And the second one...
D:\Patrick\ppsieve-cuda-vc\Release>ppsieve-cuda.exe -p42070e9 -P42070003e6 -k 1201 -K 9999 -N 2000000 -z normal ppsieve version cuda-0.1.1 (testing) nstart=76, nstep=32, gpu_nstep=32 ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000 Sieve started: 42070000000000 <= p < 42070003000000 Thread 0 starting Detected GPU 0: GeForce GTX 260 Detected compute capability: 1.3 Detected 27 multiprocessors. 42070000070587 | 9475*2^197534+1 42070000198537 | 3373*2^1046686+1 42070000300049 | 9139*2^461846+1 42070000345343 | 1715*2^635711+1 42070000464001 | 4179*2^1577462+1 42070000949861 | 4707*2^571847+1 42070001011573 | 7113*2^215532+1 42070001040127 | 6471*2^37907+1 42070002482267 | 9951*2^1920408+1 42070002690167 | 2553*2^1888870+1 42070002698543 | 4239*2^368773+1 42070002875941 | 4081*2^1494668+1 Thread 0 completed Waiting for threads to exit Sieve complete: 42070000000000 <= p < 42070003000000 Found 12 factors count=95668,sum=0x37dacb7121ccffe4 Elapsed time: 4.31 sec. (0.05 init + 4.27 sieve) at 737011 p/sec. Processor time: 0.37 sec. (0.08 init + 0.30 sieve) at 10613046 p/sec. Average processor utilization: 1.73 (init), 0.07 (sieve)
12) Message boards : Sieving : ppsieve/tpsieve CUDA testing (Message 24114)
Posted 4865 days ago by Redstar3894Project donor
OK, I had thought it was running over 1MP/s; it was just 1KP/s. I think something may be wrong with my sleep timing. I'll look into it and get back to you.

I did notice (through watching GPU usage on EVGA Precision) that the GPU usage never stayed constant...it would spike for a second or two to around 75% and then fall to zero for about 10-20 seconds....

Hope that helps!

And BTW, thanks Ken for building a Windows version! It seems like it has a some more ground to cover to catch up with the linux builds, but great job nonetheless!
13) Message boards : Sieving : ppsieve/tpsieve CUDA testing (Message 24112)
Posted 4865 days ago by Redstar3894Project donor
Here's my result from the shorter test Scott posted above:

D:\Patrick\ppsieve-cuda-vc\Release>ppsieve-cuda.exe -p42070e9 -P42070003e6 -k 1201 -K 9999 -N 2000000 -z normal ppsieve version cuda-0.1.1 (testing) nstart=76, nstep=32, gpu_nstep=32 ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000 Sieve started: 42070000000000 <= p < 42070003000000 Thread 0 starting Detected GPU 0: GeForce GTX 260 Detected compute capability: 1.3 Detected 27 multiprocessors. 42070000070587 | 9475*2^197534+1 42070000198537 | 3373*2^1046686+1 42070000300049 | 9139*2^461846+1 42070000345343 | 1715*2^635711+1 42070000464001 | 4179*2^1577462+1 42070000949861 | 4707*2^571847+1 42070001011573 | 7113*2^215532+1 42070001040127 | 6471*2^37907+1 42070002482267 | 9951*2^1920408+1 42070002690167 | 2553*2^1888870+1 42070002698543 | 4239*2^368773+1 42070002875941 | 4081*2^1494668+1 Thread 0 completed Waiting for threads to exit Sieve complete: 42070000000000 <= p < 42070003000000 Found 12 factors count=95668,sum=0x37dacb7121ccffe4 Elapsed time: 11.19 sec. (0.05 init + 11.14 sieve) at 282411 p/sec. Processor time: 4.18 sec. (0.06 init + 4.12 sieve) at 763818 p/sec. Average processor utilization: 1.30 (init), 0.37 (sieve)
14) Message boards : Sieving : ppsieve/tpsieve CUDA testing (Message 24111)
Posted 4865 days ago by Redstar3894Project donor
i7-920 @ 2.8 GHz
6GB RAM
Win7-64
GTX 260 Core 216 (Factory OC)
BOINC suspended for all tests


D:\Patrick\ppsieve-cuda-vc\Release>ppsieve-cuda.exe -p42070e9 -P42070030e6 -k 1201 -K 9999 -N 2000000 -z normal -q ppsieve version cuda-0.1.1 (testing) nstart=76, nstep=32, gpu_nstep=32 ppsieve initialized: 1201 <= k <= 9999, 76 <= n <= 2000000 Sieve started: 42070000000000 <= p < 42070030000000 Thread 0 starting Detected GPU 0: GeForce GTX 260 Detected compute capability: 1.3 Detected 27 multiprocessors. p=42070030146561, 17.48K p/sec, 0.03 CPU cores, 100.5% done. ETA 03 Jun 21:40 Thread 0 completed Waiting for threads to exit Sieve complete: 42070000000000 <= p < 42070030000000 Found 97 factors count=955289,sum=0x2dbc17167afb6a8d Elapsed time: 842.70 sec. (0.03 init + 842.66 sieve) at 35775 p/sec. Processor time: 37.74 sec. (0.03 init + 37.71 sieve) at 799528 p/sec. Average processor utilization: 0.97 (init), 0.04 (sieve)


I swiped the cudart.dll from Collatz...

I am at home, but tomorrow I can test it on 32-bit systems with various CUDA cards (9600 GSO, 9600GS, 8600 GT, 8400 GS, 8300 GS). Might try my laptop's 8400M GS tonight...is there a memory minimum limit?

I am also curious as to what this may be...it depends on what version of the CUDA SDK this was compiled with...newer versions will run considerably faster on newer cards as well as include increased capabilities (double precision, anyone?)

I'll try getting the cudart.dll from a project like GPUGrid or Milkyway, which both use at least CUDA 2.2 (due to double precision support) and see what, if any, difference that makes...

EDIT: Whoa, put my foot in my mouth a bit there...Collatz would use 2.2...my bad... :p
And also, when I switched the cudart.dll with the one from GPUGrid, it made NO difference whatsoever...
15) Message boards : Number crunching : Alternative Platforms (Message 24098)
Posted 4865 days ago by Redstar3894Project donor
Some of the projects have workunits that take a long time on PPC. Workunits on GCW shouldn't take quite as long now that the base 13 tests are done. They will probably vary from five to ten hours, possibly longer, on your computer. I would recommend against SGS because the large k will make phrot inefficient. 27121, PPSE, and GFN65536 are also good choices, even though 27121 has longer workunits.

As for GFN32768, forget about it. The bases are too large for genefer and will trigger round-off errors. Since genefer80 can still handle the larger bases, I would avoid it until genefer80 until the bases get too large for it. The residues for generfer and phrot/llr/pfgw are not compatible, so unless phrot were to find a PRP, the test would be wasted (unless the server is not configured to do double-checks).


Hmmmm....I haven't tried GCW in a while....maybe I'll add that to the rotation and see how it runs...

But yeah, I've more or less written off SGS and GFN32768 due to the above reasons...probably going to add 27121 to that list as well...

I've been looking over programmers notes for the PowerPC 970FX and based on my learnings and (limited) knowledge of programing, it doesn't seem that there will be any easy way to get a similar extended precision (i.e. Genefer80) running on the PPC, since, (as you know) the 80-bit floating point standard is unique to the x87 instruction set and therefore to the x86 architecture...though since the PPC 970 does have 2 FPU's as well as the AltiVec units, something could possibly be done with those....or not...again, forgive my interpretations if I'm way off, some of this is way over my head...

-ftree-loop-distribution is not a compiler option on MacPPC. I still haven't had time to work on this issue. The reason why the other one built with -O3 is because you also had -O2, which overrode the -O3.

Interesting, so the MacPPC version does not include the graphite loop optimizations....
Also, I was under the impression that since I specified:
-Wl
that all options following that would be passed to the linker...so that's why I placed -O2 there....but again, I'm probably wrong.... thanks for being patient here....

I did a quick test and see that -funsafe-math-optimizations, which is set by -ffast-math is the problem on MacPPC. Using -ffast-math with -fno-unsafe-math-optimizations works, albeit not much faster than just -O3 without -ffast-math. I might be able to re-arrange parts of the code so that I can use -ffast-math, but I don't know yet. Here is gcc's take on -ffast-math:

This option should never be turned on by any -O option since it can result in incorrect output for programs which depend on an exact implementation of IEEE or ISO rules/specifications for math functions.

I knew I forgot something...I'll see if -ffast-math with -fno-unsafe-math-optimizations makes any difference and I'll play around with the Graphite Loop Optimizations because it seemed like those helped phrot (not to mention allowing genefer to compile with -O3)...I'll try and read up some more on all this and see if I can take a look at moving things around in the code as well
16) Message boards : Number crunching : Alternative Platforms (Message 24094)
Posted 4865 days ago by Redstar3894Project donor
Okay, sorry it's been a while, though I doubt anyone was holding their breath... ;) I got a new Nexus One and I've been enjoying all the Android splendor so I've been otherwise occupied for the last week or so with that to distract me from PG...


Here's an update on how PRPNet has been going on my G5 (single processor 1.8 GHz w/1.2G RAM running Gentoo with kernel 2.6.32) :


I've been able to get phrot tuned pretty well for PPSE10 and 11k, 10k work units take ~94 sec and 11k's take around ~105 sec

PPSE n>450K work units, on the other hand, take around ~2000 sec on average...

ESP works, but it takes several hours (I haven't kept track of exactly how long, but I believe it's around 6-8)...so I have that set to a low % in prpclient.ini...

Same for SGS, though I ended up commenting that line out because it was taking about 10-12 hours/WU to complete

27121 and GCW13 are also working, but I never completed a WU from either of those projects, I aborted both tasks after about 24 hours (tried 27121 first, then GCW) because they were less than half complete

But then, after I got my nexus one, and thus being otherwise occupied, I left the new 121 project crunching and it finished the first WU after about 36 hours....so I would assume that 27121 and GCW would take about that same amount of time (+ or - 2 hours)...still, 121 seems like the credit is fairly high for one work unit...I would be interested in finding out roughly how long its taking others to finish a WU on x86....

So I'm not sure if the extremely long run-times on the larger numbers are due to my hardware limitations (probably) or just due to the way I optimized phrot...I'll have to play around with optimizations a little more and see what GCC can come up with...my problem is that I'm not willing to wait long enough (damn ADD rears its ugly head) for the longer test units to finish running, so most of my tests were on the (relatively) smaller PPSE10 and 11k WU's....but now I have the new phone to distract me while it is running these tests, so I should be good ;)


Genefer seems to be working very well on GFN65536, but on GFN32768 it detects a roundoff error almost immediately and finishes with phrot (similar to what happens on x86 windows, but the x86 windows version then has the option to usually finish with the 80-bit version of genefer)


Mark: in regards to your above post I was able to successfully compile genefer with "-O3", but it only seemed to work correctly (producing correct residues) with the following CFLAG added on:
-ftree-loop-distribution
Again, I'm not quite sure why this is the case, but it is working correctly when compiled with that CFLAG in addition to the others:
-O3 -mcpu=970 -mtune=970 -maltivec -pipe -Wl,-lm -O2
It does provide a slight but noticeable performance benefit (at least in the built in benchmarks) when running Genefer


I should have some free time either tomorrow night or a couple days next week, during which I should be able to start working on test packages of PRPNet for LinuxPPC (32- and 64-bit), which, once completed, could hopefully be included on the PRPNet download page! :)
17) Message boards : Number crunching : Alternative Platforms (Message 23932)
Posted 4878 days ago by Redstar3894Project donor
Did you run genefer with the switch that verifies residues? I suggest that you read this thread: http://www.primegrid.com/forum_thread.php?id=1800 regarding genefer compiler options. -O3 causes problems. I haven't looked into it yet. --ffast-math should work, but it doesn't. I really need to spend some time on it. I should be able to do that Memorial Day weekend.


I discovered this problem later, (though I caught it before I was sent any GFN workunits)...It's a shame, because -O3 and -ffast-math make it SO much faster....

I should have some time to kill this weekend, so maybe I'll try building genefer while selectively enabling the subflags of -O3 and -ffast-math, from what I read on that thread, it seems like one or two of the flags is the culprit...probably disabling "-fno-associative math" and "-funsafe-math-optimizations"...at least on the PS3 (and yet to be determined on the G5)

From the GCC Manual:
-fassociative-math
Allow re-association of operands in series of floating-point operations. This violates the ISO C and C++ language standard by possibly changing computation result. NOTE: re-ordering may change the sign of zero as well as ignore NaNs and inhibit or create underflow or overflow (and thus cannot be used on a code which relies on rounding behavior like (x + 2**52) - 2**52). May also reorder floating-point comparisons and thus may not be used when ordered comparisons are required. This option requires that both -fno-signed-zeros and -fno-trapping-math be in effect. Moreover, it doesn't make much sense with -frounding-math.

-funsafe-math-optimizations
Allow optimizations for floating-point arithmetic that (a) assume that arguments and results are valid and (b) may violate IEEE or ANSI standards. When used at link-time, it may include libraries or startup files that change the default FPU control word or other similar optimizations.
This option is not turned on by any -O option since it can result in incorrect output for programs which depend on an exact implementation of IEEE or ISO rules/specifications for math functions. It may, however, yield faster code for programs that do not require the guarantees of these specifications. Enables -fno-signed-zeros, -fno-trapping-math, -fassociative-math and -freciprocal-math.


I would try taking a look through the code itself, but as I said before, my knowledge of C is rudimentary at best, so maybe if I'm feeling adventurous later...I'll post the results here if I find some problems/answers... ;)

On the contrary, as you said in the other thread http://www.primegrid.com/forum_thread.php?id=1800&nowrap=true#22784, phrot seems to be running VERY well when compiled with -O3 (though not -ffast-math)....I'm getting about ~94s per PPSE10k unit, about ~100s per PPSE11k unit, ~2100s per PPSE unit...

I also was able to gain about ~2s per PPSE10/11k wu's by compiling phrot with -floop-block and -ftree-loop-distribution (I'm using GCC 4.4.3) so I'm not sure if that is just by chance or the Graphite/PPL loop optimizations will actually benefit the code....


I didn't launch this effort with any major expectations to speak of, so I'm very satisfied with the results and if I were able to get genefer fully optimized, that would just be the icing on the cake... :)


Rogue, seriously, thank you so much for all your help and patience with me here, I really appreciate it!
18) Message boards : Number crunching : Alternative Platforms (Message 23918)
Posted 4879 days ago by Redstar3894Project donor

There is a BIG difference between -funroll-loops and UNROLLED_MR. -funroll-loops is a compiler optimization that can take relatively simple loops (a few lines of C) and unroll them. UNROLLED_MR is coded optimization that takes some fairly complex loops and unrolls them.

I went in and added another #ifdef to phrot.c, specifically ppc64:
#if defined(__ppc__) || defined(__ppc64__) #define UNROLLED_MR #endif


I think the results, compared to my earlier run, speak for themselves:
AzureDragon phrot # ./phrot.g5 -d -q"7843*2^134274+1"
Phil Carmody's Phrot (0.72)
Input 7843*2^134274+1 : Actually testing 128499712*1048576^6713+1 (witness=3 6715/14336 limbs)
7843*2^134274+1 [-223680,289213,462807,360724] is composite LLR64=b72949bc5ffc727a. (e=0.03516 (0.0587399~3.90193e-16@0.000) t=91.61s)


Whereas my earlier test, when UNROLLED_MR was probably NOT defined, i got these results:
7843*2^134274+1 is composite LLR64=b72949bc5ffc727a. (e=0.03516 (0.0587399~3.90193e-16@0.000) t=108.05s )
[2010-05-19 20:17:05 GMT] PPSE10k: 7843*2^134274+1 is not prime. Residue b72949bc5ffc727a


It really is amazing how that makes such a difference....I'll check to see if I can tweak it a little more and will update as time permits
19) Message boards : Number crunching : Alternative Platforms (Message 23916)
Posted 4879 days ago by Redstar3894Project donor
There is a BIG difference between -funroll-loops and UNROLLED_MR. -funroll-loops is a compiler optimization that can take relatively simple loops (a few lines of C) and unroll them. UNROLLED_MR is coded optimization that takes some fairly complex loops and unrolls them.


I had a feeling it was something along those lines....I'll have to take a look to see if that was explicitly defined when I built phrot....
(Again, I apologize for my relative ignorance when it comes to these things...)



In other news, I have succeeded in building genefer! :)

I took your suggestion and added in a new #elseif with a CPU_TARGET linux/ppc/ppc64 and that did the trick :)

I can provide a diff with my modifications if you like, or just post them here....

Also, when building genefer, I had to use the '-pipe' CFLAG, I kept getting error messages about temporary files....not sure if that was a general error or something I was doing incorrectly, but it works now, I've checked it against several sources and it appears to be running correctly :)

If you had any recommendations about optimizing genefer, I would greatly appreciate them, what I ended up using is
'-O3 -pipe -mcpu=970 -mtune=970 -maltivec -Wl,-lm -O2'
(where the last two are passed to the linker) In hindsight, I probably didn't need to use '-lm' since it's not linking external libraries, but it seemed to work okay....

And PRPNet has been humming along rather smoothly for the last few hours, it's not the quickest in the world (at least compared to the i7 that I'm used to), but it gets the job done... :)


So I just wanted to especially thank rogue for his support and patience in helping me figure this out! I definitely couldn't have done it without your help, Thank You! :)

Next I guess I'll try re-working some sieve applications to see if they can be built on PPC/Linux

Sieve applications do exist for PPC. Look here, http://sites.google.com/site/geoffreywalterreynolds/programs/. These can all be built on PPC. RISC is faster for some things that require fused multiply-add instructions in the FPU, but the extra registers don't help as much as you would expect. This is due to the number of cycles needed for some instructions and the expense of converting between FPU and INT.

But that will probably have to wait until I have some more free time... ;)
20) Message boards : Number crunching : Alternative Platforms (Message 23905)
Posted 4880 days ago by Redstar3894Project donor
the UNROLLED_MR is a setting that will unroll the main loops to gain performance on CPUs with many registers. PPC is one such CPU. This is in phrot.c

#if defined(__ppc__)
#define UNROLLED_MR
#endif

so if __ppc__ is defined on your box, then it will take advantage of it.

I'll have to play around with this...because even with the modifications I made to the makefile, it should have still called UNROLLED_MR....but wouldn't '-funroll-loops' via invoking '-O3' in gcc do essentially the same thing?
(Again, forgive me if I'm way off here)

Because that would be great (obviously) if there were some additional optimizations I could use on phrot....I would like to stretch this G5 as far and fast as I possibly can....

The FPU to INT conversion is a problem with both FFTs and sieves, so those programs try to do as much math work as possible without doing a conversion. IIRC the conversion between the two is much more expensive on PPC than x86.

Oh well, I guess that's just the price I pay for keeping my G5 around... ;)

LLR 3.8.1 might be on par with phrot since it is using special modular reduction which George added specifically for LLR and PFGW. I haven't compared them since LLR 3.8 came out. Even if phrot is now slower than LLR for other bases, I think that it is fairly impressive since it has almost no asm in it.

With what little knowledge of this area I have, I agree, that phrot still compares very favorably in many situations/bases to LLR without much asm to speak of is very impressive....IMHO everyone involved in producing and maintaining phrot definitely deserves a round of applause for that! :)


Next 10 posts
[Return to PrimeGrid main page]
DNS Powered by DNSEXIT.COM
Copyright © 2005 - 2023 Rytis Slatkevičius (contact) and PrimeGrid community. Server load 1.45, 2.42, 2.51
Generated 29 Sep 2023 | 13:27:26 UTC