Author |
Message |
robish Volunteer moderator Volunteer tester
 Send message
Joined: 7 Jan 12 Posts: 2212 ID: 126266 Credit: 7,523,618,709 RAC: 3,434,968
                               
|
https://wccftech.com/rumor-nvidia-launching-surprise-rtx-2080-ti-with-4352-cuda-cores-11gb-gddr6-vram/
Want.
____________
My lucky number 10590941048576+1 |
|
|
|
Me too, but the entire pricing structure is higher than that of the 10 series. $600 for the 2070? Yuck.
Once the Pascals are finally all gone maybe they'll drop the price down to reasonable and I can finally upgrade to the 2080ti ($1200 right now).
____________
Eating more cheese on Thursdays. |
|
|
robish Volunteer moderator Volunteer tester
 Send message
Joined: 7 Jan 12 Posts: 2212 ID: 126266 Credit: 7,523,618,709 RAC: 3,434,968
                               
|
Me too, but the entire pricing structure is higher than that of the 10 series. $600 for the 2070? Yuck.
Once the Pascals are finally all gone maybe they'll drop the price down to reasonable and I can finally upgrade to the 2080ti ($1200 right now).
Pricey agreed, 1200 is a bit too much. But very powerful ;)
____________
My lucky number 10590941048576+1 |
|
|
|
Looks like the non-Founders Edition pricing is a bit better $499/699/999-70/80/80ti, but we shall certainly see what crops up.
____________
Eating more cheese on Thursdays. |
|
|
|
RTX-ops and TIPS vs. 'real' teraflops. I'm kinda lost. I believe I read 15ish Tflops.
Nvidia throwing a smokescreen? |
|
|
RafaelVolunteer tester
 Send message
Joined: 22 Oct 14 Posts: 911 ID: 370496 Credit: 551,549,092 RAC: 448,706
                         
|
For our GPU devs out there, could the Tensor cores be used for Primegrid? Obviously code would have to be written to take advantage of them, but outside of that, is there a way to leverage them for either of our current projects? |
|
|
robish Volunteer moderator Volunteer tester
 Send message
Joined: 7 Jan 12 Posts: 2212 ID: 126266 Credit: 7,523,618,709 RAC: 3,434,968
                               
|
RTX-ops and TIPS vs. 'real' teraflops. I'm kinda lost. I believe I read 15ish Tflops.
Nvidia throwing a smokescreen?
13.1 TFLOPS (Rumored)
~16 TFLOPS (Expected)
Don't know what the others mean.
____________
My lucky number 10590941048576+1 |
|
|
RafaelVolunteer tester
 Send message
Joined: 22 Oct 14 Posts: 911 ID: 370496 Credit: 551,549,092 RAC: 448,706
                         
|
RTX-ops and TIPS vs. 'real' teraflops. I'm kinda lost. I believe I read 15ish Tflops.
Nvidia throwing a smokescreen?
13.1 TFLOPS (Rumored)
~16 TFLOPS (Expected)
Don't know what the others mean.
Probably the rumored, official spec is 13.1, but taking the card's auto OC feature / boost clocks, onde would expect 16TFLOPS instead. |
|
|
|
The RTX TIPS stuff is just a "new math" marketing performance term since they're leveraging new chip hardware elements and changing the dynamics of GPU gaming work, not unlike 8800GTX and CUDA (as they so nicely pointed out in their presentation). The INT4/8 stuff precludes the use of "FLOPS" since they don't do float, thus the old IPS term was brought out.
Once real reviews come out next month, we'll get a better idea of what the true non-RT performance really is, especially compared to Pascal. The Anandtech review will most certainly include many compute comparisons.
____________
Eating more cheese on Thursdays. |
|
|
|
Separation of the shader engine, RTX cores, Tensor cores with different measurements. Flops, TIPS, TOPS.. aaargh.
Hoping for a technical deep dive from reviewers once the product line release. I'm not really knowledgeable as of now.
|
|
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2645 ID: 29980 Credit: 568,565,361 RAC: 198
                              
|
It'll take someone smarter than me to figure out if any of it is useful beyond the traditional stuff. A lot of the prime number related stuff uses mid to high precision number formats. To get the astronomical rates used in deep learning and similar, they're very low precision so you might as well use a random number generator.
Also the cards are out the day after I go away on a 2 week+ work trip. Earliest I'd think about buying will be on my return, by which time I'd expect more in depth reviews to be available. |
|
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 820 ID: 164101 Credit: 305,989,513 RAC: 2,326

|
2080 Ti / 1080 Ti
# cores: 4352 / 3584 => + 21%
freq (boost): ~1600 / ~1600 MHz
Mem bandwidth: 616 / 484 GB/s => + 27%
Genefer basic type is Int32 (TIPS).
Is it possible to use the concurrent FP32 and a part of the +16 TFlops? That will not be easy!
FP16 and Int8 are definitely too small.
Then we can expect about +25% compared to 10 series. |
|
|
Monkeydee Volunteer tester
 Send message
Joined: 8 Dec 13 Posts: 540 ID: 284516 Credit: 1,531,877,060 RAC: 820,374
                            
|
Pre-orders on at Newegg.
$1100CAD for the RTX 2080
$1600CAD for the RTX 2080ti
Those are a firm pass from me. The 2060 could be a contender though depending on TDP. The RTX 2070 with a TDP of 175W is a little high for my taste.
____________
My Primes
Badge Score: 4*2 + 6*2 + 7*3 + 8*10 + 11*3 + 12*1 = 166
|
|
|
|
7 more days till they come out for everyone's testing. Anxiously awaiting reviews and performance findings.
Has any cruncher on here pre-ordered? |
|
|
|
I'll be getting a 2080ti once the next stock availability comes in, if they aren't price-gouged, an Asus or eVGA 3-fan model, and if they are, an FE stock one.
Looking back, I wish I had preordered during the launch event when I had one sitting in the NV store cart. Those prices aren't going away, but I also wasn't confident about how much OT I was going to be able to score at work to pay for the over-my-budget amount. Turns out, I've made enough extra to cover a whole card and I can add to my Cascade Lake upgrade fund over the next few months (we're having a slight employee shortage). Curse that hindsight!
Also, architectural information was released today. Looks like for every FP32 unit there's also an INT32 unit, which was also the case with Volta, but now is being mainstreamed. I know PG primarily uses FP-heavy computations, but are there any INT operations that could be run on the side with a CUDA 10-aware recompile? And speaking of FP: FP64 is still at 1/32 so that shiny new 2080ti is only 2x a GTX580 and half anything from AMD's Tahiti generation. According to Nvidia, apparently FP64 is "legacy" now; I thought more computational precision was the future?
____________
Eating more cheese on Thursdays. |
|
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2645 ID: 29980 Credit: 568,565,361 RAC: 198
                              
|
We need someone better at maths and/or programming to comment on the potential.
With my limited knowledge, some parts of genefer can make good use of FP32 performance. FP64 has been in decline on consumer GPUs for a long time and I wouldn't be too surprised to see it go. That's not to say there isn't a need for it, and I'm sure there are "professional" solutions that offer it, at a price. At least on that side, we still have Intel leading the way, most recently with AVX-512 offering a lot of FP64 potential, but not necessarily the ram bandwidth to feed it.
As for the integer part, that's more useful for sieve isn't it? I don't know how AP works but maybe it could help there also.
I don't know if it would be possible to run a (sieve or AP)+genefer in parallel, or if it would take a dedicated client to implement both to extract performance from this. |
|
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 820 ID: 164101 Credit: 305,989,513 RAC: 2,326

|
Also, architectural information was released today. Looks like for every FP32 unit there's also an INT32 unit, which was also the case with Volta, but now is being mainstreamed. I know PG primarily uses FP-heavy computations, but are there any INT operations that could be run on the side with a CUDA 10-aware recompile? And speaking of FP: FP64 is still at 1/32 so that shiny new 2080ti is only 2x a GTX580 and half anything from AMD's Tahiti generation. According to Nvidia, apparently FP64 is "legacy" now; I thought more computational precision was the future?
genefer doesn't use FP32 instructions.
ocl transform uses FP64 and may run faster with an INT32 unit for address computation.
ocl2, ocl3, ocl4 and ocl5 are Number-theoretic transform and use INT32 instructions.
A good point is the number of streaming multiprocessors:
RTX2080Ti: 68 vs GTX1080Ti: 28.
genefer uses local memory to share some data and there is one local memory per SM. Then more SM may improve parallelism.
|
|
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2645 ID: 29980 Credit: 568,565,361 RAC: 198
                              
|
genefer doesn't use FP32 instructions.
Well, I got that very wrong then :) Is there anything useful FP32 could be used for? I assume not, as otherwise it would already have been considered. |
|
|
|
genefer doesn't use FP32 instructions.
Well, I got that very wrong then :) Is there anything useful FP32 could be used for? I assume not, as otherwise it would already have been considered.
Me, too. I always assumed that OCL2+ were methods of faking FP64+ level precision on 32-bit hardware.
@Yves
I don't think it really matters whether INT32 will help OCL...at 1/32 64/32 ratio, I can't see it improving over that much of a performance gap. I do like your thought that having a greater number of smaller SMs is better than fewer larger ones for the extra memory.
So, since genefer is INT-based, and we've been running it on FP hardware, would running it on INT32-specific cores make an improvement? Or are the FP cores on existing hardware able to run in an integer mode? (Sorry if these are silly questions, my programming experience is limited to coding HS math classroom demonstration programs from when I was a teacher that would run great on a 486)
____________
Eating more cheese on Thursdays. |
|
|
dthonon Volunteer tester
 Send message
Joined: 6 Dec 17 Posts: 435 ID: 957147 Credit: 1,750,052,982 RAC: 291,078
                                 
|
And the other nice side-effect is that it will push down the price on 10xx series and put quite a few second-hand cards on the market ;-) |
|
|
|
And the other nice side-effect is that it will push down the price on 10xx series and put quite a few second-hand cards on the market ;-)
Using the 1060 as an example, a dump is already in progress on eBay with sellers selling a number of them at a time (ex "five available"). The drop in the value of mining coins has probably helped in this regard as well.
I have an orphan Skylake system which is dying for a GPU and I plan to score an MSI Gaming GTX 1060 OC 3 Gb and which sells for anywhere between 140 and 160 euro (within Europe, therefore no import duties and customs). I have also seen the 1080 already under 400 euro. |
|
|
robish Volunteer moderator Volunteer tester
 Send message
Joined: 7 Jan 12 Posts: 2212 ID: 126266 Credit: 7,523,618,709 RAC: 3,434,968
                               
|
https://nvidianews.nvidia.com/news/new-nvidia-data-center-inference-platform-to-fuel-next-wave-of-ai-powered-services?ncid=so-lin-wnnc-58518&linkId=100000003470940
Check the specs on this bad boy! But probably cost ......a lot! :)
" it offers 65 teraflops of peak performance for FP16"
____________
My lucky number 10590941048576+1 |
|
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2645 ID: 29980 Credit: 568,565,361 RAC: 198
                              
|
To get those flops it really cuts down on the precision. If FP32 isn't used, don't expect FP16 to be much use either. |
|
|
Michael Goetz Volunteer moderator Project administrator
 Send message
Joined: 21 Jan 10 Posts: 14011 ID: 53948 Credit: 433,714,327 RAC: 818,672
                               
|
To get those flops it really cuts down on the precision. If FP32 isn't used, don't expect FP16 to be much use either.
To be clear: OCL uses fp64.
Everything else uses int32. (This is why most Nvidia cards skip OCL -- their fp64 performance is so poor that it's faster to use nominaly slower OCL4 transform with int32 instead.)
fp32 and fp16 are not used at all.
____________
My lucky number is 75898524288+1 |
|
|
|
Also, architectural information was released today. Looks like for every FP32 unit there's also an INT32 unit, which was also the case with Volta, but now is being mainstreamed. I know PG primarily uses FP-heavy computations, but are there any INT operations that could be run on the side with a CUDA 10-aware recompile? And speaking of FP: FP64 is still at 1/32 so that shiny new 2080ti is only 2x a GTX580 and half anything from AMD's Tahiti generation. According to Nvidia, apparently FP64 is "legacy" now; I thought more computational precision was the future?
genefer doesn't use FP32 instructions.
ocl transform uses FP64 and may run faster with an INT32 unit for address computation.
ocl2, ocl3, ocl4 and ocl5 are Number-theoretic transform and use INT32 instructions.
A good point is the number of streaming multiprocessors:
RTX2080Ti: 68 vs GTX1080Ti: 28.
genefer uses local memory to share some data and there is one local memory per SM. Then more SM may improve parallelism.
Yves what you create will be most efficient app on BOINC platform as seen already. Currently no other BOINC application has complete full power.
I already purchased 2944 CUDA (46SMT) RTX 1180 preorder that will be water cooled for optimal clocking.
I'd say 280/300 Watts for RTX1180 and 300/350Watts on RTX1180ti. Pascal and Maxwell 300W on INT32 OCL4 - Volta a rated 300W GPU. |
|
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 820 ID: 164101 Credit: 305,989,513 RAC: 2,326

|
Me, too. I always assumed that OCL2+ were methods of faking FP64+ level precision on 32-bit hardware.
The first version of ocl2 used fixed-point numbers but that was slower than NTT.
From a 'philosophical' point of view, it's more satisfying since NTT are error-free and there is no reason to use FP numbers for algebraic number theory except that computation with floats is faster.
Computations based on statistics are not pleasant for number theory.
So, since genefer is INT-based, and we've been running it on FP hardware, would running it on INT32-specific cores make an improvement? Or are the FP cores on existing hardware able to run in an integer mode?
So far each core was like a single execution unit (like because the true architecture was certainly more complex: see https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#maximize-instruction-throughput).
On Pascal (6.1), 128 cores are able to execute 128 FP32 inst. or 4 FP64 inst. or 128 INT32 add or 64 INT32 shift or about 40 INT32 mul, etc.
We can hope that 7.5 (Turing) will be like 7.0 (Volta): 64 cores/SM, 64 INT32 add/cycle or 64 INT32 mul/cycle!
The Turing scheduler is able to execute one INT32 and one FP32 inst per cycle. But with the current version of genefer the FP32 unit will sleep!
genefer will use a small part of the RTX GPUs: the INT32 units, data cache and shared memory of SM and the GDDR6 memory controllers. FP32 and FP64 units, RT cores, tensor cores, texture units are unused today.
https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf
Will it be possible to improve this? Maybe but it will not be easy! |
|
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2645 ID: 29980 Credit: 568,565,361 RAC: 198
                              
|
To get those flops it really cuts down on the precision. If FP32 isn't used, don't expect FP16 to be much use either.
To be clear: OCL uses fp64.
Everything else uses int32. (This is why most Nvidia cards skip OCL -- their fp64 performance is so poor that it's faster to use nominaly slower OCL4 transform with int32 instead.)
fp32 and fp16 are not used at all.
From my limited knowledge, I understood FP64 was used for a lot of things. Why it stops being useful as b increases in genefer, I don't understand, and we enter those other transforms. I assume we run out of precision to store the values needed and any extra work to get around it becomes less efficient than other methods. I missed that it was a drop to int32, and incorrectly got somewhere it was fp32.
I don't expect it to happen, but what if FP128 became a thing? Would that help? The trend with machine learning unfortunately is in the other direction. They're doing ever more operations on ever smaller data structures. |
|
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 820 ID: 164101 Credit: 305,989,513 RAC: 2,326

|
From my limited knowledge, I understood FP64 was used for a lot of things. Why it stops being useful as b increases in genefer, I don't understand, and we enter those other transforms. I assume we run out of precision to store the values needed and any extra work to get around it becomes less efficient than other methods. I missed that it was a drop to int32, and incorrectly got somewhere it was fp32.
You're right but we don't use a single int32.
If we multiply two numbers of n 'digits' in base b (using standard grade-school multiplication), the max size of each digit of the result (before carry) is n*(b-1)^2.
FP64 precision is 53 bits then we should have n*(b-1)^2 < 2^53. Because of statistics, we can use a larger bound and control the round-off error.
NTT is based on the Chinese remainder theorem. We can perform some calculations modulo m_1, m_2, ..., m_k and combine them together for finding the result modulo m = m_1*m_2*...*m_k. If m > n*(b-1)^2, the result is correct.
ocl4 uses two 31-bit primes m_1 and m_2, ocl5: two 32-bit primes, ocl3: one 64-bit prime and ocl2 three 31-bit primes.
I don't expect it to happen, but what if FP128 became a thing? Would that help?
Yes, very much! The quadruple precision is 113 bits. The x87 version of the transform uses the old fp80 unit and precision is 64 bits. It was efficient on processors without AVX. |
|
|
|
Yves, that was very enlightening, thank you!
____________
Eating more cheese on Thursdays. |
|
|
|
Reviews are in and now we know why Nvidia was acting so weird. Not a single reviewer is recommending the 2080 or 2080 ti at this time. |
|
|
|
Reviews are in and now we know why Nvidia was acting so weird. Not a single reviewer is recommending the 2080 or 2080 ti at this time.
It's that darn price and all the fancy new features aren't ready yet, nor will they be available to anyone not running the latest Win 10 version. Considering Nvidia's NDA over this launch is pretty extreme, I wonder what many of those reviewers really wanted to say. I'm eagerly waiting the HardOCP review since they refuse to bow to Nvidia's shenanigans and I enjoy their honesty.
/And because it is the highest performance for PG and gaming, yes, I'm still buying one, and then seeing if I can sweep up some super cheap 1080tis, too.
____________
Eating more cheese on Thursdays. |
|
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1957 ID: 352 Credit: 6,144,688,936 RAC: 2,290,622
                                      
|
MSI RTX 2080 Sea Hawk X
(GPU 1935 Mhz, temp ~50C, fan speed ~1400 RPM)
PPS Sieve: 151 - 153 s
PPS Sieve x2: 218 s for both, 109 s each
AP27: 750 sec (12m30s), twice as fast comparing to GTX 1070.
GFN15: 72-75 s
GFN16: 162-163 s
GFN17Low: 355 s; 6m
GFN17MEGA: 400 s; 6m40s (GPU load ~64%)
GFN18: 750 s (GPU load ~68%)
GFN19: 2214 s; 37m (GPU load ~79%), twice as fast comparing to GTX 1070.
GFN20: 7210 s; 120m (GPU load ~89%), OCL4
GFN21: 6h 55m (GPU load ~94%), OCL
____________
My stats |
|
|
Azmodes Volunteer tester
 Send message
Joined: 30 Dec 16 Posts: 184 ID: 479275 Credit: 2,197,541,354 RAC: 340
                       
|
PPS Sieve: 151 - 153 secs.
I'm assuming that's for one task at a time? Could you perhaps post the duration for two simultaneously? I've found that it noticeably increases throughput on every single card I've tried it with. Depending on the card, improvement varies from 5% to up to 50+%.
____________
Long live the sievers.
+ Encyclopaedia Metallum: The Metal Archives + |
|
|
robish Volunteer moderator Volunteer tester
 Send message
Joined: 7 Jan 12 Posts: 2212 ID: 126266 Credit: 7,523,618,709 RAC: 3,434,968
                               
|
MSI RTX 2080 Sea Hawk X
(GPU 1935 Mhz, temp ~50C, fan speed ~1400 RPM)
PPS Sieve: 151 - 153 secs.
AP27: 750 sec (12m30s)
GFN15: 72-75 s
GFN16: 162-163 s
GFN17Low: 355 s; 6m
GFN17MEGA: 400 secs.; 6m40s
GFN18: 750 s (GPU load ~68%)
GFN19: 2214 s ; 37m (GPU load ~79%)
Wow, that IS quick. 👍
____________
My lucky number 10590941048576+1 |
|
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1957 ID: 352 Credit: 6,144,688,936 RAC: 2,290,622
                                      
|
I'm assuming that's for one task at a time? Could you perhaps post the duration for two simultaneously? I've found that it noticeably increases throughput on every single card I've tried it with. Depending on the card, improvement varies from 5% to up to 50+%.
Yes, that was for one task.
Two simultaneously: 218 s for both, that's 109 each.
____________
My stats |
|
|
Azmodes Volunteer tester
 Send message
Joined: 30 Dec 16 Posts: 184 ID: 479275 Credit: 2,197,541,354 RAC: 340
                       
|
Thanks. Damn, that's over 5 million credit per day.
____________
Long live the sievers.
+ Encyclopaedia Metallum: The Metal Archives + |
|
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1957 ID: 352 Credit: 6,144,688,936 RAC: 2,290,622
                                      
|
Thanks. Damn, that's over 5 million credit per day.
No, half that.
3600*24 / 109 = ~ 800 WU per day.
Each is 3371 credit = 2M6 credit.
____________
My stats |
|
|
Azmodes Volunteer tester
 Send message
Joined: 30 Dec 16 Posts: 184 ID: 479275 Credit: 2,197,541,354 RAC: 340
                       
|
Of course. My bad, multiplied by 2 once too often. :P
____________
Long live the sievers.
+ Encyclopaedia Metallum: The Metal Archives + |
|
|
Dad Send message
Joined: 28 Feb 18 Posts: 284 ID: 984171 Credit: 182,080,291 RAC: 0
                 
|
GFN-17MEGA is only 1 minute faster than my 1070ti
____________
Tonight's lucky numbers are
555*2^3563328+1 (PPS-MEGA)
and
58523466^131072+1 (GFN-17 MEGA) |
|
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1957 ID: 352 Credit: 6,144,688,936 RAC: 2,290,622
                                      
|
Yes, it's not much more.
Or put another way:
1070 Ti should do around 185 GFN17Mega tasks per day
2080 should do around 227 GFN17Mega tasks per day.
Thus, will do ~22% more work.
Note that card has relatively low GPU load even on my i7 8700K.
I'm considering to do some GFN tests with 2 WUs running like PPS Sieve...
____________
My stats |
|
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1957 ID: 352 Credit: 6,144,688,936 RAC: 2,290,622
                                      
|
Preliminary tests are complete.
MSI RTX 2080 Sea Hawk X
(GPU 1935 Mhz, temp ~50C, fan speed ~1400 RPM)
PPS Sieve: 151 - 153 s
PPS Sieve x2: 218 s for both, 109 s each
AP27: 750 sec (12m30s), twice as fast comparing to GTX 1070.
GFN15: 72-75 s
GFN16: 162-163 s
GFN17Low: 355 s; 6m
GFN17MEGA: 400 secs.; 6m40s
GFN18: 750 s (GPU load ~68%)
GFN19: 2214 s ; 37m (GPU load ~79%), twice as fast comparing to GTX 1070.
GFN20: 7210 s; 120m (GPU load ~89%), OCL4
GFN21: 6h 55m (GPU load ~94%), OCL
GTX 1060: 20 h
GTX 1070: 14 h
GTX 1080: 12 h
GTX 1080Ti: 8 h 30
TITAN V: 4 h
____________
My stats |
|
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 820 ID: 164101 Credit: 305,989,513 RAC: 2,326

|
Preliminary tests are complete.
Thanks for these results.
Cuda 10 (and its documentation) is available then we know more about Turing.
Turing is a Volta with a FP64/FP32 ratio of 1/32 (vs 1/2).
First a comparison (GFN21):
RTX 2080 (OC): 2944 cores @ 1860MHz, Mem 256-bit @ 14 Gbps 7 h
GTX 1080: 2560 cores @ 1733MHz, Mem 256-bit @ 10 Gbps 13 h
GTX 1080 (OC): 2560 cores @ 1822MHz, Mem 256-bit @ 10 Gbps 12 h
13/7 = 1.85 and 2944*1860/2560*1733 = 1.23 1.85/1.23 = 1.5 => +50% / core
12/7 = 1.71 and 2944*1860/2560*1822 = 1.174 1.71/1.174 = 1.46 => +46% / core
With Turing, the speed of each core (at the same frequency) has increased by about 50 percent!
Some reasons can be given:
- the number of compute units (SM) has doubled and the amount of static shared memory per multiprocessor is the same. Then the size of shared memory has doubled.
- the 32-bit integer multiply was a multiple instruction (about 3 simple inst.) and now it is a single instruction.
- faster memory, larger L2 and L1 caches.
|
|
|
LookAS Volunteer tester Send message
Joined: 19 Apr 08 Posts: 38 ID: 21649 Credit: 354,890,618 RAC: 0
                      
|
RTX 2080Ti quick Genefer benchmark
binary should be version 3.3.4
last test on page
https://pctuning.tyden.cz/hardware/graficke-karty/53971-nvidia-rtx-2080-ti-vykon-v-novych-hrach-a-aplikacich?start=7
|
|
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1957 ID: 352 Credit: 6,144,688,936 RAC: 2,290,622
                                      
|
For those curious about RTX 2080 Ti, there is a quick benchmark.
I should post similar one for RTX 2080. The times posted were on live WUs.
GFN21 was close to live WUs: 268524 vs 270000.
GFN20 1028668 vs 1100000
GFN19 2365858 vs 2500000
GFN18 5792500 vs 6000000
____________
My stats |
|
|
Yves Gallot Volunteer developer Project scientist Send message
Joined: 19 Aug 12 Posts: 820 ID: 164101 Credit: 305,989,513 RAC: 2,326

|
GFN20 1028668 vs 1100000
The runtime ratio is log(1028668)/log(1100000) ~ 0.995.
The error is negligible.
|
|
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1957 ID: 352 Credit: 6,144,688,936 RAC: 2,290,622
                                      
|
The runtime ratio is log(1028668)/log(1100000) ~ 0.995.
The error is negligible.
Yes, it's a close one and GPU/RAM clocks are making bigger difference.
I still wanted to check if the same transformation is used.
I also ran live test to see real GPU usage, temps etc, which are not obvious using bechmark.
____________
My stats |
|
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1957 ID: 352 Credit: 6,144,688,936 RAC: 2,290,622
                                      
|
Genefer 3.3.4 on RTX 2080
Running on platform 'NVIDIA CUDA', device 'GeForce RTX 2080', vendor 'NVIDIA Corporation', version 'OpenCL 1.2 CUDA' and driver '411.63'.
46 computeUnits @ 1860MHz, memSize=8192MB, cacheSize=736kB, cacheLineSize=128B, localMemSize=48kB, maxWorkGroupSize=1024.
High priority change succeeded.
Generalized Fermat Prime Search benchmarks
100000000^32768+1 262145 digits OCL2 Estimated time: 0:01:02
50000000^65536+1 504560 digits OCL2 Estimated time: 0:02:15
15000000^131072+1 940585 digits OCL2 Estimated time: 0:05:13
50000000^131072+1 1009120 digits OCL2 Estimated time: 0:05:33
6000000^262144+1 1776852 digits OCL2 Estimated time: 0:18:00
2500000^524288+1 3354364 digits OCL5 Estimated time: 0:33:30
1100000^1048576+1 6334860 digits OCL4 Estimated time: 1:53:00
270000^2097152+1 11390396 digits OCL4 Estimated time: 6:54:00
130000^4194304+1 21449434 digits OCL4 Estimated time: 25:40:00
Those bechmarks are with CPU cores idle.
I mentined GFN17MEGA taking around 400 sec, which lowered to 385 secs while running CUL LLR on 5 of 6 CPU cores.
Today, I switched to GCW Sieve on 5 cores and GFN17MEGA times went down to 365 secs.
Still a bit longer than benchmark times getting there.
Anyway, for lower GFNs, let's say up to GFN17Mega, there is not much gain for RTX 2080 Ti.
____________
My stats |
|
|
|
http://www.primegrid.com/result.php?resultid=936092604 |
|
|
robish Volunteer moderator Volunteer tester
 Send message
Joined: 7 Jan 12 Posts: 2212 ID: 126266 Credit: 7,523,618,709 RAC: 3,434,968
                               
|
http://www.primegrid.com/result.php?resultid=936092604
Wow, that IS awesomely quick! Would love to see results from Gfn20+.
Jealous ;)
____________
My lucky number 10590941048576+1 |
|
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1957 ID: 352 Credit: 6,144,688,936 RAC: 2,290,622
                                      
|
This is half the time of RTX 2080. Hmm, how come?
EDIT: OCL5 vs OCL3
Yeah, run Genefer benchmark please...
____________
My stats |
|
|
|
GFN20:4686sec
http://www.primegrid.com/result.php?resultid=935996124
GFN21:17038sec
http://www.primegrid.com/result.php?resultid=936609427
GFN22:64936sec
http://www.primegrid.com/result.php?resultid=936694893 |
|
|
Honza Volunteer moderator Volunteer tester Project scientist Send message
Joined: 15 Aug 05 Posts: 1957 ID: 352 Credit: 6,144,688,936 RAC: 2,290,622
                                      
|
GFN22:64936sec
http://www.primegrid.com/result.php?resultid=936694893
Cool, the other result is from Tesla V100-SXM2-16GB with ~12 hours.
RTX 2080 Ti ~ 18 hours.
I was wondering how long took GFN22 when we began or when it was WR candidate.
For example, in 2012 when GFN22 became World Record prime and it was half the credit that is now so about 1/2 of the computation demand.
Times were about 10x comparing to what we can achieve today - with RTX 2080 or Tesla.
____________
My stats |
|
|