mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2013-10-04, 19:19   #45
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

59710 Posts
Default

Quote:
Originally Posted by kracker View Post
Sorry, triple post
some self test.
http://www.filedropper.com/output
tough stuff.

All errors are caused by the 32- or 24-bit kernels. The 15-bit kernels have no failures in this run. However, the 32-bit kernels are much faster.

This (probably) rules out different rounding, as the 15-bit kernels rely much more on proper rounding - they require up to 22 bits per float to be accurate, whereas the 32-bit kernels only require 20.

The only other "big" difference is, that the 15-bit kernels do not use mul_hi or any other 32-bit multiplication

What next?
  • I can prepare a version that only allows 15-bit-kernels. Although slow, this one could survive a full selftest.
  • I can build a version to debug the 32-bit actions.
  • I can build the PerformanceInfo version that allows us to have detailed measurement of the speed of the different kernels - to be able to compare and see if it's worth the effort.
Which way to go?
Bdot is offline   Reply With Quote
Old 2013-10-04, 19:27   #46
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

37·59 Posts
Default

Quote:
Originally Posted by Bdot View Post
tough stuff.

All errors are caused by the 32- or 24-bit kernels. The 15-bit kernels have no failures in this run. However, the 32-bit kernels are much faster.

This (probably) rules out different rounding, as the 15-bit kernels rely much more on proper rounding - they require up to 22 bits per float to be accurate, whereas the 32-bit kernels only require 20.

The only other "big" difference is, that the 15-bit kernels do not use mul_hi or any other 32-bit multiplication

What next?
  • I can prepare a version that only allows 15-bit-kernels. Although slow, this one could survive a full selftest.
  • I can build a version to debug the 32-bit actions.
  • I can build the PerformanceInfo version that allows us to have detailed measurement of the speed of the different kernels - to be able to compare and see if it's worth the effort.
Which way to go?
I don't know, maybe the last may be best? I don't know. (Did I say that twice?)
kracker is offline   Reply With Quote
Old 2013-10-04, 19:45   #47
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

17·487 Posts
Default

Quote:
Originally Posted by Bdot View Post
What next?
  • I can prepare a version that only allows 15-bit-kernels. Although slow, this one could survive a full selftest.
  • I can build a version to debug the 32-bit actions.
  • I can build the PerformanceInfo version that allows us to have detailed measurement of the speed of the different kernels - to be able to compare and see if it's worth the effort.
Which way to go?
Let's debug the 32-bit (and 24?) kernels. The knowledge gained may be useful.
Prime95 is offline   Reply With Quote
Old 2013-10-04, 21:37   #48
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

10010101012 Posts
Default

OK, I'll prepare that.

Could you please run the quick selftest for VectorSize=2? I think I have some issue with VectorSize=1, also on AMD I see more or less "random" failures there as well ...
Bdot is offline   Reply With Quote
Old 2013-10-05, 01:10   #49
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

1000100001112 Posts
Default

kernels don't compile other than 1...

Code:
mfakto 0.14pre1-Win (64bit build)


Runtime options
  Inifile                   mfakto.ini
  Verbosity                 1
  SieveOnGPU                no
  SievePrimesMin            5000
  SievePrimesMax            200000
  SievePrimes               25000
  SievePrimesAdjust         1
  NumStreams                3
  GridSize                  4
  SieveCPUMask              0
  GPUSieveSize              64Mi bits
  WorkFile                  worktodo.txt
  ResultsFile               results.txt
  Checkpoints               enabled
  CheckpointDelay           300s
  Stages                    enabled
  StopAfterFactor           class
  PrintMode                 compact
  V5UserID                  none
  ComputerID                none
  TimeStampInResults        yes
  VectorSize                2
  GPUType                   AUTO
  SmallExp                  no
WARNING: Cannot read UseBinfile from inifile, set to 0 by default
  UseBinfile                0
Compiletime options
  SIEVE_SIZE_LIMIT          36kiB
  SIEVE_SIZE                289731bits
  SIEVE_SPLIT               250
  MORE_CLASSES              enabled
Select device - Get device info - 
INFO: Device does not support out-of-order operations. Fallback to in-order queues.
Compiling kernels.
 
    BUILD OUTPUT
In file included from :90:
.\montgomery.cl:82:6: error: can't convert between vector values of different size ('ulong2' and 'long  __attribute__((ext_vector_type(2)))')
  r2 += (r1!=0)? 1UL : 0UL;
  ~~ ^  ~~~~~~~~~~~~~~~~~~

error: front end compiler failed build.
     END OF BUILD OUTPUT
Error -11 (Build program failure ): clBuildProgram
init_CL(3, 11) failed
kracker is offline   Reply With Quote
Old 2013-10-05, 13:13   #50
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3×199 Posts
Default

Quote:
Originally Posted by kracker View Post
kernels don't compile other than 1...
Please try the GPUSieve with VectorSize=2 ... montgomery is not used there.

(I'm sorry for that painful process - maybe I need to go shopping. What CPU is that? Any current Intel i5/7?)
Bdot is offline   Reply With Quote
Old 2013-10-05, 14:29   #51
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

37·59 Posts
Default

Quote:
Originally Posted by Bdot View Post
Please try the GPUSieve with VectorSize=2 ... montgomery is not used there.

(I'm sorry for that painful process - maybe I need to go shopping. What CPU is that? Any current Intel i5/7?)
Will do... and don't worry.
Any haswell (i3, i5, i7,) has OpenCL 1.2, I believe.
kracker is offline   Reply With Quote
Old 2013-10-05, 16:32   #52
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3·199 Posts
Default

Quote:
Originally Posted by kracker View Post
Will do... and don't worry.
Any haswell (i3, i5, i7,) has OpenCL 1.2, I believe.
And ... could you please try to change the offending line 82 in montgomery.cl from

r2 += (r1!=0)? 1UL : 0UL;

to

r2 += (ulong_v)((r1!=0)? 1UL : 0UL);

(with CPU sieve and VectorSize=1)
Bdot is offline   Reply With Quote
Old 2013-10-05, 23:12   #53
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

37·59 Posts
Default

This is with cpu sieving and vectorsize 1: it works "normally" only on vectorsize 1.

edit: more later, sorry for quick message, busy today :(

edit2: with new montgomery.cl vectorsize 2
Code:
.\montgomery.cl:82:9: error: invalid conversion between ext-vector type 'ulong2' and 'long  __attribute__((ext_vector_type(2)))'
  r2 += (ulong_v)((r1!=0)? 1UL : 0UL);
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Last fiddled with by kracker on 2013-10-05 at 23:59
kracker is offline   Reply With Quote
Old 2013-10-07, 20:57   #54
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

25516 Posts
Default

Quote:
Originally Posted by kracker View Post
This is with cpu sieving and vectorsize 1: it works "normally" only on vectorsize 1.

edit: more later, sorry for quick message, busy today :(

edit2: with new montgomery.cl vectorsize 2
Code:
.\montgomery.cl:82:9: error: invalid conversion between ext-vector type 'ulong2' and 'long  __attribute__((ext_vector_type(2)))'
  r2 += (ulong_v)((r1!=0)? 1UL : 0UL);
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I must admit that I don't understand what this compiler complains about. Neither implicit nor explicit casting seem to work - looks like a compiler-bug to me.

Speaking of which, AMD compiles fine but has difficulties in evaluating simple expressions:

Code:
    if (get_global_id(0)==TRACE_TID) printf((__constant char *)"div2.6: q.d4=%x, carry=%x, nn.d3=%x\n",
        q.d4, carry, nn.d3);

 q.d4 = q.d4 - nn.d3 + carry;

    if (get_global_id(0)==TRACE_TID) printf((__constant char *)"div2.7: q.d4=%x, carry=%x, nn.d3=%x\n",
        q.d4, carry, nn.d3);
prints the following:
Code:
div2.6: q.d4=51ed, carry=0, nn.d3=51ed
div2.7: q.d4=394d8646, carry=0, nn.d3=51ed
So, according to Catalyst 13.9,

0x51ed - 0x51ed + 0 = 0x394d8646.

This is the reason why for me all the selftests fail at VectorSize=1. Only at higher VectorSizes, the GPU seems to feel the need to save the calculated result to the intended place ... (not sure, which part of the memory is overwritten at VectorSize=1 - at every run I get a different result for q.d4).
Bdot is offline   Reply With Quote
Old 2013-10-08, 01:06   #55
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

17×487 Posts
Default

Quote:
Originally Posted by Bdot View Post
I must admit that I don't understand what this compiler complains about.
Is it because one vector is unsigned and the other is signed?
Prime95 is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Can I run my CPU's integrated GPU along with my discrete GPU? Red Raven GPU Computing 9 2014-10-24 02:01
New integrated CPU-GPU programming paradigm Dubslow GPU Computing 1 2012-02-15 08:45
Ivy Bridge integrated GPU? Dubslow GPU Computing 7 2011-11-18 23:36
Can I use integrated graphics alongside a GPU? mdettweiler GPU Computing 9 2010-09-15 19:41
turn off your integrated Snd card in CMOS nngs Hardware 0 2005-05-20 01:31

All times are UTC. The time now is 15:24.


Fri Jul 7 15:24:41 UTC 2023 up 323 days, 12:53, 0 users, load averages: 1.51, 1.17, 1.12

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔