mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2019-03-23, 08:29   #56
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3×269 Posts
Default

Quote:
Originally Posted by M344587487 View Post
It has. There could be a resurgence in mining if crypto picks up again but right now it's enthusiasts and speculators only.
As for the crypto this is highly unlikely. Basically every 'crypto' type 'coin' is inherently a pyramidgame as soon as you can find crypto's by means of mining.

Only reason i can think of all the cryptocoins haven't been rendered illegally is because not seldom a huge bank is behind many of the introduced crypto's (ING bank for example behind ethereum) so they have a direct cash interest in swindling people by means of that cryptocoin.

Most banks see of course all the criminal money that gets transferred and criminals do not mind in order to whitewash their criminal money into legal money, to pay high fees and transaction percentages.

Several banks recently have been fined small fines (order of some hundreds of millions of dollars - for example ING bank) - from my viewpoint seen - at possible criminal transactions.

It's nonstop very tempting for huge financials to keep involved in criminal money. And crypto's so far they seem to get away with. Yet from a distance seen it's all drugstrade and transferring criminal money.

By now the average dude on the street starts to understand this slowly so it's very unlikely a specific crypto that can get mined in the future will emerge and be a long lived successtory.
diep is offline   Reply With Quote
Old 2019-03-23, 08:32   #57
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3×269 Posts
Default

Quote:
Originally Posted by M344587487 View Post
Fingers crossed. If it has the full 1:2 ratio does that mean we can potentially saturate the memory at lower core clocks, or even do TF with the extra headroom with higher clocks? I wonder if it's possible to assign some CU's to gpuowl and others to mfakto, is SR-IOV needed for that or equivalent? I have doubts SR-IOV would make it to the consumer version.
It's been some years that i coded OpenCL - back then it was impossible to run different kernels at the same time at different SIMDs at AMD gpu's.

AMD claimed it was 'theoretical technical possible' to achieve this yet they didn't program that feature yet.

Back then the gpgpu helpdesk from AMD was - let's say it polite - not very professional.

Maybe someone has a more recent update to this info.
diep is offline   Reply With Quote
Old 2019-03-23, 08:47   #58
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

22·3·112 Posts
Default

Quote:
Originally Posted by diep View Post
Maybe someone has a more recent update to this info.
AMD is developing ROCm, an OpenSource OpenCL implementation on github, and the developers are very responsive to bug reports. The quality in general is good, and is improving rapidly.
preda is offline   Reply With Quote
Old 2019-03-23, 09:06   #59
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3×269 Posts
Default

Quote:
Originally Posted by preda View Post
AMD is developing ROCm, an OpenSource OpenCL implementation on github, and the developers are very responsive to bug reports. The quality in general is good, and is improving rapidly.
Yeah all good and well but i want to run multiple kernels at the same time and use the L1 cache to store arrays and data like i can do on Nvidia.

Nowadays both manufacturers have what is it already up to 80 SIMDs each card or so?

Doing everything via the GDDR5/HBM2 is just not a scaleable option anymore. I want 1 byte of bandwidth for each 3 flops (double precision) it delivers and that isn't asking overly much.

Part of that bandwidth can be offloaded to L1 datacaches, provided they are large enough for the number of threads (warps) you run on each SIMD.

I'm not aware of AMD but at Nvidia i need to run at least 8 warps (threads of 32 cudacores) at the same time at each SIMD as otherwise it gets too slow - means one has to divide the available L1 datacache amongst at least 8 warps (threads of 32 cuda cores). That's not much L1 datacache which is so much needed for example for my sieving code to sieve on the GPU k * 2^n +- 1 with as variable n.

3000 dollar for a Titan V is high price simply - and they can ask that much because AMD lacks features.
I bought second hand Titan Z - because has about 1 byte of bandwidth for each 2.5 flops it delivers.

Last fiddled with by diep on 2019-03-23 at 09:12
diep is offline   Reply With Quote
Old 2019-03-23, 10:26   #60
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

22×3×112 Posts
Default

Quote:
Originally Posted by diep View Post
Yeah all good and well but i want to run multiple kernels at the same time and use the L1 cache to store arrays and data like i can do on Nvidia.

Nowadays both manufacturers have what is it already up to 80 SIMDs each card or so?

Doing everything via the GDDR5/HBM2 is just not a scaleable option anymore. I want 1 byte of bandwidth for each 3 flops (double precision) it delivers and that isn't asking overly much.

Part of that bandwidth can be offloaded to L1 datacaches, provided they are large enough for the number of threads (warps) you run on each SIMD.

I'm not aware of AMD but at Nvidia i need to run at least 8 warps (threads of 32 cudacores) at the same time at each SIMD as otherwise it gets too slow - means one has to divide the available L1 datacache amongst at least 8 warps (threads of 32 cuda cores). That's not much L1 datacache which is so much needed for example for my sieving code to sieve on the GPU k * 2^n +- 1 with as variable n.

3000 dollar for a Titan V is high price simply - and they can ask that much because AMD lacks features.
I bought second hand Titan Z - because has about 1 byte of bandwidth for each 2.5 flops it delivers.
Maybe you're looking for what is called "local" memory in OpenCL, LDS (Local Data Share) in AMD GCN. It is basically an "explicitly managed" L1 memory (vs. the "cache" being "implicitly managed"), fast and relatively large.

Last fiddled with by preda on 2019-03-23 at 10:28
preda is offline   Reply With Quote
Old 2019-03-23, 10:46   #61
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

32716 Posts
Default

Quote:
Originally Posted by preda View Post
Maybe you're looking for what is called "local" memory in OpenCL, LDS (Local Data Share) in AMD GCN. It is basically an "explicitly managed" L1 memory (vs. the "cache" being "implicitly managed"), fast and relatively large.
When i toyed with it at by now this old AMD card i got here (back then fast and expensive) it wasn't very impressive. I understood that newer GPU's of AMD had less or that it was worse accessible - maybe i misunderstood.

What size does it have on this Radeon VII?

And how many gpu-clocks does it take on Radeon VII to store something in LDS and retrieve from it (of course for 64 opencl-cores at same time)?

(Of course i'm speaking about throughput clocks - not actual latencies.)

edit:
And i assume then that the LDS doesn't get hosted onto the registerfile yet is actual separated storage - is this correct?

Last fiddled with by diep on 2019-03-23 at 10:49
diep is offline   Reply With Quote
Old 2019-03-23, 10:57   #62
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

22×3×112 Posts
Default

Quote:
Originally Posted by diep View Post
And i assume then that the LDS doesn't get hosted onto the registerfile yet is actual separated storage - is this correct?
Yes, LDS is a very-low-latency memory, close to the processor (L1-type); it is not "registers" in GCN parlance.
preda is offline   Reply With Quote
Old 2019-03-23, 12:48   #63
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

2·52·19 Posts
Default

4608K stock perf_level tests:
Code:
perf_level wall_power rocm-smi_power temp sclk mclk ms_per_it_4608K joules_per_it
8          311        236            102  1802 1001 0.94            0.22184
7          298        226            98   1774 1001 0.94            0.21244
6          285        213            95   1750 1001 0.95            0.20235
5          264        195            95   1684 1001 0.96            0.1872
4          227        163            95   1547 1001 0.99            0.16137
3          190        129            95   1373 1001 1.07            0.13803
4608K stock --setfan tests:
Code:
set_fan wall_power rocm-smi_power temp sclk mclk ms_per_it_4608K
255     307        233            90   1802 1001 0.94
200     305        233            93   1802 1001 0.94
175     310        235            99   1802 1001 0.94
150     314        237            105  1802 1001 0.94
pp_od_clk_voltage dump:
Code:
OD_SCLK:
0:        808Mhz
1:       1801Mhz
OD_MCLK:
1:       1000Mhz
OD_VDDC_CURVE:
0:        808Mhz        690mV
1:       1304Mhz        797mV
2:       1801Mhz       1081mV
OD_RANGE:
SCLK:     808Mhz       2200Mhz
MCLK:     351Mhz       1200Mhz
VDDC_CURVE_SCLK[0]:     808Mhz       2200Mhz
VDDC_CURVE_VOLT[0]:     738mV        1218mV
VDDC_CURVE_SCLK[1]:     808Mhz       2200Mhz
VDDC_CURVE_VOLT[1]:     738mV        1218mV
VDDC_CURVE_SCLK[2]:     808Mhz       2200Mhz
VDDC_CURVE_VOLT[2]:     738mV        1218mV
rocminfo dump:
Code:
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (number of timestamp)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
===CPU info snip===                   
*******                  
Agent 2                  
*******                  
  Name:                    gfx906                             
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128                                
  Queue Min Size:          4096                               
  Queue Max Size:          131072                             
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16KB                               
  Chip ID:                 26287                              
  Cacheline Size:          64                                 
  Max Clock Frequency (MHz):1802                               
  BDFID:                   2816                               
  Compute Unit:            60                                 
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      FALSE                              
  Wavefront Size:          64                                 
  Workgroup Max Size:      1024                               
  Workgroup Max Size Per Dimension:
    Dim[0]:                  67109888                           
    Dim[1]:                  184550400                          
    Dim[2]:                  0                                  
  Grid Max Size:           4294967295                         
  Waves Per CU:            40                                 
  Max Work-item Per CU:    2560                               
  Grid Max Size per Dimension:
    Dim[0]:                  4294967295                         
    Dim[1]:                  4294967295                         
    Dim[2]:                  4294967295                         
  Max number Of fbarriers Per Workgroup:32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16760832KB                         
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Acessible by all:        FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64KB                               
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Acessible by all:        FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx906          
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Dimension: 
        Dim[0]:                  67109888                           
        Dim[1]:                  1024                               
        Dim[2]:                  16777217                           
      Workgroup Max Size:      1024                               
      Grid Max Dimension:      
        x                        4294967295                         
        y                        4294967295                         
        z                        4294967295                         
      Grid Max Size:           4294967295                         
      FBarrier Max Size:       32                                 
*** Done ***
clinfo dump:
Code:
Number of platforms:                 1
  Platform Profile:                 FULL_PROFILE
  Platform Version:                 OpenCL 2.1 AMD-APP (2833.0)
  Platform Name:                 AMD Accelerated Parallel Processing
  Platform Vendor:                 Advanced Micro Devices, Inc.
  Platform Extensions:                 cl_khr_icd cl_amd_event_callback cl_amd_offline_devices 


  Platform Name:                 AMD Accelerated Parallel Processing
Number of devices:                 1
  Device Type:                     CL_DEVICE_TYPE_GPU
  Vendor ID:                     1002h
  Board name:                     Vega 20
  Device Topology:                 PCI[ B#11, D#0, F#0 ]
  Max compute units:                 60
  Max work items dimensions:             3
    Max work items[0]:                 1024
    Max work items[1]:                 1024
    Max work items[2]:                 1024
  Max work group size:                 256
  Preferred vector width char:             4
  Preferred vector width short:             2
  Preferred vector width int:             1
  Preferred vector width long:             1
  Preferred vector width float:             1
  Preferred vector width double:         1
  Native vector width char:             4
  Native vector width short:             2
  Native vector width int:             1
  Native vector width long:             1
  Native vector width float:             1
  Native vector width double:             1
  Max clock frequency:                 1802Mhz
  Address bits:                     64
  Max memory allocation:             14588628172
  Image support:                 Yes
  Max number of images read arguments:         128
  Max number of images write arguments:         8
  Max image 2D width:                 16384
  Max image 2D height:                 16384
  Max image 3D width:                 2048
  Max image 3D height:                 2048
  Max image 3D depth:                 2048
  Max samplers within kernel:             26287
  Max size of kernel argument:             1024
  Alignment (bits) of base address:         1024
  Minimum alignment (bytes) for any datatype:     128
  Single precision floating point capability
    Denorms:                     Yes
    Quiet NaNs:                     Yes
    Round to nearest even:             Yes
    Round to zero:                 Yes
    Round to +ve and infinity:             Yes
    IEEE754-2008 fused multiply-add:         Yes
  Cache type:                     Read/Write
  Cache line size:                 64
  Cache size:                     16384
  Global memory size:                 17163091968
  Constant buffer size:                 14588628172
  Max number of constant args:             8
  Local memory type:                 Scratchpad
  Local memory size:                 65536
  Max pipe arguments:                 16
  Max pipe active reservations:             16
  Max pipe packet size:                 1703726284
  Max global variable size:             14588628172
  Max global variable preferred total size:     17163091968
  Max read/write image args:             64
  Max on device events:                 1024
  Queue on device max size:             8388608
  Max on device queues:                 1
  Queue on device preferred size:         262144
  SVM capabilities:                 
    Coarse grain buffer:             Yes
    Fine grain buffer:                 Yes
    Fine grain system:                 No
    Atomics:                     No
  Preferred platform atomic alignment:         0
  Preferred global atomic alignment:         0
  Preferred local atomic alignment:         0
  Kernel Preferred work group size multiple:     64
  Error correction support:             0
  Unified memory for Host and Device:         0
  Profiling timer resolution:             1
  Device endianess:                 Little
  Available:                     Yes
  Compiler available:                 Yes
  Execution capabilities:                 
    Execute OpenCL kernels:             Yes
    Execute native function:             No
  Queue on Host properties:                 
    Out-of-Order:                 No
    Profiling :                     Yes
  Queue on Device properties:                 
    Out-of-Order:                 Yes
    Profiling :                     Yes
  Platform ID:                     0x7fd046e90f70
  Name:                         gfx906
  Vendor:                     Advanced Micro Devices, Inc.
  Device OpenCL C version:             OpenCL C 2.0 
  Driver version:                 2833.0 (HSA1.1,LC)
  Profile:                     FULL_PROFILE
  Version:                     OpenCL 1.2 
  Extensions:                     cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program
Quote:
Originally Posted by preda View Post
You know that you have to commit the change, I think by writing a line with a single "c" at the end? A sequence of "s ...." lines, and a single "c" at the end.
The problem is that the lines wouldn't send to pp_od_clk_voltage so there was nothing to push to the card, something along the lines of "bash echo error, invalid input". Performance level was set to manual and pushing as sudo and root was tried. I'm probably doing something silly, will give it another go later.

Quote:
Originally Posted by diep View Post
It's been some years that i coded OpenCL - back then it was impossible to run different kernels at the same time at different SIMDs at AMD gpu's.
...
SR-IOV is what servers use to allow multiple VMs to use a single PCIe device simultaneously. It's a pro feature that I don't think is enabled on this card. I don't know if resources can be diced up to multiple programs in a non-VM fashion.


If there's any other tests or info you want dumped let me know. I'm going to do some gpuowl memory OD tests, then try undervolting again, then if undervolting works undervolt+memory OD. Might as well test mfakto too, what would be a good test for that?
M344587487 is offline   Reply With Quote
Old 2019-03-23, 12:56   #64
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

14478 Posts
Default

Quote:
If there's any other tests or info you want dumped let me know. I'm going to do some gpuowl memory OD tests, then try undervolting again, then if undervolting works undervolt+memory OD. Might as well test mfakto too, what would be a good test for that?
Much appreciated - yes i like to know the size of the LDS and how many clocks it is to store and retrieve from it - as i assume the 16KB L1 it shows is the instruction cache. Knowing timing is crucial because losing suddenly more than 1 clock throughput to using the LDS makes it worthless as that throws away factor 2 performance of the GPU potentially. And also whether it can store/retrieve simultaneously for all 64 units in each SIMD the data in that 1 clock. Because that would result again in slow latencies to the LDS.
diep is offline   Reply With Quote
Old 2019-03-23, 13:10   #65
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3·269 Posts
Default

Quote:
Originally Posted by preda View Post
Yes, LDS is a very-low-latency memory, close to the processor (L1-type); it is not "registers" in GCN parlance.
Can you read and write yourself manual into that LDS?

As some year and a half ago when i had to decide which gpu to buy there was some cheap GCN generation gpu's from AMD on ebay delivering something around 1.45T double precision. Reason for me to not buy some of them was because when i checked out the AMD documentation it said that it did have a fast L1 cache on the gpu but you could not manual read/write to it - the card used it itself to cache the device memory in a clever manner. So i bought the Titan-Z instead.

Last fiddled with by diep on 2019-03-23 at 13:11
diep is offline   Reply With Quote
Old 2019-03-23, 17:57   #66
xx005fs
 
"Eric"
Jan 2018
USA

3378 Posts
Default

Quote:
Originally Posted by M344587487 View Post
Radeon VII gpuowl benchmark at stock settings, these are out of the box numbers using the default performance levels. Using a daily build of Ubuntu Disco (upcoming 19.04) with 5.0.0 kernel. ROCm using upstream. Not headless, GPU was driving display which may have had a tiny impact on performance. Performance level set with rocm-smi --setsclk #, default fan speed and memory clocks. PRP on an 89M exponent using 5M FFT, latest gpuowl. Test bench is a Ryzen 1700 idling, 16GB of RAM and a gold rated PSU.

Code:
perf_level  wall_power rocm-smi_power temp sclk mclk ms/it joules_per_iteration
8           305        240            101  1802 1001 1.03  0.2472
7           298        229            97   1774 1001 1.04  0.23816
6           290        220            95   1750 1001 1.04  0.2288
5           265        197            95   1684 1001 1.06  0.20882
4           228        164            95   1547 1001 1.09  0.17876
3           191        131            95   1373 1001 1.18  0.15458
2           158        103            87   1135 1001 1.285 0.132355
1           131        82             74    809 1001 1.68  0.13776
0           122        75             69    701 1001 1.9   0.1425
Numbers can be improved by overclocking the memory and undervolting the core. Tried a basic memory overclock by setting memory OD to 19% in rocm-smi which upped mclk to 1192. At perf level 8 it was doing 0.95 ms/it but I only ran it for 5 minutes as it was hitting the 250W power cap so the figure is not very useful. An undervolt should improve efficiency considerably but I haven't figured out how to do that or change the perf level presets properly yet (amdgpu.ppfeaturemask=0xffffffff is in grub and pp_od_clk_voltage is present in /sys... however pushing "s 4 1547 1050" or whatever doesn't work, nor does --setslevel or --setmlevel in rocm-smi. Pro drivers required?).

Impressive, it is faster than a Titan V with HBM overclocked to as high as it goes, by like around 10% even at stock. But it is not the improvement I was hoping for as the thread discussing about memory use for GPUOWL should means that RVII is no where near memory limited with its DP capability and should be a great deal faster than Vega 64 (0.8TFLOP vs 3.4TFLOP). So maybe it is still memory limited and that overclocking HBM would greatly help?

Last fiddled with by xx005fs on 2019-03-23 at 18:01
xx005fs is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Vega 20 announced with 7.64 TFlops of FP64 M344587487 GPU Computing 4 2018-11-08 16:56
GTX 1180 Mars Volta consumer card specs leaked tServo GPU Computing 20 2018-06-24 08:04
RX Vega performance xx005fs GPU Computing 5 2018-01-17 00:22
Radeon Pro Duo 0PolarBearsHere GPU Computing 0 2016-03-15 01:32
AMD Radeon R9 295X2 firejuggler GPU Computing 33 2014-09-03 21:42

All times are UTC. The time now is 14:18.


Fri Jul 7 14:18:29 UTC 2023 up 323 days, 11:47, 0 users, load averages: 0.96, 1.20, 1.23

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔