![]() |
|
|
#56 | |
|
Sep 2006
The Netherlands
3×269 Posts |
Quote:
Only reason i can think of all the cryptocoins haven't been rendered illegally is because not seldom a huge bank is behind many of the introduced crypto's (ING bank for example behind ethereum) so they have a direct cash interest in swindling people by means of that cryptocoin. Most banks see of course all the criminal money that gets transferred and criminals do not mind in order to whitewash their criminal money into legal money, to pay high fees and transaction percentages. Several banks recently have been fined small fines (order of some hundreds of millions of dollars - for example ING bank) - from my viewpoint seen - at possible criminal transactions. It's nonstop very tempting for huge financials to keep involved in criminal money. And crypto's so far they seem to get away with. Yet from a distance seen it's all drugstrade and transferring criminal money. By now the average dude on the street starts to understand this slowly so it's very unlikely a specific crypto that can get mined in the future will emerge and be a long lived successtory. |
|
|
|
|
|
|
#57 | |
|
Sep 2006
The Netherlands
3×269 Posts |
Quote:
AMD claimed it was 'theoretical technical possible' to achieve this yet they didn't program that feature yet. Back then the gpgpu helpdesk from AMD was - let's say it polite - not very professional. Maybe someone has a more recent update to this info. |
|
|
|
|
|
|
#58 |
|
"Mihai Preda"
Apr 2015
5AC16 Posts |
|
|
|
|
|
|
#59 | |
|
Sep 2006
The Netherlands
3×269 Posts |
Quote:
Nowadays both manufacturers have what is it already up to 80 SIMDs each card or so? Doing everything via the GDDR5/HBM2 is just not a scaleable option anymore. I want 1 byte of bandwidth for each 3 flops (double precision) it delivers and that isn't asking overly much. Part of that bandwidth can be offloaded to L1 datacaches, provided they are large enough for the number of threads (warps) you run on each SIMD. I'm not aware of AMD but at Nvidia i need to run at least 8 warps (threads of 32 cudacores) at the same time at each SIMD as otherwise it gets too slow - means one has to divide the available L1 datacache amongst at least 8 warps (threads of 32 cuda cores). That's not much L1 datacache which is so much needed for example for my sieving code to sieve on the GPU k * 2^n +- 1 with as variable n. 3000 dollar for a Titan V is high price simply - and they can ask that much because AMD lacks features. I bought second hand Titan Z - because has about 1 byte of bandwidth for each 2.5 flops it delivers. Last fiddled with by diep on 2019-03-23 at 09:12 |
|
|
|
|
|
|
#60 | |
|
"Mihai Preda"
Apr 2015
22×3×112 Posts |
Quote:
Last fiddled with by preda on 2019-03-23 at 10:28 |
|
|
|
|
|
|
#61 | |
|
Sep 2006
The Netherlands
3×269 Posts |
Quote:
What size does it have on this Radeon VII? And how many gpu-clocks does it take on Radeon VII to store something in LDS and retrieve from it (of course for 64 opencl-cores at same time)? (Of course i'm speaking about throughput clocks - not actual latencies.) edit: And i assume then that the LDS doesn't get hosted onto the registerfile yet is actual separated storage - is this correct? Last fiddled with by diep on 2019-03-23 at 10:49 |
|
|
|
|
|
|
#62 |
|
"Mihai Preda"
Apr 2015
22·3·112 Posts |
|
|
|
|
|
|
#63 | ||
|
"Composite as Heck"
Oct 2017
2·52·19 Posts |
4608K stock perf_level tests:
Code:
perf_level wall_power rocm-smi_power temp sclk mclk ms_per_it_4608K joules_per_it 8 311 236 102 1802 1001 0.94 0.22184 7 298 226 98 1774 1001 0.94 0.21244 6 285 213 95 1750 1001 0.95 0.20235 5 264 195 95 1684 1001 0.96 0.1872 4 227 163 95 1547 1001 0.99 0.16137 3 190 129 95 1373 1001 1.07 0.13803 Code:
set_fan wall_power rocm-smi_power temp sclk mclk ms_per_it_4608K 255 307 233 90 1802 1001 0.94 200 305 233 93 1802 1001 0.94 175 310 235 99 1802 1001 0.94 150 314 237 105 1802 1001 0.94 Code:
OD_SCLK: 0: 808Mhz 1: 1801Mhz OD_MCLK: 1: 1000Mhz OD_VDDC_CURVE: 0: 808Mhz 690mV 1: 1304Mhz 797mV 2: 1801Mhz 1081mV OD_RANGE: SCLK: 808Mhz 2200Mhz MCLK: 351Mhz 1200Mhz VDDC_CURVE_SCLK[0]: 808Mhz 2200Mhz VDDC_CURVE_VOLT[0]: 738mV 1218mV VDDC_CURVE_SCLK[1]: 808Mhz 2200Mhz VDDC_CURVE_VOLT[1]: 738mV 1218mV VDDC_CURVE_SCLK[2]: 808Mhz 2200Mhz VDDC_CURVE_VOLT[2]: 738mV 1218mV Code:
=====================
HSA System Attributes
=====================
Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (number of timestamp)
Machine Model: LARGE
System Endianness: LITTLE
==========
HSA Agents
==========
*******
Agent 1
*******
===CPU info snip===
*******
Agent 2
*******
Name: gfx906
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128
Queue Min Size: 4096
Queue Max Size: 131072
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16KB
Chip ID: 26287
Cacheline Size: 64
Max Clock Frequency (MHz):1802
BDFID: 2816
Compute Unit: 60
Features: KERNEL_DISPATCH
Fast F16 Operation: FALSE
Wavefront Size: 64
Workgroup Max Size: 1024
Workgroup Max Size Per Dimension:
Dim[0]: 67109888
Dim[1]: 184550400
Dim[2]: 0
Grid Max Size: 4294967295
Waves Per CU: 40
Max Work-item Per CU: 2560
Grid Max Size per Dimension:
Dim[0]: 4294967295
Dim[1]: 4294967295
Dim[2]: 4294967295
Max number Of fbarriers Per Workgroup:32
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 16760832KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Acessible by all: FALSE
Pool 2
Segment: GROUP
Size: 64KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Acessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx906
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Dimension:
Dim[0]: 67109888
Dim[1]: 1024
Dim[2]: 16777217
Workgroup Max Size: 1024
Grid Max Dimension:
x 4294967295
y 4294967295
z 4294967295
Grid Max Size: 4294967295
FBarrier Max Size: 32
*** Done ***
Code:
Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.1 AMD-APP (2833.0)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices
Platform Name: AMD Accelerated Parallel Processing
Number of devices: 1
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Board name: Vega 20
Device Topology: PCI[ B#11, D#0, F#0 ]
Max compute units: 60
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 256
Preferred vector width char: 4
Preferred vector width short: 2
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 4
Native vector width short: 2
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 1802Mhz
Address bits: 64
Max memory allocation: 14588628172
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 8
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 26287
Max size of kernel argument: 1024
Alignment (bits) of base address: 1024
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: Yes
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 17163091968
Constant buffer size: 14588628172
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 65536
Max pipe arguments: 16
Max pipe active reservations: 16
Max pipe packet size: 1703726284
Max global variable size: 14588628172
Max global variable preferred total size: 17163091968
Max read/write image args: 64
Max on device events: 1024
Queue on device max size: 8388608
Max on device queues: 1
Queue on device preferred size: 262144
SVM capabilities:
Coarse grain buffer: Yes
Fine grain buffer: Yes
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 0x7fd046e90f70
Name: gfx906
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 2.0
Driver version: 2833.0 (HSA1.1,LC)
Profile: FULL_PROFILE
Version: OpenCL 1.2
Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program
Quote:
Quote:
If there's any other tests or info you want dumped let me know. I'm going to do some gpuowl memory OD tests, then try undervolting again, then if undervolting works undervolt+memory OD. Might as well test mfakto too, what would be a good test for that? |
||
|
|
|
|
|
#64 | |
|
Sep 2006
The Netherlands
3·269 Posts |
Quote:
|
|
|
|
|
|
|
#65 | |
|
Sep 2006
The Netherlands
3·269 Posts |
Quote:
As some year and a half ago when i had to decide which gpu to buy there was some cheap GCN generation gpu's from AMD on ebay delivering something around 1.45T double precision. Reason for me to not buy some of them was because when i checked out the AMD documentation it said that it did have a fast L1 cache on the gpu but you could not manual read/write to it - the card used it itself to cache the device memory in a clever manner. So i bought the Titan-Z instead. Last fiddled with by diep on 2019-03-23 at 13:11 |
|
|
|
|
|
|
#66 | |
|
"Eric"
Jan 2018
USA
223 Posts |
Quote:
Impressive, it is faster than a Titan V with HBM overclocked to as high as it goes, by like around 10% even at stock. But it is not the improvement I was hoping for as the thread discussing about memory use for GPUOWL should means that RVII is no where near memory limited with its DP capability and should be a great deal faster than Vega 64 (0.8TFLOP vs 3.4TFLOP). So maybe it is still memory limited and that overclocking HBM would greatly help? Last fiddled with by xx005fs on 2019-03-23 at 18:01 |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Vega 20 announced with 7.64 TFlops of FP64 | M344587487 | GPU Computing | 4 | 2018-11-08 16:56 |
| GTX 1180 Mars Volta consumer card specs leaked | tServo | GPU Computing | 20 | 2018-06-24 08:04 |
| RX Vega performance | xx005fs | GPU Computing | 5 | 2018-01-17 00:22 |
| Radeon Pro Duo | 0PolarBearsHere | GPU Computing | 0 | 2016-03-15 01:32 |
| AMD Radeon R9 295X2 | firejuggler | GPU Computing | 33 | 2014-09-03 21:42 |