mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software > Mlucas

Reply
 
Thread Tools
Old 2017-10-28, 13:45   #89
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

22·1,549 Posts
Default

I actually find it easier to not bother with all the myriad of ID values and registers. Instead I just set an invalid instruction trap and start executing something from the set I want to use. If it traps then I drop back one level and try again. This covers both bases where either the OS or the CPU doesn't support the instructions.
retina is offline   Reply With Quote
Old 2017-10-28, 21:30   #90
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

103×113 Posts
Default

Quote:
Originally Posted by retina View Post
I actually find it easier to not bother with all the myriad of ID values and registers. Instead I just set an invalid instruction trap and start executing something from the set I want to use. If it traps then I drop back one level and try again. This covers both bases where either the OS or the CPU doesn't support the instructions.
A similar thought had occurred to me when I couldn't find any CPUID-style mention in the ARM instruction manual, but time to get concrete about this. Is that something easily doable in C - e.g. using the functionality of signal.h to catch SIGILL - or does it need C++'s try/catch exception handling?

If you could post some sample code with your implementation, that would be great.
ewmayer is offline   Reply With Quote
Old 2017-10-28, 22:51   #91
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

22·1,549 Posts
Default

Quote:
Originally Posted by ewmayer View Post
A similar thought had occurred to me when I couldn't find any CPUID-style mention in the ARM instruction manual, but time to get concrete about this. Is that something easily doable in C - e.g. using the functionality of signal.h to catch SIGILL - or does it need C++'s try/catch exception handling?

If you could post some sample code with your implementation, that would be great.
I don't use C/C++, or any HLL. It's all assembly. And I don't use Linux, or any commercial OS. So I'm not sure how much any code I have that could help you. But the try/catch thing (or its equivalent) would appear to be the easiest thing to use here. Pick a representative instruction and put it in the try block. Build up a pool of true/false flags.
retina is offline   Reply With Quote
Old 2017-10-31, 16:28   #92
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

67258 Posts
Default

I don't think the exception-handling semantics in C++ are capable of dealing with unix style signals, which by their nature involve interrupt handling and a context switch to the OS. I suppose you could ignore the illegal instruction signal and then look to see if the instruction you are testing has had the intended effect on sample data, i.e. do a SIMD vector add and check that vector(1)+vector(1) == vector(2).

More mundane: what about parsing /proc/cpuinfo in linux? On x86 this parses the cpuid bits for you. Is there an ARM equivalent for that?
jasonp is offline   Reply With Quote
Old 2017-11-01, 06:19   #93
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

22×1,549 Posts
Default

Quote:
Originally Posted by jasonp View Post
I don't think the exception-handling semantics in C++ are capable of dealing with unix style signals, which by their nature involve interrupt handling and a context switch to the OS. I suppose you could ignore the illegal instruction signal and then look to see if the instruction you are testing has had the intended effect on sample data, i.e. do a SIMD vector add and check that vector(1)+vector(1) == vector(2).
I'm not sure how you could just ignore the illegal instruction trap. You would have to rewrite EIP/RIP properly to skip the faulting instruction, else you keep returning the the same instruction.

For my code I just lay it out like this (pseudo code):
Code:
//set all flags to zero
supports_FPU = 0
supports_CMOV = 0
//...
supports_AVX512F = 0

//test each instruction set
try {
	finit		//FPU instruction
	supports_FPU = 1
}
catch {
	//nothing to do here
}

try {
	cmoveq eax,eax	//CMOV instruction
	supports_CMOV = 1
}
catch {
	//nothing to do here
}

//...

try {
	vpabsq zmm1 {k1}, zmm2, zmm3	//AVX512 Foundation instruction
	supports_AVX512F = 1
}
catch {
	//nothing to do here
}
retina is offline   Reply With Quote
Old 2017-11-01, 15:07   #94
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3,541 Posts
Default

Ignoring SIGILL presupposes that the processor triggers a synchronous exception that starts an exception handler, and when the user-defined portion of that handler does nothing then execution restarts at the instruction after the faulting one, whose address was saved by hardware and is accessible somewhere. At least that's how the embedded CPUs I'm familiar with would work.

I see there are some projects on github that use unix internals to trap things like segfaults and convert them into C++ exceptions; pretty slick.
jasonp is offline   Reply With Quote
Old 2017-11-01, 16:18   #95
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

183416 Posts
Default

Ideally the try block will set the trap address to the start of the following catch block. So when it traps EIP/RIP is updated to point to the catch block and the flag never gets set to 1. The EIP/RIP rewrite has to be handled in the exception code. And for code like I have above it won't work if the code always returns to the following instruction because then the flag is always set regardless of whether or not the instruction was valid.
retina is offline   Reply With Quote
Old 2017-11-02, 07:29   #96
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

2·52·11 Posts
Default

Here is the output of cat /proc/cpuinfo on a 64-bit machine:

Code:
$ cat /proc/cpuinfo | grep Features | sort -u
Features    : fp asimd evtstrm
And here is code to check undefined instructions in C:

Code:
#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <signal.h>

static int got_sigill;

static void illegal_handler(int signum, siginfo_t *info, void *context)
{
  ucontext_t *uc = (ucontext_t *)context;

  printf("Got SIGILL\n");
  got_sigill = 1;
  uc->uc_mcontext.pc += 4; /* AArch64 instructions always are 4-byte long */
}

static int setup_illegal(void)
{
  struct sigaction act;

  memset(&act, 0, sizeof(act));
  act.sa_handler = illegal_handler;
  act.sa_flags = SA_SIGINFO;
  errno = 0;
  if (sigaction(SIGILL, &act, NULL) < 0) {
    perror("signaction");
    return 1;
  }

  return 0;
}

int main(void)
{
  int err;

  /* first setup the signal handler */
  err = setup_illegal();
  if (err) {
    return 1;
  }

  /* now check instruction */
  got_sigill = 0;
  asm(".inst 0x0");
  if (got_sigill) {
    printf("Instruction is undefined.\n");
  } else {
    printf("Instruction is not undefined.\n");
  }

  return 0;
}
ldesnogu is offline   Reply With Quote
Old 2017-11-03, 08:51   #97
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

22616 Posts
Default

And here is how to use HW capabilities:

Code:
#include <stdio.h>
#include <sys/auxv.h>
#include <asm/hwcap.h>

static int has_asimd(void)
{
  unsigned long hwcaps = getauxval(AT_HWCAP);

  if (hwcaps & HWCAP_ASIMD) {
    return 1;
  }

  return 0;
}

int main(void)
{
  int asimd;

  asimd = has_asimd();
  if (asimd) {
    printf("AdvSIMD is supported.\n");
  } else {
    printf("AdvSIMD is NOT supported.\n");
  }

  return 0;
}
ldesnogu is offline   Reply With Quote
Old 2017-11-05, 02:48   #98
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

103×113 Posts
Default

All odd-radix-DFT macros in place, thus we have full non-power-of-2 FFT capability. Timings look good, in the sense of nicely monotonic behavior, no significant anomalies related to the DFTs at the larger odd-prime lengths 11 and 13. I have been noticing a timing effect where I get as much as a 5-10% speedup for a given 100-iteration timing test (lasting 10-20 sec for FFT lengths in the 2-4Mdoubles range, using all 4 cores of my Odroid C2) when the CPU has been 'idle' (just OS background stuff running) for at least a few minutes. That indicates to me that thermal-based throttling is indeed occurring. As a near-term workaround will run any timings I post here with the top half of the plastic protective case removed; longer term will see about finding a small cooling fan and suitable power pinouts on the C2 board to run it from.

Laurent, many thanks for the sample code related to SIGILL - I will be looking at that over the next few days, in conjunction with some basic small-DFT macros I use for low-level SIMD correctness and timing tests - those are currently only used in a special #define-flag-enabled build mode, but I think I will enable them in all SIMD builds and wrap them in SIGILL checking.
ewmayer is offline   Reply With Quote
Old 2017-11-05, 23:54   #99
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

103×113 Posts
Default

Quote:
Originally Posted by ewmayer View Post
All odd-radix-DFT macros in place, thus we have full non-power-of-2 FFT capability. Timings look good, in the sense of nicely monotonic behavior, no significant anomalies related to the DFTs at the larger odd-prime lengths 11 and 13.
BTW, there is no fundamental reason why FFTs using these prime radices at the front end should perform worse than their smoother counterparts. Here's why: First off, the per-input-normalized opcounts for e.g. DFT-11 and 13 are unavoidably worse than their neighbors 10,12,14. On FMA hardware a simple implementation of a prime-length-N-DFT which uses a simple FMA-based pair of quadratic matrix-multiplies to do the (N-1)/2 x (N-1)/2 subconvolutions by sine and cosine parts of the complex roots needs precisely 5*(N-1) ADD and (N-1)^2 FMA, with 2*(N-1) of the FADD replaceable by the same number of FMA in the opening and closing (N-1)-butterflies if register availability is a problem, both variants having the same total opcount, (N+4)*(N-1). For example, such a basic approach to a 13-DFT, using the ADD-to-FMA replacement to ease register pressure, needs 36 ADD and 168 FMA, a total opcount comparable to the best-of-breed van Buskirk tangent-DFT for the same radix, with FMAs used to minimize the total opcount of the latter, as well. (The reason to prefer the simpler approach, if opcounts are comparable, is that it tends to have significantly better roundoff error properties than e.g. the tangent-DFT.)

Compare those ~200 arithmetic ops to the mere 96 needed for an optimized 12-DFT (four 3-DFTs needing 12 ops each followed by three 4-DFTs needing 16 ops each) and the 160 needed for an optimized 14-DFT (two 7-DFTs needing 66 ops each followed by seven 2-DFTs needing 4 ops each), radix-13 is clearly worse. But, since e.g. 11 and 13 are prime - and the same argument also applies to odd composites like 9 and 15 - they contain no power-of-2 components as do 12 and 14, meaning we use some index-permutation magic to twiddlelessly combine them with the ensuing larger-power-of-2 FFT pass sequence. For the 13-DFT, saving a full layer of complex twiddle MULs is equivalent to saving ~50 FMAs, so our effective opcount for 13-DFT drops to ~150 total arithmetic ops, which is roughly the same ops-per-input as the 14-DFT and only slightly more than the very factorization-smooth 12-DFT, when one places things in the context of doing a large-length FFT with said DFTs being accompanied by multiple power-of-2 passes which do the bulk of the overall work.

Last fiddled with by ewmayer on 2017-11-06 at 00:00
ewmayer is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Economic prospects for solar photovoltaic power cheesehead Science & Technology 137 2018-06-26 15:46
Which SIMD flag to use for Raspberry Pi BrainStone Mlucas 14 2017-11-19 00:59
compiler/assembler optimizations possible? ixfd64 Software 7 2011-02-25 20:05
Running 32-bit builds on a Win7 system ewmayer Programming 34 2010-10-18 22:36
SIMD string->int fivemack Software 7 2009-03-23 18:15

All times are UTC. The time now is 06:41.


Sat Jul 17 06:41:49 UTC 2021 up 50 days, 4:29, 1 user, load averages: 1.91, 1.67, 1.40

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.