![]() |
|
|
#89 |
|
Undefined
"The unspeakable one"
Jun 2006
My evil lair
22×1,549 Posts |
I actually find it easier to not bother with all the myriad of ID values and registers. Instead I just set an invalid instruction trap and start executing something from the set I want to use. If it traps then I drop back one level and try again. This covers both bases where either the OS or the CPU doesn't support the instructions.
|
|
|
|
|
|
#90 | |
|
∂2ω=0
Sep 2002
República de California
1163910 Posts |
Quote:
If you could post some sample code with your implementation, that would be great. |
|
|
|
|
|
|
#91 | |
|
Undefined
"The unspeakable one"
Jun 2006
My evil lair
22×1,549 Posts |
Quote:
|
|
|
|
|
|
|
#92 |
|
Tribal Bullet
Oct 2004
354110 Posts |
I don't think the exception-handling semantics in C++ are capable of dealing with unix style signals, which by their nature involve interrupt handling and a context switch to the OS. I suppose you could ignore the illegal instruction signal and then look to see if the instruction you are testing has had the intended effect on sample data, i.e. do a SIMD vector add and check that vector(1)+vector(1) == vector(2).
More mundane: what about parsing /proc/cpuinfo in linux? On x86 this parses the cpuid bits for you. Is there an ARM equivalent for that? |
|
|
|
|
|
#93 | |
|
Undefined
"The unspeakable one"
Jun 2006
My evil lair
619610 Posts |
Quote:
For my code I just lay it out like this (pseudo code): Code:
//set all flags to zero
supports_FPU = 0
supports_CMOV = 0
//...
supports_AVX512F = 0
//test each instruction set
try {
finit //FPU instruction
supports_FPU = 1
}
catch {
//nothing to do here
}
try {
cmoveq eax,eax //CMOV instruction
supports_CMOV = 1
}
catch {
//nothing to do here
}
//...
try {
vpabsq zmm1 {k1}, zmm2, zmm3 //AVX512 Foundation instruction
supports_AVX512F = 1
}
catch {
//nothing to do here
}
|
|
|
|
|
|
|
#94 |
|
Tribal Bullet
Oct 2004
3,541 Posts |
Ignoring SIGILL presupposes that the processor triggers a synchronous exception that starts an exception handler, and when the user-defined portion of that handler does nothing then execution restarts at the instruction after the faulting one, whose address was saved by hardware and is accessible somewhere. At least that's how the embedded CPUs I'm familiar with would work.
I see there are some projects on github that use unix internals to trap things like segfaults and convert them into C++ exceptions; pretty slick. |
|
|
|
|
|
#95 |
|
Undefined
"The unspeakable one"
Jun 2006
My evil lair
22×1,549 Posts |
Ideally the try block will set the trap address to the start of the following catch block. So when it traps EIP/RIP is updated to point to the catch block and the flag never gets set to 1. The EIP/RIP rewrite has to be handled in the exception code. And for code like I have above it won't work if the code always returns to the following instruction because then the flag is always set regardless of whether or not the instruction was valid.
|
|
|
|
|
|
#96 |
|
Jan 2008
France
10468 Posts |
Here is the output of cat /proc/cpuinfo on a 64-bit machine:
Code:
$ cat /proc/cpuinfo | grep Features | sort -u Features : fp asimd evtstrm Code:
#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <signal.h>
static int got_sigill;
static void illegal_handler(int signum, siginfo_t *info, void *context)
{
ucontext_t *uc = (ucontext_t *)context;
printf("Got SIGILL\n");
got_sigill = 1;
uc->uc_mcontext.pc += 4; /* AArch64 instructions always are 4-byte long */
}
static int setup_illegal(void)
{
struct sigaction act;
memset(&act, 0, sizeof(act));
act.sa_handler = illegal_handler;
act.sa_flags = SA_SIGINFO;
errno = 0;
if (sigaction(SIGILL, &act, NULL) < 0) {
perror("signaction");
return 1;
}
return 0;
}
int main(void)
{
int err;
/* first setup the signal handler */
err = setup_illegal();
if (err) {
return 1;
}
/* now check instruction */
got_sigill = 0;
asm(".inst 0x0");
if (got_sigill) {
printf("Instruction is undefined.\n");
} else {
printf("Instruction is not undefined.\n");
}
return 0;
}
|
|
|
|
|
|
#97 |
|
Jan 2008
France
2·52·11 Posts |
And here is how to use HW capabilities:
Code:
#include <stdio.h>
#include <sys/auxv.h>
#include <asm/hwcap.h>
static int has_asimd(void)
{
unsigned long hwcaps = getauxval(AT_HWCAP);
if (hwcaps & HWCAP_ASIMD) {
return 1;
}
return 0;
}
int main(void)
{
int asimd;
asimd = has_asimd();
if (asimd) {
printf("AdvSIMD is supported.\n");
} else {
printf("AdvSIMD is NOT supported.\n");
}
return 0;
}
|
|
|
|
|
|
#98 |
|
∂2ω=0
Sep 2002
República de California
103×113 Posts |
All odd-radix-DFT macros in place, thus we have full non-power-of-2 FFT capability. Timings look good, in the sense of nicely monotonic behavior, no significant anomalies related to the DFTs at the larger odd-prime lengths 11 and 13. I have been noticing a timing effect where I get as much as a 5-10% speedup for a given 100-iteration timing test (lasting 10-20 sec for FFT lengths in the 2-4Mdoubles range, using all 4 cores of my Odroid C2) when the CPU has been 'idle' (just OS background stuff running) for at least a few minutes. That indicates to me that thermal-based throttling is indeed occurring. As a near-term workaround will run any timings I post here with the top half of the plastic protective case removed; longer term will see about finding a small cooling fan and suitable power pinouts on the C2 board to run it from.
Laurent, many thanks for the sample code related to SIGILL - I will be looking at that over the next few days, in conjunction with some basic small-DFT macros I use for low-level SIMD correctness and timing tests - those are currently only used in a special #define-flag-enabled build mode, but I think I will enable them in all SIMD builds and wrap them in SIGILL checking. |
|
|
|
|
|
#99 | |
|
∂2ω=0
Sep 2002
República de California
103·113 Posts |
Quote:
Compare those ~200 arithmetic ops to the mere 96 needed for an optimized 12-DFT (four 3-DFTs needing 12 ops each followed by three 4-DFTs needing 16 ops each) and the 160 needed for an optimized 14-DFT (two 7-DFTs needing 66 ops each followed by seven 2-DFTs needing 4 ops each), radix-13 is clearly worse. But, since e.g. 11 and 13 are prime - and the same argument also applies to odd composites like 9 and 15 - they contain no power-of-2 components as do 12 and 14, meaning we use some index-permutation magic to twiddlelessly combine them with the ensuing larger-power-of-2 FFT pass sequence. For the 13-DFT, saving a full layer of complex twiddle MULs is equivalent to saving ~50 FMAs, so our effective opcount for 13-DFT drops to ~150 total arithmetic ops, which is roughly the same ops-per-input as the 14-DFT and only slightly more than the very factorization-smooth 12-DFT, when one places things in the context of doing a large-length FFT with said DFTs being accompanied by multiple power-of-2 passes which do the bulk of the overall work. Last fiddled with by ewmayer on 2017-11-06 at 00:00 |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Economic prospects for solar photovoltaic power | cheesehead | Science & Technology | 137 | 2018-06-26 15:46 |
| Which SIMD flag to use for Raspberry Pi | BrainStone | Mlucas | 14 | 2017-11-19 00:59 |
| compiler/assembler optimizations possible? | ixfd64 | Software | 7 | 2011-02-25 20:05 |
| Running 32-bit builds on a Win7 system | ewmayer | Programming | 34 | 2010-10-18 22:36 |
| SIMD string->int | fivemack | Software | 7 | 2009-03-23 18:15 |