mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software > Mlucas

Reply
 
Thread Tools
Old 2017-06-23, 21:30   #78
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

103·113 Posts
Default

Quote:
Originally Posted by ET_ View Post
I'm afraid I forgot the '-' after the E

The code compiles and runs happily
I am running the selftest with 1 thread (./Mlucas -s m > selftest.log) and will share the log as I have it (I'm at 1408K right now). Let me know if you still need my prefs.arm file as well.
You just dropped the patched imul_macro0.h file into the mlucas_v17 src-dir, yes?

Don't really need the log, just post the resulting mlucas.cfg file.

Quote:
The [M|m]lucas.cfg file is not written. The code detects its lack, but when it tries to write (r+) it does not succeed.
I had no such issues on my odroid. Do you have write permissions to the dir in which you are doing self-tests? [If that is different than the one you built in] What does 'touch mlucas.cfg' in that dir give?

Quote:
To do a selftest with more threads, should I try

Code:
./Mlucas -s m -nthread [2|4]
or

Code:
./Mlucas -s m -cpu 0:3
?
On the ARM there is just one logical core per physical core, so

'./Mlucas -s m -nthread 2' is the same as './Mlucas -s m -cpu 0:1'
and
'./Mlucas -s m -nthread 4' is the same as './Mlucas -s m -cpu 0:3'
ewmayer is offline   Reply With Quote
Old 2017-06-23, 22:20   #79
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

32×5×107 Posts
Default

Quote:
Originally Posted by ewmayer View Post
You just dropped the patched imul_macro0.h file into the mlucas_v17 src-dir, yes?
Yes!

Quote:
Originally Posted by ewmayer View Post
Don't really need the log, just post the resulting mlucas.cfg file.
Here they are.
The mlucas.cfg is correctly written after the test is completed, while I erroneously thought it was written on the fly.

Luigi

P.S. Guess you may need the 2 and 3 threads files as well... 3 Looks like a good choice. If so, I will compute them tomorrow.
Attached Files
File Type: zip mlucas.cfg.1thd.zip (641 Bytes, 85 views)
File Type: zip mlucas.cfg.4thd.zip (649 Bytes, 90 views)

Last fiddled with by ET_ on 2017-06-23 at 22:26
ET_ is offline   Reply With Quote
Old 2017-06-24, 06:38   #80
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

103×113 Posts
Default

Quote:
Originally Posted by paulunderwood View Post
Prefix each root command with sudo and enter your user password.

My guess is you can do sudo passwd root if you want to set up a root password.

Or run sudo su to run root commands.
Thanks, that works (e.g. 'sudo shutdown -h now' ... have not tried setting up root pwd yet). Also found that the system needs the boot-MicroSD to remain installed - I had assumed that after the initial boot-up the OS and various boot-loader files would get copied to the onboard memory and the MicroSD no longer needed.

Quote:
Originally Posted by ET_ View Post
The mlucas.cfg is correctly written after the test is completed, while I erroneously thought it was written on the fly.
Thanks - sure, go ahead do 2,3-threaded self-tests when you get the chance. Based on your 1 and 4-thread data, your Odriod must be using an older rev of the ARM core, or maybe slower memory, than mine - here is what I get @1,2,4-threads. Notice the 1-thread runtimes are 25-30% faster overall, and the scaling to 4-threads ('||-eff.' is short for 'parallel efficiency') is much better - my 4-thread timings are just over half of yours. When did you purchse your Odroid, and which precise model is it?
Code:
	1-thread:	2-thread:	3-thread:	4-thread:
FFTlen	ms/iter	||-eff%	ms/iter	||-eff%	ms/iter	||-eff%	ms/iter	||-eff%
1024	 223.93	100	 112.33	99.7	  95.18	78.4	 61.24	91.4
1152	 272.53	100	 137.15	99.4	 117.28	77.5	 74.66	91.3
1280	 317.98	100	 160.73	98.9	 138.97	76.3	 88.37	90.0
1408	 351.05	100	 177.97	98.6	 152.51	76.7	 95.73	91.7
1536	 387.47	100	 197.59	98.0	 169.60	76.2	108.45	89.3
1664	 411.91	100	 209.30	98.4	 179.49	76.5	112.95	91.2
1792	 419.99	100	 213.92	98.2	 182.06	76.9	116.43	90.2
1920	 485.82	100	 246.78	98.4	 212.18	76.3	134.59	90.2
2048	 476.06	100	 241.00	98.8	 205.41	77.3	131.84	90.3
2304	 570.37	100	 290.40	98.2	 250.33	75.9	158.11	90.2
2560	 654.57	100	 345.22	94.8	 300.91	72.5	196.02	83.5
2816	 725.62	100	 383.84	94.5	 334.08	72.4	217.17	83.5
3072	 793.89	100	 418.03	95.0	 357.01	74.1	237.74	83.5
3328	 849.32	100	 448.77	94.6	 390.24	72.5	255.72	83.0
3584	 859.99	100	 456.01	94.3	 393.88	72.8	262.40	81.9
3840	 990.53	100	 525.26	94.3	 457.94	72.1	298.67	82.9
4096	 974.11	100	 512.90	95.0	 445.75	72.8	297.59	81.8
4608	1213.42	100	 615.35	98.6	 537.29	75.3	353.29	85.9
5120	1460.96	100	 775.31	94.2	 669.04	72.8	447.42	81.6
5632	1617.00	100	 857.64	94.3	 742.06	72.6	495.67	81.6
6144	1764.45	100	 937.44	94.1	 780.16	75.4	546.03	80.8
6656	1897.60	100	1009.80	94.0	 870.94	72.6	586.85	80.8
7168	1945.88	100	1035.60	93.9	 887.20	73.1	609.23	79.8
7680	2231.54	100	1179.50	94.6	1021.03	72.9	691.53	80.7
Notes:
- 3-thread scaling is by far the worst, unsurprisingly because Mlucas is optimized for power-of-2 thread counts.
- The || scaling for 1,2,4-threads is quite impressive, especially given that we typically expect one-single-thread-job-per-core mode to run at no better than 80-90% efficiency due to overall system memory contention among the jobs (i.e. 1-worker/4-thread ~= 4-worker/1-thread in total-throughput terms).
- Only significant timing anomaly is for FFT lengths of form 15*2^n, which are slower than the power-of-2 lengths just above them. The medium self-test (-s m) currently does not do 8192K, but likely those timings would, by extension be slightly faster than the 7680K ones which form the bottom ros of the above table.
ewmayer is offline   Reply With Quote
Old 2017-06-24, 09:06   #81
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

32×5×107 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Based on your 1 and 4-thread data, your Odriod must be using an older rev of the ARM core, or maybe slower memory, than mine - here is what I get @1,2,4-threads. Notice the 1-thread runtimes are 25-30% faster overall, and the scaling to 4-threads ('||-eff.' is short for 'parallel efficiency') is much better - my 4-thread timings are just over half of yours. When did you purchse your Odroid, and which precise model is it?
I bought them (5 boards) at PicoCluster (www.picocluster.com) last March, and was running the test from node0, having a tail -f, 3 ssh and the whole GUI system running on it. Once I figure out how to access each board via ssh from my computer (the boards are preconfigured with a 10.0.x.x IP address while I am on a 192.168.x.x network), I suppose the timings should lower. I will run the next 2-3threads test on a different node just to see how it works.
ET_ is offline   Reply With Quote
Old 2017-06-24, 21:36   #82
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2D7716 Posts
Default

Quote:
Originally Posted by ET_ View Post
I bought them (5 boards) at PicoCluster (www.picocluster.com) last March, and was running the test from node0, having a tail -f, 3 ssh and the whole GUI system running on it. Once I figure out how to access each board via ssh from my computer (the boards are preconfigured with a 10.0.x.x IP address while I am on a 192.168.x.x network), I suppose the timings should lower. I will run the next 2-3threads test on a different node just to see how it works.
Also - I ran my tests with the board open to the room air, and a fan blowing from across the room - no idea if these guys have a preinstalled temperature-monitoring system, but I figured since they are fanless, better safe than sorry.

On to the vector-asm coding effort!
ewmayer is offline   Reply With Quote
Old 2017-06-24, 21:50   #83
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

10010110011112 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Also - I ran my tests with the board open to the room air, and a fan blowing from across the room - no idea if these guys have a preinstalled temperature-monitoring system, but I figured since they are fanless, better safe than sorry.
In fact, I suppose that 5 boards and a switch closed inside a plexyglass cube may become hot and tend to throttle... Oh, well, time to design a cooling system will come.
ET_ is offline   Reply With Quote
Old 2017-07-09, 07:17   #84
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2D7716 Posts
Default

With some helpful advice from fellow forumite and ARM employee Tom Womack (a.k.a. fivemack), got my first nontrivial asm-macros put together and timing-tested in the last few days.

Copied in the code-box below is the ARMv8 version - at least the first go - of a complex 4-DFT with 3 complex twiddles. I tested this side-by-side with the SSE2 version of the same macro, which has no FMAs, obviously. My initial timings were surprising, and had the ARM code running faster - not only in cycles but further in terms of wall-clock time - on 1 CPU of my 1.5GHz Odroid-C2than than the SSE2 version of the same macro running on 1 CPU of my 2.0GHz Core2Duo macbook. But based on the relative theoretical instruction throughputs of the 2 respective CPUs that 's simply wildly implausible. One more odd thing from the same side-by-side testing ... on the Core2/SSE2 once the GCC opt-level hit -O1 the macro timings bottomed out, not surprising since the compiler treats the ASM loop body as a back box and can only optimize the loop logic. However on the ARM going from -O1 to -O3 gave slightly better than a 2-fold speedup.

Further digging into possible timing-loop overheads quickly revealed the cause of the above oddities - the loop was running the macro in in-place mode, reading 16 doubles (8 complex-double, thus 4 vector-complex-double, hence "4-DFT") from a block of quasirandom (i.e. 'repeatably random') inited local memory, then writing back to the same memory. This requires one to re-init the macro inputs on each loop pass via memcpy. For one reason or another, GCC (v4.2 on my Core2 and 5.something on my Odroid) is doing a really bad job of this init step on the Core2 even at -O3, whereas in going from -O1 to -O3 on the Odroid the compiler is slashing the cost of the init. The obvious answer was to switch to running the macro in out-of-place mode, which allows the inputs to be inited just once and the loop body now consists just of the 4-DFT macro. With that, here are the opcounts and cycle counts for the 2 respective 128-bit SIMD implementations:

x86_64 SSE2: 41 MEM (19 load[1 via mem-op in addpd], 14 store, 8 reg-copy), 22 ADDPD, 16 MULPD: 46 cycles. (Note I can get this down to 36 cycles, but only by using > 8 vector registers, which is inconsistent with being able to do two such 4-DFTs side-by-side, one in vector registers 0-7, the other in 8-15.)

ARM v8 Neon: 11 MEM (7 load-pair, 4 store-pair, i.e. 22 total vector-load/stores via 11 instructions), 16 FADD, 12 FMUl/FMA: 93 cycles. (Here I use 12 vector registers to save some spill/fills and arithmetic, because I have 32 such registers to work with.)

Thus almost exactly double the cycle count on the ARM vs the Core2. Here is the ARM inline-asm macro - note that q- and v- are different name prefixes for the same set of vector registers, the former treating a given register as an integer one, the latter as a floating-point register. The LDP and STP (load-pair and store-pair) instructions can be used to operate on either kind of underlying data but formally require the integer form of the register name. Thus we e.g. 'LPD q4,q5' to load 32 bytes of contiguous date from a memory location, then use the same resulting register data under the name v4 and v5 to do vector floating-point arithmetic on them. The '.2d' v-register suffixes mean 'treat register as pair of floating doubles':
Code:
__asm__ volatile (\
	"ldr	x0,%[__add0]		\n\t"\
	"ldr	w1,%[__p1]			\n\t"\
	"ldr	w2,%[__p2]			\n\t"\
	"ldr	w3,%[__p3]			\n\t"\
	"ldr	x4,%[__cc0]			\n\t"\
	"ldr	x5,%[__r0]			\n\t"\
	"add	x1, x0,x1,lsl #3	\n\t"\
	"add	x2, x0,x2,lsl #3	\n\t"\
	"add	x3, x0,x3,lsl #3	\n\t"\
	/* SSE2_RADIX_04_DIF_3TWIDDLE(r0,c0): */\
	/* Do	the p0,p2 combo: */\
	"ldp	q4,q5,[x2]			\n\t"\
	"ldp	q8,q9,[x4]			\n\t"/* cc0 */\
	"ldp	q0,q1,[x0]			\n\t"\
	"fmul	v6.2d,v4.2d,v8.2d	\n\t"/* twiddle-mul: */\
	"fmul	v7.2d,v5.2d,v8.2d	\n\t"\
	"fmls	v6.2d,v5.2d,v9.2d	\n\t"\
	"fmla	v7.2d,v4.2d,v9.2d	\n\t"\
	"fsub	v2.2d ,v0.2d,v6.2d	\n\t"/* 2 x 2 complex butterfly: */\
	"fsub	v3.2d ,v1.2d,v7.2d	\n\t"\
	"fadd	v10.2d,v0.2d,v6.2d	\n\t"\
	"fadd	v11.2d,v1.2d,v7.2d	\n\t"\
	/* Do	the p1,3 combo: */\
	"ldp	q8,q9,[x4,#0x40]	\n\t"/* cc0+4 */\
	"ldp	q6,q7,[x3]			\n\t"\
	"fmul	v0.2d,v6.2d,v8.2d	\n\t"/* twiddle-mul: */\
	"fmul	v1.2d,v7.2d,v8.2d	\n\t"\
	"fmls	v0.2d,v7.2d,v9.2d	\n\t"\
	"fmla	v1.2d,v6.2d,v9.2d	\n\t"\
	"ldp	q8,q9,[x4,#0x20]	\n\t"/* cc0+2 */\
	"ldp	q6,q7,[x1]			\n\t"\
	"fmul	v4.2d,v6.2d,v8.2d	\n\t"/* twiddle-mul: */\
	"fmul	v5.2d,v7.2d,v8.2d	\n\t"\
	"fmls	v4.2d,v7.2d,v9.2d	\n\t"\
	"fmla	v5.2d,v6.2d,v9.2d	\n\t"\
	"fadd	v6.2d,v4.2d,v0.2d	\n\t"/* 2 x 2 complex butterfly: */\
	"fadd	v7.2d,v5.2d,v1.2d	\n\t"\
	"fsub	v4.2d,v4.2d,v0.2d	\n\t"\
	"fsub	v5.2d,v5.2d,v1.2d	\n\t"\
	/* Finish radix-4 butterfly and store results: */\
	"fsub	v8.2d,v10.2d,v6.2d	\n\t"\
	"fsub	v9.2d,v11.2d,v7.2d	\n\t"\
	"fsub	v1.2d,v3.2d,v4.2d	\n\t"\
	"fsub	v0.2d,v2.2d,v5.2d	\n\t"\
	"fadd	v6.2d,v6.2d,v10.2d	\n\t"\
	"fadd	v7.2d,v7.2d,v11.2d	\n\t"\
	"fadd	v4.2d,v4.2d,v3.2d	\n\t"\
	"fadd	v5.2d,v5.2d,v2.2d	\n\t"\
	"stp	q6,q7,[x5      ]	\n\t"/* out 0 */\
	"stp	q0,q4,[x5,#0x20]	\n\t"/* out 1 */\
	"stp	q8,q9,[x5,#0x40]	\n\t"/* out 2 */\
	"stp	q5,q1,[x5,#0x60]	\n\t"/* out 3 */\
	:					/* outputs: none */\
	: [__add0] "m" (r0)	/* All inputs from memory addresses here */\
	 ,[__p1] "m" (p1)\
	 ,[__p2] "m" (p2)\
	 ,[__p3] "m" (p3)\
	 ,[__two] "m" (two)\
	 ,[__cc0] "m" (cc0)\
	 ,[__r0] "m" (r0)\
	: "cc","memory","x0","x1","x2","x3","x4","x5","v0","v1","v2","v3","v4","v5","v6","v7","v8","v9","v10","v11"	/* Clobbered registers */\
);
In using this basic kind of small-DFT macro to build up a larger one (say a radix-16 DFT), I typically implement 2-columns of sich code operating side-by-side on independent data, which helps hide latency. Here are the respective 1-column and 2-column cycle counts:

x86_64 SSE2: 1-col = 46 cycles, 2-col = 77 cycles.

ARM Neon: 1-col = 93 cycles, 2-col = 165 cycles.

Thus a decent per-cycle throughput gain on both, but comparatively more for the SSE2 code.

To the ARM experts hereabouts, do those timings seem reasonable?
ewmayer is offline   Reply With Quote
Old 2017-10-27, 22:15   #85
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

103×113 Posts
Default

I am only a few weeks away from releasing a beta version Mlucas with ARMv8 SIMD-assembly support - many thanks to fellow forumite and ARM engineer Tom Womack (a.k.a. fivemack) for much useful assistance in my early steep-part-of-the-learning-curve coding efforts.

But let me damp down expectations right off the bat: The performance of the SIMD code is less than I'd hoped in my wide-eyed initial guesstimates - looks like all 4 cores of my Odroid C2 are roughly equivalent to 1 core of my vintage-2009 Core2 Duo macbook running an SSE2 build of Mlucas, and equivalent to perhaps 1/8th of both cores of my ham-sandwich-sized Intel Broadwell NUC running an AVX2 build f the code. I'm not sure how that stacks up on a per-watt basis, probably decently enough, but overall we're talking on the order of half a year or more to do a single exponent at the current GIMPS wavefront, and that number needs to come down in order to spur any appreciable user adoption. So I'm hoping readers/future-users of the ARMv8 code can tell me some good news about better performance for higher-end ARMv8 systems than my humble A53, and e.g. low-cost multi-socket ARMv8 systems which contain multiple copies of such 4-core CPUs. Heck, fot this kind of work we'd really like just a simple board with multiple CPUs and wouldn't even need any memory subsystem support for interprocessor communication, but I'm guessing that's a no-go from a marketing perspective for general-purpose compute hardware.

Some interesting performance trends already visible in the current powers-of-2-only binary (Adding non-power-of-2 support is pretty quick at this point since it shares ~90% of the power-of-2 FFT-code infrastructure, merely requiring implementation of odd-radix DFT macros for radices 3,5,7,9,11,13,15, only the two composite ones of which require separate DIF and DIT version of said macros). A key performance-related parameter relates to the leading radix, used for the initial fFFT pass and final iFFT pass, let's call said radix R. Say I'm doing a length-N FFT. Once I do that initial radix-R pass which accesses stride-N/R-sperarated sets of data, the subsequent passes of the fFFT, the dyadic-mul step and the iFFT passes all the way up to the final radix-R iFFT one, all those operate on R disjoint chunks of N/R data each, which naturally are assigned to separate threads in a multithreaded run. The size of these disjoint chunks is thus key in terms of getting good cache performance - we want each such chunk to fit into L2, typically with some room to spare. Thus at any given FFT length we typically see a "sweet spot" leading radix R - make R smaller and the resulting larger data chunks start to spill out of L2, make R too large and the overhead of handling many small chunks begins to dominate the runtime.

A typical example on the ARM is provided by various radix combos at 4096K FFT, where the sweet spot is at R = 256, which yields an N/R-chunksize of ~32MB/256 = 128 kB. Let's compare 100 iterations using leading radices 128 and 256 here, for 1 and 4-threads:

R 1-thr 4-thr
128 70 sec 24 sec
256 64 sec 23 sec

i.e. R = 256 gives a 10% speedupp over R = 128 in single-thread mode, but the advantage drops to just 4% when running 4-threaded. (The eagle-eyed follower of this thread may compare these timings to those I gave for my initial scalar-double C-code build in post 80 and note that e.g. the above 64/23 sec for 1/4-thread represent only ~1.5x speedup over the non-SIMD build - like I said, underwhelming.) Two possibilities immediately come to mind to explain the 4-thread behavior: memory bandwidth (i.e. the RAM can't keep all 4 cores fed) and thermal throttling (the C2 has no cooling fan, just a small heatsink). Is there a way to see if the latter is occurring, and if so, to what degree? I know there are temp-monitoring packages for various linux distros, but those require the OS to play nice with the underlying CPU and motherboard hardware.

In post 50 of this thread VistordeHolland mentions power draw of 200mW per core for the A53 implementation of ARMv8 (the one in the Odroid C2), but the little passive heatsink on my C2 gets sufficiently hot under load that I'm skeptical of the total die power draw being under 1W ... might the L2 cache's power draw be excluded from the 200mW figure? The cheap little plastic housing I bought as an accessory for my C2 is poorly ventilated and surely doesn't help, but is necessary to keep on at the moment since I'm sneakernetting code updates via thumb drive and need to protect the board during all that plugging-and-playing. The whole thermal-throttling thing may prove to be a bogus notion, but I won't know until I know, right?

Lastly for now, can any of our resident ARM coders tell me whether ARM has some analog of x86's CPUID functionality? It would be nice to be able to support the same kind of "on program startup, check the CPU's SIMD support against that targeted by the build. If build target instructions not supported by CPU quite with error; if CPU supports SIMD but build does not target same, print info-message to that effect" functionality I use for x86 SIMD-capable CPUs/builds.
ewmayer is offline   Reply With Quote
Old 2017-10-28, 00:09   #86
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

22·1,549 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Lastly for now, can any of our resident ARM coders tell me whether ARM has some analog of x86's CPUID functionality? It would be nice to be able to support the same kind of "on program startup, check the CPU's SIMD support against that targeted by the build. If build target instructions not supported by CPU quite with error; if CPU supports SIMD but build does not target same, print info-message to that effect" functionality I use for x86 SIMD-capable CPUs/builds.
Yes, of course. They are implementation dependant but ID registers are usually named ID_AA64* & ID_ISAR* and are accessed with the MRS instruction. You'll have to read your specific CPU manual to see how they are defined, but the generic ARMv8 manual lists the names and bit positions for the generic case.

Last fiddled with by retina on 2017-10-28 at 00:12
retina is offline   Reply With Quote
Old 2017-10-28, 01:18   #87
GP2
 
GP2's Avatar
 
Sep 2003

5×11×47 Posts
Default

There's a persistent rumor that Apple, which already designs its own ARM-based chips for mobile devices, is considering moving the Mac to ARM too.

They have already successfully navigated two architecture switches in past decades, from Motorola 68xxx to PowerPC to Intel, so it's surely feasible.

Intel architecture has stagnated for some time now, and seems to have its inherent limitations. And the consumer demand for faster chips on the desktop is modest at best. Meanwhile, all the mobile devices are doing face recognition and augmented reality and whatnot while coping with limited battery life, so there's a neverending powerful industrywide incentive to keep making ARM run faster and use less power. So I think ARM will at some point become very relevant to our interests, and it's good to get ahead of the curve.

I personally wouldn't care if each individual exponent took half a year, as long as I could run a whole bunch of them in parallel at the lowest possible buck for the bang.
GP2 is offline   Reply With Quote
Old 2017-10-28, 08:23   #88
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

2·52·11 Posts
Default

Quote:
Originally Posted by retina View Post
Yes, of course. They are implementation dependant but ID registers are usually named ID_AA64* & ID_ISAR* and are accessed with the MRS instruction. You'll have to read your specific CPU manual to see how they are defined, but the generic ARMv8 manual lists the names and bit positions for the generic case.
Alas most of these registers can't be read from user space.

There exists two possibilities:
  1. parse the output of /proc/cpuinfo
  2. play with getauxval and HWCAP (<sys/auxv.h> and <asm/hwcap.h>)
I never tried any of these so this might not be exactly what Ernst needs.

Last fiddled with by ldesnogu on 2017-10-28 at 08:23
ldesnogu is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Economic prospects for solar photovoltaic power cheesehead Science & Technology 137 2018-06-26 15:46
Which SIMD flag to use for Raspberry Pi BrainStone Mlucas 14 2017-11-19 00:59
compiler/assembler optimizations possible? ixfd64 Software 7 2011-02-25 20:05
Running 32-bit builds on a Win7 system ewmayer Programming 34 2010-10-18 22:36
SIMD string->int fivemack Software 7 2009-03-23 18:15

All times are UTC. The time now is 05:59.


Sat Jul 17 05:59:07 UTC 2021 up 50 days, 3:46, 1 user, load averages: 1.34, 1.45, 1.61

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.