mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software > Mlucas

Reply
 
Thread Tools
Old 2022-01-11, 23:49   #12
leonardyan96
 
leonardyan96's Avatar
 
"Cassessory"
May 2017
Northern China

2·3·7 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Have you ever been able to run using '-cpu 0:7' without the invalid-argument warnings, or is 6 cores the maximum that has (sometimes) worked? It sounds like the OS is doing some kind of dynamic core binding/unbinding. Note that the pthread affinity-setting is formally a *hint* to the OS, whether it gets respected or not depends on the OS and the particuar device in question. Apple M1 is another such 4-big|4-little 8-core hybrid, but there cores 0-7 are always available. I suggest you try just using -cpu 0:3 and run self-tests (preferably with the phone mostly idle at the time and in some decent airflow), in hopes the OS will remap the process to mostly use the performance cores.
-cpu 0:7 will also cause this warning.
leonardyan96 is offline   Reply With Quote
Old 2022-01-11, 23:54   #13
leonardyan96
 
leonardyan96's Avatar
 
"Cassessory"
May 2017
Northern China

2·3·7 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Have you ever been able to run using '-cpu 0:7' without the invalid-argument warnings, or is 6 cores the maximum that has (sometimes) worked? It sounds like the OS is doing some kind of dynamic core binding/unbinding. Note that the pthread affinity-setting is formally a *hint* to the OS, whether it gets respected or not depends on the OS and the particuar device in question. Apple M1 is another such 4-big|4-little 8-core hybrid, but there cores 0-7 are always available. I suggest you try just using -cpu 0:3 and run self-tests (preferably with the phone mostly idle at the time and in some decent airflow), in hopes the OS will remap the process to mostly use the performance cores.
On that phone I mentioned before, which use a MediaTek SoC, all 8 cores are always avaliable. I have never seen such warning in ordinary pm1 runs.
leonardyan96 is offline   Reply With Quote
Old 2022-01-14, 03:53   #14
leonardyan96
 
leonardyan96's Avatar
 
"Cassessory"
May 2017
Northern China

2·3·7 Posts
Default

From manpage 2 sched_setaffinity:
Quote:
EINVAL

The affinity bit mask mask contains no processors that are currently physically on the system and permitted to the process according to any restrictions that may be imposed by the "cpuset" mechanism described in cpuset(7).
I guess, when I specify '-cpu 0:7', Mlucas creates 8 threads, then uses sched_setaffinity to set affinity of only one core for each thread. When a thread hits an inavailable core, out comes the "Invalid argument" error.

Since all 8 cores present in /proc/cpuinfo with typical numbering 0-7, I guess it's the cpuset mechanism, as defined in /dev/cpuset, who stops me from using some cores. However I don't have permission to ls /dev, maybe requires rooting. I wonder if Harmony OS imposes more restriction than normal Android.

Last fiddled with by leonardyan96 on 2022-01-14 at 03:56
leonardyan96 is offline   Reply With Quote
Old 2022-01-14, 09:24   #15
leonardyan96
 
leonardyan96's Avatar
 
"Cassessory"
May 2017
Northern China

2×3×7 Posts
Default

/dev/cpuset, the pid of Mlucas as well as CPU restriction info in /proc/<mlucas pid>/status can all be obtained via ADB. I'm doing some more investigation though doesn't seem very hopeful

I might be learning something about Linux process scheduling...

Last fiddled with by leonardyan96 on 2022-01-14 at 09:26
leonardyan96 is offline   Reply With Quote
Old 2022-01-23, 12:22   #16
leonardyan96
 
leonardyan96's Avatar
 
"Cassessory"
May 2017
Northern China

2×3×7 Posts
Default EUREKA!

The core control mechanism, developed by Qualcomm, is the most possible reason of not being able to use all cores.
I checked /sys/devices/system/cpu/core_ctl_isolated via ADB, which seems to contain numbers of those cores not available for use. For the first time it reads "4,7". When I checked it again a moment later, it was changed into "4,6". Later it's empty, and I can specify "-cpu 0:7" without any setaffinity errors. Now it reads "5,6", "-cpu 0:4,7" works well while "-cpu 0:7" causes error this time.
If it can be rooted I might be able to solve the problem by changing the parameters of core_ctl.

Last fiddled with by leonardyan96 on 2022-01-23 at 12:23
leonardyan96 is offline   Reply With Quote
Old 2022-01-23, 18:41   #17
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

112×97 Posts
Default

Thanks for digging into this - sounds like the OS is dynamically adding and removing cores from the set available for user processes.

When you start a job with '-cpu 0:7' and it starts without the sched_setaffinity errors, if you monitor the program using 'top' for a few minutes, what % CPU usage does it show, and does it vary significantly? What about when the thus-started job does emit the sched_setaffinity error messages - does the resulting CPU usage look different in 'top'?

Again, in either case it's ultimately up to the OS to manage thread affinity - if you end up with similar runtime CPU utilization in either case (with and w/o error messages), perhaps I should just milden the messaging to a warning to the effect of "leaving thread affinity up to OS".
ewmayer is offline   Reply With Quote
Old 2022-01-24, 12:25   #18
leonardyan96
 
leonardyan96's Avatar
 
"Cassessory"
May 2017
Northern China

528 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Thanks for digging into this - sounds like the OS is dynamically adding and removing cores from the set available for user processes.

When you start a job with '-cpu 0:7' and it starts without the sched_setaffinity errors, if you monitor the program using 'top' for a few minutes, what % CPU usage does it show, and does it vary significantly? What about when the thus-started job does emit the sched_setaffinity error messages - does the resulting CPU usage look different in 'top'?

Again, in either case it's ultimately up to the OS to manage thread affinity - if you end up with similar runtime CPU utilization in either case (with and w/o error messages), perhaps I should just milden the messaging to a warning to the effect of "leaving thread affinity up to OS".
You get it right. core_ctl is a kernel module which dynamically isolates/unisolates cores according to CPU load. Isolated cored can't be used by progresses, so it won't be woke up, saving energy. My phone's 835 has 8 cores divided into 2 clusters. One cluster contains 4 small cores, the other one contains 4 big cores. The small cluster is always fully unisolated, while the big one has 2 isolated cores nearly all the time, unless I launch some resource-demanding apps, which itself affects Mlucas' performance. At least today I have never managed to run mlucas with "-cpu 0:7" without setaffinity error.
core_ctl has an interface called core_ctl_set_boost which unisolates all cores, but it can only be used by other kernel modules, meaning I can't unisolate all cores manually on an not-rooted device.
I ran two sets of tests today, both with 2 cores isolated. Due to background services from other apps the timing may vary a bit between each tests, but not very much.
Code:
./Mlucas -fft 5632 -iters 1000 -radset 0 -cpu 0:7
It showed some errors. Mlucas used ~600% of cpu, and the test cost about 1 min 40 sec.
Code:
./Mlucas -fft 5632 -iters 1000 -radset 0 -cpu 0:5
This time I avoided 2 isolated cores and there was no error. However, radix 0/2 is not exactly divisible by NTHREADS=6 and carry step only uses 4 threads. Mlucas used ~450% of cpu, and the test cost about 2 min 30 sec.
Apparently, even if affected by isolated cores, specifying all cores still yields better performance.
What I worry about is, if a thread is set onto an isolated core, it might have to share another core with another thread, hurting performance. Because I'm currently using this phone, currently I don't want to try rooting it, so I'm not able to test the case of using all 8 cores without hitting isolated cores.

Last fiddled with by leonardyan96 on 2022-01-24 at 12:28
leonardyan96 is offline   Reply With Quote
Old 2022-01-24, 16:25   #19
chris2be8
 
chris2be8's Avatar
 
Sep 2009

44548 Posts
Default

Are there any user controls for core_ctl? Eg can it be told not to isolate any cores if the phone is plugged into a charger? Energy saving isn't needed then.
chris2be8 is offline   Reply With Quote
Old 2022-01-24, 22:34   #20
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

1173710 Posts
Default

I did a preliminary online search for 'core_ctl', found very little material. May be wrong, but perhaps this power-management mechanism is intended to be accessible by the OS only. Your timings indicate that even with the OS dynamically isolating a few cores, aside from the init-phase "unable to set thread affinity" error messages, the process is running fine. That makes sense because any such power-management would need to be able to bounce threads between cores without stalling their execution. Note that Mlucas-used threadpool schema is designed for asynchronous execution, i.e. even if some threads briefly stall or run more slowly than others, that simply means that the slower threads end up doing fewer work units during the given iteration; the only constraint being that all threads must complete their final work unit by way of resynchronization, before execution can continue to the next of the 2 main parallel-executed phases of each iteration.

For example, your runs at 5.5M FFT and radix set 0 have leading radix R = 352. For technical FFT-implementation reasons not important here, that results in R/2 = 176 independent work units. For '-cpu 0:7' those get assigned to a pool of 8 threads, each thread completes its current WU then grabs the next-available one, as long as there are WUs left which need to get done. In the idealized case of all threads running in precise lockstep, using 6 threads, you'd have each doing floor(176/6) = 29 WUs, leaving 2 WUs, so those would need 2 threads, leaving 4 idle threads for the final pass, for a performance hit of perhaps 3% over the ideal case. In your case, even with 8 threads, the OS dynamically isolating and freeing-up cores results in more or less the same kind of thing - as long as there's a decent amount of work to divvy up amongst the cores & threads, it's OK.

Again for technical reasons the parallel carry-step needs a power-of-2 thread count, so when you run with e.g. '-cpu 0:5', the carry step uses just 4 threads - that's why it's generally preferable to stick to power-of-2 thread counts, and "fill the CPU up" with those. Say on a 6-core CPU you'd have three 2-thread instances or one 4-thread and one 2-thread. Your hardware is a little special in that regard, but it seems best to optimistically assume all 8 cores will be available much of the time.

Just by way of 1 more throughput reference, what timing for the same 1000-iter self-test @5.5M FFT do you get using '-cpu 0:3'? As long as 8-threaded is running faster than any lower threadcount, just go with that, and I'll milden the scary-sounding affinity-error messages to warnings in the next release.
ewmayer is offline   Reply With Quote
Old 2022-01-25, 00:18   #21
leonardyan96
 
leonardyan96's Avatar
 
"Cassessory"
May 2017
Northern China

4210 Posts
Default

Quote:
Originally Posted by chris2be8 View Post
Are there any user controls for core_ctl? Eg can it be told not to isolate any cores if the phone is plugged into a charger? Energy saving isn't needed then.
At least the stock ROM of my phone doesn't have this. Smartphone manufacturers in China tend to put some "gaming mode" or "performance mode" in their products, but I'm not sure if they change the behaviour of core_ctl, since mine is from a little-known foreign brand.

Also, when I check the status of core_ctl via ADB it has already been charging, obviously the core isolation is not disabled in this situation.

Last fiddled with by leonardyan96 on 2022-01-25 at 00:31
leonardyan96 is offline   Reply With Quote
Old 2022-01-25, 02:00   #22
leonardyan96
 
leonardyan96's Avatar
 
"Cassessory"
May 2017
Northern China

2×3×7 Posts
Default

Quote:
Originally Posted by ewmayer View Post
I did a preliminary online search for 'core_ctl', found very little material. May be wrong, but perhaps this power-management mechanism is intended to be accessible by the OS only. Your timings indicate that even with the OS dynamically isolating a few cores, aside from the init-phase "unable to set thread affinity" error messages, the process is running fine. That makes sense because any such power-management would need to be able to bounce threads between cores without stalling their execution. Note that Mlucas-used threadpool schema is designed for asynchronous execution, i.e. even if some threads briefly stall or run more slowly than others, that simply means that the slower threads end up doing fewer work units during the given iteration; the only constraint being that all threads must complete their final work unit by way of resynchronization, before execution can continue to the next of the 2 main parallel-executed phases of each iteration.

For example, your runs at 5.5M FFT and radix set 0 have leading radix R = 352. For technical FFT-implementation reasons not important here, that results in R/2 = 176 independent work units. For '-cpu 0:7' those get assigned to a pool of 8 threads, each thread completes its current WU then grabs the next-available one, as long as there are WUs left which need to get done. In the idealized case of all threads running in precise lockstep, using 6 threads, you'd have each doing floor(176/6) = 29 WUs, leaving 2 WUs, so those would need 2 threads, leaving 4 idle threads for the final pass, for a performance hit of perhaps 3% over the ideal case. In your case, even with 8 threads, the OS dynamically isolating and freeing-up cores results in more or less the same kind of thing - as long as there's a decent amount of work to divvy up amongst the cores & threads, it's OK.

Again for technical reasons the parallel carry-step needs a power-of-2 thread count, so when you run with e.g. '-cpu 0:5', the carry step uses just 4 threads - that's why it's generally preferable to stick to power-of-2 thread counts, and "fill the CPU up" with those. Say on a 6-core CPU you'd have three 2-thread instances or one 4-thread and one 2-thread. Your hardware is a little special in that regard, but it seems best to optimistically assume all 8 cores will be available much of the time.

Just by way of 1 more throughput reference, what timing for the same 1000-iter self-test @5.5M FFT do you get using '-cpu 0:3'? As long as 8-threaded is running faster than any lower threadcount, just go with that, and I'll milden the scary-sounding affinity-error messages to warnings in the next release.
With other parameters remained the same, when using 4 small cores it uses ~370% of CPU and about 3 min 42 sec, while it's ~390% of CPU and about 1 min 31 sec when 4 big cores(with setaffinity errors). It's strange that the timing of using all 8 cores is no better than 4 big cores only

I also tried my previous phone with a low-end octacore MediaTek SoC, not affected by core isolation. With all 8 cores used it takes 750% of CPU and about 3 minutes. Even with some cores isolated the Sanpdragon 835 is still far more powerful

The conclusion is clear: using all cores (with multiple instances when necessary) should always be the best choice. If anyone has a rooted device, unisolating all cores might still help with the performance, but for now I'm not able to verify this.

Last fiddled with by leonardyan96 on 2022-01-25 at 02:03
leonardyan96 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Is "mung" or "munged" a negative word in a moral sense? Uncwilly Lounge 15 2020-04-14 18:35
GQQ: a "deterministic" "primality" test in O(ln n)^2 Chair Zhuang Miscellaneous Math 21 2018-03-26 22:33
Stockfish game: "Move 8 poll", not "move 3.14159 discussion" MooMoo2 Other Chess Games 5 2016-10-22 01:55
Aouessare-El Haddouchi-Essaaidi "test": "if Mp has no factor, it is prime!" wildrabbitt Miscellaneous Math 11 2015-03-06 08:17
Would Minimizing "iterations between results file" may reveal "is not prime" earlier? nitai1999 Software 7 2004-08-26 18:12

All times are UTC. The time now is 13:11.


Mon Jul 4 13:11:34 UTC 2022 up 81 days, 11:12, 0 users, load averages: 1.23, 1.29, 1.31

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔