mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   How to set up for running gpuOwl under Ubuntu (and other Linux) with OpenCL (https://www.mersenneforum.org/showthread.php?t=25601)

ewmayer 2020-06-10 23:33

How to set up for running gpuOwl under Ubuntu (and other Linux) with OpenCL
 
1 Attachment(s)
[b]Moderator Note:[/b] Post #1 of this thread is intended to provide, step by step, everything needed by a user wanting to do what the thread title states, starting with the procedure for creating a Ubuntu boot-image USB. Later comments are tasked with noting specific (and hopefully small) differences needed for e.g. other Linux distros and specific GPU models, and may be folded into the OP as warranted. Post #1 will be continually maintained and updated so as to stay current.

Thanks to Xyzzy for handholding me through the boot-image procedure, M344587487 for the original version of the gpuowl-setup recipe and the Radeon VII settings-tweak shell script, and all the various forumites (preda, paulunderwood, Prime95, etc) who helped the OP with this stuff when he purchased his first GPU, a Radeon VII, early in 2020. I have only tried the recipe out on one other GPU model, a Radeon 540 in an Intel NUC, there I sucessful built gpuowl but was unable to run it due to an issue of OpenCL not recognizing that GPU model. So feedback regarding whether it works - or how to make it work - with other GPUs and Linux distros is needed, and welcome.

[b]Creating a Ubuntu boot-image USB:[/b] If you already have such a boot-image USB, you can skip to the next section. Note in the following all mount/umount/fdisk commands except of the informational kind must be done as root or using the 'sudo' prefix command.

Technical note: Both cp and dd do faithful byte-copy of a file, thus e.g. md5/sha1 will agree between original and copy. But dd copies to address-offset 0 on the target filesystem, because that is where bootloaders expects a boot image to start. And dd copies a file as a single contiguous block, whereas cp copies to wherever it finds a good spot, and use filesystem magic to link noncontiguous fragments into what looks like a single file to the outside world.

0. Go to the list of [url=http://releases.ubuntu.com/]currently-supported Ubuntu releases[/url] and download the .iso file of the one you want. In my most-recent case I grabbed the 19.10 "64-bit PC (AMD64) desktop image" .iso file, and my notes will use that as an example;

1. Insert a usb stick into an existing linux or MacOs system. Many Linux distros will auto-mount USB storage media, but for boot-disk-creation, we must make sure it is *not* mounted. To see the mount point, use the linux [i]lsblk[/i] command. E.g. on my 2015-vintage Intel NUC the USB was auto-mounted as /dev/sdb1, with mount point /media/ewmayer, ls of which showed a files-containing directory ... 'umount /dev/sdb /media/ewmayer' left 'ls -l /media/ewmayer' showing just . and .., no more directory entry. You need to be careful to specify both the block device (/dev/sd*) and the specific mount point of the USB, since it is common to have multiple filesystems sharing the same block device. I'll replace my 'sdb' with a generic 'sdX' and let users properly fill in for the 'X'.

2. Clear the usb stick - note this is slow and linear-time in the size of the storage medium, so it pays to use the smallest USB needed to store the ISO file. The trailing bs= option overrides the default blocksize-to-write, 512 bytes, with a much larger 1MB, which should speed things significantly:
[i]
sudo dd if=/dev/zero of=/dev/sdX bs=1M
[/i]
The completion message looks scary but is simply an expected 'hit end of fs' message (note if your system hangs for, say more than a minute after printing the "No space left on device" message, you may need to ctrl-c it). Your numbers will be different, but in my case I saw this:
[i]
failed to open 'dev/sdb': No space left on device
31116289+0 records in
31116289+0 records out
15931539456 bytes (16 GB) copied, 3842.03 s, 4.1 MB/s [using newer 16GB USB, needed just 1566 s, 10.2 MB/s]
[/i]
3. use dd to copy the .iso file. As dd is a low-level utility, no re-mount of the stick filesystem is needed/wanted, and my example again assumes the USB is mounted at /dev/sdX, with the user supplying the 'X':
[i]
sudo dd if=[Full path to ISO file, no wildcarding permitted] of=/dev/sdX bs=1M [url=https://www.reddit.com/r/linux/comments/krpjp3/dd_bs_and_why_you_should_use_convfsync/]oflag=sync[/url]
[/i]
On completion, 'sudo fdisk -l /dev/sdX' shows /dev/sdX1 as bootable (the * under 'Boot') and 'Empty'. In my case it also showed a nonbootable partition at /dev/sdb2, which we can ignore:
[code]
Device Boot Start End Sectors Size Id Type
/dev/sdb1 * 0 4812191 4812192 2.3G 0 Empty
/dev/sdb2 4073124 4081059 7936 3.9M ef EFI (FAT-12/16/32)
[/code]
Oddly, in the above the start of sdb2 lies inside the sdb1 range, but that appears to be ignorable. I've used the same boot-USB to install Ubuntu on multiple devices, without any problems.

In my resulting files-view window the previous contents of the USB had vanished and been replaced by 'Ubuntu 19.10 amd64', which showed 10 dirs - [boot,casper,dists,EFI,install,isolinux,pics,pool,preseed,ubuntu] - and 2 files, md5sum.txt [34.8 kB] and README.diskdefines [225 bytes].

4. After copying the .iso, the USB may or may not (this is OS-dependent) end up mounted on /dev/sdX1. To be sure unmount the filesystem with 'sudo umount /dev/sdX1'. (If it was not left so mounted, you'll simply get a "umount: /dev/sdX1: not mounted" error message.) Remove the stick from the system used to burn the .iso, and after doing any needed file backups of the target system, insert the stick into that, reboot and, at the appropriate prompt, press <f1> to enter the Boot Options menu.

5. Fiddle the boot order in the target system BIOS to put the USB at #1 (note this may not be needed, so feel free to first try starting from here:) then shut down, insert bootable USB, power up, use the up/down-arrow keys to scroll through resulting boot-options menu, which includes items like "try without installing" and "install alongside existing OS installation". I chose "install now". Next it detected an existing Debian install, asked if I wanted to keep ... this was on a mere 30GB SSD, 2 installs too cramped, so chose Ubuntu-only. 5 mins later: done, restarting ... "Please remove the installation medium, then press ENTER:".

6. If you fiddled the boot order in the BIOS in the preceding step, the next time you reboot, use the BIOS to move the hard drive back to #1 boot option.

[b]Installing and setting up for gpuowl running:[/b]

o 'sudo passwd root' to set root pwd [I make same as user-pwd on my I-am-sole-user systems]
o sudo apt update
o sudo apt install -y build-essential clinfo git libgmp-dev libncurses5 libnuma-dev python ssh openssh-server
[build-essential instead is a meta package that installs gcc/g++/make and a few other packages commonly used in a standard libc toolchain; optional nice-to-haves include the 'multitail' and 'screen' packages]
o Edit /etc/default/grub to add amdgpu.ppfeaturemask=0xffffffff to GRUB_CMDLINE_LINUX_DEFAULT
o sudo update-grub
o wget -qO - [url]https://repo.radeon.com/rocm/rocm.gpg.key[/url] | sudo apt-key add -
o echo 'deb [arch=amd64] [url]http://repo.radeon.com/rocm/apt/debian/[/url] xenial main' | sudo tee /etc/apt/sources.list.d/rocm.list
o sudo apt update && sudo apt install rocm-dev
o Add yourself to the video group. There are 2 options for doing this:
1. The [URL="https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html"]AMD rocm installation guide[/URL] suggests using the command 'sudo usermod -a -G video $LOGNAME'
2. Should [1] fail for some reason, add yourself manually:
[i]echo 'SUBSYSTEM=="kfd", KERNEL=="kfd", TAG+="uaccess", GROUP="video"' | sudo tee /etc/udev/rules.d/70-kfd.rules[/i]
o reboot
o git clone [url]https://github.com/preda/gpuowl[/url] && cd gpuowl && make
['clone' only on initial setup - subsequent updates can use 'git pull' from within the existing gpuowl-dir:
'cd ~/gpuowl && git pull [url]https://github.com/preda/gpuowl[/url] && make']

[b]Queueing up work and reporting results:[/b]

o Read through the README.md file for basic background on running the code and various command-line options. To queue up GIMPS work, from within the gpuowl executable directory, run './tools/primenet.py -u [primenet uid] -p [primenet pwd] -w [your preferred worktype] --tasks [number of assignments to fetch] &'. This will periodically run an automated python work-management script to grab new work and report any results generated since the last such run of the script.

On my R7 I generally choose '-w PRP --tasks 10', since --tasks does not differentiate based on task type, e.g. if my current worktodo has, say, 5 p-1 jobs queued up, the work-fetch will only grab 5 new PRP assignments. I do weekly results-checkins/new-work-fetches and even running 2 jobs per card as suggested below for the R7, each PRP assignment completes in under 40 hours, thus I want at least 5 PRP assignments queued up at all times. Note that for PRP and LL-test assignments needing some prior p-1 trial factoring, the program will automatically split the original PRP or LL assignment into 2, inserting a p-1 one ("PFactor=...") before the PRP/LL one. Thus an original worktodo.txt file consisting of 10 new PRP assignments might end up with as many as 20 assignments, consisting of 10 such Pfactor/PRP pairs.

o Once the worktodo.txt file has been created and populated with 1 or more assignments, start the program: 'sudo ./gpuowl' should be all that is needed for most users. That will run 1 instance in the terminal in "live progress display" mode; to run in silent background mode or to manage more than one instance from a terminal session, prepend 'nohup' (this diverts all the ensuing screen output to the nohup.out file) and append ' &' to the program name. To target a specific device on a multi-GPU system, use the '-d [device id]' flag, with numeric device id taken from the output of the /opt/rocm/bin/rocm-smi command.

[b]Use -maxAlloc to avoid out-of-memory with multi-jobs per card:[/b]

If you run multiple gpuowl instances per card as suggested in general for both performance and should-one-job-crash reasons, you need to take care to add '-maxAlloc [(0.9)*(Card memory in MB)/(#instances)]' to your program-invocation command line. That limits the program instances to using at most 90% of the card HBM in total; without it, if your multiple jobs happen to find themselves in the memory-hungry stage 2 of p-1 factoring at the same time, since OpenCL does not provide a reliable "how much HBM remains available" functionality, they will combine to allocate more memory than is on the card, causing them to swap out and slow to a crawl.

The default amount (around 90% of what is available on the card in question) gpuowl uses per job is well into the "diminishing returns" part of the stage 2 memory-vs-speed equation for typical modern cards having multi-gigabytes of HBM, so limiting the mem-alloc thusly should not incur a noticeable performance penalty, especially compared to the nearly-infinite performance penalty resulting from the above-described out-of-memory state.

Another good reason to run 2 instances per card - even on cards where this does not give a total-throughput boost - is fault insurance. For example, shortly after midnight last night one of the 2 jobs I had running on the R7 in my Haswell system coredumped with this obscure internal-fault message:
[i]
double free or corruption (!prev)
Aborted (core dumped)
[/i]
No problem - Run #2 continued merrily on its way, the only hit to total-throughput was the single-digit-percentage one resulting from switching from 2-job to 1-job mode on this card. As soon as I saw what had happened on checking the status of my runs this morning, I restarted the aborted job with no problems. Had I been running just 1 job, a whole night's computing would have been lost.

[b]Radeon VII specific:[/b]

o On R7, to maximize throughput you want to run 2 instances per card - in my experience this gives a roughly 6-8% total-throughput boost. I find the easiest way to do this is to create 2 subdirs under the gpuowl-dir, say run0 and run1 for card 0, cd into each and use '../tools/primenet.py [above options] &' to queue up work, and '../gpuowl [-flags] -maxAlloc 7500 &' to start a run.

If managing work remotely I precede each of the executable invocations with 'nohup', and use multitail -N 2 ~/gpuowl/run*/*log' to view the latest progress of my various runs.

o To maximize throughput per watt and keep card temperatures reasonable, you'll want to manually adjust each card's SCLK and MCLK settings - on my single-card system I get best FLOPS/Watt while avoiding huge-slowdown-levels of downclocking via the following bash script, which must be executed as root:
[code]#!/bin/bash
# EWM: This is a basic single-GPU setup script ... customize to suit:

if [ "$EUID" -ne 0 ]; then echo "Radeon VII init script needs to be executed as root" && exit; fi

#Allow manual control
echo "manual" >/sys/class/drm/card0/device/power_dpm_force_performance_level
#Undervolt by setting max voltage
# V Set this to 50mV less than the max stock voltage of your card (which varies from card to card), then optionally tune it down
echo "vc 2 1801 1010" >/sys/class/drm/card0/device/pp_od_clk_voltage
#Overclock mclk to 1200
echo "m 1 1200" >/sys/class/drm/card0/device/pp_od_clk_voltage
#Push a dummy sclk change for the undervolt to stick
echo "s 1 1801" >/sys/class/drm/card0/device/pp_od_clk_voltage
#Push everything to the card
echo "c" >/sys/class/drm/card0/device/pp_od_clk_voltage
#Put card into desired performance level
/opt/rocm/bin/rocm-smi --setsclk 3 --setfan 120[/code]
Setting SCLK = 3 rather than 4 saves ~50W with a modest ~6% timing hit; going to SCLK = 2 saves another 50W but incurs a further ~15% timing hit. If you find overclocking MCLK to 1200 is unstable (gives 'EE' error-line outputs and possibly causes the run to halt), try a lower 1150 - I've found that to be the maximum safe setting based on what works on all 4 of my R7s. You'll want to use rocm-smi to monitor the temp of your various cards and adjust the settings as needed.

o On my 3-R7 system, I use the elaborated setup script copied in the attachment to this post. Note the inline comments re. sclk and fan settings, and the actual job-start commands at end of the file, which put 2 jobs on each card. After a system reboot, I only need to do a single 'sudo bash *sh' to be up and running.

o Mihai Preda comments on running on multiple R7s:
"I find running gpuowl with -uid <16-hex-char id> much more useful than running with -d <position> .
This way the identity of the card is preserved even when swapping it around the PCIe slots.
And the script tools/device.py can be used to convert the UID to -d "position" for rocm-smi ."

[b]Troubleshooting:[/b]

o If, after successfully building gpuowl, 'clinfo' does not find the GPU, [url=https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html#ubuntu]try[/url] [i]sudo apt install rocm-dkms[/i], and assuming that package-install succeeds, reboot.

o If you installed an OpenCL-supporting GPU on a system which previously had an nVidia one, you may need to remove the nVidia drivers like so... . Such a previous-card install may also have left one or more /sys/class/drm/card* entries, which if they exist mean that the card-settings-init script below needs to have its 'card0' entries fiddled to replace 0 with the most-recently-added (= largest) index in the list of /sys/class/drm/card* entries.

o If you get a files-owned-by-root error on an attempted work-fetch using primenet.py, do 'sudo chown -R [uid]:[iud] *' and manually append the downloaded (but not written to the worktodo.txt file) assignments to worktodo.txt .

[b]Advanced Usage:[/b]

o To clean-kill all gpuowl instances running on a system (say, prior to a system shutdown): Mihai explains that gpuowl expects SIGINT for clean shutdown. Simply doing e.g. "shutdown -h now" sends a SIGTERM followed - after an unspecified (and unspecifiable) delay - by a SIGKILL, and can lead to corrupted gpuowl savefile. The proper sequence is to precede the system shutdown with:
[i]
sudo kill -INT $(pgrep gpuowl)
[/i]
o If using 'screen' and working remotely, once gpuowl is up and running, detach screen (ctrl-a --> d) prior to logout.

o For subsequent ROCm-updates, use the following sequence: [i]sudo apt autoremove rocm-dev[/i]
[George adds: "I don't think these 2 are required, but I don't see how they'd hurt:
wget -qO - [url]https://repo.radeon.com/rocm/rocm.gpg.key[/url] | sudo apt-key add -
echo 'deb [arch=amd64] [url]http://repo.radeon.com/rocm/apt/debian/[/url] xenial main' | sudo tee /etc/apt/sources.list.d/rocm.list
]
[i]sudo apt update
sudo apt install libncurses5|clinfo|rocm-dev
[reboot][/i]

Xyzzy 2020-06-11 00:21

[QUOTE=ewmayer;547643][I]sudo dd if=/dev/zero of=/dev/sdX[/I][/QUOTE]We only do this when we get a new USB stick. We usually use [C]badblocks -s -t random -v -w /dev/sdX[/C] because it verifies that each memory location is valid. Most of the time you can get away with not checking the stick but we feel it is cheap insurance.[QUOTE=ewmayer;547643][I]sudo dd if=[Full path to ISO file, no wildcarding permitted] of=/dev/sdX[/I][/QUOTE]You can greatly speed up this process by adding [C]bs=1M[/C] to the end. Or 8M. Or something like that. By default it does the work in 512 byte chunks so there is a lot of overhead.

Edit1: Here is a link for people who want to create the boot USB stick on a Windows computer.

[URL]https://ubuntu.com/tutorials/tutorial-create-a-usb-stick-on-windows#1-overview[/URL]

Edit2: [C]lsblk[/C] is a much easier way to see attached block devices. It also shows you if and where the device is mounted.[CODE]$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
nvme0n1 259:0 0 953.9G 0 disk
└─nvme0n1p1 259:1 0 953.9G 0 part /[/CODE]<<< INSERT USB STICK >>>[CODE]$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 1 28.9G 0 disk
└─sda1 8:1 1 28.9G 0 part /run/media/m/stuff
nvme0n1 259:0 0 953.9G 0 disk
└─nvme0n1p1 259:1 0 953.9G 0 part /

$ umount /dev/sda1

$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 1 28.9G 0 disk
└─sda1 8:1 1 28.9G 0 part
nvme0n1 259:0 0 953.9G 0 disk
└─nvme0n1p1 259:1 0 953.9G 0 part /[/CODE]

Prime95 2020-08-11 03:45

[URL="https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html"]AMD rocm installation guide[/URL] suggests this command

sudo usermod -a -G video $LOGNAME

to add yourself to the video group

DrobinsonPE 2020-10-20 03:50

I have a computer, Ryzen 3200G, that I got the windows version of GPUOWL v6.11-364 running on. See here: [url]https://www.mersenneforum.org/showpost.php?p=557321&postcount=2471[/url]. I removed the windows hard drive and reinstalled the linux hard drive, upgraded to Linux Mint 20 and followed the instructions above for

[B]Installing and setting up for gpuowl running:[/B]

Everything went well until the last line
git clone [url]https://github.com/preda/gpuowl[/url] && cd gpuowl && make

it spit out a lot of text and ended with

[CODE]g++ -o gpuowl Pm1Plan.o util.o B1Accumulator.o Memlock.o log.o GmpUtil.o Worktodo.o common.o main.o Gpu.o clwrap.o Task.o checkpoint.o timeutil.o Args.o state.o Signal.o FFTConfig.o AllocTrac.o gpuowl-wrap.o sha3.o md5.o -lstdc++fs -lOpenCL -lgmp -pthread -L/opt/rocm-3.3.0/opencl/lib/x86_64 -L/opt/rocm-3.1.0/opencl/lib/x86_64 -L/opt/rocm/opencl/lib/x86_64 -L/opt/amdgpu-pro/lib/x86_64-linux-gnu -L.
/usr/bin/ld: cannot find -lOpenCL
collect2: error: ld returned 1 exit status
make: *** [Makefile:19: gpuowl] Error 1
drobinson@3200G:~/gpuowl$ [/CODE]

From what I can tell, the make did not happen. Can someone please give me some advice on what to do next? This is the same thing that happened the first time I tried about 6 months ago and the reason why I installed Windows on the computer in an attempt to see if it was a hardware incompatibility issue.

I prefer Linux over Windows so I am trying to get both GPUOWL and mfakto to work on Linux. I have a feeling that it is something I am doing wrong because every time I try to learn a new program I always find all of the PEBCAK issues before I get the program working.

phillipsjk 2020-10-20 05:56

[QUOTE=DrobinsonPE;560380]
git clone [URL]https://github.com/preda/gpuowl[/URL] && cd gpuowl && make

it spit out a lot of text and ended with

[CODE]g++ -o gpuowl Pm1Plan.o util.o B1Accumulator.o Memlock.o log.o GmpUtil.o Worktodo.o common.o main.o Gpu.o clwrap.o Task.o checkpoint.o timeutil.o Args.o state.o Signal.o FFTConfig.o AllocTrac.o gpuowl-wrap.o sha3.o md5.o -lstdc++fs -lOpenCL -lgmp -pthread -L/opt/rocm-3.3.0/opencl/lib/x86_64 -L/opt/rocm-3.1.0/opencl/lib/x86_64 -L/opt/rocm/opencl/lib/x86_64 -L/opt/amdgpu-pro/lib/x86_64-linux-gnu -L.
/usr/bin/ld: cannot find -lOpenCL
collect2: error: ld returned 1 exit status
make: *** [Makefile:19: gpuowl] Error 1
drobinson@3200G:~/gpuowl$ [/CODE]From what I can tell, the make did not happen. Can someone please give me some advice on what to do next? [/QUOTE]


If you are going to be compiling instead of installing binaries, you often need to install development packages, which essentially just include the needed headers. Often named libraryname-dev.


So if you have the opencl package installed, you may need the opencl-dev package as well. OK I checked, [URL="http://packages.linuxmint.com/search.php?release=any&section=any&keyword=opencl"]http://packages.linuxmint.com/search.php?release=any§ion=any&keyword=opencl[/URL] does not have opencl packages for AMD cards.


What the instructions tell you to do is install the "rocm-dev" packages from a foreign repository. You should check that the version of Mint you are using is compatible with the version of Ubuntu exected by the repo.


[code]
# echo 'deb [arch=amd64] [URL]http://repo.radeon.com/rocm/apt/debian/[/URL] xenial main' | sudo tee /etc/apt/sources.list.d/rocm.list
# sudo apt update && sudo apt install rocm-dev[/code]I believe the first line simply creates a file in /etc/apt/sources.list.d/rocm.list
with the contents 'deb [arch=amd64] [URL]http://repo.radeon.com/rocm/apt/debian/[/URL] xenial main'


You may want to examine your sources.list.d directory to see if the format is consistent. the 'xenial' codename may be incorrect, causing the entry to be ignored ('main' should then possibly pull it in, may not work if you are on 'testing')


The second line tells 'apt' to update the package lists, then install rocm-dev. Did that complete properly?


Grr: just checked the apt(8) manpage on my devuan installation looks like it got GNU-fied :P


[quote=apt] Much like apt itself, its manpage is intended as an end user interface
and as such only mentions the most used commands and options partly to
not duplicate information in multiple places and partly to avoid
overwhelming readers with a cornucopia of options and details.
[/quote]


Apparently checking the installation status of a package is not a common command :P


My pet peve with the GNU project is removing all of the man pages to encourage the use of the info command instead. The info command is complicated enough that I have to look up how to use it every time "info info", then I forget what I was looking up in the first place! (BSD has good man pages, but possibly less bleeding edge hardware support).


Edit: if the linking step is failing, maybe you need to install 'rocm' as well:


[code]
# sudo apt-get install rocm
[/code]

M344587487 2020-10-20 07:03

[QUOTE=DrobinsonPE;560380]...
From what I can tell, the make did not happen. Can someone please give me some advice on what to do next?
...[/QUOTE]

The latest ROCm changed the OpenCL library path, until the gpuowl repo accounts for the change you'll need to edit the LIBPATH line in the Makefile to include -L/opt/rocm/opencl/lib

kruoli 2020-10-20 07:53

[QUOTE=phillipsjk;560386]The info command is complicated enough that I have to look up how to use it every time "info info", then I forget what I was looking up in the first place![/QUOTE]

Argh! "info info" is broken on my distribution. Bad sign. Apparently, that's known since 18.04 but still not fixed for me?

Anyways, I'm drifting off.

DrobinsonPE 2020-10-21 02:01

[QUOTE=M344587487;560389]The latest ROCm changed the OpenCL library path, until the gpuowl repo accounts for the change you'll need to edit the LIBPATH line in the Makefile to include -L/opt/rocm/opencl/lib[/QUOTE]

Thank you! Editing the makefile worked. I now have a compiled gpuowl v7.0-54-g8aadeed-dirty and I can get an output from gpuowl -h. Unfortunately, I now need to figure out why it is not finding my igpu.

[CODE]drobinson@3200G:~/gpuowl$ /home/drobinson/gpuowl/gpuowl
2020-10-20 18:38:21 gpuowl v7.0-54-g8aadeed-dirty
2020-10-20 18:38:21 Note: not found 'config.txt'
2020-10-20 18:38:21 device 0, unique id ''
2020-10-20 18:38:21 Exception gpu_error: DEVICE_NOT_FOUND clGetDeviceIDs(platforms[i], kind, 64, devices, &n) at clwrap.cpp:77 getDeviceIDs
2020-10-20 18:38:21 Bye
[/CODE]

clinfo also does not seem to find the igpu.

[CODE]drobinson@3200G:~$ clinfo
Number of platforms 1
Platform Name AMD Accelerated Parallel Processing
Platform Vendor Advanced Micro Devices, Inc.
Platform Version OpenCL 2.0 AMD-APP (3186.0)
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_icd cl_amd_event_callback
Platform Extensions function suffix AMD

Platform Name AMD Accelerated Parallel Processing
Number of devices 0
[/CODE]

I am making progress but still have a few puzzles to figure out. Also, I was looking for GPUOWL v6.11 because it looks like v7.0 is still experimental so I need to figure out how to clone an earlier version.

The edit to the makefile needs to be added to the first post so that the next person to follow the instructions knows what to do.

M344587487 2020-10-21 08:32

[QUOTE=DrobinsonPE;560472]...I am making progress but still have a few puzzles to figure out. Also, I was looking for GPUOWL v6.11 because it looks like v7.0 is still experimental so I need to figure out how to clone an earlier version.[/QUOTE]
"git switch v6" to switch to the v6 branch, "git switch master" to go back to the currently v7 branch. Until the Makefile edit is in the repo you'll need to do "git checkout -f" to drop the manual edit when pulling updates to master then re-edit the Makefile.

[QUOTE=DrobinsonPE;560472]...
The edit to the makefile needs to be added to the first post so that the next person to follow the instructions knows what to do.[/QUOTE]I'll see if I can dust off a github account and make a pull request so it's no longer an issue.


APU support is lacking with ROCm but you can get it to work with the OpenCL part we need. To do that from the upstream install you should have from following the first post try:
[code]sudo apt autoremove rocm-dev
sudo apt install rocm-dkms
sudo usermod -a -G video $LOGNAME
sudo usermod -a -G render $LOGNAME
sudo reboot
[/code]This converts your ROCm install from using upstream to using AMD's latest release, the difference is that upstream relies on drivers that have made it into the kernel and I'm assuming Mint 20 is using kernel 5.4 which is probably too early for decent APU support as it's the most recent development. The usermod lines may not be necessary but they also won't hurt.


As a future note if AMD ever releases a new GPU worth a damn for compute we're going to need to use AMD's drivers for a while for ease. It's the bridge between a new card needing recent developments and ROCm only playing nice with older LTS kernels. All that means roughly speaking is installing rocm-dkms instead of rocm-dev with potentially a few twiddly bits like the usermod lines. OP's install guide is based on the following link with upstream substituted for AMD's drivers, for an AMD driver install just follow this standard guide: [URL]https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html#ubuntu[/URL]

preda 2020-10-21 10:18

[QUOTE=DrobinsonPE;560472]
The edit to the makefile needs to be added to the first post so that the next person to follow the instructions knows what to do.[/QUOTE]

No big deal, that libpath was already added to Makefile. (in master only)

DrobinsonPE 2020-11-07 18:48

After months of trying, I finally got GPUOWL working on Linux. The instructions above did not work with multiple tries on multiple systems. This possibly is because I am learning as I go and keep making mistakes in the installation. I did learn a lot about how to install GPUOWL including how to use git and what programs needed to be installed on the computer to install ROCm and make GPUOWL from the instructions above.

Here is what worked:

Install ROCm following the instructions here: [url]https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html[/url]
This included using the following commands:
o sudo apt update
o sudo apt dist-upgrade
o sudo apt install libnuma-dev
o sudo reboot
o wget -q -O - [url]https://repo.radeon.com/rocm/rocm.gpg.key[/url] | sudo apt-key add -
o echo 'deb [arch=amd64] [url]https://repo.radeon.com/rocm/apt/debian/[/url] xenial main' | sudo tee /etc/apt/sources.list.d/rocm.list
o sudo apt update
o sudo apt install rocm-dkms && sudo reboot
o sudo usermod -a -G video $LOGNAME
o sudo usermod -a -G render $LOGNAME
o sudo reboot

Check installation worked with the following:
o /opt/rocm/bin/rocminfo
o /opt/rocm/opencl/bin/clinfo

Install GPUOWL following the instructions here: [url]https://github.com/preda/gpuowl[/url]
This included using the following commands:
o sudo apt install git
o sudo apt install libgmp-dev
o sudo apt install gcc
o git clone [url]https://github.com/preda/gpuowl[/url] && cd gpuowl && make

Make a worktodo.txt (in linux do not forget to add the .txt to the end of the file name) and add an assignment.

I still have not figured out what to put in a config.txt yet. - future homework.

Computer: ASRock Deskmini A300W, AMD A8-9600, 16GB DDR-4, SSD
Prgrams: Linux Mint 20, ROCm v3.9, GPUOWL v7.2-16-gla50f11

This is what it has displayed so far. It has not displayed any progress yet so I am not sure it is actually working. I will post in the "gpuOwL: an OpenCL program for Mersenne primality testing" thread if it start showing progress.

[CODE]drobinson@A8-9600:~/gpuowl$ /home/drobinson/gpuowl/gpuowl
2020-11-07 09:19:41 GpuOwl VERSION v7.2-16-g1a50f11
2020-11-07 09:19:41 GpuOwl VERSION v7.2-16-g1a50f11
2020-11-07 09:19:41 Note: not found 'config.txt'
2020-11-07 09:19:41 device 0, unique id ''
2020-11-07 09:19:41 gfx801-0 100406741 FFT: 5.50M 1K:11:256 (17.41 bpw)
2020-11-07 09:19:41 gfx801-0 100406741 OpenCL args "-DEXP=100406741u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DAMDGPU=1 -DWEIGHT_STEP_MINUS_1=0.5051841309934193 -DIWEIGHT_STEP_MINUS_1=-0.33562945595234156 -DIWEIGHTS={0,-0.33562945595234156,-0.11722356040363666,-0.41350933655290917,-0.22070575769356826,-0.48225986026566819,-0.31205740337878252,-0.0859024056184058,-0.39270048390804446,-0.19305618018821558,-0.46389029541574911,-0.2876490077922636,-0.053469967508113697,-0.37115332735591766,-0.16442558794578249,-0.4448689732712372,} -cl-std=CL2.0 -cl-finite-math-only "
[/CODE]

M344587487 2020-11-07 22:00

You may have more luck with the 3200G if you still have it, the A8-9600 is pre-Ryzen but more importantly pre-Vega so I'd be surprised if you can get iGPU compute working. Even the Vega iGPU's are not particularly well supported so going before them is a longshot.

DrobinsonPE 2020-11-07 23:05

I finally realized that GPUOWL was not compiling and that was why it was not running. Apparently openCL does not quite work.

I went back and ran /opt/rocm/opencl/bin/clinfo
It finds the integrated graphic but stops running in the SVM capabilities.
[CODE] Number of devices: 1
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Board name: Wani [Radeon R5/R6/R7 Graphics]
Device Topology: PCI[ B#0, D#1, F#0 ]
..............
SVM capabilities:
Coarse grain buffer: Yes
Fine grain buffer: Yes
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
^C[/CODE]
I still have the 3200G but I moved it to a different motherboard and it is currently running Windows 10. I am testing it with Prime 95, mfakto, and GPUOWL on Windows 10 right now but will switch the SSD to Linux soon and follow the same installation process that almost worked on the A8-9600.

MOD -- please feel free to delete my previous post because GPUOWL still did not work. That is unless you want to continue memorializing my failures.

Xyzzy 2020-11-08 00:01

[QUOTE=DrobinsonPE;562571]That is unless you want to continue memorializing my failures.[/QUOTE]Your post might help someone in the future, even if all it shows is what you tried to do. Sometimes any hit for information via a search engine is appreciated!

:tu:

DrobinsonPE 2020-11-08 05:02

[QUOTE=M344587487;562565]You may have more luck with the 3200G if you still have it, the A8-9600 is pre-Ryzen but more importantly pre-Vega so I'd be surprised if you can get iGPU compute working. Even the Vega iGPU's are not particularly well supported so going before them is a longshot.[/QUOTE]

I thought I would give it a try because ROCm said it might work. The A8-9600 is a “Bristol Ridge” APU.

[url]https://github.com/RadeonOpenCompute/ROCm#Hardware-and-Software-Support[/url]

The integrated GPUs in AMD APUs are not officially supported targets for ROCm. As described below, "Carrizo", "Bristol Ridge", and "Raven Ridge" APUs are enabled in our upstream drivers and the ROCm OpenCL runtime. However, they are not enabled in the HIP runtime, and may not work due to motherboard or OEM hardware limitations. As such, they are not yet officially supported targets for ROCm.

paulunderwood 2020-11-13 09:50

Ubuntu vs Debian
 
With the lastest ROCm Debian could not cope with the need for python3.8. After many hours of fiddling I ended up using Ubuntu instead. With the help of Ernst's guide -- see OP (needs to add user to render group) -- I got something clean working running the latest gpuOwl.

I lost 2 days GPU computing to this upgrade.

PhilF 2020-11-13 14:01

[QUOTE=paulunderwood;563085]With the lastest ROCm Debian could not cope with the need for python3.8. After many hours of fiddling I ended up using Ubuntu instead. With the help of Ernst's guide -- see OP (needs to add user to render group) -- I got something clean working running the latest gpuOwl.

I lost 2 days GPU computing to this upgrade.[/QUOTE]

I've been fighting that too. It appears that AMD is no longer supporting Debian -- on purpose.

Anyway, which version of Ubuntu did you use?

paulunderwood 2020-11-13 17:22

I opted for 20.04 LTS..

M344587487 2020-11-13 18:55

I've had trouble with 20.10 on the latest ROCm, the iGPU crashes under compute load and amusingly hangs the entire system which I haven't seen in years.

paulunderwood 2020-12-08 06:11

Debian/Ubuntu/Centos/Fedora
 
Debian has a problem with ROCm's latest drivers -- requires and uninstallable version of python.

Ubuntu was a disaster for me. OS slowed down after a few days then ROCm upgrade screwed the system,

Centos -- libraries too old

Fedora 33 -- bang on OS.

ZFR 2021-02-12 20:53

Thanks for the thread.

[QUOTE=ewmayer;547643]

[B]Use -maxAlloc to avoid out-of-memory with multi-jobs per card:[/B]

If you run multiple gpuowl instances per card as suggested in general for both performance and should-one-job-crash reasons, you need to take care to add '-maxAlloc [(0.9)*(Card memory in MB)/(#instances)]' to your program-invocation command line.[/QUOTE]

Just a question. Where exactly is the multiple instance suggestion taken from? I couldn't find anything about it in the readme. Are 2 instances enough, or should I run more?

Each instance needs its own folder, right? And -pool can be used so they share worktodo?

M344587487 2021-02-12 21:33

Consumer gear is limited to two instances per card, at least AMD cards are, trying to run a third will compile the kernel but never run it. Depending on the card model there may be a small throughput benefit to running two instances of gpuowl, test your card to find out. AFAIK you're correct about how to do it.

ewmayer 2021-02-12 21:55

I run 2 instances per card on each of my Radeon VIIs for 2 reasons:

1. Gives an total throughput boost in the 7-10% range;

2. If one job hangs or crashes - infrequent, but it does happen - one minimizes the total throughput hit.

Even if one has a GPU model where 2-instances is slightly slower in total-throughput terms - say no more that 5% - [2] makes it worth doing, IMO.

On the R7 I found negative benefit from > 2 instances.

ZFR 2021-02-13 07:35

[QUOTE=ewmayer;571467]...[/QUOTE]

OK, gotcha. Guess I'll test it out and if the throughput is same or slightly lower, I'll go with 2.

ZFR 2021-02-19 15:02

[QUOTE=ewmayer;571467]
Even if one has a GPU model where 2-instances is slightly slower in total-throughput terms - say no more that 5% - [2] makes it worth doing, IMO.

[/QUOTE]

On GeForce GTX 970 getting worse results. 12.3 ms/iter when running one instance. When running 2 I get 28.6 average. That's about 16% worse throughput.

ZFR 2021-02-20 09:16

Actually, never mind. That was when I tested by putting the assigned exponent into worktodo, which does a P-1 on it simultaneously?

When I simply test with the -prp <exp> arguments, the throughput on running two instances is only 2% worse. I'll just use 2 instances for the in-case-it-hangs-or-crashes reason.

jas 2021-05-25 20:53

Success report
 
Hi! I did a clean install of Ubuntu 20.04.2 LTS on a Gigabyte Aorus X570 I Pro Wifi with a AMD Ryzen 5950x CPU and a Radeon VII card. The instructions worked fine, except for diff below, and it is now churning its first PRP's (both gpuowl and mprime).

[QUOTE=ewmayer;547643]
o sudo apt install -y gcc gdb python libgmp-dev git ssh openssh-server clinfo libncurses5 libnuma-dev
[/QUOTE]

Please add 'make' here -- as it is not installed by default or anything else.

[QUOTE=ewmayer;547643]
o Add yourself to the video group. There are 2 options for doing this:
1. The [URL="https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html"]AMD rocm installation guide[/URL] suggests using the command 'sudo usermod -a -G video $LOGNAME'
[/QUOTE]

Option 1 worked fine, after logout/login (or reboot).

However, after following all those instructions above, and building gpuowl fine, 'clinfo' would not find the GPU. I read some instructions here:

[URL]https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html#ubuntu[/URL]

And indeed after doing the following, everything worked:

[QUOTE]
sudo apt install rocm-dkms
reboot
[/QUOTE]


I'm using -maxAlloc 16GB since I'm only using one runner, is this pushing anything or is this fine?



Thanks,
Simon

ewmayer 2021-05-25 21:46

@Simon - done, thanks - I put the clinfo-issue fix in the Troubleshooting section of the OP.

16GB should be OK with just 1 instance running, but suggest watching progress through at least one p-1 stage 2 ('P2' in the progress-line just right of the exponent): If that takes more than a few hours, maybe ratchet things back to 14-15GB. (I always run 2 instances per R7, but perhaps other 1-instance runners can comment on whether maxAlloc of 16GB works for their p-1 stage 2s).

M344587487 2021-05-26 09:25

[QUOTE=jas;579086]...
Please add 'make' here -- as it is not installed by default or anything else.
[/QUOTE]
Might be best to replace gcc with build-essential instead, it's a meta package that installs gcc/g++/make and probably a few other parts commonly used in a standard libc toolchain.


BTW make not installed by default is criminal, should be punishable by a week of having to exclusively use Hannah Montana Linux :P

ewmayer 2021-05-26 21:11

[QUOTE=M344587487;579113]Might be best to replace gcc with build-essential instead, it's a meta package that installs gcc/g++/make and probably a few other parts commonly used in a standard libc toolchain.[/QUOTE]

Good suggestion - done.

LaurV 2021-05-27 03:57

[offtopic]

Or, not really off topic, more like a praise to your tutorial, but from a different angle...

I have a couple of Radeon VII cards that I wanted to use for mining, but they got a ridiculously low hash rate under win7, i.e. half of what I read on the dedicated forums they should achieve. Struggled with windoze drivers for a while, went as much back as installing drivers from 2015, and wasted a week for it (in the evenings, after working time, come home, eat, take a shower, 4-5 hours in front of the computer every evening!, even one or two sleepless nights), but to no result. With this or that (old) driver I got more or less hash rate, but no cigar. As I didn't like the "dedicated for mining" operating systems (like HiveOS or so, you never know where your profit goes...) I decided to run Ubuntu from a stick.

I disconnected all my disks (afraid of bugs in the mining software, I won't install that on a computer where I do the banking and internet shopping stuff), ran the 'buntu, from a stick, got some courage after dealing with colab :razz: and puting my nose into other people's scripts (Teal, Dan, Chris, DanaJ, etc, thank you all!), then I put Ubuntu on a 64G external SSD (thing costs $8 on Lazada and it is bloody fast!), then wasted few more days (!) - no joke - dealing with installing AMD drivers and trying to convince OpenCL to run on my mini-buntu.

That is a f'king laborious stuff, did you see the tutorials on the web? (rhetoric question, no need answer). Grrr... I learned more Linux with this task than I learned in the last 35 years. I mean, we had a 2 or 3 semesters Linux course in the last years of the uni, but that was in the very incipient phase of it, we (collective we) were thinking more how to convince the female colleagues to go on dates with us than to write bash files, that was 35 years ago, and I never used it seriously ever since, except for small stuff now and then. In conclusion, I suck at Linux.

However.

I was almost ready to give up with making OpenCL run on it, I followed different tutorials, etc., I even got to the stage that I was running the clinfo from a directory and all cards were properly identified, but when I was running the same clinfo from another folder there was no graphic card installed. Which pissed me off terrible.

But then I said, ok, if I am in this stage already, let's see if I can make gpuOwl running, to see at least if the Linux version is faster than the Windows version. Hint: it is not. They are the same. More or less. But to get to that stage, I had to follow the tutorial from this thread. Which is mostly similar with other tutorials I already followed, from other forums or youtube. After some more adventures owned to my stupidity, it got me running OpenCL and gpuOwl eventually.

After seeing that the Linux version is not faster at doing PRP and P-1, i said, ok, back to windows, and I was ready to plug the SSD and the stick, and put back the windoze HDDs. But, you know the drill, if we are here, let's see if the miner runs, before demolishing the house....

[B]IT DID![/B] :davar55:

I don't know what went wrong, and then, what went right, but we are mining ETH with ~90 Mhashes per second per card [U]after we followed your tutorial on installing OpenCL[/U]. :w00t:. No "speed patch" applied (there is a collection of them on the web, but they don't "smell good", we still need the cards in the future!, so we don't want the magic smoke out yet, therefore no undervoltage and no overclocking for now! - albeit people say such unorthodox tricks can bring you about 20-25% more hash rate).

Edit: the million dollars question is if such unorthodox tricks can bring you 25% more GHzdays when crunching primes too... We may study that latter... :huh_don_t_we_have_a_smiley_turning_pages_?:

[/offtopic]

kriesel 2022-02-02 03:21

[QUOTE=ewmayer;571467]I run 2 instances per card on each of my Radeon VIIs for 2 reasons:

1. Gives an total throughput boost in the 7-10% range;

2. If one job hangs or crashes - infrequent, but it does happen - one minimizes the total throughput hit.

Even if one has a GPU model where 2-instances is slightly slower in total-throughput terms - say no more that 5% - [2] makes it worth doing, IMO.

On the R7 I found negative benefit from > 2 instances.[/QUOTE]I think the benefit will depend on the GPU model and the work. IIRC for disparate work between the two instances, performance gain may be less, or there may be a loss. (Running very different fft lengths may result in less throughput than a single instance of either length.)

Re the sometimes-two-instances-performance-penalty, in that case why not use a shell script so two instances ALTERNATE when one crashes or runs out of queued work. I suggest a short delay between and perhaps a maximum loop count. An A-B loop without either will inflate logs with lots of garbage otherwise when both instances are out of work or a driver or a lib or symlink has gone bonkers or the GPU has got into a must-crash-app state. (Cue the anything Windows can do, Linux can do better chorus...;)

masser 2022-02-20 15:05

Should the original post be updated from

[QUOTE]o wget -qO - [url]http://repo.radeon.com/rocm/[/url][B]apt/debian/[/B]rocm.gpg.key | sudo apt-key add -[/QUOTE]

to


[QUOTE]o wget -qO - http[B]s[/B]://repo.radeon.com/rocm/rocm.gpg.key | sudo apt-key add -
[/QUOTE]
This was the first stumbling block as I try to get an AMD device running on Ubuntu.

ewmayer 2022-02-25 20:31

[QUOTE=masser;600382]Should the original post be updated from

[QUOTE]o wget -qO - [url]http://repo.radeon.com/rocm/[/url][B]apt/debian/[/B]rocm.gpg.key | sudo apt-key add -[/QUOTE]

to

[QUOTE]o wget -qO - http[B]s[/B]://repo.radeon.com/rocm/rocm.gpg.key | sudo apt-key add -
[/QUOTE]
This was the first stumbling block as I try to get an AMD device running on Ubuntu.[/QUOTE]

AMD must've changed the repo structure sometime in past year - have edited OP to update link to your version, thanks.

kriesel 2022-04-11 23:59

Re [URL]https://mersenneforum.org/showpost.php?p=547643&postcount=1[/URL]
reading it repeatedly in preparation for attempting installation of a GPU and gpuowl on an existing CentOS installation that's already running Mlucas, I wonder about the following:
At what point in the sequence should the GPU be physically installed? (I'm guessing before ROCm installation. Is that right?)
Other than apt vs yum, what differences are there between Ubuntu installation process and CentOS 8 Stream for example?
Some of the steps are quite descriptive, while others are a bit mysterious as to purpose for a Linux noob. Could some more description be added of what the intent is for individual command lines as stated? (Especially some of those beginning o)
How does one verify the GPU is functional as far as being seen by openCL?
Seems like at least running ./gpuowl -h would be in order before queuing up work etc.
Which CentOS versions have been found to work with ROCM & RadeonVII, or not work?
[URL]https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html[/URL] indicates 7.9 or 8.3; note 8.3 is essentially no longer installable because of
repository issues.

A summary of some gleanings from the thread:
to pick gpuowl branches,
"git switch v6" to switch to the v6 branch, "git switch master" to go back to the currently v7 branch [URL]https://mersenneforum.org/showpost.php?p=560503&postcount=9[/URL]

Distros that have worked:
Ubuntu 19.10, 20.04
Mint 20?
Fedora 33?

Distros that have led to problems:
Mint (maybe just a path problem)
Debian (lacking ROCm support; requires an uninstallable version of python)
CentOS 8.x that are not stream (repository issues preventing required updates)
Ubuntu 20.10 with iGPU & latest ROCm as of 2020-11-13

GPU models successful:
GTX970
Radeon VII

GPU models unsuccessful:
AMD APU on A8-9600
Radeon 540 (OpenCL not recognizing that GPU model)

My first intended target system is an i5-7600T CPU, which includes an Intel HD630, running CentOS 8 stream OS. Intel Linux driver support listed is for Ubuntu, RHEL 8.3 or 8.4, and some Suse versions; no CentOS listed, much less the stream version.
clinfo returns no platforms. The HD630 is not a requirement, but it would have been useful to learn upon.
The must-have is Radeon VII support.

kriesel 2022-04-12 00:49

Given the following on a box build, same motherboard and cpu as the preceding post:
(same hardware as for attachment 4 of [URL]https://www.mersenneforum.org/showpost.php?p=590115&postcount=17[/URL])
BIOS set for onboard video display regardless
CentOS 7.9 OS new install on blank HD
Mlucas installed and working
Increased from 32GiB system ram to 64
Shut down the system sprawled across a table, install in a case, restart & confirm it's functional
Physically installed RX 5700 XT GPU for compute (probably mostly gpuowl)
amdgpu 21.40.1 installed per AMD's directions (for eventual A:B:C test, Win:amdgpu:rocm on same GPU & system)
gpuowl working directory created, worktodo.txt, config.txt
gpuowl-v6.11-380 zip file for linux from a forum post extracted there
A test run of gpuowl complains about libOpenCL.so.1 absent & terminates. Won't even run gpuowl.exe -h.
After some fumbling around including numerous web searches and dead ends, yum install opencl-headers
Then it's on to issues with libstdc libgxx etc.
I think I'll need to install git, select the right git branch for gpuowl, git clone, recompile etc.
Also clinfo showed no platforms. No idea for fixes for that.
Did not get opencl to work on CentOS 7.9 before moving on.

System has subsequently been taken to CentOS 8 Stream by partition removal and fresh install, so will need to redo all gpuowl and opencl and gpu driver related.

I've also been stumped by how to defeat nouveaux drivers and install real NVIDIA-provided drivers for NVIDIA GPUs on Ubuntu (attempted repeatedly on Ubuntu and a Core 2 Duo system long ago).

chris2be8 2022-04-12 16:00

[QUOTE=kriesel;603786]
I've also been stumped by how to defeat nouveaux drivers and install real NVIDIA-provided drivers for NVIDIA GPUs on Ubuntu (attempted repeatedly on Ubuntu and a Core 2 Duo system long ago).[/QUOTE]

Two issues I've run into are:
1: Make sure you have updated initrd.img by running update-initramfs (this should happen automatically when you install the Nvidia drivers). The system looks there for the blacklist of drivers not to install. Without doing that you can update the blacklist, reboot and wonder why it made no difference (as I did for some time).

2: The Nvidia drivers don't work with the preempt kernel. I hit this on OpenSUSE and had to reboot with the default kernel to make them work. I don't know if this can affect Ubuntu though, but check what kernel you are using.

Running [c]lspci -v -s 07:00[/c] should tell you what driver is in use (replace 07:00 with the address of your GPU). If it says nouveau is in use you need to change something.

M344587487 2022-04-12 21:19

[quote]Could some more description be added of what the intent is for individual command lines as stated? (Especially some of those beginning o)[/quote]o Edit /etc/default/grub to add amdgpu.ppfeaturemask=0xffffffff to GRUB_CMDLINE_LINUX_DEFAULT
Exposes file-based clocking/voltage controls for AMD cards as used in the script (things like /sys/class/drm/card0/device/pp_od_clk_voltage are virtual files that don't exist without the kernel being booted with this option enabled). grub is a bootloader and what you're doing here is adding a boot option to grub's config that will load the kernel with this option on every boot.

o sudo update-grub
Generates the actual config file used by grub from the file you just edited. If you bork grub you bork the system, not directly editing the config guards from formatting errors at least.

o wget -qO - [URL]https://repo.radeon.com/rocm/rocm.gpg.key[/URL] | sudo apt-key add -
Add AMD key to your keychain, allowing you to add the repo below as a trusted source

o echo 'deb [arch=amd64] [URL]http://repo.radeon.com/rocm/apt/debian/[/URL] xenial main' | sudo tee /etc/apt/sources.list.d/rocm.list
Add AMD's deb repo to the list of repositories searched by your machine


o sudo apt update && sudo apt install rocm-dev
Update repo's and install rocm-dev, which will be found in the repo just added


o Add yourself to the video group. There are 2 options for doing this:
On Ubuntu you need to be in the video group to access the GPU. Not sure if/how this translates to other OS's

[quote]How does one verify the GPU is functional as far as being seen by openCL?[/quote]Start a PRP to be sure, just running the help might seem to work but in practice there could be an issue.

[quote]Which CentOS versions have been found to work with ROCM & RadeonVII, or not work?[/quote]Personally (many may disagree) every version of CentOS is unsuitable or at least are hard mode, roughly speaking they're intentionally old toolchains which you're trying to use modern features with by using gpuowl. CentOS 8 Stream may be your best bet if you're sticking with CentOS as relative to non-stream it's a much newer toolchain, albeit still years behind. At the very least you'll probably need to install AMD's drivers as the supplied kernel may be too old to have working mainline drivers. You may have trouble compiling because of the old toolchain. You may have trouble running someone else's binary because they probably linked against a newer libc. You may even run into a situation with RadeonVII/RDNA2/newer-nvidia-cards where the firmware for the card isn't present by default.

[quote]A test run of gpuowl complains about libOpenCL.so.1 absent & terminates. Won't even run gpuowl.exe -h[/quote]OpenCL isn't installed so the help fails as even then it tries to populate a list with GPUs. The guide uses ROCm to install opencl as it contains an implementation (possibly AMD-specific?). If you can't get ROCm to work then you'll need to find other means of getting a working OpenCL runtime for your hardware. I am not familiar with CentOS (other than using the ancient version to compile P95) so this is very basic advice: running "yum search opencl" will search the default repo's for OpenCL and you might get lucky. For example I could do this with apt on Ubuntu to find the package intel-opencl-icd, which is the package on Ubuntu for the OpenCL compute runtime for intel iGPU's. gpuowl doesn't need to be compiled for each OpenCL runtime, you just need a working gpuowl binary for your environment and a suitable OpenCL runtime for the hardware you want to run gpuowl on.

[quote]Then it's on to issues with libstdc libgxx etc.[/quote]That rings a bell. There was another thread where someone else was torturing themselves trying to get gpuowl running on CentOS. They couldn't compile their own gpuowl because the toolchain was too old to support gpuowl's relatively modern C++ dialect (C++17?), but coming from the other direction they couldn't run a binary I compiled because I was linking against a much newer libc (as an aside this is one reason P95 is compiled on an ancient CentOS). Libc compatibility hell is just the sort of thing that everyone especially noobs should avoid IMO. If a modern-enough g++ isn't in the repo's you'll likely need to compile it yourself, even then it's unclear to me if you'll have trouble linking to an older libc.

kriesel 2022-04-13 03:08

[QUOTE=M344587487;603853]Personally (many may disagree) every version of CentOS is unsuitable or at least are hard mode, roughly speaking they're intentionally old toolchains which you're trying to use modern features with by using gpuowl. CentOS 8 Stream may be your best bet if you're sticking with CentOS as relative to non-stream it's a much newer toolchain, albeit still years behind. At the very least you'll probably need to install AMD's drivers as the supplied kernel may be too old to have working mainline drivers. You may have trouble compiling because of the old toolchain. You may have trouble running someone else's binary because they probably linked against a newer libc. You may even run into a situation with RadeonVII/RDNA2/newer-nvidia-cards where the firmware for the card isn't present by default.

OpenCL isn't installed so the help fails as even then it tries to populate a list with GPUs. The guide uses ROCm to install opencl as it contains an implementation (possibly AMD-specific?). If you can't get ROCm to work then you'll need to find other means of getting a working OpenCL runtime for your hardware. I am not familiar with CentOS (other than using the ancient version to compile P95) so this is very basic advice: running "yum search opencl" will search the default repo's for OpenCL and you might get lucky. For example I could do this with apt on Ubuntu to find the package intel-opencl-icd, which is the package on Ubuntu for the OpenCL compute runtime for intel iGPU's. gpuowl doesn't need to be compiled for each OpenCL runtime, you just need a working gpuowl binary for your environment and a suitable OpenCL runtime for the hardware you want to run gpuowl on.

That rings a bell. There was another thread where someone else was torturing themselves trying to get gpuowl running on CentOS. They couldn't compile their own gpuowl because the toolchain was too old to support gpuowl's relatively modern C++ dialect (C++17?), but coming from the other direction they couldn't run a binary I compiled because I was linking against a much newer libc (as an aside this is one reason P95 is compiled on an ancient CentOS). Libc compatibility hell is just the sort of thing that everyone especially noobs should avoid IMO. If a modern-enough g++ isn't in the repo's you'll likely need to compile it yourself, even then it's unclear to me if you'll have trouble linking to an older libc.[/QUOTE]
Thanks for the extensive response and the time spent to offer help.

Here's where I'm coming from on this inquiry. Linux is strongly advocated by some GIMPS participants for various reasons. I have decades of heavy Windows experience, & very little Linux over the same period (Slackware, & early RedHat, lately mostly Ubuntu, but in prep for some systems, a little CentOS).
Ubuntu seems a popular distro, and is also available atop WSL. I have both a native Ubuntu on a garden variety system and Windows with Ubuntu atop WSL on the same system for head to head comparisons of the three environments for things like Mlucas and prime95. I depend heavily on Windows remote desktop or VNC for remote management of numerous systems; generally performed from a single laptop.

Gpuowl was developed on Linux & ROCm. Some say gpuowl is much more efficient on AMD gpus with the ROCm driver which is only supported for Linux. That includes Mihai Preda, and as he is the author of gpuowl, I trust his opinion. George Woltman too.

Mlucas no longer compiles in the msys2/mingw environment on Windows, so recent and future versions of Mlucas require some Linux environment. (Since V18, due to use of Linux signals if not more, if my notes are accurate.)

Two drawbacks, close to deal-breakers for some installs here, of Ubuntu, are:
1) VNC remote desktop to Ubuntu 18 or later systems does not allow connecting to existing user sessions, by design, which is my intended use case, so VNC on Ubuntu, though the install was eventually successful, is useless to me;

2) Ubuntu was attempted and abandoned by Ernst Mayer, on his Xeon Phi 7250 (after running into the "purple screen of death). He settled on CentOS 8.2 for that system. CentOS is supported on that hardware; Ubuntu is not. I have a Xeon Phi 7250 and 7210. I have Windows10 on them but they behave strangely in prime95 timings over time, switching abruptly to a much slower iteration timing unpredictably and staying slow until restarted; and otherwise, such as after adding DIMM ram, Windows does not make effective NUMA control available in that scenario, but per Ernst, CentOS does. Also WSL with Ubuntu atop is a VM only capable of accessing at most 64 virtual cpus due to Windows' processor group construct, which may be only 16 real cores, out of 64 or 68 real cores present. Efficiency can be very low, so operation quite slow. Sometimes Windows seems to get in a mode where it is spending much more cpu time on some sort of overhead than the user's computing tasks; perhaps shuffling work between the very many logical cores. Xeon Phis are x4 HT, so 256 or 272 logical cores.

So I tried to match the version Ernst has been running, because it was proven to support that hardware well.
But the 7210 has a GPU installed now, and the 7250 might in the future.

My Xeon Phi 7250 and 7210 are very cantankerous at startup, such that it can take many attempts to get through the BIOS tests and get to attempts at OS startup with one boot succeeding. That process can take an hour or more with frequent tending; minimum BIOS time is ~4 minutes. So rather than attempt learning about Linux distros on it at the start, I used some practice machines.

I ended up at CentOS 8 stream after trying on V7.9, 8.2, 8.3, 8.4 IIRC. I currently have one system running CentOS 8 stream only.
Another old dual-Xeon system is hard-drive-swap switchable between Windows10 and CentOS 8 stream.
These were practice for attempting dual-boot install of CentOS on the Xeon Phi 7250.

Both these practice machines are candidates for GPU installations in Linux. One is now GPU equipped and running Windows. The other I'd like to move a Radeon VII into, and leave it CentOS 8 stream with Mlucas already running there. If I could get some Linux distro identified that was compatible with the hardware, and AMD GPUs, and remote access of already ongoing sessions, it would permit another type of hardware that is Linux-only. Some of my other hardware would also be candidates for Linux if I could get the remote admin worked out.

Today I attempted TigerVNC installation on the box currently running CentOS 8 stream. It appears to install successfully following [URL="https://www.tecmint.com/install-tightvnc-access-remote-desktop-in-linux/"]online directions[/URL], but did not create the etc/tigervnc directory or put any configuration files there.
After consulting a few more CentOS oriented install guides, I resorted to manually creating the folder and config files, but it still fails to start successfully.

[CODE][root@raven etc]# ls -l . | grep tig
drwxr-xr-x. 2 root root 29 Apr 12 16:08 tigervnc
[root@raven etc]# ls -l tig*
-rw-r--r--. 1 root root 133 Apr 12 16:08 vncserver.users
$ systemctl start vncserver@:1 --now
$ systemctl enable vncserver@:1 --now
$ systemctl status vncserver@:1
● vncserver@:1.service - Remote desktop service (VNC)
Loaded: loaded (/usr/lib/systemd/system/vncserver@.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Tue 2022-04-12 21:55:00 CDT; 57s ago
Process: 12253 ExecStart=/usr/bin/vncserver_wrapper <USER> %i (code=exited, status=2)
Process: 12245 ExecStartPre=/bin/sh -c /usr/bin/vncserver -kill %i > /dev/null 2>&1 || : (code=exited, status=0/SUCCESS)
Main PID: 12253 (code=exited, status=2)

Apr 12 21:55:00 raven systemd[1]: Starting Remote desktop service (VNC)...
Apr 12 21:55:00 raven systemd[1]: Started Remote desktop service (VNC).
Apr 12 21:55:00 raven vncserver_wrapper[12253]: runuser: user <USER> does not exist
Apr 12 21:55:00 raven systemd[1]: vncserver@:1.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 12 21:55:00 raven vncserver_wrapper[12253]: FATAL: 'runuser -l <USER>' failed!
Apr 12 21:55:00 raven systemd[1]: Unit vncserver@:1.service entered failed state.
Apr 12 21:55:00 raven systemd[1]: vncserver@:1.service failed.[/CODE]This seems order of magnitude slower than Windows VNC installation, and still not working on the VNC server end. Any Linux TigerVNC server guidance please, anyone qualified?

chalsall 2022-04-13 03:33

[QUOTE=kriesel;603873]Any Linux TigerVNC server guidance please, anyone qualified?[/QUOTE]

Ken... To put it explicitly on the table, you and I have our differences over the years. On many different subject domains.

But if you actually want to learn how to control your Linux machines, I'm more than willing to help.

Sincerely.

M344587487 2022-04-16 11:38

To run a Radeon VII or other AMD hardware you'll need the AMD OpenCL driver.
Centos8stream is not a supported target, I'm assuming this workaround is necessary but regardless it's sufficient: [URL]https://github.com/RadeonOpenCompute/ROCm/issues/1632#issuecomment-1041392453[/URL]
[code]dnf install -y https://repo.radeon.com/amdgpu-install/21.50/rhel/8.5/amdgpu-install-21.50.50000-1.el8.noarch.rpm[/code]Fix repo as centos 8 stream is not officially supported:
[code]sed -e's/$amdgpudistro/8.5/g' -i /etc/yum.repos.d/amdgpu*.repo[/code]Install OpenCL runtime:
[code]amdgpu-install --usecase=opencl[/code]At this point you could try the gpuowl binary you downloaded, but you mentioned libc errors earlier which we haven't fixed and it probably won't work (the problem is likely that you're running an older libc than whoever compiled gpuowl). Better to compile your own, luckily the centos8stream default toolchain uses gcc 8 (the earliest version that supports c++17, centos8stream is just new enough to not be a complete PITA) so there shouldn't be any painful workarounds. The following would apply regardless of AMD/intel/nvidia hardware as long as the appropriate OpenCL driver was installed.

Install gcc:
[code]dnf group install "Development Tools"[/code]Install gmp:
[code]dnf install gmp-devel[/code]Clone gpuowl repo:
[code]git clone https://github/com/preda/gpuowl[/code]Switch to v6:
[code]cd gpuowl && git switch v6[/code]Edit a path in the LIBPATH line in the Makefile from -L/opt/rocm/opencl/lib/x86_64 to -L/opt/rocm/opencl/lib , newer ROCm have changed the layout and v6 hasn't added new location to search.
Build:
[code]make[/code]Note a lot of the install commands need to be executed either with sudo from a user with sudo rights or as root, depends how you've set up accounts. Tested in a VM, everything appears to be working to the point of gpuowl -h but I have no GPU attached to confirm.

[QUOTE=kriesel;603873]Any Linux TigerVNC server guidance please, anyone qualified?[/QUOTE]It looks like you've copied a script and were meant to replace <USER> with your username, I have no VNC experience but that should at least get you to the next problem. Is VNC strictly necessary or are you just going with what you know? As you're just using CLI programs on the Linux machines ssh seems a better choice. There's not much to a basic ssh workflow really:
[LIST][*]ssh to remote in or run commands[*]Run screen or tmux after remoting in so that any programs you start in the session don't depend on you being connected. To get back to the session you would for example remote in again then run 'tmux attach'[*]scp to copy files with CLI[*]sshfs to mount a client directory on the local system so you can copy/paste/edit work/result files as if they were local. This could be the main way you admin the device once a session has started assuming you're mostly just feeding gpuowl more work[/LIST]


All times are UTC. The time now is 08:06.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.