Intro
I run a three-node cluster and just recently one of the nodes had a PSU fail, thus leaving just two of three nodes working.
Sure enough, the cluster dropped and syncing stopped while also some VMs stopped because of no quorate.
Temporary solution
I solved this temporarily by running the below command.
root@cyndane5:~# pvecm expected 2
Then waiting a bit and after a few minutes, I run the below command.
root@cyndane5:~# pvecm status
Cluster information
-------------------
Name: skynet
Config Version: 14
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Fri Mar 22 10:10:55 2024
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000003
Ring ID: 1.8439
Quorate: Yes
Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 2
Quorum: 2
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.0.5
0x00000003 1 192.168.0.4 (local)
root@cyndane5:~#
In the Proxmox web-GUI the quorum status was now green.
Picking up the pieces - restoring to normal
When the third node is fixed and back online I'll run the below command and will be expecting all back to normal.
Ie, using three expected nodes.
root@cyndane5:~# pvecm expected 3
Future enhancements
A way to avoid this broken quorate would be to use a Qdevice. When the failed node has been fixed I'll look into that.
Sources
https://forum.proxmox.com/threads/another-cluster-not-ready-no-quorum-500-case.56104/
https://pve.proxmox.com/pve-docs/pvecm.1.html
https://forum.proxmox.com/threads/2-node-ha-with-external-qdevice.135429/
https://pve.proxmox.com/wiki/Cluster_Manager#_corosync_external_vote_support
- Written by: Sorin Srbu
- Hits: 291
Alternativ 1
Ladda ner en ISO-fil, tillhörande checksummefil, och kör följande kommandon i en terminal.
$ cd ~/Downloads
$ wget https://releases.ubuntu.com/jammy/ubuntu-22.04.3-live-server-amd64.iso
$ wget https://releases.ubuntu.com/jammy/SHA256SUMS
$ mv SHA256SUMS ubuntu-22.04.3-live-server-amd64.iso.sha256
$ ll
-rw-rw-r-- 1 sorin sorin 2133391360 Oct 26 09:58 ubuntu-22.04.3-live-server-amd64.iso
-rw-rw-r-- 1 sorin sorin 103 Oct 26 10:31 ubuntu-22.04.3-live-server-amd64.iso.sha256
## Radera raden med Desktop-checksumman. Spara och avsluta. Man får ett felmeddelande om att Desktop-ISOn saknas annars.
$ nano SHA256SUMS
$ sha256sum -c ubuntu-22.04.3-live-server-amd64.iso.sha256
ubuntu-22.04.3-live-server-amd64.iso: OK
Alternativ 2
Gör på samma sätt om i Alternativ 1 med nerladdning och editering.
Detta alternativ är lite mer manuellt, men kan ge lite information.
Det som händer är att först genereras en checksumma för tar.gz-filen och sparas ner till filen filesum.
Sedan körs en diff på checksumman man fick från tar-gz-filen och den i sha512-filen. Det larmar bara om checksummorna skiljer och meddelar om summorna stämmer.
Slutligen raderas filen filesum.
$ sha512sum rkhunter-1.4.6.tar.gz > filesum && diff -qs filesum rkhunter-1.4.6.tar.gz.sha512 && rm ./filesum
Files filesum and rkhunter-1.4.6.tar.gz.sha512 are identical
Detta alternativ kan ibland säga att checksummorna skiljer, även om så inte är fallet.
Om checksummorna inte skiljer vid manuell kontroll eller via Alternativ 1, se efter i sha512-filen om det är ett eller två mellansteg mellan checksumme-hashen och filnamnet som ska verifieras.
Det ska vara två mellanslag.
Källor
https://www.a2hosting.com/kb/developer-corner/linux/working-with-file-checksums/
https://www.tecmint.com/generate-verify-check-files-md5-checksum-linux/
https://itsfoss.com/checksum-tools-guide-linux/
- Written by: Sorin Srbu
- Hits: 300
I seem to never be able to remember the proper permissions for the .ssh-folder, so here's a note about it.
May it be forever remembered!
File/folder | Numeric notation (octal) | Symbolic notation |
.ssh | 700 | drwx------ |
public key, id_rsa.pub | 644 | -rw-r--r-- |
private key, id_rsa | 600 | -rw------- |
authorized_keys | 644 | rw-r--r-- |
config | 600 | rw------- |
Example
$ chmod 600 .ssh/id_rsa
$ ll id_rsa
-rw------- 1 sorin sorin 2602 Oct 15 2021 id_rsa
Gotchas'
Make sure the user owns the .ssh-folder and files and not root!
$ chown -Rv sorin.sorin /home/sorin/.ssh
Misc
Also see Ssh to remote computer asks for password despite certificates in place and Passwordless ssh.
Sources
https://superuser.com/questions/215504/permissions-on-private-key-in-ssh-folder
https://www.tecmint.com/set-ssh-directory-permissions-in-linux/
https://help.ubuntu.com/community/SSH/OpenSSH/Keys
https://wintelguy.com/permissions-calc.pl
- Written by: Sorin Srbu
- Hits: 338
Assumptions
When a failed harddisk has been confirmed, follow the below guide to replace the disk.
Assuming a Dell R710, a PERC 6 RAID-card and a new or used SAS (2TB) harddrive.
The server is using a RAID6 array.
Notes
- Reboot the server.
- Press Ctrl-R when the PERC blurb shows on-screen.
- Check the disks under the Physical disks tree-branch on the first screen.
- Confirm a drive has a FAILED-flag showing.
- Press Ctrl-N to go to the next page.
- Find the failed drive and chose to see the options for it.
- Choose Take off-line and confirm the warning.
- The disk should not be seen at the first screen, Physical disks tree-branch.
- Put in a new disk.
- The PERC should automatically start rebuilding the array after a short period of time.
- Exit the PERC-utility and press Ctrl-Alt-Delete to reboot.
- Monitor the boot to see that no errors are thrown.
Next steps
Next steps can be performed from the office.
- Connect with ssh to the server.
- Open three terminals and run these commands, one in each terminal.
-
$ watch -n 10 megacli -LDInfo -L0 -a0
-
$ watch -n 10 megaraidsas-status
-
$ watch -n 10 megacli -PDRbld -ShowProg -PhysDrv [32:4] -aAL
-
What the commands do
2.1 A general output of the virtual drive information.
Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name :raid6
RAID Level : Primary-6, Secondary-0, RAID Level Qualifier-3
Size : 7.275 TB
Sector Size : 512
Parity Size : 3.637 TB
State : Partially Degraded
Strip Size : 64 KB
Number Of Drives : 6
Span Depth : 1
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy : Disk's Default
Encryption Type : None
Is VD Cached: No
Exit Code: 0x00
2.2 Shows the individual disk statuses and what they currently do.
-- Arrays informations --
-- ID | Type | Size | Status
a0d0 | RAID 6 | 7450GiB | DEGRADED
-- Disks informations
-- ID | Model | Status | Warnings
a0e32s0 | TOSHIBA MG04SCA200E 1863GiB | online
a0e32s1 | SEAGATE ST2000NM0001 1863GiB | online
a0e32s2 | IBM-ESXS ST2000NM0023 1863GiB | online
a0e32s3 | SEAGATE ST2000NM0001 1863GiB | online
a0e32s4 | HITACHI HUS72402CLAR2000 1863GiB | rebuild | errs: media:0 other:8
a0e32s5 | IBM-ESXS ST2000NM0023 1863GiB | online
There is at least one disk/array in a NOT OPTIMAL state.
2.3 The rebuilding progress.
Rebuild Progress on Device at Enclosure 32, Slot 4 Completed 42% in 203 Minutes.
Exit Code: 0x00
Sources
The internet
Using perccli with Dell PE R710 and Perc 6/i
- Written by: Sorin Srbu
- Hits: 502