db1205 is the secondary media backups metadata db server, usually just a standby to db1204. Unless it is the active server because the primary is unavailable, it just has to be checked that replication restarts correctly after maintenance.
backup1011 is a mediabackups storage server. Ideally, mediabackups are paused during the maintenance to avoid backup errors.
backup1009 is the main backup node for bacula on eqiad. Most backups happen during the night- so just monitoring that it came back and new backups happen normally would be enough.
backup1010 is in intermittent usage to support mediabackups disk space, but mostly idle at the time, so unless it situtation changes by july and finally gets pooled for bacula, it will require no action.
@Marostegui , in order to resolve this ticket, now that read activity I assume is lower, do you think I could get a host from es4 and es5 on both dcs depooled for a day and with exclusive usage in order to take a final, archivable, full backup of those sections? Doesn't have to happen at the same time on the 4 hosts:
@ABran-WMF Thanks for handling it. To confirm, the issue happened at 2024-06-11 13:53:41 (Tuesday) , right (or before?)? Because I may recover the host from backups just to be 100% sure there is no leftover corruption.
This is ready for dc-ops.
This is ready for dc ops.
I will migrate the backups to 10.6 without removing yet the 10.4 backup sources.
@Volans not Amir, but Re: your first question, my understanding is that this was a compromise to make sure there was something good enough and simple short term, rather than overengineering from the start. That doesn't mean that what you suggest is discarded, but something that could be improved later on. For example, I am personally interested on having a querable service/API later for backup checks, but this is better than nothing ATM, with relatively small effort. Later on, a database could import the file and generate it, for example. So I am a fan of interating slowly as long as it is an improvement 0:-D.
Followup to T361087 .
I did a disk stress test for an hour or so, saw no media errors, smart errors or raid controller weirdness.
A disk was rebuilt on the 17 of May:
In T364296#9789045 , @Marostegui wrote: Sorry about it :( how can I help?
Sorry about it :( how can I help?
Thanks, the upgrade is no issue, but data will have a lot of backup errors due to not beeing depooled before maintenance, will need some work.
All backups now will be generated from 10.6 servers, with the exception of s1. Leaving a couple of hosts with 10.4 before upgrading them/decomming them.
@Marostegui es6 and es7 backups are enabled, and a first run was done here. They seem mostly empty, though:
In T363995#9775321 , @jcrespo wrote: [2024-05-06 14:33:33,903] INFO:backup '9/96/Gnome-edit-delete.svg' downloaded [2024-05-06 14:33:33,904] INFO:backup sha256 sum of Gnome-edit-delete.svg is a45ec2020e0997a031bdd62a0dc30a518c82d9c1a100d6e8420bb2a6f938c48f [2024-05-06 14:33:33,912] WARNING:backup A file with the same sha265 as "commonswiki Gnome-edit-delete.svg 3a36632c6569fcdf45aca81a71b53ae4faf80083 2024-05-06 14:16:14" was already uploaded, skipping. So it was a "coincidence" that was backed up, and not the other way. I will check the backup logs to see why it was failing.
[2024-05-06 14:33:33,903] INFO:backup '9/96/Gnome-edit-delete.svg' downloaded [2024-05-06 14:33:33,904] INFO:backup sha256 sum of Gnome-edit-delete.svg is a45ec2020e0997a031bdd62a0dc30a518c82d9c1a100d6e8420bb2a6f938c48f [2024-05-06 14:33:33,912] WARNING:backup A file with the same sha265 as "commonswiki Gnome-edit-delete.svg 3a36632c6569fcdf45aca81a71b53ae4faf80083 2024-05-06 14:16:14" was already uploaded, skipping.
So it was a "coincidence" that was backed up, and not the other way. I will check the backup logs to see why it was failing.
It was failing back in 2021:
In T363995#9763970 , @MatthewVernon wrote:
Here it is the 2 file versions (with the hash it can be checked they are the same files):
In any case, at this point I 'd prefer to do an in-place upgrade rather than a reimage, given how unreliable a reimage is and how impactful it can be for stateful services.
In T361087#9744977 , @cmooney wrote: In T361087#9744384 , @jcrespo wrote: Booting failed (PXE): PXELINUX 6.03 lwIP 20150819 Copyright (C) 1994-2014 H. Peter Anvin et al Debian 12 (bookworm) amd64 (Wikimedia edition) boot: Loading debian-installer/amd64/linux... ok Loading debian-installer/amd64/initrd.gz... Boot failed: press a key to retry, or wait for reset... Hmm. Not sure if we've seen this problem before. DHCP clearly worked as did the debian image download, but Linux failed to load for some reason. @jcrespo the only difference was selecting bullseye rather than bookworm on the second attempt?
In T361087#9744384 , @jcrespo wrote: Booting failed (PXE): PXELINUX 6.03 lwIP 20150819 Copyright (C) 1994-2014 H. Peter Anvin et al Debian 12 (bookworm) amd64 (Wikimedia edition) boot: Loading debian-installer/amd64/linux... ok Loading debian-installer/amd64/initrd.gz... Boot failed: press a key to retry, or wait for reset...
Booting failed (PXE):
PXELINUX 6.03 lwIP 20150819 Copyright (C) 1994-2014 H. Peter Anvin et al Debian 12 (bookworm) amd64 (Wikimedia edition) boot: Loading debian-installer/amd64/linux... ok Loading debian-installer/amd64/initrd.gz... Boot failed: press a key to retry, or wait for reset...
Hmm. Not sure if we've seen this problem before. DHCP clearly worked as did the debian image download, but Linux failed to load for some reason.
@jcrespo the only difference was selecting bullseye rather than bookworm on the second attempt?
If booted into bullseye.
PXELINUX 6.03 lwIP 20150819 Copyright (C) 1994-2014 H. Peter Anvin et al
Will reimage soon.
Looking good now:
hi, backups of matomo database failed with:
update: on both eqiad and codfw we are generating dumps and snapshots in 10.6 for x1, s2, s6, s5, s3.
Hi, after 73470d0dca68abee0 ntp no longer auto-restarts, but after one of the latest changes (I believe b48874a81565b7051be39659c056 ), it is pending . Can it be restarted or should it be kept with the old config for a while, and it should be acked?
Hi, today we had another occurrence of this. We didn't consider it a full-blown incident due to the no direct (or almost no) impact on users during the service down. After kubemaster1002 was detected as down during its automatic restart (due to a puppet change), it took a long time to come back- with lots of incoming network connections stuck/failing, and maximizing cpu usage. https://grafana.wikimedia.org/goto/KbF5zPaIg?orgId=1
@Marostegui Update: backups for x1, s2, s6, s5 and s3 are generating dumps and snapshots with MariaDB 10.6 currently on codfw. Doing s5 and s3 on eqiad next. You may see a lot of 10.4 servers, but they are idle and only kept just in case, they are not active, and will be just eventually upgraded or discarded.
[09:44] <jinxer-wm> (SystemdUnitFailed) resolved: prometheus-mysqld-exporter.service on db2200:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
Thank you a lot, to everybody!
CC @ABran-WMF in case I missed something.