? jcrespo

jcrespo (Jaime Crespo)
Sr Database Administrator

Projects (12)
View All

bacula
Component
Data-Persistence-Backup
Component
database-backups
Component
DBA
Group
dbbackups-dashboard
Component

Calendar

User Details

User Since: May 11 2015, 8:31 AM (474 w, 2 d)
Availability: Available
IRC Nick: jynus
LDAP User: Jcrespo
MediaWiki User: JCrespo (WMF) [ Global Accounts ]

Recent Activity
View All

Today

jcrespo added a comment to T365998: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad .

db1205 is the secondary media backups metadata db server, usually just a standby to db1204. Unless it is the active server because the primary is unavailable, it just has to be checked that replication restarts correctly after maintenance.

Wed, Jun 12, 9:40 AM · SRE-swift-storage , DBA , Data-Persistence , Infrastructure-Foundations , netops , SRE

jcrespo added a comment to T365996: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad .

backup1011 is a mediabackups storage server. Ideally, mediabackups are paused during the maintenance to avoid backup errors.

Wed, Jun 12, 9:36 AM · SRE-swift-storage , DBA , Data-Persistence , Infrastructure-Foundations , netops , SRE

jcrespo added a comment to T365995: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad .

backup1009 is the main backup node for bacula on eqiad. Most backups happen during the night- so just monitoring that it came back and new backups happen normally would be enough.

Wed, Jun 12, 9:35 AM · SRE-swift-storage , DBA , Data-Persistence , Infrastructure-Foundations , netops , SRE

jcrespo added a comment to T365993: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad .

backup1010 is in intermittent usage to support mediabackups disk space, but mostly idle at the time, so unless it situtation changes by july and finally gets pooled for bacula, it will require no action.

Wed, Jun 12, 9:29 AM · SRE-swift-storage , DBA , Data-Persistence , Infrastructure-Foundations , netops , SRE

jcrespo added a comment to T363812: Setup backups for es6, es7 and archive old read only backups .

@Marostegui , in order to resolve this ticket, now that read activity I assume is lower, do you think I could get a host from es4 and es5 on both dcs depooled for a day and with exclusive usage in order to take a final, archivable, full backup of those sections? Doesn't have to happen at the same time on the 4 hosts:

Wed, Jun 12, 9:08 AM · database-backups , Data-Persistence-Backup

jcrespo updated the task description for T363812: Setup backups for es6, es7 and archive old read only backups .

Wed, Jun 12, 9:01 AM · database-backups , Data-Persistence-Backup

jcrespo added a comment to T367162: db1240.s3 index issues .

@ABran-WMF Thanks for handling it. To confirm, the issue happened at 2024-06-11 13:53:41 (Tuesday) , right (or before?)? Because I may recover the host from backups just to be 100% sure there is no leftover corruption.

Wed, Jun 12, 8:43 AM · Data-Persistence-Backup

Fri, Jun 7

jcrespo updated the task description for T358741: Decommission db2096-db2120 .

Fri, Jun 7, 10:24 AM · DBA

jcrespo created T366892: decommission db2102.codw.wmnet .

Fri, Jun 7, 10:23 AM · Patch-For-Review , decommission-hardware

jcrespo updated the task description for T360751: Upgrade backup sources to MariaDB 10.6 .

Fri, Jun 7, 10:00 AM · Data-Persistence , Data-Persistence-Backup

jcrespo added a comment to T362883: decommission db2099.codfw.wmnet .

This is ready for dc-ops.

Fri, Jun 7, 8:59 AM · SRE , DC-Ops , ops-codfw , decommission-hardware

jcrespo placed T362883: decommission db2099.codfw.wmnet up for grabs.

Fri, Jun 7, 8:59 AM · SRE , DC-Ops , ops-codfw , decommission-hardware

jcrespo added a comment to T366877: decommission db2098.codfw.wmnet .

This is ready for dc ops.

Fri, Jun 7, 8:58 AM · SRE , DC-Ops , ops-codfw , decommission-hardware

jcrespo renamed T366877: decommission db2098.codfw.wmnet from decommission db2098 to decommission db2098.codfw.wmnet .

Fri, Jun 7, 8:58 AM · SRE , DC-Ops , ops-codfw , decommission-hardware

jcrespo placed T362802: decommission db2097.codfw.wmnet up for grabs.

Fri, Jun 7, 8:56 AM · SRE , DC-Ops , ops-codfw , decommission-hardware

jcrespo added a comment to T362802: decommission db2097.codfw.wmnet .

This is ready for dc ops.

Fri, Jun 7, 8:56 AM · SRE , DC-Ops , ops-codfw , decommission-hardware

jcrespo updated the task description for T358741: Decommission db2096-db2120 .

Fri, Jun 7, 8:52 AM · DBA

jcrespo updated the task description for T360751: Upgrade backup sources to MariaDB 10.6 .

Fri, Jun 7, 8:40 AM · Data-Persistence , Data-Persistence-Backup

jcrespo updated the task description for T362883: decommission db2099.codfw.wmnet .

Fri, Jun 7, 8:22 AM · SRE , DC-Ops , ops-codfw , decommission-hardware

jcrespo updated the task description for T366877: decommission db2098.codfw.wmnet .

Fri, Jun 7, 8:21 AM · SRE , DC-Ops , ops-codfw , decommission-hardware

jcrespo updated the task description for T362802: decommission db2097.codfw.wmnet .

Fri, Jun 7, 8:21 AM · SRE , DC-Ops , ops-codfw , decommission-hardware

jcrespo updated the task description for T358741: Decommission db2096-db2120 .

Fri, Jun 7, 7:42 AM · DBA

jcrespo added a subtask for T360751: Upgrade backup sources to MariaDB 10.6 : T362883: decommission db2099.codfw.wmnet .

Fri, Jun 7, 7:32 AM · Data-Persistence , Data-Persistence-Backup

jcrespo added a parent task for T362883: decommission db2099.codfw.wmnet : T360751: Upgrade backup sources to MariaDB 10.6 .

Fri, Jun 7, 7:32 AM · SRE , DC-Ops , ops-codfw , decommission-hardware

jcrespo added a subtask for T360751: Upgrade backup sources to MariaDB 10.6 : T366877: decommission db2098.codfw.wmnet .

Fri, Jun 7, 7:26 AM · Data-Persistence , Data-Persistence-Backup

jcrespo added a parent task for T366877: decommission db2098.codfw.wmnet : T360751: Upgrade backup sources to MariaDB 10.6 .

Fri, Jun 7, 7:26 AM · SRE , DC-Ops , ops-codfw , decommission-hardware

jcrespo added a subtask for T360751: Upgrade backup sources to MariaDB 10.6 : T362802: decommission db2097.codfw.wmnet .

Fri, Jun 7, 7:26 AM · Data-Persistence , Data-Persistence-Backup

jcrespo added a parent task for T362802: decommission db2097.codfw.wmnet : T360751: Upgrade backup sources to MariaDB 10.6 .

Fri, Jun 7, 7:26 AM · SRE , DC-Ops , ops-codfw , decommission-hardware

jcrespo created T366877: decommission db2098.codfw.wmnet .

Fri, Jun 7, 7:16 AM · SRE , DC-Ops , ops-codfw , decommission-hardware

Wed, May 29

jcrespo updated the task description for T360751: Upgrade backup sources to MariaDB 10.6 .

Wed, May 29, 4:28 PM · Data-Persistence , Data-Persistence-Backup

jcrespo updated the task description for T364290: Upgrade s1 to MariaDB 10.6 .

Wed, May 29, 4:27 PM · DBA

jcrespo added a comment to T364290: Upgrade s1 to MariaDB 10.6 .

I will migrate the backups to 10.6 without removing yet the 10.4 backup sources.

Wed, May 29, 3:28 PM · DBA

jcrespo added a comment to T363581: Build a machine-readable catalogue of mariadb tables in production .

@Volans not Amir, but Re: your first question, my understanding is that this was a compromise to make sure there was something good enough and simple short term, rather than overengineering from the start. That doesn't mean that what you suggest is discarded, but something that could be improved later on. For example, I am personally interested on having a querable service/API later for backup checks, but this is better than nothing ATM, with relatively small effort. Later on, a database could import the file and generate it, for example. So I am a fan of interating slowly as long as it is an improvement 0:-D.

Wed, May 29, 2:21 PM · DBA

Thu, May 23

jcrespo updated the task description for T365607: Reprovision missing files due to backup1005 hw issues .

Thu, May 23, 8:06 AM · Data-Persistence-Backup , media-backups

Wed, May 22

jcrespo added a parent task for T361087: backup1005 crashed : T365607: Reprovision missing files due to backup1005 hw issues .

Wed, May 22, 2:46 PM · SRE , ops-eqiad , DC-Ops , Data-Persistence-Backup , media-backups

jcrespo added a comment to T365607: Reprovision missing files due to backup1005 hw issues .

Followup to T361087 .

Wed, May 22, 2:46 PM · Data-Persistence-Backup , media-backups

jcrespo added a subtask for T365607: Reprovision missing files due to backup1005 hw issues : T361087: backup1005 crashed .

Wed, May 22, 2:46 PM · Data-Persistence-Backup , media-backups

jcrespo triaged T365607: Reprovision missing files due to backup1005 hw issues as High priority.

Wed, May 22, 2:46 PM · Data-Persistence-Backup , media-backups

jcrespo created T365607: Reprovision missing files due to backup1005 hw issues .

Wed, May 22, 2:46 PM · Data-Persistence-Backup , media-backups

jcrespo closed T365217: Degraded RAID on backup2010 as Resolved .

I did a disk stress test for an hour or so, saw no media errors, smart errors or raid controller weirdness.

Resolving for now.

Wed, May 22, 1:50 PM · Data-Persistence-Backup , Data-Persistence , DC-Ops , SRE , ops-codfw

jcrespo added a comment to T365217: Degraded RAID on backup2010 .

A disk was rebuilt on the 17 of May:

Wed, May 22, 7:25 AM · Data-Persistence-Backup , Data-Persistence , DC-Ops , SRE , ops-codfw

Tue, May 21

jcrespo added a comment to T363812: Setup backups for es6, es7 and archive old read only backups .

Stop es4 and es5 backups
Generate a full clusterX and clusterY last backup
Archive it into long term backups
Remove dump user

Tue, May 21, 11:31 AM · database-backups , Data-Persistence-Backup

Tue, May 14

jcrespo added a parent task for T364447: Make es4 and es5 RO : T363812: Setup backups for es6, es7 and archive old read only backups .

Tue, May 14, 8:43 AM · DBA

jcrespo added a subtask for T363812: Setup backups for es6, es7 and archive old read only backups : T364447: Make es4 and es5 RO .

Tue, May 14, 8:43 AM · database-backups , Data-Persistence-Backup

jcrespo removed a subtask for T364447: Make es4 and es5 RO : T363812: Setup backups for es6, es7 and archive old read only backups .

Tue, May 14, 8:42 AM · DBA

jcrespo removed a parent task for T363812: Setup backups for es6, es7 and archive old read only backups : T364447: Make es4 and es5 RO .

Tue, May 14, 8:42 AM · database-backups , Data-Persistence-Backup

jcrespo added a subtask for T364447: Make es4 and es5 RO : T363812: Setup backups for es6, es7 and archive old read only backups .

Tue, May 14, 8:42 AM · DBA

jcrespo removed a subtask for T355285: Productionize es10[35-40] : T363812: Setup backups for es6, es7 and archive old read only backups .

Tue, May 14, 8:42 AM · DBA

jcrespo edited parent tasks for T363812: Setup backups for es6, es7 and archive old read only backups , added: T364447: Make es4 and es5 RO ; removed: T355285: Productionize es10[35-40] .

Tue, May 14, 8:42 AM · database-backups , Data-Persistence-Backup

jcrespo updated the task description for T363812: Setup backups for es6, es7 and archive old read only backups .

Tue, May 14, 8:41 AM · database-backups , Data-Persistence-Backup

jcrespo updated the task description for T363812: Setup backups for es6, es7 and archive old read only backups .

Tue, May 14, 8:04 AM · database-backups , Data-Persistence-Backup

May 13 2024

jcrespo added a comment to T364296: Reimage db1215 and db2185 (zarcillo) to bookworm .

In T364296#9789045 , @Marostegui wrote:

Sorry about it :( how can I help?

May 13 2024, 8:44 AM · DBA

jcrespo added a comment to T364296: Reimage db1215 and db2185 (zarcillo) to bookworm .

Thanks, the upgrade is no issue, but data will have a lot of backup errors due to not beeing depooled before maintenance, will need some work.

May 13 2024, 8:14 AM · DBA

May 9 2024

jcrespo added a comment to T360751: Upgrade backup sources to MariaDB 10.6 .

All backups now will be generated from 10.6 servers, with the exception of s1. Leaving a couple of hosts with 10.4 before upgrading them/decomming them.

May 9 2024, 10:13 AM · Data-Persistence , Data-Persistence-Backup

jcrespo updated the task description for T360751: Upgrade backup sources to MariaDB 10.6 .

May 9 2024, 9:59 AM · Data-Persistence , Data-Persistence-Backup

jcrespo triaged T362509: Setup new dbprov hosts and decommission the old ones as High priority.

May 9 2024, 9:58 AM · Patch-For-Review , database-backups , Data-Persistence-Backup

jcrespo added a comment to T363812: Setup backups for es6, es7 and archive old read only backups .

@Marostegui es6 and es7 backups are enabled, and a first run was done here. They seem mostly empty, though:

May 9 2024, 9:20 AM · database-backups , Data-Persistence-Backup

jcrespo updated the task description for T363812: Setup backups for es6, es7 and archive old read only backups .

May 9 2024, 9:16 AM · database-backups , Data-Persistence-Backup

jcrespo closed T363995: Commons: File:Gnome-edit-delete.svg not found as Resolved .

May 9 2024, 7:11 AM · SRE-swift-storage , Commons

May 7 2024

jcrespo added a comment to T363995: Commons: File:Gnome-edit-delete.svg not found .

In T363995#9775321 , @jcrespo wrote:
[2024-05-06 14:33:33,903] INFO:backup '9/96/Gnome-edit-delete.svg' downloaded
[2024-05-06 14:33:33,904] INFO:backup sha256 sum of Gnome-edit-delete.svg is a45ec2020e0997a031bdd62a0dc30a518c82d9c1a100d6e8420bb2a6f938c48f
[2024-05-06 14:33:33,912] WARNING:backup A file with the same sha265 as "commonswiki Gnome-edit-delete.svg 3a36632c6569fcdf45aca81a71b53ae4faf80083 2024-05-06 14:16:14" was already uploaded, skipping.
So it was a "coincidence" that was backed up, and not the other way. I will check the backup logs to see why it was failing.

May 7 2024, 9:01 AM · SRE-swift-storage , Commons

May 6 2024

jcrespo added a comment to T363995: Commons: File:Gnome-edit-delete.svg not found .

It was failing back in 2021:

May 6 2024, 5:53 PM · SRE-swift-storage , Commons

jcrespo added a comment to T363995: Commons: File:Gnome-edit-delete.svg not found .

In T363995#9763970 , @MatthewVernon wrote:

May 6 2024, 2:39 PM · SRE-swift-storage , Commons

jcrespo added a comment to T363995: Commons: File:Gnome-edit-delete.svg not found .

Here it is the 2 file versions (with the hash it can be checked they are the same files):

May 6 2024, 1:15 PM · SRE-swift-storage , Commons

Apr 30 2024

jcrespo closed T361087: backup1005 crashed as Resolved .

Apr 30 2024, 10:45 AM · SRE , ops-eqiad , DC-Ops , Data-Persistence-Backup , media-backups

jcrespo changed the status of T363812: Setup backups for es6, es7 and archive old read only backups from Open to In Progress .

Apr 30 2024, 10:42 AM · database-backups , Data-Persistence-Backup

jcrespo changed the status of T363812: Setup backups for es6, es7 and archive old read only backups , a subtask of T355285: Productionize es10[35-40] , from Open to In Progress .

Apr 30 2024, 10:42 AM · DBA

jcrespo created T363812: Setup backups for es6, es7 and archive old read only backups .

Apr 30 2024, 10:35 AM · database-backups , Data-Persistence-Backup

jcrespo updated the task description for T362509: Setup new dbprov hosts and decommission the old ones .

Apr 30 2024, 10:23 AM · Patch-For-Review , database-backups , Data-Persistence-Backup

Apr 25 2024

jcrespo added a comment to T361087: backup1005 crashed .

In any case, at this point I 'd prefer to do an in-place upgrade rather than a reimage, given how unreliable a reimage is and how impactful it can be for stateful services.

Apr 25 2024, 3:38 PM · SRE , ops-eqiad , DC-Ops , Data-Persistence-Backup , media-backups

jcrespo updated subscribers of T361087: backup1005 crashed .

In T361087#9744977 , @cmooney wrote:
In T361087#9744384 , @jcrespo wrote:
Booting failed (PXE):
PXELINUX 6.03 lwIP 20150819 Copyright (C) 1994-2014 H. Peter Anvin et al


Debian 12 (bookworm) amd64 (Wikimedia edition)

                                              boot: 
Loading debian-installer/amd64/linux... ok
Loading debian-installer/amd64/initrd.gz...
Boot failed: press a key to retry, or wait for reset...
Hmm. Not sure if we've seen this problem before. DHCP clearly worked as did the debian image download, but Linux failed to load for some reason.

@jcrespo the only difference was selecting bullseye rather than bookworm on the second attempt?

Apr 25 2024, 3:29 PM · SRE , ops-eqiad , DC-Ops , Data-Persistence-Backup , media-backups

jcrespo added a comment to T361087: backup1005 crashed .

If booted into bullseye.

Apr 25 2024, 11:40 AM · SRE , ops-eqiad , DC-Ops , Data-Persistence-Backup , media-backups

jcrespo added a comment to T361087: backup1005 crashed .

Booting failed (PXE):

PXELINUX 6.03 lwIP 20150819 Copyright (C) 1994-2014 H. Peter Anvin et al

Apr 25 2024, 11:15 AM · SRE , ops-eqiad , DC-Ops , Data-Persistence-Backup , media-backups

Apr 24 2024

jcrespo claimed T361087: backup1005 crashed .

Will reimage soon.

Apr 24 2024, 4:51 PM · SRE , ops-eqiad , DC-Ops , Data-Persistence-Backup , media-backups

jcrespo awarded T363186: Cache mw-mcrouter service ClusterIP in apcu cache a Love token.

Apr 24 2024, 12:02 PM · Patch-For-Review , MediaWiki-Platform-Team , MediaWiki-libs-BagOStuff , MediaWiki-Engineering , serviceops , Sustainability (Incident Followup)

Apr 23 2024

jcrespo closed T349397: Migrate the matomo host to bookworm as Resolved .

Looking good now:

Apr 23 2024, 6:11 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05) , Data-Engineering

jcrespo closed T349397: Migrate the matomo host to bookworm , a subtask of T288804: Upgrade the Data Engineering infrastructure to Debian Bullseye , as Resolved .

Apr 23 2024, 6:09 PM · Data-Platform-SRE , Epic

jcrespo added a comment to T349397: Migrate the matomo host to bookworm .

hi, backups of matomo database failed with:

Apr 23 2024, 3:46 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05) , Data-Engineering

jcrespo reopened T349397: Migrate the matomo host to bookworm , a subtask of T288804: Upgrade the Data Engineering infrastructure to Debian Bullseye , as Open .

Apr 23 2024, 3:45 PM · Data-Platform-SRE , Epic

jcrespo reopened T349397: Migrate the matomo host to bookworm as "Open".

Apr 23 2024, 3:45 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05) , Data-Engineering

jcrespo added a comment to T360751: Upgrade backup sources to MariaDB 10.6 .

update: on both eqiad and codfw we are generating dumps and snapshots in 10.6 for x1, s2, s6, s5, s3.

Apr 23 2024, 11:20 AM · Data-Persistence , Data-Persistence-Backup

Apr 18 2024

jcrespo updated the task description for T362509: Setup new dbprov hosts and decommission the old ones .

Apr 18 2024, 9:08 AM · Patch-For-Review , database-backups , Data-Persistence-Backup

jcrespo added a comment to T362421: magru network setup .

Hi, after 73470d0dca68abee0 ntp no longer auto-restarts, but after one of the latest changes (I believe b48874a81565b7051be39659c056 ), it is pending . Can it be restarted or should it be kept with the old config for a while, and it should be acked?

Apr 18 2024, 9:07 AM · Patch-For-Review , netops , Infrastructure-Foundations , SRE

Apr 17 2024

jcrespo created T362766: 2024-04-17 mw-on-k8s eqiad outage .

Apr 17 2024, 11:22 AM · serviceops , Sustainability (Incident Followup)

jcrespo created P60740 (An Untitled Masterwork) .

Apr 17 2024, 8:39 AM

jcrespo updated the task description for T360751: Upgrade backup sources to MariaDB 10.6 .

Apr 17 2024, 8:00 AM · Data-Persistence , Data-Persistence-Backup

jcrespo updated the task description for T360751: Upgrade backup sources to MariaDB 10.6 .

Apr 17 2024, 7:55 AM · Data-Persistence , Data-Persistence-Backup

Apr 16 2024

jcrespo updated the task description for T358741: Decommission db2096-db2120 .

Apr 16 2024, 1:54 PM · DBA

jcrespo added a comment to T358936: Kubernetes apiserver probe failures on restart .

Hi, today we had another occurrence of this. We didn't consider it a full-blown incident due to the no direct (or almost no) impact on users during the service down. After kubemaster1002 was detected as down during its automatic restart (due to a puppet change), it took a long time to come back- with lots of incoming network connections stuck/failing, and maximizing cpu usage. https://grafana.wikimedia.org/goto/KbF5zPaIg?orgId=1

Apr 16 2024, 11:04 AM · Prod-Kubernetes , serviceops , SRE

jcrespo updated the task description for T360751: Upgrade backup sources to MariaDB 10.6 .

Apr 16 2024, 8:15 AM · Data-Persistence , Data-Persistence-Backup

jcrespo added a comment to T360751: Upgrade backup sources to MariaDB 10.6 .

@Marostegui Update: backups for x1, s2, s6, s5 and s3 are generating dumps and snapshots with MariaDB 10.6 currently on codfw. Doing s5 and s3 on eqiad next. You may see a lot of 10.4 servers, but they are idle and only kept just in case, they are not active, and will be just eventually upgraded or discarded.

Apr 16 2024, 8:11 AM · Data-Persistence , Data-Persistence-Backup

jcrespo closed T362611: Alert in need of triage: SystemdUnitFailed (instance db2200:9100) as Resolved .

[09:44] <jinxer-wm> (SystemdUnitFailed) resolved: prometheus-mysqld-exporter.service on db2200:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed

Apr 16 2024, 7:59 AM · DBA , sre-alert-triage