Closed Bug 804357 Opened 12 years ago Closed 11 years ago

symbols pushes should retry and/or be able to select among multiple destinations

Categories

(Release Engineering :: General, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: kmoir)

References

Details

(Whiteboard: [SPOF])

Attachments

(3 files, 1 obsolete file)

Currently, if the symbol server is down, symbol uploads will fail, full stop.

It would be great to:
 * retry this a few times on failure (can retry.py do this?)
 * select among a configured set of destination hosts, either sequentially or randomly, so that if one goes down the others will pick up its load

For example, right now we're using symbols1.dmz.phx1.mozilla.com which is a single host in phx1.  With some client-side resiliency and retries, we could add symbols2, so that a failure of either host would not cause build failures.
I think this is a dupe of bug 799654, which went live today.
The first part is, and that's great!  What do you think about adding the second part?  The other option is to switch to using the Zeus VIP, so Zeus can do the failover for you.
The symbol server host is set as an environment variable that 'make uploadsymbols' uses, so it would be a bit tricky to add more than one host in there without changing the build system. I think client-side retry + load balancer is cleanest.
Priority: -- → P3
So for the moment that means using symbolpush.mozilla.org.

symbolpush.mozilla.org is an alias for symbols1.pub.phx1.mozilla.com.
symbols1.pub.phx1.mozilla.com has address 63.245.216.250

Don't do that yet, though - I'd first like to move that to Zeus (it's NAT now), if possible.  I'll file a bug blocking this one.
Depends on: 804995
Bug 804995 is to the point where symbolpush.mozilla.org is the appropriate name/IP to be using to upload symbols.  At the moment, all access methods work and are equivalent:

 symbols1.pub.phx1.mozilla.com / 63.245.216.250
 symbols1.dmz.phx1.mozilla.com / 10.8.74.48
 symbolpush.mozilla.org / symbolpush.zlb.phx.mozilla.net / 63.245.217.193

So switching which one is used should not cause any problems or require any coordination.

The last is the one I would like to standardize on.  Once that's complete, we can put multiple hosts behind the zlb with failover, and eliminate an SPOF.

The new VIP is accessible from at least some of the build network:

[cltbld@dev-stage01 ~]$ nc -vz 63.245.217.193 22
Connection to 63.245.217.193 22 port [tcp/ssh] succeeded!

but please verify that from each affected VLAN before enabling.

So the action here is:
 * verify all slaves that push symbols can reach symbolpush.mozilla.org
 * replace 'symbols1.dmz.phx1.mozilla.com' in the buildbot configs with 'symbolpush.mozilla.org'.
No longer depends on: 804995
Any chance this can get fixed up?  It should be nice and simple, and will eliminate an SPOF.
Assignee: nobody → kmoir
Blocks: 804995
I verified all the slaves types could ssh to the host.  The old host name is also referenced in mozharness scripts.
Attached patch mozharness patchSplinter Review
Attached patch puppet patch (obsolete) — Splinter Review
I think a patch for puppet is needed too, not sure how to deploy this patch in a non-breaking fashion.
Attachment #721485 - Attachment description: patch → puppet patch
Comment on attachment 721485 [details] [diff] [review]
puppet patch

dustin@cerf ~ $ host symbolpush.mozilla.org
symbolpush.mozilla.org is an alias for symbolpush.zlb.phx.mozilla.net.
symbolpush.zlb.phx.mozilla.net has address 63.245.217.193

so you should be able to add the new one with the correct external IP.  Then there's no conflict (aside from listing the same key twice, which -- I think -- isn't a problem)
Attached file puppet patch
Attachment #721485 - Attachment is obsolete: true
Attachment #721338 - Flags: review?(catlee)
Attachment #721344 - Flags: review?(catlee)
Comment on attachment 721679 [details]
puppet patch

Not sure how we can test this in staging since there is a different symbols server in staging.
Attachment #721679 - Flags: review?(catlee)
Attachment #721679 - Flags: review?(catlee) → review+
Attachment #721344 - Flags: review?(catlee) → review+
Attachment #721338 - Flags: review?(catlee) → review+
You can log onto production machines of various OSes that need to push symbols here and make sure they can login ok.

Does windows need a known_hosts update too?
Yes windows needs a host update too. Good catch.  I forgot about the machines that aren't managed by Puppet.
Depends on: 849942
Attachment #721338 - Flags: checked-in+
Attachment #721344 - Flags: checked-in+
Comment on attachment 721679 [details]
puppet patch

I also updated the known_hosts on all the production windows build machines (mw32-ix-slave and w64-ix-slave).  I don't think the try build machines need this change so I left them alone.
Attachment #721679 - Flags: checked-in+
Comment on attachment 721344 [details] [diff] [review]
mozharness patch

This is merged to production.
This is in production.
I verified that this is working on recent builds in tbpl.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
We forgot to update /N/production/darwin11-x86_64/build/Users/cltbld/.ssh/known_hosts for the bld-lion-r5 slaves. I've added this on scl3-production-puppet. Kim, can you make sure the other puppet masters and staging copies of this file are updated as well?

Probably a good idea to update any other known_hosts files on /N too.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I updated them.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Should really have waited until bug 849942 was resolved before landing the buildbot-configs patch. Some of the windows machines are having issues uploading symbols.
Sorry nthmomas, I didn't think it would take long to resolve fixing the image in bug 849942.
My mistake actually, I missed comment #15 here.
Depends on: 855682
Product: mozilla.org → Release Engineering
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: