Closed
Bug 804357
Opened 12 years ago
Closed 12 years ago
symbols pushes should retry and/or be able to select among multiple destinations
Categories
(Release Engineering :: General, defect, P3)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dustin, Assigned: kmoir)
References
Details
(Whiteboard: [SPOF])
Attachments
(3 files, 1 obsolete file)
924 bytes,
patch
|
catlee
:
review+
kmoir
:
checked-in+
|
Details | Diff | Splinter Review |
3.68 KB,
patch
|
catlee
:
review+
kmoir
:
checked-in+
|
Details | Diff | Splinter Review |
4.28 KB,
text/plain
|
catlee
:
review+
kmoir
:
checked-in+
|
Details |
Currently, if the symbol server is down, symbol uploads will fail, full stop.
It would be great to:
* retry this a few times on failure (can retry.py do this?)
* select among a configured set of destination hosts, either sequentially or randomly, so that if one goes down the others will pick up its load
For example, right now we're using symbols1.dmz.phx1.mozilla.com which is a single host in phx1. With some client-side resiliency and retries, we could add symbols2, so that a failure of either host would not cause build failures.
Comment 1•12 years ago
|
||
I think this is a dupe of bug 799654, which went live today.
Reporter | ||
Comment 2•12 years ago
|
||
The first part is, and that's great! What do you think about adding the second part? The other option is to switch to using the Zeus VIP, so Zeus can do the failover for you.
Comment 3•12 years ago
|
||
The symbol server host is set as an environment variable that 'make uploadsymbols' uses, so it would be a bit tricky to add more than one host in there without changing the build system. I think client-side retry + load balancer is cleanest.
Priority: -- → P3
Reporter | ||
Comment 4•12 years ago
|
||
So for the moment that means using symbolpush.mozilla.org.
symbolpush.mozilla.org is an alias for symbols1.pub.phx1.mozilla.com.
symbols1.pub.phx1.mozilla.com has address 63.245.216.250
Don't do that yet, though - I'd first like to move that to Zeus (it's NAT now), if possible. I'll file a bug blocking this one.
Reporter | ||
Comment 5•12 years ago
|
||
Bug 804995 is to the point where symbolpush.mozilla.org is the appropriate name/IP to be using to upload symbols. At the moment, all access methods work and are equivalent:
symbols1.pub.phx1.mozilla.com / 63.245.216.250
symbols1.dmz.phx1.mozilla.com / 10.8.74.48
symbolpush.mozilla.org / symbolpush.zlb.phx.mozilla.net / 63.245.217.193
So switching which one is used should not cause any problems or require any coordination.
The last is the one I would like to standardize on. Once that's complete, we can put multiple hosts behind the zlb with failover, and eliminate an SPOF.
The new VIP is accessible from at least some of the build network:
[cltbld@dev-stage01 ~]$ nc -vz 63.245.217.193 22
Connection to 63.245.217.193 22 port [tcp/ssh] succeeded!
but please verify that from each affected VLAN before enabling.
So the action here is:
* verify all slaves that push symbols can reach symbolpush.mozilla.org
* replace 'symbols1.dmz.phx1.mozilla.com' in the buildbot configs with 'symbolpush.mozilla.org'.
No longer depends on: 804995
Reporter | ||
Comment 6•12 years ago
|
||
Any chance this can get fixed up? It should be nice and simple, and will eliminate an SPOF.
Assignee | ||
Updated•12 years ago
|
Assignee: nobody → kmoir
Assignee | ||
Comment 7•12 years ago
|
||
I verified all the slaves types could ssh to the host. The old host name is also referenced in mozharness scripts.
Assignee | ||
Comment 8•12 years ago
|
||
Assignee | ||
Comment 9•12 years ago
|
||
I think a patch for puppet is needed too, not sure how to deploy this patch in a non-breaking fashion.
Assignee | ||
Updated•12 years ago
|
Attachment #721485 -
Attachment description: patch → puppet patch
Reporter | ||
Comment 10•12 years ago
|
||
Comment on attachment 721485 [details] [diff] [review]
puppet patch
dustin@cerf ~ $ host symbolpush.mozilla.org
symbolpush.mozilla.org is an alias for symbolpush.zlb.phx.mozilla.net.
symbolpush.zlb.phx.mozilla.net has address 63.245.217.193
so you should be able to add the new one with the correct external IP. Then there's no conflict (aside from listing the same key twice, which -- I think -- isn't a problem)
Assignee | ||
Comment 11•12 years ago
|
||
Attachment #721485 -
Attachment is obsolete: true
Assignee | ||
Updated•12 years ago
|
Attachment #721338 -
Flags: review?(catlee)
Assignee | ||
Updated•12 years ago
|
Attachment #721344 -
Flags: review?(catlee)
Assignee | ||
Comment 12•12 years ago
|
||
Comment on attachment 721679 [details]
puppet patch
Not sure how we can test this in staging since there is a different symbols server in staging.
Attachment #721679 -
Flags: review?(catlee)
Updated•12 years ago
|
Attachment #721679 -
Flags: review?(catlee) → review+
Updated•12 years ago
|
Attachment #721344 -
Flags: review?(catlee) → review+
Updated•12 years ago
|
Attachment #721338 -
Flags: review?(catlee) → review+
Comment 13•12 years ago
|
||
You can log onto production machines of various OSes that need to push symbols here and make sure they can login ok.
Does windows need a known_hosts update too?
Assignee | ||
Comment 14•12 years ago
|
||
Yes windows needs a host update too. Good catch. I forgot about the machines that aren't managed by Puppet.
Assignee | ||
Updated•12 years ago
|
Attachment #721338 -
Flags: checked-in+
Assignee | ||
Updated•12 years ago
|
Attachment #721344 -
Flags: checked-in+
Assignee | ||
Comment 15•12 years ago
|
||
Comment on attachment 721679 [details]
puppet patch
I also updated the known_hosts on all the production windows build machines (mw32-ix-slave and w64-ix-slave). I don't think the try build machines need this change so I left them alone.
Attachment #721679 -
Flags: checked-in+
Comment 16•12 years ago
|
||
Comment on attachment 721344 [details] [diff] [review]
mozharness patch
This is merged to production.
Comment 17•12 years ago
|
||
This is in production.
Assignee | ||
Comment 18•12 years ago
|
||
I verified that this is working on recent builds in tbpl.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Comment 19•12 years ago
|
||
We forgot to update /N/production/darwin11-x86_64/build/Users/cltbld/.ssh/known_hosts for the bld-lion-r5 slaves. I've added this on scl3-production-puppet. Kim, can you make sure the other puppet masters and staging copies of this file are updated as well?
Probably a good idea to update any other known_hosts files on /N too.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 20•12 years ago
|
||
I updated them.
Assignee | ||
Updated•12 years ago
|
Status: REOPENED → RESOLVED
Closed: 12 years ago → 12 years ago
Resolution: --- → FIXED
Comment 21•12 years ago
|
||
Should really have waited until bug 849942 was resolved before landing the buildbot-configs patch. Some of the windows machines are having issues uploading symbols.
Assignee | ||
Comment 22•12 years ago
|
||
Sorry nthmomas, I didn't think it would take long to resolve fixing the image in bug 849942.
Comment 23•12 years ago
|
||
My mistake actually, I missed comment #15 here.
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
Updated•7 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•