Incorrect Domain returned during OAuth2 auth_code request
Hello Community,
I've opened a support ticket (01926683) on this topic, but seeing if someone here has bumped into this as well and has a solution. When I begin the process of requesting on OAuth2 access token, I make the request to the auth server:GET https://auth.brightspace.com/oauth2/auth?response_type=code&redirect_uri=https://example.edu/oauthreturn&client_id=xxxxxxxxxxxxxx&scope=…..
That request provides a 302 redirect to:https://xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.lms.brightspace.com/d2l/auth/api/token?x_a=……
Since 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.lms.brightspace.com'
is not the same as 'example.edu
' I get a not authorized error, because it's a different domain that I do not have a current session in.
Is anyone else seeing this behavior (auth service redirecting you to the wrong domain)?
Best Answer
-
Hello,
A recent change to the OAuth 2.0 backend to improve reliability unfortunately had a bug.
Your site has multiple URLs, e.g. lms.school.edu, the one you tend to use in your browser, and <tenantId>.lms.brightspace.com. The second one is a new-ish thing that allows our internal services to contact your LMS more reliably — e.g. if a client mis-configures DNS for lms.school.edu it shouldn't impact our backend.
Unfortunately some wires got crossed and sometimes we would use the "tenantID URL" for OAuth 2 redirects. At this time I can't quantify how often this happened/which customers it happened for.
At this point we believe that the issue is resolved. Thanks for reporting it, and let me know if you still see the behaviour going forward.
Answers
-
Hi!
We currently have a P2 ticket on this issue. Support is not yet able to reproduce this consistently so they have received a network capture from me to further analyze the problem.
Big problem for us indeed! It seems to be some kind of new oauth flow introduced without announcement or test options. Our ticket number is 01926310
Edit: it must have started somewhere around Dec 12th 7PM-Dec 13th 7AM (UTC).
-
Thanks for the explanation @Jacob.P.433 and great that it has been resolved. Functionality has indeed been restored for us, so that's a good thing.
Will this incident be evaluated internally? It has led to an outage of two full working days. These weeks before Christmas are a bit quieter, but when it would have been busy, this would have been dramatic.
-
Will this incident be evaluated internally? It has led to an outage of two full working days. These weeks before Christmas are a bit quieter, but when it would have been busy, this would have been dramatic.
Yeah, we take these incidents seriously.
You're right about the approximate timeline. Here's some more details:
* Dec 12th @ 3:53 PM EST: The defect started rolling out to production
* Dec 12th @ 4:12 PM EST: Rollout is complete in all regions
* Dec 13th @ 3:51 PM EST: First internal report comes in, but with incomplete information
* Dec 13th @ 4:56 PM EST: This post reported the issue
* Dec 14th @ 9:37 AM EST: I'm aware of this post and able to reproduce from extra info collected internally
* Dec 14th @ 10:32 AM EST: Fix is developed
* Dec 14th @ 10:55 AM EST: Fix starts to rollout
* Dec 14th @ 10:59 AM EST: Rollout complete
* Dec 14th @ 11:33 AM EST: My first post hereThere are two problems as I see it:
1. The time between rollout of the defect and when we noticed (from internal reports) was about 24 hours.
2. Even with the internal report, without more information this didn't get prioritized and because it was ~5PM everything else got delayed another ~19 hours (people logged off/went to sleep etc.)
I view (1) as the main problem; if we could have detected this earlier that would have the biggest impact, both directly on (1) (eliminating this delay) and indirectly on (2) (typically with automated monitoring we have enough information to go on.)
So the solution to problems like this that we would typically consider are automated monitoring. The reason that nothing caught this is due to the tricky/unusual nature of the situation. Probably the best we could do is monitor for the volume of successful OAuth 2.0 logins (not failures — there was no "failure" in this case, just confused users looking at a login page on a weird domain.) If there is a sudden dip that is ideally something we'd get alerted for. The trickiness with these alarms is tuning them to get a good signal-to-noise ratio: successful OAuth 2 logins could rise and drop for innocuous reasons throughout the day, and we wish to only get alarms for real problems. Note also that this was only a partial outage — the experience varied by org, with some orgs seeing more impact. I don't have data yet on how widespread it was though.
Other approaches involve questioning how long it took for customer reports (via channels such as the community or otherwise) to get to the right team. Those are important too but more generically. At the same time our goal is to not have to rely on user reports for incidents like this.
This is an example of how we look at incidents like this. -
Jacob, thanks for the fix and the information. We are seeing much better behavior now, and will continue to monitor.
-
Jacob, Thank you for sharing this! I really appreciate the thoughtful and candid nature of your response, it means a lot.
-Justin
-
Hi @Jacob.P.433,
Thanks for the extended information/timeline and I'm glad to see that you will evaluate all relevant points. In my ticket comment from Dec 13th 6:38 AM (EST) I posted pretty much a summary of what is in this post so it's unfortunate that this info did not reach you the right teams/persons day earlier.
I guess that change was unrelated to that night's hotfix that was released at the same time? That combination made it probably more difficult to analyze.
Anyhow, we're happy it has been fixed :)
-
In my ticket comment from Dec 13th 6:38 AM (EST) I posted pretty much a summary of what is in this post so it's unfortunate that this info did not reach you the right teams/persons day earlier.
Yeah, triaging issues "through the front door" and reducing the latency of getting it to the right team is unfortunately sometimes tricky.
I guess that change was unrelated to that night's hotfix that was released at the same time? That combination made it probably more difficult to analyze.
Yeah those two things would have been unrelated. The defect was in a core piece of infrastructure that exists outside of any one customer and is related to how our machines talk to each other internally (and has the "side job" of doing implementing the user-facing OAuth 2 functionality). Infrastructure like that is updated multiple times a day outside of the regular software/feature deployment schedule, usually without anyone noticing 😉