On May 21st the Joind.in hosted service began experiencing connectivity issues between parts of their application hosting tier, which ultimately was caused by a mix of timeouts and TLS handshake failures when our web2 frontend requested data from our API. This resulted in many requests seeing a 522 error from Cloudflare, an application error from the website, or a Slim error message while we were attempting to troubleshoot.
We’ve been in contact with Platform.sh (our host) since shortly after seeing the issue crop up. They confirmed that the region we’re currently hosting the application in was having some infrastructure performance issues shortly thereafter. This morning they reached out to us indicating that these issues have been resolved.
As part of this discussion, Platform.sh staff noticed that our account is currently provisioned in their older, less advanced US1 region. They’ve recommended that we migrate to the newer US2 region as soon as we can manage to do so. Our plan, now that the fires are out, is to perform that migration next week.
The unfortunate part of this site reliability issue was that it hit smack in the middle of at least one big event in the US. We don’t want this to happen again and are working to ensure upcoming events will have a smooth experience with our platform. The region move is part of this effort, but we’ve also implemented additional performance tweaks at the database level to ensure API responsiveness, as well as increased logging, monitoring, and alerting across the board to ensure that we’re (nearly) the first ones to notice another issue like this.
If you’d like to ask any questions or chat with the Joindin Leadership Team, feel free to join us in our Discord chat group: https://discordapp.com/invite/fWa9fu9. Thanks to the Platform.sh team, including our primary contact Larry Garfield, for advising us on measures to avoid another repeat of this issue.