One of our main priorities as a company right now is increasing the stability and performance of our website. It’s not as reliable as it needs to be, and it’s important for us to change that.
One of our main problems currently is with one of our hosting providers and their hardware. We’ve kicked off a project to move all of our servers to AWS, which will give us greater reliability and the ability to recover more quickly when things do fail. Want more details?
Read the detailed interview below with our Director of Dev Ops, Sylvain Niles
What needs to be fixed to make the website more reliable?
Years ago, aspects of our site were built for 30,000 members. Now we have 7 million. The solution that worked for the Couchsurfing website then is different from is needed now.
Because of Couchsurfing’s nonprofit origins, the company had to go with a super cheap data center provider and built the website around that provider. Now, years later, we regularly have issues with their servers and networking hardware, both of which bring our servers down. The original Couchsurfing site has a lot of things hardcoded into it that relate to our existing data center, making it difficult to simply move to a newer cloud provider. So not only do we need to change providers, but we need to rewrite a big part of the website before we can move.
Our longer downtimes (around 6 hours) are almost always related to server or network hardware failure. Shorter downtimes (20 minutes to 2 hours) are usually due to network issues with our provider. Server failure can be improved by moving to newer servers with the existing provider and we’ve been doing this proactively. Network downtime – like the one this weekend – is out of our control until we’re able to move all services to a reliable cloud provider.
What are you and the team working on?
For now, the process to increased website stability is done in baby steps. The team is replacing older servers with newer ones so that at least we’re lessening server failure. This doesn’t help us with networking hardware though, which can only be fixed by moving to a new provider. We can’t move to a new provider until we rewrite the code.
For now, our goal in moving to newer hardware is to have less frequent downtime related to server failure.
How do you plan on increasing website stability?
We’re adding more detailed monitoring to help us detect failures – and the conditions that lead to failures – sooner. We’re writing scripts that check metrics on the servers that indicate there’s going to be a problem. We’re also changing portions of the new site to rely less on the old site.
Recently we added a feature that automatically updates our Twitter account when we turn on the downtime page, both when the site goes down and when all systems have been restored. This should help keep members informed as early as possible.
Unlike new companies, every time we implement a new feature or move from the old site code to newer code, it needs to be able to support 7 million users on the first day. That’s time-consuming. For example, with something like messages, we’ll need to include all old messages and couchrequests, translate them to the new system and allow them both to work at the same time during the migration. And that’s just messaging! When you think of all the other Couchsurfing features it becomes this huge task (and hopefully explains why it feels like these changes are taking a long time to implement!)
When are you hoping to have more reliable servers?
As we finish these individual projects, the site will get more reliable and you’ll see some big improvements within a month and a half. We’re hoping to be 100% migrated from the old to the new provider within three months.