Coping with Massive Growth
Massive growth is the dream of almost every startup, but it can be a nightmare if you’re not anticipating it. Recently, I had a CTO call me and tell me that they were about to go from 2 million users to over 200 million users with a recently signed contract. They rang me as they were worried about the potential cost. I asked a couple of questions, and we both quickly realised that cost was the least of their worries.
So, what did they need to worry about? The most important thing in my mind is that the service needed to stay up and remain fast.
How do we start?
The first step in increasing service reliability and maintaining performance is to measure everything:
Artwork: Allie Brosh
Once you have measurements that tell you how your system is working, then you can work out what is running slow and identify when something goes down. Don’t forget to also measure error rates e.g. 400 and 500 error codes – preferably as a percentage of all requests, as absolute numbers soon become meaningless.
If you have a globally distributed userbase, then you may want to set up checks from all around the world too. I used to live in New Zealand, and you could soon tell apps that were making lots of sequential calls to a server far away, compared to those that made fewer or in parallel. You might be able to improve on many things, but you’ll never improve the speed of light!
What to measure
Measurement, or you may prefer to call it monitoring or observability, is an ongoing process – you can always measure more things. However, your time isn’t unlimited, so you need to focus on the services that impact on your users the most first. As an example if you’ve got slow performance on a web service then add further measurement on that service, e.g. the call to a database or the performance of other, related, components within the web service. This way you are working on finding out what is slowing things down and then improving it.
It’s not just real-time performance that matters. I’d encourage you to measure things inside your app or website too and send these back for further analysis – this is commonly called telemetry. Don’t just measure performance via telemetry, but also things like the way the user interacts with your service: you’ll be surprised what you find here. Of course, make sure that you do this in a privacy-compliant manner.
You may have noticed something here – I didn’t focus on disk and CPU metrics. Of course, you also want to do these, but the primary focus should be on user experience.
You’ll also want to set up some form of log collection service. There are many commercial services for that these days, and it’s very handy to have them in one place, rather than hunting around many servers.
When things go wrong
You’ll want to hook some of these measurements up to some form of alerting system and build an incident response plan. So, when something goes wrong at 2am, someone can sort it quickly.
When you inevitably do have an outage, then be honest with your users. Your new users will be much more tolerant if you tell them that all their enthusiasm means that your service is getting stretched. Many companies now publish reports online into what went wrong and how they’ll fix it when they have problems. This helps customers trust that things are being improved.
There are also a few things I’d recommend on the architecture to make it ready for scale, too:
- As far as possible, make things stateless – keep the state in a database or some form of blob storage.
- Deploy things using Infrastructure as Code.
- Build a robust CI/CD pipeline and process so that you can release and deploy code frequently.
With these steps you’ll be able to connect servers behind load balancers that can then auto-scale as load comes on rather than having to build bigger, more fragile servers – this is what people mean by scale out, not scale up.
When you’ve got time, you may want to start shifting your code to “serverless”. This will make the overhead of building and running software quite a bit less, but there may be a reasonable amount of work to get to this point.
Do check the tools that your cloud provider gives you for recommendations – these will often have very useful and important advice for you. e.g. stopping you accidentally exposing private data to the internet. For Azure I’d recommend starting with Azure Advisor and Azure Security Center.
I could dive much deeper into security as I think this is such a crucial aspect too, but that would be a whole series of blog posts. As luck would have it my colleague, Tanya Janca, has written a series all about this – check out Pushing left, like a boss.
Coming back to cost
So, what about the original question around cost? From my experience it’s basically a game of Whac-A-Mole. You’ll have things that popup all over the place and it’s a matter of focusing on what’s the costliest and working out how to reduce that. A couple of practical things that I’d suggest are:
- Tag your resources – this way you can see where you’re spending your money. A good starting point for Azure is here.
- Focus on slowest running services that are used a lot. If a service takes 200 milliseconds and you can drop this to 100 milliseconds then you’ll halve the cost (assuming that CPU is the limitation), and if it’s an end-user facing service, you’ll make your users happier too – win-win!
Good luck with your scaling – it’s a great problem to have!