Performance & stability work

rkorytkowski · May 26, 2023, 11:27am

Hi,

Let’s discuss performance and stability goals for the rest of 2023.

The issues we are experiencing from time to time with OCL Online are mostly related to the pace we are rolling out new features. When doing deployments we run rudimentary and manual tests on staging and production environments that things work. There are no automated tests to validate how the changes affect performance in real environments such as staging and production.

We discover that things are slow or unstable when there’s increased usage of the system. Often times it is by hearing from our users and not from internal monitoring.

First and foremost we need to add to our setup better internal monitoring for performance hiccups.
1.1. Alerts for slow running requests (> 2s)
1.2. Alerts for long running background tasks (> 20 min) and automatically stopping these tasks if they are not approved.
1.3. Alerts for increased number of background tasks in queues or tasks that are too long in queues. Automatically rejecting new tasks if queues cannot fit more to prevent out of memory issues.
1.4. ~~Fix alerts for increased memory usage > 85%.~~
1.5. Add monitoring for ES slow queries. Currently we have no insight into how queries perform to do adjustments.
Monitor performance when rolling out new code
2.1. Identify 50 most popular API requests that would affect most users if they were running slower.
2.2. Implement performance tests to hit the most popular API endpoints that can be run against any environment. Set the acceptance threshold to be not slower than last run +10% for fluctuations.
2.2. Migrate deployments from Bamboo to AWS CodeDeploy to have support for rolling updates. Deploy one replica at a time and proceed or roll back if not passing performance tests.
Increase stability of services
3.1. ~~Add SWAP memory for background tasks as they tend to use up all available RAM under heavy load.~~
3.3. ~~Set hard limits on used memory for individual services. Currently one service using up all available memory may affect other services.~~
3.2. ~~Introduce Redis and ES replicas for improved stability and performance.~~
3.3. Introduce fair use limits on background tasks i.e. stop tasks that run too long (e.g. > 20 mins) if not approved, reject new tasks if queues build up too much instead of waiting for services to fail.
3.4. ~~Upgrade terraform and AWS resources to roll out host OS upgrades more easily.~~
Investigate possible UI changes to respond faster to users. Here are just some ideas that could be considered.
4.1. Make adjustments to global search to limit the scope of search to what is most interesting to users. Maybe it doesn’t make sense to run 5 queries for each search if users hardly ever change tabs to Mappings, Sources, Users or Orgs.
4.2. Identify long running tasks and make them asynchronous with UI supporting notifications when ready and tracking progress while continuing doing other things.
4.3. Setup CDN for static content to have better response times in other parts of the world.
4.4. Deploy some service replicas in other AWS region to have better response times in e.g. Africa (significantly increases AWS costs 20-50%).
Work on improving bulk imports and exports performance and resource usage.
5.1. Do not use RAM to store uploads and exports. Store them on disk and use streams.
5.2. Implement resuming bulk tasks.
5.3. Use CDN to serve exports.

Let’s discuss on today’s architecture call. I can elaborate on any of these if needed as well.

We’ll continue the discussion here.

(crossing out items that have been addressed)

jon · May 26, 2023, 4:14pm

Thank you @rkorytkowski! Following from our discussion during the Architecture call, we wanted to see a few more things added here:

Ideas for increased utilization of AWS resources to improve performance
Highlight tasks that in the plan will have a high impact on UX
Propose a Phase 1 plan – where should we start? what are the most important things to get done in the next several months?
@Sny Please also add your thoughts here!

rkorytkowski · June 2, 2023, 12:32pm

These are done now:
3.1. Add SWAP memory for background tasks as they tend to use up all available RAM under heavy load.
3.3. Set hard limits on used memory for individual services. Currently one service using up all available memory may affect other services.

Next quick ones are (one week):
3.2. Introduce Redis and ES replicas for improved stability and performance.
3.4. Upgrade terraform and AWS resources to roll out host OS upgrades more easily.

These are the most pressing to increase stability.

The rest should be tackled pretty much in order.

As for increased utilisation of AWS resources we’ll use OpenSearch instead of self-hosted ES and increase DB instances to start with.

rkorytkowski · June 5, 2023, 12:01pm

Happy to inform that 3.4 is done as well.

I’ll work on Redis and ES replicas next week together with ES monitoring for slow queries. Once this is done we will be able to speed up the global search, which should visibly improve UX.

Next I’ll move on to monitoring starting from alerts.

rkorytkowski · June 29, 2023, 12:17pm

Happy to report that we now have ES cluster in place. Your searches should be faster. We are also more resilient to failures.

I’ll proceed with deploying the Redis cluster. If nothing unexpected happens, it should be completed before the weekend.

rkorytkowski · October 12, 2023, 8:58am

A bit delayed follow up… the Redis replicas with Sentinels for HA have been deployed a while back.

We also did 3.4. Upgrade terraform and AWS resources to roll out host OS upgrades more easily.

I started to cross out items that have been addressed in the initial post to make it more clear what’s left.

Next items that we’ll tackle in Q4 are:
1.1. Alerts for slow running requests (> 2s)
1.5. Add monitoring for ES slow queries. Currently we have no insight into how queries perform to do adjustments.
2.1. Identify 50 most popular API requests that would affect most users if they were running slower.
2.2. Implement performance tests to hit the most popular API endpoints that can be run against any environment. Set the acceptance threshold to be not slower than last run +10% for fluctuations.
2.2. Migrate deployments from Bamboo to AWS CodeDeploy to have support for rolling updates. Deploy one replica at a time and proceed or roll back if not passing performance tests.

ES slow queries monitoring will be addressed by deploying Kibana.

jon · October 24, 2023, 11:40am

Also plan to migrate to GitHub Actions (to completely retire Bamboo CI)