Let’s discuss performance and stability goals for the rest of 2023.
The issues we are experiencing from time to time with OCL Online are mostly related to the pace we are rolling out new features. When doing deployments we run rudimentary and manual tests on staging and production environments that things work. There are no automated tests to validate how the changes affect performance in real environments such as staging and production.
We discover that things are slow or unstable when there’s increased usage of the system. Often times it is by hearing from our users and not from internal monitoring.
First and foremost we need to add to our setup better internal monitoring for performance hiccups.
1.1. Alerts for slow running requests (> 2s)
1.2. Alerts for long running background tasks (> 20 min) and automatically stopping these tasks if they are not approved.
1.3. Alerts for increased number of background tasks in queues or tasks that are too long in queues. Automatically rejecting new tasks if queues cannot fit more to prevent out of memory issues.
1.4. Fix alerts for increased memory usage > 85%.
1.5. Add monitoring for ES slow queries. Currently we have no insight into how queries perform to do adjustments.
Monitor performance when rolling out new code
2.1. Identify 50 most popular API requests that would affect most users if they were running slower.
2.2. Implement performance tests to hit the most popular API endpoints that can be run against any environment. Set the acceptance threshold to be not slower than last run +10% for fluctuations.
2.2. Migrate deployments from Bamboo to AWS CodeDeploy to have support for rolling updates. Deploy one replica at a time and proceed or roll back if not passing performance tests.
Increase stability of services
3.1. Add SWAP memory for background tasks as they tend to use up all available RAM under heavy load.
3.3. Set hard limits on used memory for individual services. Currently one service using up all available memory may affect other services.
3.2. Introduce Redis and ES replicas for improved stability and performance.
3.3. Introduce fair use limits on background tasks i.e. stop tasks that run too long (e.g. > 20 mins) if not approved, reject new tasks if queues build up too much instead of waiting for services to fail.
3.4. Upgrade terraform and AWS resources to roll out host OS upgrades more easily.
Investigate possible UI changes to respond faster to users. Here are just some ideas that could be considered.
4.1. Make adjustments to global search to limit the scope of search to what is most interesting to users. Maybe it doesn’t make sense to run 5 queries for each search if users hardly ever change tabs to Mappings, Sources, Users or Orgs.
4.2. Identify long running tasks and make them asynchronous with UI supporting notifications when ready and tracking progress while continuing doing other things.
4.3. Setup CDN for static content to have better response times in other parts of the world.
4.4. Deploy some service replicas in other AWS region to have better response times in e.g. Africa (significantly increases AWS costs 20-50%).
Work on improving bulk imports and exports performance and resource usage.
5.1. Do not use RAM to store uploads and exports. Store them on disk and use streams.
5.2. Implement resuming bulk tasks.
5.3. Use CDN to serve exports.
Let’s discuss on today’s architecture call. I can elaborate on any of these if needed as well.
We’ll continue the discussion here.