Good Job concurrency lock contention

Probably a mediocre job in retrospect

This week I was running a long backfill task in GoodJob which spawned a large number of workers performing operations on a high number of database records. This kind of thing can get a bit stressful and takes a bit of planning to avoid strain on resources, not to mention too much babysitting. I’m more familiar with Sidekiq but I think they’re pretty comparable.

The details of said workers might be interesting to go into some more detail about later since it involved LLMs and Amazon Bedrock calls, but the takeaway I want to document is about over-engineering for performance.

I’d optimistically added a perform_limit with an associated key to avoid too many workers running at once, hogging the queue and/or stressing out Bedrock and triggering rate limits. At the same time, I kept running into resource problems, overloading the database and causing latency enough to cause detriment to production and general DevOps concern. After going around in circles for a while, reducing concurrent jobs etc., I realised the concurrency locks themselves were causing the problem. We had a good number of workers available, and while 5 were happily performing the task, the other X workers were spamming queries to ask whether the lock was available. Having workers hit Postgres over and over to check on locks smells a bit like a poor architecture on GoodJob’s part, but I’ll leave that question open for another day.

I’d incorrectly diagnosed the number of concurrent workers as the problem, but in fact it was the limitation itself causing the database strain. This wasn’t something I could see myself and took a bit of coordination with DevOps to identify which processes were the bottleneck and what they were trying to do. The number of workers available was a sufficient throttling mechanism, and adding a concurrent limit & key simply caused lock contention and a flurry of activity while the workers tried to negotiate who’s using the key. As soon as I removed those restrictions, those workers chomped up the jobs easily.

Completely unrelated TIL