If you’ve worked with Cosmos at any level of scale, you’ve probably come across this issue. A good starting point is this article on the Microsoft site.
What is a 429?
An error occured: CosmosContainerService: An error occurred when patching item in cosmos MyDatabase, Microsoft.Azure.Cosmos.CosmosException : Response status code does not indicate success: TooManyRequests (429)
In fact, if you’re using Cosmos to its capacity, you are expected to get a small number of 429 errors (have a look at the linked MS document). By default, these are retried, and typically result in success. The time to worry is when you get enough that the retry fails and, therefore, so does the call.
The crux of the error is that you’re using more Request Units or RUs than you are allowed. You can increase the number, but there’s a cost implication.
RUs
An RU, or Request Unit is basically a simplified version of the spec of your Cosmos DB. This article explains it very well. However, it means that the resources for your system, and the throughput are lumped into a single, chargeable, metric.
At the time of writing, 1 RU was just under $0.10 / month - so 20,000 RUs would be around $2,000. There’s a couple of things to note here that matter: firstly, as a rule of thumb, a 1KB update is equal to 1 RU (we’ll come back to this later as it’s not always true); and you can only decide how many RUs to use for a provisioned instance.
Investigation
The first port of call is to have a look at the Insights tab for the Cosmos DB. The Overview gives you an idea as to any issues:
Here you can see the total requests versus throttled requests. Normalised RU consumption at 100% isn’t necessarily a bad thing, in fact, it means you’re fully utilising the RUs that you’re paying for. You can then drill into the metrics, and see (for example) the total request units:
This graph shows the number of request units. Be careful here of the granularity - you pay for RUs per second - the graph here is per minute, so divide by 60 to get the per second figure.
How to Fix
There are essentially two ways to correct this: increase the RU limit (scale), or decrease your calls to it. Firstly, if you’re not using a provisioned instance and you’re coming across this error then you probably need to move to one. If you are then you can increase the provisioning, but this can get very expensive very quickly.
Some strategies that you can use are:
-
Reduce the number of reads - for example, cache data outside of Cosmos, or use strategies to avoid reading. As an example, instead of reading and updating, you could just update / add a new record.
-
Reduce the number of updates - for example, batch updates into bigger chunks, or delay / reduce the frequency your updates. This has the added advantage that you get a “discount” for larger updates; very roughly speaking, 1k = 1RU, however, 100K = 10RUs, so a larger update can significantly reduce the count.
Scaling
If you decide to scale, then you can do so via the Scale blade:
Keep an eye on the Cost Per Month figure - that will give you a maximum.
Autoscale means that it can scale down, but it cannot scale up past the allotted figure - it will also only scale within a range.