Automatic client retries in gRPC

Hey folks, so in this video we're going to look at automatic retries. This is a really great way to keep your client more available even when a server possibly is not or is having issues. For this example we're going to create another RPC and this one will be like a flaky RPC, so every so often it will return an error and that way we can kind of simulate error scenarios and then show that the retries are working as expected.

We'll create an RPC called Flakey that will have a FlakeyRequest and return a FlakeyResponse. This won't take any request parameters, it'll have an empty request and response. So let's go ahead and generate our code and then we can implement that on the server as well.

In this RPC implementation, what we can do is generate a random number between 0 and 2, and let's say if it is 0 then we'll return a successful response, and if it's not then we'll return an error response. That way we can see that approximately every 2 in 3 requests will fail and that's how we can test our automatic retries out. We'll use the rand package to generate a random number, and then check if that number is not zero. If it's not, let's print a line to say "error response returned" and return an internal status.

Now let's add a client for the automatic retries. For our service config, I'll share a configuration I created earlier and we can walk through it. The method config looks the same as before - it has the name of the service. One thing to notice: before we were passing in the name of the RPC as well but we don't always have to do that. If we don't provide that and we just provide the service name, this config will apply to all the RPCs on that service.

Then we can add our retry policy as well. This specifies that it will retry up to four attempts. Initially, the backoff of the retries will be 100 milliseconds, and then the maximum backoff we'll set is the maximum amount of time we'll wait between retries is one second. Then this is the multiplier - because this will retry with backoff, this will multiply the initial backoff by this value, so it increases the time it waits between each retry.

Finally, we can provide a list of retriable_status_codes. It'll only retry if these statuses are returned. For example, in this case if an invalid argument was returned then the client wouldn't retry. This is just a good way to specify which errors are retriable and which ones are not on our service.

One thing we can do is add this to our structs that we created in our previous video, just so it's easier to create these objects. We have our method config, and inside that we have our name and our timeout like we had before, and now what we want to add is a retry policy. The first thing we want is max_attempts and we're including an omitempty on these just so if they're not specified when we create the instance of the config, then it just won't get rendered, it won't get marshalled and added to the JSON.

If we leave this off we might cause some unexpected issues where we have, for example, like a timeout with an empty string which we probably don't want. We'll have initial_backoff as a string, max_backoff as a string, backoff_multiplier as an integer, and then finally our retriable_status_codes as an array of strings.

For our implementation, we'll set max attempts to 4, initial backoff to 0.1 seconds, max backoff to 1 second, backoff multiplier to 2, and then we'll set the retriable status codes. Then we'll marshal our service config with JSON.marshal, create our gRPC connection and client, and pass in our service config, cast to string as it's a slice of bytes.

Now let's create our client and just like before we can call our flaky RPC which just takes an empty request and returns an empty response. Because this RPC should work one in every three times in theory, and we have four retries configured, we should hopefully see a successful response but we probably will see some errors as well.

Running our tests, we can see that the server returned four error responses and because we have a max retries of four we eventually received the internal error back. Running it again produced the same result. On our third attempt, we received two error responses and then eventually got a successful response returned. Just to prove that this is working we could also configure the max attempts to be 10, which worked on the second try.

To demonstrate configuring different backoffs, let's try this with a one second backoff up to 10 seconds. Now you can see that there's a second in between each retry attempt. This shows how we can configure automatic retries to make our clients much more resilient to failures and handle error scenarios better.

Obviously there are certain error codes which are more retriable than others. I wouldn't expect to have something like an invalid argument in here because if there are client errors, if the client is invalid, then there's not much point in retrying. But in examples where the server might be flaky and there might be internal errors or the server's not available at certain times, this is really useful and can help ensure the client remains stable.

Feel free to play around with different values in the retry policy and see how this would work in your own projects. I'll see you in the next video.

Automatic client retries in gRPC

Chris Shepherd

Transcript