A lesson learned on GRPC Client side load balancing

Background 

This post walks through the steps I took to debug load balancing issues for a client/server written in Go, using gRPC and running in Kubernetes. This post starts by identifying the problem, attempting to fix it, realizing why that didn't actually fix it, and then finally understanding how gRPC load balancing was working with the Kubernetes service. It's kind of weird because rather than a nice issue/solution format, I walk you through my process. So if you're into that, great! If not, just go to the last part to figure out the main points.

Anyways, I was given the task of writing a load tester for one of our microservices that takes in a request for a certain type of product recommendation and returns the products. We wanted to hit the service at ~60k requests/second. To do this, I built a load tester in Go that uses gRPC to talk to the service. The load tester is comprised of a client that is called locally from my laptop and port-forwards to the attacker pods (load tester server) in a Kubernetes cluster and fires them off. After the attackers have finished sending requests to the recommendations service, the client records the responses from the attackers and spits out statistics on the test. 

Definitions 

Load balancing distributes tasks to a set of workers. In this case, we want to distribute incoming requests evenly over the set of service pods that we are hitting. See gRPC's documentation on load balancing.

Client-side load balancing is when the caller determines how to distribute the tasks instead of a layer in front of the server that takes all incoming requests/tasks and distributes them. 

gRPC is a modern open source high performance Remote Procedure Call (RPC) framework developed by Google. It's known for being ~fast~. (See grpc-go)

The problem

The issue that I ran into while developing the load tester was that when I increased the number of threads each attacker used / the rate of requests per second to run at, the results contained a lot of concurrency/resource exhausted errors. At first, I tried scaling the pods, and then I tried scaling the resources on the pods from requesting 1cpu/1Gb memory to 14cpu/50Gb (each pod on its own node) per pod. Each time I would try to ramp up the tests, I would get a significant amount of errors like

"response type": "rpc error: code = ResourceExhausted desc = Too many concurrent requests. Please retry later."

:( 

Identifying the source of the problem 

We knew that the pods' scale/size wasn't the main issue here since tuning them gave no change in the results. Since the stats were coming from the attackers, I knew that each attacker was sending requests and receiving responses. The next place to look was the service pods we were attacking. The Prometheus metrics from the pods showed something interesting, we were only hitting a few of the service pods, and we were hitting those hard, while there were some service pods that were never hit at all. This caused an unbalanced load on the service pods. Istio/envoy does not sit in front of the service pod we were testing, so there was no server-side load balancing. 

Testing with a low send rate, the results from the service were 

port requests
4000 62
4001 62
4002 0
4002 0
4003 0
4004 29
4005 0
4006 0
4007 31
4008 0
4009 0
4010 31

The port number isn't interesting here, it just represents one of the service pods we are testing. The takeaway from these results is that the requests that the attackers are sending to the service are not being distributed, causing an uneven load. 

The Code

My original code had a few problems in it that caused undistributed sends. In an attempt to add distribution, we added load-balancing configs and more connections.

Connections and client-side load balancing configs

At first, we thought that creating the connection only once was the issue, so we put the connection in the thread loop that fires off requests to the pods we were testing. Additionally, we set up the client-side load balancer with the configs. We also set up the DNS/kuberesolver in an attempt to resolve the addresses of each of the pods in the service when given the service name. We observed kuberesolver wasn't changing the results, so we kept only the DNS resolver. All this means is that in the address we dial, we added "dns:///" to the front (so if address is pod-service-name, we would have dns:///pod-service-name). 

So we changed three things basically simultaneously:

* created new gRPC connection for each thread 

* added DNS resolver  (dns:///{address_to_dial})

* added the round robin gRPC config 

Setting up a client-side load balancer 

There are a few blog posts and sets of documents about setting up a client-side gRPC load balancer. One lesson I learned is that the gRPC library has changed quite a bit throughout the versions, especially for the load balancing configs. I read a blog that suggested using gRPC.WithBalancerName() or a JSON format that was outdated and no longer reflected the protobuf message in the most up-to-date version. Pick first is the default, which was obviously not working well for my use case. Instead, I wanted to set the load balancer to round-robin. So the following was added to the Dial function. 

grpc.WithDefaultServiceConfig(`{"loadBalancingConfig": [ { "round_robin": {} } ]}`)

*note: you could generate the proto code in your repo so this isn't as ugly or you can just stick with the JSON 



*note: I always have to be careful with contexts and cancellations once a connection is added to the thread loop. Make sure you don't cancel the connection before it should end. 

 Results 

Now we test! At a low send rate with 9 threads we get 


port requests
4000 12
4001 0
4002 10
4002 9
4003 8
4004 0
4005 10
4006 9
4007 0
4008 8
4009 12
4010 11

Notice we sent to only 9 pods while testing with 9 threads and creating a gRPC connection per thread. Testing the load balancer with the added configs/connection creations worked for testing with a high volume and thread count because enough new connections were made to effectively balance out the sends to the test pods. For example, if we had 2 workers with 20 threads each (40 connections/threads total), the results here would look much more distributed among the 11 instances we tested. 

However, this test at a low load clearly shows that we aren't getting the load balancing distribution that we expected. Even though this code provided distributed requests sufficient for our load tests, I wanted to keep digging to understand why the round-robin load balancing didn't seem to be working. So, I started removing one component at a time and added some debugging tools. 

Debugging

I started to think about the code we have when I read this in the gRPC documentation about load balancing policies

Load-balancing within gRPC happens on a per-call basis, not a per-connection basis. In other words, even if all requests come from a single client, we still want them to be load-balanced across all servers.
This contradicts our assumptions about the code we added above- that we weren't picking a new address because we created the connection once, outside of the method that fires off the requests, and instead pass the connection into that method. 

To figure out what was going on, I removed each configuration one by one and tested. 

Removing configs

There were a couple of observations I made when I started to dig into this issue 
(1) using `dig` in the command line, only one address showed for the service
(2) the DNS name was configured correctly 


The steps I took to debug are as follows 
(1) removed the connection creation outside the loop and kept all other configs the same 
(2) removed dns:/// from the address (testing with the connection made inside and outside of the loop)
(3) removed the round-robin config (testing with connection inside and outside of the loop) 

Removing the DNS and round-robin configs had no effect on the results. Moving the connection outside of the thread loop did show differing results, given there were >1 threads. 

Adding gRPC logging

I also added gRPC logging. This proved that the load balancer was configured correctly, but is interesting because it only shows one address. When the round-robin config was removed, the logs showed we used "pick-first", the default load balancing mechanism. 

In the docker file, I set two environment variables
ENV GRPC_GO_LOG_SEVERITY_LEVEL info
ENV GRPC_GO_VERBOSITY_LEVEL 99
*note: if you are using a go service, these vars won't work. It says so in the first sentence, but I missed it at first. Make sure to use the Go/gRPC variables. 

The logs then looked like 
2021/11/11 01:37:01 INFO: [core] ccResolverWrapper: sending update to cc: {[{10.72.102.84:81  <nil> <nil> 0 <nil>}] <nil> <nil>}
2021/11/11 01:37:01 INFO: [core] ClientConn switching balancer to "round_robin"
2021/11/11 01:37:01 INFO: [core] Channel switches to new LB policy "round_robin"
2021/11/11 01:37:01 INFO: [core] Subchannel Connectivity change to CONNECTING
2021/11/11 01:37:01 INFO: [core] Subchannel picks a new address "10.72.102.84:81" to connect
2021/11/11 01:37:01 INFO: [core] Channel Connectivity change to CONNECTING
2021/11/11 01:37:01 INFO: [core] ccResolverWrapper: sending update to cc: {[{10.72.102.84:81  <nil> <nil> 0 <nil>}] <nil> <nil>}
2021/11/11 01:37:01 INFO: [core] ClientConn switching balancer to "round_robin"
2021/11/11 01:37:01 INFO: [core] Channel switches to new LB policy "round_robin"
2021/11/11 01:37:01 INFO: [core] Subchannel Connectivity change to CONNECTING
2021/11/11 01:37:01 INFO: [core] Subchannel picks a new address "10.72.102.84:81" to connect
2021/11/11 01:37:01 INFO: [core] Channel Connectivity change to CONNECTING
2021/11/11 01:37:01 INFO: [core] ccResolverWrapper: sending update to cc: {[{10.72.102.84:81  <nil> <nil> 0 <nil>}] <nil> <nil>}
2021/11/11 01:37:01 INFO: [core] ClientConn switching balancer to "round_robin"
2021/11/11 01:37:01 INFO: [core] Channel switches to new LB policy "round_robin"
2021/11/11 01:37:01 INFO: [core] Subchannel Connectivity change to CONNECTING
2021/11/11 01:37:01 INFO: [core] Subchannel picks a new address "10.72.102.84:81" to connect


So, what is this address? If I use kubectl to describe the service we see 
IP:                       10.72.102.84
IPs:                      10.72.102.84
which tells us that the service IP is the only address it is able to pick from. We aren't resolving the individual pod IPs. Although we have round-robin configured, we aren't utilizing the functionality that it provides- picking a new address from a list of addresses. 

I found a helpful blog on gRPC load balancing that mentioned 
We saw how the dns name-system didn't work on Kubernetes because the Kubernetes's default internal load balancing works at the L4 level. gRPC is a L7 protocol.
All of this pointed to the Kubernetes service not being able to resolve the individual pod IPs. 

*note: I tried to set up channelz but couldn't get the UI component/any observability working, if anyone has some tips or have opinions on why this is a good debugging tool please lmk. 

Finally! solving the puzzle 

Turns out that the Kubernetes service isn't headless. Tadaaa! LOL. Yeah, that was it all along. So, we couldn't resolve the IPs from the service the whole time, and we kind of got around it by putting the connection in the thread loops and randomly picking a pod to route to for each thread. With enough threads on each attacker pod, we were pretty distributed, but the round-robin and DNS configs did nothing for us. 

Here is a good description of what it means for a Kubernetes service to be headless.  To enable the load balancing we want, we would either have to change the service to be headless or set up a way to gather all of the IPs and then round-robin them (aka a discovery service), checking periodically to make sure those IPs are still available. xDS does this. See also: xDS Features in gRPC


We could also use Istio on the server side to achieve load balancing (which also uses xDS under the hood). We tested with Istio, and it evenly distributed the requests among all the workers, regardless of the thread count. Which load balancer to use really depends on the service limitations, and if server or client-side load balancing should be used. Istio has its own configs that will be applied at the edge of the service that could impact the service itself. 

Summary 

* I made some assumptions about the server-side handling of load balancing and then realized the load balancing piece was completely missing, and with gRPC, you can set it up from the client-side. 

* make sure the code/configs that you copy reflect the version of the library you are using. (round robin configs- lol woops)

* gRPC debugging is really useful for tracking gRPC connection behaviors 

* the gRPC connection load balances per-call not per-connection, so you don't need to create multiple connections to achieve client-side load balancing with gRPC 

* a non-headless Kubernetes service won't be able to resolve the individual IPs by just adding "dns:///" to the beginning of the address. A discovery service like xDS is needed. Or change the service configs so that the Kubernetes service is headless. 

* basically I went through this whole process just to find out it was a service config but at least I learned something about gRPC along the way


I knew when I was getting really frustrated with this that it would be a good opportunity to write a blog. So here we are again. <3 

If you are looking for some good reads about gRPC, check out these posts by Evan Jones- gRPC is easy to misconfigure and Load balancing gRPC services

Comments

Popular Posts