A lesson learned on GRPC Client side load balancing
Background
This post walks through the steps I took to debug load balancing issues for a client/server written in Go, using gRPC and running in Kubernetes. This post starts by identifying the problem, attempting to fix it, realizing why that didn't actually fix it, and then finally understanding how gRPC load balancing was working with the Kubernetes service. It's kind of weird because rather than a nice issue/solution format, I walk you through my process. So if you're into that, great! If not, just go to the last part to figure out the main points.
Anyways, I was given the task of writing a load tester for one of our microservices that takes in a request for a certain type of product recommendation and returns the products. We wanted to hit the service at ~60k requests/second. To do this, I built a load tester in Go that uses gRPC to talk to the service. The load tester is comprised of a client that is called locally from my laptop and port-forwards to the attacker pods (load tester server) in a Kubernetes cluster and fires them off. After the attackers have finished sending requests to the recommendations service, the client records the responses from the attackers and spits out statistics on the test.
Definitions
Load balancing distributes tasks to a set of workers. In this case, we want to distribute incoming requests evenly over the set of service pods that we are hitting. See gRPC's documentation on load balancing.
Client-side load balancing is when the caller determines how to distribute the tasks instead of a layer in front of the server that takes all incoming requests/tasks and distributes them.
gRPC is a modern open source high performance Remote Procedure Call (RPC) framework developed by Google. It's known for being ~fast~. (See grpc-go)
The problem
The issue that I ran into while developing the load tester was that when I increased the number of threads each attacker used / the rate of requests per second to run at, the results contained a lot of concurrency/resource exhausted errors. At first, I tried scaling the pods, and then I tried scaling the resources on the pods from requesting 1cpu/1Gb memory to 14cpu/50Gb (each pod on its own node) per pod. Each time I would try to ramp up the tests, I would get a significant amount of errors like
"response type": "rpc error: code = ResourceExhausted desc = Too many concurrent requests. Please retry later."
:(
Identifying the source of the problem
We knew that the pods' scale/size wasn't the main issue here since tuning them gave no change in the results. Since the stats were coming from the attackers, I knew that each attacker was sending requests and receiving responses. The next place to look was the service pods we were attacking. The Prometheus metrics from the pods showed something interesting, we were only hitting a few of the service pods, and we were hitting those hard, while there were some service pods that were never hit at all. This caused an unbalanced load on the service pods. Istio/envoy does not sit in front of the service pod we were testing, so there was no server-side load balancing.
Testing with a low send rate, the results from the service were
port | requests |
---|---|
4000 | 62 |
4001 | 62 |
4002 | 0 |
4002 | 0 |
4003 | 0 |
4004 | 29 |
4005 | 0 |
4006 | 0 |
4007 | 31 |
4008 | 0 |
4009 | 0 |
4010 | 31 |
The port number isn't interesting here, it just represents one of the service pods we are testing. The takeaway from these results is that the requests that the attackers are sending to the service are not being distributed, causing an uneven load.
The Code
My original code had a few problems in it that caused undistributed sends. In an attempt to add distribution, we added load-balancing configs and more connections.
Connections and client-side load balancing configs
At first, we thought that creating the connection only once was the issue, so we put the connection in the thread loop that fires off requests to the pods we were testing. Additionally, we set up the client-side load balancer with the configs. We also set up the DNS/kuberesolver in an attempt to resolve the addresses of each of the pods in the service when given the service name. We observed kuberesolver wasn't changing the results, so we kept only the DNS resolver. All this means is that in the address we dial, we added "dns:///" to the front (so if address is pod-service-name, we would have dns:///pod-service-name).
So we changed three things basically simultaneously:
* created new gRPC connection for each thread
* added DNS resolver (dns:///{address_to_dial})
* added the round robin gRPC config
Setting up a client-side load balancer
There are a few blog posts and sets of documents about setting up a client-side gRPC load balancer. One lesson I learned is that the gRPC library has changed quite a bit throughout the versions, especially for the load balancing configs. I read a blog that suggested using gRPC.WithBalancerName() or a JSON format that was outdated and no longer reflected the protobuf message in the most up-to-date version. Pick first is the default, which was obviously not working well for my use case. Instead, I wanted to set the load balancer to round-robin. So the following was added to the Dial function.
grpc.WithDefaultServiceConfig(`{"loadBalancingConfig": [ { "round_robin": {} } ]}`)
*note: you could generate the proto code in your repo so this isn't as ugly or you can just stick with the JSON
*note: I always have to be careful with contexts and cancellations once a connection is added to the thread loop. Make sure you don't cancel the connection before it should end.
Results
Now we test! At a low send rate with 9 threads we get
port | requests |
---|---|
4000 | 12 |
4001 | 0 |
4002 | 10 |
4002 | 9 |
4003 | 8 |
4004 | 0 |
4005 | 10 |
4006 | 9 |
4007 | 0 |
4008 | 8 |
4009 | 12 |
4010 | 11 |
Debugging
Load-balancing within gRPC happens on a per-call basis, not a per-connection basis. In other words, even if all requests come from a single client, we still want them to be load-balanced across all servers.
Removing configs
Adding gRPC logging
ENV GRPC_GO_LOG_SEVERITY_LEVEL infoENV GRPC_GO_VERBOSITY_LEVEL 99
2021/11/11 01:37:01 INFO: [core] ccResolverWrapper: sending update to cc: {[{10.72.102.84:81 <nil> <nil> 0 <nil>}] <nil> <nil>} 2021/11/11 01:37:01 INFO: [core] ClientConn switching balancer to "round_robin" 2021/11/11 01:37:01 INFO: [core] Channel switches to new LB policy "round_robin" 2021/11/11 01:37:01 INFO: [core] Subchannel Connectivity change to CONNECTING 2021/11/11 01:37:01 INFO: [core] Subchannel picks a new address "10.72.102.84:81" to connect 2021/11/11 01:37:01 INFO: [core] Channel Connectivity change to CONNECTING 2021/11/11 01:37:01 INFO: [core] ccResolverWrapper: sending update to cc: {[{10.72.102.84:81 <nil> <nil> 0 <nil>}] <nil> <nil>} 2021/11/11 01:37:01 INFO: [core] ClientConn switching balancer to "round_robin" 2021/11/11 01:37:01 INFO: [core] Channel switches to new LB policy "round_robin" 2021/11/11 01:37:01 INFO: [core] Subchannel Connectivity change to CONNECTING 2021/11/11 01:37:01 INFO: [core] Subchannel picks a new address "10.72.102.84:81" to connect 2021/11/11 01:37:01 INFO: [core] Channel Connectivity change to CONNECTING 2021/11/11 01:37:01 INFO: [core] ccResolverWrapper: sending update to cc: {[{10.72.102.84:81 <nil> <nil> 0 <nil>}] <nil> <nil>} 2021/11/11 01:37:01 INFO: [core] ClientConn switching balancer to "round_robin" 2021/11/11 01:37:01 INFO: [core] Channel switches to new LB policy "round_robin" 2021/11/11 01:37:01 INFO: [core] Subchannel Connectivity change to CONNECTING 2021/11/11 01:37:01 INFO: [core] Subchannel picks a new address "10.72.102.84:81" to connect
IP: 10.72.102.84 IPs: 10.72.102.84
We saw how the dns name-system didn't work on Kubernetes because the Kubernetes's default internal load balancing works at the L4 level. gRPC is a L7 protocol.
Finally! solving the puzzle
Here is a good description of what it means for a Kubernetes service to be headless. To enable the load balancing we want, we would either have to change the service to be headless or set up a way to gather all of the IPs and then round-robin them (aka a discovery service), checking periodically to make sure those IPs are still available. xDS does this. See also: xDS Features in gRPC
Summary
* gRPC debugging is really useful for tracking gRPC connection behaviors
If you are looking for some good reads about gRPC, check out these posts by Evan Jones- gRPC is easy to misconfigure and Load balancing gRPC services
Comments
Post a Comment