Need to be careful when you using gRPC keepalive

Allen Ng
4 min readMay 10, 2021

--

Keepalive is very important concept almost in all tcp connection like database etc, the same concept does support in gRPC as well.

Why we need keepalive in gRPC?

When a connection is idle for a while, if there were some other network component involved like load balancer or proxy, the connection may get disconnected.

In those kinds of cases, we will need to have a mechanism to reconnect to the server-side.

Configurations

In the grpc keepalive proposal, there are a lot of configurations that we can set, this time we will use golang as an example.

From the golang official pkg there were client side and server side parameters. Most of them are quite straightforward and easy to understand, but some configuration is still a bit ambiguous for the server side setting.

MaxConnectionAge vs MaxConnectionAgeGrace

First thing that I am not so sure about are these two configurations, from the description MaxConnectionAgeGrace is the time allowed for the rpc call before the connection was forcibly closed.

So I got one question, during `MaxConnectionAgeGrace` period, what happen if there are new requests?

Using the following setting to test with

  • MaxConnectionAge — 10s
  • MaxConnectionAgeGrace — 30s

Client Side

Server Side

running the server & client with `GODEBUG=http2debug=2` so it will print out internal logs.

Result

Server logs

2021/05/07 18:27:56 Received: world
2021/05/07 18:28:06 http2: Framer 0xc0001a0000: wrote GOAWAY len=8 LastStreamID=2147483647 ErrCode=NO_ERROR Debug=""
...
2021/05/07 18:28:16 http2: Framer 0xc00025a000: read DATA flags=END_STREAM stream=1 len=12 data="\x00\x00\x00\x00\a\n\x05world"
2021/05/07 18:28:16 Received: world
...
2021/05/07 18:28:26 returning world
...
2021/05/07 18:28:36 http2: Framer 0xc000346000: wrote GOAWAY len=8 LastStreamID=2147483647 ErrCode=NO_ERROR Debug=""
...
2021/05/07 18:28:45 http2: Framer 0xc0001a01c0: wrote GOAWAY len=8 LastStreamID=2147483647 ErrCode=NO_ERROR Debug=""
...
2021/05/07 18:28:46 returning world

Client logs

2021/05/07 18:27:56 http2: Framer 0xc0002a4000: wrote DATA flags=END_STREAM stream=1 len=12 data="\x00\x00\x00\x00\a\n\x05world"
...
2021/05/07 18:28:06 http2: Framer 0xc0002a4000: read GOAWAY len=8 LastStreamID=2147483647 ErrCode=NO_ERROR Debug=""
...
2021/05/07 18:28:16 http2: Framer 0xc0001f2000: wrote DATA flags=END_STREAM stream=1 len=12 data="\x00\x00\x00\x00\a\n\x05world"
...
2021/05/07 18:28:25 http2: Framer 0xc0002a41c0: read GOAWAY len=8 LastStreamID=2147483647 ErrCode=NO_ERROR Debug=""
...
2021/05/07 18:28:26 Greeting: Hello world
2021/05/07 18:28:36 http2: Framer 0xc00023c000: read GOAWAY len=8 ...
2021/05/07 18:28:46 Greeting: Hello world

We can see every 10 seconds the server side send a GOAWAY to client connection and client reconnect after that, but since the previous request hasn’t finished the connection itself is still alive.

The call will take 30s and with MaxAge (10s) + Grace (30s), the call should have enough time to process.

As we can see in the log, the client side sends a request every 20s, and we can see both of them were done without any issues.

MaxAge with Client Side Keepalive

Another interesting problem that many people may face is grpc connection was force closed by transport closing before MaxAge , usually it was caused by Client Side Keepalive

In client side keepalive configuration, there was a Time params that you can pass in, according to the description it is the client side ping interval, if you only set this params for example 5s, most likely your connection will close very regularly and randomly.

This is because there was a hidden restriction for ping interval, in the proposal it said within MinTime , the server can only receive at max 3 pings.

Let’s try it out!

Server side

Client Side

Result

Client logs

2021/05/10 12:37:55 http2: Framer 0xc0000d4000: wrote SETTINGS len=0
2021/05/10 12:37:55 http2: Framer 0xc0000d4000: read SETTINGS len=6, settings: MAX_FRAME_SIZE=16384
2021/05/10 12:37:55 http2: Framer 0xc0000d4000: wrote SETTINGS flags=ACK len=0
2021/05/10 12:37:55 http2: Framer 0xc0000d4000: read SETTINGS flags=ACK len=0
2021/05/10 12:37:55 http2: Framer 0xc0000d4000: wrote HEADERS flags=END_HEADERS stream=1 len=74
2021/05/10 12:37:55 http2: Framer 0xc0000d4000: wrote DATA flags=END_STREAM stream=1 len=12 data="\x00\x00\x00\x00\a\n\x05world"
2021/05/10 12:37:55 http2: Framer 0xc0000d4000: read WINDOW_UPDATE len=4 (conn) incr=12
2021/05/10 12:37:55 http2: Framer 0xc0000d4000: read PING len=8 ping="\x02\x04\x10\x10\t\x0e\a\a"
2021/05/10 12:37:55 http2: Framer 0xc0000d4000: wrote PING flags=ACK len=8 ping="\x02\x04\x10\x10\t\x0e\a\a"
2021/05/10 12:38:05 http2: Framer 0xc0000d4000: wrote PING len=8 ping="\x00\x00\x00\x00\x00\x00\x00\x00"
2021/05/10 12:38:05 http2: Framer 0xc0000d4000: read PING flags=ACK len=8 ping="\x00\x00\x00\x00\x00\x00\x00\x00"
2021/05/10 12:38:15 http2: Framer 0xc0000d4000: wrote PING len=8 ping="\x00\x00\x00\x00\x00\x00\x00\x00"
2021/05/10 12:38:15 http2: Framer 0xc0000d4000: read PING flags=ACK len=8 ping="\x00\x00\x00\x00\x00\x00\x00\x00"
2021/05/10 12:38:25 http2: Framer 0xc0000d4000: wrote PING len=8 ping="\x00\x00\x00\x00\x00\x00\x00\x00"
2021/05/10 12:38:25 http2: Framer 0xc0000d4000: read PING flags=ACK len=8 ping="\x00\x00\x00\x00\x00\x00\x00\x00"
2021/05/10 12:38:35 http2: Framer 0xc0000d4000: wrote PING len=8 ping="\x00\x00\x00\x00\x00\x00\x00\x00"
2021/05/10 12:38:35 http2: Framer 0xc0000d4000: read PING flags=ACK len=8 ping="\x00\x00\x00\x00\x00\x00\x00\x00"
2021/05/10 12:38:35 http2: Framer 0xc0000d4000: read GOAWAY len=22 LastStreamID=1 ErrCode=ENHANCE_YOUR_CALM Debug="too_many_pings"
2021/05/10 12:38:35 could not greet: rpc error: code = Unavailable desc = transport is closing

We can see the client pinging the server every 10s while our setting is 3s, the reason is written in the config description:

// After a duration of this time if the client doesn't see any activity it
// pings the server to see if the transport is still alive.
// If set below 10s, a minimum value of 10s will be used instead.

As we can see at the 4th time the client pinging the server it throw a `ENHANCE_YOUR_CALM` error and send a GOAWAY to the client, it make all the rpcs force closed and this was ignoring the Grace Period.

Conclusion

In conclude, we know that the previous connection will still open but not accepting new request, after MaxAge it will refresh to a new connection and plotting connection to the new connection instead.

Using keepalive in client-side sometime can be dangerous if you misconfigured, we need to pay extra attention and check the documentation carefully.

--

--