A lesson learned on Go's Context (using pgx)

A little bit about a bug I encountered in my code due to handling context timeouts 

I thought I was going to be cute and put a nice little timeout on my context with a nice little defer cancel() in the function that had the one job of returning Postgres rows using the pgx library. I was a bit naive about how willy nilly I could be using context. The issue was that (1) I found I was timing out and my end response for the endpoint I was hitting was empty but without an error, and (2) I was canceling the context that was needed to read the rows and (3) I wasn't catching the timeout error. The first problem exposed that I wasn't catching the context error and that my queries were taking longer than I expected. The second issue exposed an error in how I was using context inside my function and resulted in me returning a max of 25 rows at a time. Although the bug in my code is specific to how I was using the pgx library, I think it is applicable to uses of context in general. 


A little "context" (lol) on the issue

Let's say I have a function like this one 


This simplified example code shows an Endpoint method that is hit when the API receives an /endpoint request. The Endpoint func runs a postgres query via the RunQuery method and then reads the rows by scanning them into the Response struct. Finally, the method returns the rows. 

The issue: I was returning empty responses without an error, but when I ran the query myself in the terminal, I got a response. 


Attempt #1: increase the timeout and catch the timeout error

I found that the query was taking longer than expected, much longer than the timeout I had set in RunQuery. While the query can be optimized, it still doesn't explain why no error would return in the response saying that the timeout occurred. Increasing the timeout is only a bandaid for the issue. 

 So what if I add a ping after Query runs? If I call it using the same context that has timed out, it should return an error. 



And it did kinda work. What I mean is, upping the timeout to 2 minutes means the query no longer timed out every time it ran, and if it did, we returned an error like

{
  "error": "2 UNKNOWN: context deadline exceeded"
}

but... why do we need to ping AFTER running the query for the purpose of checking if the context was cancelled? seems a little odd. We should just be able to catch the context error more directly. At this point, I asked another engineer if they had run into this in Go, to which they replied "you can catch the error on context by using ctx.Err()." This is true, and a totally valid response given the information I gave them. So I did the following bad thing. 


This is a dangerous way to use context.Err(), from the Golang docs on context

    // Err indicates why this context was cancelled, after the Done channel
    // is closed.
    Err() error

Just calling context.Err() did work, but I'm not making sure the channel is done or any of that. I will admit that I didn't google the use case, I just tried it, it worked, and said cool. Once I googled it, I realized that nobody was using it the way I was. Although I did get the context error, this isn't reliable and the library should be handling it. Instead of trying to do something hacky, I should figure out how to appropriately catch the error from the pgx library. One thing I have learned is that if you can't find an example using a library the way you are using it, you may want to take an extra second to think about if you are doing the right thing. Maybe you are, but .... maybe you aren't. 

At this point though, I was thinking "ok great, done??" The answer is no, I still haven't actually solved the issue, and another weird thing happened. 

When I didn't timeout, I only returned at most 25 rows at a time. ever. At this point, I almost lost my cool, but instead, I said to myself I will figure this out even if it kills me :) hehe -- figuratively, I mean.. this is probably not a healthy/recommended way to approach debugging lol. 


Attempt #2: actually solve the issue and return more than a max of 25 rows at a time

You may be smarter than I was and noticed that I am calling cancel() within the RunQuery function. This means that I am canceling the context. I made two bad assumptions 

(1) the timeout error would return from Query
(2) rows don't care about the context 

both of these were wrong. 

Just read the source code 
Source code is usually hard to read and confusing because it's new to the reader, but it holds the answers. 

The pgx Query method uses context to Acquire() a connection and Acquire returns *Conn, which is used to call getRows. Then, a connResource struct is made, which holds the rows and the connection.

So, if you exit this RunQuery method, cancel the context, return the rows, then go to read the rows with rows.Next(), guess what? You aren't going to read all (or any) of your rows. If the context timed out during the query run, you won't read any rows. If the query completed, and cancel() was called at the end of RunQuery, you will only read the rows in the buffer. That is why I consistently read 25 rows for results that returned more. 


TL;DR by reading the source code, the reasoning for the weirdness I was seeing was apparent. The context is used to read the rows, and running cancel() in the RunQuery function means that if a query did not timeout, it still wouldn't be able to read all of the rows due to the cancellation of the context in the rows struct. 

Catching the context error within RunQuery was still only a bandaid on the real problem.


Using rows.Err()

A big miss on my part was not using rows.Err().
In the Endpoint method, you will see that I did check the error from Scan(), but there is another error from rows that needed to be caught also. To test, I ran

This isolated test does a couple of things:

(1) makes sure the query times out so that we will see a timeout error (timeout set to 1 second for a query that takes longer)

(2) tries to catch the timeout error in three different places. (1) In calling rows.Err() before scanning the rows, (2) calling rows.Err() after scanning the rows, and (3) using the bad use of ctx.Err() 

The result was that the context error was returned at (2), the rows.Err() after calling rows.Next()

This means that we must attempt to read the row before we will know that the context timeout has occurred. 

Solution

Three things needed to be done to fix this code:

(1) move the context timeout to the outside of the RunQuery function so that cancel() isn't called before the rows are read
(2) adjust the timeout (in this case, increase it until queries are optimized)
(3) call rows.Err() after rows.Next() to catch the timeout error if it occurs

The improved code looks something like 


Looking at the solution now seems so obvious, of course. Also, keep in mind this is a simplified version of the code. The method takes a few steps before it gets to RunQuery. So, originally, keeping that context timeout seemed unnecessary. The RunQuery method no longer needs to create or cancel a timeout, and even if the rows are empty, calling rows.Next() then rows.Err() returns the error that I expected. 

In this process, I learned that context and timeouts aren't something you can just get away with assuming it will work like you want it to. Every library makes choices and tradeoffs around how to handle these things, and it is important to understand how it works, or ... your API may not work as you expect. 


ty

At some point, everyone needs help figuring things out. It feels wrong to say "I figured this out, memememememe." After a while, I posted the issue in a work Slack channel so shout out to Al Delucca, Avi Lederman, and Luke Ravitch for chiming in/sharing knowledge about this stuff and helping me get to a solution much quicker than I would have on my own. 



Find the pgx library here

Comments

  1. Hi,
    I'm having some similar trouble with context and this blog post if one of the few places I've seen the context discussed and explained in a real-world setting - so thanks!

    In the final solution section line 3, `cctx, cancel := context.WithTimeout(ctx, 120*time.Second)`
    The parent `ctx` passed into WithTimeout is undefined - is this supposed to be `request.Context()` ?

    Then in RunQuery, ctx is the argument name, but cctx is passed into Query (assume this was just a typo).

    Thanks!

    ReplyDelete
  2. Thanks a lot! I definitely made the same assumptions, and I for sure got the same issues :)

    ReplyDelete

Post a Comment

Popular Posts