A lesson learned on Go's Context (using pgx)
A little bit about a bug I encountered in my code due to handling context timeouts
I thought I was going to be cute and put a nice little timeout on my context with a nice little defer cancel() in the function that had the one job of returning Postgres rows using the pgx library. I was a bit naive about how willy nilly I could be using context. The issue was that (1) I found I was timing out and my end response for the endpoint I was hitting was empty but without an error, and (2) I was canceling the context that was needed to read the rows and (3) I wasn't catching the timeout error. The first problem exposed that I wasn't catching the context error and that my queries were taking longer than I expected. The second issue exposed an error in how I was using context inside my function and resulted in me returning a max of 25 rows at a time. Although the bug in my code is specific to how I was using the pgx library, I think it is applicable to uses of context in general.
A little "context" (lol) on the issue
Let's say I have a function like this one
The issue: I was returning empty responses without an error, but when I ran the query myself in the terminal, I got a response.
So what if I add a ping after Query runs? If I call it using the same context that has timed out, it should return an error.
When I didn't timeout, I only returned at most 25 rows at a time. ever. At this point, I almost lost my cool, but instead, I said to myself I will figure this out even if it kills me :) hehe -- figuratively, I mean.. this is probably not a healthy/recommended way to approach debugging lol.
(2) rows don't care about the context
both of these were wrong.
So, if you exit this RunQuery method, cancel the context, return the rows, then go to read the rows with rows.Next(), guess what? You aren't going to read all (or any) of your rows. If the context timed out during the query run, you won't read any rows. If the query completed, and cancel() was called at the end of RunQuery, you will only read the rows in the buffer. That is why I consistently read 25 rows for results that returned more.
TL;DR by reading the source code, the reasoning for the weirdness I was seeing was apparent. The context is used to read the rows, and running cancel() in the RunQuery function means that if a query did not timeout, it still wouldn't be able to read all of the rows due to the cancellation of the context in the rows struct.
Catching the context error within RunQuery was still only a bandaid on the real problem.
A big miss on my part was not using rows.Err().
In the Endpoint method, you will see that I did check the error from Scan(), but there is another error from rows that needed to be caught also. To test, I ran
This isolated test does a couple of things:
(1) makes sure the query times out so that we will see a timeout error (timeout set to 1 second for a query that takes longer)
(2) tries to catch the timeout error in three different places. (1) In calling rows.Err() before scanning the rows, (2) calling rows.Err() after scanning the rows, and (3) using the bad use of ctx.Err()
The result was that the context error was returned at (2), the rows.Err() after calling rows.Next()
This means that we must attempt to read the row before we will know that the context timeout has occurred.
Three things needed to be done to fix this code:
(1) move the context timeout to the outside of the RunQuery function so that cancel() isn't called before the rows are read
(2) adjust the timeout (in this case, increase it until queries are optimized)
(3) call rows.Err() after rows.Next() to catch the timeout error if it occurs
The improved code looks something like
Looking at the solution now seems so obvious, of course. Also, keep in mind this is a simplified version of the code. The method takes a few steps before it gets to RunQuery. So, originally, keeping that context timeout seemed unnecessary. The RunQuery method no longer needs to create or cancel a timeout, and even if the rows are empty, calling rows.Next() then rows.Err() returns the error that I expected.
In this process, I learned that context and timeouts aren't something you can just get away with assuming it will work like you want it to. Every library makes choices and tradeoffs around how to handle these things, and it is important to understand how it works, or ... your API may not work as you expect.
At some point, everyone needs help figuring things out. It feels wrong to say "I figured this out, memememememe." After a while, I posted the issue in a work Slack channel so shout out to Al Delucca, Avi Lederman, and Luke Ravitch for chiming in/sharing knowledge about this stuff and helping me get to a solution much quicker than I would have on my own.