At Arpeely, we operate on a massive scale, connecting to numerous Real-Time Bidding (RTB) exchanges and handling over a million impressions each second. This volume of data requires innovative solutions for system stability and cost management. We recently encountered a subtle yet impactful bug that hindered our main service's performance.
Background
Every time someone opens a website or an app containing an ad, an auction is performed between various advertisers. Those who bid the highest deliver an ad to present to the user.
Our key service, RTB-service, is in charge of bidding on those auction requests sent by various RTB ad exchanges. Processing about 7K requests per second (RPS) per pod, quick decision-making is a crucial business requirement. Written in Go, efficiency is critical for this service.
However, we started noticing sporadic performance issues, including latency spikes, request drops, and an increased pod count. These issues, gradually worsening, prompted a thorough investigation.
Initial Exploration
Our first step was to gather more information and deepen our understanding of the service's behavior. Using Grafana for monitoring, we wanted to understand what are good, bad and average latencies of the service. To improve my intuition about the service, I sought patterns - How does its behavior change during the day, from low to peak times? How does degradation manifest in our metrics? Does the issue occur at regular intervals or specific times? Is it linked to identifiable actions like deployments or configuration changes?
At this stage, I had no theories, and let the data guide me. As Sherlock Holmes once said, "It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts."
Anomaly in the Swarm
I wasn’t able to find any clear correlation between degradation and anything else, which could’ve helped explain its source. I did, however, find something of interest - the problem occurred at pod level, and not at service-wide level. In other words, a pod may start to behave badly while other pods behave normally. When a single pod falls behind, the effect is hard to notice at system level. But when enough pods perform poorly, it looks like the whole system degradates. This insight was important for the solution - if the problem occurs at the pod level, the rest of the investigation - and the solution - should be focused at the pod level.
I wondered: Why do two pods running the same code, under the same conditions, behave differently? This suggested a nondeterministic bug, which was reassuring - the problem wasn't within our architecture or technology stack but likely with a specific bug, and by fixing it, all pods should behave similarly.
What Makes a Pod Tick?
Next, I used pprof, a powerful profiling tool for Go programs that offers developers a window into the performance characteristics of their applications. Pprof provides insights into CPU utilization and memory allocation, among other metrics. pprof works by collecting profiling data from a running program, which can then be analyzed to identify bottlenecks or inefficiencies. It can generate various reports (like flame graphs and call graphs) that visually break down how CPU or memory is allocated across different functions within the program.
CPU profiling didn't reveal much, but heap profiling showed something intriguing: in some pods, GBs of memory were allocated by a function called config.loadConfig. All pods have configuration objects in their memory, and that makes perfectly sense - but why are some significantly larger than the expected size of several MBs? It looks like a memory leak, which can explain the issue!
Now, when we have more facts, we can start forming a theory - we have some non deterministic memory leak related to our configuration, which causes everything to slow down - each memory allocation takes more time, and the GC needs to work harder and more frequently, since it has less available memory. It can perfectly explain the issues that we saw.
Now, we need to prove that by fixing the bug. As I mentioned earlier, pprof is a very strong tool, but it’s not perfect. It led us to the problematic allocation source but couldn't identify who references the memory and why it wasn't freed. Without a clear path forward from pprof, I needed to get my hands dirty.
The idea is simple:
1. Reproduce the problem locally and deterministically
2. Comment out parts of the code, until the problem is solved
With the new knowledge that the problem is related to configuration objects, I simulated very frequent configuration changes, and managed to reproduce the issue deterministically in my local environment. After every request I sent, the process memory increased a little. Further code investigation led to this short line of code:
Commenting out this line stopped the problem, pinpointing the issue. Hooray!
Can You Find the Go Bug?
Let's examine the problematic code (Imp is a bid request, received from the exchange):
Did You Find the Issue?
When examining the code, we observe that impressions (bid requests) are directed to a channel (impsChannel) for asynchronous processing. The function processIteration plays a critical role here, operating in an endless loop with two primary responsibilities:
Transferring Impressions to a Buffer: It moves impressions from impsChannel to a temporary storage buffer, a slice named imps.
Flushing the Buffer: It clears out the imps buffer either when it reaches a predefined limit (BufferSize) or after a specified duration (TickerTimeoutMS) elapses, ensuring impressions are processed in timely batches.
The critical oversight lies in the management of the imps buffer. While processIteration signals the processing of impressions by adjusting iPtr (a pointer to track the number of impressions currently in the buffer), the actual impressions remain within the buffer. This subtle bug leads to unintended consequences, especially noticeable as the traffic patterns fluctuate:
During High Traffic: Imagine a scenario where the service receives a high volume of impressions, say 2,000, within each TickerTimeoutMS interval. The buffer (imps) efficiently grows to accommodate these, reaching a length of 2,000 impressions.
During Reduced Traffic: Later, as the influx of impressions slows down, perhaps only 500 impressions are received in the same time frame. Consequently, the buffer's length adjusts to 500 to reflect the current load.
However, this is where the bug manifests. Impressions indexed from 500 to 1999 in the buffer, populated during the high traffic period, are never overwritten or cleared. This oversight results in these impressions—and crucially, the configuration objects they reference—remaining in memory indefinitely.
At first glance, holding onto past impressions might not seem problematic, especially if they reference the same configuration object. The issue escalates, however, when each impression is linked to a distinct configuration object. Given that configurations can change over time, this scenario is not just plausible but expected. Thus, what initially appears as a minor inefficiency balloons into a significant memory leak, as each impression retains a separate instance of a potentially large configuration object in memory, which can not be freed by the garbage collector.
The Fix
Understanding this, the fix is very simple, using the clear function:
After deploying the fix, we indeed saw a significant improvement in performance. Latencies became much more stable, and pod count decreased:
Let's Summarize
Recipe for success:
Understand where the problem lies (network / pod-level / etc)
Gather data
Deep dive with pprof
Fix
Monitor on system-wide level & prove with pprof
Get pats on the back for saving cloud cost
Comments