Optimizing Gitea Act Runner Connection Load: Reducing from 1,300 req/s to 170 req/s


cover

Gitea Act Runner is the execution component of Gitea Actions, responsible for fetching CI/CD tasks from the Gitea Server and reporting execution results. As more teams self-host Gitea, the HTTP request volume between Runners and the Server has become a bottleneck on the Server side. This article documents how we analyzed and resolved this problem, reducing the request volume from approximately 1,300 req/s to approximately 170 req/s for 200 Runners — an 87% reduction.

Current Architecture: Everything is HTTP Polling

All communication between Act Runner and Gitea Server is based on ConnectRPC (HTTP unary request-response) — no streaming, no WebSocket. Every communication is a full HTTP roundtrip:

1
2
3
Runner → POST /api/actions/runner.v1.RunnerService/FetchTask  → Server (polling)
Runner → POST /api/actions/runner.v1.RunnerService/UpdateLog  → Server (log reporting)
Runner → POST /api/actions/runner.v1.RunnerService/UpdateTask → Server (state reporting)

The original design had two fixed-frequency timers:

  1. Poller: Calls FetchTask every 2 seconds, regardless of whether there are jobs
  2. Reporter: Calls UpdateLog + UpdateTask every 1 second, regardless of whether there is new data

Quantifying the Problem

Let’s calculate for a typical medium-to-large deployment: 200 Runners, averaging 3 tasks each.

Polling layer:

1
200 runners × 1 req / 2s = 100 req/s

Reporter layer (2 requests per second for each running task):

1
200 runners × 3 tasks × 2 req/s = 1,200 req/s

Total: approximately 1,300 req/s. Most of these are wasted requests — no new jobs, no new logs, no state changes.

Solution 1: Polling Backoff with Jitter

The Polling Problem

200 idle Runners sending FetchTask every 2 seconds generate 100 req/s of wasted traffic. Worse, if the Server goes down briefly and recovers, all Runners will flood in simultaneously (thundering herd).

Design: Two Independent Counters (Per-Worker)

1
2
3
4
5
6
7
// Each worker goroutine holds its own state, avoiding shared counters
// With Capacity > 1, this prevents different workers' empty counts
// from accumulating into false backoff
type workerState struct {
    consecutiveEmpty  int64 // Server responds normally, but no task
    consecutiveErrors int64 // Network errors, timeouts
}

Why not use a single counter? Because “no jobs available” and “Server is down” are two different scenarios requiring different recovery strategies:

ScenarioemptyerrorsBehavior
Server normal, no jobs+1resetGradual backoff
Server unresponsiveno change+1Aggressive backoff
Server recovers, still no jobs+1reseterrors reset but empty maintains backoff
Task acquiredresetresetImmediately return to minimum interval

Key scenario: Server goes down for 5 minutes then recovers. With a single counter, the first successful response after recovery would reset the counter to zero, and all Runners would simultaneously return to 2-second intervals, causing a thundering herd. With two counters, errors reset but empty continues, providing a smooth backoff transition.

Why per-worker instead of a shared atomic? Act Runner’s architecture consists of “N independent workers, each polling and running on its own” (Capacity controls N). With a shared counter, 3 workers each experiencing their first empty would push the counter to 3, being misjudged as “3 consecutive empties” and triggering a long backoff — when each worker has only been empty once. By splitting counters into workerState, each goroutine reads and writes its own int64, eliminating the need for atomics and avoiding this false backoff issue. An added benefit: when the Server recovers and one worker gets a task, only that worker’s backoff resets while others maintain their current interval, providing stronger thundering herd protection.

The Math Behind Exponential Backoff

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
func (p *Poller) calculateInterval(s *workerState) time.Duration {
    base := p.cfg.Runner.FetchInterval           // default 2s
    maxInterval := p.cfg.Runner.FetchIntervalMax // default 60s

    n := max(s.consecutiveEmpty, s.consecutiveErrors)
    if n <= 1 {
        return base
    }

    shift := min(n-1, 5)  // max shift=5, prevents int64 overflow
    interval := base * time.Duration(int64(1)<<shift)
    return min(interval, maxInterval)
}

Backoff curve:

1
2
3
4
5
6
n=1  → 2s  (first empty response, no backoff)
n=2  → 4s
n=3  → 8s
n=4  → 16s
n=5  → 32s
n=6+ → 60s (cap)

min(n-1, 5) limits the bit shift because if n accumulates beyond 64, 1<<63 would overflow to a negative number.

Jitter: Spreading Synchronized Requests

1
2
3
4
5
func addJitter(d time.Duration) time.Duration {
    jitterRange := int64(d) * 2 / 5  // 40% total range
    jitter := rand.Int64N(jitterRange) - jitterRange/2  // [-20%, +20%]
    return d + time.Duration(jitter)
}

If 200 Runners start simultaneously, their backoff counters will be identical, producing the exact same interval. Adding ±20% random jitter spreads requests across the [interval×0.8, interval×1.2] range.

Why 20% instead of 50%? Too much jitter makes behavior unpredictable (user sets 5s but actual could be 2.5s or 7.5s). 20% provides sufficient spreading without deviating too far from expectations. This is also the jitter strategy recommended by AWS.

Why We Abandoned rate.Limiter

The original design used golang.org/x/time/rate.Limiter:

1
2
limiter := rate.NewLimiter(rate.Every(2*time.Second), 1)
limiter.Wait(ctx)  // fixed rate

The problem is that rate.Limiter has a fixed rate once created — it doesn’t support dynamically adjusting intervals or jitter. Switching to time.NewTimer with recalculated intervals each time naturally supports dynamic backoff.

Fetch First, Sleep After

An easy-to-miss detail: the old rate.NewLimiter(..., 1) with burst=1 allowed the first Wait() to return immediately, so the Runner could attempt to fetch the next job right after startup or completing a task.

When we initially switched to timer-based, we made a mistake — putting the sleep before the fetch:

1
2
3
4
5
6
7
8
9
// Wrong: has to wait a full FetchInterval after startup before fetching
func (p *Poller) pollOnce(s *workerState) {
    for {
        timer := time.NewTimer(interval)  // sleep first
        <-timer.C
        task, ok := p.fetchTask(ctx, s)   // then fetch
        ...
    }
}

The correct approach is fetch first, sleep after — attempt to fetch first, only sleep if nothing is available:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
func (p *Poller) pollOnce(s *workerState) {
    for {
        task, ok := p.fetchTask(p.pollingCtx, s)  // fetch first
        if !ok {
            timer := time.NewTimer(interval)      // sleep only if no task
            select {
            case <-timer.C:
            case <-p.pollingCtx.Done():
                timer.Stop()
                return
            }
            continue
        }
        // Got a task, execute immediately
        p.runTaskWithRecover(p.jobsCtx, task)
        return
    }
}

This preserves the burst=1 semantics: fetch immediately on startup, and immediately try the next one after a task completes, without wasting any wait time.

Polling Results

ScenarioBeforeAfter
200 idle Runners100 req/s~3.4 req/s (backoff to 60s)
Reduction97%

Solution 2: Event-Driven Reporter

The Reporter Problem

The original RunDaemon executed two HTTP requests every second:

1
2
3
4
5
func (r *Reporter) RunDaemon() {
    _ = r.ReportLog(false)    // HTTP call
    _ = r.ReportState(false)  // HTTP call
    time.AfterFunc(time.Second, r.RunDaemon)
}

Even when there are no new log lines and the state hasn’t changed, requests are still sent. 600 running tasks (200 runners × 3 tasks) produce 1,200 req/s.

But CI task log output is intermittent: heavy output during npm install, occasional lines while downloading Docker images, and complete silence between steps. A fixed 1-second interval wastes during silent periods and can’t go faster during bursts.

Design: Triple-Trigger Mechanism

Replace the recursive timer with a goroutine + select event loop:

Event-Driven Reporter

The three trigger conditions each solve a different problem:

TriggerDefaultWhat it solvesWhat happens without it
Batch size100 rowsFast delivery during high outputnpm install outputs 500 lines, waits 5 seconds
logTicker5sSteady-state baselineChannel notifications may be coalesced, needs periodic scan
maxLatencyTimer3sSingle log line doesn’t wait longOne “Starting…” line followed by silence, waits 5 seconds

State reporting is separated to a 5-second interval, with a stateNotify channel for immediate flushing on step transitions.

Why Separate Log and State?

Log and State have completely different change frequencies:

DataChange frequencyFrontend useLatency tolerance
Log rowsTens of lines/sec (burst)Real-time CI output3-5 seconds
Task stateOnce per step transitionStep status icons<1s (via stateNotify)

Sharing the same timer would force the interval to the strictest requirement, needlessly increasing the state request frequency.

Channel Design: Why Buffered(1) + Non-Blocking?

1
2
3
4
5
6
7
logNotify: make(chan struct{}, 1)

// In Fire():
select {
case r.logNotify <- struct{}{}:
default:  // channel already has a notification, don't block
}

Fire() is a logrus hook — every CI log line passes through it, making it a hot path. With an unbuffered channel, Fire() would block until the daemon goroutine reads, directly slowing down CI execution.

The buffer=1 semantics mean “there’s something new” (boolean signal), not “how many.” The daemon checks len(r.logRows) after receiving the notification to learn the actual count. buffer>1 provides no additional benefit.

stateNotify: Why Flush Immediately on Step Transitions?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
// Step start detected
if step.StartedAt == nil {
    step.StartedAt = timestamppb.New(timestamp)
    urgentState = true  // → triggers stateNotify
}

// Step end detected
if stepResult, ok := r.parseResult(v); ok {
    step.Result = stepResult
    urgentState = true  // → triggers stateNotify
}

When users watch a CI build in the Gitea UI, what they care about most is when a step changes from “waiting” to “running” (spinning animation) and from “running” to “success/failure” (checkmark/cross). These are UX-critical moments. If they have to wait 5 seconds for the stateTicker, users will feel like it’s “stuck.”

When the daemon receives a stateNotify, it flushes both log and state simultaneously, ensuring <1 second latency.

Skip Optimization: Don’t Send If Nothing Changed

Beyond the trigger mechanism, two layers of skip logic were added:

ReportLog — return immediately on empty buffer:

1
2
3
if !noMore && len(rows) == 0 {
    return nil  // no HTTP request sent
}

ReportState — dirty flag:

1
2
3
4
5
6
7
// In Fire(), on any state change:
r.stateChanged = true

// In ReportState():
if !reportResult && !changed && len(outputs) == 0 {
    return nil  // even proto.Clone is saved
}

Why use a dirty flag instead of serialization comparison (proto.Marshalbytes.Equal)? Because proto.Marshal would serialize the entire TaskState on every daemon tick, even though nothing changes most of the time. A dirty flag is a zero-cost bool check.

Reporter Results

ScenarioBeforeAfterReduction
Log requests (420 active tasks)420 req/s84 req/s80%
State requests126 req/s25 req/s80%
Total~550 req/s~109 req/s80%

Solution 3: HTTP Client Tuning

The HTTP Client Problem

The original code used http.DefaultClient in non-insecure mode:

1
2
3
4
5
6
7
8
func getHTTPClient(endpoint string, insecure bool) *http.Client {
    if strings.HasPrefix(endpoint, "https://") && insecure {
        return &http.Client{Transport: &http.Transport{
            TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
        }}
    }
    return http.DefaultClient  // MaxIdleConnsPerHost = 2
}

http.DefaultClient’s MaxIdleConnsPerHost defaults to 2. All Runner requests target the same Server, and when concurrent goroutines (polling + multiple task reporters) exceed 2, excess idle connections are closed, requiring new TCP + TLS handshakes for subsequent requests.

Additionally, getHTTPClient was called twice (PingService + RunnerService), creating two separate connection pools.

The Fix

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
func getHTTPClient(endpoint string, insecure bool) *http.Client {
    transport := &http.Transport{
        MaxIdleConns:        10,
        MaxIdleConnsPerHost: 10,
        IdleConnTimeout:     90 * time.Second,
    }
    if strings.HasPrefix(endpoint, "https://") && insecure {
        transport.TLSClientConfig = &tls.Config{InsecureSkipVerify: true}
    }
    return &http.Client{Transport: transport}
}

// Shared client
httpClient := getHTTPClient(endpoint, insecure)
PingServiceClient:   pingv1connect.NewPingServiceClient(httpClient, ...)
RunnerServiceClient: runnerv1connect.NewRunnerServiceClient(httpClient, ...)

Why 10? Maximum concurrency ≈ 1 (polling) + capacity × 2 (log + state per task). With the default capacity=1, only 3 are needed; setting 10 covers capacity=4 without waste.

Difference from Circuit Breakers

Some might ask: isn’t this just a circuit breaker? Not exactly.

DimensionCircuit BreakerOur Adaptive Backoff
Stops requests?Yes (fully blocks in OPEN state)No, just slows down (60s max)
State modelThree states: Closed → Open → Half-OpenStateless, continuous interval calculation
Recovery methodProbes one request after cooldownResets immediately when task acquired
Design purposeFail-fastReduce wasted load

Circuit breakers are suited for scenarios where “the downstream is completely unavailable and continuing to retry only adds burden.” Our scenario is “the downstream is fine, there’s just no work to do” — Backoff is more appropriate. If we need to protect against Server overload in the future, we can layer a circuit breaker on top when consecutiveErrors exceeds a threshold.

Frontend UX Impact

None of these optimizations should sacrifice user experience. Here’s the latency comparison:

ScenarioBeforeAfterWhy it’s acceptable
Continuous output (npm install)~1s~5sCI logs don’t need sub-second updates
Single line then silence~1s≤3smaxLatencyTimer as baseline
Large burst (100+ lines)~1s<1sBatch size triggers immediate flush, faster than before
Step start/end~1s<1sstateNotify immediate flush
Job completion~1s~1sClose() retry mechanism unchanged

New Configuration Options

All settings have safe defaults — existing configuration files require no changes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
runner:
  # Polling backoff
  fetch_interval_max: 60s # Maximum backoff interval when idle

  # Log reporting
  log_report_interval: 5s # Periodic flush interval
  log_report_max_latency: 3s # Maximum wait time for a single log line (must be less than log_report_interval)
  log_report_batch_size: 100 # Number of rows that triggers an immediate flush

  # State reporting
  state_report_interval: 5s # Periodic flush interval (step transitions are still immediate)

Summary

OptimizationApproachReduction
PollingExponential backoff + jitter97% (idle runners)
Log reportingEvent-driven + batching + skip empty80%
State reportingSeparate interval + dirty flag + skip unchanged80%
HTTP connectionsConnection pool tuning + shared clientReduced TCP/TLS re-establishment
Overall200 runners × 3 tasks1,300 → 170 req/s (87%)

The common principle behind all these optimizations: don’t do unnecessary work. No new logs? Don’t send UpdateLog. State unchanged? Don’t send UpdateTask. No jobs? Gradually reduce FetchTask frequency. By significantly reducing Server load without sacrificing frontend responsiveness.