OkHttp EventListener: Network Observability Beyond Logging Interceptors

The 45-Second Mystery

Last month, our Crashlytics lit up:

UnknownHostException: Unable to resolve host "api.example.com"
  Occurrences: 2,847
  Users affected: 1,203
  Context: ¯\_(ツ)_/¯

Backend team checked their dashboards: "API response time is 47ms p95. Not our problem."

They were right. The API was fast. But users were staring at spinners for 45 seconds before seeing "Something went wrong."

Where did those 45 seconds go?

The Visibility Gap

Here's what most Android apps measure:

What You Track	What Actually Happens
Total request time	Yes
HTTP status code	Yes
DNS resolution time	No
TCP handshake time	No
TLS negotiation time	No
Time waiting for first byte	No
Which phase failed	No

You're measuring the destination, but you're blind to the journey.

That UnknownHostException? It could mean:

DNS server unreachable (2-30 second timeout)
Domain doesn't exist (instant failure)
Network switched mid-request (random timing)
DNS poisoning in certain regions (varies)

Without phase-level visibility, you're debugging with a blindfold.

Where Time Actually Goes

We instrumented 50,000 requests across different network conditions. Here's what we found:

Good Network (WiFi, 4G LTE)

Phase	P50	P95	P99
DNS Lookup	5ms	45ms	120ms
TCP Connect	23ms	89ms	156ms
TLS Handshake	67ms	142ms	203ms
Time to First Byte	52ms	187ms	412ms
Total	147ms	463ms	891ms

Degraded Network (3G, Poor Signal)

Phase	P50	P95	P99
DNS Lookup	234ms	8,200ms	29,000ms
TCP Connect	456ms	2,100ms	5,600ms
TLS Handshake	312ms	890ms	1,400ms
Time to First Byte	178ms	1,200ms	3,400ms
Total	1,180ms	12,390ms	39,400ms

The culprit in our 45-second mystery? DNS timeout on degraded networks.

But we only discovered this after adding proper instrumentation.

The Logging Interceptor Trap

This is in 90% of Android codebases:

class LoggingInterceptor : Interceptor {
    override fun intercept(chain: Interceptor.Chain): Response {
        val start = System.currentTimeMillis()
        val response = chain.proceed(chain.request())
        val duration = System.currentTimeMillis() - start

        Timber.d("Request took ${duration}ms") // <-- This number lies
        return response
    }
}

Why it lies:

Scenario	What Happened	What You See
3 retries due to connection reset	3 separate failures, then success	"Request took 12,000ms"
Cache hit	Instant response from disk	"Request took 2ms" (good!)
Redirect chain (3 hops)	3 network round trips	Single timing
DNS timeout + success	30s DNS, 200ms request	"Request took 30,200ms"

You're seeing the outcome, not the story.

The OkHttp Timeout Trap

Here's a "reasonable" timeout configuration:

val client = OkHttpClient.Builder()
    .connectTimeout(10, TimeUnit.SECONDS)
    .readTimeout(30, TimeUnit.SECONDS)
    .writeTimeout(30, TimeUnit.SECONDS)
    .build()

Pop quiz: What's the maximum time a user could wait?

If you said 70 seconds, you're wrong. It's potentially infinite.

The Timeout Truth Table

Timeout	What It Actually Controls	Resets?
`connectTimeout`	DNS + TCP + TLS combined	No
`readTimeout`	Max time between bytes	Yes, per chunk
`writeTimeout`	Max time between bytes	Yes, per chunk
`callTimeout`	Entire operation end-to-end	No

A server trickling 1 byte every 25 seconds will never trigger your 30-second readTimeout. Each byte resets the clock.

callTimeout is the only timeout that represents actual user experience.

val client = OkHttpClient.Builder()
    .connectTimeout(15, TimeUnit.SECONDS)
    .readTimeout(30, TimeUnit.SECONDS)
    .writeTimeout(30, TimeUnit.SECONDS)
    .callTimeout(45, TimeUnit.SECONDS)  // <-- The one that matters
    .build()

The Solution: EventListener

OkHttp has a hidden API that most developers don't know exists. EventListener gives you callbacks for every phase of the request lifecycle.

Phase	Events
Start	`callStart`
DNS	`dnsStart` → `dnsEnd`
TCP	`connectStart` → `connectEnd`
TLS	`secureConnectStart` → `secureConnectEnd`
Request	`connectionAcquired` → `requestHeadersStart` → `requestHeadersEnd` → `requestBodyStart` → `requestBodyEnd`
Response	`responseHeadersStart` → `responseHeadersEnd` → `responseBodyStart` → `responseBodyEnd`
Cleanup	`connectionReleased` → `callEnd`

Production Implementation

class NetworkMetricsListener(
    private val onMetrics: (NetworkMetrics) -> Unit
) : EventListener() {

    private var callStart = 0L
    private var dnsStart = 0L
    private var connectStart = 0L
    private var secureConnectStart = 0L
    private var requestStart = 0L
    private var responseStart = 0L

    private var connectionReused = false

    override fun callStart(call: Call) {
        callStart = System.nanoTime()
    }

    override fun dnsStart(call: Call, domainName: String) {
        dnsStart = System.nanoTime()
    }

    override fun dnsEnd(call: Call, domainName: String, inetAddressList: List<InetAddress>) {
        // DNS complete - connection reuse skips this entirely
    }

    override fun connectStart(call: Call, inetSocketAddress: InetSocketAddress, proxy: Proxy) {
        connectStart = System.nanoTime()
    }

    override fun secureConnectStart(call: Call) {
        secureConnectStart = System.nanoTime()
    }

    override fun connectionAcquired(call: Call, connection: Connection) {
        connectionReused = (connectStart == 0L)  // No connect phase = reused
    }

    override fun requestHeadersStart(call: Call) {
        requestStart = System.nanoTime()
    }

    override fun responseHeadersStart(call: Call) {
        responseStart = System.nanoTime()
    }

    override fun callEnd(call: Call) {
        emitMetrics(success = true)
    }

    override fun callFailed(call: Call, ioe: IOException) {
        emitMetrics(success = false, error = ioe)
    }

    private fun emitMetrics(success: Boolean, error: IOException? = null) {
        val now = System.nanoTime()
        onMetrics(NetworkMetrics(
            dnsMs = if (dnsStart > 0) (connectStart - dnsStart).toMillis() else 0,
            connectMs = if (connectStart > 0) (secureConnectStart - connectStart).toMillis() else 0,
            tlsMs = if (secureConnectStart > 0) (requestStart - secureConnectStart).toMillis() else 0,
            ttfbMs = (responseStart - requestStart).toMillis(),
            totalMs = (now - callStart).toMillis(),
            connectionReused = connectionReused,
            success = success,
            errorType = error?.javaClass?.simpleName
        ))
    }

    private fun Long.toMillis() = TimeUnit.NANOSECONDS.toMillis(this)
}

What You Get

Before EventListener:

Field	Value
Duration	32,450ms
Error	UnknownHostException
Context	???

After EventListener:

Phase	Duration	Status
DNS Lookup	30,120ms	TIMEOUT
TCP Connect	--	--
TLS Handshake	--	--
TTFB	--	--
Total	30,120ms
Error	UnknownHostException
Failed Phase	DNS

Now you know: The DNS resolver on this user's network is broken. Not your API. Not your code. Their ISP.

Level Up: Distributed Tracing with OpenTelemetry

EventListener tells you what happened on the client. But what about the full journey?

Android App → CDN → API Gateway → Service → Database

Where is the slowness?

With OpenTelemetry, you can trace a request from button tap to database query:

class TracingEventListener(
    private val tracer: Tracer
) : EventListener() {

    private var rootSpan: Span? = null

    override fun callStart(call: Call) {
        rootSpan = tracer.spanBuilder("HTTP ${call.request().method}")
            .setSpanKind(SpanKind.CLIENT)
            .setAttribute("http.url", call.request().url.toString())
            .startSpan()
    }

    override fun dnsStart(call: Call, domainName: String) {
        tracer.spanBuilder("DNS Lookup")
            .setParent(Context.current().with(rootSpan!!))
            .startSpan()
    }

    // ... create child spans for each phase

    override fun callEnd(call: Call) {
        rootSpan?.setStatus(StatusCode.OK)
        rootSpan?.end()
    }
}

The Trace Waterfall

Now you can answer: "Is it DNS, the network, or the backend?"

    HTTP GET /api/users
    Total: 1,247ms
  

    0ms
    250ms
    500ms
    750ms
    1000ms
    1247ms
  
DNS Lookup

        45ms
      
TCP Connect

        89ms
      
TLS Handshake

        156ms
      
Request Send

      
12ms
Response Receive

        945ms
      
DNS
TCP
TLS
Request
Response

Observability Stack Options

Solution	Cost	Setup	Best For
Honeycomb	Paid (20M events free)	5 min	Best query experience
Grafana Cloud	Free 50GB/mo	10 min	Already using Grafana
Jaeger	Free (self-host)	1-2 hrs	Full control
Datadog	Paid	15 min	Enterprise, existing DD

Results

After implementing EventListener + OpenTelemetry in our production app:

Metric	Before	After	Change
MTTR for network issues	4.2 hours	23 minutes	-91%
"Network error" bug reports	847/week	312/week	-63%
P95 false timeout errors	2.3%	0.4%	-83%

The biggest win? We stopped blaming the backend for DNS problems.

Real Device Testing: The Proof

Theory is nice. Data is better. I built a test app and ran it on real devices to see what actually happens.

Test App: github.com/aldefy/okhttp-network-metrics

EventListener: See What Interceptors Can't

What Interceptor Sees	What EventListener Reveals
Request → Response	DNS: 5081ms
Total: 7362ms	TCP: 1313ms
	TLS: 964ms
	TTFB: 7359ms

The Doze Mode Discovery

Doze Mode Recovery Timeline
Pixel 9 Pro Fold • JIO 4G

      0s
      5s
      10s
      15s
    
Immediate
after doze exit

        TCP TIMEOUT
        15,000ms
      

        ❌
      
After 5s
wait then retry

      
1,431ms ✓
After 30s
fully recovered

      
1,777ms ✓

      Key Insight: Pixel fails immediately after doze exit because the radio is still waking up.
      Wait 5 seconds and the connection succeeds.
    

      💡
      Moto Razr behavior is opposite: Works immediately, then loses network after 5s. Same error, different root cause.
    

Same error message. Completely different root causes. This is why EventListener matters.

Baseline Performance

Device	Network	Cold DNS	TCP	TLS	TTFB	Total
Pixel 9 Pro Fold	JIO 4G	229ms	972ms	665ms	1989ms	1991ms
Moto Razr 40 Ultra	Airtel	5081ms	1313ms	964ms	7359ms	7362ms

That 5-second DNS on Motorola? Thats not a typo. Airtels DNS is slow on first lookup.

Completely opposite behavior!

Pixel: Fails immediately post-doze, then recovers after 5 seconds
Moto: Works immediately, then loses network entirely

Without EventListener, both would show the same error. With it, you can see the Pixel fails at TCP (SocketTimeoutException), while Moto loses the network interface completely.

TL;DR

Your logging interceptor is lying. It shows outcomes, not phases.
callTimeout is the only timeout that matters for user experience.
EventListener exists. Use it. You'll finally understand your network failures.
Add distributed tracing if you need end-to-end visibility across services.
DNS is usually the culprit for those mysterious 30+ second timeouts.

The UnknownHostException you've been catching with a generic error message? It deserves better. Your users certainly do.

What This Means For You

EventListener transforms network debugging from guesswork into science. Instead of:

"Users are reporting slow network. Maybe its the backend?"

You get:

"Airtel users on Motorola devices have 5s DNS resolution. Jio users on Pixel fail TCP immediately after doze but recover in 5s. Backend is fine - its carrier DNS and OS power management."

Thats the difference between filing a ticket with your backend team and actually fixing the problem.

Your Action Items

Add EventListener to your OkHttp client - 50 lines of code, infinite debugging value
Log phase timings to your analytics - segment by carrier, device, network type
Test doze recovery on YOUR devices - the behavior varies wildly
Set callTimeout - its the only timeout that reflects user experience

Try It Yourself

I open-sourced the test app: github.com/aldefy/okhttp-network-metrics

Run it on your devices. See what YOUR carriers DNS looks like. Find out how YOUR devices recover from doze.

Download the test app — run Baseline + Post-Doze, and DM me your results on Twitter. Ill add your device to this post.

Because the network will always be unreliable - but now you can see exactly where and why.

The next time someone says "its a network issue" youll know which part of the network, on which carrier, on which device, under which conditions. Thats not debugging - thats engineering.

Tags: Android, OkHttp, Kotlin, Network, Observability, OpenTelemetry

The OkHttp API You're Not Using