Understanding Network Failures on Android
The Black Box Problem
When a network request fails on Android, what do you actually know?
try {
val response = apiService.getUser()
} catch (e: IOException) {
Timber.e(e, "Network error") // ...and that's it
}
You know it failed. You don't know where or why.
Was it DNS? TCP? TLS? The server? Your user's ISP? A flaky wifi connection?
Most Android apps treat the network as a black box. Request goes in, response (or error) comes out. Everything in between is invisible.
What Actually Happens During a Network Request
Before we can debug network issues, we need to understand what we're debugging. Here's the journey of a single HTTPS request:
graph TD
A["1. DNS Lookup
api.example.com → IP"] --> B["2. TCP Connect
SYN → SYN-ACK → ACK"]
B --> C["3. TLS Handshake
Certificate + Keys"]
C --> D["4. Request Send
HTTP headers + body"]
D --> E["5. Wait - TTFB
Server processing"]
E --> F["6. Response Receive
Download body"]
Each phase can fail independently. Each has different failure modes and different timings.
The problem: Your try/catch block sees them all as one blob.
Why This Matters: A Real Debugging Story
Here's a common scenario:
Your crash reporting tool shows:
UnknownHostException: Unable to resolve host "api.example.com"
What does this tell you? Almost nothing useful.
UnknownHostException could mean:
- DNS server is unreachable (network issue)
- Domain doesn't exist (config bug)
- DNS response was too slow (timeout)
- User switched networks mid-request
- ISP is having DNS issues
- Corporate firewall blocking DNS
Without knowing when in the request lifecycle this happened and how long it took, you're guessing.
The Logging Interceptor Trap
This pattern exists in most Android codebases:
class LoggingInterceptor : Interceptor {
override fun intercept(chain: Interceptor.Chain): Response {
val start = System.currentTimeMillis()
val response = chain.proceed(chain.request())
val duration = System.currentTimeMillis() - start
Timber.d("${chain.request().url} took ${duration}ms")
return response
}
}
This tells you total time. It doesn't tell you:
- How much was DNS vs TCP vs TLS vs server?
- Did the connection get reused? (huge difference)
- Were there retries hidden inside?
- Which phase failed on errors?
You're measuring the outcome, not the journey.
OkHttp's Hidden API: EventListener
OkHttp provides EventListener - a callback interface that fires at every phase of the request lifecycle. Most developers don't know it exists.
abstract class EventListener {
open fun callStart(call: Call) {}
open fun dnsStart(call: Call, domainName: String) {}
open fun dnsEnd(call: Call, domainName: String, inetAddressList: List<InetAddress>) {}
open fun connectStart(call: Call, inetSocketAddress: InetSocketAddress, proxy: Proxy) {}
open fun connectEnd(call: Call, inetSocketAddress: InetSocketAddress, proxy: Proxy, protocol: Protocol?) {}
open fun secureConnectStart(call: Call) {}
open fun secureConnectEnd(call: Call, handshake: Handshake?) {}
open fun requestHeadersStart(call: Call) {}
open fun requestHeadersEnd(call: Call, request: Request) {}
open fun responseHeadersStart(call: Call) {}
open fun responseHeadersEnd(call: Call, response: Response) {}
open fun callEnd(call: Call) {}
open fun callFailed(call: Call, ioe: IOException) {}
}
With this, you can measure each phase independently.
Building a Metrics Listener
Here's a practical implementation:
class NetworkMetricsListener : EventListener() {
private var callStartNanos = 0L
private var dnsStartNanos = 0L
private var connectStartNanos = 0L
private var secureConnectStartNanos = 0L
private var requestStartNanos = 0L
private var responseStartNanos = 0L
override fun callStart(call: Call) {
callStartNanos = System.nanoTime()
}
override fun dnsStart(call: Call, domainName: String) {
dnsStartNanos = System.nanoTime()
}
override fun dnsEnd(call: Call, domainName: String, inetAddressList: List<InetAddress>) {
val dnsMs = (System.nanoTime() - dnsStartNanos).nanosToMillis()
// Log or record: DNS took $dnsMs ms
}
override fun connectStart(call: Call, inetSocketAddress: InetSocketAddress, proxy: Proxy) {
connectStartNanos = System.nanoTime()
}
override fun secureConnectStart(call: Call) {
secureConnectStartNanos = System.nanoTime()
}
override fun secureConnectEnd(call: Call, handshake: Handshake?) {
val tlsMs = (System.nanoTime() - secureConnectStartNanos).nanosToMillis()
// Log or record: TLS handshake took $tlsMs ms
}
override fun connectEnd(call: Call, inetSocketAddress: InetSocketAddress, proxy: Proxy, protocol: Protocol?) {
val connectMs = (System.nanoTime() - connectStartNanos).nanosToMillis()
// Log or record: TCP connect took $connectMs ms
}
override fun responseHeadersStart(call: Call) {
responseStartNanos = System.nanoTime()
}
override fun callEnd(call: Call) {
val totalMs = (System.nanoTime() - callStartNanos).nanosToMillis()
// Now you have the full breakdown
}
override fun callFailed(call: Call, ioe: IOException) {
// You know exactly which phase was in progress when it failed
val failedAt = when {
responseStartNanos > 0 -> "response"
secureConnectStartNanos > 0 -> "tls"
connectStartNanos > 0 -> "tcp"
dnsStartNanos > 0 -> "dns"
else -> "pre-dns"
}
// Log: Request failed during $failedAt phase
}
private fun Long.nanosToMillis() = TimeUnit.NANOSECONDS.toMillis(this)
}
Wiring It Up
val client = OkHttpClient.Builder()
.eventListenerFactory { NetworkMetricsListener() }
.build()
Note: Use eventListenerFactory (not eventListener) so each call gets its own listener instance.
What You Can Learn
Once you have phase-level metrics, you can answer questions like:
DNS Issues
- Is DNS slow for certain carriers/regions?
- Are users hitting DNS timeouts?
- Would DNS caching help?
Connection Issues
- Are TCP handshakes slow? (distance to server)
- Are TLS handshakes slow? (certificate chain too long?)
- Are connections being reused? (if
dnsStartnever fires, connection was reused)
Server Issues
- Is TTFB (time to first byte) high? That's server processing time.
- Is response download slow? Could be payload size or bandwidth.
Connection Reuse: The Silent Optimization
One important insight from EventListener: connection reuse.
If dnsStart and connectStart never fire for a request, OkHttp reused an existing connection. This skips DNS + TCP + TLS entirely.
On a reused connection:
Request: DNS(0) + TCP(0) + TLS(0) + TTFB(50ms) = 50ms
On a fresh connection:
Request: DNS(30ms) + TCP(45ms) + TLS(120ms) + TTFB(50ms) = 245ms
If your metrics show lots of fresh connections, you might be:
- Creating new OkHttpClient instances (don't do this)
- Hitting connection pool limits
- Having connections closed due to timeouts
Timeout Configuration
While we're here, let's clarify OkHttp's timeouts:
val client = OkHttpClient.Builder()
.connectTimeout(15, TimeUnit.SECONDS) // DNS + TCP + TLS combined
.readTimeout(30, TimeUnit.SECONDS) // Time between bytes (resets per chunk)
.writeTimeout(30, TimeUnit.SECONDS) // Time between bytes (resets per chunk)
.callTimeout(60, TimeUnit.SECONDS) // Total end-to-end timeout
.build()
Important: readTimeout and writeTimeout reset on each chunk of data. A slow server trickling bytes can keep a request alive indefinitely.
callTimeout is the only one that represents total user-facing time.
Next Steps
EventListener gives you visibility into what's happening. What you do with that data is up to you:
- Log to your analytics - track phase timings by carrier, region, app version
- Set up alerts - if DNS P95 spikes, you'll know
- Add to crash reports - attach phase info to network exceptions
- Distributed tracing - connect client spans to backend spans with OpenTelemetry
The network will always be unreliable. But at least now you can see where it's unreliable.
Further reading: OkHttp Events documentation
Real Device Testing: The Proof
Theory is nice. Here is what we actually measured using the EventListener on real devices:
Devices Tested
- Moto Razr 40 Ultra — Airtel 4G (no VoLTE)
- Pixel 9 Pro Fold — Jio 4G LTE
- Android Emulator — WiFi simulation
Baseline Performance
| Metric | Razr 40 Ultra | Pixel 9 Pro Fold |
|---|---|---|
| Cold DNS | 5070ms | 141ms |
| Cached DNS | 3-7ms | 10-12ms |
| TCP Connect | ~1300ms | ~1000ms |
| TLS Handshake | ~960ms | ~650ms |
| Total (warm) | ~2150ms | ~1900ms |
The Razr took 5 seconds just for DNS on the first request. Without EventListener, you would just see "slow network" — no idea WHERE the 5 seconds went.
Post-Doze Recovery
After forcing Android Doze mode and waking the device:
| Timing | Razr 40 Ultra | Pixel 9 Pro Fold |
|---|---|---|
| Immediate (0s) | ✅ Works (2082ms) | ❌ TCP timeout (15s) |
| After 5s | ❌ DNS fail | ❌ DNS fail |
| After 30s | ❌ DNS fail | ❌ DNS fail |
Two flagship foldables, same network type, completely different behavior:
- Razr: Slow cold DNS (5s!) but recovers from doze immediately
- Pixel: Fast baseline but aggressive doze kills even the first request
What Crash Reports Would Show vs What EventListener Revealed
| Device | Crash Report Says | EventListener Revealed |
|---|---|---|
| Razr 40 | "Timeout" | Cold DNS = 5070ms (the real culprit) |
| Pixel 9 | "Timeout" | DNS OK (97ms), TCP hung for 15s |
| Both | "UnknownHostException" | DNS resolver dies after doze |
This is the point: IOException tells you nothing. EventListener tells you exactly which phase failed and why.
Run It Yourself
I built a test app to collect this data. Want to contribute your device's results?
Devices Tested:
├── Moto Razr 40 Ultra (Airtel 4G)
├── Pixel 9 Pro Fold (Jio 4G VoLTE)
├── Android Emulator (WiFi simulation)
└── [Your device here]
└── Run the app, share results!
Download the test app — run Baseline + Post-Doze, and DM me your results on Twitter. I'll add your device to this post.
The Network Journey (What EventListener Sees)
graph LR
A[App calls API] --> B[DNS Lookup]
B --> C[TCP Connect]
C --> D[TLS Handshake]
D --> E[Send Request]
E --> F[Wait TTFB]
F --> G[Receive Response]
style B fill:#ff6b6b,color:#fff
style C fill:#ffa502,color:#fff
style D fill:#2ed573,color:#fff
Red = DNS (where Razr spent 5 seconds). Orange = TCP (where Pixel timed out). Green = TLS.
Device Comparison: Same API, Different Story
graph TD
subgraph Razr["Moto Razr 40 Ultra"]
R1[Cold DNS: 5070ms] --> R2[TCP: 1300ms]
R2 --> R3[TLS: 960ms]
R3 --> R4[Total: ~7s first call]
end
subgraph Pixel["Pixel 9 Pro Fold"]
P1[Cold DNS: 141ms] --> P2[TCP: 1000ms]
P2 --> P3[TLS: 650ms]
P3 --> P4[Total: ~2s first call]
end
Post-Doze Failure Points
flowchart TD
A[Device exits Doze] --> B{Immediate request}
B -->|Razr| C[✅ Success]
B -->|Pixel| D[❌ TCP Timeout 15s]
A --> E{After 5 seconds}
E -->|Both devices| F[❌ DNS Fails]
A --> G{After 30 seconds}
G -->|Both devices| H[❌ DNS Still Fails]
F --> I[OS DNS resolver goes back to sleep]
H --> I
Key insight: The DNS resolver itself enters a sleep state after doze. Not a server issue — an OS issue.
What This Means For You
EventListener transforms network debugging from guesswork into science. Instead of:
"Users are reporting slow network. Maybe it's the backend?"
You get:
"Airtel users on Motorola devices have 5s DNS resolution. Jio users on Pixel fail TCP immediately after doze but recover in 5s. Backend is fine — it's carrier DNS and OS power management."
That's the difference between filing a ticket with your backend team and actually fixing the problem.
Your Action Items
- Add EventListener to your OkHttp client — 50 lines of code, infinite debugging value
- Log phase timings to your analytics — segment by carrier, device, network type
- Test doze recovery on YOUR devices — the behavior varies wildly
- Set callTimeout — it's the only timeout that reflects user experience
Try It Yourself
I open-sourced the test app: github.com/aldefy/okhttp-network-metrics
Run it on your devices. See what YOUR carrier's DNS looks like. Find out how YOUR devices recover from doze.
Download the test app — run Baseline + Post-Doze, and DM me your results on Twitter. I'll add your device to this post.
Because the network will always be unreliable — but now you can see exactly where and why.
The next time someone says "it's a network issue" you'll know which part of the network, on which carrier, on which device, under which conditions. That's not debugging — that's engineering.