Free API Performance Tool
Enter response times for multiple API endpoints to compare latency side by side, grade performance against industry benchmarks, and identify which endpoints need optimization. Works for REST, GraphQL, gRPC, WebSocket, and Chat/LLM APIs.
API latency is the total time elapsed from when a client sends a request to when it receives the complete response. It's often used interchangeably with "API response time" — though strictly, latency refers to network propagation delay and response time includes server processing on top of that. In practice, developers measure the full round-trip time and call it latency.
API latency is not a single number — it's the sum of several sequential phases. Understanding each phase is essential for knowing where to optimize when latency is too high.
Latency thresholds vary significantly by API type. A 200ms gRPC response is slow (gRPC should be under 20ms), but 200ms for a complex database-backed REST endpoint is perfectly acceptable. Use these benchmarks to contextualize your numbers.
| API Type | Excellent | Good | Acceptable | Slow | Critical |
|---|---|---|---|---|---|
| REST (simple read) | < 50ms | 50–150ms | 150–500ms | 500ms–1s | > 1s |
| REST (DB-backed) | < 100ms | 100–300ms | 300ms–1s | 1s–3s | > 3s |
| GraphQL | < 100ms | 100–300ms | 300ms–1s | 1s–3s | > 3s |
| gRPC | < 20ms | 20–50ms | 50–200ms | 200–500ms | > 500ms |
| WebSocket (per message) | < 10ms | 10–50ms | 50–150ms | 150–500ms | > 500ms |
| Chat / LLM API (TTFT) | < 300ms | 300–800ms | 800ms–2s | 2s–5s | > 5s |
| Authentication API | < 50ms | 50–150ms | 150–400ms | 400ms–1s | > 1s |
The tool above applies these thresholds automatically based on the API type you select for each endpoint — so grading is always contextually accurate.
Large Language Model (LLM) APIs have fundamentally different latency profiles than traditional REST APIs. The key metric is Time to First Token (TTFT) — the time before the first character of the response starts streaming. For real-time chat applications, TTFT determines perceived responsiveness far more than total generation time.
A TTFT under 500ms feels fast to users even if total generation takes 8-10 seconds, because streaming creates the perception of immediate response. Total response time depends heavily on output length — longer responses take proportionally more time regardless of model.
| Model / Provider | Typical TTFT | TTFT Range | Streaming | Notes |
|---|---|---|---|---|
| GPT-4o (OpenAI) | 300–600ms | 150ms–2s | ✓ Yes | Varies significantly with load |
| GPT-4 Turbo | 500–1200ms | 300ms–3s | ✓ Yes | Larger model, higher latency |
| Claude 3.5 Sonnet | 250–500ms | 150ms–1.5s | ✓ Yes | Generally fast TTFT |
| Claude 3 Opus | 400–900ms | 200ms–2s | ✓ Yes | Highest quality, higher latency |
| Gemini 1.5 Pro | 400–800ms | 200ms–2s | ✓ Yes | Strong on long context |
| Gemini 1.5 Flash | 200–400ms | 100ms–1s | ✓ Yes | Optimized for speed |
| Mistral Large | 300–700ms | 150ms–1.5s | ✓ Yes | European hosting option |
| Llama 3 (self-hosted) | 50–500ms | Varies widely | ✓ Yes | Depends entirely on hardware |
Use the comparison tool above to benchmark your actual measured TTFT values against these industry averages. Enter each provider as a separate endpoint with the Chat/LLM type selected.
Compare latency across different routes — identify which endpoints are slow and need database query optimization or caching.
Enter the same endpoint at old and new latency values to measure the exact impact of adding Redis caching, optimizing a query, or switching to a CDN.
Compare the same data operation across different API protocols with type-appropriate benchmarks for each — not a one-size-fits-all threshold.
Enter TTFT values from your actual API calls to OpenAI, Anthropic, Gemini, or Mistral and see which performs best for your use case.
The right optimization depends entirely on which phase of the request is slow. Throwing a CDN at a slow database query doesn't help — you need to match the solution to the bottleneck.
Slow TTFB (server processing) — the most common bottleneck
Add a caching layer for frequently-requested data — Redis or Memcached for in-memory caching, CDN edge caching for public GET endpoints. Run EXPLAIN ANALYZE on slow database queries to find missing indexes. Replace N+1 query patterns (separate query per item in a list) with batch queries or JOINs. Use database connection pooling (PgBouncer, HikariCP) to avoid connection setup overhead on every request.
High DNS lookup time
Switch to a faster DNS provider (Cloudflare 1.1.1.1 or Google 8.8.8.8). Enable DNS caching in your clients and CDN. For microservices, use service mesh DNS with in-cluster resolution (CoreDNS in Kubernetes) instead of external DNS for internal service calls.
High geographic latency (network RTT)
Deploy to multiple regions and route users to the nearest one. Use a CDN (Cloudflare, Fastly, AWS CloudFront) for static content and cacheable API responses. For dynamic APIs, edge functions (Cloudflare Workers, Vercel Edge) move compute physically closer to users — eliminating the round-trip to a central data center.
Large response payload size
Enable gzip or brotli compression on all API responses (typically 70-80% size reduction for JSON). Implement pagination — never return unbounded lists. Use sparse fieldsets in GraphQL (request only needed fields) or implement field filtering in REST. Consider Protocol Buffers (gRPC) for binary serialization instead of JSON when latency is critical.
Cold start latency in serverless functions
Serverless cold starts (AWS Lambda, Google Cloud Functions) add 100ms-3s of latency on the first request after idle. Mitigation options: provisioned concurrency (keeps function warm, costs more), scheduled keep-alive pings, moving to container-based deployment for latency-critical endpoints, or using edge runtimes (Deno Deploy, Cloudflare Workers) which have sub-1ms cold starts.
Common questions from developers measuring and optimizing API performance — the kind of discussions found on Reddit, Stack Overflow, and API-focused communities.
How do I measure API latency accurately from client code?
Use performance.now() in browsers for sub-millisecond accuracy — Date.now() has lower resolution. In Node.js, use process.hrtime(). Measure multiple samples and look at p95/p99 rather than single measurements, as network jitter creates significant variance. For production monitoring, use OpenTelemetry or your APM (Datadog, New Relic, Sentry) to capture distributed traces that break down latency by service component.
Why does my API latency vary so much between requests?
Latency variance (jitter) has several sources: database query plan changes (cold query caches), garbage collection pauses in JVM or Node.js, connection pool exhaustion forcing new connections, CPU throttling on shared cloud infrastructure, and CDN cache misses vs hits. High p99 relative to p50 (median) usually means one of these intermittent factors. Measure p50, p95, and p99 separately to diagnose — a stable p50 with high p99 points to specific slow outlier requests.
Is 200ms API latency acceptable for a production app?
It depends entirely on the API type and what the user is doing. 200ms for a gRPC internal microservice call is slow. 200ms for a complex database-backed REST endpoint is good. 200ms for a Chat API TTFT is excellent. The key rule: user-triggered actions that visibly block the UI should complete under 300ms total (including client rendering). Background data fetches can tolerate 1000ms+. Use the latency grading in this tool, which applies type-appropriate thresholds rather than a single cutoff.
What is the difference between API latency and throughput?
Latency is the time for a single request to complete (milliseconds per request). Throughput is the number of requests a system handles per unit of time (requests per second). They're related but independent — an API can have low latency but low throughput (fast per-request but doesn't scale), or high throughput but high latency (scales to many users but each waits longer). For user-facing APIs, optimize latency first. For batch processing APIs, optimize throughput. Most production systems need both: low latency at high concurrency.
How does Cloudflare or a CDN reduce API latency?
A CDN reduces latency in two ways. For cacheable GET requests, it stores the response at edge nodes globally and serves it from the nearest one — eliminating the round-trip to your origin server. For dynamic (non-cacheable) requests, the CDN still reduces latency by terminating TLS closer to the user (eliminating TLS handshake RTT) and routing the request through an optimized private network to your origin instead of the public internet.