Build a DNS Resolver in Python From Scratch

Recursive resolution, TTL caching, and the chain of responsibility pattern that powers every internet request you make

Apr 10, 2026

Every command you run hits DNS. Every curl, every kubectl, every terraform apply. Your browser made a DNS query to load this page. DNS is the most heavily used distributed system on the planet, and most engineers have never seen what happens inside a resolver.

Today we build one from scratch. A DNS resolver with TTL caching and the Chain of Responsibility pattern.

The 3-Question Framework

What does the system DO? It translates domain names into IP addresses by walking the DNS hierarchy. This is the Chain of Responsibility pattern. A request passes through a chain of handlers (cache, hosts file, recursive resolver) until one of them can answer it.

What operations must be FAST? Most DNS lookups should return instantly from cache. A dict with TTL gives us O(1) cache hits. Cache misses trigger network calls, which are slower but infrequent for popular domains.

What is the core LOGIC? Recursive resolution. Start at the root DNS servers. They point you to the TLD servers (.com, .io). The TLD servers point you to the authoritative servers. The authoritative server gives you the final IP address. Cache every answer with its TTL so you only walk the tree once.

Pattern:    Chain of Responsibility (cache -> resolver -> root -> TLD -> auth)
Structure:  dict with TTL (DNS cache)
Algorithm:  Recursive resolution with caching at every level

How DNS Resolution Actually Works

When you type crackthat.io in your browser, here is what happens:

This is a chain. Each handler either answers the question or passes it to the next handler in the chain. That is the Chain of Responsibility pattern.

The key insight: after the first lookup, Step 1 returns the answer instantly for the next 300 seconds. Without caching, every HTTP request would add 50 to 100ms walking the DNS tree.

Step 1: The DNS Cache

The foundation. A simple dict with TTL expiration. Same lazy expiration pattern from Issue 3.

import time

class DNSCache:

    def __init__(self):
        self._cache = {}

    def get(self, domain, record_type="A"):
        key = (domain, record_type)
        entry = self._cache.get(key)
        if entry is None:
            return None
        if time.time() > entry["expiry"]:
            del self._cache[key]
            return None
        return entry["value"]

    def put(self, domain, value, ttl, record_type="A"):
        key = (domain, record_type)
        self._cache[key] = {
            "value": value,
            "expiry": time.time() + ttl,
        }

    def size(self):
        return len(self._cache)

Interview phrasing: “The DNS cache is a hash map keyed by (domain, record_type) with lazy TTL expiration. Same pattern as Redis. O(1) lookups. Expired entries are cleaned up on access.”

Step 2: The Resolver

The resolver sends a query to a DNS server and returns the IP address. We use Python’s socket library to make the actual network call.

import socket

class DNSResolver:

    def __init__(self):
        self.cache = DNSCache()
        self._queries = 0

    def resolve(self, domain):
        cached = self.cache.get(domain)
        if cached:
            return cached

        ip = self._do_lookup(domain)
        if ip:
            self.cache.put(domain, ip, 300)
        return ip

    def _do_lookup(self, domain):
        self._queries += 1
        try:
            result = socket.getaddrinfo(domain, None, socket.AF_INET)
            if result:
                return result[0][4][0]
        except socket.gaierror:
            return None
        return None

    def stats(self):
        return {
            "cache_size": self.cache.size(),
            "queries_made": self._queries,
        }

On the first call, the resolver makes a real DNS query using the operating system’s resolver. The result is cached for 300 seconds. Every subsequent call for the same domain returns instantly from cache.

In production DNS servers like CoreDNS or Unbound, the resolver builds raw UDP packets and walks the tree from root servers to authoritative servers. Our implementation delegates to the OS for simplicity, but the caching and chain of responsibility architecture is identical.

How production recursive resolution works:

Ask root server for crackthat.io
Root says “Try the .io TLD servers”
Ask .io TLD server for crackthat.io
TLD says “The authoritative server is ns1.vercel-dns.com”
Ask authoritative server for crackthat.io
Authoritative says “76.76.21.21” with TTL=300
Cache the answer and return the IP

Step 3: The Chain of Responsibility

The Chain of Responsibility pattern makes the resolver extensible. Each handler tries to answer the query. If it cannot, it passes to the next handler.

from abc import ABC, abstractmethod

class DNSHandler(ABC):

    def __init__(self):
        self._next = None

    def set_next(self, handler):
        self._next = handler
        return handler

    def pass_to_next(self, domain):
        if self._next:
            return self._next.handle(domain)
        return None

    @abstractmethod
    def handle(self, domain):
        pass


class CacheHandler(DNSHandler):

    def __init__(self, cache):
        super().__init__()
        self._cache = cache

    def handle(self, domain):
        result = self._cache.get(domain)
        if result:
            return result
        return self.pass_to_next(domain)

class HostsFileHandler(DNSHandler):
    """Check local host mappings before going to network."""

    def __init__(self, hosts_path=None):
        super().__init__()
        import os
        if hosts_path is None:
            hosts_path = os.path.join(os.sep, "etc", "hosts")
        self._hosts = self._load_hosts(hosts_path)

    def _load_hosts(self, path):
        hosts = {}
        try:
            with open(path) as f:
                for line in f:
                    line = line.strip()
                    if line and not line.startswith("#"):
                        parts = line.split()
                        if len(parts) >= 2:
                            ip = parts[0]
                            for hostname in parts[1:]:
                                hosts[hostname] = ip
        except FileNotFoundError:
            pass
        return hosts

    def handle(self, domain):
        result = self._hosts.get(domain)
        if result:
            return result
        return self.pass_to_next(domain)


class RecursiveHandler(DNSHandler):

    def __init__(self, resolver):
        super().__init__()
        self._resolver = resolver

    def handle(self, domain):
        return self._resolver.resolve(domain)

Now we wire the chain together:

def build_resolver_chain():
    cache = DNSCache()
    resolver = DNSResolver()
    resolver.cache = cache

    cache_handler = CacheHandler(cache)
    hosts_handler = HostsFileHandler()
    recursive_handler = RecursiveHandler(resolver)

    cache_handler.set_next(hosts_handler).set_next(recursive_handler)
    return cache_handler


chain = build_resolver_chain()
ip = chain.handle("crackthat.io")
print(f"crackthat.io resolved to {ip}")

Why the chain matters:

Most queries hit the cache handler and return in microseconds. Only cache misses reach the hosts file handler. Only misses from both reach the recursive handler which makes network calls. This is the same layered lookup your operating system uses:

Application DNS cache (Chrome, Firefox)
OS resolver cache (the local hosts file, systemd-resolved)
Local recursive resolver (CoreDNS, dnsmasq)
Upstream recursive resolver (8.8.8.8, 1.1.1.1)
Root servers to TLD to Authoritative

Comparison: DNS Resolution Strategies

The DevOps Connection

Scaling to Millions: DNS at Production Scale

Our resolver handles one machine. Production DNS serves billions of queries per day across the globe. Here is how it scales.

Anycast Routing

Every root DNS server has one IP address, but that IP is announced from dozens of locations worldwide. When you query 198.41.0.4, your packets go to the nearest physical server. This is called anycast. Cloudflare uses the same technique for their 1.1.1.1 resolver. One IP, 300+ locations. Your query never crosses the ocean.

GeoDNS

A single domain can return different IP addresses based on where the client is. A user in Dubai gets routed to a server in the Middle East. A user in London gets routed to a server in Europe. Route53, Cloudflare, and Akamai all support this. It is how CDNs deliver content from the closest edge node.

Caching Hierarchy

Production DNS uses multiple layers of caching to absorb traffic before it reaches authoritative servers.

Browser cache (Chrome, Firefox)
    -> OS cache (systemd-resolved)
        -> Local resolver (CoreDNS in your K8s cluster)
            -> ISP resolver (your provider's DNS)
                -> Authoritative server (only on full miss)

Each layer absorbs 80 to 90% of queries. By the time a request reaches the authoritative server, it represents thousands of identical queries that were served from cache.

Negative Caching

Production resolvers also cache “domain not found” responses (NXDOMAIN). Without this, a typo like “gogle.com” would hit the authoritative server on every single request. Negative caching with a short TTL (30 to 60 seconds) prevents this.

DNS Failover and Health Checks

Route53 and Cloudflare support health checked DNS records. If a server fails its health check, the DNS record is automatically updated to point to a healthy server. The TTL on these records is kept low (30 to 60 seconds) so clients pick up the change quickly.

Normal:     api.example.com -> 10.0.0.1 (healthy)
Failover:   api.example.com -> 10.0.0.2 (backup)
Recovery:   api.example.com -> 10.0.0.1 (restored)

This is the foundation of active/passive failover. Your Terraform configs probably already use this pattern.

Rate Limiting DNS Servers

At scale, DNS servers must protect themselves from abuse. Techniques include response rate limiting (RRL) to cap the number of identical responses per second, and query rate limiting per source IP. Without these, a DNS amplification attack can take down your resolver.

Interview upgrade: When the interviewer asks about DNS, mention anycast for global distribution, caching hierarchy for traffic absorption, and health checked records for failover. This shows you think at production scale, not just single machine implementations.

Interview Walkthrough Script

When the interviewer asks “Design a DNS resolver”:

“I would use the Chain of Responsibility pattern. A DNS query passes through a chain of handlers: first the local cache for O(1) lookups, then the hosts file, then a recursive resolver. The recursive resolver walks the DNS tree starting at root servers, following referrals through TLD servers to the authoritative server. Every response is cached with its TTL. Most production traffic hits the cache and returns in microseconds. For reliability I would add retry logic, multiple root servers, and negative caching for NXDOMAIN responses.”

Challenge: Try It Yourself

CNAME resolution. Add support for CNAME records. When you get a CNAME instead of an A record, resolve the CNAME target recursively.
Negative caching. Cache NXDOMAIN (domain not found) responses too. This prevents repeated lookups for nonexistent domains.
Benchmark cache hit rates. Resolve 1000 random domains, then resolve them again. Measure the hit rate and time difference.

Next week: Build a Task Queue in Python (Priority Queue)

Previous: Build a Load Balancer | Build a Rate Limiter | The GIL Explained

Crack That Weekly

Discussion about this post

Ready for more?