tu-huynh
tuhuynh
.com
Blog

Asynchronous Non-blocking I/O under the hood: poll, select, epoll/kqueue

TĂș wrote

Why asynchronous non-blocking I/O??

Check out The C10K problem for the details.

The blocking I/O model with 1 thread/1 request has two main issues:

  1. Memory Overhead: Each thread requires a significant amount of memory:

    • Stack memory: Defaults to 1MB/thread on Linux
    • Kernel memory
  2. CPU Overhead due to context switching:

    • Each switch between threads takes about 1-100 microseconds
    • Requires saving/loading register states
    • Invalidates CPU cache, TLB entries
    • Kernel scheduler overhead

With 10,000 concurrent connections, the overhead can reach 10GB of RAM and thousands of context switches per second.

File Descriptor (FD): “In Linux, everything is a file”

Before we continue, a quick introduction to FDs in Linux:

A File Descriptor is an integer used to refer to resources in the Linux kernel.

When a process creates a new file or opens a socket connection:

  1. The kernel creates an entry in the Global File Table.
  2. It creates an entry in the process’s FD table pointing to the Global File Table entry.
  3. It returns the FD number to the process.

The following parts of this article are based on the book The Linux Programming Interface - Chapter 63: Alternative I/O models.

How Did Linux Blocking I/O System Calls Evolve?

Back in the day, I/O in Linux could only be a blocking call, like this:

// Read file from an FD
ssize_t read(int fd, void *buf, size_t count);

// Write file to an FD
ssize_t write(int fd, const void *buf, size_t count);

What’s a blocking call? It means that when you need to request a read/write (I/O, pipe, disk
), after the call (read/write func) the process (thread) will sleep until these I/O operations are completed. If you want to do something else (e.g., handle another request), you have to fork another process (thread), leading to the 10K Problem.

Servers need to watch a lot of file descriptors

On a server, each time you agree to open a connection to a client with the accept system call, an FD is created to represent that connection to the process.

For most web servers, it’s common to handle thousands of connections to clients simultaneously. At this point, the server needs to know when clients send new data on these connections so that the server can process and respond to the client. This could be done with a for loop like the one below:

# Server makes 1000 requests to the DB (client) to query data, and waits for data from the DB client to return
for x in database_query_connections:
    if has_new_response_data(x):
        process_response(x)

The problem with this approach is that it can waste a lot of CPU time. Instead of wasting all that CPU time asking “is there a new update?”, we can simply tell the Linux kernel “hey, I have 100 FDs, let me know when one of them has updated data!“.

Why need select & poll?

Before select/poll, to handle multiple connections simultaneously, the server had 2 options:

  1. Blocking I/O with multiple threads:

    • One thread per connection
    • Consumes a lot of resources (memory, CPU context switching)
    • Doesn’t scale well with a large number of connections
  2. Non-blocking I/O with busy waiting:

    while (1) {
        for (int i = 0; i < n_conns; i++) {
            // Try read, immediately return if no data
            int n = read(conns[i], buf, sizeof(buf));
            if (n > 0) {
                // Process data
            }
        }
    }
    • Wastes CPU cycles
    • Inefficient when the number of connections is large

Select and poll were created to solve these problems:

  • Allows monitoring multiple FDs simultaneously
  • Blocks until data is ready
  • Doesn’t waste CPU cycles on busy waiting

Start with poll & select

These 2 syscalls are available on all Unix versions, while epoll and kqueue depend on the operating system (Linux kernel uses epoll, BSD-based systems like macOS use kqueue, and they are quite similar).

Explanation of poll and select

select and poll are two basic system calls for monitoring multiple file descriptors. Both allow a process to check if an I/O operation can be performed on a set of FDs without blocking.

Select

select is an older syscall, introduced in BSD 4.2:

  • Uses a bitmask to track FDs (fd_set)
  • Limits the number of FDs that can be monitored (usually 1024)
  • Must set up the fd_set after each call
  • Can monitor 3 types of events: read, write, and exceptions

Poll

poll was introduced later to overcome some of the limitations of select:

  • Uses an array of struct pollfd instead of a bitmask
  • No hard limit on the number of FDs
  • No need to set up the array after each call
  • Provides more event types (POLLRDHUP, POLLPRI, etc.)

Key differences between poll and select

  1. Data structure:

    • select: Uses 3 separate fd_sets for read, write, and exceptions
    • poll: Uses an array of struct pollfd with event flags
  2. FD Limit:

    • select: Limited by FD_SETSIZE (usually 1024)
    • poll: No hard limit on the number of FDs
  3. Performance:

    • select: Must copy and set up fd_sets after each call
    • poll: No need to set up the array, just clear revents
  4. Timeout handling:

    • select: Uses struct timeval (microsecond precision)
    • poll: Uses milliseconds
  5. Portability:

    • select: Available on most systems
    • poll: Not available on some older systems

How to use poll and select

Below are the signatures of the two syscalls:

// Select syscall
int pselect(int nfds, fd_set *readfds, fd_set *writefds,
            fd_set *exceptfds, const struct timespec *timeout,
            const sigset_t *sigmask);

// Poll syscall
int ppoll(struct pollfd *fds, nfds_t nfds,
          const struct timespec *tmo_p, const sigset_t *sigmask);

Important parameters:

  • timeout:
    • -1: block indefinitely until an event occurs
    • 0: non-blocking, return immediately
    • > 0: block for the specified duration

Example of using select:

fd_set readfds;
struct timeval tv;
int retval;

FD_ZERO(&readfds);
FD_SET(sock_fd, &readfds);
tv.tv_sec = 5;  // 5 seconds timeout

retval = select(sock_fd + 1, &readfds, NULL, NULL, &tv);
if (retval == -1)
    perror("select()");
else if (retval)
    printf("Data is available now.\n");
else
    printf("No data within five seconds.\n");

Example of using poll:

struct pollfd fds[1];
int timeout_msecs = 5000;    // 5 seconds timeout
int retval;

fds[0].fd = sock_fd;
fds[0].events = POLLIN;

retval = poll(fds, 1, timeout_msecs);
if (retval == -1)
    perror("poll()");
else if (retval)
    printf("Data is available now.\n");
else
    printf("No data within five seconds.\n");

Comparing poll/select with traditional read

To clearly see the advantages of poll/select, let’s compare the ways to handle multiple connections:

  1. Traditional Read (blocking):
// Must block for each connection
for (int i = 0; i < n_conns; i++) {
    char buf[1024];
    int n = read(conns[i], buf, sizeof(buf));  // Block until data is available
    if (n > 0) {
        process_data(buf, n);
    }
}
// Problem: If the first connection has no data,
// it will block the entire process, unable to handle other connections
  1. Non-blocking Read with busy-waiting:
// Set all sockets to non-blocking mode beforehand
// fcntl(conns[i], F_SETFL, O_NONBLOCK);

// Must continuously check each connection
while (1) {
    for (int i = 0; i < n_conns; i++) {
        char buf[1024];
        int n = read(conns[i], buf, sizeof(buf));
        if (n > 0) {
            process_data(buf, n);
        } else if (n < 0) {
            if (errno == EAGAIN || errno == EWOULDBLOCK) {
                // No data available, continue with the next socket
                continue;
            } else {
                // Handle error
                handle_error();
            }
        }
    }
    // Problem: Wastes CPU due to continuously checking all sockets
    // even though most have no data
}
  1. Using select:
fd_set readfds;
int max_fd = -1;

while (1) {
    // Set up the FDs to be monitored
    FD_ZERO(&readfds);
    for (int i = 0; i < n_conns; i++) {
        FD_SET(conns[i], &readfds);
        if (conns[i] > max_fd) {
            max_fd = conns[i];
        }
    }
    
    // Block until any FD is ready
    int activity = select(max_fd + 1, &readfds, NULL, NULL, NULL);
    
    if (activity < 0) {
        perror("select error");
        break;
    }
    
    // Check which FD is ready
    for (int i = 0; i < n_conns; i++) {
        if (FD_ISSET(conns[i], &readfds)) {
            char buf[1024];
            int n = read(conns[i], buf, sizeof(buf));  // Will not block
            if (n > 0) {
                process_data(buf, n);
            } else if (n == 0) {
                // Connection closed
                close(conns[i]);
                // Handle connection closure
            } else {
                // Handle error
                handle_error();
            }
        }
    }
}
  1. Using poll:
struct pollfd fds[MAX_CONNS];

// Set up array of pollfd structures
for (int i = 0; i < n_conns; i++) {
    fds[i].fd = conns[i];
    fds[i].events = POLLIN;  // Interested in readable events
}

while (1) {
    // Block until any FD is ready
    int activity = poll(fds, n_conns, -1);  // -1 = block indefinitely
    
    if (activity < 0) {
        perror("poll error");
        break;
    }
    
    // Check which FD is ready
    for (int i = 0; i < n_conns; i++) {
        if (fds[i].revents & POLLIN) {
            char buf[1024];
            int n = read(fds[i].fd, buf, sizeof(buf));
            if (n > 0) {
                process_data(buf, n);
            } else if (n == 0) {
                // Connection closed
                close(fds[i].fd);
                fds[i].fd = -1;  // Mark fd as closed
            } else {
                // Handle error
                handle_error();
            }
        }
    }
}

Advantages of poll/select:

  1. Doesn’t block unnecessarily - only blocks when no FD is ready
  2. Doesn’t waste CPU on busy waiting - only consumes CPU when processing actual data
  3. Can handle multiple connections with one thread
  4. Flexible timeout control (can set a timeout for poll/select)
  5. Portable across Unix systems

Disadvantages:

  1. Must copy FD sets between user space and kernel space each time it is called
  2. Must scan the entire FD set each time it is called, complexity O(n)
  3. Limit on the number of FDs (FD_SETSIZE in select, usually 1024)
  4. No information about the number of ready FDs (only knows that at least 1 is ready)

What’s epoll?

To address the above disadvantages, epoll was created, especially for handling a large number of concurrent connections.

epoll is a group of syscalls (epoll_create, epoll_ctl, epoll_wait) that give the Linux kernel a list of FDs to monitor and update. Unlike poll/select, epoll can operate in two modes: level-triggered (default) and edge-triggered.

Here are the steps to use epoll:

  1. Call epoll_create to tell the kernel that you’re preparing to epoll, this function will return an id.
  2. Call epoll_ctl to tell the kernel which FD you want to receive updates for when it changes, you can provide many different types of FDs (pipes, FIFOs, sockets, POSIX message queues, inotify instances, devices, 
).
  3. Call epoll_wait to wait for updates on the FDs in the list you need to monitor.

Improvements of epoll:

  1. Better performance
  • epoll: Uses a callback mechanism instead of scanning - the kernel only returns the FDs with events, complexity O(1) instead of O(N)
  • epoll: Doesn’t have to copy the FD list between calls
  1. No hard limit on the number of FDs
  • Can monitor tens of thousands of connections without encountering quantity limits
  1. Stores state in the kernel
  • Uses 3 different system calls to create, manage, and wait for events (epoll_create, epoll_ctl, epoll_wait)
  • No need to reset the FD list after each call
  1. Supports both notification modes
  • Level-triggered and Edge-triggered (what are they??)

Level-triggered vs Edge-triggered

These are 2 different modes for notifying I/O events:

  1. Level-triggered (LT):
  • Continuously notifies when the FD is in a ready state
  • Example: When a socket has data, it will continuously notify until all data is read
  • Supported by poll/select/epoll
  • Safer because it doesn’t miss events
  1. Edge-triggered (ET):
  • Only notifies once when the FD transitions from not-ready to ready
  • Example: Only notifies once when new data arrives on the socket
  • Only supported by epoll
  • Better performance but requires careful handling of code to avoid missing events (data)

Why need 2 different modes?

These 2 modes are designed to meet different use cases in I/O processing:

  1. Level-triggered came first and is suitable for:

    • Simple applications that need high reliability
    • Processing data in small chunks, without needing to read everything at once
    • Developers only need to read data when needed, without worrying about reading the entire buffer
    • Legacy frameworks that have been designed with this model
  2. Edge-triggered was created later to solve:

    • Performance issues when the number of connections is very large
    • Reducing the number of unnecessary system calls
    • Allowing easier implementation of zero-copy I/O (transferring data directly from the disk/network buffer to the socket buffer without copying through user space)
    • Suitable for modern async I/O applications

When should you use which mode?

Use Level-triggered when:

  • The application needs to be simple and easy to maintain
  • Reliability is more important than performance
  • Processing data in small parts
  • Don’t want to implement complex logic to read the entire buffer
  • Using legacy frameworks/libraries

Use Edge-triggered when:

  • Need to optimize performance for a large number of connections
  • Can implement logic to read the entire buffer accurately
  • Want to minimize the number of system calls
  • When you want to implement zero-copy I/O

Example illustrating the difference:

LT mode with 100 bytes in the socket buffer:
- Notify "readable"
- Read 50 bytes
- Still notify "readable" (because there are still 50 bytes)
- Read the remaining 50 bytes
- No longer notify

ET mode with 100 bytes in the socket buffer:
- Notify "readable" (first time data arrives)
- Read 50 bytes
- No longer notify
- Must read all data in one go

Detailed explanation of Level-triggered notification

Level-triggered (LT) mode operates based on the state of the file descriptor:

  • When the buffer still has data to read (regardless of how much), the FD will be considered “readable”
  • Notification will continue until ALL data in the buffer has been read
  • This is like a “level” - if the level > 0 (there is still data), it will continue to notify

In the example with 100 bytes above:

  • Initially there are 100 bytes in the buffer → notify “readable”
  • After reading 50 bytes:
    • The buffer still has 50 bytes unread
    • Therefore the FD is still in the “readable” state
    • Level-triggered will continue to notify because of this state
  • Only when all 100 bytes have been read, and the buffer is empty, will the notification stop

This is one of the strengths of Level-triggered mode:

  • Safer because it doesn’t miss data
  • No need to read all data in one go
  • Suitable for applications that process data in small chunks

Compared to Edge-triggered mode:

  • ET only notifies when there is a state change (edge)
  • Only notifies once when new data arrives
  • Forced to read all data in one go, otherwise data may be missed

This is why many frameworks such as Java NIO (pre NIO.2) choose Level-triggered as the default - it is safer and easier to use, although it may not be as efficient as Edge-triggered in some cases.

Edge-triggered epoll

To use epoll in edge-triggered mode, we need to:

  1. Set the EPOLLET flag when registering the FD with epoll
  2. Set the FD to non-blocking mode
  3. Read all data in the buffer each time a notification is received

Real-world examples

Web Server handles requests with epoll:

// Initialize epoll
int epfd = epoll_create1(0);
struct epoll_event ev, events[MAX_EVENTS];

// Level-triggered mode (default)
ev.events = EPOLLIN;
ev.data.fd = server_fd;
epoll_ctl(epfd, EPOLL_CTL_ADD, server_fd, &ev);

while (1) {
    int nfds = epoll_wait(epfd, events, MAX_EVENTS, -1);
    for (int i = 0; i < nfds; i++) {
        if (events[i].data.fd == server_fd) {
            // Accept new connection
            int client_fd = accept(server_fd, NULL, NULL);
            // Add to epoll
            ev.events = EPOLLIN;
            ev.data.fd = client_fd;
            epoll_ctl(epfd, EPOLL_CTL_ADD, client_fd, &ev);
        } else {
            // Level-triggered: Will notify continuously when there is data
            char buf[1024];
            int n;
            while ((n = read(events[i].data.fd, buf, sizeof(buf))) > 0) {
                process_data(buf, n);
            }
        }
    }
}

// Edge-triggered mode
ev.events = EPOLLIN | EPOLLET;  // Add EPOLLET flag
ev.data.fd = server_fd;
epoll_ctl(epfd, EPOLL_CTL_ADD, server_fd, &ev);

while (1) {
    int nfds = epoll_wait(epfd, events, MAX_EVENTS, -1);
    for (int i = 0; i < nfds; i++) {
        if (events[i].data.fd == server_fd) {
            // Accept new connection
            int client_fd = accept(server_fd, NULL, NULL);
            // Set non-blocking mode
            fcntl(client_fd, F_SETFL, fcntl(client_fd, F_GETFL, 0) | O_NONBLOCK);
            // Add to epoll
            ev.events = EPOLLIN | EPOLLET;
            ev.data.fd = client_fd;
            epoll_ctl(epfd, EPOLL_CTL_ADD, client_fd, &ev);
        } else {
            // Edge-triggered: Only notify once when new data arrives
            char buf[1024];
            int n;
            while (1) {
                n = read(events[i].data.fd, buf, sizeof(buf));
                if (n <= 0) {
                    if (errno == EAGAIN || errno == EWOULDBLOCK) {
                        // All data has been read
                        break;
                    }
                    // Error handling
                    break;
                }
                process_data(buf, n);
            }
        }
    }
}

Zero Copy and Epoll

Zero copy is an I/O optimization technique in which data is transferred directly from the disk buffer or network buffer to the application buffer without copying through user space. When combined with epoll (with edge-triggered mode), zero copy can significantly reduce the overhead of I/O operations.

Some popular runtimes/frameworks that use zero copy with epoll:

  1. Netty Framework (Java):

    • Uses FileRegion and native transport (epoll) to implement zero copy
    • Often used in web servers to transfer large files
  2. Nginx:

    • Uses the sendfile() syscall combined with epoll
    • Very effective when serving static files
    • Reduces CPU usage and significantly increases throughput

Who uses epoll?

Most modern frameworks and runtimes use epoll (on Linux) or kqueue (on BSD/MacOS) to implement event loops and async I/O:

1. Node.js (libuv)

const server = require('http').createServer();

server.on('connection', (socket) => {
    console.log('New connection');
    socket.on('data', (data) => {
        console.log('Received:', data.toString());
    });
});

server.listen(8080);

Under the hood, libuv uses epoll to monitor socket events.

2. Go Runtime

func main() {
    ln, err := net.Listen("tcp", ":8080")
    if err != nil {
        panic(err)
    }
    
    for {
        conn, err := ln.Accept()
        if err != nil {
            continue
        }
        go handleConnection(conn)  // Creates a new goroutine
    }
}

Go runtime uses netpoller (based on epoll) to implement non-blocking network I/O. When a goroutine performs network I/O:

  1. The runtime registers the FD with epoll
  2. The goroutine is parked (suspended)
  3. When an event occurs, the scheduler will wake up the goroutine

3. Nginx Event Loop

events {
    use epoll;  # Explicitly use epoll
    worker_connections 1024;
}

Nginx uses epoll to handle thousands of concurrent connections on each worker process.

4. Modern Java NIO

// NIO (Level-triggered) Example
ServerSocketChannel serverSocket = ServerSocketChannel.open();
serverSocket.bind(new InetSocketAddress(8080));
serverSocket.configureBlocking(false);

Selector selector = Selector.open();
serverSocket.register(selector, SelectionKey.OP_ACCEPT);

while (true) {
    selector.select();
    Set<SelectionKey> selectedKeys = selector.selectedKeys();
    Iterator<SelectionKey> iter = selectedKeys.iterator();
    
    while (iter.hasNext()) {
        SelectionKey key = iter.next();
        iter.remove();
        
        if (key.isAcceptable()) {
            // Accept new connection
            SocketChannel client = serverSocket.accept();
            client.configureBlocking(false);
            client.register(selector, SelectionKey.OP_READ);
        }
        
        if (key.isReadable()) {
            // Level-triggered: Will be notified as long as there's data
            SocketChannel client = (SocketChannel) key.channel();
            ByteBuffer buffer = ByteBuffer.allocate(1024);
            int bytesRead = client.read(buffer);
            if (bytesRead > 0) {
                buffer.flip();
                // Process data...
            }
        }
    }
}

// NIO.2 (Edge-triggered) Example with AsynchronousSocketChannel
public class AsyncServer {
    private final AsynchronousServerSocketChannel server;
    
    public AsyncServer() throws IOException {
        server = AsynchronousServerSocketChannel.open()
                .bind(new InetSocketAddress(8080));
    }
    
    public void start() {
        server.accept(null, new CompletionHandler<AsynchronousSocketChannel, Void>() {
            @Override
            public void completed(AsynchronousSocketChannel client, Void attachment) {
                // Accept next connection
                server.accept(null, this);
                
                // Edge-triggered: Only notified once when data arrives
                ByteBuffer buffer = ByteBuffer.allocate(1024);
                client.read(buffer, buffer, new CompletionHandler<Integer, ByteBuffer>() {
                    @Override
                    public void completed(Integer result, ByteBuffer buffer) {
                        if (result > 0) {
                            buffer.flip();
                            // Must read ALL available data since we only get notified once
                            while (buffer.hasRemaining()) {
                                // Process data...
                            }
                            // Setup next read
                            buffer.clear();
                            client.read(buffer, buffer, this);
                        }
                    }
                    
                    @Override
                    public void failed(Throwable exc, ByteBuffer attachment) {
                        try {
                            client.close();
                        } catch (IOException e) {
                            // Handle error
                        }
                    }
                });
            }
            
            @Override
            public void failed(Throwable exc, Void attachment) {
                // Handle error
            }
        });
    }
}

Java NIO (pre NIO.2) uses level-triggered notification through the Selector API. This means that the Selector will continuously notify until all data has been read from the buffer.

NIO.2 (from Java 7) introduces AsynchronousSocketChannel with an edge-triggered model through CompletionHandler callbacks. When new data arrives, the callback is only called once, so the code must ensure that all available data is read and the callback is reset for the next read (the code will be more complex and error-prone).

Java NIO Selector on Linux uses epoll from JDK 1.5+

Netty (Netty Native Transport)

Hmm, in reality, most async projects in the Java ecosystem don’t use the Java NIO API but use the Netty framework.

Netty is used for most popular async frameworks such as PlayFramework, Spring WebFlux, and Akka, but instead of using Java NIO APIs (stdlib), Netty chooses to directly implement epoll/kqueue through JNI (Java Native Interface, calling OS syscalls directly). The reason is:

  1. Better Performance:

    • Reduces the number of context switches between user space and kernel space
    • Avoids overhead from the Java NIO selector implementation (meaning the old Java NIO version)
    • Zero-copy transfer is implemented more efficiently
    • Reduces GC pressure due to fewer object allocations
  2. Better Control:

    • Direct access to epoll/kqueue features that Java NIO doesn’t expose
    • Can optimize for specific platforms (Linux/BSD)
    • Handles edge cases and error handling better

5. Python Async Frameworks

import asyncio

async def handle_connection(reader, writer):
    data = await reader.read(100)
    writer.write(data)
    await writer.drain()
    writer.close()

async def main():
    server = await asyncio.start_server(
        handle_connection, '127.0.0.1', 8080)
    await server.serve_forever()

asyncio.run(main())

Python asyncio uses epoll through the selector module.

6. Redis

Redis is an in-memory data structure store used as a database, cache, and message broker. Redis uses an event-driven single-threaded model with multiplexed I/O thanks to epoll to achieve high performance.

// Excerpt from Redis source code (ae_epoll.c)
static int aeApiPoll(aeEventLoop *eventLoop, struct timeval *tvp) {
    aeApiState *state = eventLoop->apidata;
    int retval, numevents = 0;

    retval = epoll_wait(state->epfd, state->events, eventLoop->setsize,
            tvp ? (tvp->tv_sec*1000 + tvp->tv_usec/1000) : -1);
    
    if (retval > 0) {
        int j;
        numevents = retval;
        for (j = 0; j < numevents; j++) {
            int mask = 0;
            struct epoll_event *e = state->events+j;

            if (e->events & EPOLLIN) mask |= AE_READABLE;
            if (e->events & EPOLLOUT) mask |= AE_WRITABLE;
            if (e->events & EPOLLERR) mask |= AE_WRITABLE|AE_READABLE;
            if (e->events & EPOLLHUP) mask |= AE_WRITABLE|AE_READABLE;
            
            eventLoop->fired[j].fd = e->data.fd;
            eventLoop->fired[j].mask = mask;
        }
    }
    return numevents;
}

Redis handles tens of thousands of concurrent connections on a single thread thanks to epoll and the event loop model. This helps Redis achieve extremely low latency and high throughput without complicating the design with multi-threading.

Performance Benefits

Using epoll provides important benefits:

  1. Scalability
  • Can handle thousands of connections with low overhead
  • Memory usage does not increase linearly with the number of connections
  1. Performance
  • Reduces CPU usage due to fewer context switches
  • Lower latency because it doesn’t have to scan the entire FD list
  1. Resource Efficiency
  • No need to create a thread for each connection
  • Lower kernel space memory usage

References: