
Asynchronous Non-blocking I/O under the hood: poll, select, epoll/kqueue
Why asynchronous non-blocking I/O??
Check out The C10K problem for the details.
The blocking I/O model with 1 thread/1 request has two main issues:
-
Memory Overhead: Each thread requires a significant amount of memory:
- Stack memory: Defaults to 1MB/thread on Linux
- Kernel memory
-
CPU Overhead due to context switching:
- Each switch between threads takes about 1-100 microseconds
- Requires saving/loading register states
- Invalidates CPU cache, TLB entries
- Kernel scheduler overhead
With 10,000 concurrent connections, the overhead can reach 10GB of RAM and thousands of context switches per second.
File Descriptor (FD): âIn Linux, everything is a fileâ
Before we continue, a quick introduction to FDs in Linux:
A File Descriptor is an integer used to refer to resources in the Linux kernel.
When a process creates a new file or opens a socket connection:
- The kernel creates an entry in the Global File Table.
- It creates an entry in the processâs FD table pointing to the Global File Table entry.
- It returns the FD number to the process.
The following parts of this article are based on the book The Linux Programming Interface - Chapter 63: Alternative I/O models.

How Did Linux Blocking I/O System Calls Evolve?
Back in the day, I/O in Linux could only be a blocking call, like this:
// Read file from an FD
ssize_t read(int fd, void *buf, size_t count);
// Write file to an FD
ssize_t write(int fd, const void *buf, size_t count);
Whatâs a blocking call? It means that when you need to request a read/write (I/O, pipe, diskâŠ), after the call (read/write func) the process (thread) will sleep until these I/O operations are completed. If you want to do something else (e.g., handle another request), you have to fork another process (thread), leading to the 10K Problem.
Servers need to watch a lot of file descriptors
On a server, each time you agree to open a connection to a client with the accept
system call, an FD is created to represent that connection to the process.
For most web servers, itâs common to handle thousands of connections to clients simultaneously. At this point, the server needs to know when clients send new data on these connections so that the server can process and respond to the client. This could be done with a for loop like the one below:
# Server makes 1000 requests to the DB (client) to query data, and waits for data from the DB client to return
for x in database_query_connections:
if has_new_response_data(x):
process_response(x)
The problem with this approach is that it can waste a lot of CPU time. Instead of wasting all that CPU time asking âis there a new update?â, we can simply tell the Linux kernel âhey, I have 100 FDs, let me know when one of them has updated data!â.
Why need select & poll?
Before select/poll, to handle multiple connections simultaneously, the server had 2 options:
-
Blocking I/O with multiple threads:
- One thread per connection
- Consumes a lot of resources (memory, CPU context switching)
- Doesnât scale well with a large number of connections
-
Non-blocking I/O with busy waiting:
while (1) { for (int i = 0; i < n_conns; i++) { // Try read, immediately return if no data int n = read(conns[i], buf, sizeof(buf)); if (n > 0) { // Process data } } }
- Wastes CPU cycles
- Inefficient when the number of connections is large
Select and poll were created to solve these problems:
- Allows monitoring multiple FDs simultaneously
- Blocks until data is ready
- Doesnât waste CPU cycles on busy waiting
Start with poll
& select
These 2 syscalls are available on all Unix versions, while epoll
and kqueue
depend on the operating system (Linux kernel uses epoll
, BSD-based systems like macOS use kqueue
, and they are quite similar).
Explanation of poll and select
select
and poll
are two basic system calls for monitoring multiple file descriptors. Both allow a process to check if an I/O operation can be performed on a set of FDs without blocking.
Select
select
is an older syscall, introduced in BSD 4.2:
- Uses a bitmask to track FDs (fd_set)
- Limits the number of FDs that can be monitored (usually 1024)
- Must set up the fd_set after each call
- Can monitor 3 types of events: read, write, and exceptions
Poll
poll
was introduced later to overcome some of the limitations of select
:
- Uses an array of struct pollfd instead of a bitmask
- No hard limit on the number of FDs
- No need to set up the array after each call
- Provides more event types (POLLRDHUP, POLLPRI, etc.)
Key differences between poll and select
-
Data structure:
select
: Uses 3 separate fd_sets for read, write, and exceptionspoll
: Uses an array of struct pollfd with event flags
-
FD Limit:
select
: Limited by FD_SETSIZE (usually 1024)poll
: No hard limit on the number of FDs
-
Performance:
select
: Must copy and set up fd_sets after each callpoll
: No need to set up the array, just clear revents
-
Timeout handling:
select
: Uses struct timeval (microsecond precision)poll
: Uses milliseconds
-
Portability:
select
: Available on most systemspoll
: Not available on some older systems
How to use poll and select
Below are the signatures of the two syscalls:
// Select syscall
int pselect(int nfds, fd_set *readfds, fd_set *writefds,
fd_set *exceptfds, const struct timespec *timeout,
const sigset_t *sigmask);
// Poll syscall
int ppoll(struct pollfd *fds, nfds_t nfds,
const struct timespec *tmo_p, const sigset_t *sigmask);
Important parameters:
- timeout:
-1
: block indefinitely until an event occurs0
: non-blocking, return immediately> 0
: block for the specified duration
Example of using select:
fd_set readfds;
struct timeval tv;
int retval;
FD_ZERO(&readfds);
FD_SET(sock_fd, &readfds);
tv.tv_sec = 5; // 5 seconds timeout
retval = select(sock_fd + 1, &readfds, NULL, NULL, &tv);
if (retval == -1)
perror("select()");
else if (retval)
printf("Data is available now.\n");
else
printf("No data within five seconds.\n");
Example of using poll:
struct pollfd fds[1];
int timeout_msecs = 5000; // 5 seconds timeout
int retval;
fds[0].fd = sock_fd;
fds[0].events = POLLIN;
retval = poll(fds, 1, timeout_msecs);
if (retval == -1)
perror("poll()");
else if (retval)
printf("Data is available now.\n");
else
printf("No data within five seconds.\n");
Comparing poll/select with traditional read
To clearly see the advantages of poll/select, letâs compare the ways to handle multiple connections:
- Traditional Read (blocking):
// Must block for each connection
for (int i = 0; i < n_conns; i++) {
char buf[1024];
int n = read(conns[i], buf, sizeof(buf)); // Block until data is available
if (n > 0) {
process_data(buf, n);
}
}
// Problem: If the first connection has no data,
// it will block the entire process, unable to handle other connections
- Non-blocking Read with busy-waiting:
// Set all sockets to non-blocking mode beforehand
// fcntl(conns[i], F_SETFL, O_NONBLOCK);
// Must continuously check each connection
while (1) {
for (int i = 0; i < n_conns; i++) {
char buf[1024];
int n = read(conns[i], buf, sizeof(buf));
if (n > 0) {
process_data(buf, n);
} else if (n < 0) {
if (errno == EAGAIN || errno == EWOULDBLOCK) {
// No data available, continue with the next socket
continue;
} else {
// Handle error
handle_error();
}
}
}
// Problem: Wastes CPU due to continuously checking all sockets
// even though most have no data
}
- Using select:
fd_set readfds;
int max_fd = -1;
while (1) {
// Set up the FDs to be monitored
FD_ZERO(&readfds);
for (int i = 0; i < n_conns; i++) {
FD_SET(conns[i], &readfds);
if (conns[i] > max_fd) {
max_fd = conns[i];
}
}
// Block until any FD is ready
int activity = select(max_fd + 1, &readfds, NULL, NULL, NULL);
if (activity < 0) {
perror("select error");
break;
}
// Check which FD is ready
for (int i = 0; i < n_conns; i++) {
if (FD_ISSET(conns[i], &readfds)) {
char buf[1024];
int n = read(conns[i], buf, sizeof(buf)); // Will not block
if (n > 0) {
process_data(buf, n);
} else if (n == 0) {
// Connection closed
close(conns[i]);
// Handle connection closure
} else {
// Handle error
handle_error();
}
}
}
}
- Using poll:
struct pollfd fds[MAX_CONNS];
// Set up array of pollfd structures
for (int i = 0; i < n_conns; i++) {
fds[i].fd = conns[i];
fds[i].events = POLLIN; // Interested in readable events
}
while (1) {
// Block until any FD is ready
int activity = poll(fds, n_conns, -1); // -1 = block indefinitely
if (activity < 0) {
perror("poll error");
break;
}
// Check which FD is ready
for (int i = 0; i < n_conns; i++) {
if (fds[i].revents & POLLIN) {
char buf[1024];
int n = read(fds[i].fd, buf, sizeof(buf));
if (n > 0) {
process_data(buf, n);
} else if (n == 0) {
// Connection closed
close(fds[i].fd);
fds[i].fd = -1; // Mark fd as closed
} else {
// Handle error
handle_error();
}
}
}
}
Advantages of poll/select:
- Doesnât block unnecessarily - only blocks when no FD is ready
- Doesnât waste CPU on busy waiting - only consumes CPU when processing actual data
- Can handle multiple connections with one thread
- Flexible timeout control (can set a timeout for poll/select)
- Portable across Unix systems
Disadvantages:
- Must copy FD sets between user space and kernel space each time it is called
- Must scan the entire FD set each time it is called, complexity O(n)
- Limit on the number of FDs (FD_SETSIZE in select, usually 1024)
- No information about the number of ready FDs (only knows that at least 1 is ready)
Whatâs epoll?
To address the above disadvantages, epoll was created, especially for handling a large number of concurrent connections.
epoll
is a group of syscalls (epoll_create
, epoll_ctl
, epoll_wait
) that give the Linux kernel a list of FDs to monitor and update. Unlike poll/select, epoll can operate in two modes: level-triggered (default) and edge-triggered.
Here are the steps to use epoll
:
- Call
epoll_create
to tell the kernel that youâre preparing toepoll
, this function will return an id. - Call
epoll_ctl
to tell the kernel which FD you want to receive updates for when it changes, you can provide many different types of FDs (pipes, FIFOs, sockets, POSIX message queues, inotify instances, devices, âŠ). - Call
epoll_wait
to wait for updates on the FDs in the list you need to monitor.
Improvements of epoll
:
- Better performance
- epoll: Uses a callback mechanism instead of scanning - the kernel only returns the FDs with events, complexity O(1) instead of O(N)
- epoll: Doesnât have to copy the FD list between calls
- No hard limit on the number of FDs
- Can monitor tens of thousands of connections without encountering quantity limits
- Stores state in the kernel
- Uses 3 different system calls to create, manage, and wait for events (epoll_create, epoll_ctl, epoll_wait)
- No need to reset the FD list after each call
- Supports both notification modes
- Level-triggered and Edge-triggered (what are they??)
Level-triggered vs Edge-triggered
These are 2 different modes for notifying I/O events:
- Level-triggered (LT):
- Continuously notifies when the FD is in a ready state
- Example: When a socket has data, it will continuously notify until all data is read
- Supported by poll/select/epoll
- Safer because it doesnât miss events
- Edge-triggered (ET):
- Only notifies once when the FD transitions from not-ready to ready
- Example: Only notifies once when new data arrives on the socket
- Only supported by epoll
- Better performance but requires careful handling of code to avoid missing events (data)
Why need 2 different modes?
These 2 modes are designed to meet different use cases in I/O processing:
-
Level-triggered came first and is suitable for:
- Simple applications that need high reliability
- Processing data in small chunks, without needing to read everything at once
- Developers only need to read data when needed, without worrying about reading the entire buffer
- Legacy frameworks that have been designed with this model
-
Edge-triggered was created later to solve:
- Performance issues when the number of connections is very large
- Reducing the number of unnecessary system calls
- Allowing easier implementation of zero-copy I/O (transferring data directly from the disk/network buffer to the socket buffer without copying through user space)
- Suitable for modern async I/O applications
When should you use which mode?
Use Level-triggered when:
- The application needs to be simple and easy to maintain
- Reliability is more important than performance
- Processing data in small parts
- Donât want to implement complex logic to read the entire buffer
- Using legacy frameworks/libraries
Use Edge-triggered when:
- Need to optimize performance for a large number of connections
- Can implement logic to read the entire buffer accurately
- Want to minimize the number of system calls
- When you want to implement zero-copy I/O
Example illustrating the difference:
LT mode with 100 bytes in the socket buffer:
- Notify "readable"
- Read 50 bytes
- Still notify "readable" (because there are still 50 bytes)
- Read the remaining 50 bytes
- No longer notify
ET mode with 100 bytes in the socket buffer:
- Notify "readable" (first time data arrives)
- Read 50 bytes
- No longer notify
- Must read all data in one go
Detailed explanation of Level-triggered notification
Level-triggered (LT) mode operates based on the state of the file descriptor:
- When the buffer still has data to read (regardless of how much), the FD will be considered âreadableâ
- Notification will continue until ALL data in the buffer has been read
- This is like a âlevelâ - if the level > 0 (there is still data), it will continue to notify
In the example with 100 bytes above:
- Initially there are 100 bytes in the buffer â notify âreadableâ
- After reading 50 bytes:
- The buffer still has 50 bytes unread
- Therefore the FD is still in the âreadableâ state
- Level-triggered will continue to notify because of this state
- Only when all 100 bytes have been read, and the buffer is empty, will the notification stop
This is one of the strengths of Level-triggered mode:
- Safer because it doesnât miss data
- No need to read all data in one go
- Suitable for applications that process data in small chunks
Compared to Edge-triggered mode:
- ET only notifies when there is a state change (edge)
- Only notifies once when new data arrives
- Forced to read all data in one go, otherwise data may be missed
This is why many frameworks such as Java NIO (pre NIO.2) choose Level-triggered as the default - it is safer and easier to use, although it may not be as efficient as Edge-triggered in some cases.
Edge-triggered epoll
To use epoll in edge-triggered mode, we need to:
- Set the EPOLLET flag when registering the FD with epoll
- Set the FD to non-blocking mode
- Read all data in the buffer each time a notification is received
Real-world examples
Web Server handles requests with epoll:
// Initialize epoll
int epfd = epoll_create1(0);
struct epoll_event ev, events[MAX_EVENTS];
// Level-triggered mode (default)
ev.events = EPOLLIN;
ev.data.fd = server_fd;
epoll_ctl(epfd, EPOLL_CTL_ADD, server_fd, &ev);
while (1) {
int nfds = epoll_wait(epfd, events, MAX_EVENTS, -1);
for (int i = 0; i < nfds; i++) {
if (events[i].data.fd == server_fd) {
// Accept new connection
int client_fd = accept(server_fd, NULL, NULL);
// Add to epoll
ev.events = EPOLLIN;
ev.data.fd = client_fd;
epoll_ctl(epfd, EPOLL_CTL_ADD, client_fd, &ev);
} else {
// Level-triggered: Will notify continuously when there is data
char buf[1024];
int n;
while ((n = read(events[i].data.fd, buf, sizeof(buf))) > 0) {
process_data(buf, n);
}
}
}
}
// Edge-triggered mode
ev.events = EPOLLIN | EPOLLET; // Add EPOLLET flag
ev.data.fd = server_fd;
epoll_ctl(epfd, EPOLL_CTL_ADD, server_fd, &ev);
while (1) {
int nfds = epoll_wait(epfd, events, MAX_EVENTS, -1);
for (int i = 0; i < nfds; i++) {
if (events[i].data.fd == server_fd) {
// Accept new connection
int client_fd = accept(server_fd, NULL, NULL);
// Set non-blocking mode
fcntl(client_fd, F_SETFL, fcntl(client_fd, F_GETFL, 0) | O_NONBLOCK);
// Add to epoll
ev.events = EPOLLIN | EPOLLET;
ev.data.fd = client_fd;
epoll_ctl(epfd, EPOLL_CTL_ADD, client_fd, &ev);
} else {
// Edge-triggered: Only notify once when new data arrives
char buf[1024];
int n;
while (1) {
n = read(events[i].data.fd, buf, sizeof(buf));
if (n <= 0) {
if (errno == EAGAIN || errno == EWOULDBLOCK) {
// All data has been read
break;
}
// Error handling
break;
}
process_data(buf, n);
}
}
}
}
Zero Copy and Epoll
Zero copy is an I/O optimization technique in which data is transferred directly from the disk buffer or network buffer to the application buffer without copying through user space. When combined with epoll (with edge-triggered mode), zero copy can significantly reduce the overhead of I/O operations.
Some popular runtimes/frameworks that use zero copy with epoll:
-
Netty Framework (Java):
- Uses
FileRegion
and native transport (epoll) to implement zero copy - Often used in web servers to transfer large files
- Uses
-
Nginx:
- Uses the
sendfile()
syscall combined with epoll - Very effective when serving static files
- Reduces CPU usage and significantly increases throughput
- Uses the
Who uses epoll
?
Most modern frameworks and runtimes use epoll (on Linux) or kqueue (on BSD/MacOS) to implement event loops and async I/O:
1. Node.js (libuv)
const server = require('http').createServer();
server.on('connection', (socket) => {
console.log('New connection');
socket.on('data', (data) => {
console.log('Received:', data.toString());
});
});
server.listen(8080);
Under the hood, libuv uses epoll to monitor socket events.
2. Go Runtime
func main() {
ln, err := net.Listen("tcp", ":8080")
if err != nil {
panic(err)
}
for {
conn, err := ln.Accept()
if err != nil {
continue
}
go handleConnection(conn) // Creates a new goroutine
}
}
Go runtime uses netpoller (based on epoll) to implement non-blocking network I/O. When a goroutine performs network I/O:
- The runtime registers the FD with epoll
- The goroutine is parked (suspended)
- When an event occurs, the scheduler will wake up the goroutine
3. Nginx Event Loop
events {
use epoll; # Explicitly use epoll
worker_connections 1024;
}
Nginx uses epoll to handle thousands of concurrent connections on each worker process.
4. Modern Java NIO
// NIO (Level-triggered) Example
ServerSocketChannel serverSocket = ServerSocketChannel.open();
serverSocket.bind(new InetSocketAddress(8080));
serverSocket.configureBlocking(false);
Selector selector = Selector.open();
serverSocket.register(selector, SelectionKey.OP_ACCEPT);
while (true) {
selector.select();
Set<SelectionKey> selectedKeys = selector.selectedKeys();
Iterator<SelectionKey> iter = selectedKeys.iterator();
while (iter.hasNext()) {
SelectionKey key = iter.next();
iter.remove();
if (key.isAcceptable()) {
// Accept new connection
SocketChannel client = serverSocket.accept();
client.configureBlocking(false);
client.register(selector, SelectionKey.OP_READ);
}
if (key.isReadable()) {
// Level-triggered: Will be notified as long as there's data
SocketChannel client = (SocketChannel) key.channel();
ByteBuffer buffer = ByteBuffer.allocate(1024);
int bytesRead = client.read(buffer);
if (bytesRead > 0) {
buffer.flip();
// Process data...
}
}
}
}
// NIO.2 (Edge-triggered) Example with AsynchronousSocketChannel
public class AsyncServer {
private final AsynchronousServerSocketChannel server;
public AsyncServer() throws IOException {
server = AsynchronousServerSocketChannel.open()
.bind(new InetSocketAddress(8080));
}
public void start() {
server.accept(null, new CompletionHandler<AsynchronousSocketChannel, Void>() {
@Override
public void completed(AsynchronousSocketChannel client, Void attachment) {
// Accept next connection
server.accept(null, this);
// Edge-triggered: Only notified once when data arrives
ByteBuffer buffer = ByteBuffer.allocate(1024);
client.read(buffer, buffer, new CompletionHandler<Integer, ByteBuffer>() {
@Override
public void completed(Integer result, ByteBuffer buffer) {
if (result > 0) {
buffer.flip();
// Must read ALL available data since we only get notified once
while (buffer.hasRemaining()) {
// Process data...
}
// Setup next read
buffer.clear();
client.read(buffer, buffer, this);
}
}
@Override
public void failed(Throwable exc, ByteBuffer attachment) {
try {
client.close();
} catch (IOException e) {
// Handle error
}
}
});
}
@Override
public void failed(Throwable exc, Void attachment) {
// Handle error
}
});
}
}
Java NIO (pre NIO.2) uses level-triggered notification through the Selector API. This means that the Selector will continuously notify until all data has been read from the buffer.
NIO.2 (from Java 7) introduces AsynchronousSocketChannel with an edge-triggered model through CompletionHandler callbacks. When new data arrives, the callback is only called once, so the code must ensure that all available data is read and the callback is reset for the next read (the code will be more complex and error-prone).
Java NIO Selector on Linux uses epoll from JDK 1.5+
Netty (Netty Native Transport)
Hmm, in reality, most async projects in the Java ecosystem donât use the Java NIO API but use the Netty framework.
Netty is used for most popular async frameworks such as PlayFramework, Spring WebFlux, and Akka, but instead of using Java NIO APIs (stdlib), Netty chooses to directly implement epoll/kqueue through JNI (Java Native Interface, calling OS syscalls directly). The reason is:
-
Better Performance:
- Reduces the number of context switches between user space and kernel space
- Avoids overhead from the Java NIO selector implementation (meaning the old Java NIO version)
- Zero-copy transfer is implemented more efficiently
- Reduces GC pressure due to fewer object allocations
-
Better Control:
- Direct access to epoll/kqueue features that Java NIO doesnât expose
- Can optimize for specific platforms (Linux/BSD)
- Handles edge cases and error handling better
5. Python Async Frameworks
import asyncio
async def handle_connection(reader, writer):
data = await reader.read(100)
writer.write(data)
await writer.drain()
writer.close()
async def main():
server = await asyncio.start_server(
handle_connection, '127.0.0.1', 8080)
await server.serve_forever()
asyncio.run(main())
Python asyncio uses epoll through the selector module.
6. Redis
Redis is an in-memory data structure store used as a database, cache, and message broker. Redis uses an event-driven single-threaded model with multiplexed I/O thanks to epoll to achieve high performance.
// Excerpt from Redis source code (ae_epoll.c)
static int aeApiPoll(aeEventLoop *eventLoop, struct timeval *tvp) {
aeApiState *state = eventLoop->apidata;
int retval, numevents = 0;
retval = epoll_wait(state->epfd, state->events, eventLoop->setsize,
tvp ? (tvp->tv_sec*1000 + tvp->tv_usec/1000) : -1);
if (retval > 0) {
int j;
numevents = retval;
for (j = 0; j < numevents; j++) {
int mask = 0;
struct epoll_event *e = state->events+j;
if (e->events & EPOLLIN) mask |= AE_READABLE;
if (e->events & EPOLLOUT) mask |= AE_WRITABLE;
if (e->events & EPOLLERR) mask |= AE_WRITABLE|AE_READABLE;
if (e->events & EPOLLHUP) mask |= AE_WRITABLE|AE_READABLE;
eventLoop->fired[j].fd = e->data.fd;
eventLoop->fired[j].mask = mask;
}
}
return numevents;
}
Redis handles tens of thousands of concurrent connections on a single thread thanks to epoll and the event loop model. This helps Redis achieve extremely low latency and high throughput without complicating the design with multi-threading.
Performance Benefits
Using epoll provides important benefits:
- Scalability
- Can handle thousands of connections with low overhead
- Memory usage does not increase linearly with the number of connections
- Performance
- Reduces CPU usage due to fewer context switches
- Lower latency because it doesnât have to scan the entire FD list
- Resource Efficiency
- No need to create a thread for each connection
- Lower kernel space memory usage
References:
- http://www.kegel.com/c10k.html
- https://man7.org/tlpi
- https://github.com/shichao-an/notes/blob/master/docs/unp/ch6.md
- https://daniel.haxx.se/docs/poll-vs-select.html
- https://jvns.ca/blog/2017/06/03/async-io-on-linuxâselectâpollâand-epoll/
- https://eklitzke.org/blocking-io-nonblocking-io-and-epoll