Implementing FIFO queues without locks
Since the debut of S3-FIFO, many people have become very interested in implementing the new eviction algorithm. However, there have been several discussions where concerns about scalability are raised. In this post, I will discuss how to implement FIFO and S3-FIFO without using any lock.
A quick intro to S3-FIFO
S3-FIFO is a new cache eviction algorithm that is designed to be simple, scalable, and yet effective. It uses three FIFO queues
- a small FIFO queue S that quickly removes new and unpopular objects (quick demotion)
- a main FIFO queue M that keeps popular objects in the cache with reinsertion (lazy promotion), and
- a ghost FIFO queue G that stores the id of objects recently evicted from the small queue.
The small queue uses 10% of cache space, and the main queue uses the rest. The ghost queue stores the same number of entries as the main queue . Each object in the cache uses one bit (or two bits ) of metadata to track hotness — the bit increments by 1 when the object is accessed.
Upon a cache miss, if the id of the requested object is not tracked in the ghost queue G, it is inserted into the small queue S; however, if the requested object is tracked in G, then the object is inserted into the main queue M.
When S performs an eviction if the object has not been reused since insertion, it is evicted, and its id (or hash) is tracked in the ghost queue G. Otherwise, the object is promoted to the main queue M with the access bit reset to zero.
When M performs an eviction, an object is directly evicted if it has not been reused since the insertion. Otherwise, the object is reinserted into the main queue M. (This is the lazy promotion part, also called FIFO-Reinsertion, Second Chance, or CLOCK).
More details of S3-FIFO can be found on the project website.
We can see that all operations on the FIFO queues happen either at the head or at the tail, which means we can implement the FIFO queues without using any lock.
Implement a FIFO queue without locks
The most straightforward approach to implement a lock-free FIFO queue is to use a ring buffer with a pointer to the tail. Each insertion atomically increments the tail pointer. However, this approach has a drawback — the size of the ring buffer must be fixed. However, the number of objects in the cache is not known ahead of time in many cases.
A more common approach is to use a linked list. However, implementing lock-free FIFO using a linked list is not trivial. Below I will demonstrate how to implement a lock-free FIFO queue using a linked list.
Support remove operation
The code shown above does not support remove operation. There are a few ways to implement remove.
Use a background thread
This is my favoriate approach. This approach uses a dedicated background thread to remove objects so that there is no conflict between threads. Upon a call to remove, we remove the hash table entry, and write the pointer to the object into a buffer. The background thread periodically checks the buffer and removes the objects. Since all other threads operate either at the head (for insertion) or the tail (for eviction), only the background thread update objects in the middle of the queue, so it is thread-safe.
However, there is one rare race condition. We need to make sure removing head does not conflict with the insertion. Therefore, we need to check if the object to remove is the head, if so, we atomically update the head to the next object (CAS) before removing it.
Use a reader-writer lock
The second approach is to use a reader-writer lock. All threads that read, write and evict acquire the reader lock, and deletion acquires the writer lock. The reader-writer lock allows concurrent reads and writes, but not concurrent deletes. Since most workloads have a small number of deletes, this approach is often sufficient. The only drawback is that the overhead of reader-writer lock is higher than the lock-free approach.
The last solution is to use a flag to mark the object as removed and free the space used by key and value. This is the easiest solution when the workload does not have a lot of deletes, but it wastes some space on metadata.
Implementing S3-FIFO without locks
Now we have all the ingredients to implement S3-FIFO without locks. The code shown below is a simplified version of lock-free S3-FIFO.
I will update this when I have time.
Even though FIFO queues are implemented without locking, the CAS operations often incur cache coherence traffic and delay under high contention. In my experience it can become a scalability bottleneck when we want to scale to more than 16 threads. See the discussions here, here, and here.
Note that most of the simple benchmarks compare mutex and atomics using simple operation, however, the critical section in real system is much larger causing mutex to be more expensive than atomics.
One optimization to improve atomics is batching where each eviction evicts multiple objects and inserted objects are added to the head in a batch. This can reduce the amount of cache coherence traffic significantly.
In this post, I demonstrated how to implement FIFO and S3-FIFO without using any lock. The key idea is to atomically update the head and tail pointers. Lock-free implementation is not only useful for S3-FIFO, but also useful for other applications, for example, netBSD has an implementation of lock-free FIFO queue.
EDIT: There is a great paper on implementing lock-free FIFO queue using a linked list. See here.