Implementing FIFO queues without locks

Since the debut of S3-FIFO, many people have become very interested in implementing the new eviction algorithm. However, there have been several discussions where concerns about scalability are raised. In this post, I will discuss how to implement FIFO and S3-FIFO without using any lock.

A quick intro to S3-FIFO

S3-FIFO is a new cache eviction algorithm that is designed to be simple, scalable, and yet effective. It uses three FIFO queues

The small queue uses 10% of cache space, and the main queue uses the rest. The ghost queue stores the same number of entries as the main queue [1]. Each object in the cache uses one bit (or two bits [2]) of metadata to track hotness — the bit increments by 1 when the object is accessed.

Upon a cache miss, if the id of the requested object is not tracked in the ghost queue G, it is inserted into the small queue S; however, if the requested object is tracked in G, then the object is inserted into the main queue M.

When S performs an eviction if the object has not been reused since insertion, it is evicted, and its id (or hash) is tracked in the ghost queue G. Otherwise, the object is promoted to the main queue M with the access bit reset to zero.

When M performs an eviction, an object is directly evicted if it has not been reused since the insertion. Otherwise, the object is reinserted into the main queue M. (This is the lazy promotion part, also called FIFO-Reinsertion, Second Chance, or CLOCK).

More details of S3-FIFO can be found on the project website.

We can see that all operations on the FIFO queues happen either at the head or at the tail, which means we can implement the FIFO queues without using any lock.

Implement a FIFO queue without locks

The most straightforward approach to implement a lock-free FIFO queue is to use a ring buffer with a pointer to the tail. Each insertion atomically increments the tail pointer. However, this approach has a drawback — the size of the ring buffer must be fixed. However, the number of objects in the cache is not known ahead of time in many cases.

A more common approach is to use a linked list. However, implementing lock-free FIFO using a linked list is not trivial. Below I will demonstrate how to implement a lock-free FIFO queue using a linked list.

typedef struct object {
   K key;
   V value;
   struct object *prev;
   struct object *next;
   bool accessed;
} object_t;


object_t *head = NULL;
object_t *tail = NULL;


/* insert at the head */
void atomic_insert(object_t *new_obj, object_t *head) {
    object_t *old_head = head;
    new_obj->next = old_head;
    new_obj->prev = NULL;


   // atomically update the head if it is still the same
    while (__atomic_compare_exchange_n(
        &head, &old_head, new_obj, false,
        __ATOMIC_RELAXED, __ATOMIC_RELAXED) == false) {
            old_head = head;
            new_obj->next = old_head;
    }

    if (old_head != NULL) {
        old_head->prev = new_obj;
    } else {
        tail = new_obj;
    }
}


/* evict from the tail */
object_t* atomic_evict(object_t *_head, object_t *tail) {
    object_t *old_tail = tail;


    // atomically update the tail if it is still the same
    while (__atomic_compare_exchange_n(
        &tail, &old_tail, tail->prev, false,
        __ATOMIC_RELAXED, __ATOMIC_RELAXED) == false) {
            old_tail = tail;
   }

    // assume we will not evict till empty
    assert(tail != NULL);
    tail->next = NULL;

    return old_tail;
}

Support remove operation

The code shown above does not support remove operation. There are a few ways to implement remove.

Use a background thread

This is my favoriate approach. This approach uses a dedicated background thread to remove objects so that there is no conflict between threads. Upon a call to remove, we remove the hash table entry, and write the pointer to the object into a buffer. The background thread periodically checks the buffer and removes the objects. Since all other threads operate either at the head (for insertion) or the tail (for eviction), only the background thread update objects in the middle of the queue, so it is thread-safe.

However, there is one rare race condition. We need to make sure removing head does not conflict with the insertion. Therefore, we need to check if the object to remove is the head, if so, we atomically update the head to the next object (CAS) before removing it.

Use a reader-writer lock

The second approach is to use a reader-writer lock. All threads that read, write and evict acquire the reader lock, and deletion acquires the writer lock. The reader-writer lock allows concurrent reads and writes, but not concurrent deletes. Since most workloads have a small number of deletes, this approach is often sufficient. The only drawback is that the overhead of reader-writer lock is higher than the lock-free approach.

Use tombstone

The last solution is to use a flag to mark the object as removed and free the space used by key and value. This is the easiest solution when the workload does not have a lot of deletes, but it wastes some space on metadata.

Implementing S3-FIFO without locks

Now we have all the ingredients to implement S3-FIFO without locks. The code shown below is a simplified version of lock-free S3-FIFO.

typedef struct s3fifo {
   object_t *small_head;
   object_t *small_tail;
   object_t *main_head;
   object_t *main_tail;
   object_t *ghost_head;
   object_t *ghost_tail;
} s3fifo_t;




void s3fifo_atomic_insert(s3fifo_t *s3fifo, object_t *obj) {
   // sorry for the weird syntax
   if (obj not in ghost queue) {
       atomic_insert(obj, s3fifo->small_head);
   } else {
       atomic_insert(obj, s3fifo->main_head);
   }
}


void s3fifo_atomic_evict(s3fifo_t *s3fifo) {
   while (true) {
       if (s3fifo->small->used_size > s3fifo->small->size) {
           object_t *obj = atomic_evict(s3fifo->small_head, s3fifo->small_tail);
           // if the object is not accessed, evict it
           // accessed object is promoted to main queue
           if (!obj->accessed) {
               atomic_insert(obj, s3fifo->ghost_head);
           }
       } else {
           object_t *obj = atomic_evict(s3fifo->main_head, s3fifo->main_tail);
       }


       // insert the object back to the main queue if it is accessed
       if (obj->accessed) {
           obj->accessed = false;
           atomic_insert(obj, s3fifo->main_head);
       } else {
           break;
       }
   }
}

Performance results

I will update this when I have time.

More optimizations

Even though FIFO queues are implemented without locking, the CAS operations often incur cache coherence traffic and delay under high contention. In my experience it can become a scalability bottleneck when we want to scale to more than 16 threads. See the discussions here, here, and here.

Note that most of the simple benchmarks compare mutex and atomics using simple operation, however, the critical section in real system is much larger causing mutex to be more expensive than atomics.

One optimization to improve atomics is batching where each eviction evicts multiple objects and inserted objects are added to the head in a batch. This can reduce the amount of cache coherence traffic significantly.

Conclusion

In this post, I demonstrated how to implement FIFO and S3-FIFO without using any lock. The key idea is to atomically update the head and tail pointers. Lock-free implementation is not only useful for S3-FIFO, but also useful for other applications, for example, netBSD has an implementation of lock-free FIFO queue.

EDIT: There is a great paper on implementing lock-free FIFO queue using a linked list. See here.