Spinlocks and Read-Write Locks

Most parallel programming in some way will involve the use of locking at the lowest levels. Locks are primitives that provide mutual exclusion that allow data structures to remain in consistent states. Without locking, multiple threads of execution may simultaneously modify a data structure. Without a carefully thought out (and usually complex) lock-free algorithm, the result is usually a crash of hang as unintended program states are entered. Since the creation of a lock-free algorithm is extremely difficult, most programs use locks.

If updating a data structure is slow, the lock of choice is a mutex of some kind. These will transfer control to the operating system when they block. This allows another thread to run, and perhaps make progress whilst the first thread sleeps. This transfer of control consists of a pair of context switches, which are quite a slow operation. Thus, if the lock-hold time is expected to be short, then this may not be the fastest method.

Spinlocks

Instead of context switches, a spinlock will "spin", and repeatedly check to see if the lock is unlocked. Spinning is very fast, so the latency between an unlock-lock pair is small. However, spinning doesn't accomplish any work, so may not be as efficient as a sleeping mutex if the time spent becomes significant.

Before we describe the implementation of spin locks, we first need a set of atomic primitives. Fortunately, gcc provides some of these as built-in functions:

A spinlock can be implemented in an obvious way, using the atomic exchange primitive.

So how fast is the above code? A simple benchmark to test the overhead of a lock is to have a given number of threads attempting to lock and unlock it, doing a fixed amount of work each time. If the total number of lock-unlock pairs is maintained as the number of threads is increased, it is possible to measure the affect of contention on performance. A good spinlock implementation will be as fast as possible for any given number of threads attempting to use that lock simultaneously.

These results are pretty good, but can be improved. The problem is that if there are multiple threads contending, then they all attempt to take the lock at the same time once it is released. This results in a huge amount of processor bus traffic, which is a huge performance killer. Thus, if we somehow order the lock-takers so that they know who is next in line for the resource we can vastly reduce the amount of bus traffic.

One spinlock algorithm that does this is called the MCS lock. This uses a list to maintain the order of acquirers.

The MCS lock takes a hugely longer time when the number of threads is greater than the number of processors (four in this case). This is because if the next thread in the queue isn't active when the lock is unlocked, then everyone must wait until the operating system scheduler decides to run it. Every "fair" lock algorithm has this problem. Thus, the simple unfair spinlock still can be quite useful when you don't know that the number of threads is bounded by the number of cpus.

A bigger problem with the MCS lock is its API. It requires a second structure to be passed in addition to the address of the lock. The algorithm uses this second structure to store the information which describes the queue of threads waiting for the lock. Unfortunately, most code written using spinlocks doesn't have this extra information, so the fact that the MCS algorithm isn't a drop-in replacement to a standard spin lock is a problem.

An IBM working group found a way to improve the MCS algorithm to remove the need to pass the extra structure as a parameter. Instead, on-stack information was used instead. The result is the K42 lock algorithm:

The timings of the K42 algorithm are as good as, if not better than the MCS lock:

Unfortunately, the K42 algorithm has another problem. It appears that it may be patented by IBM. Thus it cannot be used either. (Without perhaps paying royalties to IBM.)

One way around this is to use a different type of list. The K42 and MCS locks use lists ordered so that finding the next thread to run is easy, and adding to the end is hard. What about flipping the direction of the pointers so that finding the end is easy, and find who's next hard? The result is the following algorithm:

It is still faster than the standard spinlock when contention is low, but once more than two threads are attempting to lock at the same time it is worse, and gets slower from there on.

Another possible trick is to use a spinlock within a spinlock. The first lock can be very light weight since we know it will only be held for a short time. It can then control the locking for the wait list describing the acquirers of the real spinlock. If done right, the number of waiters on the sub-lock can be kept low, and thus minimize bus traffic. The result is:

Unfortunately, this is even worse than the previous listlock algorithm. This is only good for the uncontended case.

Another possibility is to modify some other type of locking algorithm to be a spinlock. The read-write locks from Reactos are designed to be scale extremely well. If the "read" part of them is removed, then the mutual exclusion between the writers will act just like a spinlock. Doing this yields:

Again, this algorithm disappoints. The results are similar to the bitlistlock algorithm. This isn't surprising, as the wait-block that controls the waiter list is synchronized by a bit lock.

Time to think laterally. One of the problems with the above algorithms is synchronization of the wait list. The core issue is that we need some way to recognize the head and tail of that list. The head of the list is needed to add a new waiter. The tail is needed to decide who is to go next. The MCS lock used the extra structure information so that the list tail could be quickly found. The K42 Lock used the patented method of storing the tail in a second list pointer within the lock itself.

There is another trick we can do though. If the extra information is allocated on the stack, then it may be possible to recognize that a pointer is pointing within our own stack frame. If so, then we can use that information within the algorithm to decide where the wait list ends. The result is the stack-lock algorithm:

This algorithm is quite a bit simpler if you know that a thread's stack is aligned a certain way. (Then the stack-check turns into an XOR and a mask operation.) Unfortunately, it is still quite slow.

The lock operation above looks to be fairly efficient, it is the unlock routine that is slow and complex. Perhaps if we save a little more information within the lock itself, then the unlock operation can be made faster. Since quite a bit of time seems to be spent finding the previous node to ourselves (which is the one to wake up), it might be better to do that while we are spinning waiting for our turn to take the lock. If we save this previous point within the lock, we then will not need to calculate it within the unlock routine.

This starts regaining some of the speed we have lost, but still isn't quite as good as the K42 algorithm. (It is however, always faster than the original naive spinlock provided that the number of threads is less than the number of processors.)

A careful reading of the plock algorithm shows that it can be improved even more. We don't actually need to know the pointer value of the next waiter. Some other unique value will do instead. Instead of saving a pointer, we can use a counter that we increment. If a waiter knows which counter value corresponds to its turn, then it just needs to wait until that value appears. The result is called the ticket lock algorithm:

The above algorithm is extremely fast, and beats all the other fair-locks described.

In fact, this is the spinlock algorithm used in the Linux kernel, although for extra speed, the kernel version is written in assembly language rather than the semi-portable C shown above. Also note that the above code depends on the endianness of the computer architecture. It is designed for little-endian machines. Big endian processors will require a swap of the two fields within the structure in the union.

The ticket lock shows that an oft-repeated fallacy is untrue. Many of the above fair-lock algorithms are meant to scale well because the waiters are spinning on different memory locations. This is meant to reduce bus traffic and thus increase performance. However, it appears that that effect is small. The more important thing is to make sure that the waiters are ordered by who gets to take the lock next. This is what the ticket lock does admirably. The fact that multiple waiters are spinning on the same ticket lock location does not seem to be a performance drain.

Read Write Locks

Quite often, some users of a data structure will make no modifications to it. They just require read access to its fields to do their work. If multiple threads require read access to the same data, there is no reason why they should not be able to execute simultaneously. Spinlocks don't differentiate between read and read/write access. Thus spinlocks do not exploit this potential parallelism. To do so, read-write locks are required.

The simplest read-write lock uses a spinlock to control write access, and a counter field for the readers.

The benchmark the above code, we need a little more information than the spinlock case. The fraction of readers is important. The more readers, the more parallelism we should get, and the faster the code should run. It is also important to have a random distribution of readers and writers, just like real-world situations. Thus a parallel random number generator is used. By selecting a random byte, and choosing 1, 25, 128, or 250 out of 256 possibilities to be a writer we can explore the mostly-reader case through to where most users of the lock are writers. Finally, it is important to find out the effects of contention. In general, read-write locks tend to be used where contention is high, so we will mostly look at the case where the number of threads is equal to the number of processors.

The dumb lock above performs fairly poorly when there is no contention. If one thread is used we get:

As expected, we asymptote to the relatively slow timings of the standard spinlock algorithm as the write fraction increases. If there is contention, however, the dumb lock actually performs quite well. Using four threads:

The obvious thing to do to try to gain speed would be to replace the slow spinlock with a ticketlock algorithm. If this is done, we have:

This performs much better in the uncontended case, taking 3.7 seconds for all write fractions. It is surprising that it doesn't beat the contended case though:

This is slower for low write fractions, and faster for large write fractions. Since most of the time we use a read-write lock when the write fraction is low, this is really bad for this algorithm, which can be twice as slow as its competitor.

To try to reduce contention, and to gain speed, lets explore the rather complex algorithm used in Reactos to emulate Microsoft Window's slim read-write (SRW) locks. This uses a wait list, with a bitlock to control access to the wait list data structure. It is designed so that waiters will spin on separate memory locations for extra scalability.

The above code is not exactly the code in Reactos. It has been simplified and cleaned up somewhat. One of the controlling bit flags has been removed, and replaced with altered control flow. So how does it perform? In the uncontended case, it is just like the dumb ticket-based read-write lock, and takes 3.7 seconds for all cases. For the contended case with four threads:

This is quite bad, slower than the dumb lock in all contended cases. The extra complexity simply isn't worth any performance gain.

Another possibility is to combine the reader count with some bits describing the state of the writers. A similar technique is used by the Linux kernel to describe its (reader-preferring) read-write locks. Making the lock starvation-proof for writers instead, yields something like the following:

This lock unfortunately, has a similar performance to the dumb lock using a ticket lock as its spinlock.

The version in the Linux kernel is written in assembler, so may be a fair bit faster. It uses the fact that the atomic add instruction can set the zero flag. This means that the slower add-and-test method isn't needed, and a two-instruction fast path is used instead.

Sticking to semi-portable C code, we can still do a little better. There exists a form of the ticket lock that is designed for read-write locks. An example written in assembly was posted to the Linux kernel mailing list in 2002 by David Howells from RedHat. This was a highly optimized version of a read-write ticket lock developed at IBM in the early 90's by Joseph Seigh. Note that a similar (but not identical) algorithm was published by John Mellor-Crummey and Michael Scott in their landmark paper "Scalable Reader-Writer Synchronization for Shared-Memory Multiprocessors". Converting the algorithm from assembly language to C yields:

This read-write lock performs extremely well. It is as fast as the dumb spinlock rwlock for low writer fraction, and nearly as fast as the dumb ticketlock rwlock for large number of writers. It also doesn't suffer any slowdown when there is no contention, taking 3.7 seconds for all cases. With contention:

This algorithm is five times faster than using a simple spin lock for the reader-dominated case. Its only drawback is that it is difficult to upgrade read locks into write locks atomically. (It can be done, but then rwticket_wrunlock() needs to use an atomic instruction, and the resulting code becomes quite a bit slower.) This drawback is the reason why this algorithm is not used within the Linux kernel. Some parts depend on the fact that if you have a read lock, then acquiring a new read lock recursively will always succeed. However, if that requirement were to be removed, this algorithm probably would be a good replacement.

One final thing to note is that the read-write ticket lock is not optimal. The problem is the situation where readers and writers alternate in the wait queue: writer (executing), reader 1, writer, reader 2. The two reader threads can be shuffled so that they execute in parallel. i.e. the second reader should probably not have to wait until the second writer finishes to execute. Fortunately, this situation is encountered rarely when the thread count is low. For four processors and threads, this happens one time in 16 if readers and writers are equally likely, and less often otherwise. Unfortunately, as the number of threads increases this will lead to an asymptotic factor of two slowdown compared with the optimal ordering.

The obvious thing to do to fix this is to add a test to see if readers should be reordered in wait-order. However, since the effect is so rare with four concurrent threads, it is extremely hard (if not impossible) to add the check with a low enough overhead that the result is a performance win. Thus it seems that the problem of exactly which algorithm is best will need to be revisited when larger multicore machines become common.

Comments

sfuerst said...

The article has been updated to improve the reference to the creation of the read-write ticket lock algorithm. Also some extra discussion of the non-optimality of the algorithm is added at the end.

Cory said...

In the function "static int ticket_trylock(ticketlock *t)", should the following line of code changed from:

unsigned cmpnew = ((unsigned) me << 16) + menew;

to be

unsigned cmpnew = ((unsigned) menew << 16) + me;

I think it is to update the "users" part with the increased 1, right?

sfuerst said...

Oops, you are right. I didn't check the trylock routines nearly as well as the lock+unlock ones. :-/ I've updated the article with the fix.

Cory said...

In fact this document is quite good for me to understand these locks. I thus tried the ticket based reader/writer lock, but it seems it has some problems also. The following is the test code that I use:

rwticket ticket;

int main (void)
{
rwticket_init(&ticket);
rwticket_rdtrylock(&ticket);//should ok
rwticket_wrtrylock(&ticket);//should fail
rwticket_rdunlock(&ticket);
rwticket_wrtrylock(&ticket);//should ok
rwticket_rdtrylock(&ticket);//should fail
rwticket_wrtrylock(&ticket);//should fail
rwticket_wrunlock(&ticket);
rwticket_wrtrylock(&ticket);//should ok
rwticket_wrlock(&ticket);//should wait
}

and the following is the output:

coryxie@coryxie-t60:~/test-code$ ./rwlock
rwticket_rdtrylock OK, users 1, readers 1, writers 0
rwticket_wrtrylock BUSY, users 1, readers 1, writers 0
rwticket_rdunlock, users 1, readers 1, writers 1
rwticket_wrtrylock OK, users 1, readers 1, writers 2 <===should OK, actual OK
rwticket_rdtrylock OK, users 2, readers 2, writers 2 <===should fail, actual OK
rwticket_wrtrylock OK, users 2, readers 2, writers 3 <===should fail, actual OK
rwticket_wrunlock, users 2, readers 3, writers 4
rwticket_wrtrylock BUSY, users 2, readers 3, writers 4
rwticket_wrlock enter, users 2, readers 3, writers 4
rwticket_wrlock wait, users 3, readers 3, writers 4

I think once any rwticket_rdtrylock() or rwticket_wrtrylock() succeed to gain the lock, any other tries for the writer lock (wr, lock, try lock) should either fail or wait.

And once a writer lock is taken by someone, then rwticket_rdtrylock() should fail, and rwticket_rdlock() should wait.

However the results shows that once the rwticket_wrtrylock get the writer lock, the rwticket_rdtrylock() and rwticket_wrtrylock() are still OK to get the lock, which seems incorrect.

sfuerst said...

Yep, the code for rwticket_wrtrylock had a bug. "me" and "menew" were swapped in the calculation of cmpnew. The code has been fixed in the article.

Thanks for spotting this! I really should have tested the trylock routines more... It wouldn't surprise me if there were latent bugs in a few of the others as well.

said...

Regarding the ticket lock's use of atomic_xadd: so 2-byte alignment is sufficient for 16-bit interlocked operations? I found note for this in documentation for neither __sync_fetch_and_add() nor Windows' _InterlockedExchangeAdd16().

Borislav Trifonov said...

One way to do a ticket_timedlock lock would be to loop on the ticket_trylock, but I wonder if it's safe to loop as in ticket_lock and unlock when it times out, to avoid the cmpxchg of ticket_trylock
static void ticket_timedlock(ticketlock *t, unsigned long delay)
{
        if (ticket_trylock(t)) return 0;
        unsigned long tEnd = current_time() + delay;
        unsigned short me = atomic_xadd(&t->s.users, 1);
        while (t->s.ticket != me)
        {
                cpu_relax();
                if (current_time() > tEnd)
                {
                        ticket_unlock(t);
                        return ETIMEDOUT;
                }
        }
        return 0;
}

sfuerst said...

It is safe to do locked operations with any alignment. I know this perhaps used to be documented as unsafe by Microsoft... but Intel disagrees. You just have to make sure that the underlying instruction exists in the x86 instruction set for the compiler to emit.

Unfortunately, I don't think you can implement ticket_timedlock() like that. The unlock code assumes that the unlocking thread actually has the lock. The above will cause all sorts of problems due to the ticket<->thread relationships being moved out of step.

Borislav Trifonov said...

Ah, I see that now. Would an atomic decrement on _users in place of the unlock work? I only did limited testing but I didn't encounter any problems.

sfuerst said...

Imagine you are at a local shop that uses a ticket system.

The number of the customer that is currently being served is 25. You pull a ticket, number 40.

You wait for a few minutes, but the checkout is slow, and the server is up to number 30. You want to leave. However, in that time a few other people have arrived, and the ticket dispenser is up to number 45.

What can you do? If you leave, when the checkout reaches number 40 no one will have that number. In the real world, the server will wait for a few seconds to see if anyone with number 40 is there. If not, he/she will skip to number 41. The computer is different. It will wait forever, causing a deadlock.

Another option is to try to decrement the value of the ticket dispenser. i.e. change it from 45 to 44. Now what happens? Unfortunately, the same deadlock. You haven't fixed the problem that no one has number 40 except you, and you have left. (Plus there is a further problem when a new customer arrives and takes 44. Two customers will have the same number - causing chaos at the checkout when 44 is reached.)

What you need to do is find some way to give your ticket to some other new customer who hasn't yet grabbed a ticket. This isn't as easy as it sounds. What happens if no new customer arrives? What happens if multiple people want to leave their ticket behind?

This complexity is why the current algorithm of only grabbing a ticket in the trylock routine if you know for certain that you will immediately succeed in getting the lock is what I implemented. Perhaps there is another way... but it isn't nearly as easy to show to be correct.

Samy Al Bahra said...

said, regarding alignment. It is to use any alignment for atomic operations except for cmpxchg8b and cmpxchg16b. These generally require 8 and 16 byte alignment respectively. However, if your target is not aligned then it is would likely require a split transaction. On the IA32 implementations I am familiar with this usually means you will revert to bus locking and will lose any of the benefits of cache locking. Bus locking is extremely expensive and can starve your software stack from bus access. If possible, align your atomics targets.

Arto N said...

Thanks for a very good article. Nice to see that even these days there are people trying to avoid poor performing code if you "easily" can do better.

Concerning the examples I have two questions which based my understanding are possible race conditions: In ticket_unlock method you are using 't->s.ticket++' and in rwticket_rdlock you are using 'l->s.read++'. I have understood that Intel guarantees that 8/16/32/64 bit loads and stores are atomic without 'lock' prefix (assuming the vars are properly aligned or fitting into cache pipeline) but increments (like decrement) are not.

So, is there some even deeper trick in your code or should these inc operations be done using atomic_inc to really get them atomic?

sfuerst said...

Think about ownership... at those given points only one thread can modify those variables. So doing a simple non-atomic read, modify, write sequence is perfectly okay. It also is much much faster than doing an atomic increment. This is one of the advantages of a ticket lock: less bus contention.

The only "deep" trick is that the Intel memory model requires the store in the non-atomic increment to eventually be visible to other threads. This may not be the case on other architectures, and some sort of cache-synchronization instruction may be required.

Arto N said...

Hi,

Of course. Especially in the case of SCHED_FIFO threads. But not necessarily for SCHED_OTHER or SCHED_RR. But in general, spinning locks do not behave that well if timeslicing is allowed so you have perhaps assumed that there shall not be any preemptive context swithches...

mohammed shahid said...

Hi sfuerst,

Very great utility for those analysing the performance of mutual exclusion algorithms in an actual system.

With regards to both MCS and IBM's MCS called K42, there are variants which use only fetch-and-store rather than compare-and-swap (eg. Craig, MLH locks).

Since they use a simpler instruction, is it likely that these variants would be better in time-terms ?

regards,
-shahid.

sfuerst said...

Arto: No such assumptions are needed. Try drawing the possible state transitions in a diagram. As long as only a single thread is modifying a given memory location, that thread doesn't need to use atomic instructions. (Again assuming Intel memory orderings where no store-store barrier is needed.)

mohammed: I didn't know about those types of locks when creating this article a year or so ago. Perhaps the other variants are faster... However, if they fall under the patent, the average person still can't use them. :-(

Note that the list of lock types here is obviously incomplete. The major one missing is that using the bit test-and-set instructions.

Samy Al Bahra said...

Mohammed, MCS and CLH are both great in that they minimize cache coherence traffic. However, they are both arbitrating (fair) spinlocks. These tend to perform terribly under contention in a highly preemptive environment (without much additional complexity and slower fast paths). I've uploaded some data I generated some years ago at

http://repnop.org/t/unfair.png
http://repnop.org/t/fair.png

These are the results of a benchmark run on a 4-socket quad-core AMD Opteron 8350 in user-space under Linux. With preemption disabled, the ball game changes. Under preemption, as you can see it is not very fair to compare to fair spinlocks under contention.

You can view the fast path latency for these various spinlocks at http://concurrencykit.org/doc/appendixZ.html (these are approximate measurements).

utehute said...

Thanks for a very informative article. It was fun to read. However, I am unable to recreate your results. I tried comparing the basic spinlock algorithm to the fast ticketlock algorithm. I found that the basic algorithm is a touch faster than the ticketlock algorithm, although the ticketlock is very much a fair algorithm. I am having a hard time explaining this. It is possible that my CPU bus easily handles all traffic generated from the 8 threads I am running. My testing method is this.

for i=0 to 8:
pthread_create(..., work)

work() {
start = gettimeofday
acquire_lock()
traverse vector
release_lock
end = gettimeofday
}

I then compare the results and find the basic spinlock to be just a touch faster. Here are the results (per thread) that I get.
ticketlock:
0::lock: 7.273 seconds.
1::lock: 7.27306 seconds.
2::lock: 7.27305 seconds.
3::lock: 7.2731 seconds.
4::lock: 7.2731 seconds.
5::lock: 7.27309 seconds.
6::lock: 7.27308 seconds.
7::lock: 7.27308 seconds.

spinlock:
0::lock: 0.876959 seconds.
1::lock: 4.84771 seconds.
2::lock: 6.99854 seconds.
3::lock: 6.85127 seconds.
4::lock: 4.29035 seconds.
5::lock: 1.75049 seconds.
6::lock: 6.42732 seconds.
7::lock: 2.62516 seconds.

To me this makes sense. In the basic algorithm, the threads finish more quickly. So there is less work for the scheduler, which would reduce processing time.

Can you tell me if I might have made a mistake here?

sfuerst said...

Your results are what you might expect under extreme unfairness. How "unfair" the locked xchg operation is depends on the underlying hardware implementation, and how long the lock is held.

The test results here use a small loop to wait whilst the lock is held to make sure other threads have had time to touch the lock's cache line. Remember, under real-world conditions the lock is taken to do work. Not modelling that correctly changes the timings...

Peppe said...

In your ticket-lock example users and tickets are both SHORT's which mean the can only acquire a maximum value of 65535. The lock can only be locked 65535, which amounts to 18 hours and 12 minutes if locking and unlocking once a second. Many applications receiving data that should be analysed receives data several times a second and should run for months without rebooting, even using INT instead of short might be insufficient.

sfuerst said...

They are unsigned shorts. That means that their behaviour on overflow is well defined. So the lock can be taken way more than 2^16 times without issue.

The limitation is that no more than 65535 threads can attempt to take the lock simultaneously. If you use more threads than that, then yes, you can replace the underlying types with a pair of 32-bit unsigned integers inside a 64-bit one.

Peppe said...

Every time you call ticket_lock you increase users and every time you call ticket_unlock, you increase ticket. You never decrease any of them? So if you are using the same lock a lot of times, I would argue that you would run out of numbers.

moses said...

@Peppe, it will overflow.
unsigned short x = 65535;
++x;
// x is now 0