Synthesizing Fine-Grained Synchronization Protocols for Implicit Monitors

KOSTAS FERLES*, The University of Texas at Austin, USA
BENJAMIN SEPANSKI*, The University of Texas at Austin, USA
RAHUL KRISHNAN, The University of Texas at Austin, USA
JAMES BORNHOLT, The University of Texas at Austin, USA
ISIL DILLIG, The University of Texas at Austin, USA

A monitor is a widely-used concurrent programming abstraction that encapsulates all shared state between threads. Monitors can be classified as being either implicit or explicit depending on the primitives they provide. Implicit monitors are much easier to program but typically not as efficient. To address this gap, there has been recent research on automatically synthesizing explicit-signal monitors from an implicit specification [Ferles et al. 2018], but prior work does not exploit all parallelization opportunities due to the use of a single lock for the entire monitor. This paper presents a new technique for synthesizing fine-grained explicit-synchronization protocols from implicit monitors. Our method is based on two key innovations: First, we present a new static analysis for inferring safe interleavings that allow violating mutual exclusion of monitor operations without changing its semantics. Second, we use the results of this static analysis to generate a MaxSAT instance whose models correspond to correct-by-construction synchronization protocols. We have implemented our approach in a tool called CORTADO and evaluate it on monitors that contain parallelization opportunities. Our evaluation shows that CORTADO can synthesize synchronization policies that are competitive with, or even better than, expert-written ones on these benchmarks.

1 INTRODUCTION

Concurrent programming is difficult because it requires developers to consider interactions between multiple threads of execution and mediate access to shared resources and data. Programming languages can offer higher-level abstractions to reduce this complexity by making concurrent programming more declarative. One such abstraction is the monitor [Hansen 1973; Hoare 1974], which is an object that encapsulates shared state and allows threads access to it only through a set of operations, between which the monitor enforces mutual exclusion.

Ideally, developers would implement monitors using implicit synchronization, wherein the only synchronization primitive is a waituntil(P) operation that blocks threads until condition P is satisfied. The compiler or runtime can then automatically generate the necessary explicit synchronization operations (locks, condition variables, etc.) to implement the monitor in a way that respects the semantics of the implicit monitor. However, automatically deriving an efficient explicit monitor from its implicit specification is a challenging problem, and there have been several recent research efforts, including both run-time techniques like AutoSynch [Hung and Garg 2013] and compile-time tools like Expresso [Ferles et al. 2018], to support implicit-synchronization monitors.

While these state-of-the-art approaches make it possible to program using implicit monitors, they still achieve sub-optimal performance because they adhere closely to the monitor’s mutual exclusion

*Both authors contributed equally to the paper.

Authors’ addresses: Kostas Ferles, Computer Science Department, The University of Texas at Austin, USA, kferles@cs.utexas.edu; Benjamin Sepanski, Computer Science Department, The University of Texas at Austin, USA, ben_sephanski@utexas.edu; Rahul Krishnan, Computer Science Department, The University of Texas at Austin, USA, rahulk@cs.utexas.edu; James Bornholt, Computer Science Department, The University of Texas at Austin, USA, bornholt@cs.utexas.edu; Isil Dillig, Computer Science Department, The University of Texas at Austin, USA, isil@cs.utexas.edu.

2018. 2475-1421/2018/1-ART1 $15.00
https://doi.org/
requirement. They generally use a single lock for the entire monitor and allow access by at most one thread at a time across all monitor operations. In practice, however, many monitors can admit additional concurrency while still preserving the appearance of mutual exclusion. For example, consider a FIFO queue monitor that provides take and put operations. These two operations can safely run concurrently unless the queue is empty or full, as they will not access the same slot in the queue. Today, realizing this fine-grained concurrency requires expert developers to fall back to hand-written explicit synchronization. These implementations are subtle and error-prone, and there is no easy way for developers to determine when they have extracted the maximum possible concurrency from such an implementation.

This paper presents a new technique to automatically synthesize fine-grained explicit-synchronization monitors. Our technique takes as input an implicit monitor that specifies the desired operations and automatically generates an implementation that allows as much concurrency as possible between those operations while still preserving the appearance of mutual exclusion. The key idea is to decompose each monitor operation into a set of fragments and allocate a set of locks to each fragment to enforce the mutual exclusion requirement while allowing as many fragments as possible to run concurrently. The resulting implementation selectively acquires and releases locks at fragment boundaries within each operation and signals condition variables as needed.

At a high level, our approach operates in three phases to generate a high-performance explicit synchronization monitor from its implicit version:

- **Signal placement:** First, we use an off-the-shelf technique [Ferles et al. 2018] to infer a signaling regime which determines where to insert signaling operations on condition variables. While the output of this tool is sufficient to synthesize a single-lock implementation, it does not admit any additional concurrency wherein different threads can perform monitor operations simultaneously.

- **Static analysis:** Second, we perform static analysis to infer sufficient conditions for correctness. That is, the output of the static analysis is a set of conditions such that if the synthesized monitor obeys them, it is guaranteed to be correct-by-construction. A key challenge for this static analysis is to determine which fragments can safely execute concurrently without creating a potential violation of the monitor semantics. The analysis simulates interleaving each fragment between the fragments of other operations and determines which possible interleavings are safe.

- **Synchronization protocol synthesis via MaxSAT:** Finally, we reduce the synthesis problem to a maximum satisfiability (MaxSAT) instance from whose solution an explicit synchronization protocol can be extracted. The hard constraints in the MaxSAT problem enforce the correctness requirements extracted by the static analysis, while the soft constraints encode two competing objective functions: minimizing the total number of locks used, while maximizing the number of pairs of fragments that can run concurrently.

We have implemented our proposed approach in a tool called CORTADO that operates on Java monitors and evaluated it on a collection of monitor implementations that are (1) drawn from popular open-source projects and (2) contain parallelization opportunities that can be achieved via fine-grained locking. Given only the implicit monitor as input, CORTADO synthesizes explicit-synchronization monitors that perform as well as, or better than, hand-written explicit implementations by expert developers. Compared to state-of-the-art automated tools for synthesizing explicit monitors [Ferles et al. 2018], CORTADO-synthesized monitors extract more concurrency and therefore perform much better (up to 39.1×) on heavily contended workloads.

In summary, this paper makes four main contributions:

- A new technique for automatically synthesizing fine-grained monitor implementations that admit the maximum possible concurrency.
- A novel static static analysis for inferring safe interleaving opportunities between threads.
A MaxSAT encoding to automate reasoning about both the correctness and performance of the synthesized explicit-synchronization monitor.
• An implementation of our technique, Cortado, that outperforms both state-of-the-art automated tools and expert-written code on benchmarks that can be parallelized via fine-grained locking.

2 OVERVIEW
In this section, we give an overview of our approach through a motivating example. Given the implicit-synchronization monitor shown in Figure 1a, our goal is to automatically synthesize an efficient and semantically equivalent explicit-synchronization monitor like the one presented in Figure 1b. In what follows, we walk through this example and describe how our technique is able to automatically generate the code in Figure 1b.

2.1 Implicit-synchronization monitor
Our technique takes as input an implicit-synchronization monitor that specifies which operations should execute atomically and when certain operations are allowed to proceed but does not fix a specific synchronization protocol for realizing that behavior. For example, Figure 1a shows an implicit monitor that implements a limited capacity blocking queue via a bounded circular array buffer. This monitor defines two operations, put and take, that execute atomically (i.e., the body of each method must appear to execute as one indivisible unit). The put operation adds an object if the queue is not full, and the take removes an object if the queue is not empty. If one of these method calls cannot proceed (i.e., queue is full or empty), the monitor blocks the calling thread’s execution using...
a `waituntil` statement until the operation can be executed. For example, the `waituntil` statement
at line 13 in `take` blocks execution until there is at least one object in the queue.

As Figure 1a illustrates, implicit-synchronization monitors make concurrent programming
simpler because they are declarative: they merely state which operations are atomic and when
operations can proceed, but they do not specify a particular synchronization protocol for realizing
that desired behavior. However, most programming languages do not offer implicit synchronization
facilities; so, concurrent programs must instead be implemented in terms of explicit synchronization
constructs such as locks and condition variables, as we discuss next.

### 2.2 Explicit-synchronization monitor

Figure 1b shows an explicit-synchronization implementation of the bounded queue from Figure 1a
that is written by an expert. This implementation uses two distinct locks, `putLock` and `takeLock`,
to protect the `put` and `take` methods respectively. The explicit-synchronization monitor also uses
an atomic integer for the `count` field, transforming reads into `get()` calls (e.g., line 12) and writes
into the appropriate atomic method (e.g., `count.getAndIncrement()` on line 16). The expert-written
monitor performs explicit signaling via condition variables `notFull` and `notEmpty` that are associated
with `putLock` and `takeLock` respectively. When a thread cannot execute one of these operations, it
calls `await` on the appropriate condition variable to block its execution (lines 13 and 26). A thread
blocked in `put` can only be unblocked by a corresponding `take` that frees up space in the queue. To
do so, the `take` must acquire `putLock` and perform a signal operation on condition variable `notFull`
(lines 33–35). The logic for `take` is symmetric (lines 19–21).

Although the expert-written version has more locks than a single global-lock implementation, its
performance will often be better: Introducing two locks allows `put` and `take` to execute concurrently,
although multiple concurrent `puts` are still serialized, as are multiple `takes`. A single global lock
would admit no concurrency in this case and would still incur the same synchronization overhead
of acquiring and releasing a lock on every method call. The expert implementation mitigates the
overhead of having two locks by acquiring locks selectively: `take` only acquires the `putLock` if
it is possible for there to be a `put` operation currently blocked waiting for space in the queue, which
happens only if the queue was full when `take` ran (the `put/takeLock` case is symmetric). This
example demonstrates the intricacy of synthesizing fine-grained locking protocols: instead of only
minimizing the total number of locks, we must also try to maximize the available concurrency.

### 2.3 Our Approach

Our tool CORTADO automatically synthesizes the efficient explicit-synchronization monitor in
Figure 1b given the implicit version from Figure 1a. It does so in three phases: First, it infers when
and how signaling operations should take place. Second, it performs static analysis to infer sufficient
conditions for the synthesized monitor to be correct. Third, it encodes the synchronization protocol
synthesis problem as a MaxSAT instance and uses a model of the MaxSAT problem to generate an
explicit-synchronization monitor. Since prior work can already handle the first phase, we only focus
on the latter two phases in the following discussion.

**Granularity.** The granularity of our synthesized locking protocol is at the level of code fragments,
where each fragment is a single-entry region of code within a single method. For example, the
fragments chosen for the blocking queue example are indicated by comments in Figure 1a. Fragments
are the indivisible unit of concurrency in our approach: we aim to maximize the number of fragments
that can run concurrently, but we do not modify the code within a fragment to introduce extra
concurrency (e.g., by removing data races). Hence, the explicit monitor synthesized by our approach
acquires and releases locks only at fragment boundaries.
Static analysis. To ensure correctness of the synthesized monitor, our technique needs to enforce the following three key requirements:

1. **Data-race freedom**: Fragments that involve a data race must not be able to run concurrently.
2. **Deadlock freedom**: Locks must be acquired and released in an order that prevents deadlocks.
3. **Atomicity**: Each monitor operation should appear to take place as one indivisible unit. That is, even though the implementation can allow thread interleavings inside monitor operations, the resulting state should be equivalent to one where each method executes truly atomically.

Here, the second requirement (i.e., deadlock freedom) does not necessitate any static analysis, as we can prevent deadlocks by imposing a static total order \( \preceq \) on locks [Birrell 1989] and ensuring that locks are acquired and released in a manner that is consistent with \( \preceq \). However, in order to ensure data-race freedom and atomicity, we need to perform static analysis of the source code to identify (1) code fragments that have a data race, and (2) interleaving opportunities between code fragments. Since detection of data races is a well-studied problem, the novelty of our static analysis lies in identifying safe interleaving opportunities. Hence, the key question addressed by our analysis is the following: Given a code fragment \( f \) executed by thread \( t \), and two consecutive code fragments \( f_i, f_j \) executed by a different thread \( t' \), is it safe to interleave the execution of \( f \) in between \( f_i \) and \( f_j \) while ensuring that monitor operations appear to take place atomically?

To answer this question, our method performs a novel static analysis to identify a set of such safe interleavings. For instance, going back to the running example, our analysis determines that it is safe to interleave the execution of fragment 4 in Figure 1a in between fragments 5 and 6 by checking a number of commutativity relations between code fragments. In this instance, since our analysis proves that fragment 4 left-commutes [Lipton 1975] with fragment 5 and right-commutes [Lipton 1975] with 6 and all of its successors, we identify this as a safe interleaving opportunity. On the other hand, our analysis concludes that interleaving fragment 4 in between 1 and 2 is not safe because fragment 4 does not left-commute with fragment 1 — intuitively, this is because fragment 4 can falsify predicate \( \text{count} < \text{queue} \cdot \text{length} \) that appears in the \text{waituntil} statement of fragment 1.

**MaxSAT overview.** Once we identify possible data races and safe interleavings via static analysis, we use this information to generate a MaxSAT instance whose solution corresponds to a fine-grained synchronization protocol. Specifically, our MaxSAT encoding uses a variable \( h_{f_i}^{l_j} \) to indicate that code fragment \( f_i \) must hold lock \( l_j \) and generates both hard constraints (for correctness) and soft constraints (for efficiency) over these variables. Thus, if the MaxSAT solver returns a model in which variable \( h_{f_i}^{l_j} \) is assigned to true, this means that the synthesized code must acquire lock \( l_j \) prior to executing fragment \( f_i \). Similarly, our MaxSAT encoding introduces a variable \( a_{fld} \) indicating that field \( fld \) should be implemented using an atomic type.

The hard constraints in our MaxSAT encoding correspond to the three correctness requirement mentioned earlier, namely (1) data race prevention, (2) deadlock freedom, and (3) atomicity. On the other hand, soft constraints encode our optimization objective. In what follows, we give a brief overview of the different types of constraints in our encoding, focusing only on constraints that involve lock acquisition variables \( h_{f_i}^{l_j} \). However, it is worth noting that our technique also generates constraints on atomic variables \( a_{fld} \) and can automatically convert fields to atomic types whenever doing so is safe and more efficient than introducing a lock.

**Data-race freedom.** Given a pair of code fragments \( (f_i, f_j) \) that have a potential data race according to the static analysis, our MaxSAT encoding introduces hard constraints of the form \( \bigvee_k (h_{f_i}^{k} \land h_{f_j}^{k}) \) stating that \( f_i \) and \( f_j \) must share at least one common lock. For example, in Figure 1a, our analysis determines that fragments 4 and 8 cannot run in parallel since they both write to the
same memory location count. Thus, the MaxSAT instance contains boolean constraints to make sure that two different threads cannot execute count-- and count++ at the same time.

**Deadlock freedom.** Our approach precludes deadlocks by imposing a total order \( \leq \) on locks. In particular, it enforces that a thread \( t \) can only acquire lock \( l \) if \( t \) does not already hold any lock \( l' \) where \( l' < l \). For example, in Figure 1a, suppose the locking protocol determines that fragments 1 and 2 must hold all locks in sets \( L_1 \) and \( L_2 \) respectively. Between executing the two fragments, the code will need to acquire all locks in \( L_2 \setminus L_1 \). Hence, we add constraints \( i < j \) for every pair of locks \( l_j \in L_2 \setminus L_1 \) and \( l_i \in L_1 \cap L_2 \) so that those locks can be acquired while respecting the order \( \leq \).

**Atomicity.** Our MaxSAT encoding also includes constraints to ensure that monitor operations appear to execute atomically. Suppose that our static analysis determines that a thread cannot safely execute code fragment \( f \) in between some other thread’s execution of code fragments \( f_1 \) and \( f_2 \). To prevent such an unsafe interleaving, we add hard constraints to ensure that fragments \( f, f_1, \) and \( f_2 \) all share at least one common lock. For example, since our analysis determines that fragment 4 (count++) cannot be interleaved with any other pair of fragments in the same method put (running concurrently on a different thread), our MaxSAT encoding includes a hard constraint asserting that fragment 4 must share a lock with all other fragments in the put method.

**Soft constraints.** Because the efficiency of the synthesized code depends on both the allowed parallelization opportunities as well as the number of locks, our optimization objective tries to minimize the number of locks and maximize the number of fragments that can run in parallel. To encode the latter objective, our MaxSAT encoding includes soft constraints asserting that any two parallelizable fragments must not share a lock. On the other hand, to encode the former objective, we add a soft constraint stating that no fragment in \( m \) is holding lock \( l \).

**Monitor generation.** A solution of the generated MaxSAT instance determines (a) which fragments should hold which locks, (b) which fields should be implemented using atomic types, and (c) which locks should be associated with which condition variables. Thus, together with the output of the signal placement technique [Ferles et al. 2018], a model of the MaxSAT problem can be automatically translated into the target monitor implementation. For our running example, CORTADO synthesizes precisely the implementation in Figure 1b given the implicit monitor of Figure 1a.

### 3 PRELIMINARIES

In this section, we describe our source and target languages and define what it means for an explicit synchronization monitor to correctly implement an implicit one.

#### 3.1 Background on Monitors

In this work, we assume that all shared resources between threads are handled by a monitor class \( M \) which consists of fields \( F \) and set of operations (methods) \( O \). The fields \( F \) constitute the only shared state between threads, which can only access shared state by performing one of the monitor operations \( o \in O \). These operations can be performed by an arbitrary, yet fixed, number of threads, and locations reachable through arguments are assumed to be thread-local. We represent each thread by a unique identifier from set \( T \subseteq \mathbb{N} \), and we model memory locations using access paths (AP) [Landi and Ryder 1992] of the form \( \pi = \nu.(f)\ast \), consisting of a base variable \( \nu \) optionally followed by a finite sequence of field accesses. We also assume that a special this variable stores the memory location of the monitor object.

**Definition 3.1. (Monitor State).** A monitor state \( \sigma : T \times AP \rightarrow \mathbb{N} \) is a mapping from pairs \( (t, \pi) \) (where \( t \) is a thread identifier and \( \pi \) an access path) to a value.
Monitor $M$ ::= monitor $M$ \{ (fld $|$ sync $|$ $m$) * \}

Field fld ::= $\tau$ $f$ ::= $e$

Sync sync ::= Lock $l$ ::= new Lock() | CondVar cv ::= $l$.newCondVar() | Atomic[$\tau$] $a$ ::= $e$

Method $m$ ::= $m(\overline{v})(ccr^*)$

Method $m$ ::= $m(\overline{v})$ \{ ccr * \}

CCR ccr ::= waituntil($p$);$s$

CCR ccr ::= (ls*)

Stmt $s$ ::= skip $|$ $v$ ::= $e$ $|$ $v.f$ ::= $e$

Stmt $s$ ::= skip $|$ $v$ ::= $e$ $|$ $v.f$ ::= $e$

| $v.m(\overline{e})$ | [if (e)]? goto 1

| $ls_1; ls_2$

| $a_{pre} ::= a$.update($\lambda \chi.e$)

LStmt ls ::= 1:? $s$

LStmt ls ::= 1:? $s$

(a) Implicit-synchronization monitor language.

(b) Explicit-synchronization monitor language.

Fig. 2. Source & target languages. We use $e$ and $p$ for expressions and predicates respectively.

3.2 Source Language

Our source language, presented in Figure 2a, corresponds to implicit synchronization monitors without explicit locking or signaling. The body of each monitor operation consists of a sequence of so-called Conditional Critical Regions (CCRs) [Hoare 1971], which in turn consist of a waituntil statement followed by one or more regular non-blocking statements. We refer to the predicate of the waituntil statement of a CCR as its guard and to the rest of the statements as its body. A thread executes the body of the CCR atomically if its guard evaluates to true; otherwise it suspends its execution and exits the monitor until the predicate becomes true. More formally, the semantics of our source language are defined via the notion of an implicit monitor history:

Definition 3.2. (Implicit monitor history). Given a set of threads interacting with each other through monitor $M_s = (F, O)$, an implicit monitor history $h_i$ is a sequence $(ccr_1, t_1) \ldots (ccr_n, t_n)$ where each $ccr_i$ is a CCR of $M_s$ and $t_i$ is a thread identifier.

Given history $h_i$, we define an argument mapping $v_i$ to be a list whose $i$’th element maps formal parameters of Method(ccr$_i$) to their actual value for each event $(ccr_i, t_i)$ in $h_i$.

Definition 3.3. (Implicit monitor semantics). Given a monitor $M_s$, initial state $\sigma$, and monitor history $h_i$ with argument mapping $v_i$, the operational semantics of $M$ is defined using a judgment $M_s \vdash (h_i, v_i, \sigma) \Downarrow \sigma'$ indicating that the new monitor state is $\sigma'$ after executing $h_i$ on state $\sigma$.

Because our source language is very similar to the one used in Ferles et al. [2018], we omit a formal definition of the operational semantics. Following that work, we also consider an implicit history to be valid only if it respects the program order of the input monitor.

3.3 Target Language

Figure 2b presents the language of explicit-synchronization monitors. The overall structure of this target language is similar to the source language but with a few important differences. First, an explicit monitor contains locks, conditional variables, and atomic fields, collectively referred to as synchronization variables. Second, CCRs in the target language do not contain waituntil statements; instead, the logic of a waituntil statement is implemented by calling methods on the appropriate condition variable. We assume that synchronization variables support all the standard
3.4 Relating Implicit and Explicit Histories

In order to formalize the correctness of our approach, we need to relate an implicit history $h_i$ of a source monitor $M_i$ with an explicit history $h_e$ of its corresponding target version $M_t$. Because every history of an implicit monitor $M_i$ induces a corresponding history of its explicit version $M_t$, we define an operation called that Expand that "translates" an implicit history to an explicit one. That is, given an implicit history $h_i$ with argument mapping $\nu_i$ and state $\sigma$, Expand$_{M_t}(h_i, \nu_i, \sigma)$ returns a pair $(h_e, \nu_e)$, where $h_e$ is a history of $M_t$ containing all statements executed by $h_i$ under initial state $\sigma$ and $\nu_e$ is the argument mapping for $h_e$.

Example 3.6. Consider the implicit monitor of Figure 3a and its explicit counterpart in Figure 3b. For histories $h_i$ and $h_e$ from Figure 3c we have Expand$_{M_t}(h_i, \nu_i, \sigma) = (h_e, \nu_e)$ for some $\nu_i, \nu_e$.

Using this Expand operation, we can classify explicit histories as being sequential or interleaved:

Definition 3.7. (Sequential history) Let $M_t$ be an explicit monitor implementation of $M_i$. We say that an explicit history $h_e$ of monitor $M_t$ with argument mapping $\nu_e$ is sequential iff there exist a history $h_i$ of $M_i$, argument mapping $\nu_i$, and initial state $\sigma$ such that Expand$_{M_t}(h_i, \nu_i, \sigma) = (h_e, \nu_e)$.

In other words, a sequential history corresponds to an execution in which statements of the explicit monitor are not interleaved between threads.

Example 3.8. Going back to Figure 3c, history $h_e$ is sequential but $h'_e$ is not.

Next, we introduce the notion of well-formed histories, which, intuitively, respect the program order of the original implicit monitor:
Definition 3.9. (Well-formed history) Let $\Pi(h, t)$ be the projection of $h$ onto thread $t$ (i.e., it filters out all elements of $h$ not involving thread $t$). We say that a history $h_e$ of $M_t$ is well-formed iff, for every thread $t$, there exists sequential histories $h_e^1, \ldots, h_e^n$ such that $\Pi(h_e, t) = h_e^1 \cdots h_e^n$.

Intuitively, well-formed histories respect program dependence in the original monitor for every thread. By definition, every sequential history is also well-formed. In the remainder of this paper, we implicitly mean well-formed explicit history whenever we refer to an explicit history.

Example 3.10. Histories $h_e, h'_e$ from Figure 3c are both well-formed. However, the following history is not well-formed because it does not respect program order: $(12. \text{unlock}(), t)(12. \text{lock}(), t)$

Definition 3.11. (Interleaved history) We say that a history $h_e$ of $M_t$ is interleaved iff it is (1) well-formed and (2) not sequential.

Example 3.12. History $h'_e$ from Figure 3c is interleaved.

Next, we define what it means for an explicit history to simulate an implicit history.

Definition 3.13. (Simulation relation). Let $M_t$ be an explicit version of implicit monitor $M$. We say that an explicit history $h_e$ of $M_t$ with argument mapping $v_e$ simulates $(h_i, v_i)$ of $M_t$ on input $\sigma$, denoted $(h_e, v_e) \sim (h_i, v_i)$, if there exist sequential history $h'_e$ and $v'_e$ such that:

1. $\forall t. \Pi(h_e, t) = \Pi(h'_e, t)$ and
2. $\text{Expand}_{M_t}(h_i, v_i, \sigma) = (h'_e, v'_e)$.

In other words, $h_e$ simulates a history of the original monitor if it is a (well-formed) permutation of some sequential history $h'_e$ of the explicit monitor $M_t$.

Example 3.14. Going back to Figure 3c, we have $(h'_e, v') \sim (h_i, v)$ for some $v, v'$.

3.5 Correctness of Explicit-Synchronization Monitors

Using the concepts introduced in the previous section, we now formalize what it means for an explicit monitor to correctly implement an implicit one.

Definition 3.15. (State equivalence) Let $\sigma$ be a program state of an implicit monitor $M_s$ and $\sigma'$ that of an explicit monitor $M_t$. We say that $\sigma$ and $\sigma'$ are equivalent modulo $M_s$, denoted $\sigma \equiv_{M_s} \sigma'$, iff for all $(t, \pi)$ in the domain of $\sigma$, we have $\sigma(t, \pi) = \sigma'(t, \pi)$.

Intuitively, this notion of equivalence between two monitor states ignores any additional synchronization fields and local variables introduced by translating $M$ to an explicit-synchronization monitor. Finally, we can define the correctness of an explicit monitor as follows:

Definition 3.16. (Correctness) We say that an explicit monitor $M_t$ correctly implements an implicit monitor $M_s$, denoted as $M_s \models M_t$, iff for all input states $\sigma_s, \sigma_t$ s.t. $\sigma_s \equiv_{M_s} \sigma_t$, we have:

1. $\forall h_i, v_i. M_s \vdash (h_i, v_i, \sigma_s) \Downarrow \sigma'_s \implies \left( M_t \vdash \text{Expand}_{M_t}(h_i, v_i, \sigma_s, \sigma_t) \Downarrow \sigma'_t \land \sigma'_s \equiv_{M_s} \sigma'_t \right)$

2. $\forall h_e, v_e. M_t \vdash (h_e, v_e, \sigma_t) \Downarrow \sigma'_t \implies (\exists h_i, v_i. (h_e, v_e) \sim (h_i, v_i) \land M_s \vdash (h_i, v_i, \sigma_t) \Downarrow \sigma'_s \land \sigma'_s \equiv_{M_s} \sigma'_t)$

The first correctness condition simply states that $M_t$ does not eliminate any feasible behaviors of $M_s$. The second condition, on the other hand, states that every feasible history of $M_t$ simulates some implicit history that results in the same state. Intuitively, this means that all statement interleavings allowed by $M_t$ provide the illusion that all operations of $M_s$ are executed atomically.

4 MAIN ALGORITHM

In this section, we present our main synthesis algorithm. Specifically, Section 4.1 introduces some preliminary definitions and proves an NP-completeness result to justify the reduction to MaxSAT. Then, Section 4.2 presents the high-level algorithm, Section 4.3 presents the static analysis for inferring safe interleavings, and Sections 4.4 presents the details of the MaxSAT encoding.
4.1 Fragment Dependency Graphs and NP-Completeness

Our main synthesis algorithm is parametrized over a partitioning of the input monitor into code fragments, where each code fragment defines a unit of computation that we need to assign locks to. In this section, we clarify our assumptions about these code fragments and prove the NP-completeness of the problem for a given choice of partition.

First, to define what we mean by a valid partition, we represent each method of the monitor as a standard control-flow graph (CFG), where each atomic statement belongs to its own basic block. Given a control-flow graph \( G \) and node \( n \), we write \( \text{Preds}(G, n) \) to indicate the predecessor nodes of \( n \) in \( G \) and \( \text{Succs}(G, n) \) to indicate its successors. Then, a valid partition of a method into code fragments is defined as follows:

\[
\text{Definition 4.1. (Partition)} \quad \text{Let } G = (V, E) \text{ be the CFG representation of a method. Then, a partition of this method is a set of CFGs } \{G_1, \ldots, G_n\} \text{ with } G_i = (V_i, E_i) \text{ such that:}
\]

1. \( V = \bigcup_{i=1}^{n} V_i \) and \( E_i = E \cap (V_i \times V_i) \)
2. For every \( G_i \), there is at most one node \( n \in V_i \) such that \( \text{Preds}(G, n) \nsubseteq V_i \)
3. Every \( \text{waituntil}(p) \) statement must belong to its own \( G_i \) — i.e., if a node \( n \in V \) is a \( \text{waituntil} \) statement, then there exists a \( G_i = (\{n\}, \emptyset) \)

Intuitively, a partition is a set of sub-CFGs such that (1) these sub-CFGs cover all nodes of the original CFG, (2) each sub-CFG has a unique entry node, and (3) \( \text{waituntil} \) statements belong to their own sub-CFG. We refer to the code snippet represented by each sub-CFG as a code fragment and define a notion of fragment dependency graph (FDG) as follows:

\[
\text{Definition 4.2. (FDG)} \quad \text{Given a method } m \text{ with CFG } G = (V, E) \text{ and a partition of } G \text{ into } \{G_1, \ldots, G_n\}, \text{ a fragment dependency graph (FDG) is a directed acyclic graph } (V', E') \text{ such that}
\]

1. every \( f_i \in V' \) is the code fragment associated with \( G_i \); (2) there is an edge \( (f_i, f_j) \in E' \) iff there is an edge in \( G \) from any exit node of \( G_i \) to the entry node of \( G_j \).

\[
\text{Example 4.3. Figure 4 presents the FDG for method take for the partition in Figure 1a,}
\]

\[
\text{Observe that we require the FDG to be acyclic, so some partitions do not give rise to valid FDGs. In the rest of this paper, we assume that partitions obey this restriction so that all cycles are contained within individual nodes of the FDG. We also lift this notion of FDG from individual methods to entire monitors in the obvious way (i.e., union of all FDGs). As we will see in the next section, our synthesis algorithm operates over FDG representations of monitors.}
\]

Next, we state the following NP-completeness result to justify our MaxSAT encoding:

\[
\text{Theorem 4.4. (NP-Completeness)} \quad \text{Let } G = (V, E) \text{ be an FDG of monitor } M, \text{ and let } \Pi \subseteq V \times V \text{ be a set of fragment pairs that can run in parallel. Then, deciding whether there exists a synchronization protocol with at most } k \text{ locks and that allows all pairs in } \Pi \text{ to run in parallel is NP-Complete.}
\]

\[
\text{Proof. By reduction from the edge clique cover problem [Michael and Quint 2006]. The proof can be found in Appendix B.}
\]

4.2 Synthesis Algorithm

In this section, we describe our core synthesis procedure, which is summarized in Figure 5. At a high level, the \textsc{SynthesizeMonitor} algorithm consists of the following steps. First, it uses the technique of Ferles et al. [2018] to infer signaling operations (line 4). This yields a partially concretized monitor \( M' \) with signaling operations but no locking. Next, it constructs an FDG representation
1: procedure SYNTHESIZE_MONITOR(M)
3:   output: a semantically equivalent explicit-synchronization monitor.
4:   M’ ← PlaceSignals(M) → Use technique of Ferles et al. to infer signaling operations
5:   G ← ConstructFDG(M’)
6:   Nu ← ComputeMaxLocks(G)
7:   S ← StaticAnalyze(G)
8:   opt ← −1
9:   for i ∈ [1, Nu] do
10:      (H, S) ← MaxSATEncoding(M, G, S, i)
11:      (p, v, timeout) ← Solve(H, S)
12:      if timeout ∨ (v ≤ opt) then break
13:         (best, opt) ← (p, v)
14: return Instrument(best, G, M’)

Fig. 5. Main Synthesis Algorithm.

of the resulting monitor M’ as defined in Section 4.1 (line 5). Third, it infers an upper bound Nu on the maximum number of locks that the synthesized code should use (line 6). Then, it statically analyzes the FDG to infer requirements that the synthesized code needs to obey (line 7) and uses the results of the previous steps to generate the MaxSAT encoding (line 11). Finally, it instruments M’ (line 14) using the synchronization protocol inferred using MaxSAT. Since the most involved aspects of this algorithm are the MaxSAT encoding and inference of safe interleavings, we defer a detailed discussion of these to the next two subsections and focus on the rest.

Iterative exploration of lock count. As mentioned above, our synthesis algorithm conceptually reduces the protocol synthesis problem to MaxSAT and uses an off-the-shelf solver to maximize our optimization objective. To achieve this goal, one option is to generate the MaxSAT encoding based on the maximum possible locks (obtained via the call to ComputeMaxLocks) and then let the solver figure out the optimal number of locks to use. However, in practice, such an approach does not scale because the size of the encoding increases with respect to the maximum number of locks allowed. That is, for many realistic problems, the MaxSAT solver fails to terminate within a reasonable time limit if we generate the encoding based on the maximum possible locks. Thus, instead of directly generating a very large MaxSAT formula up front, our SYNTHESIZE_MONITOR procedure enters a loop (lines 9–13) wherein it gradually increases the maximum number of locks allowed (and hence the size of the MaxSAT encoding). If we get to a point where the MaxSAT solver starts timing out (indicated by boolean variable called timeout) or we fail to increase the objective value despite using a larger upper bound on locks (see line 12), then the procedure terminates with the best synchronization policy found so far. While this strategy does not guarantee global optimality, it is much more practical than the alternative.
Signaling operations. Our synthesis algorithm uses an auxiliary procedure called PlaceSignals [Ferles et al. 2018] which yields a monitor \( M' \) that belongs to an intermediate language that is identical to our source language (Figure 2a) except that it contains explicit signaling operations. Specifically, this intermediate language contains two additional signaling directives: (1) \( \text{signal}(p,c) \) which notifies a single thread that is blocked on predicate \( p \) if condition \( c \) holds, and (2) \( \text{broadcast}(p,c) \) which notifies all threads blocked on \( p \) if \( c \) holds. Figure 6 shows the result of calling PlaceSignals on the \texttt{put} procedure from Figure 1a.

FDG construction. Recall that an FDG is a generalized version of a control-flow graph where nodes are code fragments rather than basic blocks, and each code fragment is a unit of computation that our algorithm should assign locks to. Since there can be many ways to partition a given method into code fragments, the ConstructFDG procedure invoked at line 5 of Figure 5 implements a particular heuristic for partitioning a method into code fragments. In particular, the more the number of code fragments, the more the parallelization opportunities; thus, our ConstructFDG procedure tries to maximize the number of code fragments while maintaining the invariant that the FDG is acyclic and that each code fragment must have a unique entry point (see Section 5).

Computing upper bound on locks. Because the MaxSAT encoding assumes a fixed number of locks, our synthesis algorithm calls the ComputeMaxLocks procedure at line 6 to compute an upper bound on the number of locks needed. Given an FDG \( \mathcal{G} = (V, E) \), the key idea behind this procedure is to construct a so-called conflict graph \( \mathcal{G}_C = (V, E_C) \) where \( (f, f') \) is in \( E_C \) iff fragments \( f \) and \( f' \) have a data race. Since it can be shown that the optimal solution to our problem is an edge clique cover [Michael and Quint 2006] of this conflict graph (see Appendix), we can use known theorems (e.g., Mantel’s theorem, Alon [1986] etc.) to obtain an upper bound on the number of locks needed without having to solve an NP-complete problem.1

Static analysis. Recall from Section 2 that the constraints in our MaxSAT encoding utilize information obtained via static analysis. Thus, line 7 of Figure 5 statically analyzes the input monitor to obtain the following three pieces of information:

- **Atomic fields** \( \mathcal{F} \): One of the goals of the analysis is to infer a set of fields that could potentially be implemented using Atomic types. Thus, our static analysis checks whether (a) a field of type \( \text{Atomic} \) has a corresponding AtomicT version, and (b) whether all updates to this field can be implemented using one of the methods provided by AtomicT.

- **Data races** \( \mathcal{R} \): The second goal of our static analysis is to identify pairs of fragments that would have a data race if they do not use a shared lock. Thus, given a pair of fragments \( (f, f') \), our static analysis checks whether \( f \) writes to a memory location \( l \) that is accessed in \( f' \).

- **Interleaving opportunities** \( I \): Finally, a third key goal of the analysis is to identify safe interleaving opportunities between fragments. Since this aspect of the analysis is quite involved, we discuss it in the next subsection.

MaxSAT encoding. As mentioned in Section 2, our MaxSAT encoding uses two types of boolean variables, namely (1) \( h_{ij}^f \) indicating that fragment \( f \) must hold lock \( l_j \) and (2) \( a_f \) indicating that field \( f \) should be converted to atomic. Hence, a model of the MaxSAT problem can be easily converted to a so-called locking protocol \( (L, A, P) \), where \( L \) is an assignment from fragments to a set of locks, \( A \) is a set of fields that should be implemented using atomic types, and \( P \) is a mapping from

1In our implementation, we use multiple upper bounds using known theorems and return the best one.

```java
void put(Object o) {
    waituntil(count < queue.length);
    boolean wasEmpty = count == 0;
    queue[last] = o;
    count++;
    last = (last + 1) % queue.length;
    broadcast(count == 0, wasEmpty);
}
```

Fig. 6. Method put with explicit signals.
waituntil guards to locks. In particular, we have \( l_j \in \mathcal{L}(f_i) \) if and only if \( h_j^{1/} \) is assigned to true in the model returned by the MaxSAT solver, and we have \( \text{fld} \in \mathcal{A} \) if \( a_{\text{fld}} \) is assigned to true. Due to the constraints in our MaxSAT encoding, it is similarly easy to derive \( \mathcal{P} \): because our encoding ensures that every occurrence of a waituntil statement is protected by the same set of locks \( S \), we associate one of the locks \( l \) in \( S \) with the condition variable introduced for predicate \( p \).

**Instrumentation.** The last step of our algorithm is to synthesize the explicit-synchronization monitor via the Instrument procedure invoked at line 14. Given a synchronization protocol \((\mathcal{L}, \mathcal{A}, \mathcal{P})\), the Instrument procedure performs the following steps:

1. First, it introduces all the synchronization fields (locks, condition variables, and atomic fields) that appear in the protocol.
2. It converts every update to an atomic field to the corresponding atomic update statement.
3. Finally, it introduces all the necessary locking and signaling operations to implement the synthesized synchronization protocol.

We refer the interested reader to Appendix D for more details on the instrumentation.

**Theorem 4.5. (Correctness)** Given an implicit-synchronization monitor \( M \) in the language of Figure 2a, if \( \text{SYNTHESIZEMONITOR}(M) \) returns \( M' \), then we have \( M \sim M' \).

**Proof.** Can be found in Appendix E. \( \square \)

### 4.3 Analysis to Identify Safe Interleavings

We now describe how to infer safe interleaving opportunities between threads while ensuring that monitor operations appear to take place atomically. Given a fragment dependency graph \( \mathcal{G} = (V, E) \) for a monitor \( M \), an interleaving opportunity (or interleaving for short) is a pair \((v, e)\) where \( v \in V \) is a code fragment of \( M \) and \( e = (v_1, v_2) \in E \) is an edge of the FDG. Intuitively, such an interleaving is safe if some thread can execute \( v \) in between some other thread’s execution of \( v_1 \) and \( v_2 \) without violating atomicity. The goal of our static analysis is to identify a set \( I \) of such safe interleavings. In what follows, we formalize safe interleavings and describe an analysis for identifying them.

**Formalizing safe interleavings.** To formalize the notion of safe interleaving, we need to keep track of which fragments of the monitor were executed in what order. For this purpose, given an FDG \( \mathcal{G} = (V, E) \) of \( M \), we define a fragmented monitor \( M_\mathcal{G} \) to be the same as \( M \) except that every fragment in \( \mathcal{G} \) is placed in its own method. Observe that histories of \( M_\mathcal{G} \) encode all possible interleavings of fragments in \( \mathcal{G} \). In this sense, histories of \( M_\mathcal{G} \) are similar to explicit monitor histories but are slightly higher level in that they allow interleavings between fragments rather than atomic statements. Thus, we adapt the same notions of sequential, well-formed, and interleaved histories from Section 3.3 to fragmented monitors, as illustrated by the following examples.

**Example 4.6.** Given monitor \( M \) from Figure 1a, its fragmented version \( M_\mathcal{G} \) splits \( \text{put} \) and \( \text{take} \) into four different methods, each named \( \text{put}_i \) and \( \text{take}_i \). Given history \( h = (\text{take}, t) \) and initial state \( \sigma \) with a non-empty queue, we have

\[
\text{Expand}_{M_\mathcal{G}}(h, v, \sigma) = (((\text{take}_1, t)(\text{take}_2, t)(\text{take}_3, t)(\text{take}_4, t), v'))
\]

where \( \text{take}_1, \ldots, \text{take}_4 \) denote fragments 5-8 in Figure 1a and \( v, v' \) are empty argument mappings.

---

2Specifically, when choosing which lock \( l \) in set \( S \) to designate as the representative, we choose the smallest lock in \( S \) according to the total order. Because all locks held by a thread must be released before it blocks on a condition variable and must be acquired after it gets notified (with method \( \text{wait} \) releasing and acquiring \( l \) internally), choosing the smallest lock prevents deadlocks.

---

Example 4.7. In the example above, Expand$_V$$_G$ $(h, v, \sigma)$ is both sequential and well-formed. However, $(take_1, t), (take_2, t)$ is not well-formed because it does not involve all four methods, and $(take_1, t), (take_3, t), (take_2, t), (take_4, t)$ is also not well-formed because it executes take$_3$ before take_2. Finally, the following history is an interleaved (and, by definition, well-formed) history where threads $t$ and $t'$ execute method take and put respectively:

$$h_G = (put_1, t)(put_2, t)(put_3, t)(take_1, t')(put_4, t)(take_2, t')(take_3, t')(take_4, t')$$ (1)

Furthermore, for this history we have $(h_G, v_G) \sim ((put, t)(take, t'), v)$ for some $v_G$ and $v$. That is, $h_G$ simulates a history of $M$ where thread $t$ executes method put and $t'$ executes take.

**Definition 4.8. (Interleaving)** Given an FDG $G = (V, E)$ for monitor $M$, an interleaving is a pair $(v, e)$ where $v \in V$ and $e \in E$. Furthermore, given a history $h$ of fragmented monitor $M_G$, we write $X(h_G)$ to denote the set of all interleavings that occur in $h$.

Example 4.9. For the history $h_G$ from Eq. 1, we have:

$$X(h_G) = \{(take_1, (put_3, put_4)), (put_4, (take_1, take_2))\}$$

This is the case because this history executes take$_1$ in between consecutive fragments put$_3$ and put$_4$ of some other thread. Similarly, we have $(put_4, (take_1, take_2)) \in X(h_G)$ because it executes put$_4$ in between take$_1$ and take$_2$.

**Definition 4.10. (Safe interleavings).** Let $G$ be an FDG of monitor $M$. We say that a set of interleavings $S$ is safe, if for every input state $\sigma$ and every interleaved history $h_G$ of $M_G$ we have:

$$\text{If } X(h_G) \subseteq S \text{ and } M_G \vdash (h_G, v_G, \sigma) \Downarrow \sigma' \text{ then } \exists h, v. (h_G, v_G) \sim (h, v) \text{ and } M \vdash (h, v, \sigma) \Downarrow \sigma'$$

In other words, a set of interleavings $S$ is safe if for every interleaved history of $h_G$ whose interleavings are a subset of $S$ we can prove that $h_G$ leads to the same final state as some history $h$ of $M$ where $h$ simulates $h_G$. This definition essentially lifts the second correctness criterion of Definition 3.16 to a fragmented monitor.

**Inferring Safe Interleavings.** We now turn our attention to the problem of inferring safe interleavings. Given a monitor $M$ and its FDG $G = (V, E)$, our goal is to find a set $I \subseteq V \times E$ such that all interleavings in $I$ are safe. However, a key challenge is that the space of all safe interleavings is exponential (i.e., the power set of $V \times E$), so, even if we had a procedure for checking whether some set $I$ is safe, enumerating all candidates would be computationally intractable.

To overcome this challenge, we introduce the notion of strong safety that allows us to build $I$ iteratively. In particular, note that if $S_1$ and $S_2$ are both safe interleaving sets according to Definition 4.10, it may not be the case that $S_1 \cup S_2$ is also a safe interleaving. However, to build $I$ incrementally, we need a notion of safe interleaving that is closed under union. For this purpose, we introduce a notion of strong safety for a single interleaving $(v, e)$. Since strongly safe interleavings enjoy the property of being closed under union, this notion lends itself to a computationally feasible technique for computing safe interleaving sets. In the remainder of this section, we define strong safety and present our static analysis for computing safe interleaving sets. Towards this goal, we first introduce the notions of left and right commutativity for our context:

**Definition 4.11. (Left/Right Commutativity).** Given fragments $v$ and $v'$, we say that $v$ left commutes with $v'$, denoted $\text{LeftCommute}(v, v')$, iff, whenever $M_G \vdash ((v', t')(v, t), v, \sigma) \Downarrow \sigma'$ holds, so does $M_G \vdash ((v, t)(v', t'), v, \sigma) \Downarrow \sigma'$. Conversely, $v$ right commutes with $v'$, denoted $\text{RightCommute}(v, v')$, iff $M_G \vdash ((v, t)(v', t'), v, \sigma) \Downarrow \sigma'$ implies $M_G \vdash ((v', t')(v, t), v, \sigma) \Downarrow \sigma'$.
In other words, a fragment \( v \) left commutes with \( v' \) if, whenever \( v \) executes just after \( v' \), the resulting state is the same as if \( v \) had executed just before \( v' \). For example, \( f_6 \) (i.e. \( \text{count}++ \)) in Figure 1a left-commutes with \( f_3 \) since increasing \( \text{count} \) right after \( \text{waituntil} \) \((\text{count} > 0) \) is equivalent to increasing \( \text{count} \) just before \( \text{waituntil} \) \((\text{count} > 0) \). That is, assuming that \( \text{waituntil} \) \((\text{count} > 0) \) was not blocked before executing \( \text{count}++ \), then it will still not be blocked after executing \( \text{count}++ \).

However, \( f_3 \) does not left-commute with \( f_1 \) when \( \text{count} \) equals \( \text{queue}.\text{length} - 1 \), incrementing \( \text{count} \) just after \( \text{waituntil} \) \((\text{count} < \text{queue}.\text{length}) \) is not equivalent to incrementing \( \text{queue}.\text{length} \) before the \( \text{waituntil} \) statement. That is, if \( \text{waituntil} \) \((\text{count} < \text{queue}.\text{length}) \) did not block before executing \( \text{count}++ \), we cannot guarantee that it also does not block after executing \( \text{count}++ \).

Next, we use this notion of left and right commutativity to define strong safety:

**Definition 4.12. (Strong safety).** Let \( \mathcal{G} = (V, E) \) be an FDG for monitor \( M \), and let \( E^* \) denote the reflexive transitive closure of \( E \). We say that an interleaving \((v, e), \) where \( e = (v_s, v_t) \), is strongly safe if the following conditions are satisfied:

1. \( \forall v^-. (v^-, v_s) \in E^* \implies \text{LeftCommute}(v, v^-) \)
2. \( \forall v^+. (v_t, v^+) \in E^* \implies \text{RightCommute}(v, v^+) \)

That is, an interleaving \((v, e)\) is said to be strongly safe if we can prove that fragment \( v \) left commutes with *every* possible predecessor of \( v_s \) and that it right commutes with *every* possible successor of \( v_t \). To see why these conditions imply safety, recall that a set of interleavings \( S \) is safe if, for any history \( h_G \) whose interleavings are a subset of \( S \), we can find some (sequential) history of the original monitor that simulates \( h_G \). Assuming \( S \) contains only strongly safe interleavings, we can create such a sequential history by “removing” interleavings one at a time from \( h_G \). For instance, let \( \chi = (v, (v_s, v_t)) \in S \) be an interleaving that occurs in \( h_G \). Since \( \chi \) is strongly safe, we can always obtain an equivalent history \( h'_G \) that has strictly less interleavings than \( h_G \) by commutating \( v \) past either every successor of \( v_t \) or every predecessor of \( v_s \) that appears in \( h_G \).

**Example 4.13.** For the monitor from Figure 1a, we can show that every interleaving \((v, e)\) where \( v \) belongs to method \( \text{take} \) and edge \( e \) belongs to method \( \text{put} \) (and vice versa) is strongly safe. However, none of the interleavings where \( v \) and \( e \) belong to the same method are strongly safe. Finally, because both of the interleavings of the history \( h_G \) from Eq. 1 are strongly safe, we can derive a sequential history that simulates history \( h_G \) by swapping \((\text{take}_1, t') \) with \((\text{put}_4, t) \).

We now state a key theorem that underlies the correctness of our approach:

**Theorem 4.14.** Let \( \mathcal{G} \) be an FDG and let \( \chi_1, \ldots, \chi_n \) be strongly safe interleavings. Then, \( \{\chi_1, \ldots, \chi_n\} \) satisfies Definition 4.10 (i.e., is a safe interleaving set for \( \mathcal{G} \)).

**Proof.** Can be found in Appendix E. \(\square\)

**Static analysis algorithm.** Finally, we conclude this section by presenting our static analysis algorithm (shown in Figure 7) for computing a set \( I \) of safe interleavings. At a high level, this algorithm identifies which \((v, e)\) pairs are strongly safe and then returns their union, which by Theorem 4.14, corresponds to a safe interleaving set. To check whether an interleaving \((v, e)\) for \( e = (v_s, v_t) \) is strongly safe, we must check if \( v \) left commutes with each predecessor of \( v_s \) and right commutes with each successor of \( v_t \). As shown in the \text{LeftCommute} procedure, we reduce the verification of left commutativity to the problem of verifying a Hoare triple. In particular, given fragments \( v, v' \), we generate a code snippet \( S_L; S_R \) where \( (1) \) \( S_L \) is an alpha-renamed version of \( v' ; v \) with \( \text{waituntil} \)'s replaced by \( \text{assume} \) statements, and \( (2) \) \( S_R \) is an alpha-renamed version of \( v ; v' \) with \( \text{waituntil} \)'s replaced by \( \text{assert} \) statements. Note that we turn \( \text{waituntil} \)'s in \( S_L \) into \( \text{assumes} \) because the definition of left commutativity assumes that \( v' ; v \) has terminated. On the
The second type of variable used in our encoding is of the form \( a \). Values of variables are the same at the end, assuming that they are the same in the beginning.

Waituntil statement evaluates to true under the assumption that they also evaluate to true in the results of the static analysis, and (c) an upper bound on the maximum number of locks, and (d) the space available to hold locks. In this section, we describe our MaxSAT encoding which is formalized as inference rules in Figure 8 for generating these constraints in more detail.

Recall that the encoding procedure takes as input (a) an FDG representation of the monitor, (b) an FDG representation of the monitor, (c) an upper bound on the maximum number of locks, and (d) the space available to hold locks. In this section, we describe our MaxSAT encoding which is formalized as inference rules in Figure 8 for generating these constraints in more detail.

Variables. Our MaxSAT encoding uses two types of variables. First, we introduce variables of the form \( h_{ij} \) indicating that fragment \( v_i \) needs to hold lock \( l_j \). Thus, given an FDG with \( n \) vertices and an upper bound \( N \) on the number of locks, our encoding contains \( n \times N \) such variables.

The second type of variable used in our encoding is of the form \( f_{ld} \) indicating that \( fld \) should be implemented using an atomic type.

Mutex encoding. Given a set of fragments \( F \) and an upper bound \( N \) on the number of locks, we often need to enforce that all fragments in \( F \) share at least one of the \( N \) possible locks. We write

```
1: procedure FINDSAFEINTERLEAVINGS(\( G \))
2:   input: An FDG representation \( G = (V, E) \) of monitor \( M \)
3:   output: A set \( I \) of all safe interleavings
4:   \( I \leftarrow \emptyset \)
5:   for \( v \in V, e = (v_s, v_t) \in E \) do
6:     \( V_s^* \leftarrow \{ v' \mid (v', v_s) \in E^* \} \) \quad \rightarrow \text{All predecessor vertices that reach } v_s.
7:     \( V_t^* \leftarrow \{ v' \mid (v_t, v') \in E^* \} \) \quad \rightarrow \text{All successor vertices of } v_t.
8:     if \( (\forall v_s^* \in V_s^*. \text{LEFTCOMMUTE}(v, v_s^*)) \land (\forall v_t^* \in V_t^*. \text{LEFTCOMMUTE}(v_t^*, v)) \) then
9:       \( I \leftarrow I \cup \{(v, e)\} \)
10:  return \( I \)
11: function LEFTCOMMUTE(\( v, v' \))
12:   input: Two fragments \( v, v' \)
13:   output: true if \( v \) left commutes with \( v' \)
14:   \( X \leftarrow \{ x \mid x \text{ is a variable in } v \text{ or } v' \} \)
15:   \( X_L \leftarrow \{ x_l \text{ fresh name } | x \in X \} \quad X_R \leftarrow \{ x_r \text{ fresh name } | x \in X \} \)
16:   \( S_L \leftarrow (v'; v)[\text{assume/waituntil}, X_L/X] \quad S_R \leftarrow (v; v')[\text{assert/waituntil}, X_R/X] \)
17: return Verify(\( \{X_L = X_R\}; S_L; S_R \{X_L = X_R\} \))
```

Fig. 7. Algorithm to find all safe interleavings.
These hard constraints.

Next, we describe the hard constraints generated by our MaxSAT encoding.

Fig. 8. Inference rules for MaxSATEncoding(M, G, S, N) procedure. G = (V, E) is an FDG of monitor M, S = (F, R, I) are the results of the static analysis, and N is an upper bound on the number of locks. Predicate IsFrag(v) is true if v ∈ V, IsEdge(e) if e ∈ E, and SafeInterleaving(v, e) if (v, e) ∈ I. Relations Methods(M) and Preds(M) return all methods of monitor M and all predicates that appear as an argument of a waituntil statement in M respectively.

Mutex(F, N) to denote this requirement. In particular, as shown at the bottom of Figure 8, this is defined as \( \text{Mutex}(F, N) = \bigvee_{i=1}^{N} \bigwedge_{f \in F} h_{f}^{i} \).

**Hard constraints.** Next, we describe the hard constraints generated by our MaxSAT encoding. These hard constraints H correspond to correctness requirements on the synthesized protocol and include (1) data race freedom, (2) correct signaling and deadlock freedom and (3) atomicity. Specifically, the first two rules in Figure 8 deal with data race freedom, the next rule deals with atomicity, and the last two rules deal with deadlock freedom and correct signaling.

**Race-1.** The first rule, labeled Race-1, deals with data race freedom of two fragments that have a data race on a single monitor field f. The premises of this rule stipulate that \( v_1, v_2 \) are fragments that race only on field f which can be converted to atomic (i.e., this.f ∈ F). In this case, we prevent data races by either (1) enforcing that \( v_1, v_2 \) share a lock (the Mutex constraint) or (2) ensuring that field f is converted to an atomic field.

**Race-2.** The next Race-2 rule prevents data races between fragments where the data race cannot be resolved by making one of the fields atomic. In particular, given two fragments \( v_1, v_2 \) that have a data race, this rule simply enforces that they share a common lock via the Mutex function.
**I-LEAVE.** The next rule generates constraints to ensure that monitor operations appear to take place atomically. In particular, if the static analysis cannot prove \((v, e)\) to be a strongly safe interleaving (recall Definition 4.12), then we need to ensure that a thread cannot execute \(v\) when some other thread is executing \(e\). To do so, we ensure that \(v, v_s, v_t\) all share a common lock by generating a Mutex hard constraint for these three fragments.

**L-ORDER.** The next rule, labeled L-Order, ensures that the resulting synchronization protocol is deadlock-free. Specifically, for every edge \(e = (v_s, v_t)\) in the input FDG, this rule generates a hard constraint, via LockOrder\((e, N)\) (defined at the bottom of Figure 8), that ensures that every lock acquisition respects the total order on locks. In particular, for every pair of locks \(l, u\) such that \(l < u\), LockOrder\((e, N)\) prevents the synchronization protocol from violating the global order on locks. Recall that a protocol violates this global order if it acquires the “smaller” lock \(l\) between \(v_s\) and \(v_t\) while both \(v_s\) and \(v_t\) hold lock \(u\). Thus, the hard constraint generated by LockOrder\((e, N)\) prevents this from happening.

**Example 4.15.** Assuming \(N = 2\), this rule generates \(¬(h^l_{v_s} \land h^u_{v_t} \land ¬h^l_{v_t} \land h^u_{v_s})\) for edge \((v_s, v_t)\).

**WAIT.** The last hard constraint rule, called WAIT, is used for associating a single lock with each condition variable. In particular, since all fragments of the form waituntil\((p)\) must hold the same set of locks, this rule generates two hard constraints for every waituntil predicate \(p\) of the input monitor: (1) a mutex constraint for all waituntil\((p)\) fragments and (2) a constraint that enforces that all waituntil\((p)\) fragments must share all common locks.

**Soft Constraints.** As discussed earlier, our goal is to generate a synchronization protocol that is not only correct-by-construction but one that also results in efficient code. Hence, as a proxy metric for efficiency, we want to (1) minimize the number of locks and atomic fields that are introduced, and (2) maximize the number of fragments that can run in parallel. The remaining three rules in Figure 8 introduce soft constraints to encode this optimization objective.

**Min-Lock.** The rule labeled Min-Lock is used for minimizing the number of locks. However, instead of simply minimizing the total number of locks used by the protocol, the soft constraints generated by this rule minimize the number of locks used per method. Even though this is not equivalent to minimizing the number of locks used by the entire protocol, we have found this approach to synthesize protocols with a more even distribution of locks among the monitor methods. In practice, such protocols are more desirable because they avoid scenarios where a subset of the methods incur a higher synchronization cost than others. Specifically, this rule generates a soft constraint for every lock \(l \in \{l_1...l_N\}\) and every method \(m\) of \(M\) and asserts that none of the fragments in \(m\) hold lock \(l\).

**Min-Atom.** The Min-Atom rule generates soft constraints to minimize the number of fields that are made atomic by asserting that \(a_{fld}\) is assigned to false.

**Max-Par.** The last rule called Max-Par generates soft constraints to maximize parallelism. Specifically, for every pair of fragments \((v, v')\) that do not have data races, we add a soft constraint stating that \(v\) and \(v'\) do not share any locks.

We conclude this Section with a theorem that states the correctness of our MaxSAT encoding.

**Theorem 4.16.** Let \(m\) be a model of the generated MaxSAT instance and \((\mathcal{L}, \mathcal{A}, \mathcal{P})\) be the synchronization protocol constructed as follows:

\[
\mathcal{L} = \left\{ v \mapsto \left\{ l \mid m[h^l_v] \right\} \right\} \quad \mathcal{A} = \left\{ f_{1d} \mid m[a_{fl}] \right\} \quad \mathcal{P} = \left\{ p \mapsto I_i \mid IsWait(v, p), i = \min(\left\{ j \mid m[h^l_j] \right\}) \right\}
\]

where, \(\text{IsWait}(v, p)\) is true if \(v\) is a wait until statement on \(p\). Then, \((L, A, P)\) is a correct synchronization protocol.

**Proof.** Can be found in Appendix E. \(\square\)

5 IMPLEMENTATION

We have implemented our approach in a tool called **Cortado** that emits explicit-synchronization monitors in Java. **Cortado** is based on the Soot program analysis infrastructure [Vallée-Rai et al. 1999] and the Z3 SMT solver [De Moura and Bjørner 2008]. In particular, we use Soot to perform various kinds of static analyses needed by our method (e.g., pointer analysis) and to translate the input monitor to an explicit-synchronization monitor in Java. Furthermore, we leverage Z3 for solving MaxSAT instances and discharging the validity queries that arise when checking commutativity between fragments. In the remainder of this section, we discuss several design choices and optimizations that were not discussed previously.

**Weights of soft constraints.** As expected, the quality of the synthesized protocol depends on the model returned by the MaxSAT solver. In practice, we have observed certain types of soft constraints to be more important than others for efficiency. Thus, our implementation assigns different weights for different classes of soft constraints. For instance, because it is always preferable to use an atomic field instead of a lock, **Cortado** assigns a higher weight to soft constraints generated by rule **Min-Atom** from Figure 8 than the ones generated by rule **Min-Lock**.

**Constructing FDGs.** As mentioned in Section 4, **Cortado** uses a heuristic to partition the input CFG into fragments. The goal of this heuristic is to maximize parallelization opportunities while ensuring that the partition results in a valid FDG according to Definition 4.2. Our heuristic places every loop in its own fragment (to make sure that the FDG is well-formed) and, for code outside loops, **Cortado** creates a new fragment whenever it detects an update to monitor state (i.e., \(\text{this.fld} = *\)). In practice, we found this heuristic to achieve a good balance between the number of parallelization opportunities and the size of the resulting FDG.

**Static analysis optimization.** Our approach uses an off-the-shelf pointer analysis to detect which pairs of FDG fragments do not have a data race (and, so, can run in parallel). However, such an approach, based on pointer analysis alone, often leads to imprecision. For example, Soot’s pointer analysis cannot prove that fragments 2 and 6 in Figure 1a do not contain any races, as it does not reason about individual array elements. Therefore, in order to increase the precision of the static analysis, **Cortado** implements an SMT-based static analysis on top of Soot’s built-in pointer analysis and generates appropriate verification conditions (similar to the ones generated by Gurfinkel et al. [2015]) to prove that memory accesses of two fragments are disjoint.

6 EVALUATION

We evaluated **Cortado**’s ability to generate fine-grained locking protocols by performing a set of experiments that are designed to answer the following research questions:

**RQ1** How does the code generated by **Cortado** compare against explicit-synchronization monitors written by experts?

**RQ2** How does the technique implemented in **Cortado** compare against other compile-time state-of-the-art approaches targeting implicit-synchronization monitors?

3 An ablation study that demonstrates the need for adjusting the weights of soft constraints can be found in Appendix A.3.

4 An ablation study that justifies the design of this heuristic can be found in Appendix A.2.
RQ3 How does the static analysis for inferring safe interleavings impact the quality of the code generated by CORTADO?

RQ4 How long does CORTADO take to synthesize code and how complex are the resulting protocols?

To answer these research questions, we conducted experiments on ten explicit-synchronization monitors from popular open source repositories. Aside from CORTADO, we consider two additional baselines, described below, that aid us in answering our second and third research questions.

**Benchmarks.** The benchmarks used in our evaluation are collected from popular open source GitHub repositories. We wrote a crawler (based on GitHub’s REST API [GitHub 2022]) to automatically identify candidate explicit-synchronization monitors implemented in Java by searching for keywords like lock, unlock, await, etc. We then manually inspected class files returned by the crawler in decreasing order of GitHub popularity (stars and forks) and identified self-contained monitor-style classes that encapsulate shared state accessed by multiple threads. We included such a monitor in our benchmarks only if it satisfies the following conditions: (1) the class has a well-defined API for client threads and (2) it contains parallelization opportunities that can be realized via fine-grained locking.\(^5\) We manually isolated the shared state and monitor methods of the class file to obtain a standalone explicit-synchronization monitor and then manually translated it to an equivalent implicit monitor. To convert a benchmark to an implicit monitor, we simply removed all synchronization code (i.e., locking and signaling operations) and introduced appropriate waituntil statements. In total, we collected 10 monitors from popular repositories such as Spring Framework (a Java-based framework for creating enterprise applications), Java JDK, Apache Spark (an analytics engine for large-scale data processing), etc.\(^6\)

**Baselines.** As mentioned above, our evaluation uses two additional baselines in order to answer RQ2 and RQ3. To compare against other compile-time techniques (RQ2), we evaluate Expresso [Ferles et al. 2018], a tool that addresses the same problem as this paper but generates an explicit signal monitor using a single global lock shared by all monitor methods. To evaluate the importance of our static analysis (RQ3), we created an ablated version of CORTADO, called ABLATED, which uses a very coarse analysis to infer safe interleavings. This ablated version considers \((v, (v_s, v_t))\) a safe interleaving only if \(v\) does not have any data races with any predecessor (resp. successor) of \(v_s\) (resp. \(v_t\)). This is a sufficient condition for strong safety, but it only requires checking data races rather than discharging a set of Hoare triples.

**Evaluating performance.** Following prior work [Ferles et al. 2018; Hung and Garg 2013], we evaluate the performance of each monitor implementation by performing saturation tests [Cherem et al. 2008] wherein threads perform monitor operations without doing any additional work. We collect our performance measurements using the Java Microbenchmark Harness (JMH) [Shipilev et al. 2021]. All measurements are conducted on a 112-way (56-core × 2 SMT) Intel Xeon CPU W-3275 2.50GHz with 256 GB of memory using JDK 1.8.0_272. In this section, we present results for each benchmark for up to 128 threads, chosen as an arbitrary stopping point past the total number of hyper-threads. Results for up to 256 threads can be found in Appendix A.

\(^5\)If a monitor does not contain parallelization opportunities, our technique generates code equivalent to that synthesized by Ferles et al. [2018]. Since the goal of our evaluation is to evaluate CORTADO’s ability to generate fine-grained locking protocols, we did not include benchmarks from prior work [Ferles et al. 2018; Hung and Garg 2013] that do not contain such parallelization opportunities.

\(^6\)All benchmarks are publicly available here: https://github.com/utopia-group/cortado
6.1 Performance Results

Figure 9 plots the average time taken per monitor method invocation (i.e., milliseconds/operation) against the number of threads. In what follows, we analyze the plots in more detail and present several conclusions drawn from these results. Because our benchmarks only contain monitors where fine-grained locking is beneficial, we emphasize that our conclusions only apply to such monitors.

Comparison against hand-written implementations (RQ1). For every benchmark, the explicit synchronization monitor generated by CORTADO performs better than the expert-written implementation as the number of threads increases. In particular, CORTADO-synthesized code performs on average 3.7× and up to 39.1× times faster than the original code.

Comparison against Expresso (RQ2). CORTADO-generated explicit monitors perform better than EXPRESSO explicit monitors generated from the same implicit specification on all benchmarks. CORTADO-synthesized code outperforms EXPRESSO by 4.0× on average (and up to 48.7×).

Comparison against Ablated (RQ3). Finally, we analyze how CORTADO compares to its simplified version, Ablated, which does not use the results of the safe interleavings analysis from Section 4.3. In five cases, the code generated by Ablated is equivalent to the code generated by EXPRESSO and therefore worse than CORTADO. In two other cases (PausableThreadPoolExecutor and ProgressTracker), Ablated generates code different from both EXPRESSO and CORTADO. For PausableThreadPoolExecutor, the code generated by Ablated is slower than that of EXPRESSO because many of the operations it parallelizes are very cheap, so the overhead of extra locks outweighs their benefit. On the other hand, our static analysis detects several safe interleavings which enable CORTADO to synthesize a protocol with cheaper synchronization operations. Finally, for the remaining three cases, the code generated by Ablated matches the one generated by CORTADO. This ablation study demonstrates that the safe interleaving analysis from Section 4.3 helps extract additional concurrency on five of our benchmarks.

6.2 Synthesis Time & Protocol Complexity

To evaluate the cost and complexity of synthesizing code with CORTADO (RQ4), Table 1 summarizes its running time and presents some statistics about the synthesized protocols. For each benchmark, we report the running time for the various phases of the tool: pointer analysis with Soot, signal placement with EXPRESSO, and synthesis with CORTADO. We also report the number of locks and atomic fields in the synthesized protocol.

Table 1 shows that CORTADO terminates in under one minute for all but two benchmarks. For these two outliers, the synthesis time is dominated by EXPRESSO’s monitor invariant inference, which is necessary for signal placement. Overall, CORTADO is able to extract better performance than EXPRESSO alone with only a small additional compile-time cost.

The last three columns in Table 1 provide statistics about the synthesized explicit monitors. Most monitors benefit from CORTADO’s ability to introduce atomic fields, which reduces the overhead of operations on monitor state that would otherwise require a lock. The lines of code (LOC) results show that CORTADO synthesizes explicit monitors that are on average 1.7× larger than their implicit specifications.

7In order to handle outliers such as in JobWrapper, for all reported aggregate speedups (max, mean, etc.) we throw out data points with a z-score greater than two.
Fig. 9. Performance Results For All Tools. The y-axis is in log scale and time measurements are in milliseconds. The shadowed regions surrounding each line present 99.9% confidence interval of each measurement.
<table>
<thead>
<tr>
<th>Benchmark</th>
<th>LOC</th>
<th>Soot</th>
<th>Expresso</th>
<th>Cortado</th>
<th>Total</th>
<th>#Lock/#Op</th>
<th>#Atomic/#Op</th>
<th>LOC</th>
</tr>
</thead>
<tbody>
<tr>
<td>ArrayBlockingQueue</td>
<td>287</td>
<td>16.7</td>
<td>1897.7</td>
<td>372.7</td>
<td>2302.5</td>
<td>2 / 18</td>
<td>1 / 25</td>
<td>514</td>
</tr>
<tr>
<td>ConcurrencyThrottleSupport</td>
<td>33</td>
<td>17.0</td>
<td>1.4</td>
<td>0.2</td>
<td>18.7</td>
<td>1 / 1</td>
<td>1 / 4</td>
<td>68</td>
</tr>
<tr>
<td>CountableThreadPool</td>
<td>54</td>
<td>20.4</td>
<td>0.8</td>
<td>0.2</td>
<td>21.5</td>
<td>1 / 1</td>
<td>1 / 5</td>
<td>85</td>
</tr>
<tr>
<td>JobWrapper</td>
<td>33</td>
<td>17.0</td>
<td>0.6</td>
<td>0.1</td>
<td>17.7</td>
<td>1 / 1</td>
<td>1 / 3</td>
<td>63</td>
</tr>
<tr>
<td>PausableThreadPoolExecutor</td>
<td>79</td>
<td>20.7</td>
<td>0.8</td>
<td>0.9</td>
<td>22.5</td>
<td>3 / 4</td>
<td>2 / 9</td>
<td>122</td>
</tr>
<tr>
<td>ProgressTracker</td>
<td>65</td>
<td>0.7</td>
<td>6.8</td>
<td>0.2</td>
<td>8.0</td>
<td>2 / 7</td>
<td>1 / 4</td>
<td>119</td>
</tr>
<tr>
<td>RealmThreadPoolExecutor</td>
<td>34</td>
<td>18.4</td>
<td>0.3</td>
<td>0.1</td>
<td>18.7</td>
<td>1 / 1</td>
<td>1 / 3</td>
<td>61</td>
</tr>
<tr>
<td>RoundTripWorker</td>
<td>62</td>
<td>16.7</td>
<td>3.1</td>
<td>0.4</td>
<td>20.4</td>
<td>2 / 4</td>
<td>1 / 4</td>
<td>103</td>
</tr>
<tr>
<td>SinkQueue</td>
<td>75</td>
<td>15.7</td>
<td>981.4</td>
<td>34.6</td>
<td>1031.8</td>
<td>2 / 4</td>
<td>1 / 8</td>
<td>131</td>
</tr>
<tr>
<td>WSDataListener</td>
<td>158</td>
<td>18.3</td>
<td>5.2</td>
<td>1.0</td>
<td>25.1</td>
<td>4 / 11</td>
<td>0 / 0</td>
<td>222</td>
</tr>
</tbody>
</table>

Table 1. Synthesis time for each phase of Cortado and summaries of the synthesized protocols. LOC is lines of code. Soot indicates pointer analysis time, Expresso is the time for monitor invariant generation and signal placement, and Cortado shows the additional time on top of Soot and Expresso.

7 RELATED WORK

Monitor abstractions. The notion of monitors as an organizing abstraction for concurrent programming originates with Hoare [1974] and Hansen [1973]. Monitors offer the same synchronization facilities as semaphores—the ability to coordinate multiple threads and enforce mutual exclusion—but encapsulate the state protected by those facilities and automate mutual exclusion when entering and exiting the monitor’s operations. Lampson and Redell [1980] further extended the monitor abstraction in the Mesa programming language to handle spurious wake ups.

These early monitor abstractions are explicit-signal monitors in the taxonomy of Buhr et al. [1995] because they require the programmer to explicitly insert condition variables and signalling operations to coordinate threads within the monitor. This requirement places both a safety and a liveness burden on the programmer: they must place signals correctly to preserve invariants about the monitor’s state, but must also insert enough signals to avoid deadlock. An alternative is to use an implicit-signal (or automatic-signal) monitor, in which signals are inserted automatically by the compiler, language runtime, or operating system. Hoare [1971] proposed the notion of conditional critical regions (CCRs), which allow for monitor operations to block until a guard predicate over the monitor state is satisfied by some other thread. A CCR implementation would automatically block and signal threads in a fashion consistent with this guard semantics.

Implicit-signal monitors simplify concurrent programming, but come at a steep performance cost—Buhr et al. [1995] estimate that implicit-signal monitors are 10–50× slower than explicit ones. More recent work has tried to lower the cost of implicit-signal monitors. AutoSynch [Hung and Garg 2013] uses a combination of compile-time instrumentation and run-time evaluation to efficiently compute which threads should be woken when monitor state changes. This approach lowers the cost of implicit monitors to be close to, or sometimes better than, explicit ones. Expresso [Ferles et al. 2018] takes a different approach, using compile-time static analysis to synthesize an explicit-signal monitor equivalent to an implicit-signal version given as input. In this way, Expresso is able to erase the dynamic cost of implicit-signal monitors, and in most cases is comparable to hand-written explicit monitors. However, Expresso uses a single lock for the entire monitor and does not allow concurrent execution of threads within the monitor even when safe. Our work expands on this direction by using a richer static analysis to infer additional concurrency opportunities and uses MaxSAT to synthesize a safe and efficient locking protocol. Hence, our key contribution is to synthesize an explicit monitor that appears to match the semantics of the implicit one, but runs
monitor operations concurrently when possible and efficient. As we show in the evaluation, our proposed approach can often make the synthesized monitor faster than a hand-written equivalent.

**Automatic synchronization.** An appealing approach to lower the difficulty of concurrent programming is to deploy program analysis and synthesis techniques for automation. The common abstraction for much of this work is for the programmer to annotate atomic sections that should be executed atomically. Emmi et al. [2007] present a technique for lock allocation to an annotated program. They reduce the problem to integer linear programming and deploy the resulting tool on large-scale C and Java programs. Other approaches [Halpert et al. 2007; Hicks et al. 2006; McCloskey et al. 2006], on the other hand, take a purely static analysis route and attempt to maximize parallelism based solely on the results of the analysis. Cherem et al. [2008] present an alternative technique that uses runtime support to enable finer-grained concurrency. Compared to these efforts, Cortado applies to the more limited domain of monitors, but in exchange for this limitation is able to reason about conditional signalling and can allow atomic sections to run concurrently so long as the illusion of atomicity is maintained.

Other approaches start from a sequential program and automatically generate an equivalent concurrent program. The closest work to ours in this space is that of Golan-Gueta et al. [2011] which generates concurrent data structures given their sequential implementation. Compared to our method, their approach is applicable only to data structures that satisfy certain shape properties and all synthesized programs adhere to the same locking protocol, whereas Cortado generates a synchronization protocol specialized to the input monitor.

**Concurrency verification.** Cortado reasons about concurrent program executions by building on work in concurrent program analysis and verification. Our notion of left- and right-commutativity (Definition 4.11) comes from Lipton’s work on reduction as a concurrency proof technique [Lipton 1975]. Reduction translates interleaved program executions to simpler, equivalent sequential executions by exploiting the commutativity properties of individual program steps. We use the same idea but in reverse: starting with a sequential history (Definition 3.7), we use a static analysis of commutativity to determine how to safely introduce interleavings into that history, and use that information to determine how to assign locks to program fragments.

### 8 CONCLUSION

We presented a technique for synthesizing fine-grained synchronization protocols for implicitly synchronized monitors. Our approach first employs a novel static analysis to identify safe interleavings opportunities between code fragments and uses the results of this analysis to generate a MaxSAT encoding whose solution can be used to synthesize an efficient and correct-by-construction explicit-synchronization monitor. We have implemented our method in a tool called Cortado and evaluated its effectiveness eight monitors collected from popular open source applications. The results of our experimental evaluation demonstrate that Cortado is able to generate non-trivial synchronization protocols that are 3.7× times faster than the original implementation on average (and up to 39.1× times for some outliers).

### ACKNOWLEDGMENTS

This material is based upon work supported by the National Science Foundation under Grant No. mnnnnnn and Grant No. mmmmmmm. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation.
Synthesizing Fine-Grained Synchronization Protocols for Implicit Monitors

REFERENCES


A ADDITIONAL EXPERIMENTAL EVALUATIONS

This section presents additional experimental data for the evaluation of Section 6 as well as ablations related to several design decisions in the implementation of CORTADO. In particular, Section A.1 presents additional data points for the evaluation presented in Section 6, Section A.2 presents an ablation study that justifies the design of our heuristic for constructing an FDG, and Section A.3 presents an ablation study that demonstrates the need for adjusting the weights of soft constraints in our MaxSAT encoding.

A.1 Additional Data Points per Benchmark

In this Section we present additional data points for all the experiments presented in Section 6. As mentioned in Section 6, we chose 128 threads as a stopping point past the number of total hyper-threads in the machine used in our evaluation. In this Section, we provide data points up to 256 threads.

Figure 10 presents the results for all benchmarks in our evaluation up to 256 threads. As demonstrated by Figure 10, the general trend for all benchmarks is the same as the data presented in Section 6. Note that for benchmarks where the code generated by CORTADO exhibits similar run-time performance with the other three implementation we consider in our evaluation (e.g., RoundTripWorker), context-switches seem to dominate the running time as the number of threads increases.

A.2 Ablation Study for FDG Construction Heuristic

This section presents an ablation study for the design of our FDG construction heuristic described in Section 5. To justify the decisions behind the design of our heuristic, we implemented two additional versions of CORTADO that differ in the way they construct the FDG. In particular, we implemented a version that creates finer-grained FDGs than CORTADO and one that creates coarser-grained ones.

The finer-grained version, named Stmt-Ablation, puts every non-composite statement outside of a loop in its own fragment. Statements inside a loop are grouped together in the same fragment since FDGs are acyclic. The coarser version of our tool, named CCR-Ablation, simply puts each CCR in its own fragment.

Figure 11 presents the results of this ablation study. As demonstrated by the results, there are several cases where CORTADO performs better than at least one of its two modified versions. The cases where all three versions perform similarly are benchmarks where the synthesized synchronization protocol mainly exploits data-level parallelism among different CCRs in the monitor, which our tool can exploit given any FDG. Overall, this study demonstrates the need for a customized heuristic for constructing an FDG suitable for maximizing parallelism of implicit synchronization monitors.

A.3 Ablation Study for MaxSAT Soft Constraint Weights

Finally, this Section presents an ablation of the soft constraints weights in our MaxSAT encoding. As mentioned in Section 5, CORTADO assigns different weights to different classes of soft constraints because some of them are more important for synthesizing the optimal synchronization protocol. To demonstrate this, we have created a modified version of our tool, named Weight-Ablation, that assigns the same weight to all soft constraints.
Fig. 10. Performance results for all tools up to 256 threads. The y-axis is in log scale and time measurements are in milliseconds. The shadowed regions surrounding each line present 99.9% confidence interval of each measurement.
Fig. 11. Performance results for the FDG ablation study. The y-axis is in log scale and time measurements are in milliseconds. The shadowed regions surrounding each line present 99.9% confidence interval of each measurement.
The results of this ablation are presented in Figure 13. As this figure demonstrates, there are several cases where the ablated version of the tool performs significantly worse than CORTADO. To give a concrete example of why this is the case, consider the monitor of Figure 12 that contains two methods both of which modify field \( x \). Because the body of method \( \text{bar} \) can be interleaved between the \text{waituntil} statement and the increment statement of method \( \text{foo} \), the optimal synchronization protocol would convert field \( x \) to an atomic integer and introduce a lock that would only be held in method \( \text{foo} \). However, an equivalent protocol would be to simply protect both \( \text{foo} \) and \( \text{bar} \) with the same global lock. So, if all constraints have an equal weight, CORTADO could generate both of these protocols, since they would have the same optimum objective value. As mentioned in Section 5, CORTADO’s MaxSAT encoding prefers assignments where a race between two fragments (like the one on field \( x \)) are resolved via an atomic field rather than a lock. This forces CORTADO to generate the optimal solution for the monitor of Figure 13. The adjusted weights for other classes of soft constraints try to steer CORTADO to better performing synchronization protocols in a similar way.

```java
class M {
    int x = 0;
    void foo() {
        waituntil(x < 10);
        x++;
    }
    void bar() {
        x--;
    }
}
```

Fig. 12. A simple implicit monitor.
Fig. 13. Performance results for the soft constraints weights ablation. The y-axis is in log scale and time measurements are in milliseconds. The shadowed regions surrounding each line present 99.9% confidence interval of each measurement.
B PROOF OF NP-COMPLETENESS
To aid the reader, we restate Theorem 4.4.

THEOREM B.1. (NP-Completeness) Let $G = (V, E)$ be the FDG representation of a monitor $M$ and let $\Pi \subseteq V \times V$ be a set of fragment pairs that can safely run in parallel. Then, deciding whether there exists a synchronization protocol with at most $k$ locks and atomic fields that allows all pairs in $\Pi$ to run in parallel is an NP-Complete problem.

PROOF OF THEOREM 4.4. We prove the theorem by reduction to the edge clique cover problem [Michael and Quint 2006]. Let $G = (V, E)$ be an undirected graph. For each $v \in V$, let $E(v)$ be the set of edges incident to $v$.

We define monitor $M$ as follows: for each edge $e \in E$ define a new field $f_e$ of the monitor, initially set to zero. For each vertex $v \in V$, define a CCR $inc_v()$ which increments each $f_e$ for $e \in E(V)$, has a guard of $\top$, and returns nothing.

Let $G_M$ be the control-flow graph of monitor $M$. Note that there is one $\text{waituntil}(\top)$ statement for each $inc_v()$ and $|E(v)| = \text{degree}(v)$ increment statements in $inc_v()$, so there are $|V| + 2|E|$ total nodes in $G_M$. Define $\{G_{1,M}, \ldots, G_{|V|+2|E|M}\}$ to be the partition of $G_M$ into singletons. Then, let $G_M = (V_M, E_M)$ to be the fragment dependency graph obtained from this partition, and let $\Pi \subseteq V_M \times V_M$ be the set of fragment pairs that can safely run in parallel.

We write $frag_{u,e}$ for the fragment which increments $f_e$ in method $inc_v()$, and $\text{waituntil}_o$ for the $\text{waituntil}(O)$ statement at the beginning of $inc_v()$. Observe that

$$\Pi = \{(f_1, f_2) \mid \exists v \in V \text{ such that } f_1 = \text{waituntil}_v, \text{ or } f_2 = \text{waituntil}_v\} \cup \{(frag_{u_1,e_1}, frag_{u_2,e_2}) \mid e_1 \neq e_2\}.$$  

Note that any synchronization protocol which implements $f_e$ for some $e = (u, v) \in E$ as an atomic variable is equivalent to one which wraps a unique lock around each $frag_{u,e}$ and $frag_{v,e}$. Therefore, we only need to consider synchronization protocols which use only locks.

Suppose we are given a synchronization protocol which allows all pairs in $\Pi$ to run in parallel and uses exactly $k$ locks, $\{l_1, \ldots, l_k\}$, for some $k \in \mathbb{N}$. Define the vertices holding each lock to be

$$C_i = \{v \in V \mid \exists e \in E(v) \text{ such that } frag_{u,e} \text{ holds lock } l_i\}.$$  

We claim that $\{C_1, \ldots, C_k\}$ is a clique edge cover of $G$. First, observe that every edge $e = (u, v) \in E$ corresponds to two fragments in $G_M$: $frag_{u,e}$ and $frag_{v,e}$. Since these fragments must not run in parallel (due to a data race), they must share some lock. Let $l_i$ be that lock. By the definition of $C_i$ in Equation 3, $u \in C_i$ and $v \in C_i$. Therefore, every edge appears in $C_i$ for some $1 \leq i \leq k$. Second, suppose that $u \neq v \in C_i$ are both contained in $C_i$ for some $i$. By Equation 3, there must be some $e_u \in E(u)$ and $e_v \in E(v)$ such that both $frag_{u,e_u}$ and $frag_{u,e_v}$ hold lock $i$. Since the two fragments share a lock, we know $(frag_{u,e_u}, frag_{u,e_v}) \notin \Pi$. Therefore, $e_u = e_v$ (by Equation 2). Hence, there is an edge $e_u = e_v = (u, v) \in E$. Consequently, any two distinct vertices in $C_i$ are incident, so $C_i$ is a clique.

A symmetric argument shows how to construct a synchronization protocol using exactly $k$ locks from any edge clique-cover of $G$ which has $k$ cliques.

We have shown that, given an arbitrary graph $G$, in polynomial time we may compute a monitor $M$ such that $M$ has a synchronization protocol using at most $k$ locks and atomic variables if and only if $G$ has an edge clique-cover with at most $k$ cliques. □
This section presents the semantics of our target language presented in Figure 2b. As mentioned in Section 3, given an explicit monitor \( M_t \), initial state \( \sigma \), and monitor history \( h_e \) with argument mapping \( v_e \), the operational semantics of \( M_t \) is defined using a judgment \( M_t \vdash (h_e, v_e, \sigma) \Downarrow \sigma' \) indicating that the new state is \( \sigma' \) after executing \( h_e \) on initial state \( \sigma \). The semantics of such a monitor are implemented using the inference rules of Figure 14 that use judgements of the form

(1) \[
\frac{e = (s, t) \text{ CHECKNotif}(T, t)}{\sigma, e, v, T \Rightarrow (\sigma', e, e', T')}
\]

(2) \[
\frac{\text{LockHeld}(T, t, \sigma[l]) \quad \text{BlockThreadOnLock}(T, t, \sigma[l]) \quad \text{NoDeadLocks}(T')}{\sigma, e, v, T \Rightarrow (\sigma', e, e', T')}
\]

(3) \[
\frac{\text{LockHeld}(T, t, \sigma[l]) \quad \text{BlockThreadOnLock}(T, t, \sigma[l])}{\neg \text{LockHeld}(T, t, \sigma[l]) \quad \text{AcqLock}(T, t, \sigma[l])}
\]

(4) \[
\frac{\text{CheckNotif}(T, t)}{\sigma, e, v, T \Rightarrow (\sigma', e, e, T')}
\]

(5) \[
\frac{\text{CheckNotif}(T, t)}{\sigma, e, v, T \Rightarrow (\sigma', e, e, T')}
\]

(6) \[
\frac{\text{CheckNotif}(T, t)}{\sigma, e, v, T \Rightarrow (\sigma', e, e, T')}
\]

(7) \[
\frac{\text{CheckNotif}(T, t)}{\sigma, e, v, T \Rightarrow (\sigma', e, e, T')}
\]

(8) \[
\frac{\sigma, e :: h, v :: v', T \Rightarrow (\sigma', h, v', T')}{\sigma, e :: h, v :: v', T \Rightarrow (\sigma', h, v', T')}
\]

Fig. 14. Semantics for our target language 2b. Here, his a history, \( v \) a list of arguments for every element in \( h \), and \( T \) a tuple of sets and mappings that keep track of all pending signaling and locking operations. Methods and predicates that appear in small caps are defined the text.

C TARGET LANGUAGE OPERATIONAL SEMANTICS

This section presents the semantics of our target language presented in Figure 2b. As mentioned in Section 3, given an explicit monitor \( M_t \), initial state \( \sigma \), and monitor history \( h_e \) with argument mapping \( v_e \), the operational semantics of \( M_t \) is defined using a judgment \( M_t \vdash (h_e, v_e, \sigma) \Downarrow \sigma' \) indicating that the new state is \( \sigma' \) after executing \( h_e \) on initial state \( \sigma \). The semantics of such a monitor are implemented using the inference rules of Figure 14 that use judgements of the form

(1) \[
\frac{e = (s, t) \text{ CHECKNotif}(T, t)}{\sigma, e, v, T \Rightarrow (\sigma', e, e', T')}
\]

(2) \[
\frac{\text{LockHeld}(T, t, \sigma[l]) \quad \text{BlockThreadOnLock}(T, t, \sigma[l]) \quad \text{NoDeadLocks}(T')}{\sigma, e, v, T \Rightarrow (\sigma', e, e', T')}
\]

(3) \[
\frac{\text{LockHeld}(T, t, \sigma[l]) \quad \text{BlockThreadOnLock}(T, t, \sigma[l])}{\neg \text{LockHeld}(T, t, \sigma[l]) \quad \text{AcqLock}(T, t, \sigma[l])}
\]

(4) \[
\frac{\text{CheckNotif}(T, t)}{\sigma, e, v, T \Rightarrow (\sigma', e, e, T')}
\]

(5) \[
\frac{\text{CheckNotif}(T, t)}{\sigma, e, v, T \Rightarrow (\sigma', e, e, T')}
\]

(6) \[
\frac{\text{CheckNotif}(T, t)}{\sigma, e, v, T \Rightarrow (\sigma', e, e, T')}
\]

(7) \[
\frac{\text{CheckNotif}(T, t)}{\sigma, e, v, T \Rightarrow (\sigma', e, e, T')}
\]

(8) \[
\frac{\sigma, e :: h, v :: v', T \Rightarrow (\sigma', h, v', T')}{\sigma, e :: h, v :: v', T \Rightarrow (\sigma', h, v', T')}
\]

\[ B_S \subseteq T \times CVar: \text{a set of thread-condition variable pairs, } (t, c) \in B_S \text{ means that thread } t \text{ is blocked on condition variable } c. \]

\[ B_C \subseteq T \times CVar: \text{a set of thread-condition variable pairs, } (t, c) \in B_C \text{ means that thread } t \text{ has been notified on condition } c. \]

\[ B_L \subseteq T \times L: \text{a set of thread-lock pairs, } (t, l) \in B_L \text{ means that thread } t \text{ holds lock } l. \]

We say that $\mathcal{M}$ is well-studied, we assume the existence of an oracle $\mathcal{M}$ that gives a statement $s$ and its argument $\nu$, it returns the resulting monitor state $\sigma'$. Furthermore, because thread $t$ could be blocked before executing statement $s$, this rule uses procedure $\text{UpdateState}$ (defined in Figure 15) to update the sets inside $\mathcal{T}$ accordingly. Specifically, if thread $t$ was blocked in some condition variable $c$, then procedure $\text{UpdateState}$ removes pair $(t,c)$ from both $\mathcal{B}_S$ and $\mathcal{N}_S$ (recall that if $t$ was blocked then it is guaranteed to be notified). Similarly, if thread $t$ was blocked on some lock $l$, then procedure

\begin{figure}[h]
\centering
\begin{align*}
\text{procedure } \text{UpdateState}(\mathcal{T}, t) & \\
\text{input: } \mathcal{T} = (\mathcal{B}_S, \mathcal{N}_S, \mathcal{H}_L, \mathcal{B}_L, \mathcal{N}_L) & \\
\text{input: } t, \text{thread executing a non-synchronization statement.} & \\
\text{output: } \text{updated mappings.} & \\
\text{if } (t,c) \in \mathcal{B}_S \text{ then} & \\
\mathcal{B}_S' & \leftarrow \mathcal{B}_S \setminus \{(t,c)\} \ \mathcal{N}_S' \leftarrow \mathcal{N}_S \setminus \{(t,c)\} & \\
\text{if } (t,l) \in \mathcal{B}_L \text{ then} & \\
\mathcal{H}_L' & \leftarrow \mathcal{H}_L \cup \{(t,l)\} \ \mathcal{N}_L' \leftarrow \mathcal{N}_L \setminus \{(t,l)\} & \\
\mathcal{B}_L' & \leftarrow \mathcal{B}_L \setminus \{(t,l)\} & \\
\text{return } (\mathcal{B}_S', \mathcal{N}_S', \mathcal{H}_L', \mathcal{B}_L', \mathcal{N}_L') & \\
\end{align*}
\caption{Procedure $\text{UpdateState}$}
\end{figure}
UpdateState add the pair \((t, l)\) to \(\mathcal{H}_L\) (i.e., now \(t\) holds lock \(l\)) and removes it from \(\mathcal{N}_L\) and \(\mathcal{B}_L\) (same as in the condition variable case).

**Rule (2).** This rule applies to all statements where a thread \(t\) attempts to acquire lock \(l\) that is currently held by another thread. In order to determine whether a lock is held by another thread, this rule makes use of predicate LockHeld defined as follows:

\[
\text{LockHeld}((\mathcal{B}_S, \mathcal{N}_S, \mathcal{H}_L, \mathcal{B}_L, \mathcal{N}_L), t, l) = l \in \mathcal{H}_L[t']. \ t \neq t'
\]

Then, the rule marks thread \(t\) as blocked on lock \(l\) by using the following procedure that updates map \(\mathcal{B}_L\) by adding pair \((t, l)\):

\[
\text{BlockThreadOnLock}((\mathcal{B}_S, \mathcal{N}_S, \mathcal{H}_L, \mathcal{B}_L, \mathcal{N}_L), t, l) = (\mathcal{B}_S, \mathcal{N}_S, \mathcal{H}_L', \mathcal{B}_L', \mathcal{N}_L)
\]

where \(\mathcal{B}_L' = \mathcal{B}_L \setminus \{(t, l)\}\)

Finally, the rule requires that the new attempt to acquire lock \(l\) does not introduce any deadlocks by invoking oracle NoDeadLocks. This oracle detects any cycles in the lock acquisition by examining maps \(\mathcal{B}_L\) and \(\mathcal{H}_L\).

**Rule (3).** Conversely, the third rule applies to all cases where thread thread \(t\) attempts to acquire a lock not currently held by some other thread. In this case, the rule simply adds pair \((t, l)\) in map \(\mathcal{H}_L\) as follows:

\[
\text{AcqLock}((\mathcal{B}_S, \mathcal{N}_S, \mathcal{H}_L, \mathcal{B}_L, \mathcal{N}_L), t, l) = (\mathcal{B}_S, \mathcal{N}_S, \mathcal{H}_L', \mathcal{B}_L, \mathcal{N}_L)
\]

where \(\mathcal{H}_L' = \mathcal{H}_L \cup \{(t, l)\}\)

**Rule (4).** This rule is triggered when a thread \(t\) releases lock \(l\). The rule performs the following two updates to maps \(\mathcal{H}_L\) and \(\mathcal{N}_L\):

\[
\text{RelLock}((\mathcal{B}_S, \mathcal{N}_S, \mathcal{H}_L, \mathcal{B}_L, \mathcal{N}_L), t, l) = (\mathcal{B}_S, \mathcal{N}_S, \mathcal{H}_L', \mathcal{B}_L, \mathcal{N}_L')
\]

where \(\mathcal{H}_L' = \mathcal{H}_L \setminus \{(t, l)\}\), \(\mathcal{N}_L' = \mathcal{N}_L \cup \{(t', l)\}\) s.t. \((t', l) \in \mathcal{B}_L\)

Specifically, it removes pair \((t, l)\) from \(\mathcal{H}_L\) and notifies some thread \(t'\) currently blocked on lock \(l\).

**Rule (5).** This rule applies when a thread \(t\) calls method \texttt{await} on a condition variable \(c\). The rule simply adds pair \((t, c)\) in set \(\mathcal{B}_S\):

\[
\text{BlockOnCVar}((\mathcal{B}_S, \mathcal{N}_S, \mathcal{H}_L, \mathcal{B}_L, \mathcal{N}_L), t, c) = (\mathcal{B}_S', \mathcal{N}_S, \mathcal{H}_L, \mathcal{B}_L, \mathcal{N}_L)
\]

where \(\mathcal{B}_S' = \mathcal{B}_S \setminus \{(t, c)\}\)

**Rules (6) and (7).** These two rules are used when a thread signals or broadcasts a condition variable \(c\). They simply update set \(\mathcal{N}_S\) as follows:

\[
\text{SigCVar}((\mathcal{B}_S, \mathcal{N}_S, \mathcal{H}_L, \mathcal{B}_L, \mathcal{N}_L), c) = (\mathcal{B}_S, \mathcal{N}_S', \mathcal{H}_L, \mathcal{B}_L, \mathcal{N}_L)
\]

where \(\mathcal{N}_S' = \mathcal{N}_S \cup \{(t, c)\}\) s.t. \((t, c) \in \mathcal{B}_S\)

\[
\text{BcastCVar}((\mathcal{B}_S, \mathcal{N}_S, \mathcal{H}_L, \mathcal{B}_L, \mathcal{N}_L), c) = (\mathcal{B}_S, \mathcal{N}_S', \mathcal{H}_L, \mathcal{B}_L, \mathcal{N}_L)
\]

where \(\mathcal{N}_S' = \mathcal{N}_S \cup \{(t, c) | (t, c) \in \mathcal{B}_S\}\)

Specifically, rule 6 adds a single thread \(t\) currently blocked on condition variable \(c\) in \(\mathcal{N}_S\), whereas rule 7 adds all such threads in \(\mathcal{N}_S\).

**Rule (8).** Finally, rule 8 recursively applies the procedure to the whole input history \(h\).
D MONITOR INSTRUMENTATION.

In this section, we describe procedure Instrument which given an implicit-synchronization monitor $M$, it corresponding FGD $\mathcal{G} = (V, E)$, and a synchronization protocol $S = (\mathcal{L}, \mathcal{A}, \mathcal{P})$, it instrument protocol $S$ into $M$ yielding an explicit-synchronization monitor $M'$ equivalent to $M$. This is achieved by first introducing all the necessary synchronization fields (locks, condition variables, and atomic fields) in the input class and then instrumenting locking and signaling operations in all methods as follows:

- **Lock acquisition and release:** The synthesized code must ensure that all the locks in $\mathcal{L}(f)$ are held when executing fragment $f$. Thus, for every edge $(f, f')$ in the FDG, we instrument the code to acquire locks $\mathcal{L}(f') \setminus \mathcal{L}(f)$ and release locks $\mathcal{L}(f) \setminus \mathcal{L}(f')$. Furthermore, as mentioned in Section 2, we acquire and release these locks according to a static total order to prevent deadlocks.

- **Blocking on predicates:** Our instrumentation must also convert every $\text{waituntil}$ statement to a sequence of operations on locks and condition variables. Specifically, we instrument a $\text{waituntil}(p)$ statement as follows:

  ```java
  while(!p) { ln.unlock(); ...; l2.unlock(); c.await(); l2.lock(); ...; ln.lock(); }
  ```

  where $c$ is the condition variable associated with $p$; $l_1, \ldots, l_n$ are the locks associated with this fragment, and $l_1$ is the lock associated with condition variable $c$.

- **Signaling operations:** Finally, we instrument signaling operations introduced by PlaceSignals to acquire and release the appropriate locks. In particular, given a statement $\text{signal}(p, c)$ (similarly for $\text{broadcast}(p, c)$), our instrumentation generates the following code:

  ```java
  if (c) { lp.lock(); cp.signal(); lp.release(); }
  ```

  where $cp$ is the condition variable for predicate $p$ and $lp$ is the corresponding lock for $cp$.

Procedure Instrument is presented in Figure 16 in the form of inference rules that use the following two judgements:

- $\nu \vdash \Delta \rightsquigarrow \Delta'$, where $\nu$ is a subset of the arguments of procedure Instrument (we overload operator $\rightsquigarrow$ depending on the arguments) and $\Delta$ is one of the following: the input monitor, a field, a method, a CCR or a statement.

- $\mathcal{L}, \mathcal{G} \vdash \nu \leftrightarrow \nu'$, where $\mathcal{L}$ is the lock map of the input synchronization protocol $S$, FDG is the input $\mathcal{G}$, and $\nu$ is a fragment in $\mathcal{G}$.

The meaning of each judgement is that whenever procedure Instrument is applied to an element that appears on the left-hand side of an arrow ($\rightsquigarrow, \leftrightarrow$) it generates the element on the right-hand side.

**Overall Structure.** The core logic of this procedure is to recursively iterate every element of the input monitor and use the inferred synchronization protocol in order to convert each element to an equivalent element of the target language. At a top level, the procedure begins by transforming every field and method of the input monitor. For every method, the procedure recursively visits every CCR using operator $\rightsquigarrow$. Then, for every CCR, it collects all its fragments and uses operator $\leftrightarrow$ to instrument all the lock operations dictated by the input protocol. In what follows, we give a brief description of every rule presented in Figure 16.

**Mtr.** This is the top-level rule called by procedure Instrument and performs the following tasks: 1. it introduces all the synchronization fields (locks and condition variables) needed by the synchronization protocol and initializes them accordingly and 2. it recursively calls itself for every field, and method of $M$.

**Fld-1 & Fld-2.** These two rules are used to translate fields of $M$, with the first one being applicable to fields that must be converted to atomic fields and the second one to fields that should remain the
same. Only the first rule alters the original field by converting to an atomic field with the same
name as the original.

**METHOD & CCR.** These two rules simply recursively apply operator $\Rightarrow$ to their constituent elements.

**CCR-Statement.** This rule is the one that splits each top-level CCR-Statement to a set of fragments that belong in the input FGG $\mathcal{G}$ and then recursively transforms each of the fragments. Note, because of the properties of FDG (Definitions 4.1 and 4.2), there is only one way to decompose each CCR to its constituents fragments.

**Frag-Stmt.** This rule applies to all statements $s$ that are a fragment in the input $\mathcal{G}$. It first uses operator $\leftarrow$ to instrument all necessary lock operations and then uses a special oracle $\rightarrow_{A}$ that converts all operations that involve a field converted to atomic to the equivalent update statement in the target language.\(^8\)

**Wait.** Rule labeled $\text{Wait}$ is a special case of the above rule because, by definition, every $\text{waituntil}$ statement defines its own fragment in an FDG. Similar to the rule above, this rule also uses operator $\leftarrow$ to instrument the appropriate lock operation in the fragment but it additionally translates the $\text{waituntil}$ statement into an equivalent statement in the target language that uses condition variables. As mentioned in Section 4, each $\text{waituntil}$ statement is translated into a while loop that waits on the appropriate condition variable and properly releases and acquires all locks before and after the call to method $\text{await}$. Additionally, it acquires all locks needed by its successor statement $v$ in the FDG and releases all locks held by it but not needed by $v$ (similar to the logic described below).

**Sig.** This rule applies to all fragments that are a signalling directive of the monitor’s intermediate representation.\(^9\) In a similar manner as the rule for $\text{waituntil}$ statements, this rule first uses operator $\leftarrow$ to instrument all lock operations needed to implement the synchronization protocol. Then, it consults the predicate map $\mathcal{P}$ of the synchronization protocol to acquire the appropriate lock and perform the signaling operation on the associated condition variable.

**Instrumenting Fragments With Lock Operations.** Finally, we describe operator $\leftarrow$ which given a fragment $v$, the lock map $\mathcal{L}$ of the input synchronization protocol $\mathcal{S}$, and the FDG $\mathcal{G}$, it instruments all the necessary lock operations. The logic of this operator is split between two groups of rules, described in more detail below:

- **Rules for entry & exit fragments** (i.e., fragments without predecessors and successors respectively), which are handled by rules $\text{ENTRY-FRAG}$ and $\text{EXIT-FRAG}$ respectively. These rules simply lookup the entry or exit fragment in $\mathcal{L}$ and acquire or release the locks returned by $\mathcal{L}$ accordingly.

- **Rules for fragments with successors.** Fragments that contain some successor in the graph are handled by rules $\text{BRANCH-FRAG}$, $\text{REG-FRAG-1}$ and $\text{REG-FRAG-2}$. The logic for each of these rules is similar, i.e., for any successor fragment $v_s$ of fragment $v$, the instrumentation releases all locks required by $v$ but not by $v_s$ ($\mathcal{L}[v] \setminus \mathcal{L}[v_s]$) and acquires all locks required by $v_s$ but not $v$ ($\mathcal{L}[v_s] \setminus \mathcal{L}[v]$). All these operations are operation in accordance to the global lock order to prevent deadlocks. Last, it is worth mentioning that the main difference of these three rules is how they instrument the edge between $v$ and $v_s$. That is, if $v$ ends with a goto statement

---

\(^8\)Due to its simplicity, we omit a formal description of oracle $\rightarrow_{A}$.

\(^9\)For simplicity, we assume that every signalling operation defines its own fragment.

(conditional or not), then the instrumentation redirects the control-flow appropriately so all lock operations occur along edge \((v, v_s)\).

Finally, we conclude with the following theorem that states the correctness of our instrumentation phase.

**Theorem D.1.** Let \(S = (\mathcal{L}, \mathcal{A}, \mathcal{P})\) be a synchronization protocol inferred over FGD \(G = (V, E)\) of input monitor \(M\) and \(M'\) be the result of procedure Instrument for \(M\). Then, the following three conditions hold:

1. For every fragment \(v \in V\), \(l_i \in \mathcal{L}[v]\) iff fragment \(v\) holds lock \(l_i\) in \(M'\).
2. If \(i < j\), then \(l_i\) is never acquired whenever \(l_j\) is held.
3. Field \(f \in \mathcal{A}\) iff all its occurrences in \(M\) have been replaced with an atomic operation in \(M'\).

**Proof.** Proof can be found in Appendix E. \(\square\)

### E  CORRECTNESS-RELATED PROOFS

This section contains all the proofs related to the correctness of our approach. Section E.1 presents the proof of Theorem 4.5, Section E.2 presents the proof of Theorem 4.14, Section E.3 presents the proof of Theorem 4.16, and Section E.4 presents the proof of Theorem D.1.

#### E.1 Proof of Theorem 4.5

Theorem 4.5 states the following:

(\textbf{Correctness}). We say that an explicit monitor \(M_f\) correctly implements an implicit monitor \(M_s\), denoted as \(M_s \approx M_f\), iff for all input states \(\sigma_s, \sigma_f\) s.t. \(\sigma_s \equiv M_s, \sigma_f\), we have:

\[
\begin{align*}
(1) \quad & \forall h_i, v_i. \ M_s \vdash (h_i, v_i, \sigma_s) \downarrow \sigma'_s \implies (M_f \vdash (\text{Expand}_{M_f}(h_i, v_i, \sigma_s), \sigma_f) \downarrow \sigma'_f \land \sigma'_f \equiv M_s, \sigma'_s) \\
(2) \quad & \forall h_e, v_e. \ M_f \vdash (h_e, v_e, \sigma_f) \downarrow \sigma'_f \implies (\exists h_i, v_i. (h_e, v_e) \sim (h_i, v_i) \land M_s \vdash (h_i, v_i, \sigma_s) \downarrow \sigma'_s \land \sigma'_s \equiv M_s, \sigma'_f)
\end{align*}
\]

**Proof.** For all proofs in this section, we assume the correctness of procedure \textsc{PlaceSignals} (proved in previous work [Ferles et al. 2018]).

The proof of condition (1) above follows directly from Theorems 4.16, D.1, and the correctness of procedure \textsc{PlaceSignals}. The proof of condition (2) follows directly from Theorem 4.14, Theorem 4.16, and correctness of \textsc{PlaceSignals}. \(\square\)

#### E.2 Proof of Theorem 4.14

In this section, we present the proof of Theorem 4.14 which we reiterate here for convenience.

Before presenting the actual proof, we first introduce some auxiliary notation, relations, and lemmas. First, given a history \(h\), we define a predicate \(h[[v_1, t_1], \ldots, (v_k, t_k)]\) that evaluates to true iff each event \((v_i, t_i)\) is \(i\)-th element in \(h\) and \(i1 < \ldots < ik\). That is, this predicate encodes that these event occur in this particular order within \(h\). Second, we define \(\text{Next}(h, i, t)\) as follows:

\[
\min((j \mid j > i, \ h[i] \notin h[i - 1]))
\]

In other words, \(\text{Next}(h, i, t)\) returns the first element in \(h\) after index \(i\) whose thread identifier is \(t\). Additionally, we use \(h[i]\) to denote the \(i\)-th element in \(h\) and \(h[i : j]\), \(i < j\) to denote the “sub-history” of \(h\) between its \(i\)-th element (inclusive) and \(j\)-th element (exclusive). Finally, we extend the definition of a history projection to filter out elements that do not involve a thread, e.g., \(\Pi(h, \sim t)\) filters out all events of \(h\) that involve thread \(t\).

Next, using the notation above we define some relations that identify interleavings inside a history of a fragmented monitor \(M_G\).
Fig. 16. Procedure Instrument\((S, G)\), where \(S\) is a synchronization protocol for FDG \(G\).
Definition E.1. (History Interleaving). Given history $h_{fdg} = (V, E)$ and interleaving $\chi = (v, e = (v_s, v_t))$. We define the occurrences of $\chi$ as follows:

$$\text{Interleavings}(\chi, h_G) = \{(j, (i, k)) \mid h_G[|v_s|, t_i], (v, t'), (v_t, t_k]\}, \ t \neq t', \ k = \text{Next}(h_G, i, t)$$

Also, we write $X(h_G)$ to be the set of all interleavings that occur in $h_G$, i.e.

$$X(h_G) = \{\chi \mid X = \text{Interleavings}(\chi, h_G), |X| > 0\}$$

Finally, we write $X_s(h_G)$ to denote the number of interleavings inside $h$. Formally:

$$X_s(h_G) = \sum_{\chi \in V \times E} |\text{Interleavings}(\chi, h_G)|$$

Next, given a fragment $v$ we assume the existence of two predicates, namely, $\text{EntryFrag}$ and $\text{ExitFrag}$, that hold only if $v$ is the entry fragment or the exit fragment of its CCR respectively. Based on these relations, we define the next relation that partitions a fragment history into CCR sub-histories.

**CCR History Partition.** Let $h_G$ be a history of fragments in FGD $G$. We define the CCR partition of $h_G$ that returns a list of potentially overlapping sub-histories of $h_G$ as follows:

$$\text{CCRPart}(h_G) = \left[ h_G[i : j] \mid h_G[i] = (v_m, t, \_), \ \text{EntryFrag}(v_m), \ \text{j = min}\{ k \mid k > i, h_G[k + 1] = (v_o, t, \_), \text{ExitFrag}(v_o)\} \right]$$

Let $P = \text{CCRPart}(h_G)$, then we use $P[i]$ to refer to the $i$-th sub-history in $P$. Note we assume that partitions returned by $\text{CCRPart}$ are ordered according to the index of the first element in the sub-history. That is, if $\text{ccr}_1$ began its execution before $\text{ccr}_2$ in $h_G$, then the partition of $\text{ccr}_1$ appears before the partition of $\text{ccr}_2$ in $P$. Furthermore, given a CCR partition $P[i]$, we write $\text{Thread}(P[i])$ to represent the thread of the first element in sub-history $P[i]$. **Removing Interleavings from Histories.** Before we prove our main theorem, we define some transformations on interleaved histories that helps us remove interleavings.

First, given a CCR sub-history that is interleaved, we define its sequential history as follows:

Definition E.2. Sequential CCR Sub-history Let $h_G$ be an interleaved history, $P = \text{CCRPart}(h_G)$, and $h'_G = P[i]$ be an interleaved CCR partition. We define that the sequential history of $h'_G$, denoted as $\text{Seq}_{\text{CCR}}(h'_G)$ to be the following history: $\Pi(h'_G, t)\Pi(h'_G, t')$, where $t = \text{Thread}(P[i])$. Given an argument mapping $\nu$ for history $h'_G$, we write $\text{Seq}(\nu)$ to denote the corresponding argument mapping for $\text{Seq}_{\text{CCR}}(h'_G)$.

We now prove the following useful lemmas about sequential CCR sub-histories.

**Lemma E.3.** Let $h_G$ be an interleaved sub-history and $h'_G = \text{Seq}_{\text{CCR}}(h_G)$, then the following two things hold:

1. $X(h'_G) \subseteq X(h_G)$
2. $X_s(h'_G) < X_s(h_G)$

**Proof.** Both of the properties logically follow from the construction of $h'_G$. That is, a sequential CCR sub-history of the form $(v_1, t) \ldots (v_i, t), (v_{i+1}, t'), \ldots, (v_j, t'')$, where all elements before the $i$-th position are from thread $t$ and all elements past that are from some thread $t'$ s.t. $t \neq t'$. On the other hand, the original history $h_G$ is of the form:

$$(v_1, t) \ldots (v_k, t)(v_{k+1}, t') \ldots (v_j, t)$$

Where $h_G[1 : k + 1] = h'_G[1 : k + 1]$ (i.e., $h_G$ and $h'_G$ have a common prefix). Therefore, since by its construction $h'_G$ does not move the relative order of element is $h_G$ that do not involve thread $t$,
if an interleaving $\chi \in X(h_G')$ then we also have $\chi \in X(h_G)$. Conversely, any thread interleaving that involved thread $t$ in $h_G$ does not appear in $h_G$ (by construction). Since by its definition the interleaved history $h_G$ contains at least one interleaving that involves an edge executed by $t$, we can conclude that $X(h_G') < X(h_G)$. □

**Lemma E.4.** Let $h_G$ be an interleaved sub-history and $h_G' = \text{Seq}_{\text{CCR}}(h_G)$, then if $X(h_G)$ is a set of strongly safe interleavings we have that: $\forall \sigma, v. M_G \vdash (h_G, v, \sigma) \parallel \sigma' \Rightarrow M_G \vdash (h_G', \text{Seq}(v), \sigma) \parallel \sigma'$

**Proof.** We prove this by induction on the number of distinct interleavings of history $h_G (X(h_G))$.

**Base Case:** $X(h_G) = 1$. If there is a single interleaving in $h_G$, this implies that $h_G$ is of the form:

$$(v_1, t) \ldots (v_i, t') \ldots (v_j, t)$$

where $t = \text{Thread}(h_G)$ and $(v_i, t')$ is the only element in $h_G$ not executed by $t$. Because, the interleaving of $h_G$ is strongly safe, we have that fragment $v_i$ executed by $t'$, right commutes with any possible successor of the edge it interleaves. Also, by definition, $h_G'$ is $(v_1, t) \ldots (v_j, t)(v_i, t')$.

Combining this two facts with Lemma E.3, we can prove the theorem for our base case:

$$\forall \sigma, v. M_G \vdash (h_G, v, \sigma) \parallel \sigma' \Rightarrow M_G \vdash (h_G', \text{Seq}(v), \sigma) \parallel \sigma'$$

**Inductive Step.** In our inductive step, we assume that our lemma holds for $X(h_G) = n$ and we are going to prove it for $n + 1$. The logic is similar to the base case, specifically, we get the right-most interleaved fragment in $h_G$ and right-commute to the end of the history while obtaining a semantically equivalent history $h_G''$. After that, we can apply our inductive hypothesis on $h_G''$, which again proves our goal. □

Finally, we prove our main theorem, which we re-iterate below for convenience.

**Theorem 4.14.** Let $G$ be an FDG and let $\chi_1, \ldots, \chi_n$ be strongly safe interleavings. Then, $S = \{\chi_1, \ldots, \chi_n\}$ is a safe interleaving set for $G$.

**Proof.** By definition of safe set of interleavings, we have to prove the following for every interleaved history $h_G$ of monitor $M_G$

If $X(h_G) \subseteq S$ and $M_G \vdash (h_G, v_G, \sigma) \parallel \sigma'$ then $\exists h, v. (h_G, v_G) \leadsto (h, v)$ and $M \vdash (h, v, \sigma) \parallel \sigma'$

In order to prove that, we have to prove that for every interleaved history $h_G$ that only allows interleavings in $S$ and argument mapping $v_G$ we can find a history of the original monitor with corresponding argument mapping s.t., $((h_G, v_G) \leadsto (h, v))$. Which in turn means that we have to find a sequential history of $M_G h_G'$ s.t.

$$1. \forall t. \pi(h_G, t) = \pi(h_G', t) \quad \text{and} \quad 2. \text{Expand}_{M_G}(h, v, \sigma) = (h_G', v_G)$$

To prove the goal above, we start with an arbitrary interleaved history $h_G$ s.t. $X(h_G) \subseteq S$ and convert it to a sequential history $h_G'$ with the above properties. We perform this proof, by first creating the CCR partition of $h_G$, $P = \text{CCRPart}(h_G)$, and then induct on the number of partitions in $P$ that are interleaved.

**Base Case: One interleaved CCR in $P$.** Let $h_G^{\text{CCR}} = [P[i]$ be the interleaved history in $h_G$. Now, let $h_G' = h_G[\text{Seq}_{\text{CCR}}(h_G^{\text{CCR}})/h_G^{\text{CCR}}]$. Because $h_G^{\text{CCR}}$ is the only interleaved sub-history in $P$ and because of Lemma E.3, we have that $h_G'$ is a sequential history s.t. $\forall t, v_G. \pi(h_G, t) = \pi(h_G', t)$. Furthermore, because of Lemma E.4 we have $\forall \sigma, v. M_G \vdash (h_G, v_G, \sigma) \parallel \sigma' \Rightarrow M_G \vdash (h_G', \text{Seq}(v_G), \sigma) \parallel \sigma'$. Finally, because we have $M_G \vdash (h_G, v_G, \sigma) \parallel \sigma'$ for some $\sigma$, this implies that $\text{Expand}_{M_G}(h, v_G, \sigma) = (h_G', \text{Seq}(v_G))$, which in turns implies $(h_G', v_G) \leadsto (h, \text{Seq}(v_G))$. 

**Inductive Step.** Next, we assume that our theorem holds for up to $n$ interleaved CCRs in $P$, and will prove it for $n + 1$. Similarly as above, we find the smallest $i$ s.t. $P[i] = h^\text{CCR}_G$ is an interleaved history. Again, we construct $h'_G = h_G[\text{Seq}(h^\text{CCR}_G)/h^\text{CCR}_G]$. Because of lemma E.3, we have that the number of interleaved histories in CCRPart($h'_G$) has strictly fewer number of interleaved sub-histories than $P$. Therefore, by our inductive hypothesis, we have that $(h'_G, v'_G) \sim (h, \text{Seq}(v_G))$ for some history of $h$ of $M$. This, combined with lemma E.4, proves that $(h_G, v_G) \sim (h, \text{Seq}(v_G))$.

**E.3 Proof of Theorem 4.16**

We now prove theorem 4.16 which states the correctness of our MaxSAT encoding.

**Theorem 4.16.** Let $m$ be a model of the generated MaxSAT instance and $(L, A, P)$ be the synchronization protocol constructed as follows:

$$L = \{ v \mapsto \{ i \mid m[h_i] \} \} \quad A = \{ f1d \mid m[a_{f1d}] \} \quad P = \{ p \mapsto l_i \mid \text{IsWait}(v, p), i = \min(\{ j \mid m[h_j] \}) \}$$

where, $\text{IsWait}(v, p)$ is true if $v$ is a waituntil statement on $p$. Then, $(L, A, P)$ is a correct synchronization protocol.

**Proof.** As mentioned earlier, a synchronization protocol must meet the following correctness criteria:

1. If two fragments $v_1, v_2$ have a race (i.e., $R(v_1, v_2) \neq \emptyset$), then the protocol must prevent this race with a lock or an atomic field.
2. If a fragment interleaving $\chi = (v, e)$ is not safe, then the synchronization protocol must not allow fragment $v$ to execute in between edge $e$.
3. The protocol must be deadlock-free.

We show that, by construction, a model $m$ returned by a MaxSAT solver always satisfies the above conditions.

1. Model $m$ prevents any races between two fragments because it must satisfy all hard constraints generated by rules RACE-1 and RACE-2 from Figure 8. Therefore, $m$ will force two racy fragments to either share a lock or, when possible, convert all operations involving the racy field to equivalent atomic ones.
2. Similarly, because model $m$ must satisfy the hard constraints generated by rule I-LEAVE, any interleaving that was deemed unsafe by our static analysis is guaranteed to be infeasible in the resulting synchronization monitor.
3. Finally, because of rules WAIT and L-ORDER, the resulting synchronization monitor is guaranteed to be deadlock-free. Specifically, the hard constraints generated by rule L-ORDER enforce the invariant that all lock acquisitions respect the global lock order. Whereas, the hard constraints of rule WAIT, enforce the same invariant for the translation of a waituntil statement into an equivalent statement in the target language (see Figure 16).

**E.4 Proof of Theorem D.1**

Finally, we prove the correctness of our monitor instrumentation procedure (Fig. 16).

**Theorem E.5.** Let $S = (L, A, P)$ be a synchronization protocol inferred over FGD $G = (V, E)$ of input monitor $M$ and $M'$ be the result of procedure Instrument for $M$. Then, the following three conditions hold:

1. For every fragment $v \in V$, $l_i \in L[v]$ iff fragment $v$ holds lock $l_i$ in $M'$.
(2) If \( i < j \), then \( l_i \) is never acquired whenever \( l_j \) is held.

(3) Field \( f \in \mathcal{A} \) iff all its occurrences in \( M \) have been replaced with an atomic operation in \( M' \).

Proof. All three conditions can be proved by providing certain guarantees for a subset of the rules of Figure 16. Note that operator \( \Rightarrow \) (Figure 16) is guaranteed to visit every code fragment \( v \in V \) in the FDG, since it recursively visits every element of the input monitor until it discovers all fragments of the given FDG. Next, we prove all three conditions.

Condition (1): For this condition, we need to prove that both fragments will only hold the locks required by the synthesized protocol \( S \). The logic of this proof depends on the number and type of predecessors of fragment \( v \). We now present a case analysis:

Zero predecessors. This is the case of an entry fragment of a method in \( G \). Due to the structure of our input language and the definition of an FDG, this fragment must be a fragment that contains a single \( \text{waituntil} \) statement. The instrumentation of such a fragment is handled by rules Wait and Entry-Frag. Note, that rule Wait first calls Entry-Frag which acquires all locks needed by the fragmented defined by the \( \text{waituntil} \) fragment.

At least one predecessor. These types of fragments are handled by rules Wait, Branch-Frag, Reg-Frag-1, and Reg-Frag-2. All these rules maintain the following invariant for the fragment \( v \) that triggers them: before transferring control to any of \( v \)’s successor, they release all locks needed by \( v \) but not needed by the successor (locksets of the form \( R_i \)) and acquire all locks needed by the successor but not held by \( v \) (locksets of the form \( A_i \)). This invariant combined with the fact that these are the only ways to transfer control flow in our input language, ensure that before executing a fragment in the output monitor all necessary locks (and only those) will be acquired.

Condition (2): This directly follows from:

(1) That procedure Instrument uses auxiliary relation Acq to instrument lock acquisitions, which as shown in Figure 16 does so in increasing order of lock indices.

(2) The guarantee provided by Theorem 4.16 that the synthesized protocol acquires locks in increasing order along every control-flow edge.

Condition (3): This condition is ensured by rule Frag-Stmt of Figure 16 that ensures oracle \( \rightarrow \mathcal{A} \) is called on every fragment of \( G \). □