QUIC Concurrency Architecture
=============================

Introduction
------------

Most QUIC implementations in C are offered as a simple state machine without any
included I/O solution. Applications must do significant integration work to
provide the necessary infrastructure for a QUIC implementation to integrate
with. Moreover, blocking I/O at an application level may not be supported.

OpenSSL QUIC seeks to offer a QUIC solution which can serve multiple use cases:

- Firstly, it seeks to offer the simple state machine model and a fully
  customisable network path (via a BIO) for those who want it;

- Secondly, it seeks to offer a turnkey solution with an in-the-box I/O
  and polling solution which can support blocking API calls in a Berkeley
  sockets-like way.

These usage modes are somewhat diametrically opposed. One involves libssl
consuming no resources but those it is given, with an application responsible
for synchronisation and a potentially custom network I/O path. This usage model
is not “smart”. Network traffic is connected to the state machine and state is
input and output from the state machine as needed by an application on a purely
non-blocking basis. Determining *when* to do anything is largely the
application's responsibility.

The other diametrically opposed usage mode involves libssl managing more things
internally to provide an easier to use solution. For example, it may involve
spinning up background threads to ensure connections are serviced regularly (as
in our existing client-side thread assisted mode).

In order to provide for these different use cases, the concept of concurrency
models is introduced. A concurrency model defines how “cleverly” the QUIC engine
will operate and how many background resources (e.g. threads, other OS
resources) will be established to support operation.

Concurrency Models
------------------

- **Unsynchronised Concurrency Model (UCM):** In the Unsynchronised Concurrency
  Model, calls to SSL objects are not synchronised. There is no locking on any
  APL call (the omission of which is purely an optimisation). The application is
  either single-threaded or is otherwise responsible for doing synchronisation
  itself.

  Blocking API calls are not supported under this model. This model is intended
  primarily for single-threaded use as a simple state machine by advanced
  applications, and many applications will be likely to disable autoticking.

- **Contentive Concurrency Model (CCM):** In the
  Contentive Concurrency Model, calls to SSL objects are wrapped in locks and
  multi-threaded usage of a QUIC connection (for example, parallel writes to
  different QUIC stream SSL objects belonging to the same QUIC connection) is
  synchronised by a mutex.

  This is contentive in the sense that if a large number of threads are trying
  to write to different streams on the same connection, a large amount of lock
  contention will occur. As such, this concurrency model will not scale and
  provide good performance, at least within the context of concurrent use
  of a single connection.

  Under this model, APL calls by the application result in lock-wrapped
  mutations of QUIC core objects (`QUIC_CHANNEL`, `QUIC_STREAM`, etc.) on the
  same thread.

  This model may be used either in a variant which does not support blocking
  (NB-CCM) or which does support blocking (B-CCM). The blocking variant must
  spin up additional OS resources to correctly support blocking semantics.

- **Thread Assisted Contentive Concurrency Model (TA-CCM):** This is currently
  implemented by our thread assisted mode for client-side QUIC usage. It does
  not realise the full state separation or performance of the Worker Concurrency
  Model (WCM) below. Instead, it simply spawns a background thread which ensures
  QUIC timer events are handled as needed. It makes use of the Contentive
  Concurrency Model for performing that handling, in that it obtains a lock when
  ticking a QUIC connection just as any call by an application would.

  This mode is likely to be deprecated in favour of the full Worker Concurrency
  Model (WCM), which it will naturally be subsumed by.

- **Worker Concurrency Model (WCM):** In the Worker Concurrency Model,
  a background worker thread is spawned to manage connection processing. All
  interaction with a SSL object goes through this thread in some way.
  Interactions with SSL objects are essentially translated into commands and
  handled by the worker thread. To optimise performance and minimise lock
  contention, there is an emphasis on message passing over locking.
  Internal dataflow for application data can be managed in a zero-copy way to
  minimise the costs of this message passing.

  Under this model, QUIC core objects (`QUIC_CHANNEL`, `QUIC_STREAM`, etc.) will
  live solely on the worker thread and access to these objects by an application
  thread will be entirely forbidden.

  Blocking API calls are supported under this model.

These concurrency models are summarised as follows:

| Model  | Sophistication | Concurrency           | Blocking Supported | OS Resources              | Timer Events    | RX Steering | Core State Affinity  |
|--------|----------------|-----------------------|--------------------|---------------------------|-----------------|-------------|----------------------|
| UCM    | Lowest         | ST only               | No                 | None                      | App Responsible | None        | App Thread           |
| CCM    |                | MT (Contentive)       | Optional           | Mutex, (Notifier)         | App Responsible | TBD         | App Threads          |
| TA-CCM† |                | MT (Contentive)       | Optional           | Mutex, Thread, (Notifier) | Managed         | TBD         | App & Assist Threads |
| WCM    | Highest        | MT (High Performance) | Yes                | Mutex, Thread, Notifier   | Managed         | Futureproof | Worker Thread        |

† To eventually be deprecated in favour of WCM.

Legend:

- **Blocking Supported:** Whether blocking calls to e.g. `SSL_read` can be
  supported. If this is listed as “optional”, extra resources are required to
  support this under the listed model and these resources could be omitted if an
  application indicates it does not need this functionality at initialisation
  time.

- **OS Resources:** “Mutex” refers to mutex and condition variable resources.
  “Notifier” refers to a kind of OS resource needed to allow one thread to wake
  another thread which is currently blocking in an OS socket polling call such
  as poll(2) (e.g. an eventfd or socketpair). Resources listed in parentheses in
  the table above are required only if blocking support is desired.

- **Timer Events:** Is an application responsible for ensuring QUIC timeout
  events are handled in a timely manner?

- **RX Steering:** The matter of RX steering will be discussed in detail in a
  future document. Broadly speaking, RX steering concerns whether incoming
  traffic for multiple different QUIC connections on the same local port (e.g.
  for a server) can be vectored *by the OS* to different threads or whether the
  demuxing of incoming traffic for different connections has to be done manually
  on an in-process basis.

  The WCM model most readily supports RX steering and is futureproof in this
  regard. The feasibility of having the UCM and CCM models support RX steering
  is left for future analysis.

- **Core State Affinity:** Which threads are allowed to touch the QUIC core
  objects (`QUIC_CHANNEL`, `QUIC_STREAM`, etc.)

Architecture
------------

To recap, the API Personality Layer (APL) refers to the code in `quic_impl.c`
which implements the libssl API personality (`SSL_write`, etc.). The APL is
cleanly separated from the QUIC core implementation (`QUIC_CHANNEL`, etc.).

Since UCM is basically a slight optimisation of CCM in which unnecessary locking
is elided, discussion from hereon in will focus on CCM and WCM except where
there are specific differences between CCM and UCM.

Supporting both CCM and WCM creates significant architectural challenges. Under
CCM, QUIC core objects have their state mutated under lock by arbitrary
application threads and these mutations happen during APL calls. By contrast, a
performant WCM architecture requires that APL calls be recorded and serviced in
an asynchronous fashion involving message passing to a worker thread. This
threatens to require highly divergent dispatch architectures for the two
concurrency models.

As such, the concept of a **Concurrency Management Layer (CML)** is introduced.
The CML lives between the APL and the QUIC core code. It is responsible for
dispatching in-thread mutations of QUIC core objects when operating under CCM,
and for dispatching messages to a worker thread under WCM.

![Concurrency Models Diagram](images/quic-concurrency-models.svg)

There are two different CMLs:

- **Direct CML (DCML)**, in which core objects are worked on in the same thread
  which made an APL call, under lock;

- **Worker CML (WCML)**, in which core objects are managed by a worker thread
  with communication via message passing. This CML is split into a front end
  (WCML-FE) and back end (WCML-BE).

The legacy thread assisted mode uses a bespoke method which is similar to the
approach used by the DCML.

CML Design
----------

The CML is designed to have as small an API surface area as possible to enable
unified handling of as many kinds of (APL) API operations as possible. The idea
is that complex APL calls are translated into simple operations on the CML.

At its core, the CML exposes some number of *pipes*. The number of pipes which
can be accessed via the CML varies as connections and streams are created and
destroyed. A pipe is a *unidirectional* transport for byte streams. Zero-copy
optimisations are expected to be implemented in future but are deferred.

The CML (`QUIC_CML`) allows the caller to refer to a pipe by providing an opaque
pipe handle (`QUIC_CML_PIPE`). If the pipe is a sending pipe, the caller can use
`ossl_cml_write` to try and add bytes to it. Conversely, if it is a receiving
pipe, the caller can use `ossl_cml_read` to try and read bytes from it.

The method `ossl_cml_block_until` allows the caller to block until at least one
of the provided pipe handles is ready. Ready means that at least one byte can be
written (for a sending pipe) or at least one byte can be read (for a receiving
pipe).

Note that there is only expected to be one `QUIC_CML` instance per QUIC event
processing domain (i.e., per `QUIC_DOMAIN` / `QUIC_ENGINE` instance). The CML
fully abstracts the QUIC core objects such as `QUIC_ENGINE` or `QUIC_CHANNEL` so
that the APL never sees them.

The caller retrieves a pipe handle using `ossl_cml_get_pipe`. This function
retrieves a pipe based on two values:

  - a CML pipe class;
  - a CML *selector*.

The CML selector is a tagged union structure which specifies what pipe is to be
retrieved. Abstractly, examples of selectors include:

```text
    Domain      ()
    Listener    (listener_id: uint)
    Conn        (conn_id:     uint)
    Stream      (conn_id:     uint, stream_id: u64)
```

In other words, the CML selector selects the “object” to retrieve a pipe from.

The CML pipe class is one of the following values:

- Request
- Notification
- App Send
- App Recv

The pipe classes available for a given selector vary. For example, the “App
Send” and “App Recv” pipes only exist on a stream, so it is invalid to request
such a pipe in conjunction with a different type of selector.

The “Request” and “App Send” classes expose send-only streams, and the
“Notification” and “App Recv” classes expose receive-only streams.

For any given CML selector, the Request pipe is used to send serialized commands
for asynchronous processing in relation to the entity selected by that selector.
Conversely, the Notification pipe returns asynchronous notifications. These
could be in relation to a previous Command (e.g. indicating whether a command
succeeded), or unprompted notifications about other events.

The underlying pattern here is that there is a bidirectional channel for control
messages, and a bidirectional channel for application data, both comprised of
two unidirectional pipes in turn.

Pipe handles are stable for as long as the pipe they reference exists, so an APL
object can cache a pipe handle if desired.

All CML methods are thread safe. The CML implementation handles any necessary
locking (if any) internally.

The `ossl_cml_write_available` and `ossl_cml_read_available` calls determine the
number of bytes which can currently be written to a send-only pipe, or read from
a receive-only pipe, respectively.

**Race conditions.** Because these are separate calls to `ossl_cml_write` and
`ossl_cml_read`, the values returned by these functions may become out of date
before the caller has a chance to read `ossl_cml_write` or `ossl_cml_read`.
However, such changes are guaranteed to be monotonically in favour of the
caller; for example, the value returned by `ossl_cml_write_available` will only
ever increase asynchronously (and only decrease as a result of an
`ossl_cml_write` call). Conversely, the value returned by
`ossl_cml_read_available` will only ever increase asynchronously (and only
decrease as a result of an `ossl_cml_read` call). Assuming that only one thread
makes calls to CML functions at a given time *for a given pipe*, this therefore
poses no issue for callers.

Concurrent use of `ossl_cml_write` or `ossl_cml_read` for a given pipe is not
intended (and would not make sense in any case). The caller is responsible for
synchronising such calls.

**Examples of pipe usage.** The application data pipes are used to serialize the
actual application data sent or received on a QUIC stream. The usage of the
request/notification pipes is more varied and used for control activity. There
is therefore a “control/data” separation here. The request and notification
pipes transport tagged unions. Abstractly, commands and notifications might
include:

- Request: Reset Stream (error code: u64)
- Notification: Connection Terminated by Peer

**Example implementation of `SSL_write`.** An `SSL_write`-like API might be
implemented in the APL like this:

```c
int do_write(QUIC_CML *cml,
             QUIC_CML_PIPE notification_pipe,
             QUIC_CML_PIPE app_send_pipe,
             const void *buf, size_t buf_len)
{
    size_t bytes_written = 0;

    for (;;) {
        /* e.g. connection termination */
        process_any_notifications(notification_pipe);

        /* state checks, etc. */
        if (...->conn_terminated)
            return 0;

        if (buf_len == 0)
            return 1;

        if (!ossl_cml_write(cml, app_send_pipe, buf, buf_len, &bytes_written))
            return 0;

        if (bytes_written == 0) {
            if (!should_block())
                break;

            ossl_cml_block_until(cml, {notification_pipe, app_send_pipe});
            continue; /* try again */
        }

        buf     += bytes_written;
        buf_len -= bytes_written;
    }

    return 1;
}
```

```c
/*
 * Creates a new CML using the Direct CML (DCML) implementation. need_locking
 * may be 0 to elide mutex usage if the application is guaranteed to synchronise
 * access or is purely single-threaded.
 */
QUIC_CML *ossl_cml_new_direct(int need_locking);

/* Creates a new CML using the Worker CML (WCML) implementation. */
QUIC_CML *ossl_cml_new_worker(size_t num_worker_threads);

/*
 * Starts the CML operating. Idempotent after it returns successfully. For the
 * WCML this might e.g. start background threads; for the DCML it is likely to
 * be a no-op (but must still be called).
 */
int ossl_cml_start(QUIC_CML *cml);

/*
 * Begins the CML shutdown process. Returns 1 once shutdown is complete; may
 * need to be called multiple times until shutdown is done.
 */
int ossl_cml_shutdown(QUIC_CML *cml);

/*
 * Immediate free of the CML. This is always safe but may cause handling
 * of a connection to be aborted abruptly as it is an immediate teardown
 * of all state.
 */
void ossl_cml_free(QUIC_CML *cml);

/*
 * Retrieves a pipe for a logical CML object described by selector. The pipe
 * handle, which is stable over the life of the logical CML object, is written
 * to *pipe_handle. class_ is a QUIC_CML_CLASS value.
 */
enum {
    QUIC_CML_CLASS_REQUEST,         /* control; send */
    QUIC_CML_CLASS_NOTIFICATION,    /* control; recv */
    QUIC_CML_CLASS_APP_SEND,        /* data; send */
    QUIC_CML_CLASS_APP_RECV         /* data; recv */
};

int ossl_cml_get_pipe(QUIC_CML                  *cml,
                      int                       class_,
                      const QUIC_CML_SELECTOR   *selector,
                      QUIC_CML_PIPE             *pipe_handle);

/*
 * Returns the number of bytes a sending pipe can currently accept. The returned
 * value may increase over time asynchronously but will only decrease in
 * response to an ossl_cml_write call.
 */
size_t ossl_cml_write_available(QUIC_CML *cml, QUIC_CML_PIPE pipe_handle);

/*
 * Appends bytes into a sending pipe by copying them. The buffer can be freed
 * as soon as this call returns.
 */
int ossl_cml_write(QUIC_CML *cml, QUIC_CML_PIPE pipe_handle,
                   const void *buf, size_t buf_len);

/*
 * Returns the number of bytes a receiving pipe currently has waiting to be
 * read. The returned value may increase over time asynchronously but will only
 * decreate in response to an ossl_cml_read call.
 */
size_t ossl_cml_read_available(QUIC_CML *cml, QUIC_CML_PIPE pipe_handle);

/*
 * Reads bytes from a receiving pipe by copying them.
 */
int ossl_cml_read(QUIC_CML *cml, QUIC_CML_PIPE pipe_handle,
                  void *buf, size_t buf_len);

/*
 * Blocks until at least one of the pipes in the array specified by
 * pipe_handles is ready, or until the deadline given is reached.
 *
 * A pipe is ready if:
 *
 *   - it is a sending pipe and one or more bytes can now be written;
 *   - it is a receiving pipe and one or more bytes can now be read.
 */
int ossl_cml_block_until(QUIC_CML *cml,
                         const QUIC_CML_PIPE *pipe_handles,
                         size_t num_pipe_handles,
                         OSSL_TIME deadline);
```