PatchworkOS  19e446b
A non-POSIX operating system.
Loading...
Searching...
No Matches
Kernel-side I/O Ring Interface

Programmable submission/completion interface. More...

Collaboration diagram for Kernel-side I/O Ring Interface:

Detailed Description

Programmable submission/completion interface.

Todo:
The I/O ring system is primarily a design document for now as it remains very work in progress and subject to change, currently being mostly unimplemented.
Todo:
Rewrite the Kernel-side I/O Ring Interface documentation to match the new system.

The I/O ring provides the core of all interfaces in PatchworkOS, where user-space submits Submission Queue Entries (SQEs) and receives Completion Queue Entries (CQEs) from it, all within shared memory. Allowing for highly efficient and asynchronous I/O operations, especially since PatchworkOS is designed to be natively asynchronous.

Each SQE specifies a verb (the operation to perform) and a set of up to SQE_MAX_ARG arguments, while each CQE returns the result of a previously submitted SQE.

Synchronous operations are implemented on top of this API in userspace.

See also
User-side I/O Ring Interface for the userspace interface to the asynchronous ring.
Wikipedia for information about io_uring, the inspiration for this system.
Manpages for more information about io_uring.

Syncronization

The I/O ring structure is designed to be safe under the assumption that there is a single producer (one user-space thread) and a single consumer (the kernel).

If an I/O ring needs multiple producers (needs to be accessed by multiple threads) it is the responsibility of the caller to ensure proper synchronization.

Note
The reason for this limitation is optimization for the common case, as the syncronization logic for multiple producers would add significant overhead. Additionally, it is rather straight forward for user-space to protect the ring with a mutex should it need to.

Regarding the I/O ring structure itself, the structure can only be torndown as long as nothing is using it and there are no pending operations.

Registers

Operations performed on a I/O ring can load arguments from, and save their results to, seven 64-bit general purpose registers. All registers are stored in the shared control area of the I/O ring structure (ioring_ctrl_t), as such they can be inspected and modified by user space.

When a SQE is processed, the kernel will check six register specifiers in the SQE flags, one for each argument and one for the result. Each specifier is stored as three bits, with a SQE_REG_NONE value indicating no-op and any other value representing the n-th register. The offset of the specifier specifies its meaning, for example, bits 0-2 specify the register to load into the first argument, bits 3-5 specify the register to load into the second argument, and so on until bits 15-17 which specify the register to save the result into.

This system, when combined with SQE_LINK, allows for multiple operations to be performed at once, for example, it would be possible to open a file, read from it, seek to a new position, write to it, and finally close the file, with a single enter() call.

See also
sqe_flags_t for more information about register specifiers and their formatting.

Arguments

Arguments within a SQE are stored in five 64-bit values, arg1 through arg5. For convenience, each argument value is stored as a union with various types.

To avoid nameing conflicts and to avoid having to define new arguments for each verb, we define a convention to be used for the arguments.

It may not always be possible for a verb to follow these conventions, but they should be followed whenever reasonable.

Note
The kernels internal I/O Request Packet structure uses a similar system but with the kernel equivalents of the arguments, for example, a file_t* instead of a fd_t.

Results

The result of a SQE is stored in its corresponding CQE using a single 64-bit value. For convenience, the result is stored as a union of various types. Note that this does not actually change the stored value, just how it is interpreted.

If a SQE fails, the error code will be stored separately from the result and the result it self may be undefined. Some verbs may allow partial failures in which case the result may still be valid even if an error code is present.

Todo:
Decide if partial failures are a good idea or not.

Errors

The majority of errors are returned in the CQEs, certain errors (such as ENOMEM) may be reported directly from the enter() call.

Error values that may be returned in a CQE include:

Verbs

Included below is a list of all currently implemented verbs.

The arguments of each verb is specified in order as arg0, arg1, arg2, arg3, arg4.

<tt>VERB_NOP</tt>

A no-operation verb that does nothing but is useful for implementing sleeping.

Parameters
arg0Unused
arg1Unused
arg2Unused
arg3Unused
arg4Unused
Returns
None

<tt>VERB_READ</tt>

Reads data from a file descriptor.

Parameters
fdThe file descriptor to read from.
bufferThe buffer to read the data into.
countThe number of bytes to read.
offsetThe offset to read from, or IO_CUR to use the current position.
arg4Unused
Returns
The number of bytes read.

<tt>VERB_WRITE</tt>

Writes data to a file descriptor.

Parameters
fdThe file descriptor to write to.
bufferThe buffer to write the data from.
countThe number of bytes to write.
offsetThe offset to write to, or IO_CUR to use the current position.
arg4Unused
Returns
The number of bytes written.

<tt>VERB_POLL</tt>

Polls a file descriptor for events.

Parameters
fdThe file descriptor to poll.
eventsThe events to wait for.
arg2Unused
arg3Unused
arg4Unused
Returns
The events that occurred.

Data Structures

struct  ioring_ctx_t
 The kernel-side ring context structure. More...
 

Enumerations

enum  ioring_ctx_flags_t { IORING_CTX_NONE = 0 , IORING_CTX_BUSY = 1 << 0 , IORING_CTX_MAPPED = 1 << 1 }
 Ring context flags. More...
 

Functions

void ioring_ctx_init (ioring_ctx_t *ctx)
 Initialize a I/O context.
 
void ioring_ctx_deinit (ioring_ctx_t *ctx)
 Deinitialize a I/O context.
 
uint64_t ioring_ctx_notify (ioring_ctx_t *ctx, size_t amount, size_t wait)
 Notify the context of new SQEs.
 

Enumeration Type Documentation

◆ ioring_ctx_flags_t

Ring context flags.

Enumerator
IORING_CTX_NONE 

No flags set.

IORING_CTX_BUSY 

Context is currently being used, used for fast locking.

IORING_CTX_MAPPED 

Context is currently mapped into userspace.

Definition at line 170 of file ring.h.

Function Documentation

◆ ioring_ctx_init()

void ioring_ctx_init ( ioring_ctx_t ctx)

Initialize a I/O context.

Parameters
ctxPointer to the context to initialize.

Definition at line 146 of file ring.c.

Here is the call graph for this function:
Here is the caller graph for this function:

◆ ioring_ctx_deinit()

void ioring_ctx_deinit ( ioring_ctx_t ctx)

Deinitialize a I/O context.

Parameters
ctxPointer to the context to deinitialize.

Definition at line 162 of file ring.c.

Here is the call graph for this function:
Here is the caller graph for this function:

◆ ioring_ctx_notify()

uint64_t ioring_ctx_notify ( ioring_ctx_t ctx,
size_t  amount,
size_t  wait 
)

Notify the context of new SQEs.

Parameters
ctxPointer to the context.
amountThe number of SQEs to process.
waitThe minimum number of CQEs to wait for.
Returns
On success, the number of SQEs processed. On failure, ERR and errno is set.

Definition at line 346 of file ring.c.

Here is the call graph for this function:
Here is the caller graph for this function: