Ordinarily synchronization issues for tracing engines are kept fairly
straightforward by using UTRACE_STOP. You ask a
thread to stop, and then once it makes the
report_quiesce callback it cannot do anything else
that would result in another callback, until you let it with a
utrace_control call. This simple arrangement
avoids complex and error-prone code in each one of a tracing engine's
event callbacks to keep them serialized with the engine's other
operations done on that thread from another thread of control.
However, giving tracing engines complete power to keep a traced thread
stuck in place runs afoul of a more important kind of simplicity that
the kernel overall guarantees: nothing can prevent or delay
SIGKILL from making a thread die and release its
resources. To preserve this important property of
SIGKILL, it as a special case can break
UTRACE_STOP like nothing else normally can. This
includes both explicit SIGKILL signals and the
implicit SIGKILL sent to each other thread in the
same thread group by a thread doing an exec, or processing a fatal
signal, or making an exit_group system call. A
tracing engine can prevent a thread from beginning the exit or exec or
dying by signal (other than SIGKILL) if it is
attached to that thread, but once the operation begins, no tracing
engine can prevent or delay all other threads in the same thread group
dying.
The report_reap callback is always the final event
in the life cycle of a traced thread. Tracing engines can use this as
the trigger to clean up their own data structures. The
report_death callback is always the penultimate
event a tracing engine might see; it's seen unless the thread was
already in the midst of dying when the engine attached. Many tracing
engines will have no interest in when a parent reaps a dead process,
and nothing they want to do with a zombie thread once it dies; for
them, the report_death callback is the natural
place to clean up data structures and detach. To facilitate writing
such engines robustly, given the asynchrony of
SIGKILL, and without error-prone manual
implementation of synchronization schemes, the
utrace infrastructure provides some special
guarantees about the report_death and
report_reap callbacks. It still takes some care
to be sure your tracing engine is robust to tear-down races, but these
rules make it reasonably straightforward and concise to handle a lot of
corner cases correctly.
The first sort of guarantee concerns the core data structures
themselves. struct utrace_engine is
a reference-counted data structure. While you hold a reference, an
engine pointer will always stay valid so that you can safely pass it to
any utrace call. Each call to
utrace_attach_task or
utrace_attach_pid returns an engine pointer with a
reference belonging to the caller. You own that reference until you
drop it using utrace_engine_put. There is an
implicit reference on the engine while it is attached. So if you drop
your only reference, and then use
utrace_attach_task without
UTRACE_ATTACH_CREATE to look up that same engine,
you will get the same pointer with a new reference to replace the one
you dropped, just like calling utrace_engine_get.
When an engine has been detached, either explicitly with
UTRACE_DETACH or implicitly after
report_reap, then any references you hold are all
that keep the old engine pointer alive.
There is nothing a kernel module can do to keep a struct
task_struct alive outside of
rcu_read_lock. When the task dies and is reaped
by its parent (or itself), that structure can be freed so that any
dangling pointers you have stored become invalid.
utrace will not prevent this, but it can
help you detect it safely. By definition, a task that has been reaped
has had all its engines detached. All
utrace calls can be safely called on a
detached engine if the caller holds a reference on that engine pointer,
even if the task pointer passed in the call is invalid. All calls
return -ESRCH for a detached engine, which tells
you that the task pointer you passed could be invalid now. Since
utrace_control and
utrace_set_events do not block, you can call those
inside a rcu_read_lock section and be sure after
they don't return -ESRCH that the task pointer is
still valid until rcu_read_unlock. The
infrastructure never holds task references of its own. Though neither
rcu_read_lock nor any other lock is held while
making a callback, it's always guaranteed that the struct
task_struct and the struct
utrace_engine passed as arguments remain valid
until the callback function returns.
The common means for safely holding task pointers that is available to
kernel modules is to use struct pid, which
permits put_pid from kernel modules. When using
that, the calls utrace_attach_pid,
utrace_control_pid,
utrace_set_events_pid, and
utrace_barrier_pid are available.
The second guarantee is the serialization of
DEATH and REAP event
callbacks for a given thread. The actual reaping by the parent
(release_task call) can occur simultaneously
while the thread is still doing the final steps of dying, including
the report_death callback. If a tracing engine
has requested both DEATH and
REAP event reports, it's guaranteed that the
report_reap callback will not be made until
after the report_death callback has returned.
If the report_death callback itself detaches
from the thread, then the report_reap callback
will never be made. Thus it is safe for a
report_death callback to clean up data
structures and detach.
The final sort of guarantee is that a tracing engine will know for sure
whether or not the report_death and/or
report_reap callbacks will be made for a certain
thread. These tear-down races are disambiguated by the error return
values of utrace_set_events and
utrace_control. Normally
utrace_control called with
UTRACE_DETACH returns zero, and this means that no
more callbacks will be made. If the thread is in the midst of dying,
it returns -EALREADY to indicate that the
report_death callback may already be in progress;
when you get this error, you know that any cleanup your
report_death callback does is about to happen or
has just happened--note that if the report_death
callback does not detach, the engine remains attached until the thread
gets reaped. If the thread is in the midst of being reaped,
utrace_control returns -ESRCH
to indicate that the report_reap callback may
already be in progress; this means the engine is implicitly detached
when the callback completes. This makes it possible for a tracing
engine that has decided asynchronously to detach from a thread to
safely clean up its data structures, knowing that no
report_death or report_reap
callback will try to do the same. utrace_detach
returns -ESRCH when the struct
utrace_engine has already been detached, but is
still a valid pointer because of its reference count. A tracing engine
can use this to safely synchronize its own independent multiple threads
of control with each other and with its event callbacks that detach.
In the same vein, utrace_set_events normally
returns zero; if the target thread was stopped before the call, then
after a successful call, no event callbacks not requested in the new
flags will be made. It fails with -EALREADY if
you try to clear UTRACE_EVENT(DEATH) when the
report_death callback may already have begun, if
you try to clear UTRACE_EVENT(REAP) when the
report_reap callback may already have begun, or if
you try to newly set UTRACE_EVENT(DEATH) or
UTRACE_EVENT(QUIESCE) when the target is already
dead or dying. Like utrace_control, it returns
-ESRCH when the thread has already been detached
(including forcible detach on reaping). This lets the tracing engine
know for sure which event callbacks it will or won't see after
utrace_set_events has returned. By checking for
errors, it can know whether to clean up its data structures immediately
or to let its callbacks do the work.
When a thread is safely stopped, calling
utrace_control with UTRACE_DETACH
or calling utrace_set_events to disable some events
ensures synchronously that your engine won't get any more of the callbacks
that have been disabled (none at all when detaching). But these can also
be used while the thread is not stopped, when it might be simultaneously
making a callback to your engine. For this situation, these calls return
-EINPROGRESS when it's possible a callback is in
progress. If you are not prepared to have your old callbacks still run,
then you can synchronize to be sure all the old callbacks are finished,
using utrace_barrier. This is necessary if the
kernel module containing your callback code is going to be unloaded.
After using UTRACE_DETACH once, further calls to
utrace_control with the same engine pointer will
return -ESRCH. In contrast, after getting
-EINPROGRESS from
utrace_set_events, you can call
utrace_set_events again later and if it returns zero
then know the old callbacks have finished.
Unlike all other calls, utrace_barrier (and
utrace_barrier_pid) will accept any engine pointer you
hold a reference on, even if UTRACE_DETACH has already
been used. After any utrace_control or
utrace_set_events call (these do not block), you can
call utrace_barrier to block until callbacks have
finished. This returns -ESRCH only if the engine is
completely detached (finished all callbacks). Otherwise it waits
until the thread is definitely not in the midst of a callback to this
engine and then returns zero, but can return
-ERESTARTSYS if its wait is interrupted.