Discussion:
[PATCH] Also use l_tls_dtor_count to decide on object unload
Siddhesh Poyarekar
2015-07-09 18:07:24 UTC
Permalink
When an TLS destructor is registered, we set the DF_1_NODELETE flag to
signal that the object should not be destroyed. We then clear the
DF_1_NODELETE flag when all destructors are called, which is wrong -
the flag could have been set by other means too.

This patch replaces this use of the flag by using l_tls_dtor_count
directly to determine whether it is safe to unload the object. This
change has the added advantage of eliminating the lock taking when
calling the destructors, which could result in a deadlock. The patch
also fixes the test case tst-tls-atexit - it was making an invalid
dlclose call, which would just return an error silently.

Change verified on x86_64; the test suite does not show any
regressions due to the patch.

ChangeLog:

* elf/dl-close.c (_dl_close_worker): Don't unload DSO if there
are pending TLS destructor calls.
* stdlib/cxa_thread_atexit_impl.c (__cxa_thread_atexit_impl):
Don't touch the link map flag. Atomically increment
l_tls_dtor_count.
(__call_tls_dtors): Atomically decrement l_tls_dtor_count.
Avoid taking the load lock and don't touch the link map flag.
* stdlib/tst-tls-atexit.c (do_test): dlopen
tst-tls-atexit-lib.so again before dlclose.
---
elf/dl-close.c | 9 ++++++++-
stdlib/cxa_thread_atexit_impl.c | 25 +++++++------------------
stdlib/tst-tls-atexit.c | 9 ++++++++-
3 files changed, 23 insertions(+), 20 deletions(-)

diff --git a/elf/dl-close.c b/elf/dl-close.c
index 2104674..30e30e2 100644
--- a/elf/dl-close.c
+++ b/elf/dl-close.c
@@ -153,7 +153,11 @@ _dl_close_worker (struct link_map *map, bool force)
maps[idx] = l;
++idx;

- /* Clear DF_1_NODELETE to force object deletion. */
+ /* Clear DF_1_NODELETE to force object deletion. We don't need to touch
+ l_tls_dtor_count because forced object deletion only happens when an
+ error occurs during object load. Destructor registration for TLS
+ non-POD objects should not have happened till then for this
+ object. */
if (force)
l->l_flags_1 &= ~DF_1_NODELETE;
}
@@ -173,10 +177,13 @@ _dl_close_worker (struct link_map *map, bool force)
/* Already handled. */
continue;

+ size_t tls_dtor_count = atomic_load_relaxed (&l->l_tls_dtor_count);
+
/* Check whether this object is still used. */
if (l->l_type == lt_loaded
&& l->l_direct_opencount == 0
&& (l->l_flags_1 & DF_1_NODELETE) == 0
+ && tls_dtor_count == 0
&& !used[done_index])
continue;

diff --git a/stdlib/cxa_thread_atexit_impl.c b/stdlib/cxa_thread_atexit_impl.c
index 9120162..fac2cc9 100644
--- a/stdlib/cxa_thread_atexit_impl.c
+++ b/stdlib/cxa_thread_atexit_impl.c
@@ -50,27 +50,25 @@ __cxa_thread_atexit_impl (dtor_func func, void *obj, void *dso_symbol)
tls_dtor_list = new;

/* See if we already encountered the DSO. */
- __rtld_lock_lock_recursive (GL(dl_load_lock));
-
if (__glibc_unlikely (dso_symbol_cache != dso_symbol))
{
ElfW(Addr) caller = (ElfW(Addr)) dso_symbol;

+ /* _dl_find_dso_for_object assumes that we have the dl_load_lock. */
+ __rtld_lock_lock_recursive (GL(dl_load_lock));
struct link_map *l = _dl_find_dso_for_object (caller);
+ __rtld_lock_unlock_recursive (GL(dl_load_lock));

/* If the address is not recognized the call comes from the main
program (we hope). */
lm_cache = l ? l : GL(dl_ns)[LM_ID_BASE]._ns_loaded;
}
+
/* A destructor could result in a thread_local construction and the former
could have cleared the flag. */
- if (lm_cache->l_type == lt_loaded && lm_cache->l_tls_dtor_count == 0)
- lm_cache->l_flags_1 |= DF_1_NODELETE;
+ atomic_increment (&lm_cache->l_tls_dtor_count);

new->map = lm_cache;
- new->map->l_tls_dtor_count++;
-
- __rtld_lock_unlock_recursive (GL(dl_load_lock));

return 0;
}
@@ -83,19 +81,10 @@ __call_tls_dtors (void)
while (tls_dtor_list)
{
struct dtor_list *cur = tls_dtor_list;
- tls_dtor_list = tls_dtor_list->next;

+ tls_dtor_list = tls_dtor_list->next;
cur->func (cur->obj);
-
- __rtld_lock_lock_recursive (GL(dl_load_lock));
-
- /* Allow DSO unload if count drops to zero. */
- cur->map->l_tls_dtor_count--;
- if (cur->map->l_tls_dtor_count == 0 && cur->map->l_type == lt_loaded)
- cur->map->l_flags_1 &= ~DF_1_NODELETE;
-
- __rtld_lock_unlock_recursive (GL(dl_load_lock));
-
+ atomic_decrement (&cur->map->l_tls_dtor_count);
free (cur);
}
}
diff --git a/stdlib/tst-tls-atexit.c b/stdlib/tst-tls-atexit.c
index 68247d1..ba70790 100644
--- a/stdlib/tst-tls-atexit.c
+++ b/stdlib/tst-tls-atexit.c
@@ -82,7 +82,14 @@ do_test (void)
if (thr_ret != NULL)
return 1;

- /* Now this should unload the DSO. */
+ /* Now this sequence should unload the DSO. */
+ handle = dlopen ("$ORIGIN/tst-tls-atexit-lib.so", RTLD_LAZY);
+ if (!handle)
+ {
+ printf ("main thread: Unable to load DSO: %s\n", dlerror ());
+ return 1;
+ }
+
dlclose (handle);

/* Run through our maps and ensure that the DSO is unloaded. */
--
2.1.0
Roland McGrath
2015-07-09 20:43:01 UTC
Permalink
Post by Siddhesh Poyarekar
When an TLS destructor is registered, we set the DF_1_NODELETE flag to
signal that the object should not be destroyed. We then clear the
DF_1_NODELETE flag when all destructors are called, which is wrong -
the flag could have been set by other means too.
That sounds like a user-visible bug, which should have both a BZ# and a
concrete test case for a -z nodelete object or some such case that would
demonstrate the bug.
Post by Siddhesh Poyarekar
--- a/stdlib/tst-tls-atexit.c
+++ b/stdlib/tst-tls-atexit.c
@@ -82,7 +82,14 @@ do_test (void)
if (thr_ret != NULL)
return 1;
- /* Now this should unload the DSO. */
+ /* Now this sequence should unload the DSO. */
+ handle = dlopen ("$ORIGIN/tst-tls-atexit-lib.so", RTLD_LAZY);
+ if (!handle)
No implicit Boolean coercion.
Siddhesh Poyarekar
2015-07-10 18:16:35 UTC
Permalink
When an TLS destructor is registered, we set the DF_1_NODELETE flag to
signal that the object should not be destroyed. We then clear the
DF_1_NODELETE flag when all destructors are called, which is wrong -
the flag could have been set by other means too.

This patch replaces this use of the flag by using l_tls_dtor_count
directly to determine whether it is safe to unload the object. This
change has the added advantage of eliminating the lock taking when
calling the destructors, which could result in a deadlock. The patch
also fixes the test case tst-tls-atexit - it was making an invalid
dlclose call, which would just return an error silently.

Change verified on x86_64; the test suite does not show any
regressions due to the patch.

ChangeLog:

[BZ #18657]
* elf/dl-close.c (_dl_close_worker): Don't unload DSO if there
are pending TLS destructor calls.
* stdlib/cxa_thread_atexit_impl.c (__cxa_thread_atexit_impl):
Don't touch the link map flag. Atomically increment
l_tls_dtor_count.
(__call_tls_dtors): Atomically decrement l_tls_dtor_count.
Avoid taking the load lock and don't touch the link map flag.
* stdlib/tst-tls-atexit.c (do_test): dlopen
tst-tls-atexit-lib.so again before dlclose.
* stdlib/tst-tls-atexit-nodelete.c: New test case.
* stdlib/Makefile (tests): Use it.
---
elf/dl-close.c | 9 ++-
stdlib/Makefile | 5 +-
stdlib/cxa_thread_atexit_impl.c | 25 +++------
stdlib/tst-tls-atexit-nodelete.c | 118 +++++++++++++++++++++++++++++++++++++++
stdlib/tst-tls-atexit.c | 9 ++-
5 files changed, 145 insertions(+), 21 deletions(-)
create mode 100644 stdlib/tst-tls-atexit-nodelete.c

diff --git a/elf/dl-close.c b/elf/dl-close.c
index 2104674..30e30e2 100644
--- a/elf/dl-close.c
+++ b/elf/dl-close.c
@@ -153,7 +153,11 @@ _dl_close_worker (struct link_map *map, bool force)
maps[idx] = l;
++idx;

- /* Clear DF_1_NODELETE to force object deletion. */
+ /* Clear DF_1_NODELETE to force object deletion. We don't need to touch
+ l_tls_dtor_count because forced object deletion only happens when an
+ error occurs during object load. Destructor registration for TLS
+ non-POD objects should not have happened till then for this
+ object. */
if (force)
l->l_flags_1 &= ~DF_1_NODELETE;
}
@@ -173,10 +177,13 @@ _dl_close_worker (struct link_map *map, bool force)
/* Already handled. */
continue;

+ size_t tls_dtor_count = atomic_load_relaxed (&l->l_tls_dtor_count);
+
/* Check whether this object is still used. */
if (l->l_type == lt_loaded
&& l->l_direct_opencount == 0
&& (l->l_flags_1 & DF_1_NODELETE) == 0
+ && tls_dtor_count == 0
&& !used[done_index])
continue;

diff --git a/stdlib/Makefile b/stdlib/Makefile
index 7fc5a80..402466a 100644
--- a/stdlib/Makefile
+++ b/stdlib/Makefile
@@ -74,7 +74,7 @@ tests := tst-strtol tst-strtod testmb testrand testsort testdiv \
tst-makecontext3 bug-getcontext bug-fmtmsg1 \
tst-secure-getenv tst-strtod-overflow tst-strtod-round \
tst-tininess tst-strtod-underflow tst-tls-atexit \
- tst-setcontext3
+ tst-setcontext3 tst-tls-atexit-nodelete
tests-static := tst-secure-getenv

modules-names = tst-tls-atexit-lib
@@ -159,6 +159,9 @@ tst-tls-atexit-lib.so-no-z-defs = yes
$(objpfx)tst-tls-atexit: $(shared-thread-library) $(libdl)
$(objpfx)tst-tls-atexit.out: $(objpfx)tst-tls-atexit-lib.so

+$(objpfx)tst-tls-atexit-nodelete: $(shared-thread-library) $(libdl)
+$(objpfx)tst-tls-atexit-nodelete.out: $(objpfx)tst-tls-atexit-lib.so
+
$(objpfx)tst-setcontext3.out: tst-setcontext3.sh $(objpfx)tst-setcontext3
$(SHELL) $< $(common-objpfx) '$(test-program-prefix-before-env)' \
'$(run-program-env)' '$(test-program-prefix-after-env)' \
diff --git a/stdlib/cxa_thread_atexit_impl.c b/stdlib/cxa_thread_atexit_impl.c
index 9120162..fac2cc9 100644
--- a/stdlib/cxa_thread_atexit_impl.c
+++ b/stdlib/cxa_thread_atexit_impl.c
@@ -50,27 +50,25 @@ __cxa_thread_atexit_impl (dtor_func func, void *obj, void *dso_symbol)
tls_dtor_list = new;

/* See if we already encountered the DSO. */
- __rtld_lock_lock_recursive (GL(dl_load_lock));
-
if (__glibc_unlikely (dso_symbol_cache != dso_symbol))
{
ElfW(Addr) caller = (ElfW(Addr)) dso_symbol;

+ /* _dl_find_dso_for_object assumes that we have the dl_load_lock. */
+ __rtld_lock_lock_recursive (GL(dl_load_lock));
struct link_map *l = _dl_find_dso_for_object (caller);
+ __rtld_lock_unlock_recursive (GL(dl_load_lock));

/* If the address is not recognized the call comes from the main
program (we hope). */
lm_cache = l ? l : GL(dl_ns)[LM_ID_BASE]._ns_loaded;
}
+
/* A destructor could result in a thread_local construction and the former
could have cleared the flag. */
- if (lm_cache->l_type == lt_loaded && lm_cache->l_tls_dtor_count == 0)
- lm_cache->l_flags_1 |= DF_1_NODELETE;
+ atomic_increment (&lm_cache->l_tls_dtor_count);

new->map = lm_cache;
- new->map->l_tls_dtor_count++;
-
- __rtld_lock_unlock_recursive (GL(dl_load_lock));

return 0;
}
@@ -83,19 +81,10 @@ __call_tls_dtors (void)
while (tls_dtor_list)
{
struct dtor_list *cur = tls_dtor_list;
- tls_dtor_list = tls_dtor_list->next;

+ tls_dtor_list = tls_dtor_list->next;
cur->func (cur->obj);
-
- __rtld_lock_lock_recursive (GL(dl_load_lock));
-
- /* Allow DSO unload if count drops to zero. */
- cur->map->l_tls_dtor_count--;
- if (cur->map->l_tls_dtor_count == 0 && cur->map->l_type == lt_loaded)
- cur->map->l_flags_1 &= ~DF_1_NODELETE;
-
- __rtld_lock_unlock_recursive (GL(dl_load_lock));
-
+ atomic_decrement (&cur->map->l_tls_dtor_count);
free (cur);
}
}
diff --git a/stdlib/tst-tls-atexit-nodelete.c b/stdlib/tst-tls-atexit-nodelete.c
new file mode 100644
index 0000000..93daefc
--- /dev/null
+++ b/stdlib/tst-tls-atexit-nodelete.c
@@ -0,0 +1,118 @@
+/* Verify that a RTLD_NODELETE DSO is not unloaded even if its TLS objects are
+ destroyed.
+
+ Copyright (C) 2015 Free Software Foundation, Inc.
+ This file is part of the GNU C Library.
+
+ The GNU C Library is free software; you can redistribute it and/or
+ modify it under the terms of the GNU Lesser General Public
+ License as published by the Free Software Foundation; either
+ version 2.1 of the License, or (at your option) any later version.
+
+ The GNU C Library is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ Lesser General Public License for more details.
+
+ You should have received a copy of the GNU Lesser General Public
+ License along with the GNU C Library; if not, see
+ <http://www.gnu.org/licenses/>. */
+
+#include <dlfcn.h>
+#include <pthread.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+
+static void *
+use_handle (void *h)
+{
+ void (*foo) (void) = (void (*) (void)) dlsym(h, "do_foo");
+
+ if (!foo)
+ {
+ printf ("Unable to find symbol: %s\n", dlerror ());
+ exit (1);
+ }
+
+ foo ();
+
+ return NULL;
+}
+
+static int
+do_test (void)
+{
+ pthread_t t;
+ int ret;
+ void *thr_ret;
+
+ /* Load our module. We access (and hence construct the object) in the
+ thread. */
+ void *handle = dlopen ("$ORIGIN/tst-tls-atexit-lib.so", RTLD_LAZY);
+ if (handle == NULL)
+ {
+ printf ("Unable to load DSO: %s\n", dlerror ());
+ return 1;
+ }
+
+ if ((ret = pthread_create (&t, NULL, use_handle, handle)) != 0)
+ {
+ printf ("pthread_create failed: %s\n", strerror (ret));
+ return 1;
+ }
+
+ if ((ret = pthread_join (t, &thr_ret)) != 0)
+ {
+ printf ("pthread_create failed: %s\n", strerror (ret));
+ return 1;
+ }
+
+ if (thr_ret != NULL)
+ return 1;
+
+ /* This sequence should not unload the DSO. */
+ void *h2 = dlopen ("$ORIGIN/tst-tls-atexit-lib.so",
+ RTLD_LAZY | RTLD_NODELETE);
+ if (h2 == NULL)
+ {
+ printf ("main thread: Unable to load DSO: %s\n", dlerror ());
+ return 1;
+ }
+
+ use_handle (h2);
+
+ /* Close the handle we created in the thread. */
+ dlclose (handle);
+ dlclose (h2);
+
+ /* Run through our maps and ensure that the DSO is unloaded. */
+ FILE *f = fopen ("/proc/self/maps", "r");
+
+ if (f == NULL)
+ {
+ perror ("Failed to open /proc/self/maps");
+ fprintf (stderr, "Skipping verification of DSO unload\n");
+ return 0;
+ }
+
+ char *line = NULL;
+ size_t s = 0;
+ while (getline (&line, &s, f) > 0)
+ {
+ if (strstr (line, "tst-tls-atexit-lib.so"))
+ {
+ printf ("DSO not unloaded yet:\n%s", line);
+ return 0;
+ }
+ }
+ free (line);
+
+ /* The module was unloaded even when we specified RTLD_NODELETE. FAIL. */
+ printf ("tst-tls-atexit-lib.so was unloaded.\n");
+ return 1;
+}
+
+#define TEST_FUNCTION do_test ()
+#include "../test-skeleton.c"
diff --git a/stdlib/tst-tls-atexit.c b/stdlib/tst-tls-atexit.c
index 5ee04a8..374470e 100644
--- a/stdlib/tst-tls-atexit.c
+++ b/stdlib/tst-tls-atexit.c
@@ -79,7 +79,14 @@ do_test (void)
if (thr_ret != NULL)
return 1;

- /* Now this should unload the DSO. */
+ /* Now this sequence should unload the DSO. */
+ handle = dlopen ("$ORIGIN/tst-tls-atexit-lib.so", RTLD_LAZY);
+ if (handle == NULL)
+ {
+ printf ("main thread: Unable to load DSO: %s\n", dlerror ());
+ return 1;
+ }
+
dlclose (handle);

/* Run through our maps and ensure that the DSO is unloaded. */
--
2.4.3
Ondřej Bílka
2015-07-10 19:33:52 UTC
Permalink
Post by Siddhesh Poyarekar
When an TLS destructor is registered, we set the DF_1_NODELETE flag to
signal that the object should not be destroyed. We then clear the
DF_1_NODELETE flag when all destructors are called, which is wrong -
the flag could have been set by other means too.
This patch replaces this use of the flag by using l_tls_dtor_count
directly to determine whether it is safe to unload the object. This
change has the added advantage of eliminating the lock taking when
calling the destructors, which could result in a deadlock. The patch
also fixes the test case tst-tls-atexit - it was making an invalid
dlclose call, which would just return an error silently.
Change verified on x86_64; the test suite does not show any
regressions due to the patch.
Does not work. In thread A we register destructor just before atomic
add. Meanwhile we call dlclose in thread B which succeeds as count is
zero. After thread A resumes it causes SIGSEGV when destructor gets
called.
Siddhesh Poyarekar
2015-07-15 10:51:43 UTC
Permalink
Post by Ondřej Bílka
Does not work. In thread A we register destructor just before atomic
add. Meanwhile we call dlclose in thread B which succeeds as count is
zero. After thread A resumes it causes SIGSEGV when destructor gets
called.
Right, thanks for pointing out. Let me look at this again.

Siddhesh
Roland McGrath
2015-07-10 20:19:14 UTC
Permalink
Post by Siddhesh Poyarekar
+ size_t tls_dtor_count = atomic_load_relaxed (&l->l_tls_dtor_count);
+
/* Check whether this object is still used. */
if (l->l_type == lt_loaded
&& l->l_direct_opencount == 0
&& (l->l_flags_1 & DF_1_NODELETE) == 0
+ && tls_dtor_count == 0
&& !used[done_index])
continue;
IMHO the separate variable here is actually less readable than
just putting atomic_load_relaxed (...) inside the if.
Post by Siddhesh Poyarekar
+static void *
+use_handle (void *h)
+{
+ void (*foo) (void) = (void (*) (void)) dlsym(h, "do_foo");
Space before paren after dlsym.
Post by Siddhesh Poyarekar
+ if (!foo)
No implicit Boolean coercion.
Post by Siddhesh Poyarekar
+ /* Load our module. We access (and hence construct the object) in the
+ thread. */
I'm not sure I understand this comment. Access what? Which object? What
does construct mean here? The best reading I can guess is, "We call into
the module in the thread created below, so it accesses its TLS and
registers its destructor in that thread." Is that what it means?
Post by Siddhesh Poyarekar
+ if (thr_ret != NULL)
+ return 1;
What's the point of this test? It should be an assert, if anything.
There is no aspect of what's being tested here that affects the return
value of the thread function, is there?
Post by Siddhesh Poyarekar
+ /* This sequence should not unload the DSO. */
+ void *h2 = dlopen ("$ORIGIN/tst-tls-atexit-lib.so",
+ RTLD_LAZY | RTLD_NODELETE);
Perhaps use a macro for the DSO name string, so it's more obvious that it's
exactly the same thing in both calls.
Post by Siddhesh Poyarekar
+ /* Close the handle we created in the thread. */
+ dlclose (handle);
+ dlclose (h2);
I don't understand this comment. I guess "create the handle" means "call
dlopen", and both calls are in this function, on the main thread.
Post by Siddhesh Poyarekar
+ /* Run through our maps and ensure that the DSO is unloaded. */
+ FILE *f = fopen ("/proc/self/maps", "r");
This is unnecessarily Linux-specific. I don't think you need to do
external inspection to test this. You can just do something like save a
pointer from dlsym, and then try to use it and if the module was unloaded
that will crash.

It's also repeating code from tst-tls-atexit.c (which I don't recall
previously noticing was doing this /proc/self/maps funny business that I
don't think is really what we want). Perhaps these two tests can share
that part of the code?


Thanks,
Roland
Siddhesh Poyarekar
2015-07-15 09:30:05 UTC
Permalink
v3: Goes on top of the patch to remove tst-tls-atexit linuxism and the
fix for setting RTLD_NODELETE on an already opened DSO.

When an TLS destructor is registered, we set the DF_1_NODELETE flag to
signal that the object should not be destroyed. We then clear the
DF_1_NODELETE flag when all destructors are called, which is wrong -
the flag could have been set by other means too.

This patch replaces this use of the flag by using l_tls_dtor_count
directly to determine whether it is safe to unload the object. This
change has the added advantage of eliminating the lock taking when
calling the destructors, which could result in a deadlock. The patch
also fixes the test case tst-tls-atexit - it was making an invalid
dlclose call, which would just return an error silently.

Change verified on x86_64; the test suite does not show any
regressions due to the patch.

ChangeLog:

[BZ #18657]
* elf/dl-close.c (_dl_close_worker): Don't unload DSO if there
are pending TLS destructor calls.
* stdlib/cxa_thread_atexit_impl.c (__cxa_thread_atexit_impl):
Don't touch the link map flag. Atomically increment
l_tls_dtor_count.
(__call_tls_dtors): Atomically decrement l_tls_dtor_count.
Avoid taking the load lock and don't touch the link map flag.
* stdlib/tst-tls-atexit.c (do_test): dlopen
tst-tls-atexit-lib.so again before dlclose.
* stdlib/tst-tls-atexit-nodelete.c: New test case.
* stdlib/Makefile (tests): Use it.
---
elf/dl-close.c | 7 ++-
stdlib/Makefile | 5 +-
stdlib/cxa_thread_atexit_impl.c | 25 +++-------
stdlib/tst-tls-atexit-nodelete.c | 104 +++++++++++++++++++++++++++++++++++++++
stdlib/tst-tls-atexit.c | 14 ++++--
5 files changed, 132 insertions(+), 23 deletions(-)
create mode 100644 stdlib/tst-tls-atexit-nodelete.c

diff --git a/elf/dl-close.c b/elf/dl-close.c
index 2104674..def726f 100644
--- a/elf/dl-close.c
+++ b/elf/dl-close.c
@@ -153,7 +153,11 @@ _dl_close_worker (struct link_map *map, bool force)
maps[idx] = l;
++idx;

- /* Clear DF_1_NODELETE to force object deletion. */
+ /* Clear DF_1_NODELETE to force object deletion. We don't need to touch
+ l_tls_dtor_count because forced object deletion only happens when an
+ error occurs during object load. Destructor registration for TLS
+ non-POD objects should not have happened till then for this
+ object. */
if (force)
l->l_flags_1 &= ~DF_1_NODELETE;
}
@@ -177,6 +181,7 @@ _dl_close_worker (struct link_map *map, bool force)
if (l->l_type == lt_loaded
&& l->l_direct_opencount == 0
&& (l->l_flags_1 & DF_1_NODELETE) == 0
+ && atomic_load_relaxed (&l->l_tls_dtor_count) == 0
&& !used[done_index])
continue;

diff --git a/stdlib/Makefile b/stdlib/Makefile
index 7fc5a80..402466a 100644
--- a/stdlib/Makefile
+++ b/stdlib/Makefile
@@ -74,7 +74,7 @@ tests := tst-strtol tst-strtod testmb testrand testsort testdiv \
tst-makecontext3 bug-getcontext bug-fmtmsg1 \
tst-secure-getenv tst-strtod-overflow tst-strtod-round \
tst-tininess tst-strtod-underflow tst-tls-atexit \
- tst-setcontext3
+ tst-setcontext3 tst-tls-atexit-nodelete
tests-static := tst-secure-getenv

modules-names = tst-tls-atexit-lib
@@ -159,6 +159,9 @@ tst-tls-atexit-lib.so-no-z-defs = yes
$(objpfx)tst-tls-atexit: $(shared-thread-library) $(libdl)
$(objpfx)tst-tls-atexit.out: $(objpfx)tst-tls-atexit-lib.so

+$(objpfx)tst-tls-atexit-nodelete: $(shared-thread-library) $(libdl)
+$(objpfx)tst-tls-atexit-nodelete.out: $(objpfx)tst-tls-atexit-lib.so
+
$(objpfx)tst-setcontext3.out: tst-setcontext3.sh $(objpfx)tst-setcontext3
$(SHELL) $< $(common-objpfx) '$(test-program-prefix-before-env)' \
'$(run-program-env)' '$(test-program-prefix-after-env)' \
diff --git a/stdlib/cxa_thread_atexit_impl.c b/stdlib/cxa_thread_atexit_impl.c
index 9120162..fac2cc9 100644
--- a/stdlib/cxa_thread_atexit_impl.c
+++ b/stdlib/cxa_thread_atexit_impl.c
@@ -50,27 +50,25 @@ __cxa_thread_atexit_impl (dtor_func func, void *obj, void *dso_symbol)
tls_dtor_list = new;

/* See if we already encountered the DSO. */
- __rtld_lock_lock_recursive (GL(dl_load_lock));
-
if (__glibc_unlikely (dso_symbol_cache != dso_symbol))
{
ElfW(Addr) caller = (ElfW(Addr)) dso_symbol;

+ /* _dl_find_dso_for_object assumes that we have the dl_load_lock. */
+ __rtld_lock_lock_recursive (GL(dl_load_lock));
struct link_map *l = _dl_find_dso_for_object (caller);
+ __rtld_lock_unlock_recursive (GL(dl_load_lock));

/* If the address is not recognized the call comes from the main
program (we hope). */
lm_cache = l ? l : GL(dl_ns)[LM_ID_BASE]._ns_loaded;
}
+
/* A destructor could result in a thread_local construction and the former
could have cleared the flag. */
- if (lm_cache->l_type == lt_loaded && lm_cache->l_tls_dtor_count == 0)
- lm_cache->l_flags_1 |= DF_1_NODELETE;
+ atomic_increment (&lm_cache->l_tls_dtor_count);

new->map = lm_cache;
- new->map->l_tls_dtor_count++;
-
- __rtld_lock_unlock_recursive (GL(dl_load_lock));

return 0;
}
@@ -83,19 +81,10 @@ __call_tls_dtors (void)
while (tls_dtor_list)
{
struct dtor_list *cur = tls_dtor_list;
- tls_dtor_list = tls_dtor_list->next;

+ tls_dtor_list = tls_dtor_list->next;
cur->func (cur->obj);
-
- __rtld_lock_lock_recursive (GL(dl_load_lock));
-
- /* Allow DSO unload if count drops to zero. */
- cur->map->l_tls_dtor_count--;
- if (cur->map->l_tls_dtor_count == 0 && cur->map->l_type == lt_loaded)
- cur->map->l_flags_1 &= ~DF_1_NODELETE;
-
- __rtld_lock_unlock_recursive (GL(dl_load_lock));
-
+ atomic_decrement (&cur->map->l_tls_dtor_count);
free (cur);
}
}
diff --git a/stdlib/tst-tls-atexit-nodelete.c b/stdlib/tst-tls-atexit-nodelete.c
new file mode 100644
index 0000000..cb2b39d
--- /dev/null
+++ b/stdlib/tst-tls-atexit-nodelete.c
@@ -0,0 +1,104 @@
+/* Verify that a RTLD_NODELETE DSO is not unloaded even if its TLS objects are
+ destroyed.
+
+ Copyright (C) 2015 Free Software Foundation, Inc.
+ This file is part of the GNU C Library.
+
+ The GNU C Library is free software; you can redistribute it and/or
+ modify it under the terms of the GNU Lesser General Public
+ License as published by the Free Software Foundation; either
+ version 2.1 of the License, or (at your option) any later version.
+
+ The GNU C Library is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ Lesser General Public License for more details.
+
+ You should have received a copy of the GNU Lesser General Public
+ License along with the GNU C Library; if not, see
+ <http://www.gnu.org/licenses/>. */
+
+#include <dlfcn.h>
+#include <pthread.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+
+static void *
+use_handle (void *h)
+{
+ void (*reg_dtor) (void) = (void (*) (void)) dlsym (h, "reg_dtor");
+
+ if (reg_dtor == NULL)
+ {
+ printf ("Unable to find symbol reg_dtor: %s\n", dlerror ());
+ return (void *) (uintptr_t) 1;
+ }
+
+ reg_dtor ();
+
+ return NULL;
+}
+
+static int
+do_test (void)
+{
+ pthread_t t;
+ int ret;
+ void *thr_ret;
+
+ void *h1 = dlopen ("$ORIGIN/tst-tls-atexit-lib.so", RTLD_LAZY);
+ if (h1 == NULL)
+ {
+ printf ("h1: Unable to load DSO: %s\n", dlerror ());
+ return 1;
+ }
+
+ void *h2 = dlopen ("$ORIGIN/tst-tls-atexit-lib.so",
+ RTLD_LAZY | RTLD_NODELETE);
+ if (h2 == NULL)
+ {
+ printf ("h2: Unable to load DSO: %s\n", dlerror ());
+ return 1;
+ }
+
+ /* FOO is our variable to test if the DSO is unloaded. */
+ int *foo = dlsym (h2, "foo_var");
+
+ if (foo == NULL)
+ {
+ printf ("Unable to find symbol do_foo: %s\n", dlerror ());
+ return 1;
+ }
+
+ /* Print FOO to be sure that it is working OK. */
+ printf ("%d\n", *foo);
+
+ if ((ret = pthread_create (&t, NULL, use_handle, h1)) != 0)
+ {
+ printf ("pthread_create failed: %s\n", strerror (ret));
+ return 1;
+ }
+
+ if ((ret = pthread_join (t, &thr_ret)) != 0)
+ {
+ printf ("pthread_join failed: %s\n", strerror (ret));
+ return 1;
+ }
+
+ if (thr_ret != NULL)
+ return 1;
+
+ /* Closing the handles should not unload the DSO. */
+ dlclose (h1);
+ dlclose (h2);
+
+ /* This should not crash. */
+ printf ("%d\n", *foo);
+
+ return 0;
+}
+
+#define TEST_FUNCTION do_test ()
+#include "../test-skeleton.c"
diff --git a/stdlib/tst-tls-atexit.c b/stdlib/tst-tls-atexit.c
index 0a9e920..ae4c328 100644
--- a/stdlib/tst-tls-atexit.c
+++ b/stdlib/tst-tls-atexit.c
@@ -99,9 +99,17 @@ do_test (void)
if (thr_ret != NULL)
return 1;

- /* Now this should unload the DSO. FIXME: This is a bug, calling dlclose
- like this is actually wrong, but it works because cxa_thread_atexit_impl
- has a bug which results in dlclose allowing this to work. */
+ /* Now this sequence should unload the DSO. Note that we don't call do_foo,
+ because we don't want to register the thread destructor. We just want to
+ make sure that in the absence of pending destructors, the library is
+ unloaded correctly. */
+ handle = dlopen ("$ORIGIN/tst-tls-atexit-lib.so", RTLD_LAZY);
+ if (handle == NULL)
+ {
+ printf ("main thread: Unable to load DSO: %s\n", dlerror ());
+ return 1;
+ }
+
dlclose (handle);

/* Handle SIGSEGV. We don't use the EXPECTED_SIGNAL macro because we want to
--
2.4.3
Siddhesh Poyarekar
2015-07-15 11:10:25 UTC
Permalink
v4: Take the big lock for a longer duration during destructor
registration.

When an TLS destructor is registered, we set the DF_1_NODELETE flag to
signal that the object should not be destroyed. We then clear the
DF_1_NODELETE flag when all destructors are called, which is wrong -
the flag could have been set by other means too.

This patch replaces this use of the flag by using l_tls_dtor_count
directly to determine whether it is safe to unload the object. This
change has the added advantage of eliminating the lock taking when
calling the destructors, which could result in a deadlock. The patch
also fixes the test case tst-tls-atexit - it was making an invalid
dlclose call, which would just return an error silently.

Change verified on x86_64; the test suite does not show any
regressions due to the patch.

ChangeLog:

[BZ #18657]
* elf/dl-close.c (_dl_close_worker): Don't unload DSO if there
are pending TLS destructor calls.
* stdlib/cxa_thread_atexit_impl.c (__cxa_thread_atexit_impl):
Don't touch the link map flag. Atomically increment
l_tls_dtor_count.
(__call_tls_dtors): Atomically decrement l_tls_dtor_count.
Avoid taking the load lock and don't touch the link map flag.
* stdlib/tst-tls-atexit.c (do_test): dlopen
tst-tls-atexit-lib.so again before dlclose.
* stdlib/tst-tls-atexit-nodelete.c: New test case.
* stdlib/Makefile (tests): Use it.
---
elf/dl-close.c | 7 ++-
stdlib/Makefile | 5 +-
stdlib/cxa_thread_atexit_impl.c | 27 +++-------
stdlib/tst-tls-atexit-nodelete.c | 104 +++++++++++++++++++++++++++++++++++++++
stdlib/tst-tls-atexit.c | 14 ++++--
5 files changed, 133 insertions(+), 24 deletions(-)
create mode 100644 stdlib/tst-tls-atexit-nodelete.c

diff --git a/elf/dl-close.c b/elf/dl-close.c
index 2104674..def726f 100644
--- a/elf/dl-close.c
+++ b/elf/dl-close.c
@@ -153,7 +153,11 @@ _dl_close_worker (struct link_map *map, bool force)
maps[idx] = l;
++idx;

- /* Clear DF_1_NODELETE to force object deletion. */
+ /* Clear DF_1_NODELETE to force object deletion. We don't need to touch
+ l_tls_dtor_count because forced object deletion only happens when an
+ error occurs during object load. Destructor registration for TLS
+ non-POD objects should not have happened till then for this
+ object. */
if (force)
l->l_flags_1 &= ~DF_1_NODELETE;
}
@@ -177,6 +181,7 @@ _dl_close_worker (struct link_map *map, bool force)
if (l->l_type == lt_loaded
&& l->l_direct_opencount == 0
&& (l->l_flags_1 & DF_1_NODELETE) == 0
+ && atomic_load_relaxed (&l->l_tls_dtor_count) == 0
&& !used[done_index])
continue;

diff --git a/stdlib/Makefile b/stdlib/Makefile
index 7fc5a80..402466a 100644
--- a/stdlib/Makefile
+++ b/stdlib/Makefile
@@ -74,7 +74,7 @@ tests := tst-strtol tst-strtod testmb testrand testsort testdiv \
tst-makecontext3 bug-getcontext bug-fmtmsg1 \
tst-secure-getenv tst-strtod-overflow tst-strtod-round \
tst-tininess tst-strtod-underflow tst-tls-atexit \
- tst-setcontext3
+ tst-setcontext3 tst-tls-atexit-nodelete
tests-static := tst-secure-getenv

modules-names = tst-tls-atexit-lib
@@ -159,6 +159,9 @@ tst-tls-atexit-lib.so-no-z-defs = yes
$(objpfx)tst-tls-atexit: $(shared-thread-library) $(libdl)
$(objpfx)tst-tls-atexit.out: $(objpfx)tst-tls-atexit-lib.so

+$(objpfx)tst-tls-atexit-nodelete: $(shared-thread-library) $(libdl)
+$(objpfx)tst-tls-atexit-nodelete.out: $(objpfx)tst-tls-atexit-lib.so
+
$(objpfx)tst-setcontext3.out: tst-setcontext3.sh $(objpfx)tst-setcontext3
$(SHELL) $< $(common-objpfx) '$(test-program-prefix-before-env)' \
'$(run-program-env)' '$(test-program-prefix-after-env)' \
diff --git a/stdlib/cxa_thread_atexit_impl.c b/stdlib/cxa_thread_atexit_impl.c
index 9120162..c44f9d0 100644
--- a/stdlib/cxa_thread_atexit_impl.c
+++ b/stdlib/cxa_thread_atexit_impl.c
@@ -49,9 +49,11 @@ __cxa_thread_atexit_impl (dtor_func func, void *obj, void *dso_symbol)
new->next = tls_dtor_list;
tls_dtor_list = new;

- /* See if we already encountered the DSO. */
+ /* We have to take the big lock to prevent a racing dlclose from pulling our
+ DSO from underneath us while we're setting up our destructor. */
__rtld_lock_lock_recursive (GL(dl_load_lock));

+ /* See if we already encountered the DSO. */
if (__glibc_unlikely (dso_symbol_cache != dso_symbol))
{
ElfW(Addr) caller = (ElfW(Addr)) dso_symbol;
@@ -62,16 +64,12 @@ __cxa_thread_atexit_impl (dtor_func func, void *obj, void *dso_symbol)
program (we hope). */
lm_cache = l ? l : GL(dl_ns)[LM_ID_BASE]._ns_loaded;
}
- /* A destructor could result in a thread_local construction and the former
- could have cleared the flag. */
- if (lm_cache->l_type == lt_loaded && lm_cache->l_tls_dtor_count == 0)
- lm_cache->l_flags_1 |= DF_1_NODELETE;
-
- new->map = lm_cache;
- new->map->l_tls_dtor_count++;

+ atomic_increment (&lm_cache->l_tls_dtor_count);
__rtld_lock_unlock_recursive (GL(dl_load_lock));

+ new->map = lm_cache;
+
return 0;
}

@@ -83,19 +81,10 @@ __call_tls_dtors (void)
while (tls_dtor_list)
{
struct dtor_list *cur = tls_dtor_list;
- tls_dtor_list = tls_dtor_list->next;

+ tls_dtor_list = tls_dtor_list->next;
cur->func (cur->obj);
-
- __rtld_lock_lock_recursive (GL(dl_load_lock));
-
- /* Allow DSO unload if count drops to zero. */
- cur->map->l_tls_dtor_count--;
- if (cur->map->l_tls_dtor_count == 0 && cur->map->l_type == lt_loaded)
- cur->map->l_flags_1 &= ~DF_1_NODELETE;
-
- __rtld_lock_unlock_recursive (GL(dl_load_lock));
-
+ atomic_decrement (&cur->map->l_tls_dtor_count);
free (cur);
}
}
diff --git a/stdlib/tst-tls-atexit-nodelete.c b/stdlib/tst-tls-atexit-nodelete.c
new file mode 100644
index 0000000..cb2b39d
--- /dev/null
+++ b/stdlib/tst-tls-atexit-nodelete.c
@@ -0,0 +1,104 @@
+/* Verify that a RTLD_NODELETE DSO is not unloaded even if its TLS objects are
+ destroyed.
+
+ Copyright (C) 2015 Free Software Foundation, Inc.
+ This file is part of the GNU C Library.
+
+ The GNU C Library is free software; you can redistribute it and/or
+ modify it under the terms of the GNU Lesser General Public
+ License as published by the Free Software Foundation; either
+ version 2.1 of the License, or (at your option) any later version.
+
+ The GNU C Library is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ Lesser General Public License for more details.
+
+ You should have received a copy of the GNU Lesser General Public
+ License along with the GNU C Library; if not, see
+ <http://www.gnu.org/licenses/>. */
+
+#include <dlfcn.h>
+#include <pthread.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+
+static void *
+use_handle (void *h)
+{
+ void (*reg_dtor) (void) = (void (*) (void)) dlsym (h, "reg_dtor");
+
+ if (reg_dtor == NULL)
+ {
+ printf ("Unable to find symbol reg_dtor: %s\n", dlerror ());
+ return (void *) (uintptr_t) 1;
+ }
+
+ reg_dtor ();
+
+ return NULL;
+}
+
+static int
+do_test (void)
+{
+ pthread_t t;
+ int ret;
+ void *thr_ret;
+
+ void *h1 = dlopen ("$ORIGIN/tst-tls-atexit-lib.so", RTLD_LAZY);
+ if (h1 == NULL)
+ {
+ printf ("h1: Unable to load DSO: %s\n", dlerror ());
+ return 1;
+ }
+
+ void *h2 = dlopen ("$ORIGIN/tst-tls-atexit-lib.so",
+ RTLD_LAZY | RTLD_NODELETE);
+ if (h2 == NULL)
+ {
+ printf ("h2: Unable to load DSO: %s\n", dlerror ());
+ return 1;
+ }
+
+ /* FOO is our variable to test if the DSO is unloaded. */
+ int *foo = dlsym (h2, "foo_var");
+
+ if (foo == NULL)
+ {
+ printf ("Unable to find symbol do_foo: %s\n", dlerror ());
+ return 1;
+ }
+
+ /* Print FOO to be sure that it is working OK. */
+ printf ("%d\n", *foo);
+
+ if ((ret = pthread_create (&t, NULL, use_handle, h1)) != 0)
+ {
+ printf ("pthread_create failed: %s\n", strerror (ret));
+ return 1;
+ }
+
+ if ((ret = pthread_join (t, &thr_ret)) != 0)
+ {
+ printf ("pthread_join failed: %s\n", strerror (ret));
+ return 1;
+ }
+
+ if (thr_ret != NULL)
+ return 1;
+
+ /* Closing the handles should not unload the DSO. */
+ dlclose (h1);
+ dlclose (h2);
+
+ /* This should not crash. */
+ printf ("%d\n", *foo);
+
+ return 0;
+}
+
+#define TEST_FUNCTION do_test ()
+#include "../test-skeleton.c"
diff --git a/stdlib/tst-tls-atexit.c b/stdlib/tst-tls-atexit.c
index 0a9e920..ae4c328 100644
--- a/stdlib/tst-tls-atexit.c
+++ b/stdlib/tst-tls-atexit.c
@@ -99,9 +99,17 @@ do_test (void)
if (thr_ret != NULL)
return 1;

- /* Now this should unload the DSO. FIXME: This is a bug, calling dlclose
- like this is actually wrong, but it works because cxa_thread_atexit_impl
- has a bug which results in dlclose allowing this to work. */
+ /* Now this sequence should unload the DSO. Note that we don't call do_foo,
+ because we don't want to register the thread destructor. We just want to
+ make sure that in the absence of pending destructors, the library is
+ unloaded correctly. */
+ handle = dlopen ("$ORIGIN/tst-tls-atexit-lib.so", RTLD_LAZY);
+ if (handle == NULL)
+ {
+ printf ("main thread: Unable to load DSO: %s\n", dlerror ());
+ return 1;
+ }
+
dlclose (handle);

/* Handle SIGSEGV. We don't use the EXPECTED_SIGNAL macro because we want to
--
2.4.3
Ondřej Bílka
2015-07-16 20:15:04 UTC
Permalink
Post by Siddhesh Poyarekar
v4: Take the big lock for a longer duration during destructor
registration.
When an TLS destructor is registered, we set the DF_1_NODELETE flag to
signal that the object should not be destroyed. We then clear the
DF_1_NODELETE flag when all destructors are called, which is wrong -
the flag could have been set by other means too.
This patch replaces this use of the flag by using l_tls_dtor_count
directly to determine whether it is safe to unload the object. This
change has the added advantage of eliminating the lock taking when
calling the destructors, which could result in a deadlock. The patch
also fixes the test case tst-tls-atexit - it was making an invalid
dlclose call, which would just return an error silently.
Change verified on x86_64; the test suite does not show any
regressions due to the patch.
Now this looks ok for me.
Carlos O'Donell
2015-07-17 13:32:45 UTC
Permalink
Post by Siddhesh Poyarekar
v4: Take the big lock for a longer duration during destructor
registration.
When an TLS destructor is registered, we set the DF_1_NODELETE flag to
signal that the object should not be destroyed. We then clear the
DF_1_NODELETE flag when all destructors are called, which is wrong -
the flag could have been set by other means too.
Agreed.

Great work using l_tls_dtor_count to do reference counting atomically!

This looks like a permanent and long lasting solution to the problem.

It's exactly the kind of thing I like to see being used in the loader...
Post by Siddhesh Poyarekar
This patch replaces this use of the flag by using l_tls_dtor_count
directly to determine whether it is safe to unload the object. This
change has the added advantage of eliminating the lock taking when
calling the destructors, which could result in a deadlock. The patch
also fixes the test case tst-tls-atexit - it was making an invalid
dlclose call, which would just return an error silently.
Change verified on x86_64; the test suite does not show any
regressions due to the patch.
... However, you need more P&C comments. So a v5 will be required.

To start with:

diff --git a/include/link.h b/include/link.h
index 7ccd9c3..b39fa73 100644
--- a/include/link.h
+++ b/include/link.h
@@ -303,6 +303,7 @@ struct link_map
size_t l_tls_modid;

/* Number of thread_local objects constructed by this DSO. */
+ /* Concurrency: Protected by rtld_load_lock. */
size_t l_tls_dtor_count;

/* Information used to change permission after the relocations are
---

Is what I have on my P&C docs local branch. Please add a comment to
include/link.h to describe there at definition of struct link_map how
l_tls_dtor_count may be accessed. The above comment was something I was
going to push, but I'll have to update it to say "Accessed atomically."
now since the rtld_load_lock does not protect that member any longer.
Post by Siddhesh Poyarekar
[BZ #18657]
* elf/dl-close.c (_dl_close_worker): Don't unload DSO if there
are pending TLS destructor calls.
Don't touch the link map flag. Atomically increment
l_tls_dtor_count.
(__call_tls_dtors): Atomically decrement l_tls_dtor_count.
Avoid taking the load lock and don't touch the link map flag.
* stdlib/tst-tls-atexit.c (do_test): dlopen
tst-tls-atexit-lib.so again before dlclose.
* stdlib/tst-tls-atexit-nodelete.c: New test case.
* stdlib/Makefile (tests): Use it.
---
elf/dl-close.c | 7 ++-
stdlib/Makefile | 5 +-
stdlib/cxa_thread_atexit_impl.c | 27 +++-------
stdlib/tst-tls-atexit-nodelete.c | 104 +++++++++++++++++++++++++++++++++++++++
stdlib/tst-tls-atexit.c | 14 ++++--
5 files changed, 133 insertions(+), 24 deletions(-)
create mode 100644 stdlib/tst-tls-atexit-nodelete.c
diff --git a/elf/dl-close.c b/elf/dl-close.c
index 2104674..def726f 100644
--- a/elf/dl-close.c
+++ b/elf/dl-close.c
@@ -153,7 +153,11 @@ _dl_close_worker (struct link_map *map, bool force)
maps[idx] = l;
++idx;
- /* Clear DF_1_NODELETE to force object deletion. */
+ /* Clear DF_1_NODELETE to force object deletion. We don't need to touch
+ l_tls_dtor_count because forced object deletion only happens when an
+ error occurs during object load. Destructor registration for TLS
+ non-POD objects should not have happened till then for this
+ object. */
if (force)
l->l_flags_1 &= ~DF_1_NODELETE;
}
@@ -177,6 +181,7 @@ _dl_close_worker (struct link_map *map, bool force)
if (l->l_type == lt_loaded
&& l->l_direct_opencount == 0
&& (l->l_flags_1 & DF_1_NODELETE) == 0
+ && atomic_load_relaxed (&l->l_tls_dtor_count) == 0
Is _relaxed valid here?

Needs a comment specifying why _relaxed is valid.
Post by Siddhesh Poyarekar
&& !used[done_index])
continue;
diff --git a/stdlib/Makefile b/stdlib/Makefile
index 7fc5a80..402466a 100644
--- a/stdlib/Makefile
+++ b/stdlib/Makefile
@@ -74,7 +74,7 @@ tests := tst-strtol tst-strtod testmb testrand testsort testdiv \
tst-makecontext3 bug-getcontext bug-fmtmsg1 \
tst-secure-getenv tst-strtod-overflow tst-strtod-round \
tst-tininess tst-strtod-underflow tst-tls-atexit \
- tst-setcontext3
+ tst-setcontext3 tst-tls-atexit-nodelete
tests-static := tst-secure-getenv
modules-names = tst-tls-atexit-lib
@@ -159,6 +159,9 @@ tst-tls-atexit-lib.so-no-z-defs = yes
$(objpfx)tst-tls-atexit: $(shared-thread-library) $(libdl)
$(objpfx)tst-tls-atexit.out: $(objpfx)tst-tls-atexit-lib.so
+$(objpfx)tst-tls-atexit-nodelete: $(shared-thread-library) $(libdl)
+$(objpfx)tst-tls-atexit-nodelete.out: $(objpfx)tst-tls-atexit-lib.so
+
$(objpfx)tst-setcontext3.out: tst-setcontext3.sh $(objpfx)tst-setcontext3
$(SHELL) $< $(common-objpfx) '$(test-program-prefix-before-env)' \
'$(run-program-env)' '$(test-program-prefix-after-env)' \
diff --git a/stdlib/cxa_thread_atexit_impl.c b/stdlib/cxa_thread_atexit_impl.c
index 9120162..c44f9d0 100644
--- a/stdlib/cxa_thread_atexit_impl.c
+++ b/stdlib/cxa_thread_atexit_impl.c
@@ -49,9 +49,11 @@ __cxa_thread_atexit_impl (dtor_func func, void *obj, void *dso_symbol)
new->next = tls_dtor_list;
tls_dtor_list = new;
- /* See if we already encountered the DSO. */
+ /* We have to take the big lock to prevent a racing dlclose from pulling our
+ DSO from underneath us while we're setting up our destructor. */
__rtld_lock_lock_recursive (GL(dl_load_lock));
+ /* See if we already encountered the DSO. */
if (__glibc_unlikely (dso_symbol_cache != dso_symbol))
{
ElfW(Addr) caller = (ElfW(Addr)) dso_symbol;
@@ -62,16 +64,12 @@ __cxa_thread_atexit_impl (dtor_func func, void *obj, void *dso_symbol)
program (we hope). */
lm_cache = l ? l : GL(dl_ns)[LM_ID_BASE]._ns_loaded;
}
- /* A destructor could result in a thread_local construction and the former
- could have cleared the flag. */
- if (lm_cache->l_type == lt_loaded && lm_cache->l_tls_dtor_count == 0)
- lm_cache->l_flags_1 |= DF_1_NODELETE;
-
- new->map = lm_cache;
- new->map->l_tls_dtor_count++;
+ atomic_increment (&lm_cache->l_tls_dtor_count);
Needs a comment specifying what it synchronizes with.

Should be a C11 atomic.

Should or could be _relase, _acquire, or _relaxed.
Post by Siddhesh Poyarekar
__rtld_lock_unlock_recursive (GL(dl_load_lock));
+ new->map = lm_cache;
+
return 0;
}
@@ -83,19 +81,10 @@ __call_tls_dtors (void)
while (tls_dtor_list)
{
struct dtor_list *cur = tls_dtor_list;
- tls_dtor_list = tls_dtor_list->next;
+ tls_dtor_list = tls_dtor_list->next;
cur->func (cur->obj);
-
- __rtld_lock_lock_recursive (GL(dl_load_lock));
-
- /* Allow DSO unload if count drops to zero. */
- cur->map->l_tls_dtor_count--;
- if (cur->map->l_tls_dtor_count == 0 && cur->map->l_type == lt_loaded)
- cur->map->l_flags_1 &= ~DF_1_NODELETE;
-
- __rtld_lock_unlock_recursive (GL(dl_load_lock));
-
+ atomic_decrement (&cur->map->l_tls_dtor_count);
Needs a comment specifying what this synchronizes with.

We should be trying hard to use the ISO C11 atomics.

This could be atomic_fetch_add_* ?
Post by Siddhesh Poyarekar
free (cur);
}
}
diff --git a/stdlib/tst-tls-atexit-nodelete.c b/stdlib/tst-tls-atexit-nodelete.c
new file mode 100644
index 0000000..cb2b39d
--- /dev/null
+++ b/stdlib/tst-tls-atexit-nodelete.c
@@ -0,0 +1,104 @@
+/* Verify that a RTLD_NODELETE DSO is not unloaded even if its TLS objects are
+ destroyed.
+
+ Copyright (C) 2015 Free Software Foundation, Inc.
+ This file is part of the GNU C Library.
+
+ The GNU C Library is free software; you can redistribute it and/or
+ modify it under the terms of the GNU Lesser General Public
+ License as published by the Free Software Foundation; either
+ version 2.1 of the License, or (at your option) any later version.
+
+ The GNU C Library is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ Lesser General Public License for more details.
+
+ You should have received a copy of the GNU Lesser General Public
+ License along with the GNU C Library; if not, see
+ <http://www.gnu.org/licenses/>. */
+
+#include <dlfcn.h>
+#include <pthread.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+
+static void *
+use_handle (void *h)
+{
+ void (*reg_dtor) (void) = (void (*) (void)) dlsym (h, "reg_dtor");
+
+ if (reg_dtor == NULL)
+ {
+ printf ("Unable to find symbol reg_dtor: %s\n", dlerror ());
+ return (void *) (uintptr_t) 1;
+ }
+
+ reg_dtor ();
+
+ return NULL;
+}
+
+static int
+do_test (void)
+{
+ pthread_t t;
+ int ret;
+ void *thr_ret;
+
+ void *h1 = dlopen ("$ORIGIN/tst-tls-atexit-lib.so", RTLD_LAZY);
+ if (h1 == NULL)
+ {
+ printf ("h1: Unable to load DSO: %s\n", dlerror ());
+ return 1;
+ }
+
+ void *h2 = dlopen ("$ORIGIN/tst-tls-atexit-lib.so",
+ RTLD_LAZY | RTLD_NODELETE);
+ if (h2 == NULL)
+ {
+ printf ("h2: Unable to load DSO: %s\n", dlerror ());
+ return 1;
+ }
+
+ /* FOO is our variable to test if the DSO is unloaded. */
+ int *foo = dlsym (h2, "foo_var");
+
+ if (foo == NULL)
+ {
+ printf ("Unable to find symbol do_foo: %s\n", dlerror ());
+ return 1;
+ }
+
+ /* Print FOO to be sure that it is working OK. */
+ printf ("%d\n", *foo);
+
+ if ((ret = pthread_create (&t, NULL, use_handle, h1)) != 0)
+ {
+ printf ("pthread_create failed: %s\n", strerror (ret));
+ return 1;
+ }
+
+ if ((ret = pthread_join (t, &thr_ret)) != 0)
+ {
+ printf ("pthread_join failed: %s\n", strerror (ret));
+ return 1;
+ }
+
+ if (thr_ret != NULL)
+ return 1;
+
+ /* Closing the handles should not unload the DSO. */
+ dlclose (h1);
+ dlclose (h2);
+
+ /* This should not crash. */
+ printf ("%d\n", *foo);
+
+ return 0;
+}
+
+#define TEST_FUNCTION do_test ()
+#include "../test-skeleton.c"
diff --git a/stdlib/tst-tls-atexit.c b/stdlib/tst-tls-atexit.c
index 0a9e920..ae4c328 100644
--- a/stdlib/tst-tls-atexit.c
+++ b/stdlib/tst-tls-atexit.c
@@ -99,9 +99,17 @@ do_test (void)
if (thr_ret != NULL)
return 1;
- /* Now this should unload the DSO. FIXME: This is a bug, calling dlclose
- like this is actually wrong, but it works because cxa_thread_atexit_impl
- has a bug which results in dlclose allowing this to work. */
+ /* Now this sequence should unload the DSO. Note that we don't call do_foo,
+ because we don't want to register the thread destructor. We just want to
+ make sure that in the absence of pending destructors, the library is
+ unloaded correctly. */
+ handle = dlopen ("$ORIGIN/tst-tls-atexit-lib.so", RTLD_LAZY);
+ if (handle == NULL)
+ {
+ printf ("main thread: Unable to load DSO: %s\n", dlerror ());
+ return 1;
+ }
+
dlclose (handle);
/* Handle SIGSEGV. We don't use the EXPECTED_SIGNAL macro because we want to
Cheers,
Carlos.
Siddhesh Poyarekar
2015-07-20 10:51:04 UTC
Permalink
When an TLS destructor is registered, we set the DF_1_NODELETE flag to
signal that the object should not be destroyed. We then clear the
DF_1_NODELETE flag when all destructors are called, which is wrong -
the flag could have been set by other means too.

This patch replaces this use of the flag by using l_tls_dtor_count
directly to determine whether it is safe to unload the object. This
change has the added advantage of eliminating the lock taking when
calling the destructors, which could result in a deadlock. The patch
also fixes the test case tst-tls-atexit - it was making an invalid
dlclose call, which would just return an error silently.

I have also added a detailed note on concurrency which also aims to
justify why I chose the semantics I chose for accesses to
l_tls_dtor_count. Thanks to Torvald for his help in getting me
started on this and (literally) teaching my how to approach the
problem.

Change verified on x86_64; the test suite does not show any
regressions due to the patch.

ChangeLog:

[BZ #18657]
* elf/dl-close.c (_dl_close_worker): Don't unload DSO if there
are pending TLS destructor calls.
* stdlib/cxa_thread_atexit_impl.c (__cxa_thread_atexit_impl):
Don't touch the link map flag. Atomically increment
l_tls_dtor_count.
(__call_tls_dtors): Atomically decrement l_tls_dtor_count.
Avoid taking the load lock and don't touch the link map flag.
* stdlib/tst-tls-atexit.c (do_test): dlopen
tst-tls-atexit-lib.so again before dlclose.
* stdlib/tst-tls-atexit-nodelete.c: New test case.
* stdlib/Makefile (tests): Use it.
---
elf/dl-close.c | 9 +++-
include/link.h | 4 +-
stdlib/Makefile | 5 +-
stdlib/cxa_thread_atexit_impl.c | 76 +++++++++++++++++++++-------
stdlib/tst-tls-atexit-nodelete.c | 104 +++++++++++++++++++++++++++++++++++++++
stdlib/tst-tls-atexit.c | 14 ++++--
6 files changed, 188 insertions(+), 24 deletions(-)
create mode 100644 stdlib/tst-tls-atexit-nodelete.c

diff --git a/elf/dl-close.c b/elf/dl-close.c
index 2104674..9105277 100644
--- a/elf/dl-close.c
+++ b/elf/dl-close.c
@@ -153,7 +153,11 @@ _dl_close_worker (struct link_map *map, bool force)
maps[idx] = l;
++idx;

- /* Clear DF_1_NODELETE to force object deletion. */
+ /* Clear DF_1_NODELETE to force object deletion. We don't need to touch
+ l_tls_dtor_count because forced object deletion only happens when an
+ error occurs during object load. Destructor registration for TLS
+ non-POD objects should not have happened till then for this
+ object. */
if (force)
l->l_flags_1 &= ~DF_1_NODELETE;
}
@@ -177,6 +181,9 @@ _dl_close_worker (struct link_map *map, bool force)
if (l->l_type == lt_loaded
&& l->l_direct_opencount == 0
&& (l->l_flags_1 & DF_1_NODELETE) == 0
+ /* See CONCURRENCY NOTES in cxa_thread_atexit_impl.c to know why
+ acquire is sufficient and correct. */
+ && atomic_load_acquire (&l->l_tls_dtor_count) == 0
&& !used[done_index])
continue;

diff --git a/include/link.h b/include/link.h
index 423a695..7e7eb79 100644
--- a/include/link.h
+++ b/include/link.h
@@ -302,7 +302,9 @@ struct link_map
/* Index of the module in the dtv array. */
size_t l_tls_modid;

- /* Number of thread_local objects constructed by this DSO. */
+ /* Number of thread_local objects constructed by this DSO. This is
+ atomically accessed and modified and is not always protected by the load
+ lock. See also: CONCURRENCY NOTES in cxa_thread_atexit_impl.c. */
size_t l_tls_dtor_count;

/* Information used to change permission after the relocations are
diff --git a/stdlib/Makefile b/stdlib/Makefile
index 7fc5a80..402466a 100644
--- a/stdlib/Makefile
+++ b/stdlib/Makefile
@@ -74,7 +74,7 @@ tests := tst-strtol tst-strtod testmb testrand testsort testdiv \
tst-makecontext3 bug-getcontext bug-fmtmsg1 \
tst-secure-getenv tst-strtod-overflow tst-strtod-round \
tst-tininess tst-strtod-underflow tst-tls-atexit \
- tst-setcontext3
+ tst-setcontext3 tst-tls-atexit-nodelete
tests-static := tst-secure-getenv

modules-names = tst-tls-atexit-lib
@@ -159,6 +159,9 @@ tst-tls-atexit-lib.so-no-z-defs = yes
$(objpfx)tst-tls-atexit: $(shared-thread-library) $(libdl)
$(objpfx)tst-tls-atexit.out: $(objpfx)tst-tls-atexit-lib.so

+$(objpfx)tst-tls-atexit-nodelete: $(shared-thread-library) $(libdl)
+$(objpfx)tst-tls-atexit-nodelete.out: $(objpfx)tst-tls-atexit-lib.so
+
$(objpfx)tst-setcontext3.out: tst-setcontext3.sh $(objpfx)tst-setcontext3
$(SHELL) $< $(common-objpfx) '$(test-program-prefix-before-env)' \
'$(run-program-env)' '$(test-program-prefix-after-env)' \
diff --git a/stdlib/cxa_thread_atexit_impl.c b/stdlib/cxa_thread_atexit_impl.c
index 7a7d3d6..dc51ab5 100644
--- a/stdlib/cxa_thread_atexit_impl.c
+++ b/stdlib/cxa_thread_atexit_impl.c
@@ -16,6 +16,50 @@
License along with the GNU C Library; if not, see
<http://www.gnu.org/licenses/>. */

+/* CONCURRENCY NOTES:
+
+ The functions __cxa_thread_atexit_impl, _dl_close_worker and
+ __call_tls_dtors are the three main routines that may run concurrently and
+ access shared data. The shared data in all possible combinations of all
+ three functions are the link map list, a link map for a DSO and the link map
+ member l_tls_dtor_count.
+
+ __cxa_thread_atexit_impl takes the load_lock before modifying any shared
+ state and hence can concurrently execute multiple of its instances in
+ different threads safely.
+
+ _dl_close_worker takes the load_lock before modifying any shared state as
+ well and hence can concurrently execute multiple of its own instances as
+ well as those of __cxa_thread_atexit_impl safely.
+
+ __call_tls_dtors does not take the load lock, but reads only the link map
+ of the current DSO and its member l_tls_dtor_count. It has to ensure that
+ l_tls_dtor_count is decremented atomically
+
+ Correspondingly, _dl_close_worker loads l_tls_dtor_count and if it is zero,
+ unloads the DSO, thus dereferencing the current link map. Hence, for
+ concurrent execution of _dl_close_worker and __call_tls_dtors, it is
+ necessary that:
+
+ 1. The DSO unload in _dl_close_worker happens after l_tls_dtor_count, i.e.
+ the l_tls_dtor_count load does not get reordered to after the DSO is
+ unloaded
+ 2. The link map dereference in __call_tls_dtors happens before the
+ l_tls_dtor_count dereference.
+
+ to ensure that we eliminate the inconsistent state where the DSO is unloaded
+ before it is dereferenced in __call_tls_dtors, which could happen if any of
+ the above two pairs of sequences were reordered.
+
+ To ensure this, the l_tls_dtor_count decrement in __call_tls_dtors should
+ have release semantics and the load in _dl_close_worker should have acquire
+ semantics.
+
+ Concurrent executions of __call_tls_dtors should only ensure that the value
+ is modified atomically; no reordering constraints need to be considered.
+ Likewise for the increment of l_tls_dtor_count in
+ __cxa_thread_atexit_impl. */
+
#include <stdlib.h>
#include <ldsodefs.h>

@@ -49,9 +93,11 @@ __cxa_thread_atexit_impl (dtor_func func, void *obj, void *dso_symbol)
new->next = tls_dtor_list;
tls_dtor_list = new;

- /* See if we already encountered the DSO. */
+ /* We have to take the big lock to prevent a racing dlclose from pulling our
+ DSO from underneath us while we're setting up our destructor. */
__rtld_lock_lock_recursive (GL(dl_load_lock));

+ /* See if we already encountered the DSO. */
if (__glibc_unlikely (dso_symbol_cache != dso_symbol))
{
ElfW(Addr) caller = (ElfW(Addr)) dso_symbol;
@@ -62,16 +108,14 @@ __cxa_thread_atexit_impl (dtor_func func, void *obj, void *dso_symbol)
program (we hope). */
lm_cache = l ? l : GL(dl_ns)[LM_ID_BASE]._ns_loaded;
}
- /* A destructor could result in a thread_local construction and the former
- could have cleared the flag. */
- if (lm_cache->l_type == lt_loaded && lm_cache->l_tls_dtor_count == 0)
- lm_cache->l_flags_1 |= DF_1_NODELETE;
-
- new->map = lm_cache;
- new->map->l_tls_dtor_count++;

+ /* Relaxed since the only shared state here is l_tls_dtor_count and hence
+ there are no reordering issues. */
+ atomic_fetch_add_relaxed (&lm_cache->l_tls_dtor_count, 1);
__rtld_lock_unlock_recursive (GL(dl_load_lock));

+ new->map = lm_cache;
+
return 0;
}

@@ -83,19 +127,15 @@ __call_tls_dtors (void)
while (tls_dtor_list)
{
struct dtor_list *cur = tls_dtor_list;
- tls_dtor_list = tls_dtor_list->next;

+ tls_dtor_list = tls_dtor_list->next;
cur->func (cur->obj);

- __rtld_lock_lock_recursive (GL(dl_load_lock));
-
- /* Allow DSO unload if count drops to zero. */
- cur->map->l_tls_dtor_count--;
- if (cur->map->l_tls_dtor_count == 0 && cur->map->l_type == lt_loaded)
- cur->map->l_flags_1 &= ~DF_1_NODELETE;
-
- __rtld_lock_unlock_recursive (GL(dl_load_lock));
-
+ /* Ensure that the MAP dereference is not reordered below the
+ l_tls_dtor_count decrement. This ensures that it synchronizes with
+ the load in _dl_close_worker and keeps this dereference before the DSO
+ unload. See CONCURRENCY NOTES for more detail. */
+ atomic_fetch_add_release (&cur->map->l_tls_dtor_count, -1);
free (cur);
}
}
diff --git a/stdlib/tst-tls-atexit-nodelete.c b/stdlib/tst-tls-atexit-nodelete.c
new file mode 100644
index 0000000..cb2b39d
--- /dev/null
+++ b/stdlib/tst-tls-atexit-nodelete.c
@@ -0,0 +1,104 @@
+/* Verify that a RTLD_NODELETE DSO is not unloaded even if its TLS objects are
+ destroyed.
+
+ Copyright (C) 2015 Free Software Foundation, Inc.
+ This file is part of the GNU C Library.
+
+ The GNU C Library is free software; you can redistribute it and/or
+ modify it under the terms of the GNU Lesser General Public
+ License as published by the Free Software Foundation; either
+ version 2.1 of the License, or (at your option) any later version.
+
+ The GNU C Library is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ Lesser General Public License for more details.
+
+ You should have received a copy of the GNU Lesser General Public
+ License along with the GNU C Library; if not, see
+ <http://www.gnu.org/licenses/>. */
+
+#include <dlfcn.h>
+#include <pthread.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+
+static void *
+use_handle (void *h)
+{
+ void (*reg_dtor) (void) = (void (*) (void)) dlsym (h, "reg_dtor");
+
+ if (reg_dtor == NULL)
+ {
+ printf ("Unable to find symbol reg_dtor: %s\n", dlerror ());
+ return (void *) (uintptr_t) 1;
+ }
+
+ reg_dtor ();
+
+ return NULL;
+}
+
+static int
+do_test (void)
+{
+ pthread_t t;
+ int ret;
+ void *thr_ret;
+
+ void *h1 = dlopen ("$ORIGIN/tst-tls-atexit-lib.so", RTLD_LAZY);
+ if (h1 == NULL)
+ {
+ printf ("h1: Unable to load DSO: %s\n", dlerror ());
+ return 1;
+ }
+
+ void *h2 = dlopen ("$ORIGIN/tst-tls-atexit-lib.so",
+ RTLD_LAZY | RTLD_NODELETE);
+ if (h2 == NULL)
+ {
+ printf ("h2: Unable to load DSO: %s\n", dlerror ());
+ return 1;
+ }
+
+ /* FOO is our variable to test if the DSO is unloaded. */
+ int *foo = dlsym (h2, "foo_var");
+
+ if (foo == NULL)
+ {
+ printf ("Unable to find symbol do_foo: %s\n", dlerror ());
+ return 1;
+ }
+
+ /* Print FOO to be sure that it is working OK. */
+ printf ("%d\n", *foo);
+
+ if ((ret = pthread_create (&t, NULL, use_handle, h1)) != 0)
+ {
+ printf ("pthread_create failed: %s\n", strerror (ret));
+ return 1;
+ }
+
+ if ((ret = pthread_join (t, &thr_ret)) != 0)
+ {
+ printf ("pthread_join failed: %s\n", strerror (ret));
+ return 1;
+ }
+
+ if (thr_ret != NULL)
+ return 1;
+
+ /* Closing the handles should not unload the DSO. */
+ dlclose (h1);
+ dlclose (h2);
+
+ /* This should not crash. */
+ printf ("%d\n", *foo);
+
+ return 0;
+}
+
+#define TEST_FUNCTION do_test ()
+#include "../test-skeleton.c"
diff --git a/stdlib/tst-tls-atexit.c b/stdlib/tst-tls-atexit.c
index 0a9e920..ccea8a7 100644
--- a/stdlib/tst-tls-atexit.c
+++ b/stdlib/tst-tls-atexit.c
@@ -99,9 +99,17 @@ do_test (void)
if (thr_ret != NULL)
return 1;

- /* Now this should unload the DSO. FIXME: This is a bug, calling dlclose
- like this is actually wrong, but it works because cxa_thread_atexit_impl
- has a bug which results in dlclose allowing this to work. */
+ /* Now this sequence should unload the DSO. Note that we don't call do_foo,
+ because we don't want to register the thread destructor. We just want to
+ make sure that in the absence of pending destructors, the library is
+ unloaded correctly. */
+ handle = dlopen ("$ORIGIN/tst-tls-atexit-lib.so", RTLD_LAZY);
+ if (handle == NULL)
+ {
+ printf ("2: Unable to load DSO: %s\n", dlerror ());
+ return 1;
+ }
+
dlclose (handle);

/* Handle SIGSEGV. We don't use the EXPECTED_SIGNAL macro because we want to
--
2.4.3
Siddhesh Poyarekar
2015-07-20 17:31:12 UTC
Permalink
v6: Juggled around the test case code a bit so that it uses tst-tls-atexit.c
with a couple of macros.

When an TLS destructor is registered, we set the DF_1_NODELETE flag to
signal that the object should not be destroyed. We then clear the
DF_1_NODELETE flag when all destructors are called, which is wrong -
the flag could have been set by other means too.

This patch replaces this use of the flag by using l_tls_dtor_count
directly to determine whether it is safe to unload the object. This
change has the added advantage of eliminating the lock taking when
calling the destructors, which could result in a deadlock. The patch
also fixes the test case tst-tls-atexit - it was making an invalid
dlclose call, which would just return an error silently.

I have also added a detailed note on concurrency which also aims to
justify why I chose the semantics I chose for accesses to
l_tls_dtor_count. Thanks to Torvald for his help in getting me
started on this and (literally) teaching my how to approach the
problem.

Change verified on x86_64; the test suite does not show any
regressions due to the patch.

ChangeLog:

[BZ #18657]
* elf/dl-close.c (_dl_close_worker): Don't unload DSO if there
are pending TLS destructor calls.
* include/link.h (struct link_map): Add concurrency note for
L_TLS_DTOR_COUNT.
* stdlib/cxa_thread_atexit_impl.c (__cxa_thread_atexit_impl):
Don't touch the link map flag. Atomically increment
l_tls_dtor_count.
(__call_tls_dtors): Atomically decrement l_tls_dtor_count.
Avoid taking the load lock and don't touch the link map flag.
* stdlib/tst-tls-atexit-nodelete.c: New test case.
* stdlib/Makefile (tests): Use it.
* stdlib/tst-tls-atexit.c (do_test): dlopen
tst-tls-atexit-lib.so again before dlclose. Add conditionals
to allow tst-tls-atexit-nodelete test case to use it.
---
elf/dl-close.c | 9 ++++-
include/link.h | 4 ++-
stdlib/Makefile | 5 ++-
stdlib/cxa_thread_atexit_impl.c | 76 ++++++++++++++++++++++++++++++----------
stdlib/tst-tls-atexit-nodelete.c | 24 +++++++++++++
stdlib/tst-tls-atexit.c | 60 +++++++++++++++++++++++--------
6 files changed, 143 insertions(+), 35 deletions(-)
create mode 100644 stdlib/tst-tls-atexit-nodelete.c

diff --git a/elf/dl-close.c b/elf/dl-close.c
index 2104674..9105277 100644
--- a/elf/dl-close.c
+++ b/elf/dl-close.c
@@ -153,7 +153,11 @@ _dl_close_worker (struct link_map *map, bool force)
maps[idx] = l;
++idx;

- /* Clear DF_1_NODELETE to force object deletion. */
+ /* Clear DF_1_NODELETE to force object deletion. We don't need to touch
+ l_tls_dtor_count because forced object deletion only happens when an
+ error occurs during object load. Destructor registration for TLS
+ non-POD objects should not have happened till then for this
+ object. */
if (force)
l->l_flags_1 &= ~DF_1_NODELETE;
}
@@ -177,6 +181,9 @@ _dl_close_worker (struct link_map *map, bool force)
if (l->l_type == lt_loaded
&& l->l_direct_opencount == 0
&& (l->l_flags_1 & DF_1_NODELETE) == 0
+ /* See CONCURRENCY NOTES in cxa_thread_atexit_impl.c to know why
+ acquire is sufficient and correct. */
+ && atomic_load_acquire (&l->l_tls_dtor_count) == 0
&& !used[done_index])
continue;

diff --git a/include/link.h b/include/link.h
index 423a695..7e7eb79 100644
--- a/include/link.h
+++ b/include/link.h
@@ -302,7 +302,9 @@ struct link_map
/* Index of the module in the dtv array. */
size_t l_tls_modid;

- /* Number of thread_local objects constructed by this DSO. */
+ /* Number of thread_local objects constructed by this DSO. This is
+ atomically accessed and modified and is not always protected by the load
+ lock. See also: CONCURRENCY NOTES in cxa_thread_atexit_impl.c. */
size_t l_tls_dtor_count;

/* Information used to change permission after the relocations are
diff --git a/stdlib/Makefile b/stdlib/Makefile
index 7fc5a80..402466a 100644
--- a/stdlib/Makefile
+++ b/stdlib/Makefile
@@ -74,7 +74,7 @@ tests := tst-strtol tst-strtod testmb testrand testsort testdiv \
tst-makecontext3 bug-getcontext bug-fmtmsg1 \
tst-secure-getenv tst-strtod-overflow tst-strtod-round \
tst-tininess tst-strtod-underflow tst-tls-atexit \
- tst-setcontext3
+ tst-setcontext3 tst-tls-atexit-nodelete
tests-static := tst-secure-getenv

modules-names = tst-tls-atexit-lib
@@ -159,6 +159,9 @@ tst-tls-atexit-lib.so-no-z-defs = yes
$(objpfx)tst-tls-atexit: $(shared-thread-library) $(libdl)
$(objpfx)tst-tls-atexit.out: $(objpfx)tst-tls-atexit-lib.so

+$(objpfx)tst-tls-atexit-nodelete: $(shared-thread-library) $(libdl)
+$(objpfx)tst-tls-atexit-nodelete.out: $(objpfx)tst-tls-atexit-lib.so
+
$(objpfx)tst-setcontext3.out: tst-setcontext3.sh $(objpfx)tst-setcontext3
$(SHELL) $< $(common-objpfx) '$(test-program-prefix-before-env)' \
'$(run-program-env)' '$(test-program-prefix-after-env)' \
diff --git a/stdlib/cxa_thread_atexit_impl.c b/stdlib/cxa_thread_atexit_impl.c
index 7a7d3d6..dc51ab5 100644
--- a/stdlib/cxa_thread_atexit_impl.c
+++ b/stdlib/cxa_thread_atexit_impl.c
@@ -16,6 +16,50 @@
License along with the GNU C Library; if not, see
<http://www.gnu.org/licenses/>. */

+/* CONCURRENCY NOTES:
+
+ The functions __cxa_thread_atexit_impl, _dl_close_worker and
+ __call_tls_dtors are the three main routines that may run concurrently and
+ access shared data. The shared data in all possible combinations of all
+ three functions are the link map list, a link map for a DSO and the link map
+ member l_tls_dtor_count.
+
+ __cxa_thread_atexit_impl takes the load_lock before modifying any shared
+ state and hence can concurrently execute multiple of its instances in
+ different threads safely.
+
+ _dl_close_worker takes the load_lock before modifying any shared state as
+ well and hence can concurrently execute multiple of its own instances as
+ well as those of __cxa_thread_atexit_impl safely.
+
+ __call_tls_dtors does not take the load lock, but reads only the link map
+ of the current DSO and its member l_tls_dtor_count. It has to ensure that
+ l_tls_dtor_count is decremented atomically
+
+ Correspondingly, _dl_close_worker loads l_tls_dtor_count and if it is zero,
+ unloads the DSO, thus dereferencing the current link map. Hence, for
+ concurrent execution of _dl_close_worker and __call_tls_dtors, it is
+ necessary that:
+
+ 1. The DSO unload in _dl_close_worker happens after l_tls_dtor_count, i.e.
+ the l_tls_dtor_count load does not get reordered to after the DSO is
+ unloaded
+ 2. The link map dereference in __call_tls_dtors happens before the
+ l_tls_dtor_count dereference.
+
+ to ensure that we eliminate the inconsistent state where the DSO is unloaded
+ before it is dereferenced in __call_tls_dtors, which could happen if any of
+ the above two pairs of sequences were reordered.
+
+ To ensure this, the l_tls_dtor_count decrement in __call_tls_dtors should
+ have release semantics and the load in _dl_close_worker should have acquire
+ semantics.
+
+ Concurrent executions of __call_tls_dtors should only ensure that the value
+ is modified atomically; no reordering constraints need to be considered.
+ Likewise for the increment of l_tls_dtor_count in
+ __cxa_thread_atexit_impl. */
+
#include <stdlib.h>
#include <ldsodefs.h>

@@ -49,9 +93,11 @@ __cxa_thread_atexit_impl (dtor_func func, void *obj, void *dso_symbol)
new->next = tls_dtor_list;
tls_dtor_list = new;

- /* See if we already encountered the DSO. */
+ /* We have to take the big lock to prevent a racing dlclose from pulling our
+ DSO from underneath us while we're setting up our destructor. */
__rtld_lock_lock_recursive (GL(dl_load_lock));

+ /* See if we already encountered the DSO. */
if (__glibc_unlikely (dso_symbol_cache != dso_symbol))
{
ElfW(Addr) caller = (ElfW(Addr)) dso_symbol;
@@ -62,16 +108,14 @@ __cxa_thread_atexit_impl (dtor_func func, void *obj, void *dso_symbol)
program (we hope). */
lm_cache = l ? l : GL(dl_ns)[LM_ID_BASE]._ns_loaded;
}
- /* A destructor could result in a thread_local construction and the former
- could have cleared the flag. */
- if (lm_cache->l_type == lt_loaded && lm_cache->l_tls_dtor_count == 0)
- lm_cache->l_flags_1 |= DF_1_NODELETE;
-
- new->map = lm_cache;
- new->map->l_tls_dtor_count++;

+ /* Relaxed since the only shared state here is l_tls_dtor_count and hence
+ there are no reordering issues. */
+ atomic_fetch_add_relaxed (&lm_cache->l_tls_dtor_count, 1);
__rtld_lock_unlock_recursive (GL(dl_load_lock));

+ new->map = lm_cache;
+
return 0;
}

@@ -83,19 +127,15 @@ __call_tls_dtors (void)
while (tls_dtor_list)
{
struct dtor_list *cur = tls_dtor_list;
- tls_dtor_list = tls_dtor_list->next;

+ tls_dtor_list = tls_dtor_list->next;
cur->func (cur->obj);

- __rtld_lock_lock_recursive (GL(dl_load_lock));
-
- /* Allow DSO unload if count drops to zero. */
- cur->map->l_tls_dtor_count--;
- if (cur->map->l_tls_dtor_count == 0 && cur->map->l_type == lt_loaded)
- cur->map->l_flags_1 &= ~DF_1_NODELETE;
-
- __rtld_lock_unlock_recursive (GL(dl_load_lock));
-
+ /* Ensure that the MAP dereference is not reordered below the
+ l_tls_dtor_count decrement. This ensures that it synchronizes with
+ the load in _dl_close_worker and keeps this dereference before the DSO
+ unload. See CONCURRENCY NOTES for more detail. */
+ atomic_fetch_add_release (&cur->map->l_tls_dtor_count, -1);
free (cur);
}
}
diff --git a/stdlib/tst-tls-atexit-nodelete.c b/stdlib/tst-tls-atexit-nodelete.c
new file mode 100644
index 0000000..ff290fa
--- /dev/null
+++ b/stdlib/tst-tls-atexit-nodelete.c
@@ -0,0 +1,24 @@
+/* Verify that a RTLD_NODELETE DSO is not unloaded even if its TLS objects are
+ destroyed.
+
+ Copyright (C) 2015 Free Software Foundation, Inc.
+ This file is part of the GNU C Library.
+
+ The GNU C Library is free software; you can redistribute it and/or
+ modify it under the terms of the GNU Lesser General Public
+ License as published by the Free Software Foundation; either
+ version 2.1 of the License, or (at your option) any later version.
+
+ The GNU C Library is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ Lesser General Public License for more details.
+
+ You should have received a copy of the GNU Lesser General Public
+ License along with the GNU C Library; if not, see
+ <http://www.gnu.org/licenses/>. */
+
+#define NO_DELETE 1
+#define H2_RTLD_FLAGS (RTLD_LAZY | RTLD_NODELETE)
+#define LOADED_IS_GOOD true
+#include "tst-tls-atexit.c"
diff --git a/stdlib/tst-tls-atexit.c b/stdlib/tst-tls-atexit.c
index cea655d..e9839d8 100644
--- a/stdlib/tst-tls-atexit.c
+++ b/stdlib/tst-tls-atexit.c
@@ -16,12 +16,20 @@
License along with the GNU C Library; if not, see
<http://www.gnu.org/licenses/>. */

-/* This test dynamically loads a DSO and spawns a thread that subsequently
- calls into the DSO to register a destructor for an object in the DSO and
- then calls dlclose on the handle for the DSO. When the thread exits, the
- DSO should not be unloaded or else the destructor called during thread exit
- will crash. Further in the main thread, the DSO is opened and closed again,
- at which point the DSO should be unloaded. */
+/* For the default case, i.e. NO_DELETE not defined, the test dynamically loads
+ a DSO and spawns a thread that subsequently calls into the DSO to register a
+ destructor for an object in the DSO and then calls dlclose on the handle for
+ the DSO. When the thread exits, the DSO should not be unloaded or else the
+ destructor called during thread exit will crash. Further in the main
+ thread, the DSO is opened and closed again, at which point the DSO should be
+ unloaded.
+
+ When NO_DELETE is defined, the DSO is loaded twice, once with just RTLD_LAZY
+ flag and the second time with the RTLD_NODELETE flag set. The thread is
+ spawned, destructor registered and then thread exits without closing the
+ DSO. In the main thread, the first handle is then closed, followed by the
+ second handle. In the end, the DSO should remain loaded due to the
+ RTLD_NODELETE flag being set in the second dlopen call. */

#include <dlfcn.h>
#include <pthread.h>
@@ -31,6 +39,14 @@
#include <errno.h>
#include <link.h>

+#ifndef NO_DELETE
+# define LOADED_IS_GOOD false
+#endif
+
+#ifndef H2_RTLD_FLAGS
+# define H2_RTLD_FLAGS (RTLD_LAZY)
+#endif
+
#define DSO_NAME "$ORIGIN/tst-tls-atexit-lib.so"

/* Walk through the map in the _r_debug structure to see if our lib is still
@@ -43,7 +59,10 @@ is_loaded (void)
for (; lm; lm = lm->l_next)
if (lm->l_type == lt_loaded && lm->l_name
&& strcmp (basename (DSO_NAME), basename (lm->l_name)) == 0)
- return true;
+ {
+ printf ("%s is still loaded\n", lm->l_name);
+ return true;
+ }
return false;
}

@@ -63,7 +82,9 @@ reg_dtor_and_close (void *h)

reg_dtor ();

+#ifndef NO_DELETE
dlclose (h);
+#endif

return NULL;
}
@@ -104,19 +125,30 @@ do_test (void)
return 1;
}

+#ifndef NO_DELETE
if (spawn_thread (h1) != 0)
return 1;
+#endif

- /* Now this should unload the DSO. FIXME: This is a bug, calling dlclose
- like this is actually wrong, but it works because cxa_thread_atexit_impl
- has a bug which results in dlclose allowing this to work. */
- dlclose (h1);
+ void *h2 = dlopen (DSO_NAME, H2_RTLD_FLAGS);
+ if (h2 == NULL)
+ {
+ printf ("h2: Unable to load DSO: %s\n", dlerror ());
+ return 1;
+ }

- /* Check link maps to ensure that the DSO has unloaded. */
- if (is_loaded ())
+#ifdef NO_DELETE
+ if (spawn_thread (h1) != 0)
return 1;

- return 0;
+ dlclose (h1);
+#endif
+ dlclose (h2);
+
+ /* Check link maps to ensure that the DSO has unloaded. In the normal case,
+ the DSO should be unloaded if there are no uses. However, if one of the
+ dlopen calls were with RTLD_NODELETE, the DSO should remain loaded. */
+ return is_loaded () == LOADED_IS_GOOD ? 0 : 1;
}

#define TEST_FUNCTION do_test ()
--
2.4.3
Torvald Riegel
2015-07-21 20:29:27 UTC
Permalink
Post by Siddhesh Poyarekar
diff --git a/stdlib/cxa_thread_atexit_impl.c b/stdlib/cxa_thread_atexit_impl.c
index 7a7d3d6..dc51ab5 100644
--- a/stdlib/cxa_thread_atexit_impl.c
+++ b/stdlib/cxa_thread_atexit_impl.c
@@ -16,6 +16,50 @@
License along with the GNU C Library; if not, see
<http://www.gnu.org/licenses/>. */
+
+ The functions __cxa_thread_atexit_impl, _dl_close_worker and
+ __call_tls_dtors are the three main routines that may run concurrently and
+ access shared data. The shared data in all possible combinations of all
+ three functions are the link map list, a link map for a DSO and the link map
+ member l_tls_dtor_count.
OK. Perhaps just say a bit more specifically that this documents
concurrency for the atexit implementation (or whatever is the proper
context)?
Post by Siddhesh Poyarekar
+ __cxa_thread_atexit_impl takes the load_lock before modifying any shared
+ state and hence can concurrently execute multiple of its instances in
+ different threads safely.
I have a slight preference for "acquires" over "takes", but I know that
"take a lock" is used often.

If your not specifically talking about only modifications, I think it's
better to use "accessing" instead of "modifying". The ++ has a load,
too, and it tells the reader that both loads and stores are meant, not
just modifications.

Also, maybe say "multiple of its instances can safely execute
concurrently" to make it passive (ie, unspecified who/what executes
them).
Post by Siddhesh Poyarekar
+ _dl_close_worker takes the load_lock before modifying any shared state as
+ well and hence can concurrently execute multiple of its own instances as
+ well as those of __cxa_thread_atexit_impl safely.
+
+ __call_tls_dtors does not take the load lock, but reads only the link map
+ of the current DSO
Perhaps briefly say why this is okay?
Post by Siddhesh Poyarekar
and its member l_tls_dtor_count. It has to ensure that
+ l_tls_dtor_count is decremented atomically
Missing period.

If it decrements l_tls_dtor_count, it doesn't only read this and the
link map.

Perhaps it's easier to understand if you start by saying that not all
accesses to l_tls_dtor_count are protected by a lock and that we thus
need to synchronize using atomics. Then perhaps briefly list the
respective accesses / functions.
Post by Siddhesh Poyarekar
+ Correspondingly, _dl_close_worker loads l_tls_dtor_count and if it is zero,
+ unloads the DSO, thus dereferencing the current link map.
Good. Perhaps say that this is what we want to do. This is the
abstract goal that we have / what we want to implement.
Post by Siddhesh Poyarekar
Hence, for
+ concurrent execution of _dl_close_worker and __call_tls_dtors, it is
+
+ 1. The DSO unload in _dl_close_worker happens after l_tls_dtor_count
You mention just the variable, but not the effect, state transition, or
piece of program logic. (In contrast, "DSO unload" is an effect.)
Post by Siddhesh Poyarekar
, i.e.
+ the l_tls_dtor_count load does not get reordered to after the DSO is
+ unloaded
I know what you are trying to say, but I think it's better if we don't
start talking about "reordering" because that's at somewhat the wrong
level. Reordering is what a compiler or a CPU might do, and we don't
want to reason about that. Instead, we want to start from the
high-level goal (eg, DSO unload doesn't happen if the DSO is still
used), and reason from there what we need in terms of happens-before
relations, and then implementation things like MOs.
Post by Siddhesh Poyarekar
+ 2. The link map dereference in __call_tls_dtors happens before the
+ l_tls_dtor_count dereference.
+
+ to ensure that we eliminate the inconsistent state where the DSO is unloaded
+ before it is dereferenced in __call_tls_dtors,
That's the piece of information to start with, I believe.
Then you can say that l_tls_dtor_count acts as something like a
reference count.
Then you can follow from that that DSO use happens before decrement the
count happens before we unload. We unload only if the count is zero, so
we end up with a typical release-store / acquire load pattern.
Post by Siddhesh Poyarekar
which could happen if any of
+ the above two pairs of sequences were reordered.
+
+ To ensure this, the l_tls_dtor_count decrement in __call_tls_dtors should
+ have release semantics and the load in _dl_close_worker should have acquire
+ semantics.
I'm not sure which level of detail we want to use here. We could say
that we need the release/acquire pair so that the reads-from ensures a
happens-before. We could also assume that this pattern is well-known to
glibc developers.
(IOW, that's a question for the community to decide; I don't know enough
about what others know or aren't too familiar with.)
Post by Siddhesh Poyarekar
+ Concurrent executions of __call_tls_dtors should only ensure that the value
+ is modified atomically; no reordering constraints need to be considered.
If you are talking about the decrement operation, then perhaps say that.
Saying that the order in which they decrement/increment doesn't matter
for us is useful information though.
Post by Siddhesh Poyarekar
+ Likewise for the increment of l_tls_dtor_count in
+ __cxa_thread_atexit_impl. */
+
#include <stdlib.h>
#include <ldsodefs.h>
@@ -49,9 +93,11 @@ __cxa_thread_atexit_impl (dtor_func func, void *obj, void *dso_symbol)
new->next = tls_dtor_list;
tls_dtor_list = new;
- /* See if we already encountered the DSO. */
+ /* We have to take the big lock to prevent a racing dlclose from pulling our
+ DSO from underneath us while we're setting up our destructor. */
__rtld_lock_lock_recursive (GL(dl_load_lock));
+ /* See if we already encountered the DSO. */
if (__glibc_unlikely (dso_symbol_cache != dso_symbol))
{
ElfW(Addr) caller = (ElfW(Addr)) dso_symbol;
@@ -62,16 +108,14 @@ __cxa_thread_atexit_impl (dtor_func func, void *obj, void *dso_symbol)
program (we hope). */
lm_cache = l ? l : GL(dl_ns)[LM_ID_BASE]._ns_loaded;
}
- /* A destructor could result in a thread_local construction and the former
- could have cleared the flag. */
- if (lm_cache->l_type == lt_loaded && lm_cache->l_tls_dtor_count == 0)
- lm_cache->l_flags_1 |= DF_1_NODELETE;
-
- new->map = lm_cache;
- new->map->l_tls_dtor_count++;
+ /* Relaxed since the only shared state here is l_tls_dtor_count and hence
"Relaxed MO", just to be a bit more clear.
Post by Siddhesh Poyarekar
+ there are no reordering issues. */
I believe relaxed MO is fine, but I don't really understand the reason
as you have stated it. I think the question to ask here is who observes
the effect (ie, the increment; IOW, which loads observe this store)?
And for those observers, are there any that need a relation between this
effect and what they do? Or if not, then say that.
Post by Siddhesh Poyarekar
+ atomic_fetch_add_relaxed (&lm_cache->l_tls_dtor_count, 1);
__rtld_lock_unlock_recursive (GL(dl_load_lock));
+ new->map = lm_cache;
+
return 0;
}
@@ -83,19 +127,15 @@ __call_tls_dtors (void)
while (tls_dtor_list)
{
struct dtor_list *cur = tls_dtor_list;
- tls_dtor_list = tls_dtor_list->next;
+ tls_dtor_list = tls_dtor_list->next;
cur->func (cur->obj);
- __rtld_lock_lock_recursive (GL(dl_load_lock));
-
- /* Allow DSO unload if count drops to zero. */
- cur->map->l_tls_dtor_count--;
- if (cur->map->l_tls_dtor_count == 0 && cur->map->l_type == lt_loaded)
- cur->map->l_flags_1 &= ~DF_1_NODELETE;
-
- __rtld_lock_unlock_recursive (GL(dl_load_lock));
-
+ /* Ensure that the MAP dereference is not reordered below the
+ l_tls_dtor_count decrement. This ensures that it synchronizes with
+ the load in _dl_close_worker and keeps this dereference before the DSO
+ unload. See CONCURRENCY NOTES for more detail. */
+ atomic_fetch_add_release (&cur->map->l_tls_dtor_count, -1);
Good, but I'd not talk about reordering. Instead I's say that we want
the MAP dereference to happen before we give it up and thus before
potential DSO unload. That states the high-level goal. Then cover the
low-level implementation using the second sentence above that you
already have.


Overall, with perhaps the exception of the reasoning about the relaxed
MO on the increment, and based on the conversation that I had with
Siddhesh, I think the synchronization in this patch is sound. Thus, I
think that we can also do more word-smithing on how we explain it after
the release (so that this doesn't hold up the release then).

I do think the word-smithing is useful, because good concurrency
documentation ensures that we all are more likely to not misunderstand
each other and understand what's going on in the code. But I guess that
we'll need some more examples and iterations before we have figured out
all the details of how to document this so that it's clear for everyone.
Siddhesh Poyarekar
2015-07-23 05:49:24 UTC
Permalink
Post by Torvald Riegel
Overall, with perhaps the exception of the reasoning about the relaxed
MO on the increment, and based on the conversation that I had with
Siddhesh, I think the synchronization in this patch is sound. Thus, I
think that we can also do more word-smithing on how we explain it after
the release (so that this doesn't hold up the release then).
Thanks, I've pushed this now, with more changes to the documentation
as per your suggestion. Revised version is below.

Siddhesh

commit 90b37cac8b5a3e1548c29d91e3e0bff1014d2e5c
Author: Siddhesh Poyarekar <***@redhat.com>
Date: Thu Jul 23 11:16:18 2015 +0530

Also use l_tls_dtor_count to decide on object unload (BZ #18657)

When an TLS destructor is registered, we set the DF_1_NODELETE flag to
signal that the object should not be destroyed. We then clear the
DF_1_NODELETE flag when all destructors are called, which is wrong -
the flag could have been set by other means too.

This patch replaces this use of the flag by using l_tls_dtor_count
directly to determine whether it is safe to unload the object. This
change has the added advantage of eliminating the lock taking when
calling the destructors, which could result in a deadlock. The patch
also fixes the test case tst-tls-atexit - it was making an invalid
dlclose call, which would just return an error silently.

I have also added a detailed note on concurrency which also aims to
justify why I chose the semantics I chose for accesses to
l_tls_dtor_count. Thanks to Torvald for his help in getting me
started on this and (literally) teaching my how to approach the
problem.

Change verified on x86_64; the test suite does not show any
regressions due to the patch.

ChangeLog:

[BZ #18657]
* elf/dl-close.c (_dl_close_worker): Don't unload DSO if there
are pending TLS destructor calls.
* include/link.h (struct link_map): Add concurrency note for
L_TLS_DTOR_COUNT.
* stdlib/cxa_thread_atexit_impl.c (__cxa_thread_atexit_impl):
Don't touch the link map flag. Atomically increment
l_tls_dtor_count.
(__call_tls_dtors): Atomically decrement l_tls_dtor_count.
Avoid taking the load lock and don't touch the link map flag.
* stdlib/tst-tls-atexit-nodelete.c: New test case.
* stdlib/Makefile (tests): Use it.
* stdlib/tst-tls-atexit.c (do_test): dlopen
tst-tls-atexit-lib.so again before dlclose. Add conditionals
to allow tst-tls-atexit-nodelete test case to use it.

diff --git a/ChangeLog b/ChangeLog
index 652d968..25abba6 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,21 @@
+2015-07-23 Siddhesh Poyarekar <***@redhat.com>
+
+ [BZ #18657]
+ * elf/dl-close.c (_dl_close_worker): Don't unload DSO if there
+ are pending TLS destructor calls.
+ * include/link.h (struct link_map): Add concurrency note for
+ L_TLS_DTOR_COUNT.
+ * stdlib/cxa_thread_atexit_impl.c (__cxa_thread_atexit_impl):
+ Don't touch the link map flag. Atomically increment
+ l_tls_dtor_count.
+ (__call_tls_dtors): Atomically decrement l_tls_dtor_count.
+ Avoid taking the load lock and don't touch the link map flag.
+ * stdlib/tst-tls-atexit-nodelete.c: New test case.
+ * stdlib/Makefile (tests): Use it.
+ * stdlib/tst-tls-atexit.c (do_test): dlopen
+ tst-tls-atexit-lib.so again before dlclose. Add conditionals
+ to allow tst-tls-atexit-nodelete test case to use it.
+
2015-07-22 Mike Frysinger <***@gentoo.org>

* sysdeps/unix/sysv/linux/ia64/bits/msq.h: Change sys/types.h include
diff --git a/NEWS b/NEWS
index 007d57d..fbe8484 100644
--- a/NEWS
+++ b/NEWS
@@ -27,7 +27,7 @@ Version 2.22
18522, 18527, 18528, 18529, 18530, 18532, 18533, 18534, 18536, 18539,
18540, 18542, 18544, 18545, 18546, 18547, 18549, 18553, 18557, 18558,
18569, 18583, 18585, 18586, 18592, 18593, 18594, 18602, 18612, 18613,
- 18619, 18633, 18641, 18643, 18648, 18676, 18694, 18696.
+ 18619, 18633, 18641, 18643, 18648, 18657, 18676, 18694, 18696.

* Cache information can be queried via sysconf() function on s390 e.g. with
_SC_LEVEL1_ICACHE_SIZE as argument.
diff --git a/elf/dl-close.c b/elf/dl-close.c
index 2104674..9105277 100644
--- a/elf/dl-close.c
+++ b/elf/dl-close.c
@@ -153,7 +153,11 @@ _dl_close_worker (struct link_map *map, bool force)
maps[idx] = l;
++idx;

- /* Clear DF_1_NODELETE to force object deletion. */
+ /* Clear DF_1_NODELETE to force object deletion. We don't need to touch
+ l_tls_dtor_count because forced object deletion only happens when an
+ error occurs during object load. Destructor registration for TLS
+ non-POD objects should not have happened till then for this
+ object. */
if (force)
l->l_flags_1 &= ~DF_1_NODELETE;
}
@@ -177,6 +181,9 @@ _dl_close_worker (struct link_map *map, bool force)
if (l->l_type == lt_loaded
&& l->l_direct_opencount == 0
&& (l->l_flags_1 & DF_1_NODELETE) == 0
+ /* See CONCURRENCY NOTES in cxa_thread_atexit_impl.c to know why
+ acquire is sufficient and correct. */
+ && atomic_load_acquire (&l->l_tls_dtor_count) == 0
&& !used[done_index])
continue;

diff --git a/include/link.h b/include/link.h
index 423a695..7e7eb79 100644
--- a/include/link.h
+++ b/include/link.h
@@ -302,7 +302,9 @@ struct link_map
/* Index of the module in the dtv array. */
size_t l_tls_modid;

- /* Number of thread_local objects constructed by this DSO. */
+ /* Number of thread_local objects constructed by this DSO. This is
+ atomically accessed and modified and is not always protected by the load
+ lock. See also: CONCURRENCY NOTES in cxa_thread_atexit_impl.c. */
size_t l_tls_dtor_count;

/* Information used to change permission after the relocations are
diff --git a/stdlib/Makefile b/stdlib/Makefile
index 7fc5a80..402466a 100644
--- a/stdlib/Makefile
+++ b/stdlib/Makefile
@@ -74,7 +74,7 @@ tests := tst-strtol tst-strtod testmb testrand testsort testdiv \
tst-makecontext3 bug-getcontext bug-fmtmsg1 \
tst-secure-getenv tst-strtod-overflow tst-strtod-round \
tst-tininess tst-strtod-underflow tst-tls-atexit \
- tst-setcontext3
+ tst-setcontext3 tst-tls-atexit-nodelete
tests-static := tst-secure-getenv

modules-names = tst-tls-atexit-lib
@@ -159,6 +159,9 @@ tst-tls-atexit-lib.so-no-z-defs = yes
$(objpfx)tst-tls-atexit: $(shared-thread-library) $(libdl)
$(objpfx)tst-tls-atexit.out: $(objpfx)tst-tls-atexit-lib.so

+$(objpfx)tst-tls-atexit-nodelete: $(shared-thread-library) $(libdl)
+$(objpfx)tst-tls-atexit-nodelete.out: $(objpfx)tst-tls-atexit-lib.so
+
$(objpfx)tst-setcontext3.out: tst-setcontext3.sh $(objpfx)tst-setcontext3
$(SHELL) $< $(common-objpfx) '$(test-program-prefix-before-env)' \
'$(run-program-env)' '$(test-program-prefix-after-env)' \
diff --git a/stdlib/cxa_thread_atexit_impl.c b/stdlib/cxa_thread_atexit_impl.c
index 70c97da..8e26380 100644
--- a/stdlib/cxa_thread_atexit_impl.c
+++ b/stdlib/cxa_thread_atexit_impl.c
@@ -16,6 +16,61 @@
License along with the GNU C Library; if not, see
<http://www.gnu.org/licenses/>. */

+/* CONCURRENCY NOTES:
+
+ This documents concurrency for the non-POD TLS destructor registration,
+ calling and destruction. The functions __cxa_thread_atexit_impl,
+ _dl_close_worker and __call_tls_dtors are the three main routines that may
+ run concurrently and access shared data. The shared data in all possible
+ combinations of all three functions are the link map list, a link map for a
+ DSO and the link map member l_tls_dtor_count.
+
+ __cxa_thread_atexit_impl acquires the load_lock before accessing any shared
+ state and hence multiple of its instances can safely execute concurrently.
+
+ _dl_close_worker acquires the load_lock before accessing any shared state as
+ well and hence can concurrently execute multiple of its own instances as
+ well as those of __cxa_thread_atexit_impl safely. Not all accesses to
+ l_tls_dtor_count are protected by the load lock, so we need to synchronize
+ using atomics.
+
+ __call_tls_dtors accesses the l_tls_dtor_count without taking the lock; it
+ decrements the value by one. It does not need the big lock because it does
+ not access any other shared state except for the current DSO link map and
+ its member l_tls_dtor_count.
+
+ Correspondingly, _dl_close_worker loads l_tls_dtor_count and if it is zero,
+ unloads the DSO, thus deallocating the current link map. This is the goal
+ of maintaining l_tls_dtor_count - to unload the DSO and free resources if
+ there are no pending destructors to be called.
+
+ We want to eliminate the inconsistent state where the DSO is unloaded in
+ _dl_close_worker before it is used in __call_tls_dtors. This could happen
+ if __call_tls_dtors uses the link map after it sets l_tls_dtor_count to 0,
+ since _dl_close_worker will conclude from the 0 l_tls_dtor_count value that
+ it is safe to unload the DSO. Hence, to ensure that this does not happen,
+ the following conditions must be met:
+
+ 1. In _dl_close_worker, the l_tls_dtor_count load happens before the DSO
+ is unload and its link map is freed
+ 2. The link map dereference in __call_tls_dtors happens before the
+ l_tls_dtor_count dereference.
+
+ To ensure this, the l_tls_dtor_count decrement in __call_tls_dtors should
+ have release semantics and the load in _dl_close_worker should have acquire
+ semantics.
+
+ Concurrent executions of __call_tls_dtors should only ensure that the value
+ is accessed atomically; no reordering constraints need to be considered.
+ Likewise for the increment of l_tls_dtor_count in __cxa_thread_atexit_impl.
+
+ There is still a possibility on concurrent execution of _dl_close_worker and
+ __call_tls_dtors where _dl_close_worker reads the value of l_tls_dtor_count
+ as 1, __call_tls_dtors decrements the value of l_tls_dtor_count but
+ _dl_close_worker does not unload the DSO, having read the old value. This
+ is not very different from a case where __call_tls_dtors is called after
+ _dl_close_worker on the DSO and hence is an accepted execution. */
+
#include <stdlib.h>
#include <ldsodefs.h>

@@ -49,9 +104,11 @@ __cxa_thread_atexit_impl (dtor_func func, void *obj, void *dso_symbol)
new->next = tls_dtor_list;
tls_dtor_list = new;

- /* See if we already encountered the DSO. */
+ /* We have to acquire the big lock to prevent a racing dlclose from pulling
+ our DSO from underneath us while we're setting up our destructor. */
__rtld_lock_lock_recursive (GL(dl_load_lock));

+ /* See if we already encountered the DSO. */
if (__glibc_unlikely (dso_symbol_cache != dso_symbol))
{
ElfW(Addr) caller = (ElfW(Addr)) dso_symbol;
@@ -62,16 +119,17 @@ __cxa_thread_atexit_impl (dtor_func func, void *obj, void *dso_symbol)
program (we hope). */
lm_cache = l ? l : GL(dl_ns)[LM_ID_BASE]._ns_loaded;
}
- /* A destructor could result in a thread_local construction and the former
- could have cleared the flag. */
- if (lm_cache->l_type == lt_loaded && lm_cache->l_tls_dtor_count == 0)
- lm_cache->l_flags_1 |= DF_1_NODELETE;
-
- new->map = lm_cache;
- new->map->l_tls_dtor_count++;

+ /* This increment may only be concurrently observed either by the decrement
+ in __call_tls_dtors since the other l_tls_dtor_count access in
+ _dl_close_worker is protected by the load lock. The execution in
+ __call_tls_dtors does not really depend on this value beyond the fact that
+ it should be atomic, so Relaxed MO should be sufficient. */
+ atomic_fetch_add_relaxed (&lm_cache->l_tls_dtor_count, 1);
__rtld_lock_unlock_recursive (GL(dl_load_lock));

+ new->map = lm_cache;
+
return 0;
}

@@ -83,19 +141,15 @@ __call_tls_dtors (void)
while (tls_dtor_list)
{
struct dtor_list *cur = tls_dtor_list;
- tls_dtor_list = tls_dtor_list->next;

+ tls_dtor_list = tls_dtor_list->next;
cur->func (cur->obj);

- __rtld_lock_lock_recursive (GL(dl_load_lock));
-
- /* Allow DSO unload if count drops to zero. */
- cur->map->l_tls_dtor_count--;
- if (cur->map->l_tls_dtor_count == 0 && cur->map->l_type == lt_loaded)
- cur->map->l_flags_1 &= ~DF_1_NODELETE;
-
- __rtld_lock_unlock_recursive (GL(dl_load_lock));
-
+ /* Ensure that the MAP dereference happens before
+ l_tls_dtor_count decrement. That way, we protect this access from a
+ potential DSO unload in _dl_close_worker, which happens when
+ l_tls_dtor_count is 0. See CONCURRENCY NOTES for more detail. */
+ atomic_fetch_add_release (&cur->map->l_tls_dtor_count, -1);
free (cur);
}
}
diff --git a/stdlib/tst-tls-atexit-nodelete.c b/stdlib/tst-tls-atexit-nodelete.c
new file mode 100644
index 0000000..ff290fa
--- /dev/null
+++ b/stdlib/tst-tls-atexit-nodelete.c
@@ -0,0 +1,24 @@
+/* Verify that a RTLD_NODELETE DSO is not unloaded even if its TLS objects are
+ destroyed.
+
+ Copyright (C) 2015 Free Software Foundation, Inc.
+ This file is part of the GNU C Library.
+
+ The GNU C Library is free software; you can redistribute it and/or
+ modify it under the terms of the GNU Lesser General Public
+ License as published by the Free Software Foundation; either
+ version 2.1 of the License, or (at your option) any later version.
+
+ The GNU C Library is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ Lesser General Public License for more details.
+
+ You should have received a copy of the GNU Lesser General Public
+ License along with the GNU C Library; if not, see
+ <http://www.gnu.org/licenses/>. */
+
+#define NO_DELETE 1
+#define H2_RTLD_FLAGS (RTLD_LAZY | RTLD_NODELETE)
+#define LOADED_IS_GOOD true
+#include "tst-tls-atexit.c"
diff --git a/stdlib/tst-tls-atexit.c b/stdlib/tst-tls-atexit.c
index cea655d..e9839d8 100644
--- a/stdlib/tst-tls-atexit.c
+++ b/stdlib/tst-tls-atexit.c
@@ -16,12 +16,20 @@
License along with the GNU C Library; if not, see
<http://www.gnu.org/licenses/>. */

-/* This test dynamically loads a DSO and spawns a thread that subsequently
- calls into the DSO to register a destructor for an object in the DSO and
- then calls dlclose on the handle for the DSO. When the thread exits, the
- DSO should not be unloaded or else the destructor called during thread exit
- will crash. Further in the main thread, the DSO is opened and closed again,
- at which point the DSO should be unloaded. */
+/* For the default case, i.e. NO_DELETE not defined, the test dynamically loads
+ a DSO and spawns a thread that subsequently calls into the DSO to register a
+ destructor for an object in the DSO and then calls dlclose on the handle for
+ the DSO. When the thread exits, the DSO should not be unloaded or else the
+ destructor called during thread exit will crash. Further in the main
+ thread, the DSO is opened and closed again, at which point the DSO should be
+ unloaded.
+
+ When NO_DELETE is defined, the DSO is loaded twice, once with just RTLD_LAZY
+ flag and the second time with the RTLD_NODELETE flag set. The thread is
+ spawned, destructor registered and then thread exits without closing the
+ DSO. In the main thread, the first handle is then closed, followed by the
+ second handle. In the end, the DSO should remain loaded due to the
+ RTLD_NODELETE flag being set in the second dlopen call. */

#include <dlfcn.h>
#include <pthread.h>
@@ -31,6 +39,14 @@
#include <errno.h>
#include <link.h>

+#ifndef NO_DELETE
+# define LOADED_IS_GOOD false
+#endif
+
+#ifndef H2_RTLD_FLAGS
+# define H2_RTLD_FLAGS (RTLD_LAZY)
+#endif
+
#define DSO_NAME "$ORIGIN/tst-tls-atexit-lib.so"

/* Walk through the map in the _r_debug structure to see if our lib is still
@@ -43,7 +59,10 @@ is_loaded (void)
for (; lm; lm = lm->l_next)
if (lm->l_type == lt_loaded && lm->l_name
&& strcmp (basename (DSO_NAME), basename (lm->l_name)) == 0)
- return true;
+ {
+ printf ("%s is still loaded\n", lm->l_name);
+ return true;
+ }
return false;
}

@@ -63,7 +82,9 @@ reg_dtor_and_close (void *h)

reg_dtor ();

+#ifndef NO_DELETE
dlclose (h);
+#endif

return NULL;
}
@@ -104,19 +125,30 @@ do_test (void)
return 1;
}

+#ifndef NO_DELETE
if (spawn_thread (h1) != 0)
return 1;
+#endif

- /* Now this should unload the DSO. FIXME: This is a bug, calling dlclose
- like this is actually wrong, but it works because cxa_thread_atexit_impl
- has a bug which results in dlclose allowing this to work. */
- dlclose (h1);
+ void *h2 = dlopen (DSO_NAME, H2_RTLD_FLAGS);
+ if (h2 == NULL)
+ {
+ printf ("h2: Unable to load DSO: %s\n", dlerror ());
+ return 1;
+ }

- /* Check link maps to ensure that the DSO has unloaded. */
- if (is_loaded ())
+#ifdef NO_DELETE
+ if (spawn_thread (h1) != 0)
return 1;

- return 0;
+ dlclose (h1);
+#endif
+ dlclose (h2);
+
+ /* Check link maps to ensure that the DSO has unloaded. In the normal case,
+ the DSO should be unloaded if there are no uses. However, if one of the
+ dlopen calls were with RTLD_NODELETE, the DSO should remain loaded. */
+ return is_loaded () == LOADED_IS_GOOD ? 0 : 1;
}

#define TEST_FUNCTION do_test ()
Carlos O'Donell
2015-07-23 20:01:43 UTC
Permalink
Post by Siddhesh Poyarekar
Post by Torvald Riegel
Overall, with perhaps the exception of the reasoning about the relaxed
MO on the increment, and based on the conversation that I had with
Siddhesh, I think the synchronization in this patch is sound. Thus, I
think that we can also do more word-smithing on how we explain it after
the release (so that this doesn't hold up the release then).
Thanks, I've pushed this now, with more changes to the documentation
as per your suggestion. Revised version is below.
This version fixes almost all of my suggestions from v6 (though this is now v7).

However, it remains that we use "load lock" and "load_lock" to talk about
"dl_load_lock".

I would prefer that all references be made to the real name of the lock
e.g. "dl_load_lock", in the even that some day we split the lock in two
that we don't have to go through and clarify which of the two load locks
we're talking about.

Cheers,
Carlos.
Siddhesh Poyarekar
2015-07-24 00:41:22 UTC
Permalink
Post by Carlos O'Donell
However, it remains that we use "load lock" and "load_lock" to talk about
"dl_load_lock".
I would prefer that all references be made to the real name of the lock
e.g. "dl_load_lock", in the even that some day we split the lock in two
that we don't have to go through and clarify which of the two load locks
we're talking about.
Done, I've pushed this now:

Siddhesh

commit a81a00ff94a43af85f7aefceb6d31f3c0f11151d
Author: Siddhesh Poyarekar <***@redhat.com>
Date: Fri Jul 24 06:09:47 2015 +0530

Mention dl_load_lock by name in the comments

Mention dl_load_lock by name instead of just 'load lock' in the
comments. This makes it unambigious which lock we're talking about.

diff --git a/ChangeLog b/ChangeLog
index f1b7bd7..d43a564 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,8 @@
+2015-07-24 Siddhesh Poyarekar <***@redhat.com>
+
+ * stdlib/cxa_thread_atexit_impl.c: Use the lock name dl_load_lock
+ instead of just saying load lock in the comments.
+
2015-07-23 Roland McGrath <***@hack.frob.com>

* sysdeps/unix/Subdirs: Moved ...
diff --git a/stdlib/cxa_thread_atexit_impl.c b/stdlib/cxa_thread_atexit_impl.c
index 8e26380..2d5d56a 100644
--- a/stdlib/cxa_thread_atexit_impl.c
+++ b/stdlib/cxa_thread_atexit_impl.c
@@ -25,14 +25,15 @@
combinations of all three functions are the link map list, a link map for a
DSO and the link map member l_tls_dtor_count.

- __cxa_thread_atexit_impl acquires the load_lock before accessing any shared
- state and hence multiple of its instances can safely execute concurrently.
+ __cxa_thread_atexit_impl acquires the dl_load_lock before accessing any
+ shared state and hence multiple of its instances can safely execute
+ concurrently.

- _dl_close_worker acquires the load_lock before accessing any shared state as
- well and hence can concurrently execute multiple of its own instances as
+ _dl_close_worker acquires the dl_load_lock before accessing any shared state
+ as well and hence can concurrently execute multiple of its own instances as
well as those of __cxa_thread_atexit_impl safely. Not all accesses to
- l_tls_dtor_count are protected by the load lock, so we need to synchronize
- using atomics.
+ l_tls_dtor_count are protected by the dl_load_lock, so we need to
+ synchronize using atomics.

__call_tls_dtors accesses the l_tls_dtor_count without taking the lock; it
decrements the value by one. It does not need the big lock because it does
@@ -51,8 +52,8 @@
it is safe to unload the DSO. Hence, to ensure that this does not happen,
the following conditions must be met:

- 1. In _dl_close_worker, the l_tls_dtor_count load happens before the DSO
- is unload and its link map is freed
+ 1. In _dl_close_worker, the l_tls_dtor_count load happens before the DSO is
+ unloaded and its link map is freed
2. The link map dereference in __call_tls_dtors happens before the
l_tls_dtor_count dereference.

@@ -122,7 +123,7 @@ __cxa_thread_atexit_impl (dtor_func func, void *obj, void *dso_symbol)

/* This increment may only be concurrently observed either by the decrement
in __call_tls_dtors since the other l_tls_dtor_count access in
- _dl_close_worker is protected by the load lock. The execution in
+ _dl_close_worker is protected by the dl_load_lock. The execution in
__call_tls_dtors does not really depend on this value beyond the fact that
it should be atomic, so Relaxed MO should be sufficient. */
atomic_fetch_add_relaxed (&lm_cache->l_tls_dtor_count, 1);
Carlos O'Donell
2015-07-24 02:41:03 UTC
Permalink
Post by Carlos O'Donell
However, it remains that we use "load lock" and "load_lock" to talk about
"dl_load_lock".
I would prefer that all references be made to the real name of the lock
e.g. "dl_load_lock", in the even that some day we split the lock in two
that we don't have to go through and clarify which of the two load locks
we're talking about.
Perfect! Thanks again!

c.
Roland McGrath
2015-07-22 00:18:48 UTC
Permalink
The test refactoring looks OKish to me.
Could have done it with fewer macros, but it's fine.
Carlos O'Donell
2015-07-23 19:54:44 UTC
Permalink
Post by Siddhesh Poyarekar
v6: Juggled around the test case code a bit so that it uses tst-tls-atexit.c
with a couple of macros.
When an TLS destructor is registered, we set the DF_1_NODELETE flag to
signal that the object should not be destroyed. We then clear the
DF_1_NODELETE flag when all destructors are called, which is wrong -
the flag could have been set by other means too.
This patch replaces this use of the flag by using l_tls_dtor_count
directly to determine whether it is safe to unload the object. This
change has the added advantage of eliminating the lock taking when
calling the destructors, which could result in a deadlock. The patch
also fixes the test case tst-tls-atexit - it was making an invalid
dlclose call, which would just return an error silently.
I have also added a detailed note on concurrency which also aims to
justify why I chose the semantics I chose for accesses to
l_tls_dtor_count. Thanks to Torvald for his help in getting me
started on this and (literally) teaching my how to approach the
problem.
Change verified on x86_64; the test suite does not show any
regressions due to the patch.
This version looks awesome. Thanks for the awesome work!

OK for 2.22 with suggestions (see below).

I had lots of fun coming up with orderings between the release and acquire
pairs and the mutex locks which I treated as release+acquire barriers
for the purpose of the analysis.

My own rough review notes for reference:

__cxa_thread_atexit_impl
atomic_fetch_add_relaxed l_tls_dotr_count +1
- dl_load_lock held.
- Excludes any unloads so forbids __dl_close_worker
- Does not forbid a thread being killed.
- Deferred cancel happens after lock release
so not concurrent with anything else.
And thread eventually calls __call_tls_dtors
with release barrier.
- dl_load_lock is release+acquire barrier, which allows relaxed
ordering atomic_fetch_add to l_tls_dtor_count, because after
unlock you will either see consistent MAP or you won't see
MAP at all since load/stores can't leak past the unlock.

__call_tls_dtors
atomic_fetch_add_release curr->map->l_tls_dtor_count -1
- write release
- keeps map deref before the decrement which means you never
access map with l_tls_dtor_count set to 0.

__dl_close_worker
Took dl_load_lock
atomic_load_acquire l_tls_dtor_count
- load acquire
- either sees l_tls_dtor_count as 0 or positive
- if zero, then we know map in __call_tls_dtors was
already derefed so we know it's safe to destroy map
- if positive then we know map has not been dereffed yet
so we can't delete it. We don't know when atomic_fetch_add_release
will happen if ever.
if l_tls_dtor_count == 0
freed map.
Released dl_load_lock
Post by Siddhesh Poyarekar
[BZ #18657]
* elf/dl-close.c (_dl_close_worker): Don't unload DSO if there
are pending TLS destructor calls.
* include/link.h (struct link_map): Add concurrency note for
L_TLS_DTOR_COUNT.
Don't touch the link map flag. Atomically increment
l_tls_dtor_count.
(__call_tls_dtors): Atomically decrement l_tls_dtor_count.
Avoid taking the load lock and don't touch the link map flag.
* stdlib/tst-tls-atexit-nodelete.c: New test case.
* stdlib/Makefile (tests): Use it.
* stdlib/tst-tls-atexit.c (do_test): dlopen
tst-tls-atexit-lib.so again before dlclose. Add conditionals
to allow tst-tls-atexit-nodelete test case to use it.
---
elf/dl-close.c | 9 ++++-
include/link.h | 4 ++-
stdlib/Makefile | 5 ++-
stdlib/cxa_thread_atexit_impl.c | 76 ++++++++++++++++++++++++++++++----------
stdlib/tst-tls-atexit-nodelete.c | 24 +++++++++++++
stdlib/tst-tls-atexit.c | 60 +++++++++++++++++++++++--------
6 files changed, 143 insertions(+), 35 deletions(-)
create mode 100644 stdlib/tst-tls-atexit-nodelete.c
diff --git a/elf/dl-close.c b/elf/dl-close.c
index 2104674..9105277 100644
--- a/elf/dl-close.c
+++ b/elf/dl-close.c
@@ -153,7 +153,11 @@ _dl_close_worker (struct link_map *map, bool force)
maps[idx] = l;
++idx;
- /* Clear DF_1_NODELETE to force object deletion. */
+ /* Clear DF_1_NODELETE to force object deletion. We don't need to touch
+ l_tls_dtor_count because forced object deletion only happens when an
+ error occurs during object load. Destructor registration for TLS
+ non-POD objects should not have happened till then for this
+ object. */
OK.
Post by Siddhesh Poyarekar
if (force)
l->l_flags_1 &= ~DF_1_NODELETE;
}
@@ -177,6 +181,9 @@ _dl_close_worker (struct link_map *map, bool force)
if (l->l_type == lt_loaded
&& l->l_direct_opencount == 0
&& (l->l_flags_1 & DF_1_NODELETE) == 0
+ /* See CONCURRENCY NOTES in cxa_thread_atexit_impl.c to know why
+ acquire is sufficient and correct. */
+ && atomic_load_acquire (&l->l_tls_dtor_count) == 0
OK.
Post by Siddhesh Poyarekar
&& !used[done_index])
continue;
diff --git a/include/link.h b/include/link.h
index 423a695..7e7eb79 100644
--- a/include/link.h
+++ b/include/link.h
@@ -302,7 +302,9 @@ struct link_map
/* Index of the module in the dtv array. */
size_t l_tls_modid;
- /* Number of thread_local objects constructed by this DSO. */
+ /* Number of thread_local objects constructed by this DSO. This is
+ atomically accessed and modified and is not always protected by the load
+ lock. See also: CONCURRENCY NOTES in cxa_thread_atexit_impl.c. */
Suggest: s/load lock/_dl_load_lock/g so we don't get confused in the future
if we split the big lock.
Post by Siddhesh Poyarekar
size_t l_tls_dtor_count;
/* Information used to change permission after the relocations are
diff --git a/stdlib/Makefile b/stdlib/Makefile
index 7fc5a80..402466a 100644
--- a/stdlib/Makefile
+++ b/stdlib/Makefile
@@ -74,7 +74,7 @@ tests := tst-strtol tst-strtod testmb testrand testsort testdiv \
tst-makecontext3 bug-getcontext bug-fmtmsg1 \
tst-secure-getenv tst-strtod-overflow tst-strtod-round \
tst-tininess tst-strtod-underflow tst-tls-atexit \
- tst-setcontext3
+ tst-setcontext3 tst-tls-atexit-nodelete
tests-static := tst-secure-getenv
modules-names = tst-tls-atexit-lib
@@ -159,6 +159,9 @@ tst-tls-atexit-lib.so-no-z-defs = yes
$(objpfx)tst-tls-atexit: $(shared-thread-library) $(libdl)
$(objpfx)tst-tls-atexit.out: $(objpfx)tst-tls-atexit-lib.so
+$(objpfx)tst-tls-atexit-nodelete: $(shared-thread-library) $(libdl)
+$(objpfx)tst-tls-atexit-nodelete.out: $(objpfx)tst-tls-atexit-lib.so
+
$(objpfx)tst-setcontext3.out: tst-setcontext3.sh $(objpfx)tst-setcontext3
$(SHELL) $< $(common-objpfx) '$(test-program-prefix-before-env)' \
'$(run-program-env)' '$(test-program-prefix-after-env)' \
diff --git a/stdlib/cxa_thread_atexit_impl.c b/stdlib/cxa_thread_atexit_impl.c
index 7a7d3d6..dc51ab5 100644
--- a/stdlib/cxa_thread_atexit_impl.c
+++ b/stdlib/cxa_thread_atexit_impl.c
@@ -16,6 +16,50 @@
License along with the GNU C Library; if not, see
<http://www.gnu.org/licenses/>. */
+
Great writeup.
Post by Siddhesh Poyarekar
+ The functions __cxa_thread_atexit_impl, _dl_close_worker and
+ __call_tls_dtors are the three main routines that may run concurrently and
+ access shared data. The shared data in all possible combinations of all
+ three functions are the link map list, a link map for a DSO and the link map
+ member l_tls_dtor_count.
+
+ __cxa_thread_atexit_impl takes the load_lock before modifying any shared
+ state and hence can concurrently execute multiple of its instances in
+ different threads safely.
+
+ _dl_close_worker takes the load_lock before modifying any shared state as
+ well and hence can concurrently execute multiple of its own instances as
+ well as those of __cxa_thread_atexit_impl safely.
+
+ __call_tls_dtors does not take the load lock, but reads only the link map
+ of the current DSO and its member l_tls_dtor_count. It has to ensure that
+ l_tls_dtor_count is decremented atomically
+
+ Correspondingly, _dl_close_worker loads l_tls_dtor_count and if it is zero,
+ unloads the DSO, thus dereferencing the current link map. Hence, for
+ concurrent execution of _dl_close_worker and __call_tls_dtors, it is
+
+ 1. The DSO unload in _dl_close_worker happens after l_tls_dtor_count, i.e.
Suggest:

... happens after l_tls_dtor_count load, i.e.
Post by Siddhesh Poyarekar
+ the l_tls_dtor_count load does not get reordered to after the DSO is
+ unloaded
+ 2. The link map dereference in __call_tls_dtors happens before the
+ l_tls_dtor_count dereference.
Agreed.
Post by Siddhesh Poyarekar
+
+ to ensure that we eliminate the inconsistent state where the DSO is unloaded
+ before it is dereferenced in __call_tls_dtors, which could happen if any of
+ the above two pairs of sequences were reordered.
+
+ To ensure this, the l_tls_dtor_count decrement in __call_tls_dtors should
+ have release semantics and the load in _dl_close_worker should have acquire
+ semantics.
+
+ Concurrent executions of __call_tls_dtors should only ensure that the value
+ is modified atomically; no reordering constraints need to be considered.
+ Likewise for the increment of l_tls_dtor_count in
+ __cxa_thread_atexit_impl. */
OK.
Post by Siddhesh Poyarekar
+
#include <stdlib.h>
#include <ldsodefs.h>
@@ -49,9 +93,11 @@ __cxa_thread_atexit_impl (dtor_func func, void *obj, void *dso_symbol)
new->next = tls_dtor_list;
tls_dtor_list = new;
- /* See if we already encountered the DSO. */
+ /* We have to take the big lock to prevent a racing dlclose from pulling our
+ DSO from underneath us while we're setting up our destructor. */
__rtld_lock_lock_recursive (GL(dl_load_lock));
OK.
Post by Siddhesh Poyarekar
+ /* See if we already encountered the DSO. */
if (__glibc_unlikely (dso_symbol_cache != dso_symbol))
{
ElfW(Addr) caller = (ElfW(Addr)) dso_symbol;
@@ -62,16 +108,14 @@ __cxa_thread_atexit_impl (dtor_func func, void *obj, void *dso_symbol)
program (we hope). */
lm_cache = l ? l : GL(dl_ns)[LM_ID_BASE]._ns_loaded;
}
- /* A destructor could result in a thread_local construction and the former
- could have cleared the flag. */
- if (lm_cache->l_type == lt_loaded && lm_cache->l_tls_dtor_count == 0)
- lm_cache->l_flags_1 |= DF_1_NODELETE;
-
- new->map = lm_cache;
- new->map->l_tls_dtor_count++;
+ /* Relaxed since the only shared state here is l_tls_dtor_count and hence
+ there are no reordering issues. */
+ atomic_fetch_add_relaxed (&lm_cache->l_tls_dtor_count, 1);
OK.

I assume that this is OK because the _dl_load_lock mutex acts as a
release/acquire barrier which ensures that another thread would see
a consistent MAP if it itself took the _dl_load_lock lock. That is to say that
a racing _dl_close_worker would either see no new MAP or a new MAP with
a non-zero l_tls_dtor_count value (assuming the DSO had a thread_local
destructor).
Post by Siddhesh Poyarekar
__rtld_lock_unlock_recursive (GL(dl_load_lock));
+ new->map = lm_cache;
+
OK.
Post by Siddhesh Poyarekar
return 0;
}
@@ -83,19 +127,15 @@ __call_tls_dtors (void)
while (tls_dtor_list)
{
struct dtor_list *cur = tls_dtor_list;
- tls_dtor_list = tls_dtor_list->next;
+ tls_dtor_list = tls_dtor_list->next;
cur->func (cur->obj);
- __rtld_lock_lock_recursive (GL(dl_load_lock));
-
- /* Allow DSO unload if count drops to zero. */
- cur->map->l_tls_dtor_count--;
- if (cur->map->l_tls_dtor_count == 0 && cur->map->l_type == lt_loaded)
- cur->map->l_flags_1 &= ~DF_1_NODELETE;
-
- __rtld_lock_unlock_recursive (GL(dl_load_lock));
-
+ /* Ensure that the MAP dereference is not reordered below the
+ l_tls_dtor_count decrement. This ensures that it synchronizes with
+ the load in _dl_close_worker and keeps this dereference before the DSO
+ unload. See CONCURRENCY NOTES for more detail. */
+ atomic_fetch_add_release (&cur->map->l_tls_dtor_count, -1);
OK.
Post by Siddhesh Poyarekar
free (cur);
}
}
diff --git a/stdlib/tst-tls-atexit-nodelete.c b/stdlib/tst-tls-atexit-nodelete.c
new file mode 100644
index 0000000..ff290fa
--- /dev/null
+++ b/stdlib/tst-tls-atexit-nodelete.c
@@ -0,0 +1,24 @@
+/* Verify that a RTLD_NODELETE DSO is not unloaded even if its TLS objects are
+ destroyed.
+
+ Copyright (C) 2015 Free Software Foundation, Inc.
+ This file is part of the GNU C Library.
+
+ The GNU C Library is free software; you can redistribute it and/or
+ modify it under the terms of the GNU Lesser General Public
+ License as published by the Free Software Foundation; either
+ version 2.1 of the License, or (at your option) any later version.
+
+ The GNU C Library is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ Lesser General Public License for more details.
+
+ You should have received a copy of the GNU Lesser General Public
+ License along with the GNU C Library; if not, see
+ <http://www.gnu.org/licenses/>. */
+
+#define NO_DELETE 1
+#define H2_RTLD_FLAGS (RTLD_LAZY | RTLD_NODELETE)
+#define LOADED_IS_GOOD true
+#include "tst-tls-atexit.c"
diff --git a/stdlib/tst-tls-atexit.c b/stdlib/tst-tls-atexit.c
index cea655d..e9839d8 100644
--- a/stdlib/tst-tls-atexit.c
+++ b/stdlib/tst-tls-atexit.c
@@ -16,12 +16,20 @@
License along with the GNU C Library; if not, see
<http://www.gnu.org/licenses/>. */
-/* This test dynamically loads a DSO and spawns a thread that subsequently
- calls into the DSO to register a destructor for an object in the DSO and
- then calls dlclose on the handle for the DSO. When the thread exits, the
- DSO should not be unloaded or else the destructor called during thread exit
- will crash. Further in the main thread, the DSO is opened and closed again,
- at which point the DSO should be unloaded. */
+/* For the default case, i.e. NO_DELETE not defined, the test dynamically loads
+ a DSO and spawns a thread that subsequently calls into the DSO to register a
+ destructor for an object in the DSO and then calls dlclose on the handle for
+ the DSO. When the thread exits, the DSO should not be unloaded or else the
+ destructor called during thread exit will crash. Further in the main
+ thread, the DSO is opened and closed again, at which point the DSO should be
+ unloaded.
+
+ When NO_DELETE is defined, the DSO is loaded twice, once with just RTLD_LAZY
+ flag and the second time with the RTLD_NODELETE flag set. The thread is
+ spawned, destructor registered and then thread exits without closing the
+ DSO. In the main thread, the first handle is then closed, followed by the
+ second handle. In the end, the DSO should remain loaded due to the
+ RTLD_NODELETE flag being set in the second dlopen call. */
#include <dlfcn.h>
#include <pthread.h>
@@ -31,6 +39,14 @@
#include <errno.h>
#include <link.h>
+#ifndef NO_DELETE
+# define LOADED_IS_GOOD false
+#endif
+
+#ifndef H2_RTLD_FLAGS
+# define H2_RTLD_FLAGS (RTLD_LAZY)
+#endif
+
#define DSO_NAME "$ORIGIN/tst-tls-atexit-lib.so"
/* Walk through the map in the _r_debug structure to see if our lib is still
@@ -43,7 +59,10 @@ is_loaded (void)
for (; lm; lm = lm->l_next)
if (lm->l_type == lt_loaded && lm->l_name
&& strcmp (basename (DSO_NAME), basename (lm->l_name)) == 0)
- return true;
+ {
+ printf ("%s is still loaded\n", lm->l_name);
+ return true;
+ }
return false;
}
@@ -63,7 +82,9 @@ reg_dtor_and_close (void *h)
reg_dtor ();
+#ifndef NO_DELETE
dlclose (h);
+#endif
return NULL;
}
@@ -104,19 +125,30 @@ do_test (void)
return 1;
}
+#ifndef NO_DELETE
if (spawn_thread (h1) != 0)
return 1;
+#endif
- /* Now this should unload the DSO. FIXME: This is a bug, calling dlclose
- like this is actually wrong, but it works because cxa_thread_atexit_impl
- has a bug which results in dlclose allowing this to work. */
- dlclose (h1);
+ void *h2 = dlopen (DSO_NAME, H2_RTLD_FLAGS);
+ if (h2 == NULL)
+ {
+ printf ("h2: Unable to load DSO: %s\n", dlerror ());
+ return 1;
+ }
- /* Check link maps to ensure that the DSO has unloaded. */
- if (is_loaded ())
+#ifdef NO_DELETE
+ if (spawn_thread (h1) != 0)
return 1;
- return 0;
+ dlclose (h1);
+#endif
+ dlclose (h2);
+
+ /* Check link maps to ensure that the DSO has unloaded. In the normal case,
+ the DSO should be unloaded if there are no uses. However, if one of the
+ dlopen calls were with RTLD_NODELETE, the DSO should remain loaded. */
+ return is_loaded () == LOADED_IS_GOOD ? 0 : 1;
}
#define TEST_FUNCTION do_test ()
OK.

Cheers,
Carlos.
Torvald Riegel
2015-07-24 08:52:44 UTC
Permalink
Post by Carlos O'Donell
Post by Siddhesh Poyarekar
v6: Juggled around the test case code a bit so that it uses tst-tls-atexit.c
with a couple of macros.
When an TLS destructor is registered, we set the DF_1_NODELETE flag to
signal that the object should not be destroyed. We then clear the
DF_1_NODELETE flag when all destructors are called, which is wrong -
the flag could have been set by other means too.
This patch replaces this use of the flag by using l_tls_dtor_count
directly to determine whether it is safe to unload the object. This
change has the added advantage of eliminating the lock taking when
calling the destructors, which could result in a deadlock. The patch
also fixes the test case tst-tls-atexit - it was making an invalid
dlclose call, which would just return an error silently.
I have also added a detailed note on concurrency which also aims to
justify why I chose the semantics I chose for accesses to
l_tls_dtor_count. Thanks to Torvald for his help in getting me
started on this and (literally) teaching my how to approach the
problem.
Change verified on x86_64; the test suite does not show any
regressions due to the patch.
This version looks awesome. Thanks for the awesome work!
OK for 2.22 with suggestions (see below).
I had lots of fun coming up with orderings between the release and acquire
pairs and the mutex locks which I treated as release+acquire barriers
for the purpose of the analysis.
Don't simply treat lock/unlock as fences (if that's what you meant by
"barriers"). Technically, they behave differently. Whereas a fence can
work in combination with other atomics accesses on other memory
locations, unlock calls only synchronize with later lock calls on the
same mutex (there is a total order of all lock/unlock calls per mutex).
Thus, you can model lock/unlock as acquire/release operations on the
mutex (ie, how one would implement them), but they are not fences.
Post by Carlos O'Donell
__cxa_thread_atexit_impl
atomic_fetch_add_relaxed l_tls_dotr_count +1
- dl_load_lock held.
- Excludes any unloads so forbids __dl_close_worker
- Does not forbid a thread being killed.
OK.
Post by Carlos O'Donell
- Deferred cancel happens after lock release
so not concurrent with anything else.
And thread eventually calls __call_tls_dtors
with release barrier.
- dl_load_lock is release+acquire barrier, which allows relaxed
ordering atomic_fetch_add to l_tls_dtor_count, because after
unlock you will either see consistent MAP or you won't see
MAP at all since load/stores can't leak past the unlock.
This isn't correct. lock/unlock will synchronize with other lock/unlock
and thus may affect ordering for l_tls_dtor_count accesses, but they
aren't fences.

My understanding of how this works based on the discussions with
Siddhesh is that due to the critical sections using dl_load_lock,
__cxa_thread_atexit_impl and __dl_close_worker are mutually exclusive
and totally ordered. Calling __cxa_thread_atexit_impl after
__dl_close_worker isn't legal, so __dl_close_worker will observe any
increment because of synchronization via dl_load_lock (not because of
the unlock/lock being fences).

__call_tls_dtors uses an atomic read-modify-write, so via the total
modification order for l_tls_dtor_count, we get a total order for all
increments/decrements. __dl_close_worker now also needs to "observe"
the map deref before the decrements, and there's no lock being used
there, so we need release MO on the decrements and acquire MO on the
__dl_close_worker load of l_tls_dtor_count
Post by Carlos O'Donell
__call_tls_dtors
atomic_fetch_add_release curr->map->l_tls_dtor_count -1
- write release
- keeps map deref before the decrement which means you never
access map with l_tls_dtor_count set to 0.
__dl_close_worker
Took dl_load_lock
atomic_load_acquire l_tls_dtor_count
- load acquire
- either sees l_tls_dtor_count as 0 or positive
- if zero, then we know map in __call_tls_dtors was
already derefed so we know it's safe to destroy map
- if positive then we know map has not been dereffed yet
Minor nit: the map *may* not have been dereffed yet / may still be in
use. Doesn't make a difference for your reasoning here, but I'm
mentioning it anyway because IMO it's important to distinguish "may not
have happened" from "hasn't happened". If you say "hasn't happened",
you're mentally treating the map deref and the decrement as one atomic
action (ie, if you observe one you will also observer the other).
In my experience, it's useful to be aware of what one can say and can't
say and to be really precise about that, because it keeps you alert and
makes it less likely that one makes conclusions that aren't correct.

It's simple and often tempting to squash several different states in the
interleaving into bigger groups, but can often be misleading.
Therefore, practicing to look for and think about the exact bounds (eg,
lower bounds regarding what one knows) is useful.
Carlos O'Donell
2015-07-24 19:07:36 UTC
Permalink
Post by Torvald Riegel
Post by Carlos O'Donell
Post by Siddhesh Poyarekar
v6: Juggled around the test case code a bit so that it uses tst-tls-atexit.c
with a couple of macros.
When an TLS destructor is registered, we set the DF_1_NODELETE flag to
signal that the object should not be destroyed. We then clear the
DF_1_NODELETE flag when all destructors are called, which is wrong -
the flag could have been set by other means too.
This patch replaces this use of the flag by using l_tls_dtor_count
directly to determine whether it is safe to unload the object. This
change has the added advantage of eliminating the lock taking when
calling the destructors, which could result in a deadlock. The patch
also fixes the test case tst-tls-atexit - it was making an invalid
dlclose call, which would just return an error silently.
I have also added a detailed note on concurrency which also aims to
justify why I chose the semantics I chose for accesses to
l_tls_dtor_count. Thanks to Torvald for his help in getting me
started on this and (literally) teaching my how to approach the
problem.
Change verified on x86_64; the test suite does not show any
regressions due to the patch.
This version looks awesome. Thanks for the awesome work!
OK for 2.22 with suggestions (see below).
I had lots of fun coming up with orderings between the release and acquire
pairs and the mutex locks which I treated as release+acquire barriers
for the purpose of the analysis.
Don't simply treat lock/unlock as fences (if that's what you meant by
"barriers"). Technically, they behave differently. Whereas a fence can
work in combination with other atomics accesses on other memory
locations, unlock calls only synchronize with later lock calls on the
same mutex (there is a total order of all lock/unlock calls per mutex).
Thus, you can model lock/unlock as acquire/release operations on the
mutex (ie, how one would implement them), but they are not fences.
But aren't acquire/release operations simply made up of groups of fences
themselves? Like LoadLoad+LoadStore == Acquire? Such things are still
abstract and not machine specific and still express the same semantics?
Post by Torvald Riegel
Post by Carlos O'Donell
__cxa_thread_atexit_impl
atomic_fetch_add_relaxed l_tls_dotr_count +1
- dl_load_lock held.
- Excludes any unloads so forbids __dl_close_worker
- Does not forbid a thread being killed.
OK.
Post by Carlos O'Donell
- Deferred cancel happens after lock release
so not concurrent with anything else.
And thread eventually calls __call_tls_dtors
with release barrier.
- dl_load_lock is release+acquire barrier, which allows relaxed
ordering atomic_fetch_add to l_tls_dtor_count, because after
unlock you will either see consistent MAP or you won't see
MAP at all since load/stores can't leak past the unlock.
This isn't correct. lock/unlock will synchronize with other lock/unlock
and thus may affect ordering for l_tls_dtor_count accesses, but they
aren't fences.
In practice though aren't the lock/unlock operations fences?

How else does the a locking thread get to see the effects of what has
happened before the last unlock?
Post by Torvald Riegel
My understanding of how this works based on the discussions with
Siddhesh is that due to the critical sections using dl_load_lock,
__cxa_thread_atexit_impl and __dl_close_worker are mutually exclusive
and totally ordered. Calling __cxa_thread_atexit_impl after
__dl_close_worker isn't legal, so __dl_close_worker will observe any
increment because of synchronization via dl_load_lock (not because of
the unlock/lock being fences).
Your description in the paragraph above is correct, but you could break
down the lock into it's behaviour at the fence level and reason about
it there also, which is what I did. If that's not a good thing to do
then I'm happy to learn.

Are you suggesting simply that I consider locks at a different level
of abstraction from both acquire/release atomic operations, and from
fences?
Post by Torvald Riegel
__call_tls_dtors uses an atomic read-modify-write, so via the total
modification order for l_tls_dtor_count, we get a total order for all
increments/decrements. __dl_close_worker now also needs to "observe"
the map deref before the decrements, and there's no lock being used
there, so we need release MO on the decrements and acquire MO on the
__dl_close_worker load of l_tls_dtor_count
Agreed.
Post by Torvald Riegel
Post by Carlos O'Donell
__call_tls_dtors
atomic_fetch_add_release curr->map->l_tls_dtor_count -1
- write release
- keeps map deref before the decrement which means you never
access map with l_tls_dtor_count set to 0.
__dl_close_worker
Took dl_load_lock
atomic_load_acquire l_tls_dtor_count
- load acquire
- either sees l_tls_dtor_count as 0 or positive
- if zero, then we know map in __call_tls_dtors was
already derefed so we know it's safe to destroy map
- if positive then we know map has not been dereffed yet
Minor nit: the map *may* not have been dereffed yet / may still be in
use. Doesn't make a difference for your reasoning here, but I'm
mentioning it anyway because IMO it's important to distinguish "may not
have happened" from "hasn't happened". If you say "hasn't happened",
you're mentally treating the map deref and the decrement as one atomic
action (ie, if you observe one you will also observer the other).
In my experience, it's useful to be aware of what one can say and can't
say and to be really precise about that, because it keeps you alert and
makes it less likely that one makes conclusions that aren't correct.
You are correct, I should have said "map *may not have been dereffed...".
If l_tls_dtor_count is positive we known nothing about what is going on
in other threads, other than we know __cxa_thread_atexit_impl ran some
number of times in the past.
Post by Torvald Riegel
It's simple and often tempting to squash several different states in the
interleaving into bigger groups, but can often be misleading.
Therefore, practicing to look for and think about the exact bounds (eg,
lower bounds regarding what one knows) is useful.
Thanks for the review. I posted my rough notes specifically to facilitate
review by anyone else wanting to know what I was thinking during the review.
I appreciate you taking the bait! :-)

Cheers,
Carlos.
Torvald Riegel
2015-08-03 15:36:29 UTC
Permalink
Post by Carlos O'Donell
Post by Torvald Riegel
Post by Carlos O'Donell
Post by Siddhesh Poyarekar
v6: Juggled around the test case code a bit so that it uses tst-tls-atexit.c
with a couple of macros.
When an TLS destructor is registered, we set the DF_1_NODELETE flag to
signal that the object should not be destroyed. We then clear the
DF_1_NODELETE flag when all destructors are called, which is wrong -
the flag could have been set by other means too.
This patch replaces this use of the flag by using l_tls_dtor_count
directly to determine whether it is safe to unload the object. This
change has the added advantage of eliminating the lock taking when
calling the destructors, which could result in a deadlock. The patch
also fixes the test case tst-tls-atexit - it was making an invalid
dlclose call, which would just return an error silently.
I have also added a detailed note on concurrency which also aims to
justify why I chose the semantics I chose for accesses to
l_tls_dtor_count. Thanks to Torvald for his help in getting me
started on this and (literally) teaching my how to approach the
problem.
Change verified on x86_64; the test suite does not show any
regressions due to the patch.
This version looks awesome. Thanks for the awesome work!
OK for 2.22 with suggestions (see below).
I had lots of fun coming up with orderings between the release and acquire
pairs and the mutex locks which I treated as release+acquire barriers
for the purpose of the analysis.
Don't simply treat lock/unlock as fences (if that's what you meant by
"barriers"). Technically, they behave differently. Whereas a fence can
work in combination with other atomics accesses on other memory
locations, unlock calls only synchronize with later lock calls on the
same mutex (there is a total order of all lock/unlock calls per mutex).
Thus, you can model lock/unlock as acquire/release operations on the
mutex (ie, how one would implement them), but they are not fences.
But aren't acquire/release operations simply made up of groups of fences
themselves? Like LoadLoad+LoadStore == Acquire? Such things are still
abstract and not machine specific and still express the same semantics?
The specific implemention (ie, mostly HW in this case, but the compiler
is still involved) will likely use fence(-like) instructions on most(?)
architectures that need them. But there's nothing that prohibits a
system from distinguishing between fences that work with all future/past
memory accesses and acquire/release operations that just work in
conjunction with the specific memory access they are attached to.

Thus, to stay on the portable side, don't assume that fences and
acquire/release/... operations are necessarily the same.
Post by Carlos O'Donell
Post by Torvald Riegel
Post by Carlos O'Donell
__cxa_thread_atexit_impl
atomic_fetch_add_relaxed l_tls_dotr_count +1
- dl_load_lock held.
- Excludes any unloads so forbids __dl_close_worker
- Does not forbid a thread being killed.
OK.
Post by Carlos O'Donell
- Deferred cancel happens after lock release
so not concurrent with anything else.
And thread eventually calls __call_tls_dtors
with release barrier.
- dl_load_lock is release+acquire barrier, which allows relaxed
ordering atomic_fetch_add to l_tls_dtor_count, because after
unlock you will either see consistent MAP or you won't see
MAP at all since load/stores can't leak past the unlock.
This isn't correct. lock/unlock will synchronize with other lock/unlock
and thus may affect ordering for l_tls_dtor_count accesses, but they
aren't fences.
In practice though aren't the lock/unlock operations fences?
They will most likely be atomic memory accesses to some part of the lock
data structure that use acquire/release memory orders. But not quite
fences as they are defined by the C11 memory model.
Post by Carlos O'Donell
How else does the a locking thread get to see the effects of what has
happened before the last unlock?
Maybe we have a terminology misunderstanding here. In the C11 model,
fences are what it defines fences to be (see above). "Memory orders"
are what you use to require specific ordering constraints to take
effect. On the level of the arch-specific code, both may be implemented
using HW fence instructions.
Post by Carlos O'Donell
Post by Torvald Riegel
My understanding of how this works based on the discussions with
Siddhesh is that due to the critical sections using dl_load_lock,
__cxa_thread_atexit_impl and __dl_close_worker are mutually exclusive
and totally ordered. Calling __cxa_thread_atexit_impl after
__dl_close_worker isn't legal, so __dl_close_worker will observe any
increment because of synchronization via dl_load_lock (not because of
the unlock/lock being fences).
Your description in the paragraph above is correct, but you could break
down the lock into it's behaviour at the fence level and reason about
it there also, which is what I did. If that's not a good thing to do
then I'm happy to learn.
I don't think reasoning about the implementation-level equivalent of
locks is really helpful. I'd say it's easier to just be aware that
critical sections using the same lock instance are mutually exclusive
and totally ordered wrt themselves, and that this order is reflected in
happens-before.

However, if you think it's easier to break this down to the level of
acquire/release pairs that synchronize with each other, then that's fine
too. I wouldn't do it, personally, but it if works for you that's fine.
Just don't treat locks as equal to what C11 fences are :)
Post by Carlos O'Donell
Are you suggesting simply that I consider locks at a different level
of abstraction from both acquire/release atomic operations, and from
fences?
That would be my suggestion, yes (see above). But, YMMV.

Loading...