glibc: loading of shared objects with holes wastes address space

Discussion:

Mathias Krause

2011-10-14 13:00:05 UTC

Hi Roland,

I stumbled over some ancient changes you made back in 1996 that turned into
a "problem" in 2006 but went undetected until now. But let me tell the whole
story.

The required alignment of each of the two PT_LOAD entries of libc.so is 2MiB
on x86-64 for quite some time (since binutils version 2.18). Since the first
segment, the code segment, is only around 1.3MiB, this creates a hole in the
address space between the two segments when the library gets loaded.
_dl_map_object_from_fd() tries to handle this case, albeit it does it in a
unlucky way creating a possible huge waste of address space.

There are two problems with the current behaviour:

1/ _dl_map_object_from_fd() calls mmap() for the first segment with a size
argument fitted for the whole address range the shared object will live in.
This might (and in the case of libc.so it actually does) create a mapping
that is larger than the file it is backed with because the virtual address
range to cover all segments might be much larger than the actual file size.

2/ To cope with the possible hole created in 1/ _dl_map_object_from_fd()
tries to, at least, make the underlying address space behave as intended by
setting the protection for the memory hole to PROT_NONE. This, in fact,
leaves the mapping intact and, as a matter of fact, occupies virtual address
space.

This behaviour allows mapping the data part of libc.so as executable code
with just a single call to mprotect -- even on systems that enforce W^X on
address space mappings this will succeed. It also wastes address space which
might not be such of a concern for legitimate ELF objects but will be for
hand crafted ELF object files with small segments but huge gaps between
those segments.

Taking into account that the first mmap() call uses the size of the whole
range needed by the shared object to "reserve" address space for all
following segments, we might leave it this way to not create any races with
other threads doing calls to mmap() while _dl_map_object_from_fd() sets up
the mappings for the other segments. But the mprotect() call is wrong and
should be substituted by a call to munmap(). Actually it was munmap() until
you changed into mprotect() back in 1996. So the question is: Why mprotect,
why not munmap which seems to fit perfect and would resolve the above
issues?

Regards,
Mathias

Roland McGrath

2011-10-14 16:42:37 UTC

Permalink

Mapping past the end of a file is not a problem.
It has perfectly well-defined semantics.

The behavior of occupying holes with PROT_NONE regions is what's
intended. It was done this way because the previous behavior led to
undesireable results. When holes of a page or more were left, then
unrelated later mappings would go there. This created situations where
memory-access bugs could have extremely strange results. For example,
the gap would often be filled by an allocation done for malloc. Then a
buggy program that wrote off the end of the allocation would clobber
some library's data segment, which is much harder to figure out in
debugging than if it just clobbered some other malloc region.

So we're not going to back to how it was.

A change that I think would be reasonable is to extend the PROT_NONE
blackout regions only as far as the segment size rounded to p_align.
With normal objects, there won't be any gap beyond that. With your
objects that use an unusual layout, the PROT_NONE region will still
cover the space that the phdrs say the object expects to be covered, but
not the whole region up until the next segment. A cleanly-written patch
to implement that behavior would be fine with me.

Thanks,
Roland

Ulrich Drepper

2011-10-15 14:01:42 UTC

Permalink

Post by Roland McGrath
A change that I think would be reasonable is to extend the PROT_NONE
blackout regions only as far as the segment size rounded to p_align.

The gap is deliberately PROT_NONE so that the program occupies a
consecutive address range. No change needed or welcome.

Roland McGrath

2011-10-15 16:57:46 UTC

Permalink

Post by Ulrich Drepper
The gap is deliberately PROT_NONE so that the program occupies a
consecutive address range.

As I explained, that is well-understood. The change I described would not
affect that situation for any object linked with the normal layouts,
because the only possible gap is one less than described by the p_align of
the segment before the gap. So what's the harm in handling nonstandard
layouts differently?

Thanks,
Roland

Mathias Krause

2011-10-18 10:26:17 UTC

Permalink

Post by Roland McGrath
Mapping past the end of a file is not a problem.
It has perfectly well-defined semantics.

It does, but creating a mapping that covers the whole file and even
more beyond EOF with the flags of the first segment is not nice. It
makes parts of the shared object executable that should not be. For
only a short amount of time, though. Nevertheless, nothing one would
expect from happening.

Post by Roland McGrath
The behavior of occupying holes with PROT_NONE regions is what's
intended. It was done this way because the previous behavior led to
undesireable results. When holes of a page or more were left, then
unrelated later mappings would go there. This created situations where
memory-access bugs could have extremely strange results. For example,
the gap would often be filled by an allocation done for malloc. Then a
buggy program that wrote off the end of the allocation would clobber
some library's data segment, which is much harder to figure out in
debugging than if it just clobbered some other malloc region.

I see. So the real problem are the holes themselves. If there wouldn't
be any gap between two adjacent segments, then there would be no need
to stuff them. And honestly, currently those holes are not needed at
all. Neither the glibc nor the kernel seem to honor the alignment
requirements when searching for a suitable address. So the question
is: Why are the sections in the linker script aligned to MAXPAGESIZE
instead of PAGESIZE, i.e. aligned to 2 MiB instead of 4 KiB? But it
looks like this is more of a question for the binutils folks (CCed).

Post by Roland McGrath
So we're not going to back to how it was.
A change that I think would be reasonable is to extend the PROT_NONE
blackout regions only as far as the segment size rounded to p_align.
With normal objects, there won't be any gap beyond that. With your
objects that use an unusual layout, the PROT_NONE region will still
cover the space that the phdrs say the object expects to be covered, but
not the whole region up until the next segment. A cleanly-written patch
to implement that behavior would be fine with me.

That wouldn't prevent anonymous mappings to get placed right below a
writable mapping of the shared object. Then, again, your sketched bug
scenario with out-of-bounds access could happen.

Regards,
Mathias

Ian Lance Taylor

2011-10-18 13:44:54 UTC

Permalink

Post by Mathias Krause
I see. So the real problem are the holes themselves. If there wouldn't
be any gap between two adjacent segments, then there would be no need
to stuff them. And honestly, currently those holes are not needed at
all. Neither the glibc nor the kernel seem to honor the alignment
requirements when searching for a suitable address. So the question
is: Why are the sections in the linker script aligned to MAXPAGESIZE
instead of PAGESIZE, i.e. aligned to 2 MiB instead of 4 KiB? But it
looks like this is more of a question for the binutils folks (CCed).

Because then the executable will still run on some hypothetical future
kernel that uses larger page sizes.

Ian

Mathias Krause

2011-10-18 14:15:04 UTC

Permalink

Post by Ian Lance Taylor

Because then the executable will still run on some hypothetical future
kernel that uses larger page sizes.

Are there _any_ plans from Intel/AMD (or rumors, even) to drop the
page size support for 4 KiB pages in a future x86-64 based
architecture? At least, I'm not aware of such a thing. So this can not
really be an argument for choosing 2 MiB for ELF segment alignment.

Mathias

Ian Lance Taylor

2011-10-18 15:55:50 UTC

Permalink

Post by Mathias Krause

Post by Ian Lance Taylor

Because then the executable will still run on some hypothetical future
kernel that uses larger page sizes.

It's not the processor that matters here, it's the kernel. The question
is whether the kernel will want to some day require executables to use a
larger page size.

Ian

Roland McGrath

2011-10-18 16:59:33 UTC

Permalink

Post by Ian Lance Taylor
It's not the processor that matters here, it's the kernel. The question
is whether the kernel will want to some day require executables to use a
larger page size.

It's not necessarily even a question of "require". The kernel might decide
that the mapping for a particular executable or DSO (at a particular time,
even) is worthwhile to align to a 2M boundary so it can use the hardware
feature of 2MB-aligned page table entries for that mapping. If it does so,
and the second segment is aligned 2M away from the first, then it can use
2M page table entries for both segments.

Thanks,
Roland

Roland McGrath

2011-10-18 17:04:05 UTC

Permalink

Post by Mathias Krause
It does, but creating a mapping that covers the whole file and even
more beyond EOF with the flags of the first segment is not nice. It
makes parts of the shared object executable that should not be. For
only a short amount of time, though. Nevertheless, nothing one would
expect from happening.

I suppose that is a valid point. Nevertheless, for ET_DYN objects, a
single initial mapping that is of the whole size is necessary to reserve
the address space required. That mapping needs to be from the file because
some kernels like to choose memory regions to use differently for file
mappings than for anonymous ones.

In the case where there is a hole that needs to be PROT_NONE, it could
achieve the same end result with the same number of system calls in a
different way. That is, do the initial mapping with PROT_NONE and then use
mprotect to set the first segment to its final protections (i.e. usually
PROT_READ|PROT_EXEC). That would not have the window of extra
executability that you are concerned about.

Thanks,
Roland

Mathias Krause

2011-10-19 14:17:31 UTC

Permalink

It's not the processor that matters here, it's the kernel. The question
is whether the kernel will want to some day require executables to use a
larger page size.

Sure, it can do so. But it can only do so if the mapping itself is a
multiple of 2MB. Otherwise it would map more bytes then requested
which would clearly be a violation of the semantics of mmap(2).

If it does so,
and the second segment is aligned 2M away from the first, then it can use
2M page table entries for both segments.

The second mapping would naturally be aligned to 2MB if the preceding
segment is a multiple of 2MB, even when p_align is only 4kB.

Regards,
Mathias

Mathias Krause

2011-10-19 14:20:57 UTC

Permalink

Post by Mathias Krause
Are there _any_ plans from Intel/AMD (or rumors, even) to drop the
page size support for 4 KiB pages in a future x86-64 based
architecture? At least, I'm not aware of such a thing. So this can not
really be an argument for choosing 2 MiB for ELF segment alignment.

It's not the processor that matters here, it's the kernel. The question
is whether the kernel will want to some day require executables to use a
larger page size.

As long, as the kernel has support for 32 bit Intel CPUs, this will
not happen. So, why bother?

Regards,
Mathias

Mathias Krause

2011-10-19 14:27:42 UTC

Permalink

I understood this requirement. Making the initial mapping cover the
whole address range is not a problem per se, but mapping it as
PROT_EXEC is -- a minor one, though.

In the case where there is a hole that needs to be PROT_NONE, it could
achieve the same end result with the same number of system calls in a
different way. That is, do the initial mapping with PROT_NONE and then use
mprotect to set the first segment to its final protections (i.e. usually
PROT_READ|PROT_EXEC). That would not have the window of extra
executability that you are concerned about.

That would be a solution for this problem, indeed. Mind implementing
it yourself?

Regards,
Mathias