Discussion:
RFC: The CPU run-time library for C
(too old to reply)
H.J. Lu
2018-12-03 17:46:03 UTC
Permalink
Here is the updated proposal for the CPU run-time library for C.

Any comments?

H.J.
---
Memory and string functions in the current glibc are highly optimized for
the current processors on market. But it takes years for glibc from
release to public to be on end-user’s machines:

1. In 2018, people are still using glibc 2.17, which was released in
February, 2013, on Intel Skylake server, even when the current released
glibc 2.28 has the new memory and string functions optimized for Skylake
server.
2. The same thing will happen five years from now.

One way to address is to make glibc modular by putting memory and string
functions into a separate library, which can be updated separately,
independent of other parts of glibc. However, memory and string functions
are integral parts of glibc. Making them a separate modular may not be
easy to achieve.

I am proposing a less aggressive approach by adding a --enable-cpu-rt
configure option to enable the CPU run-time library for C, libcpu-rt-c:

1. It contains a subset of glibc. There are no new implementations of
any libc functions. All functions in libcpu-rt-c come from inside of
glibc and are tested with the same test frame work.
2. Start with memory and string functions. Any other additions must be
carefully screened.
3. It should support glibc tunables.
4. It should be binary compatible with all existing glibc binaries so
that LD_PRELOAD=libcpu-rt-c.so” can be used to override functions in
libc.so.

End users can obtain libcpu-rt-c from

1. Install libcpu-rt-c binary from their OS vendors if available.
2. Build libcpu-rt-c from source.
3. Download libcpu-rt-c binary from a central location.
Szabolcs Nagy
2018-12-03 18:20:43 UTC
Permalink
Post by H.J. Lu
4. It should be binary compatible with all existing glibc binaries so
that LD_PRELOAD=libcpu-rt-c.so” can be used to override functions in
libc.so.
is that possible in the presence of multiple
symbol versions for the same symbol?

(e.g. if i want to do this for the math library,
now that we have separate svid vs non-svid compat
symbols it's not clear to me if you could get the
right symbol version w
H.J. Lu
2018-12-03 18:32:47 UTC
Permalink
Post by Szabolcs Nagy
Post by H.J. Lu
4. It should be binary compatible with all existing glibc binaries so
that LD_PRELOAD=libcpu-rt-c.so” can be used to override functions in
libc.so.
is that possible in the presence of multiple
symbol versions for the same symbol?
(e.g. if i want to do this for the math library,
now that we have separate svid vs non-svid compat
symbols it's not clear to me if you could get the
right symbol version with preloading)
In this case, the CPU run-time library needs to provide
a compat symbol so that reference to the compat symbol
works correctly with LD_PRELOAD. I believe it is doable.
--
H.J.
H.J. Lu
2018-12-04 16:50:07 UTC
Permalink
Post by H.J. Lu
Post by Szabolcs Nagy
Post by H.J. Lu
4. It should be binary compatible with all existing glibc binaries so
that LD_PRELOAD=libcpu-rt-c.so” can be used to override functions in
libc.so.
is that possible in the presence of multiple
symbol versions for the same symbol?
(e.g. if i want to do this for the math library,
now that we have separate svid vs non-svid compat
symbols it's not clear to me if you could get the
right symbol version with preloading)
In this case, the CPU run-time library needs to provide
a compat symbol so that reference to the compat symbol
works correctly with LD_PRELOAD. I believe it is doable.
Here is a test:

[***@gnu-cfl-1 preload-2]$ make
gcc -fno-builtin -g -c -o foo.o foo.c
gcc -fno-builtin -g -fPIC -c -o bar-old.o bar-old.c
gcc -shared -o libbar-old.so bar-old.o
-Wl,-soname,libbar.so,--version-script=libbar-old.map
gcc -o foo-old foo.o libbar-old.so -Wl,-R,.
gcc -fno-builtin -g -fPIC -c -o bar-new.o bar-new.c
gcc -shared -o libbar-new.so bar-new.o
-Wl,-soname,libbar.so,--version-script=libbar.map
gcc -o foo-new foo.o libbar-new.so -Wl,-R,.
gcc -fno-builtin -g -fPIC -c -o preload.o preload.c
gcc -shared -o preload.so preload.o -Wl,--version-script=libbar.map
ln -sf libbar-old.so libbar.so
./foo-old
libbar.so: bar
LD_PRELOAD=./preload.so ./foo-old
preload.so: old_bar
ln -sf libbar-new.so libbar.so
./foo-new
libbar.so: new_bar
LD_PRELOAD=./preload.so ./foo-new
preload.so: new_bar
./foo-old
libbar.so: old_bar
LD_PRELOAD=./preload.so ./foo-old
preload.so: old_bar
[***@gnu-cfl-1 preload-2]$

--
H.J.
Siddhesh Poyarekar
2018-12-04 18:12:24 UTC
Permalink
Post by H.J. Lu
1. Install libcpu-rt-c binary from their OS vendors if available.
I'm curious to know what OS vendors think of this. AFAICT, it's not too
different from shipping an alternate glibc and in some ways, the latter
might just be easier than munging scripts to build a separate library.

Also, if the same ABI guarantees are expected of this new library, then
again would OS vendors prefer to ship a whole new library or would they
be better off just backporting these new routines?

Basically, this doesn't make sense if OS vendors aren't going to ship
it. Building in this complexity just to make a downloadable binary in
some arbitrary place sounds like an ugly hack that will come to bite us
later.

Siddhesh
Carlos O'Donell
2018-12-04 20:34:03 UTC
Permalink
Post by Siddhesh Poyarekar
Post by H.J. Lu
1. Install libcpu-rt-c binary from their OS vendors if available.
I'm curious to know what OS vendors think of this. AFAICT, it's not
too different from shipping an alternate glibc and in some ways, the
latter might just be easier than munging scripts to build a separate
library.
Also, if the same ABI guarantees are expected of this new library,
then again would OS vendors prefer to ship a whole new library or
would they be better off just backporting these new routines?
Basically, this doesn't make sense if OS vendors aren't going to ship
it. Building in this complexity just to make a downloadable binary
in some arbitrary place sounds like an ugly hack that will come to
bite us later.
H.J. posted an early RFC in June:
https://www.sourceware.org/ml/libc-alpha/2018-06/msg00259.html

My summary of consensus in June was:

- Suggest implementing in a distinct project: Adhemerval, Florian, Carlos.

- Request simpler design: Florian, Siddhesh.

(1) Why not an external preloadable library?

This RFC appears unchanged from the original proposal and the outstanding
comments do not appear to have been discussed in any further detail.
Particularly the cost/benefit ratio to the project to accept such patches
versus a simpler mechanism. Likewise why "most" of user needs cannot be met
by something like the ARM's cortex-strings, which doesn't need deep
integration with glibc-specific features.

(2) Current libcpu-rt-c proposal does not meed OS vendor needs.

The present libcpu-rt-c proposal as-is is not usable by OS vendors;
replacing the core string routines is equivalent to a library rebase
and requires revalidation efforts by the distribution and by QE. This
makes it *almost* as difficult to rebase and update libcpu-rt-c as it is
to rebase and update glibc (not to mention it requires using DTS in RHEL
to get a new-enough compiler/binutils). The other consequence is that a
newer compiler/binutils may need a newer gdb to even be able to debug
the code in question, and the problem is compounded. No distro that
I'm aware of has ever delivered something like this.

OS vendors already have process to backport IFUNC and other
improvements to stable branches, and we do this in RHEL for Intel,
IBM, and ARM (just look at our public glibc.spec %changelog) e.g.
- Improve libm performance AArch64 (#1302086)
- Improve memcpy performance for POWER9 DD2.1 (#1498925)
- Add Intel AVX-512 optimized routines (#1298526).
- Improve performance on Intel Purley (#1335286).
- Add support for new IBM z14 (s390x) instructions (#1375235)

If you need key routines backported, please work with your
distribution contact to have key support backported. RHEL
point releases happen frequently.

Therefore this proposal only adds work to upstream glibc, and
doesn't provide customers with a supported libcpu-rt-c. At most
it gives customers a way to improve performance by using
libraries provided by a 3rd party. That 3rd party could equally
deploy a custom glibc and tell the customer to use that.

(3) Solution is too costly in terms of maintenance.

The solution lacks the simplicity of plans like --enable-math-private.

In this patch set from Florian:
https://sourceware.org/ml/libc-alpha/2018-09/msg00368.html

We see a proposal that is much simpler for the math routines.
In particularly building libm.so such that it is distinct from
glibc and can be preloaded. This is easier for libm functions
because they are so distinct from libc, but it's just an example
of the kind of well isolated solutions which are desirable
from upstream.

My opinion is that unless the solution becomes drastically
simpler that it has too high a cost in terms of maintenance
for the problem it solves.

---

In summary:

(1) Could solve "most" of the problem with an external
pre-loadable library, wihtout all the bells-and-whistles
glibc has (tunables, etc) e.g. ARM's cortex-strings.

(2) Difficult to support from an OS vendor point of view.
Easier to just ship a new glibc.

(3) Costly in terms of maintenance for the value it provides.
Cost is ongoing maintenance and support of lots of
conditionals to enable 3rd parties providing parts of
new glibc's functionality to users.
--
Cheers,
Carlos.
Siddhesh Poyarekar
2018-12-05 03:53:40 UTC
Permalink
Post by Carlos O'Donell
https://www.sourceware.org/ml/libc-alpha/2018-06/msg00259.html
- Suggest implementing in a distinct project: Adhemerval, Florian, Carlos.
- Request simpler design: Florian, Siddhesh.
Well my opinion was really more about glibc's build system and not this
library; I couldn't see a viable way to have it ship even then and I
think a lot of y'all had already made that point.
Post by Carlos O'Donell
(1) Why not an external preloadable library?
This RFC appears unchanged from the original proposal and the outstanding
comments do not appear to have been discussed in any further detail.
Particularly the cost/benefit ratio to the project to accept such patches
versus a simpler mechanism. Likewise why "most" of user needs cannot be met
by something like the ARM's cortex-strings, which doesn't need deep
integration with glibc-specific features.
Yeah, that is a much more flexible approach. Maybe in the medium/long
term we could consider the idea of making this new project into a
submodule of glibc to reduce or even avoid duplication of code.

My only concern here is fragmentation; architecture maintainers will
need to make sure that they're syncing routines regularly. It happens
for arm/aarch64 currently because we're still in a state where glibc
dictates the development to a great extent. Once this library gets
traction, that incentive may get lost.
Post by Carlos O'Donell
(2) Current libcpu-rt-c proposal does not meed OS vendor needs.
The present libcpu-rt-c proposal as-is is not usable by OS vendors;
replacing the core string routines is equivalent to a library rebase
and requires revalidation efforts by the distribution and by QE. This
makes it *almost* as difficult to rebase and update libcpu-rt-c as it is
to rebase and update glibc (not to mention it requires using DTS in RHEL
to get a new-enough compiler/binutils). The other consequence is that a
newer compiler/binutils may need a newer gdb to even be able to debug
the code in question, and the problem is compounded. No distro that
I'm aware of has ever delivered something like this.
OS vendors already have process to backport IFUNC and other
improvements to stable branches, and we do this in RHEL for Intel,
IBM, and ARM (just look at our public glibc.spec %changelog) e.g.
- Improve libm performance AArch64 (#1302086)
- Improve memcpy performance for POWER9 DD2.1 (#1498925)
- Add Intel AVX-512 optimized routines (#1298526).
- Improve performance on Intel Purley (#1335286).
- Add support for new IBM z14 (s390x) instructions (#1375235)
If you need key routines backported, please work with your
distribution contact to have key support backported. RHEL
point releases happen frequently.
Agreed, this is pretty much what I said at the Plumbers last month with
my ex-Red Hatter Fedora on.
Post by Carlos O'Donell
Therefore this proposal only adds work to upstream glibc, and
doesn't provide customers with a supported libcpu-rt-c. At most
it gives customers a way to improve performance by using
libraries provided by a 3rd party. That 3rd party could equally
deploy a custom glibc and tell the customer to use that.
Right.
Post by Carlos O'Donell
(3) Solution is too costly in terms of maintenance.
The solution lacks the simplicity of plans like --enable-math-private.
https://sourceware.org/ml/libc-alpha/2018-09/msg00368.html
We see a proposal that is much simpler for the math routines.
In particularly building libm.so such that it is distinct from
glibc and can be preloaded. This is easier for libm functions
because they are so distinct from libc, but it's just an example
of the kind of well isolated solutions which are desirable
from upstream.
This is fine for math, but maybe not for strings because they might need
some initialization state to work correctly (e.g. tunables) and also
because they may get used very early. It's solvable, but not as easily
as math.

Siddhesh

PS: <bait>We should some day talk about going the opposite way and
merging libpthread.so into libc.so</bait>
Patrick McGehearty
2018-12-07 20:16:42 UTC
Permalink
Disclaimer: While I work for Oracle, I am not authorized on comment on
Oracle product plans.

My work focus has been primarily on performance issues on various
systems and
HW platforms over the years. I am sympathetic to the desire for performance
improvements to get into actual end-user's hands as quickly as is consistent
with security, reliability, etc.  I don't believe the add-on
glibc_(mem/string)
approach will achieve this goal for most vendors and most
vendor-dependent users.

Ideally, any supported release of a product will require testing,
documentation,
and a QA phase. Most vendors support multiple releases at any given time.
For something likely tightly tied to glibc would require QA work for each
glibc_(mem/string) with each glibc_(base). If a vendor has only 3 of
each at any
given time, that would still mean 9 units of QA work instead of 3 units
of QA work.
The potential market benefit would be small compared to the additional
overhead
of just the QA work. Once you add in the increased cost of applying fixes
more yet more source trees [six instead of three in the above example],
it hardly seems an attractive path for SW maintenance.

Today, a vendor can select the upstream performance related patches they
perceive as useful to their customers and apply them to their next update
of their newest glibc release. Older releases are likely to be left
unchanged
as customers on older releases implicitly prefer stability. If they wanted
the latest stuff they'd switch to the newest vendor release.

To get improvements to customers faster, we need vendors to have pressure
from customers to make those improvements available. That means
customers need to be aware that improvements are happening.
Even simple synthetic open source benchmarks with a reasonable range of
input values can be useful in this regard. Then one can say:
"On the glibc strcpy benchmark, for platform y, the new strcpy code runs
x% faster."
Simple, quantitative, easy to grasp the improvement, and easy to validate
by anyone with access to the src, the test, and platform y.
Then a vendor could pick up a set of improvements and tell customers
that "our newest version of glibc runs %x to %y faster on a range of
commonly used functions (see appendix for details) than glibc version zzz."
Customers who care would gravitate to vendors who release improvements
more quickly, giving vendors a reason to port the upstream improvements
more quickly.

- patrick
Post by Carlos O'Donell
Post by Siddhesh Poyarekar
Post by H.J. Lu
1. Install libcpu-rt-c binary from their OS vendors if available.
I'm curious to know what OS vendors think of this. AFAICT, it's not
too different from shipping an alternate glibc and in some ways, the
latter might just be easier than munging scripts to build a separate
library.
Also, if the same ABI guarantees are expected of this new library,
then again would OS vendors prefer to ship a whole new library or
would they be better off just backporting these new routines?
Basically, this doesn't make sense if OS vendors aren't going to ship
it. Building in this complexity just to make a downloadable binary
in some arbitrary place sounds like an ugly hack that will come to
bite us later.
https://www.sourceware.org/ml/libc-alpha/2018-06/msg00259.html
- Suggest implementing in a distinct project: Adhemerval, Florian, Carlos.
- Request simpler design: Florian, Siddhesh.
(1) Why not an external preloadable library?
This RFC appears unchanged from the original proposal and the outstanding
comments do not appear to have been discussed in any further detail.
Particularly the cost/benefit ratio to the project to accept such patches
versus a simpler mechanism. Likewise why "most" of user needs cannot be met
by something like the ARM's cortex-strings, which doesn't need deep
integration with glibc-specific features.
(2) Current libcpu-rt-c proposal does not meed OS vendor needs.
The present libcpu-rt-c proposal as-is is not usable by OS vendors;
replacing the core string routines is equivalent to a library rebase
and requires revalidation efforts by the distribution and by QE. This
makes it *almost* as difficult to rebase and update libcpu-rt-c as it is
to rebase and update glibc (not to mention it requires using DTS in RHEL
to get a new-enough compiler/binutils). The other consequence is that a
newer compiler/binutils may need a newer gdb to even be able to debug
the code in question, and the problem is compounded. No distro that
I'm aware of has ever delivered something like this.
OS vendors already have process to backport IFUNC and other
improvements to stable branches, and we do this in RHEL for Intel,
IBM, and ARM (just look at our public glibc.spec %changelog) e.g.
- Improve libm performance AArch64 (#1302086)
- Improve memcpy performance for POWER9 DD2.1 (#1498925)
- Add Intel AVX-512 optimized routines (#1298526).
- Improve performance on Intel Purley (#1335286).
- Add support for new IBM z14 (s390x) instructions (#1375235)
If you need key routines backported, please work with your
distribution contact to have key support backported. RHEL
point releases happen frequently.
Therefore this proposal only adds work to upstream glibc, and
doesn't provide customers with a supported libcpu-rt-c. At most
it gives customers a way to improve performance by using
libraries provided by a 3rd party. That 3rd party could equally
deploy a custom glibc and tell the customer to use that.
(3) Solution is too costly in terms of maintenance.
The solution lacks the simplicity of plans like --enable-math-private.
https://sourceware.org/ml/libc-alpha/2018-09/msg00368.html
We see a proposal that is much simpler for the math routines.
In particularly building libm.so such that it is distinct from
glibc and can be preloaded. This is easier for libm functions
because they are so distinct from libc, but it's just an example
of the kind of well isolated solutions which are desirable
from upstream.
My opinion is that unless the solution becomes drastically
simpler that it has too high a cost in terms of maintenance
for the problem it solves.
---
(1) Could solve "most" of the problem with an external
pre-loadable library, wihtout all the bells-and-whistles
glibc has (tunables, etc) e.g. ARM's cortex-strings.
(2) Difficult to support from an OS vendor point of view.
Easier to just ship a new glibc.
(3) Costly in terms of maintenance for the value it provides.
Cost is ongoing maintenance and support of lots of
conditionals to enable 3rd parties providing parts of
new glibc's functionality to users.
Loading...