Discussion:
[PATCH] DT_GNU_HASH: ~ 50% dynamic linking improvement
Jakub Jelinek
2006-06-28 17:09:00 UTC
Permalink
Hi!

The following patches introduce an optional ELF hash section
replacement, optimized for speed and data cache accesses, which in our
tests improves dynamic linking by about 50%. Prelinking of course
eliminates the costs altogether but if prelinked apps use dlopen they
benefit for this portion. This is incidently where some apps today
incur high costs today.

The initial design was done by Ulrich Drepper and was discussed with
Michael Meeks a few months ago as well. But nothing came off of these
discussions and the proposed approach is quite different.

The binutils patch adds a new option to ld, --hash-style, which allows
selection of which hash sections to emit. ld can either emit the old
style SHT_HASH section, or the new style SHT_GNU_HASH, or both (and
perhaps in the future there could be a mode in which it would emit
both, but SHT_HASH for slow compatibility only (minimal number of
buckets)).

The .gnu.hash section uses always sh_entsize 4 (so doesn't repeat the
historic mistakes on Alpha/s390x with .hash). The first word there is
nbuckets like in .hash section, followed by nbuckets offsets into the
chains area of the new section. If the offset is 0xffffffff, it means
there are no defined symbol names with hash % nbuckets equal to the
offset's position. Otherwise, offset N means the corresponding chain
starts at offset (1 + nbuckets + N) * 4 into .gnu.hash section. The
chain are does not mirror in size the symbol table anymore.

Each chain starts with a symbol index word, followed by chain length
word and then length times hash value of the corresponding symbol
name. If DT_GNU_HASH is present, .dynsym must be sorted, so that all
symbols with the same name hash value % nbuckets are grouped together.
So, if some chain contains

symindx0 3 hash0 hash1 hash2

then dynamic symbol symindx0 has name hash hash0, symindx0+1 has hash1 and
symindx0+2 has hash2 and (hash0%nbuckets)==(hash1%nbuckets)==(hash2%nbuckets).

The hash function used is

static uint_fast32_t
dl_new_hash (const char *s)
{
uint_fast32_t h = 5381;
for (unsigned char c = *s; c != '\0'; c = *++s)
h = h * 33 + c;
return h & 0xffffffff;
}

(Dan Bernstein's string hash function posted eons ago on comp.lang.c.)

For an unsuccessful lookup (the usual case), the dynamic linker has to
read the buckets[hash % nbuckets] (one cache line) and then go through
the chain if it is not 0xffffffff, comparing each hashN with hash and
as the hashN values are 32-bit and the hashing function spreads the
strings well, it is very likely that if the hash is equal, then we
have found the symbol (and just need to verify .dynsym/.dynstr), if it
is different from all hashN values in the chain, ld.so can go on with
another shared library. Usually the whole chain will be in one cache
line or at most 2 cache lines.

We have tested a bunch of different hash functions on the set of ~
530000 unique dynamic symbols in one Linux installation and Dan
Bernstein's hash had the fewest collisions (only 29), with very short
maximum common prefixes for the colliding symbols (just 3 chars) and
is also very cheap to compute (h * 33 is (h << 5) + h), also both ld
-O1 and non-optimizing ld hash sizing gave good results with that hash
function. The standard SysV ELF hash function gave bad results, on
the same set of symbols had 1726 collisions and some of the colliding
symbols had very long common name prefixes. One of the reasons could
be that SysV ELF hash function is 28 bit, while Bernstein's is 32 bit.

Attached stats file shows the biggest benchmark we used (a proglet
that attempts to do all dynamic linking OpenOffice.org 2.0.3
swriter.bin does), most of the libraries (except about 8 libs from the
total of 146 libraries used by the testcase) were relinked with
-Wl,--hash-style=both and glibc on top of the DT_GNU_HASH support
patch had also a hack, where if LD_X=1 was in environment, it would
pretend DT_GNU_HASH is not present to allow easier benchmarking. The
LD_X support will not be in the final glibc patch.

Ok for trunk?

Jakub
Paul Eggert
2006-06-28 18:45:44 UTC
Permalink
Post by Jakub Jelinek
(Dan Bernstein's string hash function posted eons ago on comp.lang.c.)
Bernstein now prefers XOR, i.e.,
"h = h * 33 ^ c;" instead of
"h = h * 33 + c;". Did you try that as well?
Several sources indicate it's a bit better, e.g.,
<http://eternallyconfuzzled.com/tuts/hashing.html#djb2>.
Post by Jakub Jelinek
We have tested a bunch of different hash functions
Which hash functions did you try? One-at-a-time? FNV? You'll
probably get hassled by hash triviists (like me :-) no matter which
function you choose, but you can forstall that to some extent by
mentioning which functions you tested.

(Thanks for doing all this, by the way.)
Roland McGrath
2006-06-28 19:10:51 UTC
Permalink
Post by Paul Eggert
Bernstein now prefers XOR, i.e.,
"h = h * 33 ^ c;" instead of
"h = h * 33 + c;". Did you try that as well?
Several sources indicate it's a bit better, e.g.,
<http://eternallyconfuzzled.com/tuts/hashing.html#djb2>.
We have not tried that function.
Post by Paul Eggert
Post by Jakub Jelinek
We have tested a bunch of different hash functions
Which hash functions did you try? One-at-a-time? FNV? You'll
probably get hassled by hash triviists (like me :-) no matter which
function you choose, but you can forstall that to some extent by
mentioning which functions you tested.
We tried the original ELF hash, the htab_hash_string function from
libiberty, and the full_name_hash function from Linux's <linux/dcache.h>.
We compared them using a real-world representative sample of symbol name
strings (collected from the dynamic symbols in a GNU/Linux installation).
We looked at the number of hash values shared by two symbols, the number
shared by three symbols (none were shared by more than three in our sample
set using any function we tried), and the length of common prefixes shared
by colliding symbols.


Thanks,
Roland
Jakub Jelinek
2006-06-28 20:24:58 UTC
Permalink
Post by Paul Eggert
Post by Jakub Jelinek
(Dan Bernstein's string hash function posted eons ago on comp.lang.c.)
Bernstein now prefers XOR, i.e.,
"h = h * 33 ^ c;" instead of
"h = h * 33 + c;". Did you try that as well?
Several sources indicate it's a bit better, e.g.,
<http://eternallyconfuzzled.com/tuts/hashing.html#djb2>.
Post by Jakub Jelinek
We have tested a bunch of different hash functions
Which hash functions did you try? One-at-a-time? FNV? You'll
probably get hassled by hash triviists (like me :-) no matter which
function you choose, but you can forstall that to some extent by
mentioning which functions you tested.
Ok, added a few further functions from the URL you referenced (previously
just had sysv, libiberty, dcache, djb2 and sdbm).
The number of collisions in the 537403 symbols is:
name 2sym collision # 3sym collision # more than 3sym collision #
sysv 1749 5
libiberty 42
dcache 37
djb2 29
sdbm 39
djb3 31
rot 337 39 61
sax 34
fnv 40
oat 30
Speed and number of collisions on the set of all symbols with
full 32-bit hash values (i.e. not modulo nbuckets) are the most important
factors in this case, as handling such collisions is expensive.
Also, djb2 (i.e. Bernstein with +) gave up max common prefix len
3, while djb3 gives 4, fnv 4 as well, oat and sax too.

#define _GNU_SOURCE
#include <stdio.h>
#include <string.h>
#include <sys/types.h>

unsigned int
elf_hash (const char *namearg)
{
const unsigned char *name = (const unsigned char *) namearg;
unsigned int g, h = 0;
int ch;

while ((ch = *name++) != '\0')
{
h = (h << 4) + ch;
if ((g = (h & 0xf0000000)) != 0)
{
h ^= g >> 24;
h ^= g;
}
}
return h;
}

unsigned int
libiberty_hash (const char *namearg)
{
const unsigned char *name = (const unsigned char *) namearg;
unsigned int h = 0;
int ch;

while ((ch = *name++) != '\0')
h = h * 67 + ch - 113;

return h;
}

unsigned int
dcache_hash (const char *namearg)
{
const unsigned char *name = (const unsigned char *) namearg;
unsigned int h = 0;
int ch;

while ((ch = *name++) != '\0')
h = (h + (ch << 4) + (ch >> 4)) * 11;

return h;
}

unsigned int
djb2_hash (const char *namearg)
{
const unsigned char *name = (const unsigned char *) namearg;
unsigned int h = 5381;
int ch;

while ((ch = *name++) != '\0')
h = (h << 5) + h + ch;

return h;
}

unsigned int
sdbm_hash (const char *namearg)
{
const unsigned char *name = (const unsigned char *) namearg;
unsigned int h = 0;
int ch;

while ((ch = *name++) != '\0')
h = ch + (h << 6) + (h << 16) - h;

return h;
}

unsigned int
djb3_hash (const char *namearg)
{
const unsigned char *name = (const unsigned char *) namearg;
unsigned int h = 5381;
int ch;

while ((ch = *name++) != '\0')
h = ((h << 5) + h) ^ ch;

return h;
}

unsigned int
rot_hash (const char *namearg)
{
const unsigned char *name = (const unsigned char *) namearg;
unsigned int h = 0;
int ch;

while ((ch = *name++) != '\0')
h = ch ^ (h << 4) ^ (h >> 28);

return h;
}

unsigned int
sax_hash (const char *namearg)
{
const unsigned char *name = (const unsigned char *) namearg;
unsigned int h = 0;
int ch;

while ((ch = *name++) != '\0')
h ^= ch ^ (h << 5) + (h >> 2);

return h;
}

unsigned int
fnv_hash (const char *namearg)
{
const unsigned char *name = (const unsigned char *) namearg;
unsigned int h = 2166136261;
int ch;

while ((ch = *name++) != '\0')
h = (h * 16777619) ^ ch;

return h;
}

unsigned int
oat_hash (const char *namearg)
{
const unsigned char *name = (const unsigned char *) namearg;
unsigned int h = 2166136261;
int ch;

while ((ch = *name++) != '\0')
{
h += ch;
h += (h << 10);
h ^= (h >> 6);
}

h += (h << 3);
h ^= (h >> 11);
h += (h << 15);

return h;
}

int
main (int argc, char **argv)
{
char *line = NULL;
size_t len = 0;
ssize_t n;
unsigned int (*hash) (const char *);
if (strcmp (argv[1], "sysv") == 0)
hash = elf_hash;
else if (strcmp (argv[1], "libiberty") == 0)
hash = libiberty_hash;
else if (strcmp (argv[1], "dcache") == 0)
hash = dcache_hash;
else if (strcmp (argv[1], "djb2") == 0)
hash = djb2_hash;
else if (strcmp (argv[1], "sdbm") == 0)
hash = sdbm_hash;
else if (strcmp (argv[1], "djb3") == 0)
hash = djb3_hash;
else if (strcmp (argv[1], "rot") == 0)
hash = rot_hash;
else if (strcmp (argv[1], "sax") == 0)
hash = sax_hash;
else if (strcmp (argv[1], "fnv") == 0)
hash = fnv_hash;
else if (strcmp (argv[1], "oat") == 0)
hash = oat_hash;
else
return 1;
while ((n = getline (&line, &len, stdin)) > 0)
{
if (line[n - 1] == '\n')
line[n - 1] = '\0';
printf ("%08x %s\n", hash (line), line);
}
return 0;
}

Jakub
Paul Eggert
2006-06-28 21:40:02 UTC
Permalink
Post by Jakub Jelinek
Ok, added a few further functions from the URL you referenced (previously
just had sysv, libiberty, dcache, djb2 and sdbm).
Thanks for doing that analysis.

One other (perhaps dumb) question: why limit yourself to a 32-bit hash
on machines with 64-bit addresses? Since the application area
(linking executables) is already architecture dependent, is there a
downside to going with 64-bit hashes on such machines?

(Hope you don't mind the trivia questions from the peanut gallery....)
Roland McGrath
2006-06-28 21:46:00 UTC
Permalink
Post by Paul Eggert
One other (perhaps dumb) question: why limit yourself to a 32-bit hash
on machines with 64-bit addresses? Since the application area
(linking executables) is already architecture dependent, is there a
downside to going with 64-bit hashes on such machines?
It halves the number of table entries that can fit in the machine's cache.
To overcome the memory and cache impact of doubling the table size, there
would have to be very, very common collisions of the 32-bit hash that don't
collide with a proposed 64-bit function. Since we can select a 32-bit hash
where collisions are acceptably rare in practice, it seems unlikely that
this tradeoff would ever balance.
Ulrich Drepper
2006-06-28 21:49:15 UTC
Permalink
Post by Paul Eggert
One other (perhaps dumb) question: why limit yourself to a 32-bit hash
on machines with 64-bit addresses?
Waste of space. Not only on disk, especially on memory. The goal is to
have the entire chain in a cache line. We have already good results
with 32 bits, wasting that space is simply not worth it.
--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
Michael Meeks
2006-06-29 18:27:06 UTC
Permalink
Hi Jakub,
Post by Jakub Jelinek
The following patches introduce an optional ELF hash section
Dudie; you rock ! - thanks for taking this on board; this is really
nice :-) and here was I thinking I was going to have to write a length
paper to try to convince Ulrich :-) You made my month (really).
Post by Jakub Jelinek
The initial design was done by Ulrich Drepper and was discussed with
Michael Meeks a few months ago as well. But nothing came off of these
discussions and the proposed approach is quite different.
This looks rather like -zdynsort + -zhash-vals prototype; but I'm too
pleased to quibble :-)

Anyhow - I've been thinking about various other optimisations, many of
which we can easily & neatly fit into this nice framework, I've numbered
them to help chewing them over:

1. Extensibility
+ I want to see a multiply per lookup.
+ ie. since the L2 cache misses are what hammer us -
if [sic] we want to extend the .chain -Bdirect
support in future - being able to put that in
the same record is really important.
+ a 'record size' field gives us that with ~no
bloat as of now.
+ [ and I really think some form of direct linking
is the future - but we can discuss that later
clearly ]
+ ie. having an extensibility story that is better
than 'just add a new section [ somewhere miles
away in memory/not-in-cache ] is not so fun.

2. chain walking:
+ the 'not having a .chain <-> .dynsym offset
equivalence is an interesting innovation over my
approach; having said that I'm not sure why
you did that. Most chains are rather short - and
adding a set of bytes that are almost always
length 1/2/3 seems strange; may I suggest a
better approach ?
+ Since we can sort (almost all) of .dynsym -
there is no problem ensuring that ~all symbols
that we don't want in .hash occur at the end
of the .dynsym table [ cf. the next UNDEF point ].
+ Furthermore - by inspection it can be seen that
1/2^32 ~=== 1/2^30 [1]
+ using this theorem, we can easily steal
1/2 bits from the top of the stored hash
and use them as chain terminators.
+ ie. lets have:

.hash[symhash % nbuckets] -> .chain[5]

.chain[5] -> [ (not-end) | hashval[5] & 0x3fff ]
.chain[6] -> [ (not-end) | hashval[6] & 0x3fff ]
.chain[7] -> [ (is-end) | hashval[7] & 0x3fff ]

+ that saves us several bytes per chain,
and cf. above, no measurable loss in
performance. [ of course chain[5]
<-> .dynsym[5] here as before ]

+ in fact, for extensibility a per-shlib hash
comparison 'mask' field would make for a rather
better behaviour - so we can snarf bits from
here in future if we come up with brighter ideas.

3. What we hash:
+ it is incredible to me that UNDEF symbols are
included in the symbol hash; I see no legitimate
use for that [ eg. 'free' is an expensive hit in the
hash for virtually every app ;-]
+ with 2 bits, a 'chain terminator' and a
'defined terminator' if we don't bin the undef's
from the hash completely we can at least sort
them to the end of the chain and often avoid
comparing them - [ currently they would be
filtered out only in check_match after a .dynsym
check for "if ((sym->st_value == 0 /* No value. */...
|| (type_class & (sym->st_shndx == SHN_UNDEF)))"
+ This may seem an insignificant win: most
'normal' libs I've seen only import 1/3rd
of their symbols
+ OTOH - C++ 'plugins' [ which we can't
load RTLD_LOCAL because of vague symbols
tend to have map files / visibility markup
to reduce their exposed symbol count
substantially, eg.
+ as an example 'writer' lives in 'libsw' using
objdump -T libsw680li.so | grep UND | wc -l:
Undefined: 6465
Defined: 3580
+ ie. for every 'defined' symbol that someone
might legitimately want to look up we have
~2 other symbols that they just -cannot- want
to compare.
+ ie. in some cases by dropping UNDEF symbols, we can
have our new / larger .chain section and still save
space [ on disk & in cache ].

So - some comments on your patch. I notice you don't sort .dynstr as I
did ;-) that simplifies life clearly; of course - I have devious reasons
for wanting that. Wrt. -Bdirect if we sort our relocations right and use
-Bdirect, we can turn linking into a linear walk of both the library
list, .hash, .chain, .dynsym, and .dynstr [ if we play our cards
right ;-]; but that's for another day [ I'd just love
to make linking completely disappear from the L2 cache miss profile
(when using C) - I think it can be done ;-].

Reading the patch - I really like it :-) my calculation logic to lookup
the hash stuff in a separate section was *really* ugly and evil, and
would have required some re-factoring to make it at all pleasant, your
chain walking code is sweet and pleasant to the eye etc. :-)

Anyhow - we can build some other rather nice optimisations on top of
this I think; and if we can do some of the above we can save space as
well for a pure gnu-hash system I think. My only concern really is with
extensibility.

Thanks so much for doing this - of course, as cache sizes increase this
may appear to be less useful[2]; but the faster we can make the linker,
the more we can split up OO.o into more libs, lazy load more,
componentize more and (hopefully) save size/time on startup & at
runtime.

Regards,

Michael.

[1] - so your paragraph on the hash:

"530000 unique dynamic symbols ... Dan Bernstein's hash ... fewest
collisions (only 29) ... The standard SysV ELF hash function gave bad
results ... 1726 collisions"

Is interesting - and I'm sure the hash you propose is great - but of
course, with the hash comparison the reduction in the number of L2 cache
misses is so great that the 'quality' of the hash at this level is
rather uninteresting wrt. the optimisation [ IHMO etc. ].

ie. 29/530000 ~= 1726/530000, well nearly anyway ;-) focusing on a
0.005% hit vs. 0.3% is interesting but not worth fighting over. More
interestingly can we pick a few bits to steal and only degrade that by a
factor of 4 or so ? ie. 0.005% vs. 0.02% ;-)

[2] - the Core Duo here with it's 2Mb cache is amazingly faster for OO.o
linking, but clearly at smaller sizes there is this appalling cliff to
drop-off wrt. linking performance.
--
***@novell.com <><, Pseudo Engineer, itinerant idiot
Jakub Jelinek
2006-06-29 19:39:26 UTC
Permalink
Post by Michael Meeks
+ it is incredible to me that UNDEF symbols are
included in the symbol hash; I see no legitimate
use for that [ eg. 'free' is an expensive hit in the
hash for virtually every app ;-]
More comments later, but as you can see in the patch, undefined
symbols are never added to .gnu.hash chains, only defined ones
(and, as I was lazy, also some of the undefined PLT symbols [*]).

That's what the:
+ /* Ignore also local symbols and undefined symbols. */
+ if (h->forced_local
+ || h->root.type == bfd_link_hash_undefined
+ || h->root.type == bfd_link_hash_undefweak
+ || ((h->root.type == bfd_link_hash_defined
+ || h->root.type == bfd_link_hash_defweak)
+ && h->root.u.def.section->output_section == NULL))
+ return TRUE;
in elf_collect_gnu_hash_codes and elf_renumber_gnu_hash_syms
does, .dynsym after sorting contains first the special symbols
(0, STT_SECTION, STB_LOCAL), then SHN_UNDEF symbols and only
after this contains the defined ones, sorted by the hash % nbuckets
value.

[*] SHN_UNDEF symbols with st_value != 0 need to be in .gnu.hash
(and are there), as they are used when resolving relocations that
take address of functions. These are at that point still
bfd_link_hash_defined with non-NULL output_section and only
during dynamic symbol finalization are turned into st_shndx SHN_UNDEF.
But, on i386/x86_64 we have an optimization where if the binary
never takes address of some function, st_value on its SHN_UNDEF
symbol is cleared (see e.g. elf64_x86_64_finish_dynamic_symbol
/* Mark the symbol as undefined, rather than as defined in
the .plt section. Leave the value if there were any
relocations where pointer equality matters (this is a clue
for the dynamic linker, to make function pointer
comparisons work between an application and shared
library), otherwise set it to zero. If a function is only
called from a binary, there is no need to slow down
shared libraries because of that. */
sym->st_shndx = SHN_UNDEF;
if (!h->pointer_equality_needed)
sym->st_value = 0;
). Such symbols could be left out of .gnu.hash too, assuming we
add some target hook, as pointer_equality_needed bit is target specific.

Jakub
Michael Meeks
2006-07-03 15:15:20 UTC
Permalink
Hi there,
Post by Jakub Jelinek
More comments later, but as you can see in the patch, undefined
symbols are never added to .gnu.hash chains, only defined ones
Ah - cool :-) thanks for that.
Post by Jakub Jelinek
Post by Michael Meeks
1. Extensibility
On the other side, it is IMHO much better to keep using one
section for one purpose, as ELF has always been doing.
Yes - you're quite right - it's crack smoking to want to put -Bdirect
stuff in that section anyway, as you say: a few minutes thought in a
less excited state convinced me of that [ sadly while off line ].

OTOH - I do think that allowing the stealing of some bits from the hash
code [ for other purposes ] in future is prudent, and need not perform
badly:

if (!(new_hash ^ l->chain[foo]) & l->new_hash_mask))
... hit.

etc. shouldn't be overly slower than a straight comparison surely ?
Post by Jakub Jelinek
Applications keep growing every year and while I currently counted
~ 500000 different symbol names, it can soon double, triple and keep
increasing.
From what I can see - your method was to grab ~all the symbols on your
system, and compare hashes for all of them ? and we only got <100
collisions ? I take Ulrich's point that perhaps many of these are in the
same module [be interesting to see] due to highly similar names etc. but
even so this is a tiny collision rate.
Post by Jakub Jelinek
And, all linker can do for this is try to keep all the chains
short enough to fit into cachelines.
So - perhaps you can help me here; I've been working with a mental
model of several levels of cache - whereby the more you can compress the
working set, the faster access will be. Now I appreciate the cache line
argument - that after the 1st access, reads inside the same cache line
are really fast [ and (I infer from Ulrich's post) adjacent cache line
reads are free ].
Post by Jakub Jelinek
On the other hand, what is gained? You already said we have rather
short hash chains. Most likely not longer than 5 or 6. I can count the
number of DSOs with longer chains on my hands and there is not more than
one longer chain in each DSO.
Jakub recently posted this sort of thing:

[snip]
libstdc++.so.6.0.8

Histogram for `.gnu.hash' bucket list length (total of 1022 buckets):
Length Number % of total Coverage
0 52 ( 5.1%)
1 131 ( 12.8%) 4.1%
2 225 ( 22.0%) 18.4%
3 233 ( 22.8%) 40.5%
4 171 ( 16.7%) 62.2%
5 115 ( 11.3%) 80.4%
6 64 ( 6.3%) 92.5%
7 18 ( 1.8%) 96.5%
8 8 ( 0.8%) 98.5%
9 4 ( 0.4%) 99.7%
10 1 ( 0.1%) 100.0%
...
For libc-2.4.90.so:
...
Histogram for `.gnu.hash' bucket list length (total of 1011 buckets):
Length Number % of total Coverage
0 124 ( 12.3%)
1 254 ( 25.1%) 12.4%
2 296 ( 29.3%) 41.2%
3 198 ( 19.6%) 70.2%
4 98 ( 9.7%) 89.3%
5 29 ( 2.9%) 96.4%
6 10 ( 1.0%) 99.3%
7 2 ( 0.2%) 100.0%
[snip]

But sure - long chains should be unusual. It's perhaps interesting to
consider where the performance balance between expanding the base .hash
to get more 0xffffffff hits, vs. shrinking it to fit more of it into L1
cache is.
Post by Jakub Jelinek
Now do the math: even assuming even distribution of the chain we have
on average 6 chain elements per cache line. If the linker sorts them
cache lines we can fit in 14. That's enough even for the worst chain
length.
By "sorts them cache lines" I assume you mean adding padding [perhaps undef]
records / alignment to .dynsym to maximise the length of chain inside
cache lines ? that could certainly be done later; interesting.
Post by Jakub Jelinek
So, on the one hand you save one 4-byte word which in almost no case
will cause an additional cache line to be loaded. And even if it would,
the cache lines are consecutive the the processors prefetching takes
case of it. On the other hand you more than double the likelihood to
have to compare strings. This is IMO reason enough to not change the
format.
Nah - let me do the maths another way and perhaps I'll get the result
I'm looking for ;-) by crunching the above numbers that Jakub provided -
I see an avg. no. of comparisons per chain of:

libstdc++ 3.09 # sum({count*length})/sum(count)
glibc 2.03

So a finger-in-the-air avg. chain length is perhaps ~2.5. [1]

My proposed alternate design sacrifices 1 bit of the hash, lets (for
the sake of argument) assume this has the most awful performance
degredation of a factor of 10 (instead of perhaps ~2-4).

The proposed alternate design ('stolen bit') is engineered thus:

* By sorting ~all not-hashed .dynsym entried to the 'end' we end
up with contiguous runs of .dynsym entries [ as now ] in
chains, adjacent to each other. Thus - by stealing a
terminator bit we avoid needing a chain length. [ the 'offset'
in the .hash can then just be a .dynsym index ]

So - lets also assume we have 10^6 unique symbols across a few
libraries - just for argument.

It seems then that the likelihood (with a purely random access pattern)
of a cache hit [ ie. not taking a L1/L2 cache miss ;-] is ~=
sizeof(cache)/sizeof(data_set).

With a 64k L1 cache: and a 4Mb working set (stolen-bit) this gives a
1.6% hit rate. [ for a large L2 of course it's higher, but lets look a
the worst case, smallest cache, where this effect is lowest ]

If we increase the size of each field by 4 bytes per 2.5 records [ as
per the existing 'chain-len' design ] from our avg 400k chains we get: a
1.6Mb increase (factor of 1.4) to a 5.6Mb working set. This gives us a
lower cache hit rate of 1.1%.

But - of course; as you point out there is a penalty for hash
collisions - by reducing precision we loose on comparison:

32bits - 30/530000 = 0.005% false positives
31bits - 10x worse = 0.05%

So - taking our 1 million record hypothetical set:

L1 hits false positives
chain-len. 11k 56
stolen-bit 16k 560

So - if we assume a (conservative) cost of a false positive of ~10 L2
cache misses [ it's more like 3 ] then this seems reasonable.

So - of course each false positive is prolly less than the 10 L2 cache
misses necessary to make these compare (right?). As the cache grows, of
course the assumptions start to fail, but with a 2Mb L2 cache

L1 hits: 64k L2: 2Mb false positives
chain-len. 11k 500k 56
stolen-bit 16k 360k 560

So now we need a false positive to generate 280 L2 misses to get to a
performance parity for the (existing) chain-len solution.

Of course, it's entirely probable that my numbers, assumptions,
calculations and raison-d'etre are entirely incorrect :-) Of course,
wrt. data set size much more than just searching in the .dynsym section
is going on at link time diluting the argument [etc.] Is any of it even
slightly convincing ?

Either way - since I now have time; if you're interested I'm happy to
try to hack this up so it can be profiled both ways; presumably
cachegrind can give a nice repeatable answer. If cachegrind shows it
being more cache efficient - would you consider a re-work / simplify ?

Thanks,

Michael.

[1] - of course assuming that we almost never hit - which I believe is
the common case.
--
***@novell.com <><, Pseudo Engineer, itinerant idiot
Jakub Jelinek
2006-07-03 15:59:25 UTC
Permalink
Post by Michael Meeks
Yes - you're quite right - it's crack smoking to want to put -Bdirect
stuff in that section anyway, as you say: a few minutes thought in a
less excited state convinced me of that [ sadly while off line ].
Ok, good we are on the same page ;)
Post by Michael Meeks
OTOH - I do think that allowing the stealing of some bits from the hash
code [ for other purposes ] in future is prudent, and need not perform
if (!(new_hash ^ l->chain[foo]) & l->new_hash_mask))
... hit.
etc. shouldn't be overly slower than a straight comparison surely ?
Djamel privately mentioned to me today that indeed if we require
nbuckets > 1 for valid .gnu.hash sections, we can get the chain
termination bit for free, i.e. not causing any extra collisions.
That's because if nbuckets > 1, then (hash+1) % nbuckets != hash % nbuckets,
so the bottom-most bit is effectively encoded in the bucket.

So we can get rid of the chain length words, and in that case we can
get rid of the symbol index word as well.

We have 2 options, either we create a 1:1 correspondence between the chain
position and .dynsym index (and perhaps sort the UNDEF symbols high, so that
we can trim the end of the .gnu.hash section), or we can store some addend
into the second word of .gnu.hash (before the buckets array).
ELF requires that the STB_LOCAL symbols are at the beginning, not end of
the .dynsym section. The UNDEF symbols can of course be anywhere (after
the STB_LOCAL symbols). On most of the arches there are only a few
STB_LOCAL symbols (< ~20), but on e.g. ia64 there are really many
(e.g. ia64 glibc has ~ 420 such symbols, all caused by relocations).
So, perhaps using the addend would be best.

It will limit us a little bit in the reordering to minimize chains crossing
cacheline boundaries in that we no longer can have padding/gaps in there
(well, we could store some UNDEF symbols there), but we still can do some
reordering.

I will rework the patch now.

Jakub
Michael Meeks
2006-07-03 18:21:22 UTC
Permalink
Post by Jakub Jelinek
Ok, good we are on the same page ;)
:-)
Post by Jakub Jelinek
Djamel privately mentioned to me today that indeed if we require
nbuckets > 1 for valid .gnu.hash sections, we can get the chain
termination bit for free, i.e. not causing any extra collisions.
That's because if nbuckets > 1, then (hash+1) % nbuckets != hash % nbuckets,
so the bottom-most bit is effectively encoded in the bucket.
Ah - that's a neat trick: nice [ not that it isn't worth the single bit
anyway of course, as per convoluted calculations ;-]
Post by Jakub Jelinek
We have 2 options, either we create a 1:1 correspondence between the chain
position and .dynsym index (and perhaps sort the UNDEF symbols high, so that
we can trim the end of the .gnu.hash section), or we can store some addend
into the second word of .gnu.hash (before the buckets array).
The addend sounds nice.

Of course - it depends on how hard to calculate the hash function is:
we could of course use a rather stronger / slower hash: and simply store
the value for all symbols [ regardless of def/undef etc. ] in the .chain
- and instead of calculating, look them up. [ Of course to calculate the
string hash we have to take a .dynstr cache miss in the 1st place so -
perhaps it's worthwhile ? ].

Of course, if we do that, then it does in fact make sense to have the
-Bdirect indexes next to the undef symbols, so - OTOH - perhaps that
sort of thing does belong in another section;
Post by Jakub Jelinek
ELF requires that the STB_LOCAL symbols are at the beginning, not end of
the .dynsym section.
Yep - I fell over that :-)
Post by Jakub Jelinek
The UNDEF symbols can of course be anywhere (after
the STB_LOCAL symbols). On most of the arches there are only a few
STB_LOCAL symbols (< ~20), but on e.g. ia64 there are really many
(e.g. ia64 glibc has ~ 420 such symbols, all caused by relocations).
So, perhaps using the addend would be best.
That's interesting; I have an architecturally blinkered view.
Post by Jakub Jelinek
I will rework the patch now.
Thanks,

Michael.
--
***@novell.com <><, Pseudo Engineer, itinerant idiot
Jakub Jelinek
2006-07-03 21:04:46 UTC
Permalink
Post by Jakub Jelinek
I will rework the patch now.
Here is the updated patch, Ulrich has the corresponding glibc bits and it
passed all testing we have done so far on it.

There is still room for improvement in the chain reordering to minimize
number of chains crossing cacheline boundaries, but that optimization
doesn't change anything in the section format, so I think it can be left
for later.

Ok for trunk?

2006-07-03 Jakub Jelinek <***@redhat.com>

include/
* bfdlink.h (struct bfd_link_info): Add emit_hash and
emit_gnu_hash bitfields.
include/elf/
* common.h (SHT_GNU_HASH, DT_GNU_HASH): Define.
ld/
* scripttempl/elf.sc: Add .gnu.hash section.
* emultempl/elf32.em (OPTION_HASH_STYLE): Define.
(gld${EMULATION_NAME}_add_options): Register --hash-style option.
(gld${EMULATION_NAME}_handle_option): Handle it.
(gld${EMULATION_NAME}_list_options): Document it.
* ldmain.c (main): Initialize emit_hash and emit_gnu_hash.
* ld.texinfo: Document --hash-style option.
bfd/
* elf.c (_bfd_elf_print_private_bfd_data): Handle DT_GNU_HASH.
(bfd_section_from_shdr, elf_fake_sections, assign_section_numbers):
Handle SHT_GNU_HASH.
(special_sections_g): Include .gnu.hash section.
(bfd_elf_gnu_hash): New function.
* elf-bfd.h (bfd_elf_gnu_hash): New prototype.
* elflink.c (_bfd_elf_link_create_dynamic_sections): Create .hash
only if info->emit_hash, create .gnu.hash section if
info->emit_gnu_hash.
(struct collect_gnu_hash_codes): New type.
(elf_collect_gnu_hash_codes, elf_renumber_gnu_hash_syms): New
functions.
(compute_bucket_count): Don't compute HASHCODES array, instead add
that and NSYMS as arguments. Use bed->s->sizeof_hash_entry
instead of bed->s->arch_size / 8. Fix .hash size estimation.
When not optimizing, use the number of hashed symbols rather than
dynsymcount.
(bfd_elf_size_dynamic_sections): Only add DT_HASH if info->emit_hash,
and ADD DT_GNU_HASH if info->emit_gnu_hash.
(bfd_elf_size_dynsym_hash_dynstr): Size .hash only if info->emit_hash,
adjust compute_bucket_count caller. Create and populate .gnu.hash
section if info->emit_gnu_hash.
(elf_link_output_extsym): Only populate .hash section if
finfo->hash_sec != NULL.
(bfd_elf_final_link): Adjust assertion. Handle DT_GNU_HASH.
binutils/
* readelf.c (get_dynamic_type): Handle DT_GNU_HASH.
(get_section_type_name): Handle SHT_GNU_HASH.
(dynamic_info_DT_GNU_HASH): New variable.
(process_dynamic_section): Handle DT_GNU_HASH.
(process_symbol_table): Print also DT_GNU_HASH histogram.

--- ld/scripttempl/elf.sc.jj 2006-01-01 01:02:16.000000000 +0100
+++ ld/scripttempl/elf.sc 2006-06-22 11:11:53.000000000 +0200
@@ -260,6 +260,7 @@ SECTIONS
${INITIAL_READONLY_SECTIONS}
${TEXT_DYNAMIC+${DYNAMIC}}
.hash ${RELOCATING-0} : { *(.hash) }
+ .gnu.hash ${RELOCATING-0} : { *(.gnu.hash) }
.dynsym ${RELOCATING-0} : { *(.dynsym) }
.dynstr ${RELOCATING-0} : { *(.dynstr) }
.gnu.version ${RELOCATING-0} : { *(.gnu.version) }
--- ld/ldmain.c.jj 2006-06-01 15:50:33.000000000 +0200
+++ ld/ldmain.c 2006-06-22 11:21:11.000000000 +0200
@@ -304,6 +304,8 @@ main (int argc, char **argv)
link_info.create_object_symbols_section = NULL;
link_info.gc_sym_list = NULL;
link_info.base_file = NULL;
+ link_info.emit_hash = TRUE;
+ link_info.emit_gnu_hash = FALSE;
/* SVR4 linkers seem to set DT_INIT and DT_FINI based on magic _init
and _fini symbols. We are compatible. */
link_info.init_function = "_init";
--- ld/ld.texinfo.jj 2006-06-15 14:31:06.000000000 +0200
+++ ld/ld.texinfo 2006-06-22 14:03:21.000000000 +0200
@@ -1883,6 +1883,14 @@ time it takes the linker to perform its
increasing the linker's memory requirements. Similarly reducing this
value can reduce the memory requirements at the expense of speed.

+@kindex --hash-style=@var{style}
+@item --hash-style=@var{style}
+Set the type of linker's hash table(s). @var{style} can be either
+@code{sysv} for classic ELF @code{.hash} section, @code{gnu} for
+new style GNU @code{.gnu.hash} section or @code{both} for both
+the classic ELF @code{.hash} and new style GNU @code{.gnu.hash}
+hash tables. The default is @code{sysv}.
+
@kindex --reduce-memory-overheads
@item --reduce-memory-overheads
This option reduces memory requirements at ld runtime, at the expense of
--- ld/emultempl/elf32.em.jj 2006-06-20 18:34:24.000000000 +0200
+++ ld/emultempl/elf32.em 2006-06-22 14:39:25.000000000 +0200
@@ -1719,6 +1719,7 @@ cat >>e${EMULATION_NAME}.c <<EOF
#define OPTION_GROUP (OPTION_ENABLE_NEW_DTAGS + 1)
#define OPTION_EH_FRAME_HDR (OPTION_GROUP + 1)
#define OPTION_EXCLUDE_LIBS (OPTION_EH_FRAME_HDR + 1)
+#define OPTION_HASH_STYLE (OPTION_EXCLUDE_LIBS + 1)

static void
gld${EMULATION_NAME}_add_options
@@ -1735,6 +1736,7 @@ cat >>e${EMULATION_NAME}.c <<EOF
{"enable-new-dtags", no_argument, NULL, OPTION_ENABLE_NEW_DTAGS},
{"eh-frame-hdr", no_argument, NULL, OPTION_EH_FRAME_HDR},
{"exclude-libs", required_argument, NULL, OPTION_EXCLUDE_LIBS},
+ {"hash-style", required_argument, NULL, OPTION_HASH_STYLE},
{"Bgroup", no_argument, NULL, OPTION_GROUP},
EOF
fi
@@ -1791,6 +1793,22 @@ cat >>e${EMULATION_NAME}.c <<EOF
add_excluded_libs (optarg);
break;

+ case OPTION_HASH_STYLE:
+ link_info.emit_hash = FALSE;
+ link_info.emit_gnu_hash = FALSE;
+ if (strcmp (optarg, "sysv") == 0)
+ link_info.emit_hash = TRUE;
+ else if (strcmp (optarg, "gnu") == 0)
+ link_info.emit_gnu_hash = TRUE;
+ else if (strcmp (optarg, "both") == 0)
+ {
+ link_info.emit_hash = TRUE;
+ link_info.emit_gnu_hash = TRUE;
+ }
+ else
+ einfo (_("%P%F: invalid hash style \`%s'\n"), optarg);
+ break;
+
case 'z':
if (strcmp (optarg, "initfirst") == 0)
link_info.flags_1 |= (bfd_vma) DF_1_INITFIRST;
@@ -1894,6 +1912,7 @@ cat >>e${EMULATION_NAME}.c <<EOF
fprintf (file, _(" --disable-new-dtags\tDisable new dynamic tags\n"));
fprintf (file, _(" --enable-new-dtags\tEnable new dynamic tags\n"));
fprintf (file, _(" --eh-frame-hdr\tCreate .eh_frame_hdr section\n"));
+ fprintf (file, _(" --hash-style=STYLE\tSet hash style to sysv, gnu or both\n"));
fprintf (file, _(" -z combreloc\t\tMerge dynamic relocs into one section and sort\n"));
fprintf (file, _(" -z defs\t\tReport unresolved symbols in object files.\n"));
fprintf (file, _(" -z execstack\t\tMark executable as requiring executable stack\n"));
--- bfd/elf-bfd.h.jj 2006-06-20 18:34:24.000000000 +0200
+++ bfd/elf-bfd.h 2006-06-26 16:17:53.000000000 +0200
@@ -1481,6 +1481,8 @@ extern bfd_vma _bfd_elf_section_offset

extern unsigned long bfd_elf_hash
(const char *);
+extern unsigned long bfd_elf_gnu_hash
+ (const char *);

extern bfd_reloc_status_type bfd_elf_generic_reloc
(bfd *, arelent *, asymbol *, void *, asection *, bfd *, char **);
--- bfd/elf.c.jj 2006-06-20 18:34:24.000000000 +0200
+++ bfd/elf.c 2006-06-26 16:17:28.000000000 +0200
@@ -206,6 +206,21 @@ bfd_elf_hash (const char *namearg)
return h & 0xffffffff;
}

+/* DT_GNU_HASH hash function. Do not change this function; you will
+ cause invalid hash tables to be generated. */
+
+unsigned long
+bfd_elf_gnu_hash (const char *namearg)
+{
+ const unsigned char *name = (const unsigned char *) namearg;
+ unsigned long h = 5381;
+ unsigned char ch;
+
+ while ((ch = *name++) != '\0')
+ h = (h << 5) + h + ch;
+ return h & 0xffffffff;
+}
+
bfd_boolean
bfd_elf_mkobject (bfd *abfd)
{
@@ -1239,6 +1254,7 @@ _bfd_elf_print_private_bfd_data (bfd *ab
case DT_AUXILIARY: name = "AUXILIARY"; stringp = TRUE; break;
case DT_USED: name = "USED"; break;
case DT_FILTER: name = "FILTER"; stringp = TRUE; break;
+ case DT_GNU_HASH: name = "GNU_HASH"; break;
}

fprintf (f, " %-11s ", name);
@@ -1823,6 +1839,7 @@ bfd_section_from_shdr (bfd *abfd, unsign
case SHT_FINI_ARRAY: /* .fini_array section. */
case SHT_PREINIT_ARRAY: /* .preinit_array section. */
case SHT_GNU_LIBLIST: /* .gnu.liblist section. */
+ case SHT_GNU_HASH: /* .gnu.hash section. */
return _bfd_elf_make_section_from_shdr (abfd, hdr, name, shindex);

case SHT_DYNAMIC: /* Dynamic linking information. */
@@ -2295,6 +2312,7 @@ static const struct bfd_elf_special_sect
{ ".gnu.version_r", 14, 0, SHT_GNU_verneed, 0 },
{ ".gnu.liblist", 12, 0, SHT_GNU_LIBLIST, SHF_ALLOC },
{ ".gnu.conflict", 13, 0, SHT_RELA, SHF_ALLOC },
+ { ".gnu.hash", 9, 0, SHT_GNU_HASH, SHF_ALLOC },
{ NULL, 0, 0, 0, 0 }
};

@@ -2811,6 +2829,10 @@ elf_fake_sections (bfd *abfd, asection *
case SHT_GROUP:
this_hdr->sh_entsize = 4;
break;
+
+ case SHT_GNU_HASH:
+ this_hdr->sh_entsize = 4;
+ break;
}

if ((asect->flags & SEC_ALLOC) != 0)
@@ -3256,6 +3278,7 @@ assign_section_numbers (bfd *abfd, struc
break;

case SHT_HASH:
+ case SHT_GNU_HASH:
case SHT_GNU_versym:
/* sh_link is the section header index of the symbol table
this hash table or version table is for. */
--- bfd/elflink.c.jj 2006-06-20 18:34:53.000000000 +0200
+++ bfd/elflink.c 2006-07-03 18:26:47.000000000 +0200
@@ -240,12 +240,24 @@ _bfd_elf_link_create_dynamic_sections (b
if (!_bfd_elf_define_linkage_sym (abfd, info, s, "_DYNAMIC"))
return FALSE;

- s = bfd_make_section_with_flags (abfd, ".hash",
- flags | SEC_READONLY);
- if (s == NULL
- || ! bfd_set_section_alignment (abfd, s, bed->s->log_file_align))
- return FALSE;
- elf_section_data (s)->this_hdr.sh_entsize = bed->s->sizeof_hash_entry;
+ if (info->emit_hash)
+ {
+ s = bfd_make_section_with_flags (abfd, ".hash", flags | SEC_READONLY);
+ if (s == NULL
+ || ! bfd_set_section_alignment (abfd, s, bed->s->log_file_align))
+ return FALSE;
+ elf_section_data (s)->this_hdr.sh_entsize = bed->s->sizeof_hash_entry;
+ }
+
+ if (info->emit_gnu_hash)
+ {
+ s = bfd_make_section_with_flags (abfd, ".gnu.hash",
+ flags | SEC_READONLY);
+ if (s == NULL
+ || ! bfd_set_section_alignment (abfd, s, bed->s->log_file_align))
+ return FALSE;
+ elf_section_data (s)->this_hdr.sh_entsize = 4;
+ }

/* Let the backend create the rest of the sections. This lets the
backend set the right flags. The backend will normally create
@@ -4811,6 +4823,118 @@ elf_collect_hash_codes (struct elf_link_
return TRUE;
}

+struct collect_gnu_hash_codes
+{
+ bfd *output_bfd;
+ unsigned long int nsyms;
+ unsigned long int *hashcodes;
+ unsigned long int *hashval;
+ unsigned long int *indx;
+ unsigned long int *counts;
+ bfd_byte *contents;
+ long int min_dynindx;
+ unsigned long int bucketcount;
+ unsigned long int symindx;
+ long int local_indx;
+};
+
+/* This function will be called though elf_link_hash_traverse to store
+ all hash value of the exported symbols in an array. */
+
+static bfd_boolean
+elf_collect_gnu_hash_codes (struct elf_link_hash_entry *h, void *data)
+{
+ struct collect_gnu_hash_codes *s = data;
+ const char *name;
+ char *p;
+ unsigned long ha;
+ char *alc = NULL;
+
+ if (h->root.type == bfd_link_hash_warning)
+ h = (struct elf_link_hash_entry *) h->root.u.i.link;
+
+ /* Ignore indirect symbols. These are added by the versioning code. */
+ if (h->dynindx == -1)
+ return TRUE;
+
+ /* Ignore also local symbols and undefined symbols. */
+ if (h->forced_local
+ || h->root.type == bfd_link_hash_undefined
+ || h->root.type == bfd_link_hash_undefweak
+ || ((h->root.type == bfd_link_hash_defined
+ || h->root.type == bfd_link_hash_defweak)
+ && h->root.u.def.section->output_section == NULL))
+ return TRUE;
+
+ name = h->root.root.string;
+ p = strchr (name, ELF_VER_CHR);
+ if (p != NULL)
+ {
+ alc = bfd_malloc (p - name + 1);
+ memcpy (alc, name, p - name);
+ alc[p - name] = '\0';
+ name = alc;
+ }
+
+ /* Compute the hash value. */
+ ha = bfd_elf_gnu_hash (name);
+
+ /* Store the found hash value in the array for compute_bucket_count,
+ and also for .dynsym reordering purposes. */
+ s->hashcodes[s->nsyms] = ha;
+ s->hashval[h->dynindx] = ha;
+ ++s->nsyms;
+ if (s->min_dynindx < 0 || s->min_dynindx > h->dynindx)
+ s->min_dynindx = h->dynindx;
+
+ if (alc != NULL)
+ free (alc);
+
+ return TRUE;
+}
+
+/* This function will be called though elf_link_hash_traverse to do
+ final dynaminc symbol renumbering. */
+
+static bfd_boolean
+elf_renumber_gnu_hash_syms (struct elf_link_hash_entry *h, void *data)
+{
+ struct collect_gnu_hash_codes *s = data;
+ unsigned long int bucket;
+ unsigned long int val;
+
+ if (h->root.type == bfd_link_hash_warning)
+ h = (struct elf_link_hash_entry *) h->root.u.i.link;
+
+ /* Ignore indirect symbols. */
+ if (h->dynindx == -1)
+ return TRUE;
+
+ /* Ignore also local symbols and undefined symbols. */
+ if (h->forced_local
+ || h->root.type == bfd_link_hash_undefined
+ || h->root.type == bfd_link_hash_undefweak
+ || ((h->root.type == bfd_link_hash_defined
+ || h->root.type == bfd_link_hash_defweak)
+ && h->root.u.def.section->output_section == NULL))
+ {
+ if (h->dynindx >= s->min_dynindx)
+ h->dynindx = s->local_indx++;
+ return TRUE;
+ }
+
+ bucket = s->hashval[h->dynindx] % s->bucketcount;
+ val = s->hashval[h->dynindx] & ~(unsigned long int) 1;
+ if (s->counts[bucket] == 1)
+ /* Last element terminates the chain. */
+ val |= 1;
+ bfd_put_32 (s->output_bfd, val,
+ s->contents + (s->indx[bucket] - s->symindx) * 4);
+ --s->counts[bucket];
+ h->dynindx = s->indx[bucket]++;
+ return TRUE;
+}
+
/* Array used to determine the number of hash table buckets to use
based on the number of symbols there are. If there are fewer than
3 symbols we use 1 bucket, fewer than 17 symbols we use 3 buckets,
@@ -4832,42 +4956,26 @@ static const size_t elf_buckets[] =
Therefore the result is always a good payoff between few collisions
(= short chain lengths) and table size. */
static size_t
-compute_bucket_count (struct bfd_link_info *info)
+compute_bucket_count (struct bfd_link_info *info, unsigned long int *hashcodes,
+ unsigned long int nsyms)
{
size_t dynsymcount = elf_hash_table (info)->dynsymcount;
size_t best_size = 0;
- unsigned long int *hashcodes;
- unsigned long int *hashcodesp;
unsigned long int i;
bfd_size_type amt;

- /* Compute the hash values for all exported symbols. At the same
- time store the values in an array so that we could use them for
- optimizations. */
- amt = dynsymcount;
- amt *= sizeof (unsigned long int);
- hashcodes = bfd_malloc (amt);
- if (hashcodes == NULL)
- return 0;
- hashcodesp = hashcodes;
-
- /* Put all hash values in HASHCODES. */
- elf_link_hash_traverse (elf_hash_table (info),
- elf_collect_hash_codes, &hashcodesp);
-
/* We have a problem here. The following code to optimize the table
size requires an integer type with more the 32 bits. If
BFD_HOST_U_64_BIT is set we know about such a type. */
#ifdef BFD_HOST_U_64_BIT
if (info->optimize)
{
- unsigned long int nsyms = hashcodesp - hashcodes;
size_t minsize;
size_t maxsize;
BFD_HOST_U_64_BIT best_chlen = ~((BFD_HOST_U_64_BIT) 0);
- unsigned long int *counts ;
bfd *dynobj = elf_hash_table (info)->dynobj;
const struct elf_backend_data *bed = get_elf_backend_data (dynobj);
+ unsigned long int *counts;

/* Possible optimization parameters: if we have NSYMS symbols we say
that the hashing table must at least have NSYMS/4 and at most
@@ -4883,10 +4991,7 @@ compute_bucket_count (struct bfd_link_in
amt *= sizeof (unsigned long int);
counts = bfd_malloc (amt);
if (counts == NULL)
- {
- free (hashcodes);
- return 0;
- }
+ return 0;

/* Compute the "optimal" size for the hash table. The criteria is a
minimal chain length. The minor criteria is (of course) the size
@@ -4913,9 +5018,9 @@ compute_bucket_count (struct bfd_link_in
# define BFD_TARGET_PAGESIZE (4096)
# endif

- /* We in any case need 2 + NSYMS entries for the size values and
- the chains. */
- max = (2 + nsyms) * (bed->s->arch_size / 8);
+ /* We in any case need 2 + DYNSYMCOUNT entries for the size values
+ and the chains. */
+ max = (2 + dynsymcount) * bed->s->sizeof_hash_entry;

# if 1
/* Variant 1: optimize for short chains. We add the squares
@@ -4925,7 +5030,7 @@ compute_bucket_count (struct bfd_link_in
max += counts[j] * counts[j];

/* This adds penalties for the overall size of the table. */
- fact = i / (BFD_TARGET_PAGESIZE / (bed->s->arch_size / 8)) + 1;
+ fact = i / (BFD_TARGET_PAGESIZE / bed->s->sizeof_hash_entry) + 1;
max *= fact * fact;
# else
/* Variant 2: Optimize a lot more for small table. Here we
@@ -4936,7 +5041,7 @@ compute_bucket_count (struct bfd_link_in

/* The overall size of the table is considered, but not as
strong as in variant 1, where it is squared. */
- fact = i / (BFD_TARGET_PAGESIZE / (bed->s->arch_size / 8)) + 1;
+ fact = i / (BFD_TARGET_PAGESIZE / bed->s->sizeof_hash_entry) + 1;
max *= fact;
# endif

@@ -4959,14 +5064,11 @@ compute_bucket_count (struct bfd_link_in
for (i = 0; elf_buckets[i] != 0; i++)
{
best_size = elf_buckets[i];
- if (dynsymcount < elf_buckets[i + 1])
+ if (nsyms < elf_buckets[i + 1])
break;
}
}

- /* Free the arrays we needed. */
- free (hashcodes);
-
return best_size;
}

@@ -5324,7 +5426,10 @@ bfd_elf_size_dynamic_sections (bfd *outp
bfd_size_type strsize;

strsize = _bfd_elf_strtab_size (elf_hash_table (info)->dynstr);
- if (!_bfd_elf_add_dynamic_entry (info, DT_HASH, 0)
+ if ((info->emit_hash
+ && !_bfd_elf_add_dynamic_entry (info, DT_HASH, 0))
+ || (info->emit_gnu_hash
+ && !_bfd_elf_add_dynamic_entry (info, DT_GNU_HASH, 0))
|| !_bfd_elf_add_dynamic_entry (info, DT_STRTAB, 0)
|| !_bfd_elf_add_dynamic_entry (info, DT_SYMTAB, 0)
|| !_bfd_elf_add_dynamic_entry (info, DT_STRSZ, strsize)
@@ -5726,8 +5831,6 @@ bfd_elf_size_dynsym_hash_dynstr (bfd *ou
asection *s;
bfd_size_type dynsymcount;
unsigned long section_sym_count;
- size_t bucketcount = 0;
- size_t hash_entry_size;
unsigned int dtagcount;

dynobj = elf_hash_table (info)->dynobj;
@@ -5778,23 +5881,150 @@ bfd_elf_size_dynsym_hash_dynstr (bfd *ou
memset (s->contents, 0, section_sym_count * bed->s->sizeof_sym);
}

+ elf_hash_table (info)->bucketcount = 0;
+
/* Compute the size of the hashing table. As a side effect this
computes the hash values for all the names we export. */
- bucketcount = compute_bucket_count (info);
+ if (info->emit_hash)
+ {
+ unsigned long int *hashcodes;
+ unsigned long int *hashcodesp;
+ bfd_size_type amt;
+ unsigned long int nsyms;
+ size_t bucketcount;
+ size_t hash_entry_size;
+
+ /* Compute the hash values for all exported symbols. At the same
+ time store the values in an array so that we could use them for
+ optimizations. */
+ amt = dynsymcount * sizeof (unsigned long int);
+ hashcodes = bfd_malloc (amt);
+ if (hashcodes == NULL)
+ return FALSE;
+ hashcodesp = hashcodes;

- s = bfd_get_section_by_name (dynobj, ".hash");
- BFD_ASSERT (s != NULL);
- hash_entry_size = elf_section_data (s)->this_hdr.sh_entsize;
- s->size = ((2 + bucketcount + dynsymcount) * hash_entry_size);
- s->contents = bfd_zalloc (output_bfd, s->size);
- if (s->contents == NULL)
- return FALSE;
+ /* Put all hash values in HASHCODES. */
+ elf_link_hash_traverse (elf_hash_table (info),
+ elf_collect_hash_codes, &hashcodesp);
+
+ nsyms = hashcodesp - hashcodes;
+ bucketcount
+ = compute_bucket_count (info, hashcodes, nsyms);
+ free (hashcodes);
+
+ if (bucketcount == 0)
+ return FALSE;
+
+ elf_hash_table (info)->bucketcount = bucketcount;
+
+ s = bfd_get_section_by_name (dynobj, ".hash");
+ BFD_ASSERT (s != NULL);
+ hash_entry_size = elf_section_data (s)->this_hdr.sh_entsize;
+ s->size = ((2 + bucketcount + dynsymcount) * hash_entry_size);
+ s->contents = bfd_zalloc (output_bfd, s->size);
+ if (s->contents == NULL)
+ return FALSE;

- bfd_put (8 * hash_entry_size, output_bfd, bucketcount, s->contents);
- bfd_put (8 * hash_entry_size, output_bfd, dynsymcount,
- s->contents + hash_entry_size);
+ bfd_put (8 * hash_entry_size, output_bfd, bucketcount, s->contents);
+ bfd_put (8 * hash_entry_size, output_bfd, dynsymcount,
+ s->contents + hash_entry_size);
+ }
+
+ if (info->emit_gnu_hash)
+ {
+ size_t i, cnt;
+ unsigned char *contents;
+ struct collect_gnu_hash_codes cinfo;
+ bfd_size_type amt;
+ size_t bucketcount;
+
+ memset (&cinfo, 0, sizeof (cinfo));

- elf_hash_table (info)->bucketcount = bucketcount;
+ /* Compute the hash values for all exported symbols. At the same
+ time store the values in an array so that we could use them for
+ optimizations. */
+ amt = dynsymcount * 2 * sizeof (unsigned long int);
+ cinfo.hashcodes = bfd_malloc (amt);
+ if (cinfo.hashcodes == NULL)
+ return FALSE;
+
+ cinfo.hashval = cinfo.hashcodes + dynsymcount;
+ cinfo.min_dynindx = -1;
+ cinfo.output_bfd = output_bfd;
+
+ /* Put all hash values in HASHCODES. */
+ elf_link_hash_traverse (elf_hash_table (info),
+ elf_collect_gnu_hash_codes, &cinfo);
+
+ bucketcount
+ = compute_bucket_count (info, cinfo.hashcodes, cinfo.nsyms);
+
+ if (bucketcount == 0)
+ {
+ free (cinfo.hashcodes);
+ return FALSE;
+ }
+
+ amt = bucketcount * sizeof (unsigned long int) * 2;
+ cinfo.counts = bfd_malloc (amt);
+ if (cinfo.counts == NULL)
+ {
+ free (cinfo.hashcodes);
+ return FALSE;
+ }
+
+ /* Determine how often each hash bucket is used. */
+ memset (cinfo.counts, 0, bucketcount * sizeof (cinfo.counts[0]));
+ for (i = 0; i < cinfo.nsyms; ++i)
+ ++cinfo.counts[cinfo.hashcodes[i] % bucketcount];
+
+ s = bfd_get_section_by_name (dynobj, ".gnu.hash");
+ BFD_ASSERT (s != NULL);
+ cinfo.indx = cinfo.counts + bucketcount;
+ cinfo.symindx = dynsymcount - cinfo.nsyms;
+ for (i = 0, cnt = cinfo.symindx; i < bucketcount; ++i)
+ if (cinfo.counts[i] != 0)
+ {
+ cinfo.indx[i] = cnt;
+ cnt += cinfo.counts[i];
+ }
+ BFD_ASSERT (cnt == dynsymcount);
+ cinfo.bucketcount = bucketcount;
+ cinfo.local_indx = cinfo.min_dynindx;
+
+ s->size = (2 + bucketcount + cinfo.nsyms) * 4;
+ contents = bfd_zalloc (output_bfd, s->size);
+ if (contents == NULL)
+ {
+ free (cinfo.counts);
+ free (cinfo.hashcodes);
+ return FALSE;
+ }
+
+ s->contents = contents;
+ bfd_put_32 (output_bfd, bucketcount, contents);
+ bfd_put_32 (output_bfd, cinfo.symindx, contents + 4);
+ contents += 8;
+
+ for (i = 0; i < bucketcount; ++i)
+ {
+ if (cinfo.counts[i] == 0)
+ bfd_put_32 (output_bfd, ~0, contents);
+ else
+ bfd_put_32 (output_bfd, cinfo.indx[i] - cinfo.symindx,
+ contents);
+ contents += 4;
+ }
+
+ cinfo.contents = contents;
+
+ /* Renumber dynamic symbols, populate .gnu.hash section. */
+ elf_link_hash_traverse (elf_hash_table (info),
+ elf_renumber_gnu_hash_syms, &cinfo);
+
+ free (cinfo.counts);
+ free (cinfo.hashcodes);
+ }

s = bfd_get_section_by_name (dynobj, ".dynstr");
BFD_ASSERT (s != NULL);
@@ -6663,9 +6893,6 @@ elf_link_output_extsym (struct elf_link_
{
size_t bucketcount;
size_t bucket;
- size_t hash_entry_size;
- bfd_byte *bucketpos;
- bfd_vma chain;
bfd_byte *esym;

sym.st_name = h->dynstr_index;
@@ -6679,15 +6906,23 @@ elf_link_output_extsym (struct elf_link_

bucketcount = elf_hash_table (finfo->info)->bucketcount;
bucket = h->u.elf_hash_value % bucketcount;
- hash_entry_size
- = elf_section_data (finfo->hash_sec)->this_hdr.sh_entsize;
- bucketpos = ((bfd_byte *) finfo->hash_sec->contents
- + (bucket + 2) * hash_entry_size);
- chain = bfd_get (8 * hash_entry_size, finfo->output_bfd, bucketpos);
- bfd_put (8 * hash_entry_size, finfo->output_bfd, h->dynindx, bucketpos);
- bfd_put (8 * hash_entry_size, finfo->output_bfd, chain,
- ((bfd_byte *) finfo->hash_sec->contents
- + (bucketcount + 2 + h->dynindx) * hash_entry_size));
+
+ if (finfo->hash_sec != NULL)
+ {
+ size_t hash_entry_size;
+ bfd_byte *bucketpos;
+ bfd_vma chain;
+
+ hash_entry_size
+ = elf_section_data (finfo->hash_sec)->this_hdr.sh_entsize;
+ bucketpos = ((bfd_byte *) finfo->hash_sec->contents
+ + (bucket + 2) * hash_entry_size);
+ chain = bfd_get (8 * hash_entry_size, finfo->output_bfd, bucketpos);
+ bfd_put (8 * hash_entry_size, finfo->output_bfd, h->dynindx, bucketpos);
+ bfd_put (8 * hash_entry_size, finfo->output_bfd, chain,
+ ((bfd_byte *) finfo->hash_sec->contents
+ + (bucketcount + 2 + h->dynindx) * hash_entry_size));
+ }

if (finfo->symver_sec != NULL && finfo->symver_sec->contents != NULL)
{
@@ -7861,7 +8096,7 @@ bfd_elf_final_link (bfd *abfd, struct bf
{
finfo.dynsym_sec = bfd_get_section_by_name (dynobj, ".dynsym");
finfo.hash_sec = bfd_get_section_by_name (dynobj, ".hash");
- BFD_ASSERT (finfo.dynsym_sec != NULL && finfo.hash_sec != NULL);
+ BFD_ASSERT (finfo.dynsym_sec != NULL);
finfo.symver_sec = bfd_get_section_by_name (dynobj, ".gnu.version");
/* Note that it is OK if symver_sec is NULL. */
}
@@ -8621,6 +8856,9 @@ bfd_elf_final_link (bfd *abfd, struct bf
case DT_HASH:
name = ".hash";
goto get_vma;
+ case DT_GNU_HASH:
+ name = ".gnu.hash";
+ goto get_vma;
case DT_STRTAB:
name = ".dynstr";
goto get_vma;
--- include/elf/common.h.jj 2006-02-17 15:36:26.000000000 +0100
+++ include/elf/common.h 2006-06-22 10:43:21.000000000 +0200
@@ -338,6 +338,7 @@
#define SHT_LOOS 0x60000000 /* First of OS specific semantics */
#define SHT_HIOS 0x6fffffff /* Last of OS specific semantics */

+#define SHT_GNU_HASH 0x6ffffff6 /* GNU style symbol hash table */
#define SHT_GNU_LIBLIST 0x6ffffff7 /* List of prelink dependencies */

/* The next three section types are defined by Solaris, and are named
@@ -577,6 +578,7 @@
#define DT_VALRNGHI 0x6ffffdff

#define DT_ADDRRNGLO 0x6ffffe00
+#define DT_GNU_HASH 0x6ffffef5
#define DT_TLSDESC_PLT 0x6ffffef6
#define DT_TLSDESC_GOT 0x6ffffef7
#define DT_GNU_CONFLICT 0x6ffffef8
--- include/bfdlink.h.jj 2006-04-07 17:17:29.000000000 +0200
+++ include/bfdlink.h 2006-06-22 11:11:20.000000000 +0200
@@ -324,6 +324,12 @@ struct bfd_link_info
/* TRUE if unreferenced sections should be removed. */
unsigned int gc_sections: 1;

+ /* TRUE if .hash section should be created. */
+ unsigned int emit_hash: 1;
+
+ /* TRUE if .gnu.hash section should be created. */
+ unsigned int emit_gnu_hash: 1;
+
/* What to do with unresolved symbols in an object file.
When producing executables the default is GENERATE_ERROR.
When producing shared libraries the default is IGNORE. The
--- binutils/readelf.c.jj 2006-05-30 16:13:54.000000000 +0200
+++ binutils/readelf.c 2006-07-03 19:30:54.000000000 +0200
@@ -135,6 +135,7 @@ static unsigned long dynamic_syminfo_off
static unsigned int dynamic_syminfo_nent;
static char program_interpreter[64];
static bfd_vma dynamic_info[DT_JMPREL + 1];
+static bfd_vma dynamic_info_DT_GNU_HASH;
static bfd_vma version_info[16];
static Elf_Internal_Ehdr elf_header;
static Elf_Internal_Shdr *section_headers;
@@ -1501,6 +1502,7 @@ get_dynamic_type (unsigned long type)
case DT_GNU_CONFLICTSZ: return "GNU_CONFLICTSZ";
case DT_GNU_LIBLIST: return "GNU_LIBLIST";
case DT_GNU_LIBLISTSZ: return "GNU_LIBLISTSZ";
+ case DT_GNU_HASH: return "GNU_HASH";

default:
if ((type >= DT_LOPROC) && (type <= DT_HIPROC))
@@ -2571,6 +2573,7 @@ get_section_type_name (unsigned int sh_t
case SHT_INIT_ARRAY: return "INIT_ARRAY";
case SHT_FINI_ARRAY: return "FINI_ARRAY";
case SHT_PREINIT_ARRAY: return "PREINIT_ARRAY";
+ case SHT_GNU_HASH: return "GNU_HASH";
case SHT_GROUP: return "GROUP";
case SHT_SYMTAB_SHNDX: return "SYMTAB SECTION INDICIES";
case SHT_GNU_verdef: return "VERDEF";
@@ -6228,6 +6231,15 @@ process_dynamic_section (FILE *file)
}
break;

+ case DT_GNU_HASH:
+ dynamic_info_DT_GNU_HASH = entry->d_un.d_val;
+ if (do_dynamic)
+ {
+ print_vma (entry->d_un.d_val, PREFIX_HEX);
+ putchar ('\n');
+ }
+ break;
+
default:
if ((entry->d_tag >= DT_VERSYM) && (entry->d_tag <= DT_VERNEEDNUM))
version_info[DT_VERSIONTAGIDX (entry->d_tag)] =
@@ -6903,6 +6915,9 @@ process_symbol_table (FILE *file)
bfd_vma nchains = 0;
bfd_vma *buckets = NULL;
bfd_vma *chains = NULL;
+ bfd_vma ngnubuckets = 0;
+ bfd_vma *gnubuckets = NULL;
+ bfd_vma *gnuchains = NULL;

if (! do_syms && !do_histogram)
return 1;
@@ -7282,6 +7297,145 @@ process_symbol_table (FILE *file)
free (chains);
}

+ if (do_histogram && dynamic_info_DT_GNU_HASH)
+ {
+ unsigned char nb[8];
+ bfd_vma i, maxchain = 0xffffffff;
+ unsigned long *lengths;
+ unsigned long *counts;
+ unsigned long hn;
+ unsigned long maxlength = 0;
+ unsigned long nzero_counts = 0;
+ unsigned long nsyms = 0;
+
+ if (fseek (file,
+ (archive_file_offset
+ + offset_from_vma (file, dynamic_info_DT_GNU_HASH,
+ sizeof nb)),
+ SEEK_SET))
+ {
+ error (_("Unable to seek to start of dynamic information"));
+ return 0;
+ }
+
+ if (fread (nb, 8, 1, file) != 1)
+ {
+ error (_("Failed to read in number of buckets\n"));
+ return 0;
+ }
+
+ ngnubuckets = byte_get (nb, 4);
+
+ gnubuckets = get_dynamic_data (file, ngnubuckets, 4);
+
+ if (gnubuckets == NULL)
+ return 0;
+
+ for (i = 0; i < ngnubuckets; i++)
+ if (gnubuckets[i] != 0xffffffff
+ && (maxchain == 0xffffffff || gnubuckets[i] > maxchain))
+ maxchain = gnubuckets[i];
+
+ if (maxchain == 0xffffffff)
+ return 0;
+
+ if (fseek (file,
+ (archive_file_offset
+ + offset_from_vma (file,
+ dynamic_info_DT_GNU_HASH
+ + 4 * (2 + ngnubuckets + maxchain),
+ sizeof nb)),
+ SEEK_SET))
+ {
+ error (_("Unable to seek to start of dynamic information"));
+ return 0;
+ }
+
+ do
+ {
+ if (fread (nb, 4, 1, file) != 1)
+ {
+ error (_("Failed to determine last chain length\n"));
+ return 0;
+ }
+
+ if (maxchain + 1 == 0)
+ return 0;
+
+ ++maxchain;
+ }
+ while ((byte_get (nb, 4) & 1) == 0);
+
+ if (fseek (file,
+ (archive_file_offset
+ + offset_from_vma (file,
+ dynamic_info_DT_GNU_HASH
+ + 4 * (2 + ngnubuckets), sizeof nb)),
+ SEEK_SET))
+ {
+ error (_("Unable to seek to start of dynamic information"));
+ return 0;
+ }
+
+ gnuchains = get_dynamic_data (file, maxchain, 4);
+
+ if (gnuchains == NULL)
+ return 0;
+
+ lengths = calloc (ngnubuckets, sizeof (*lengths));
+ if (lengths == NULL)
+ {
+ error (_("Out of memory"));
+ return 0;
+ }
+
+ printf (_("\nHistogram for `.gnu.hash' bucket list length (total of %lu buckets):\n"),
+ (unsigned long) ngnubuckets);
+ printf (_(" Length Number %% of total Coverage\n"));
+
+ for (hn = 0; hn < ngnubuckets; ++hn)
+ if (gnubuckets[hn] != 0xffffffff)
+ {
+ bfd_vma off, length = 1;
+
+ for (off = gnubuckets[hn]; (gnuchains[off] & 1) == 0; ++off)
+ ++length;
+ lengths[hn] = length;
+ if (length > maxlength)
+ maxlength = length;
+ nsyms += length;
+ }
+
+ counts = calloc (maxlength + 1, sizeof (*counts));
+ if (counts == NULL)
+ {
+ error (_("Out of memory"));
+ return 0;
+ }
+
+ for (hn = 0; hn < ngnubuckets; ++hn)
+ ++counts[lengths[hn]];
+
+ if (ngnubuckets > 0)
+ {
+ unsigned long j;
+ printf (" 0 %-10lu (%5.1f%%)\n",
+ counts[0], (counts[0] * 100.0) / ngnubuckets);
+ for (j = 1; j <= maxlength; ++j)
+ {
+ nzero_counts += counts[j] * j;
+ printf ("%7lu %-10lu (%5.1f%%) %5.1f%%\n",
+ j, counts[j], (counts[j] * 100.0) / ngnubuckets,
+ (nzero_counts * 100.0) / nsyms);
+ }
+ }
+
+ free (counts);
+ free (lengths);
+ free (gnubuckets);
+ free (gnuchains);
+ }
+
return 1;
}



Jakub
Ulrich Drepper
2006-06-30 19:36:18 UTC
Permalink
Post by Michael Meeks
and here was I thinking I was going to have to write a length
paper to try to convince Ulrich :-) You made my month (really).
What are you talking about? I never said that I'm opposed to changing
the hash table. I'm not going to agree to direct bindings, that's all,
and you constantly insisted on bringing that nonsense up.
Post by Michael Meeks
Anyhow - I've been thinking about various other optimisations, many of
which we can easily & neatly fit into this nice framework, I've numbered
1. Extensibility
+ I want to see a multiply per lookup.
+ ie. since the L2 cache misses are what hammer us -
if [sic] we want to extend the .chain -Bdirect
support in future - being able to put that in
the same record is really important.
+ a 'record size' field gives us that with ~no
bloat as of now.
You absolutely cannot add direct bindings as an optional feature.
Period. A binary using direct bindings can and often will have
different effective bindings then a binary which uses normal lookup.
It's complete impossible. A program can either be compiled and even
_designed_ for direct binding or not.

Adding support in any form for such an incompatible feature to the hash
table which is purely an optional extension is unacceptable.
Post by Michael Meeks
+ Since we can sort (almost all) of .dynsym -
there is no problem ensuring that ~all symbols
that we don't want in .hash occur at the end
of the .dynsym table [ cf. the next UNDEF point ].
This can only possibly work if you have spare bits...
Post by Michael Meeks
+ Furthermore - by inspection it can be seen that
1/2^32 ~=== 1/2^30 [1]
+ using this theorem, we can easily steal
1/2 bits from the top of the stored hash
and use them as chain terminators.
... and this is nonsense. You cannot compute the probabilities by
looking at the complete set of symbols. You have to determine which
kind of symbols have the same hash value after then change and then make
sure they probability of having those is the same as in the complete
set. This (most) likely is not the case.

Instead what will happen is that the symbols which are used together in
a program have a higher likelihood to collide (common prefixes etc). If
even the reduction from 32 to 31 bits produces on the full set a factor
o >2 more duplicates (Jakub tested it) this number is likely higher in
the real world.

On the other hand, what is gained? You already said we have rather
short hash chains. Most likely not longer than 5 or 6. I can count the
number of DSOs with longer chains on my hands and there is not more than
one longer chain in each DSO. Now do the math: even assuming even
distribution of the chain we have on average 6 chain elements per cache
line. If the linker sorts them cache lines we can fit in 14. That's
enough even for the worst chain length.

So, on the one hand you save one 4-byte word which in almost no case
will cause an additional cache line to be loaded. And even if it would,
the cache lines are consecutive the the processors prefetching takes
case of it. On the other hand you more than double the likelihood to
have to compare strings. This is IMO reason enough to not change the
format.
Post by Michael Meeks
+ it is incredible to me that UNDEF symbols are
You didn't read what Jakub wrote. We don't hash UNDEF records in the
new format.
--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
Jakub Jelinek
2006-07-03 09:25:34 UTC
Permalink
Post by Michael Meeks
1. Extensibility
+ I want to see a multiply per lookup.
+ ie. since the L2 cache misses are what hammer us -
if [sic] we want to extend the .chain -Bdirect
support in future - being able to put that in
the same record is really important.
+ a 'record size' field gives us that with ~no
bloat as of now.
+ [ and I really think some form of direct linking
is the future - but we can discuss that later
clearly ]
+ ie. having an extensibility story that is better
than 'just add a new section [ somewhere miles
away in memory/not-in-cache ] is not so fun.
On the other side, it is IMHO much better to keep using one section
for one purpose, as ELF has always been doing. So, use a hash section
for mapping a symbol name to a set of indexes of dynamic symbol definition
candidates, not less, not more. That's what DT_HASH has been doing and
DT_GNU_HASH is doing too. Direct linking is mainly useful for OSes where
there is one entity controlling the whole thing, in the Open Source world
it can IMHO work only in very limited cases for program which have the
vast majority of dependent libraries coming from the same source, with the
same schedule, with bugfixes leading to shipping a new version of the whole
thing. If the libraries on the other side are updated independently,
without one body controlling them all, symbols keep being added and removed
(for C the latter would be an ABI break, but for C++ the latter often
happens just when you change the internal implementation of some
function/method and it no longer references some function/method which suddenly
does not need to be compiled in) and direct linking just will lead to
horrible ODR and pointer equality violations that programs will sooner or
later break on. So for direct linking which certainly won't be used for
all programs, but perhaps just a very limited set, you really should use
a separate section.
Post by Michael Meeks
+ Furthermore - by inspection it can be seen that
1/2^32 ~=== 1/2^30 [1]
+ using this theorem, we can easily steal
1/2 bits from the top of the stored hash
and use them as chain terminators.
.hash[symhash % nbuckets] -> .chain[5]
.chain[5] -> [ (not-end) | hashval[5] & 0x3fff ]
.chain[6] -> [ (not-end) | hashval[6] & 0x3fff ]
.chain[7] -> [ (is-end) | hashval[7] & 0x3fff ]
+ that saves us several bytes per chain,
and cf. above, no measurable loss in
performance. [ of course chain[5]
<-> .dynsym[5] here as before ]
+ in fact, for extensibility a per-shlib hash
comparison 'mask' field would make for a rather
better behaviour - so we can snarf bits from
here in future if we come up with brighter ideas.
Applications keep growing every year and while I currently counted
~ 500000 different symbol names, it can soon double, triple and keep
increasing. So, while we have ATM just a few collisions, there can be
more and more in the future. And handling those is expensive.
And, all linker can do for this is try to keep all the chains short enough
to fit into cachelines.

Jakub
Mike Hearn
2006-07-13 20:59:01 UTC
Permalink
Hi Jakub,
Post by Jakub Jelinek
Direct linking is mainly useful for OSes where
there is one entity controlling the whole thing
I don't understand why you say this. Programs on Windows and MacOS
routinely use libraries and frameworks outside of those provided by
Microsoft/Apple and nothing breaks. Direct linking gives developers the
semantics they expect from shared libraries - it is not intuitive that you
must use symbol versioning, visibility switches etc when building a
library or application to avoid subtle conflicts in future.

This is especially the case as symbol versioning simply moves the problem
around and does not solve the underlying cause of confusion (that symbols
are looked up in global scope).
Post by Jakub Jelinek
If the libraries on the other side are updated independently,
without one body controlling them all, symbols keep being added and removed
(for C the latter would be an ABI break, but for C++ the latter often
happens just when you change the internal implementation of some
function/method and it no longer references some function/method which suddenly
does not need to be compiled in) and direct linking just will lead to
horrible ODR and pointer equality violations that programs will sooner or
later break on.
Michael Meeks -Bdirect patches use regular scope for vague symbols and
seem to take account of this, they also seem to work for OpenOffice which
must be the largest C++ program we have on Linux. Why is this approach not
enough?
Post by Jakub Jelinek
So for direct linking which certainly won't be used for
all programs, but perhaps just a very limited set, you really should use
a separate section.
Lack of direct linking has caused much confusion amongst regular
programmers on the street, especially those writing cross platform
software. It is not intuitive to people that (say) a
C++ program that uses libstdc++.so.5 can crash and burn on systems where
Chinese input methods are used due to mis-linked symbols. Windows uses it,
Solaris uses it, I'm pretty sure MacOS X uses it too yet usage of C++ is
very high on these platforms. So how comes it won't work for us?

thanks -mike

djamel anonymous
2006-06-30 23:55:11 UTC
Permalink
Hello, first excuse my poor english.i am wrting this message despite tha
fact that i don'y know much
of things about dynamic linking. i have understood that this thread is about
static hashing of strings.
my suggestion is about using sparse hashing which would improve further the
speed in case of
unsuccesfull search.one possibility of adding such a thing is to add a
second hash table of size
nbuckets*8 which will occupy nbuckets bytes where nbuckets is the size of
the first hash table.
this second table which will have quarter the size of the first one, will be
a simple array of bits: a set bit means that the bucket corresponds to a
symbol, a zero bit means that the bucket doesn't correspond to a symbol. if
nbuckets is equal to the number of symbols, an unsuccesfull search will
access the first hashtable with probability at most 1/8.to apply this
addition, the formulae for computing bucket num
for the first hash table will be (hash_value%(nbuckets*8))/8 and the bucket
in the second hash table
will be computed with the formulae hash_value%(nbuckets*8). by using this
additional hash table which
will have a marginal size compared to thr first hash table the cost of
unsuccessful search will probably
decrease further.if the search have high probability to be successfull(for
example when looking for a symbol that we know is present) the second table
could be simply ignored and the search could begin directly with the first
hash table.
best regards

_________________________________________________________________
MSN Hotmail sur i-mode™ : envoyez et recevez des e-mails depuis votre
téléphone portable ! http://www.msn.fr/hotmailimode/
djamel anonymous
2006-07-03 06:38:09 UTC
Permalink
Hello. after the last post on the mailing list i found that my suggestion
was too complicated and suboptimal.i think that i have found a simple and
efficient method to implement the additonal sparse hash table.the proposed
hash table is of size approximately m*8 bits were m is the nearest power of
two to nysmbols.
for example for nsymbols=3071 m will be 2048, for nsymbols=3072 m will be
4096.the index into the
hash table will be simply hash%(m*8) which will be computed quickly because
it is a power of two.
a bit of index x into the hash table will be zero if and only if no symbol
has hash%(m*8)==x.
the size of the hash table will be between 5.3333*nsymbols and
10.6666*nsymobls which is a very small space overhead.
to look for a symbol with hash value x we have to compute i=x%(m*8) then
check if the bit i is set or not. if it is cleared then the symbol is not
present otherwise the search will continue in
the hash table that is currently implemented.I think this suggestion which
is in the litterature called a bloom filter with parameter k=1 has several
advantage:
1-it has a very small space overhead: between 5.3333*nsymbols bits and
10.6666*nsymbols bits .
2-it is very easy to implement.
3-it will incure a very small cache misses if the search is unsuccesfull
there will be only
one memory access approximately in 7 cases from 8.
4-the index into the hash table is very cheap to cumpute as it is done with
a binary and.compared to a real modulo which is very expensive; on an athlon
or pentium it takes between 20 and 40 cycles.

although the index into the hash function is simple to compute, it may
however be quite efficient
because the lower bit of the chosen hash function are affected by every
character in the symbol.
static uint_fast32_t
dl_new_hash (const char *s)
{
uint_fast32_t h = 5381;
for (unsigned char c = *s; c != '\0'; c = *++s)
h = h * 33 + c;
return h & 0xffffffff;
}
in conclusion i think that this method will improve the speed for
unsuccesfull searches because
it needs to compute a power of two modulo instead of expensive normal
modulo, and that it will do one memory access most of the time.the space
overhead will be of around nsymbols bytes (between
nsymobls*0.6666 and nsumbols*1.3333).
best regards.

_________________________________________________________________
Testez Windows Live Mail Beta ! http://www.ideas.live.com/
djamel anonymous
2006-07-03 11:33:21 UTC
Permalink
Hello, i am writing again for a small suggestion that may reduce the space
occupied by the hash
table and reduce cache fills probability.
my suggestion is to align bucket chains to 8 bytes boundaries and to
suppress the length field when the chain is of length 1.this is simply done
by changing the signification of the buckets array values to :
offset=(bucket[hash%nbuckets]&)0xfffffffe
which will correspond to the offset: ((2 + nbuckets + N )&(-2)) * 4 which is
aligned on 8 bytes boundary.the signification of the chain will depend on
the lowest bit of bucket[hash%nbuckets]:
if ((bucket[hash%nbuckets]&0x1)==1) then the chain is of length 1 and
contains :
symindx hash
if ((bucket[hash%nbuckets]&0x1)==0) then the chain is of variable length
larger than 1 and contains :
symindx0 n hash0 hash1 .... hashn
so the chains space usage will be as follows :
1-a chain of length 1 that occupied previously 12 bytes will occupy 8 bytes.
2-a chain of length 2 that occupied previously 16 bytes will occupy 16
bytes.
3-a chain of length 3 that occupied previously 20 bytes will occupy 24
bytes.
4-a chain of length 4 that occupied previously 24 bytes will occupy 24
bytes.
5-a chain of length 4 that occupied previously 28 bytes will occupy 32
bytes.
.
.
if there is much more chains of length 1 than of size 3 or 5 then there will
be a significant reduction in space usage.

for a chain of length one there will never be more than one cache fill while
there was
previously a probability of 12.5% to do a second cache fill (assuming a
cache of 64 bytes );the chain will cause a second cache fill if it begins at
position 56 or 60 inside the cache line, and it has 16 possible positions.
and hence the probability to do a second cache fill is of 2/16.
for a chain of length 2 similarly the probability of doing a second cache
fill is reduced from
18.75% to 12.5%.

in conclusion if nbuckets is of the same order as nsymbols, the space usage
will be reduced by
something near 4 bytes per symbol and the cache behavior will be better.
i hope my posting is useful.
best regards.

_________________________________________________________________
Testez Windows Llive Mail Beta !
http://www.msn.fr/newhotmail/Default.asp?Ath=f
Jakub Jelinek
2006-07-03 12:00:57 UTC
Permalink
Post by djamel anonymous
if ((bucket[hash%nbuckets]&0x1)==1) then the chain is of length 1 and
symindx hash
if ((bucket[hash%nbuckets]&0x1)==0) then the chain is of variable length
symindx0 n hash0 hash1 .... hashn
1-a chain of length 1 that occupied previously 12 bytes will occupy 8 bytes.
2-a chain of length 2 that occupied previously 16 bytes will occupy 16
bytes.
3-a chain of length 3 that occupied previously 20 bytes will occupy 24
bytes.
4-a chain of length 4 that occupied previously 24 bytes will occupy 24
bytes.
5-a chain of length 4 that occupied previously 28 bytes will occupy 32
bytes.
This will
1) complicate the lookup function
2) optimize for an uncommon case

The linker can size nbuckets as it wishes of course, but trying
hard to have huge nbuckets to have length 1 chains is with DT_GNU_HASH a bad
idea. The best chain length is IMHO something like 2-6 entries, certainly
it should fit into a cacheline, but chain with length 1 is just a waste
of space in the first part of the .gnu.hash section.
The linker easily can align the whole .gnu.hash section on say 64 bytes
and reorder the chains, so that it minimizes the number of chains crossing
a 64B boundary (assuming we say 64B is the common cache line length)
and it can do so without making the section very complicated.

Say for libstdc++.so.6 we have currently:
libstdc++.so.6.0.8

Histogram for bucket list length (total of 1013 buckets):
Length Number % of total Coverage
0 36 ( 3.6%)
1 134 ( 13.2%) 4.1%
2 196 ( 19.3%) 16.1%
3 231 ( 22.8%) 37.3%
4 194 ( 19.2%) 61.0%
5 121 ( 11.9%) 79.6%
6 60 ( 5.9%) 90.6%
7 27 ( 2.7%) 96.4%
8 7 ( 0.7%) 98.1%
9 7 ( 0.7%) 100.0%

Histogram for `.gnu.hash' bucket list length (total of 1022 buckets):
Length Number % of total Coverage
0 52 ( 5.1%)
1 131 ( 12.8%) 4.1%
2 225 ( 22.0%) 18.4%
3 233 ( 22.8%) 40.5%
4 171 ( 16.7%) 62.2%
5 115 ( 11.3%) 80.4%
6 64 ( 6.3%) 92.5%
7 18 ( 1.8%) 96.5%
8 8 ( 0.8%) 98.5%
9 4 ( 0.4%) 99.7%
10 1 ( 0.1%) 100.0%

For libc-2.4.90.so:

Histogram for bucket list length (total of 1023 buckets):
Length Number % of total Coverage
0 117 ( 11.4%)
1 311 ( 30.4%) 15.1%
2 257 ( 25.1%) 40.1%
3 189 ( 18.5%) 67.6%
4 97 ( 9.5%) 86.5%
5 38 ( 3.7%) 95.7%
6 10 ( 1.0%) 98.6%
7 4 ( 0.4%) 100.0%

Histogram for `.gnu.hash' bucket list length (total of 1011 buckets):
Length Number % of total Coverage
0 124 ( 12.3%)
1 254 ( 25.1%) 12.4%
2 296 ( 29.3%) 41.2%
3 198 ( 19.6%) 70.2%
4 98 ( 9.7%) 89.3%
5 29 ( 2.9%) 96.4%
6 10 ( 1.0%) 99.3%
7 2 ( 0.2%) 100.0%

Jakub
Ulrich Drepper
2006-07-03 14:46:19 UTC
Permalink
Post by djamel anonymous
Hello, i am writing again for a small suggestion that may reduce the
space occupied by the hash
table and reduce cache fills probability.
I see absolutely no benefit at all in any of this. The second hash
table with the required second memory load would kill all the
performance advantage. It's sole purpose would be to not compare the
chain list elements at all. But:

- it has to be pessimistic. I.e., for the first table there are also
collisions and when they happen, the search must be made

- we limit the hash chain to a single cache line whenever possible and
this is as I showed before almost always the case. The operations
all work on L1 cache and even 8 or 10 memory loads are cheaper
than loading another cache line. And that number of loads is just a
worst case behavior, usually there are fewer.

- each element in your second hash table must point to the beginning of
a chain. So here the huge overhead in the hash table sizing is
costing a lot. Furthermore, the division is nowadays no problem at
all anymore. It's fast. If you really care, modify your linker to
size the hash table with a power of two and change the dynamic linker
to optimize that case. It's easy enough to do. The linker already
has a function to size the table more intelligently if requested.
--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
djamel anonymous
2006-07-03 17:30:03 UTC
Permalink
first thank you for replying to my post.
Post by Ulrich Drepper
I see absolutely no benefit at all in any of this. The second hash
table with the required second memory load would kill all the
performance advantage. It's sole purpose would be to not compare the
- it has to be pessimistic. I.e., for the first table there are also
collisions and when they happen, the search must be made
in fact this second hash table is playing a role of a filter before the
original table; a search
will be done into the original table if and only if the the corresponding
bit is set in the new hash table .
Post by Ulrich Drepper
- each element in your second hash table must point to the beginning of
a chain. So here the huge overhead in the hash table sizing is
costing a lot. Furthermore, the division is nowadays no problem at
this new additional hash table will occupy on average nsymbols*8 bits which
is nsymbols bytes; this is because we need only 1 bit per bucket; if this
bit is set we continue the search into the original table otherwise the
search is stopped after only one memory access and without doing actual
division.
Post by Ulrich Drepper
costing a lot. Furthermore, the division is nowadays no problem at
all anymore. It's fast. If you really care, modify your linker to
size the hash table with a power of two and change the dynamic linker
to optimize that case. It's easy enough to do. The linker already
has a function to size the table more intelligently if requested.
the second hash table has in fact a smaller size than the original table,
this is why its size is less important than the original one.the search into
the original table will be done if and only if
the search into the new hash table is succesfull, and only then an actual
division will be done to compute the bucket number into the original table
.but this division will have much less impact on performance because it is
done only in unlikely case of a succesfull search into the first hash table.
best reagrds.

_________________________________________________________________
MSN Messenger : discutez en direct avec vos amis !
http://www.msn.fr/msger/default.asp
Ulrich Drepper
2006-07-03 17:49:14 UTC
Permalink
Post by djamel anonymous
this new additional hash table will occupy on average nsymbols*8 bits
which is nsymbols bytes; this is because we need only 1 bit per bucket;
It doesn't matter that the second table is small. The CPU has to read
the entire cache line. And if he have a conflict (which is likely) we
have to read the second hash table. That's what's killing the
performance. L2 access is about 5 times slower than L1. And since it's
unlikely you hit L2 in the cases where it really counts (i.e., big
programs with many DSOs and many symbols) you more likely hit the main
memory. And here the factor is something like 80 (in ideal conditions).
So, the most important optimization is to reduce the number of cache
lines needed.

Prefiltering is only of advantage if using the second table is more
expensive. It isn't here, as I explained. All the entries of the hash
chain are most likely in one cache line and therefore, unless we have a
full hash value match, we have to read one cache line. With your method
you can get by with reading only one cache line only if there is no hash
collision. I.e., no collision for any of the symbols used, defined or
undefined in the specific DSO. That's not likely for any reasonable
hash table size.
--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
djamel anonymous
2006-07-03 19:50:19 UTC
Permalink
Hello, first thank you for your reply.here is an explanation of my scheme:
for an unsuccessfull search with the probability to go to the next step
between parenthesis:
hash_table1 (around 0.125) -> hash_table2(between 0.5 and 1 ) -> hash
chains(nearly 0 ) -> symbols strings
(here hash_table1 is the new hash table that is used to filter the queries
before accessing to hahs_table2 which is the hash table currently
implemented )
we will have between 1.1875 and 1.25 memory accesses per unsuccesfull query.
the original scheme was
hash_table2(between 0.5 and 1 ) -> hash chains(approx 0 ) -> symbols strings
which gives between 1.5 and 2 memory accesses per unsuccesfull query.
in fact the probability to go to hash chains is very close to 1 because the
number of buckets in hash2_table is quite low; hence the new number of
memory accesses is nearly 1.125 while the old is nearly 2 accesses.

the fact that the hash_table1 needs only 1 bit per bucket permits us to
have nsymbols*8
buckets for a space usage of only nsymbols bytes.this size permits to filter
out 87.5% of the unsuccesfull queries without continuing the search.

for the relative sizes of each structure :
hash_table1-> approx nsymbols bytes
hash_table2->between nsymbols bytes and nsymbols*8 bytes according to the
file elflink.c
chains->nsymbols*4 bytes
so (hash_table1/(hash_table1+hash_table2+chains)) will be betweeen 1/13 and
1/6; so the working set will be increased between 8.33333% and 20% .
as the hash_table1 size is in between 7.692% and 16.66% of the working set
and is accessed much more often than the other structures it is most likely
to fit into the l2 cache.
best regards.

_________________________________________________________________
MSN Hotmail sur i-mode™ : envoyez et recevez des e-mails depuis votre
téléphone portable ! http://www.msn.fr/hotmailimode/
Ulrich Drepper
2006-07-04 02:39:39 UTC
Permalink
You still don't get the cost model at all. So this is the last mail
since there is nothing whatsoever new in any of your replies.

The additional hash table requires an additional cache line to be
loaded. So it better be worth it. I.e., it must prevent a sufficient
number of second level hash table to justify the second cache line load
in case there is a hash conflict. So, when is there an advantage: only
if the bit in the first table isn't set.

If you size the first level hash table as 8*nsymbols you have at most a
12.5% rate of hitting a set bit. But in these 12.5% you have at least
two cache lines to read. I.e., the original cost was (with on average 4
comparisons for an unsuccessful lookup)

cacheline read + 4 * 10 cycles

while with your scheme you have

87.5% * (cacheline read + 20 cycles)
+ 12.5% (2 * cacheline read + 20 cycles + 4 * 10 cycles)

(here I assume each comparison + loop costs about 10 cycles, the most
expensive bit operations for the bitmap operation 20 cycles).

Now fill in values for the cacheline read costs. A L2 cache read costs
about 15 cycles on a modern machine. A main memory read costs > 240
cycles (the 240 cycles are only achieved when reading successive memory,
random access is more expensive).

current yours

L1 hit 55 cycles 41.875 cycles

L1 miss 280 cycles 295 cycles


For cache hits the formulas are (average 3 lookups):

cacheline read + 3 * 10 cycles + strcmp

vs

2 * cacheline read + 20 cycles + 3 * 10 cycles + strcmp

I.e.,

current yours

L1 hit 45 cycles + strcmp 60 cycles + strcmp

L1 miss 270 cycles + strcmp 510 cycles + strcmp


These numbers suggest that the additional table is at best a wash, even
if you assume 95% unsuccessful lookups). Considering that the memory
accesses are highly random the 240 cycle estimate is far too low. I've
some numbers for the paper I'll have for my RH summit talk. For now
look at the slides:

http://people.redhat.com/drepper/cpucache-slides.pdf

Page 10 shows the random access times. Only the asymptotic costs are of
interest here because the nature of lookups we optimize for means that
none of the data is in a cache. The cacheline read time is the
dominating factor and the two reads are independent since the lines are
not adjacent.

A second point to keep in mind is that the gap between memory and CPU
speed isn't closing. And third, the additional memory load increases
the cache pressure, causing negative effects elsewhere when they could
be avoided by reading only one cache line.

And forth, the additional tables costs disk and virtual memory space.
This is especially true if you try to shift the percentages by
allocating an even larger table. We easily have DSOs with 5,000 symbols
today. Multiply by 100 loaded DSOs and perhaps 2 bytes by symbol. And
those are small numbers. Some of RH's customers have DSOs which are
each hundreds of MBs in size with 10,000s of symbols. It is not
acceptable to waste that much space, neither on disk nor in memory (we
have people who use every byte of the available address space).


I consider the time for arguing over. If you want to show your
mechanisms is better, implement it and time it. I'm positively sure the
added complexity is adding almost no improvements, maybe even hurting
performance.
--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
djamel anonymous
2006-07-04 22:50:50 UTC
Permalink
first thank you, for your reply. i am sending this mail to explain further
my idea.My suggestion
was made primarily because of 2 assumptions:

1- the case that is to be optimized is when the expected number of
unsuccesfull querie is much more important than the succesfull ones.
2- each unsuccesfull query is doing two memory accesses to 2 different cache
lines : the bucket inside the hash table to determine and the hash chain:
those two different cache lines are not consecutive.a succesfull query will
do a third memory access to compare the name of the queried symbol with the
symbol inside the table of symbols.

the problem with the current hash table is that its size is too small to
prevent access to the hash chains with unsuccesfull queries.it seems that
the number of buckets nbuckets<=nsymbols/2 which makes the table containing
too few empty empty buckets.
increasing the size of the current hash table is not a good solution because
it occupies 32 bits per bucket.My suggestion hence , was to add a second
hash table before the current hash table that will filter out most of the
unsuccesfuill queries.this table uses only one bit per bucket.

for nsymbols==3072 the size of this table will be of 4096*8=32768 buckets
which occupies 4096 bytes.
the hash_table is declared as :
uint32 bit_hash_nbuckets=32768;
uint32 bit_hash_table[bit_hash_nbuckets>>5];

and the code looks like:

hash=hash_function(queried_symbol_name);
bit_hash_bucket=hash&(bit_hash_nbuckets-1);
if((bit_hash_table[bit_hash_bucket>>5]&(1<<(bit_hash_bucket&31)))==0)
return unsuccesfull_query;
else
continue the search into the original hash_table.

with the help of this hash table, the number of memory accesses (and
correspondingly l2 cache misses )in the common case for unsuccesfull queries
will be reduced from 2 to 1 .

if we suppose that for the currently implemented hash table that
nbuckets=nsymbols/2 then this hash table will occupy nsymbols*2 bytes. the
hash chains will occupy exactly nsymbols*4 bytes.
an unsuccesfull query will do a first random access into the hash_table
buckets and then do random access into the hash chains. so there will be two
random accesses into a space of 6*nsymbols bytes.by adding the new hash
table before the currently implemented one the common unsuccesfull case will
do an access into a space of nsymbols bytes.there is much more chance that a
space of nsymbols bytes fit into the cache than for a space of nsymbols*6
bytes.in fact it is possible that the cache miss rate will be much more
smaller for the new added hash table than with the previous one because it
is acceeded much more frequently than second hash table buckets and the hash
chains.
although i am sure this solution is faster than the the one currently
implemented, i know already that it is unlikely to be implemented because it
overcomplicates the file format.
best regards.

_________________________________________________________________
MSN Messenger : appels gratuits de PC à PC !
http://www.msn.fr/newhotmail/Default.asp?Ath=f
Jakub Jelinek
2006-07-05 09:04:38 UTC
Permalink
Post by djamel anonymous
first thank you, for your reply. i am sending this mail to explain further
my idea.My suggestion
I think you can implement the same while still not increasing the number of
used cache lines in the successful case.

Just put the bitmask along the chain offsets in the buckets array, then
(assuming .gnu.hash is always >= 8B aligned, which we can cheaply ensure),
the bitmask word will always be in the same cache line as the offset
into the chain array. In this case, for successful lookup or unsuccessful
one where the bit is not set in the bitmask we need just one cache line
access, otherwise it uses 2 (or more) as it used before. The shorter the
chain, the higher probability the bitmask saves the extra cache line read.
For extremely long chains, it wouldn't help at all (chain length >= 32
assuming all hash values in the chain have different topmost 5 bits),
but chains with length >= 7 are really rare and even for chain length
7 the probability the bitmask helps is ~ 78%.

The disadvantage of course is the increase in the .gnu.hash section
size, which grows from current
(2 + nbuckets + nsyms) * 4
to
(2 + 2 * nbuckets + nsyms) * 4.

I'll try to benchmark both of these variants and see if it is really worth
it.

Here is the interdiff on how would the glibc side of changes for
this look like (and I'm attaching also Uli's 2006-07-03 patch, which
hasn't been posted yet).

--- libc/elf/dl-lookup.c.jj 2006-07-05 08:59:01.000000000 +0200
+++ libc/elf/dl-lookup.c 2006-07-05 09:01:53.000000000 +0200
@@ -380,8 +380,8 @@ _dl_setup_hash (struct link_map *map)
+ DT_EXTRANUM + DT_VALNUM]);
map->l_nbuckets = *hash32++;
map->l_gnu_symbias = *hash32++;
- map->l_gnu_buckets = hash32;
- hash32 += map->l_nbuckets;
+ map->l_gnu_buckets = (void *) hash32;
+ hash32 += 2 * map->l_nbuckets;
map->l_gnu_hashbase = hash32;
return;
}
--- libc/elf/do-lookup.h.jj 2006-07-05 08:59:01.000000000 +0200
+++ libc/elf/do-lookup.h 2006-07-05 09:07:22.000000000 +0200
@@ -169,10 +169,15 @@ do_lookup_x (const char *undef_name, uin
if (__builtin_expect (map->l_gnu_buckets != NULL, 1))
{
/* Use the new GNU-style hash table. */
- size_t chainoff = map->l_gnu_buckets[new_hash % map->l_nbuckets];
+ Elf32_Word bitmask
+ = map->l_gnu_buckets[new_hash % map->l_nbuckets].bitmask;

- if (chainoff != 0xffffffff)
+ /* Bitmask has bit N set iff any of the hash values in
+ hasharr chain has (hash >> 27) == N. */
+ if (bitmask & (1 << (new_hash >> 27)))
{
+ size_t chainoff
+ = map->l_gnu_buckets[new_hash % map->l_nbuckets].chainoff;
const Elf32_Word *hasharr = &map->l_gnu_hashbase[chainoff];
Elf32_Word new_hash_masked = new_hash & ~1u;

--- libc/include/link.h.jj 2006-07-05 08:59:01.000000000 +0200
+++ libc/include/link.h 2006-07-05 09:00:41.000000000 +0200
@@ -141,7 +141,11 @@ struct link_map

/* Symbol hash table. */
Elf_Symndx l_nbuckets;
- const Elf32_Word *l_gnu_buckets;
+ const struct
+ {
+ Elf32_Word chainoff;
+ Elf32_Word bitmask;
+ } *l_gnu_buckets;
union
{
const Elf32_Word *l_gnu_hashbase;

Jakub
djamel anonymous
2006-07-05 10:07:15 UTC
Permalink
Hello, first i thank you for your reply.the current hash function is defined
as:

static uint_fast32_t
dl_new_hash (const char *s)
{
uint_fast32_t h = 5381;
for (unsigned char c = *s; c != '\0'; c = *++s)
h = h * 33 + c;
return h & 0xffffffff;
}

with this hash function we have strings of length 1 affect only the first 8
bits, strings of length 2 affect only the first 13 bits... strings of length
4 effect only the first 23 bits.so the full 32 bits of the hash value are
affected beginning with strings of length 6 and above.
for the variant you suggested, i think using the 5 least significant bits
may solve the problem but in this case nbuckets must be odd.another solution
would be to use the bits numbered from 5 to 9.for this solution to be
effective nbuckets must not be a multiple of 64.

best regards.

_________________________________________________________________
MSN Messenger : discutez en direct avec vos amis !
http://www.msn.fr/msger/default.asp
Jakub Jelinek
2006-07-05 10:32:29 UTC
Permalink
Post by djamel anonymous
Hello, first i thank you for your reply.the current hash function is
static uint_fast32_t
dl_new_hash (const char *s)
{
uint_fast32_t h = 5381;
for (unsigned char c = *s; c != '\0'; c = *++s)
h = h * 33 + c;
return h & 0xffffffff;
}
with this hash function we have strings of length 1 affect only the first 8
bits, strings of length 2 affect only the first 13 bits... strings of
length 4 effect only the first 23 bits.so the full 32 bits of the hash
value are affected beginning with strings of length 6 and above.
Does that matter? Short symbol names are very rare:

N number of symbols with length N
1 7
2 77
3 130
4 381
5 681
6 1153
7 1716
8 2373
9 3037
10 3552
11 4369
12 5481
13 6603
14 7183
15 7050
16 8848
17 9813
18 8906
19 10743
20 11301
21 11118
22 11143
23 11306
24 11629
25 11770
26 11533
27 12118
28 12159
29 11912
30 12025
31 12033
32 12111
33 11824
34 11481
35 12070
36 11506
37 10890
38 10953
39 10386
40 10005
...

i.e. there are just 0.23% symbols with strlen (symname) < 6.
Using low bits is a bad idea, but perhaps using some middle bits...
More than 32K buckets is extremely uncommon, so e.g. using
(hash >> 16) & 0x1f could work.

Jakub
djamel anonymous
2006-07-05 14:45:33 UTC
Permalink
thank's for your reply, and thank's also for taking my suggested hash table
into consideration.i hope this discussion would have been useful for those
who are reading this mailing list, even if th suggestion turned out to be
useless.

best regards..

_________________________________________________________________
MSN Messenger: appels gratuits de PC à PC !
http://www.msn.fr/msger/default.asp
Jakub Jelinek
2006-07-06 11:25:35 UTC
Permalink
Post by Jakub Jelinek
I'll try to benchmark both of these variants and see if it is really worth
it.
I've repeated the same tests as I've reported on Jun, 28th, this time
in 2 new different chroots.
One had (almost) all the dependent libraries built with --hash-style=both
2006-07-03 layout of .gnu.hash (i.e.
nbuckets|symidx|nbuckets*chainoff|hash0&~1|hash1&~1|hash2bitor1|...|
) - take2 in the results, the other had the enlarged buckets array, i.e.
nbuckets|symidx|nbuckets*<chainoff|bitmask>|hash0&~1|hash1&~1|hash2bitor1|...|
where bitmask would contain bit N set there are any hashes in the chain such
that ((hash >> 5) & 31) == N. It is not using the LS bits, because those
contain the sum of all characters in the symbol name, which isn't ideal.
Binutils guaranteed that nbuckets is at least 2 and never a multiple of 32
- this tree is take3 in the results.

take3 reduces a lot the number of L1 misses, on the other side slightly
increases L2 misses (supposedly because .gnu.hash is now bigger),
but overall it sounds like take3 is desirable.

Will now do a few small changes suggested by Uli late last night and then
post the final patches. I don't want to make nbuckets a power of two,
the division really isn't very expensive on contemporary CPUs and we don't
want multiple of 32 nbuckets for the bitmask trick to work properly.

cat > a.c <<\EOF
cat /tmp/a.c
#include <dlfcn.h>
const char *libs[] = {
"libvclplug_gtk680lx.so", "libvclplug_gen680lx.so", "libnss_files.so.2", "libGL.so.1", "servicemgr.uno.so",
"shlibloader.uno.so", "simplereg.uno.so", "nestedreg.uno.so", "typemgr.uno.so", "implreg.uno.so",
"security.uno.so", "libreg.so.3", "libstore.so.3", "regtypeprov.uno.so", "configmgr2.uno.so",
"typeconverter.uno.so", "gconfbe1.uno.so", "behelper.uno.so", "sax.uno.so", "localebe1.uno.so",
"uriproc.uno.so", "libspl680lx.so", "libucb1.so", "ucpgvfs1.uno.so", "libgcc3_uno.so", "libpackage2.so",
"libfileacc.so", "libuui680lx.so", "libfilterconfig1.so", "libdtransX11680lx.so", "i18npool.uno.so",
"liblocaledata_en.so", "fsstorage.uno.so", "libxstor.so", "libdbtools680lx.so", "libcups.so.2",
"libgnutls.so.13", "libgcrypt.so.11", "libgpg-error.so.0", "libmcnttype.so", "libucpchelp1.so",
"svtmisc.uno.so" };
int
main (int argc, char **argv)
{
int i;
void *h;
int flags = RTLD_LAZY;
if (argv[1][0] == 'g')
flags |= RTLD_GLOBAL;
for (i = 0; i < sizeof (libs) / sizeof (libs[0]); ++i)
h = dlopen (libs[i], flags);
return 0;
}
EOF
gcc -g -O2 -o a a.c -Wl,-rpath,/usr/lib64/openoffice.org2.0/program/ \
-L/usr/lib64/openoffice.org2.0/program/ -lsoffice -lsw680lx -lsvx680lx -lstdc++ -lm -shared-libgcc
for V in local global; do for M in '' 'export LD_X=1' 'export LD_BIND_NOW=1' 'export LD_X=1 LD_BIND_NOW=1'; \
do ( for i in 1 2 3 4; do eval $M; time ./a $V; done 2>&1 > /dev/null | \
awk 'BEGIN { printf "'"$V $M"'\t" } /^real/ { printf "%s ", $2 } END { printf "\n" }' ); done; done

take2
local 0m0.247s 0m0.244s 0m0.243s 0m0.245s
local export LD_X=1 0m0.541s 0m0.541s 0m0.540s 0m0.542s
local export LD_BIND_NOW=1 0m0.455s 0m0.451s 0m0.449s 0m0.452s
local export LD_X=1 LD_BIND_NOW=1 0m1.102s 0m1.100s 0m1.099s 0m1.101s
global 0m0.284s 0m0.281s 0m0.279s 0m0.279s
global export LD_X=1 0m0.623s 0m0.621s 0m0.622s 0m0.620s
global export LD_BIND_NOW=1 0m0.518s 0m0.514s 0m0.515s 0m0.517s
global export LD_X=1 LD_BIND_NOW=1 0m1.250s 0m1.246s 0m1.247s 0m1.246s

take3
local 0m0.212s 0m0.211s 0m0.211s 0m0.212s
local export LD_X=1 0m0.556s 0m0.579s 0m0.583s 0m0.577s
local export LD_BIND_NOW=1 0m0.391s 0m0.385s 0m0.383s 0m0.384s
local export LD_X=1 LD_BIND_NOW=1 0m1.124s 0m1.157s 0m1.178s 0m1.169s
global 0m0.241s 0m0.240s 0m0.240s 0m0.240s
global export LD_X=1 0m0.642s 0m0.665s 0m0.662s 0m0.661s
global export LD_BIND_NOW=1 0m0.438s 0m0.438s 0m0.434s 0m0.436s
global export LD_X=1 LD_BIND_NOW=1 0m1.280s 0m1.326s 0m1.318s 0m1.321s

for V in local global; do for M in '' 'export LD_X=1' 'export LD_BIND_NOW=1' 'export LD_X=1 LD_BIND_NOW=1'; \
do ( echo "$V $M"; eval $M; valgrind --tool=cachegrind ./a $V 2>&1 > /dev/null | sed -n '/== I refs/,$p' ); \
done; done

take2
local
==5984== I refs: 206,039,807
==5984== I1 misses: 11,336
==5984== L2i misses: 10,079
==5984== I1 miss rate: 0.00%
==5984== L2i miss rate: 0.00%
==5984==
==5984== D refs: 75,347,721 (58,986,205 rd + 16,361,516 wr)
==5984== D1 misses: 4,444,606 ( 4,290,272 rd + 154,334 wr)
==5984== L2d misses: 506,231 ( 413,982 rd + 92,249 wr)
==5984== D1 miss rate: 5.8% ( 7.2% + 0.9% )
==5984== L2d miss rate: 0.6% ( 0.7% + 0.5% )
==5984==
==5984== L2 refs: 4,455,942 ( 4,301,608 rd + 154,334 wr)
==5984== L2 misses: 516,310 ( 424,061 rd + 92,249 wr)
==5984== L2 miss rate: 0.1% ( 0.1% + 0.5% )
local export LD_X=1
==5988== I refs: 306,750,046
==5988== I1 misses: 11,334
==5988== L2i misses: 10,489
==5988== I1 miss rate: 0.00%
==5988== L2i miss rate: 0.00%
==5988==
==5988== D refs: 127,819,504 (98,014,006 rd + 29,805,498 wr)
==5988== D1 misses: 9,632,662 ( 9,469,418 rd + 163,244 wr)
==5988== L2d misses: 3,029,744 ( 2,924,102 rd + 105,642 wr)
==5988== D1 miss rate: 7.5% ( 9.6% + 0.5% )
==5988== L2d miss rate: 2.3% ( 2.9% + 0.3% )
==5988==
==5988== L2 refs: 9,643,996 ( 9,480,752 rd + 163,244 wr)
==5988== L2 misses: 3,040,233 ( 2,934,591 rd + 105,642 wr)
==5988== L2 miss rate: 0.6% ( 0.7% + 0.3% )
local export LD_BIND_NOW=1
==5992== I refs: 399,954,112
==5992== I1 misses: 10,955
==5992== L2i misses: 9,868
==5992== I1 miss rate: 0.00%
==5992== L2i miss rate: 0.00%
==5992==
==5992== D refs: 149,744,874 (116,766,890 rd + 32,977,984 wr)
==5992== D1 misses: 9,146,137 ( 8,945,587 rd + 200,550 wr)
==5992== L2d misses: 673,564 ( 573,758 rd + 99,806 wr)
==5992== D1 miss rate: 6.1% ( 7.6% + 0.6% )
==5992== L2d miss rate: 0.4% ( 0.4% + 0.3% )
==5992==
==5992== L2 refs: 9,157,092 ( 8,956,542 rd + 200,550 wr)
==5992== L2 misses: 683,432 ( 583,626 rd + 99,806 wr)
==5992== L2 miss rate: 0.1% ( 0.1% + 0.3% )
local export LD_X=1 LD_BIND_NOW=1
==5996== I refs: 612,477,609
==5996== I1 misses: 10,953
==5996== L2i misses: 10,129
==5996== I1 miss rate: 0.00%
==5996== L2i miss rate: 0.00%
==5996==
==5996== D refs: 261,643,870 (199,575,161 rd + 62,068,709 wr)
==5996== D1 misses: 20,389,219 ( 20,198,114 rd + 191,105 wr)
==5996== L2d misses: 6,306,926 ( 6,193,450 rd + 113,476 wr)
==5996== D1 miss rate: 7.7% ( 10.1% + 0.3% )
==5996== L2d miss rate: 2.4% ( 3.1% + 0.1% )
==5996==
==5996== L2 refs: 20,400,172 ( 20,209,067 rd + 191,105 wr)
==5996== L2 misses: 6,317,055 ( 6,203,579 rd + 113,476 wr)
==5996== L2 miss rate: 0.7% ( 0.7% + 0.1% )
global
==6000== I refs: 221,176,210
==6000== I1 misses: 11,605
==6000== L2i misses: 10,263
==6000== I1 miss rate: 0.00%
==6000== L2i miss rate: 0.00%
==6000==
==6000== D refs: 82,903,948 (64,808,080 rd + 18,095,868 wr)
==6000== D1 misses: 6,361,596 ( 6,203,856 rd + 157,740 wr)
==6000== L2d misses: 534,670 ( 441,893 rd + 92,777 wr)
==6000== D1 miss rate: 7.6% ( 9.5% + 0.8% )
==6000== L2d miss rate: 0.6% ( 0.6% + 0.5% )
==6000==
==6000== L2 refs: 6,373,201 ( 6,215,461 rd + 157,740 wr)
==6000== L2 misses: 544,933 ( 452,156 rd + 92,777 wr)
==6000== L2 miss rate: 0.1% ( 0.1% + 0.5% )
global export LD_X=1
==6004== I refs: 331,769,024
==6004== I1 misses: 11,603
==6004== L2i misses: 10,705
==6004== I1 miss rate: 0.00%
==6004== L2i miss rate: 0.00%
==6004==
==6004== D refs: 140,927,783 (107,887,754 rd + 33,040,029 wr)
==6004== D1 misses: 12,061,356 ( 11,896,678 rd + 164,678 wr)
==6004== L2d misses: 3,511,000 ( 3,404,823 rd + 106,177 wr)
==6004== D1 miss rate: 8.5% ( 11.0% + 0.4% )
==6004== L2d miss rate: 2.4% ( 3.1% + 0.3% )
==6004==
==6004== L2 refs: 12,072,959 ( 11,908,281 rd + 164,678 wr)
==6004== L2 misses: 3,521,705 ( 3,415,528 rd + 106,177 wr)
==6004== L2 miss rate: 0.7% ( 0.7% + 0.3% )
global export LD_BIND_NOW=1
==6008== I refs: 427,418,120
==6008== I1 misses: 11,204
==6008== L2i misses: 10,041
==6008== I1 miss rate: 0.00%
==6008== L2i miss rate: 0.00%
==6008==
==6008== D refs: 163,419,797 (127,305,645 rd + 36,114,152 wr)
==6008== D1 misses: 12,599,471 ( 12,390,660 rd + 208,811 wr)
==6008== L2d misses: 721,852 ( 621,527 rd + 100,325 wr)
==6008== D1 miss rate: 7.7% ( 9.7% + 0.5% )
==6008== L2d miss rate: 0.4% ( 0.4% + 0.2% )
==6008==
==6008== L2 refs: 12,610,675 ( 12,401,864 rd + 208,811 wr)
==6008== L2 misses: 731,893 ( 631,568 rd + 100,325 wr)
==6008== L2 miss rate: 0.1% ( 0.1% + 0.2% )
global export LD_X=1 LD_BIND_NOW=1
==6012== I refs: 657,379,875
==6012== I1 misses: 11,204
==6012== L2i misses: 10,320
==6012== I1 miss rate: 0.00%
==6012== L2i miss rate: 0.00%
==6012==
==6012== D refs: 285,228,684 (217,319,888 rd + 67,908,796 wr)
==6012== D1 misses: 24,774,666 ( 24,579,918 rd + 194,748 wr)
==6012== L2d misses: 7,207,303 ( 7,093,279 rd + 114,024 wr)
==6012== D1 miss rate: 8.6% ( 11.3% + 0.2% )
==6012== L2d miss rate: 2.5% ( 3.2% + 0.1% )
==6012==
==6012== L2 refs: 24,785,870 ( 24,591,122 rd + 194,748 wr)
==6012== L2 misses: 7,217,623 ( 7,103,599 rd + 114,024 wr)
==6012== L2 miss rate: 0.7% ( 0.8% + 0.1% )

take3
local
==5139== I refs: 192,860,473
==5139== I1 misses: 12,387
==5139== L2i misses: 10,120
==5139== I1 miss rate: 0.00%
==5139== L2i miss rate: 0.00%
==5139==
==5139== D refs: 72,521,147 (56,099,943 rd + 16,421,204 wr)
==5139== D1 misses: 3,413,455 ( 3,256,926 rd + 156,529 wr)
==5139== L2d misses: 525,073 ( 433,257 rd + 91,816 wr)
==5139== D1 miss rate: 4.7% ( 5.8% + 0.9% )
==5139== L2d miss rate: 0.7% ( 0.7% + 0.5% )
==5139==
==5139== L2 refs: 3,425,842 ( 3,269,313 rd + 156,529 wr)
==5139== L2 misses: 535,193 ( 443,377 rd + 91,816 wr)
==5139== L2 miss rate: 0.2% ( 0.1% + 0.5% )
local export LD_X=1
==5143== I refs: 307,040,928
==5143== I1 misses: 12,381
==5143== L2i misses: 10,684
==5143== I1 miss rate: 0.00%
==5143== L2i miss rate: 0.00%
==5143==
==5143== D refs: 127,932,904 (98,064,136 rd + 29,868,768 wr)
==5143== D1 misses: 9,484,151 ( 9,320,536 rd + 163,615 wr)
==5143== L2d misses: 3,014,606 ( 2,909,020 rd + 105,586 wr)
==5143== D1 miss rate: 7.4% ( 9.5% + 0.5% )
==5143== L2d miss rate: 2.3% ( 2.9% + 0.3% )
==5143==
==5143== L2 refs: 9,496,532 ( 9,332,917 rd + 163,615 wr)
==5143== L2 misses: 3,025,290 ( 2,919,704 rd + 105,586 wr)
==5143== L2 miss rate: 0.6% ( 0.7% + 0.3% )
local export LD_BIND_NOW=1
==5147== I refs: 371,723,469
==5147== I1 misses: 11,941
==5147== L2i misses: 9,909
==5147== I1 miss rate: 0.00%
==5147== L2i miss rate: 0.00%
==5147==
==5147== D refs: 143,691,228 (110,589,376 rd + 33,101,852 wr)
==5147== D1 misses: 6,960,878 ( 6,773,347 rd + 187,531 wr)
==5147== L2d misses: 721,548 ( 622,166 rd + 99,382 wr)
==5147== D1 miss rate: 4.8% ( 6.1% + 0.5% )
==5147== L2d miss rate: 0.5% ( 0.5% + 0.3% )
==5147==
==5147== L2 refs: 6,972,819 ( 6,785,288 rd + 187,531 wr)
==5147== L2 misses: 731,457 ( 632,075 rd + 99,382 wr)
==5147== L2 miss rate: 0.1% ( 0.1% + 0.3% )
local export LD_X=1 LD_BIND_NOW=1
==5151== I refs: 613,042,464
==5151== I1 misses: 11,940
==5151== L2i misses: 10,409
==5151== I1 miss rate: 0.00%
==5151== L2i miss rate: 0.00%
==5151==
==5151== D refs: 261,868,433 (199,669,277 rd + 62,199,156 wr)
==5151== D1 misses: 20,184,839 ( 19,970,673 rd + 214,166 wr)
==5151== L2d misses: 6,271,011 ( 6,157,540 rd + 113,471 wr)
==5151== D1 miss rate: 7.7% ( 10.0% + 0.3% )
==5151== L2d miss rate: 2.3% ( 3.0% + 0.1% )
==5151==
==5151== L2 refs: 20,196,779 ( 19,982,613 rd + 214,166 wr)
==5151== L2 misses: 6,281,420 ( 6,167,949 rd + 113,471 wr)
==5151== L2 miss rate: 0.7% ( 0.7% + 0.1% )
global
==5155== I refs: 206,478,881
==5155== I1 misses: 12,681
==5155== L2i misses: 10,302
==5155== I1 miss rate: 0.00%
==5155== L2i miss rate: 0.00%
==5155==
==5155== D refs: 79,796,888 (61,641,432 rd + 18,155,456 wr)
==5155== D1 misses: 5,204,611 ( 5,046,697 rd + 157,914 wr)
==5155== L2d misses: 551,003 ( 458,783 rd + 92,220 wr)
==5155== D1 miss rate: 6.5% ( 8.1% + 0.8% )
==5155== L2d miss rate: 0.6% ( 0.7% + 0.5% )
==5155==
==5155== L2 refs: 5,217,292 ( 5,059,378 rd + 157,914 wr)
==5155== L2 misses: 561,305 ( 469,085 rd + 92,220 wr)
==5155== L2 miss rate: 0.1% ( 0.1% + 0.5% )
global export LD_X=1
==5159== I refs: 332,044,535
==5159== I1 misses: 12,675
==5159== L2i misses: 11,000
==5159== I1 miss rate: 0.00%
==5159== L2i miss rate: 0.00%
==5159==
==5159== D refs: 141,034,475 (107,932,723 rd + 33,101,752 wr)
==5159== D1 misses: 11,988,824 ( 11,824,286 rd + 164,538 wr)
==5159== L2d misses: 3,491,915 ( 3,385,840 rd + 106,075 wr)
==5159== D1 miss rate: 8.5% ( 10.9% + 0.4% )
==5159== L2d miss rate: 2.4% ( 3.1% + 0.3% )
==5159==
==5159== L2 refs: 12,001,499 ( 11,836,961 rd + 164,538 wr)
==5159== L2 misses: 3,502,915 ( 3,396,840 rd + 106,075 wr)
==5159== L2 miss rate: 0.7% ( 0.7% + 0.3% )
global export LD_BIND_NOW=1
==5165== I refs: 396,437,452
==5165== I1 misses: 12,211
==5165== L2i misses: 10,092
==5165== I1 miss rate: 0.00%
==5165== L2i miss rate: 0.00%
==5165==
==5165== D refs: 156,857,604 (120,619,707 rd + 36,237,897 wr)
==5165== D1 misses: 10,195,672 ( 10,010,304 rd + 185,368 wr)
==5165== L2d misses: 767,253 ( 667,448 rd + 99,805 wr)
==5165== D1 miss rate: 6.4% ( 8.2% + 0.5% )
==5165== L2d miss rate: 0.4% ( 0.5% + 0.2% )
==5165==
==5165== L2 refs: 10,207,883 ( 10,022,515 rd + 185,368 wr)
==5165== L2 misses: 777,345 ( 677,540 rd + 99,805 wr)
==5165== L2 miss rate: 0.1% ( 0.1% + 0.2% )
global export LD_X=1 LD_BIND_NOW=1
==5169== I refs: 657,927,738
==5169== I1 misses: 12,211
==5169== L2i misses: 10,669
==5169== I1 miss rate: 0.00%
==5169== L2i miss rate: 0.00%
==5169==
==5169== D refs: 285,446,109 (217,408,377 rd + 68,037,732 wr)
==5169== D1 misses: 24,704,584 ( 24,488,640 rd + 215,944 wr)
==5169== L2d misses: 7,165,792 ( 7,051,869 rd + 113,923 wr)
==5169== D1 miss rate: 8.6% ( 11.2% + 0.3% )
==5169== L2d miss rate: 2.5% ( 3.2% + 0.1% )
==5169==
==5169== L2 refs: 24,716,795 ( 24,500,851 rd + 215,944 wr)
==5169== L2 misses: 7,176,461 ( 7,062,538 rd + 113,923 wr)
==5169== L2 miss rate: 0.7% ( 0.8% + 0.1% )

for V in local global; do for M in '' '-E LD_X=1' '-E LD_BIND_NOW=1' '-E LD_X=1 -E LD_BIND_NOW=1'; \
do ( echo "$V $M"; ./timing $M ./a $V ); done; done

take2
local
Strip out best and worst realtime result
minimum: 0.241084000 sec real / 0.000055769 sec CPU
maximum: 0.248671000 sec real / 0.000073555 sec CPU
average: 0.242298357 sec real / 0.000064487 sec CPU
stdev : 0.001100542 sec real / 0.000003072 sec CPU
local -E LD_X=1
optarg="LD_X=1"
Strip out best and worst realtime result
minimum: 0.537341000 sec real / 0.000066888 sec CPU
maximum: 0.545505000 sec real / 0.000082234 sec CPU
average: 0.539162642 sec real / 0.000073861 sec CPU
stdev : 0.000877479 sec real / 0.000001513 sec CPU
local -E LD_BIND_NOW=1
optarg="LD_BIND_NOW=1"
Strip out best and worst realtime result
minimum: 0.447226000 sec real / 0.000048417 sec CPU
maximum: 0.454843000 sec real / 0.000074707 sec CPU
average: 0.449379357 sec real / 0.000069658 sec CPU
stdev : 0.001006922 sec real / 0.000004595 sec CPU
local -E LD_X=1 -E LD_BIND_NOW=1
optarg="LD_X=1"
optarg="LD_BIND_NOW=1"
Strip out best and worst realtime result
minimum: 1.096843000 sec real / 0.000057979 sec CPU
maximum: 1.108830000 sec real / 0.000072577 sec CPU
average: 1.099736928 sec real / 0.000064127 sec CPU
stdev : 0.002059076 sec real / 0.000001466 sec CPU
global
Strip out best and worst realtime result
minimum: 0.277778000 sec real / 0.000056801 sec CPU
maximum: 0.287137000 sec real / 0.000082176 sec CPU
average: 0.279447607 sec real / 0.000072339 sec CPU
stdev : 0.001470227 sec real / 0.000004583 sec CPU
global -E LD_X=1
optarg="LD_X=1"
Strip out best and worst realtime result
minimum: 0.618532000 sec real / 0.000065063 sec CPU
maximum: 0.637458000 sec real / 0.000085827 sec CPU
average: 0.623114250 sec real / 0.000074015 sec CPU
stdev : 0.004356701 sec real / 0.000003608 sec CPU
global -E LD_BIND_NOW=1
optarg="LD_BIND_NOW=1"
Strip out best and worst realtime result
minimum: 0.513478000 sec real / 0.000062431 sec CPU
maximum: 0.524921000 sec real / 0.000076080 sec CPU
average: 0.517233357 sec real / 0.000072163 sec CPU
stdev : 0.003029441 sec real / 0.000002879 sec CPU
global -E LD_X=1 -E LD_BIND_NOW=1
optarg="LD_X=1"
optarg="LD_BIND_NOW=1"
Strip out best and worst realtime result
minimum: 1.243250000 sec real / 0.000055890 sec CPU
maximum: 1.252460000 sec real / 0.000073371 sec CPU
average: 1.245845642 sec real / 0.000063768 sec CPU
stdev : 0.001419068 sec real / 0.000002191 sec CPU

take3
local
Strip out best and worst realtime result
minimum: 0.207072000 sec real / 0.000059167 sec CPU
maximum: 0.213904000 sec real / 0.000073101 sec CPU
average: 0.208218392 sec real / 0.000064074 sec CPU
stdev : 0.001025499 sec real / 0.000001562 sec CPU
local -E LD_X=1
optarg="LD_X=1"
Strip out best and worst realtime result
minimum: 0.544442000 sec real / 0.000065068 sec CPU
maximum: 0.575327000 sec real / 0.000083959 sec CPU
average: 0.552129321 sec real / 0.000074002 sec CPU
stdev : 0.003887680 sec real / 0.000002200 sec CPU
local -E LD_BIND_NOW=1
optarg="LD_BIND_NOW=1"
Strip out best and worst realtime result
minimum: 0.374172000 sec real / 0.000064850 sec CPU
maximum: 0.379919000 sec real / 0.000074617 sec CPU
average: 0.376311357 sec real / 0.000072356 sec CPU
stdev : 0.001204166 sec real / 0.000002292 sec CPU
local -E LD_X=1 -E LD_BIND_NOW=1
optarg="LD_X=1"
optarg="LD_BIND_NOW=1"
Strip out best and worst realtime result
minimum: 1.101936000 sec real / 0.000059454 sec CPU
maximum: 1.177582000 sec real / 0.000070216 sec CPU
average: 1.118296642 sec real / 0.000064014 sec CPU
stdev : 0.010025261 sec real / 0.000001281 sec CPU
global
Strip out best and worst realtime result
minimum: 0.236546000 sec real / 0.000057142 sec CPU
maximum: 0.246220000 sec real / 0.000065733 sec CPU
average: 0.239646714 sec real / 0.000063370 sec CPU
stdev : 0.002348865 sec real / 0.000002093 sec CPU
global -E LD_X=1
optarg="LD_X=1"
Strip out best and worst realtime result
minimum: 0.630578000 sec real / 0.000063307 sec CPU
maximum: 0.662570000 sec real / 0.000079459 sec CPU
average: 0.642624142 sec real / 0.000073384 sec CPU
stdev : 0.006348274 sec real / 0.000003411 sec CPU
global -E LD_BIND_NOW=1
optarg="LD_BIND_NOW=1"
Strip out best and worst realtime result
minimum: 0.428218000 sec real / 0.000062262 sec CPU
maximum: 0.437902000 sec real / 0.000076227 sec CPU
average: 0.431983142 sec real / 0.000070362 sec CPU
stdev : 0.002499419 sec real / 0.000002465 sec CPU
global -E LD_X=1 -E LD_BIND_NOW=1
optarg="LD_X=1"
optarg="LD_BIND_NOW=1"
Strip out best and worst realtime result
minimum: 1.255278000 sec real / 0.000059120 sec CPU
maximum: 1.337006000 sec real / 0.000072259 sec CPU
average: 1.281770035 sec real / 0.000064085 sec CPU
stdev : 0.013042264 sec real / 0.000001631 sec CPU

/usr/sbin/prelink -vmR ./a
for V in local global; do for M in '' 'export LD_X=1' 'export LD_BIND_NOW=1' 'export LD_X=1 LD_BIND_NOW=1'; \
do ( for i in 1 2 3 4; do eval $M; time ./a $V; done 2>&1 > /dev/null | \
awk 'BEGIN { printf "'"$V $M"'\t" } /^real/ { printf "%s ", $2 } END { printf "\n" }' ); done; done

take2
local 0m0.134s 0m0.132s 0m0.131s 0m0.131s
local export LD_X=1 0m0.270s 0m0.267s 0m0.268s 0m0.267s
local export LD_BIND_NOW=1 0m0.228s 0m0.226s 0m0.226s 0m0.226s
local export LD_X=1 LD_BIND_NOW=1 0m0.507s 0m0.499s 0m0.500s 0m0.499s
global 0m0.167s 0m0.165s 0m0.165s 0m0.166s
global export LD_X=1 0m0.347s 0m0.344s 0m0.345s 0m0.344s
global export LD_BIND_NOW=1 0m0.289s 0m0.288s 0m0.287s 0m0.286s
global export LD_X=1 LD_BIND_NOW=1 0m0.657s 0m0.642s 0m0.641s 0m0.640s

take3
local 0m0.112s 0m0.109s 0m0.109s 0m0.108s
local export LD_X=1 0m0.268s 0m0.273s 0m0.265s 0m0.264s
local export LD_BIND_NOW=1 0m0.183s 0m0.180s 0m0.181s 0m0.180s
local export LD_X=1 LD_BIND_NOW=1 0m0.506s 0m0.494s 0m0.504s 0m0.497s
global 0m0.140s 0m0.136s 0m0.137s 0m0.137s
global export LD_X=1 0m0.353s 0m0.346s 0m0.345s 0m0.343s
global export LD_BIND_NOW=1 0m0.235s 0m0.232s 0m0.233s 0m0.231s
global export LD_X=1 LD_BIND_NOW=1 0m0.647s 0m0.640s 0m0.645s 0m0.643s

# valgrind --tool=cachegrind stats not provided for prelinked testcase,
# as valgrind apparently uses LD_PRELOAD internally and thus prevents
# prelinking.

for V in local global; do for M in '' '-E LD_X=1' '-E LD_BIND_NOW=1' '-E LD_X=1 -E LD_BIND_NOW=1'; \
do ( echo "$V $M"; ./timing $M ./a $V ); done; done

take2
local
Strip out best and worst realtime result
minimum: 0.129763000 sec real / 0.000051852 sec CPU
maximum: 0.133951000 sec real / 0.000067645 sec CPU
average: 0.130537178 sec real / 0.000062305 sec CPU
stdev : 0.000481749 sec real / 0.000001382 sec CPU
local -E LD_X=1
optarg="LD_X=1"
Strip out best and worst realtime result
minimum: 0.264604000 sec real / 0.000062178 sec CPU
maximum: 0.274639000 sec real / 0.000085800 sec CPU
average: 0.267029107 sec real / 0.000071314 sec CPU
stdev : 0.001715647 sec real / 0.000003884 sec CPU
local -E LD_BIND_NOW=1
optarg="LD_BIND_NOW=1"
Strip out best and worst realtime result
minimum: 0.224145000 sec real / 0.000056408 sec CPU
maximum: 0.233185000 sec real / 0.000063897 sec CPU
average: 0.225996464 sec real / 0.000062388 sec CPU
stdev : 0.000731068 sec real / 0.000001113 sec CPU
local -E LD_X=1 -E LD_BIND_NOW=1
optarg="LD_X=1"
optarg="LD_BIND_NOW=1"
Strip out best and worst realtime result
minimum: 0.497700000 sec real / 0.000064336 sec CPU
maximum: 0.506782000 sec real / 0.000082528 sec CPU
average: 0.500931285 sec real / 0.000072029 sec CPU
stdev : 0.001461852 sec real / 0.000002125 sec CPU
global
Strip out best and worst realtime result
minimum: 0.163840000 sec real / 0.000054914 sec CPU
maximum: 0.170908000 sec real / 0.000069313 sec CPU
average: 0.164924250 sec real / 0.000061514 sec CPU
stdev : 0.000564632 sec real / 0.000001893 sec CPU
global -E LD_X=1
optarg="LD_X=1"
Strip out best and worst realtime result
minimum: 0.342633000 sec real / 0.000062245 sec CPU
maximum: 0.351814000 sec real / 0.000073874 sec CPU
average: 0.346293321 sec real / 0.000070698 sec CPU
stdev : 0.002620913 sec real / 0.000002831 sec CPU
global -E LD_BIND_NOW=1
optarg="LD_BIND_NOW=1"
Strip out best and worst realtime result
minimum: 0.285191000 sec real / 0.000062101 sec CPU
maximum: 0.293454000 sec real / 0.000073980 sec CPU
average: 0.287895000 sec real / 0.000071590 sec CPU
stdev : 0.002337740 sec real / 0.000001967 sec CPU
global -E LD_X=1 -E LD_BIND_NOW=1
optarg="LD_X=1"
optarg="LD_BIND_NOW=1"
Strip out best and worst realtime result
minimum: 0.638814000 sec real / 0.000063557 sec CPU
maximum: 0.659667000 sec real / 0.000086187 sec CPU
average: 0.643091392 sec real / 0.000073052 sec CPU
stdev : 0.003414747 sec real / 0.000002750 sec CPU

take3
local
Strip out best and worst realtime result
minimum: 0.106563000 sec real / 0.000056654 sec CPU
maximum: 0.156301000 sec real / 0.000069681 sec CPU
average: 0.107375964 sec real / 0.000063363 sec CPU
stdev : 0.000376538 sec real / 0.000001826 sec CPU
local -E LD_X=1
optarg="LD_X=1"
Strip out best and worst realtime result
minimum: 0.262108000 sec real / 0.000065873 sec CPU
maximum: 0.268947000 sec real / 0.000079329 sec CPU
average: 0.265046500 sec real / 0.000074189 sec CPU
stdev : 0.001349294 sec real / 0.000002202 sec CPU
local -E LD_BIND_NOW=1
optarg="LD_BIND_NOW=1"
Strip out best and worst realtime result
minimum: 0.178339000 sec real / 0.000058372 sec CPU
maximum: 0.180964000 sec real / 0.000070478 sec CPU
average: 0.179214142 sec real / 0.000063930 sec CPU
stdev : 0.000602966 sec real / 0.000000961 sec CPU
local -E LD_X=1 -E LD_BIND_NOW=1
optarg="LD_X=1"
optarg="LD_BIND_NOW=1"
Strip out best and worst realtime result
minimum: 0.492687000 sec real / 0.000064509 sec CPU
maximum: 0.509753000 sec real / 0.000090265 sec CPU
average: 0.498109571 sec real / 0.000074401 sec CPU
stdev : 0.002885770 sec real / 0.000001602 sec CPU
global
Strip out best and worst realtime result
minimum: 0.135356000 sec real / 0.000055916 sec CPU
maximum: 0.157524000 sec real / 0.000067681 sec CPU
average: 0.137677178 sec real / 0.000062051 sec CPU
stdev : 0.001880825 sec real / 0.000003219 sec CPU
global -E LD_X=1
optarg="LD_X=1"
Strip out best and worst realtime result
minimum: 0.341822000 sec real / 0.000064653 sec CPU
maximum: 0.396209000 sec real / 0.000080133 sec CPU
average: 0.344525392 sec real / 0.000073766 sec CPU
stdev : 0.001928640 sec real / 0.000003070 sec CPU
global -E LD_BIND_NOW=1
optarg="LD_BIND_NOW=1"
Strip out best and worst realtime result
minimum: 0.230014000 sec real / 0.000057105 sec CPU
maximum: 0.236641000 sec real / 0.000079505 sec CPU
average: 0.231364928 sec real / 0.000065419 sec CPU
stdev : 0.001134105 sec real / 0.000003888 sec CPU
global -E LD_X=1 -E LD_BIND_NOW=1
optarg="LD_X=1"
optarg="LD_BIND_NOW=1"
Strip out best and worst realtime result
minimum: 0.636048000 sec real / 0.000064933 sec CPU
maximum: 0.673287000 sec real / 0.000077339 sec CPU
average: 0.645732678 sec real / 0.000073476 sec CPU
stdev : 0.006697649 sec real / 0.000001922 sec CPU


Jakub
Jakub Jelinek
2006-07-06 18:21:16 UTC
Permalink
Post by Jakub Jelinek
Will now do a few small changes suggested by Uli late last night and then
post the final patches. I don't want to make nbuckets a power of two,
the division really isn't very expensive on contemporary CPUs and we don't
want multiple of 32 nbuckets for the bitmask trick to work properly.
And here are the hopefully final patches.

Changes today:
1) swapped chainoff and bitmask in the buckets array pairs - bitmask
is accessed first by ld.so, so it makes sense to put it first
2) chainoff fields in the bucket array pairs don't contain indexes into
the chains area of .gnu.hash section (located at
.gnu.hash'VMA + 8 + 8 * nbuckets) but rather dynamic symbol indexes.
So the chain location for CHAINOFF is:
.gnu.hash'VMA + 8 + 8 * nbuckets + (CHAINOFF - symidx) * 4
The advantage of this is that the bias can be applied once in the ld.so
initialization of a library and the inner loop doesn't need to subtract
or add it anywhere.
3) added hooks, so that executable PLT slots on i386/x86_64 where the
executable never takes address of those functions (and thus the SHN_UNDEF
PLT symbols have st_value == 0) are not added to .gnu.hash chains
4) if there are no symbols in the .gnu.hash section at all, the linker
sets nbuckets to 0. This will allow in the future to completely remove
the executable from symbol search scope.

Ok for trunk?

Jakub
djamel anonymous
2006-07-06 19:53:29 UTC
Permalink
Hello, i am writing you this time about the first variant.after looking at
the benchmark results i noted that there have been a reduction in the number
of l1 cache misses; a reduction in l1 cache misses means a win of 12 cycles
; the difference between the latency of l2 cache and that of l1 cache
15-3.on the other hand replacing a division by a binary and & is a win of at
least 25 cycles, so it think that avoiding tthe division in the common case
may improve performance.
changing the size of the main hash table is not an ideal solution because it
is difficult to choose a good power of 2 size and there is no gurantee that
a power of 2 size will give short chains.this also will introduce a great
variability in the size nbcukets, they may be oversized which may give a big
waste of space.on the other hand the solution of a filter before the main
hash table that is a power of 2 has several advantages:
1-it is small: on average its size is nsymbols bytes, which has more chance
to fit in the l1 cache than 3*nsymols or 4*nsymbols bytes for tha current
hash table.
2-because it is small oversizing or undersizeing will not have a big impact
on the overall space utilization.
3-the benchmark of take3 shows an increase in l2 cache misses because of the
enlarged size of the hash table; the reduced size of the filter hash table
may do less l2 cache misses.
4-in the take3 hash table there is a waste of space because the symbol index
word inside the hash table is useless. according to the charts of yesterday
there is between 5% and 12% empty buckets inside each hash table.for each
empty bucket the symbol index will have no role and will be wasted. with the
filter hash table every added word will participate in reducing the
probability to go to the second hash table.
when an access is done to the filter hash table and the bit is found to be
set, there will be an access to the normal hash table.there is still a
chance to avoid a third memory access if the bucket is found empty(between
5% and 12 %) so the average number of memeory accesses may not increase too
much compared to take3.
I think that nbucket must not be a power of two, otherwise every access to
the normal hash table will access a non empty bucket and third memory access
to the hash chains will be done.
the following is a function that computes the nearest power of 2 to a given
integer number.
static inline unsigned pow2_round(unsigned int input)
{
unsigned log2=32-__builtin_clz(input);
if ((1<<(log2-2))&(input))
return 1<<log2;
else
return 1<<(log2-1);
};
the corresponding size of the filter hash table would be
8*pow2_round(nsymbols) bits or pow2_round(nsymbols)/4 32 bits integers.
best regards.

_________________________________________________________________
Testez Windows Llive Mail Beta !
http://www.msn.fr/newhotmail/Default.asp?Ath=f
Jakub Jelinek
2006-07-07 14:44:06 UTC
Permalink
Post by djamel anonymous
Hello, i am writing you this time about the first variant.after looking at
the benchmark results i noted that there have been a reduction in the
number of l1 cache misses; a reduction in l1 cache misses means a win of 12
cycles ; the difference between the latency of l2 cache and that of l1
cache 15-3.on the other hand replacing a division by a binary and & is a
win of at least 25 cycles, so it think that avoiding tthe division in the
common case may improve performance.
I don't think it is the modulo that matters, but the smaller footprint
of .gnu.hash case in that case. I have implemented what I think you meant
and the numbers actually convinced me.
So here is the new set of patches and new statistics.
take1 is 2006-06-28 state of things, take2 2006-07-03, take3 2006-07-05
and take4 what is attached here.

Jakub
djamel anonymous
2006-07-07 15:56:59 UTC
Permalink
Hello, first i thank you for testing my suggestion.in the patch, i have
noted the current line
cinfo.maskbits = bfd_log2 (cinfo.nsyms) + 1;
the size of the hash table is between cinfo.nsyms*2 and cindo.nsysm*4 bits
which will filter between 1/2 and 1/4 of the unsuccessful queries.perhaps
the a size of bfd_log2 (cinfo.nsyms) + 2 or
bfd_log2 (cinfo.nsyms) + 3 would give better results.
the rounding i was suggesting was to round cinfo.nsyms to the nearest power
of two and then multiply by 8 this would have given 8*cinfo.nsyms bits on
average . but perhaps with this the table would have been too big ti fit
inisde the hash table.the code to do that would have been:
tmp=bfd_log2 (cinfo.nsyms)*8;
if(tmp&(1<<(bfd_log2(cinfo.nsyms)-2)))
cinfo.maskbits = tmp;
else
cinfo.maskbits = tmp-1;
there is perhaps an ideal value for maskbits that would have to be selected
by benchmarking differenr value; a large value of maskbits would avoid many
accesses to the buckets but reduce the chances of maskbits to fit in the l1
cache.
best regards.

_________________________________________________________________
MSN Hotmail : créez votre adresse e-mail gratuite & à vie !
http://www.msn.fr/newhotmail/Default.asp?Ath=f
Jakub Jelinek
2006-07-09 22:27:03 UTC
Permalink
Post by djamel anonymous
Hello, first i thank you for testing my suggestion.in the patch, i have
noted the current line
cinfo.maskbits = bfd_log2 (cinfo.nsyms) + 1;
bfd_log2 is not 32 - __builtin_clz.
cinfo.maskbits = bfd_log2 (cinfo.nsyms) + 1;
if (cinfo.maskbits < 3)
cinfo.maskbits = 1 << 5;
else if ((1 << (cinfo.maskbits - 2)) & cinfo.nsyms)
cinfo.maskbits = 1 << (cinfo.maskbits + 3);
else
cinfo.maskbits = 1 << (cinfo.maskbits + 2);

gives you approx. NSYMS bytes in the bitmask
(between NSYMS*(2/3)*8 and NSYMS*(4/3)*8 bits).

Jakub
Nick Clifton
2006-07-10 14:17:52 UTC
Permalink
Hi Jakub,
Post by Jakub Jelinek
include/
* bfdlink.h (struct bfd_link_info): Add emit_hash and
emit_gnu_hash bitfields.
include/elf/
* common.h (SHT_GNU_HASH, DT_GNU_HASH): Define.
ld/
* scripttempl/elf.sc: Add .gnu.hash section.
* emultempl/elf32.em (OPTION_HASH_STYLE): Define.
(gld${EMULATION_NAME}_add_options): Register --hash-style option.
(gld${EMULATION_NAME}_handle_option): Handle it.
(gld${EMULATION_NAME}_list_options): Document it.
* ldmain.c (main): Initialize emit_hash and emit_gnu_hash.
* ld.texinfo: Document --hash-style option.
bfd/
* elf.c (_bfd_elf_print_private_bfd_data): Handle DT_GNU_HASH.
Handle SHT_GNU_HASH.
(special_sections_g): Include .gnu.hash section.
(bfd_elf_gnu_hash): New function.
* elf-bfd.h (bfd_elf_gnu_hash, _bfd_elf_hash_symbol): New prototypes.
(struct elf_backend_data): Add elf_hash_symbol method.
* elflink.c (_bfd_elf_link_create_dynamic_sections): Create .hash
only if info->emit_hash, create .gnu.hash section if
info->emit_gnu_hash.
(struct collect_gnu_hash_codes): New type.
(elf_collect_gnu_hash_codes, elf_renumber_gnu_hash_syms,
_bfd_elf_hash_symbol): New functions.
(compute_bucket_count): Don't compute HASHCODES array, instead add
that and NSYMS as arguments. Use bed->s->sizeof_hash_entry
instead of bed->s->arch_size / 8. Fix .hash size estimation.
When not optimizing, use the number of hashed symbols rather than
dynsymcount.
(bfd_elf_size_dynamic_sections): Only add DT_HASH if info->emit_hash,
and ADD DT_GNU_HASH if info->emit_gnu_hash.
(bfd_elf_size_dynsym_hash_dynstr): Size .hash only if info->emit_hash,
adjust compute_bucket_count caller. Create and populate .gnu.hash
section if info->emit_gnu_hash.
(elf_link_output_extsym): Only populate .hash section if
finfo->hash_sec != NULL.
(bfd_elf_final_link): Adjust assertion. Handle DT_GNU_HASH.
* elfxx-target.h (elf_backend_hash_symbol): Define if not yet defined.
(elfNN_bed): Add elf_backend_hash_symbol.
* elf64-x86-64.c (elf64_x86_64_hash_symbol): New function.
(elf_backend_hash_symbol): Define.
* elf32-i386.c (elf_i386_hash_symbol): New function.
(elf_backend_hash_symbol): Define.
binutils/
* readelf.c (get_dynamic_type): Handle DT_GNU_HASH.
(get_section_type_name): Handle SHT_GNU_HASH.
(dynamic_info_DT_GNU_HASH): New variable.
(process_dynamic_section): Handle DT_GNU_HASH.
(process_symbol_table): Print also DT_GNU_HASH histogram.
Approved - please apply.

Cheers
Nick
Nick Clifton
2006-07-10 14:17:52 UTC
Permalink
Hi Jakub,
Post by Jakub Jelinek
include/
* bfdlink.h (struct bfd_link_info): Add emit_hash and
emit_gnu_hash bitfields.
include/elf/
* common.h (SHT_GNU_HASH, DT_GNU_HASH): Define.
ld/
* scripttempl/elf.sc: Add .gnu.hash section.
* emultempl/elf32.em (OPTION_HASH_STYLE): Define.
(gld${EMULATION_NAME}_add_options): Register --hash-style option.
(gld${EMULATION_NAME}_handle_option): Handle it.
(gld${EMULATION_NAME}_list_options): Document it.
* ldmain.c (main): Initialize emit_hash and emit_gnu_hash.
* ld.texinfo: Document --hash-style option.
bfd/
* elf.c (_bfd_elf_print_private_bfd_data): Handle DT_GNU_HASH.
Handle SHT_GNU_HASH.
(special_sections_g): Include .gnu.hash section.
(bfd_elf_gnu_hash): New function.
* elf-bfd.h (bfd_elf_gnu_hash, _bfd_elf_hash_symbol): New prototypes.
(struct elf_backend_data): Add elf_hash_symbol method.
* elflink.c (_bfd_elf_link_create_dynamic_sections): Create .hash
only if info->emit_hash, create .gnu.hash section if
info->emit_gnu_hash.
(struct collect_gnu_hash_codes): New type.
(elf_collect_gnu_hash_codes, elf_renumber_gnu_hash_syms,
_bfd_elf_hash_symbol): New functions.
(compute_bucket_count): Don't compute HASHCODES array, instead add
that and NSYMS as arguments. Use bed->s->sizeof_hash_entry
instead of bed->s->arch_size / 8. Fix .hash size estimation.
When not optimizing, use the number of hashed symbols rather than
dynsymcount.
(bfd_elf_size_dynamic_sections): Only add DT_HASH if info->emit_hash,
and ADD DT_GNU_HASH if info->emit_gnu_hash.
(bfd_elf_size_dynsym_hash_dynstr): Size .hash only if info->emit_hash,
adjust compute_bucket_count caller. Create and populate .gnu.hash
section if info->emit_gnu_hash.
(elf_link_output_extsym): Only populate .hash section if
finfo->hash_sec != NULL.
(bfd_elf_final_link): Adjust assertion. Handle DT_GNU_HASH.
* elfxx-target.h (elf_backend_hash_symbol): Define if not yet defined.
(elfNN_bed): Add elf_backend_hash_symbol.
* elf64-x86-64.c (elf64_x86_64_hash_symbol): New function.
(elf_backend_hash_symbol): Define.
* elf32-i386.c (elf_i386_hash_symbol): New function.
(elf_backend_hash_symbol): Define.
binutils/
* readelf.c (get_dynamic_type): Handle DT_GNU_HASH.
(get_section_type_name): Handle SHT_GNU_HASH.
(dynamic_info_DT_GNU_HASH): New variable.
(process_dynamic_section): Handle DT_GNU_HASH.
(process_symbol_table): Print also DT_GNU_HASH histogram.
Approved - please apply.

Cheers
Nick
Nick Clifton
2006-07-10 14:20:01 UTC
Permalink
Hi Jakub,
Post by Jakub Jelinek
include/
* bfdlink.h (struct bfd_link_info): Add emit_hash and
emit_gnu_hash bitfields.
include/elf/
* common.h (SHT_GNU_HASH, DT_GNU_HASH): Define.
ld/
* scripttempl/elf.sc: Add .gnu.hash section.
* emultempl/elf32.em (OPTION_HASH_STYLE): Define.
(gld${EMULATION_NAME}_add_options): Register --hash-style option.
(gld${EMULATION_NAME}_handle_option): Handle it.
(gld${EMULATION_NAME}_list_options): Document it.
* ldmain.c (main): Initialize emit_hash and emit_gnu_hash.
* ld.texinfo: Document --hash-style option.
bfd/
* elf.c (_bfd_elf_print_private_bfd_data): Handle DT_GNU_HASH.
Handle SHT_GNU_HASH.
(special_sections_g): Include .gnu.hash section.
(bfd_elf_gnu_hash): New function.
* elf-bfd.h (bfd_elf_gnu_hash, _bfd_elf_hash_symbol): New prototypes.
(struct elf_backend_data): Add elf_hash_symbol method.
* elflink.c (_bfd_elf_link_create_dynamic_sections): Create .hash
only if info->emit_hash, create .gnu.hash section if
info->emit_gnu_hash.
(struct collect_gnu_hash_codes): New type.
(elf_collect_gnu_hash_codes, elf_renumber_gnu_hash_syms,
_bfd_elf_hash_symbol): New functions.
(compute_bucket_count): Don't compute HASHCODES array, instead add
that and NSYMS as arguments. Use bed->s->sizeof_hash_entry
instead of bed->s->arch_size / 8. Fix .hash size estimation.
When not optimizing, use the number of hashed symbols rather than
dynsymcount.
(bfd_elf_size_dynamic_sections): Only add DT_HASH if info->emit_hash,
and ADD DT_GNU_HASH if info->emit_gnu_hash.
(bfd_elf_size_dynsym_hash_dynstr): Size .hash only if info->emit_hash,
adjust compute_bucket_count caller. Create and populate .gnu.hash
section if info->emit_gnu_hash.
(elf_link_output_extsym): Only populate .hash section if
finfo->hash_sec != NULL.
(bfd_elf_final_link): Adjust assertion. Handle DT_GNU_HASH.
* elfxx-target.h (elf_backend_hash_symbol): Define if not yet defined.
(elfNN_bed): Add elf_backend_hash_symbol.
* elf64-x86-64.c (elf64_x86_64_hash_symbol): New function.
(elf_backend_hash_symbol): Define.
* elf32-i386.c (elf_i386_hash_symbol): New function.
(elf_backend_hash_symbol): Define.
binutils/
* readelf.c (get_dynamic_type): Handle DT_GNU_HASH.
(get_section_type_name): Handle SHT_GNU_HASH.
(dynamic_info_DT_GNU_HASH): New variable.
(process_dynamic_section): Handle DT_GNU_HASH.
(process_symbol_table): Print also DT_GNU_HASH histogram.
Approved - please apply.

Cheers
Nick
Jakub Jelinek
2006-07-10 21:51:35 UTC
Permalink
Post by Nick Clifton
Approved - please apply.
Thanks. I have done some small tweaks so that the .gnu.hash
section can support (almost) no bloom filter (setting bitmaskwords
in the header to 1 and all ones in the bitmask), or 1 bit (that's what the
Friday version did, with shift in the header 0) or 2 bit
bloom filter (which improves performance by around 3%).
Here is what I have committed and Ulrich is now committing the glibc
side to the libc CVS.
On x86-64, the version from Friday on the oowriter run had:
85.07% - filtered out by the 1 bit bloom filter
3.35% - empty chain (hits just 2 cache lines)
11.58% - chain walk (3 cache lines)
This version has:
91.30% - filtered out by 2 bit 64-bit bloom filter (the same size of the
bitmask array as before)
1.62% - empty chain
7.08% - chain walk
The same with LD_BIND_NOW=1 oowriter, before:
85.17% - bloom filter
3.35% - empty chain
11.48% - chain walk
Now:
91.44% - 2 bit 64-bit bloom filter
1.60% - empty chain
6.96% - chain walk

We have experimented with making the bitmask array smaller, but that turned
to be a bad idea and showed very clearly in the numbers.

2006-07-10 Jakub Jelinek <***@redhat.com>

include/
* bfdlink.h (struct bfd_link_info): Add emit_hash and
emit_gnu_hash bitfields.
include/elf/
* common.h (SHT_GNU_HASH, DT_GNU_HASH): Define.
ld/
* scripttempl/elf.sc: Add .gnu.hash section.
* emultempl/elf32.em (OPTION_HASH_STYLE): Define.
(gld${EMULATION_NAME}_add_options): Register --hash-style option.
(gld${EMULATION_NAME}_handle_option): Handle it.
(gld${EMULATION_NAME}_list_options): Document it.
* ldmain.c (main): Initialize emit_hash and emit_gnu_hash.
* ld.texinfo: Document --hash-style option.
ld/testsuite/
* ld-powerpc/tlsso32.r: Adjust.
* ld-powerpc/tlsso32.d: Adjust.
* ld-powerpc/tlsso32.g: Adjust.
* ld-powerpc/tlsso.r: Adjust.
* ld-powerpc/tlsso.g: Adjust.
* ld-powerpc/tlstocso.g: Adjust.
bfd/
* elf.c (_bfd_elf_print_private_bfd_data): Handle DT_GNU_HASH.
(bfd_section_from_shdr, elf_fake_sections, assign_section_numbers):
Handle SHT_GNU_HASH.
(special_sections_g): Include .gnu.hash section.
(bfd_elf_gnu_hash): New function.
* elf-bfd.h (bfd_elf_gnu_hash, _bfd_elf_hash_symbol): New prototypes.
(struct elf_backend_data): Add elf_hash_symbol method.
* elflink.c (_bfd_elf_link_create_dynamic_sections): Create .hash
only if info->emit_hash, create .gnu.hash section if
info->emit_gnu_hash.
(struct collect_gnu_hash_codes): New type.
(elf_collect_gnu_hash_codes, elf_renumber_gnu_hash_syms,
_bfd_elf_hash_symbol): New functions.
(compute_bucket_count): Don't compute HASHCODES array, instead add
that and NSYMS as arguments. Use bed->s->sizeof_hash_entry
instead of bed->s->arch_size / 8. Fix .hash size estimation.
When not optimizing, use the number of hashed symbols rather than
dynsymcount.
(bfd_elf_size_dynamic_sections): Only add DT_HASH if info->emit_hash,
and ADD DT_GNU_HASH if info->emit_gnu_hash.
(bfd_elf_size_dynsym_hash_dynstr): Size .hash only if info->emit_hash,
adjust compute_bucket_count caller. Create and populate .gnu.hash
section if info->emit_gnu_hash.
(elf_link_output_extsym): Only populate .hash section if
finfo->hash_sec != NULL.
(bfd_elf_final_link): Adjust assertion. Handle DT_GNU_HASH.
* elfxx-target.h (elf_backend_hash_symbol): Define if not yet defined.
(elfNN_bed): Add elf_backend_hash_symbol.
* elf64-x86-64.c (elf64_x86_64_hash_symbol): New function.
(elf_backend_hash_symbol): Define.
* elf32-i386.c (elf_i386_hash_symbol): New function.
(elf_backend_hash_symbol): Define.
binutils/
* readelf.c (get_dynamic_type): Handle DT_GNU_HASH.
(get_section_type_name): Handle SHT_GNU_HASH.
(dynamic_info_DT_GNU_HASH): New variable.
(process_dynamic_section): Handle DT_GNU_HASH.
(process_symbol_table): Print also DT_GNU_HASH histogram.

--- ld/scripttempl/elf.sc.jj 2006-01-01 01:02:16.000000000 +0100
+++ ld/scripttempl/elf.sc 2006-06-22 11:11:53.000000000 +0200
@@ -260,6 +260,7 @@ SECTIONS
${INITIAL_READONLY_SECTIONS}
${TEXT_DYNAMIC+${DYNAMIC}}
.hash ${RELOCATING-0} : { *(.hash) }
+ .gnu.hash ${RELOCATING-0} : { *(.gnu.hash) }
.dynsym ${RELOCATING-0} : { *(.dynsym) }
.dynstr ${RELOCATING-0} : { *(.dynstr) }
.gnu.version ${RELOCATING-0} : { *(.gnu.version) }
--- ld/ldmain.c.jj 2006-06-01 15:50:33.000000000 +0200
+++ ld/ldmain.c 2006-06-22 11:21:11.000000000 +0200
@@ -304,6 +304,8 @@ main (int argc, char **argv)
link_info.create_object_symbols_section = NULL;
link_info.gc_sym_list = NULL;
link_info.base_file = NULL;
+ link_info.emit_hash = TRUE;
+ link_info.emit_gnu_hash = FALSE;
/* SVR4 linkers seem to set DT_INIT and DT_FINI based on magic _init
and _fini symbols. We are compatible. */
link_info.init_function = "_init";
--- ld/ld.texinfo.jj 2006-06-15 14:31:06.000000000 +0200
+++ ld/ld.texinfo 2006-06-22 14:03:21.000000000 +0200
@@ -1883,6 +1883,14 @@ time it takes the linker to perform its
increasing the linker's memory requirements. Similarly reducing this
value can reduce the memory requirements at the expense of speed.

+@kindex --hash-style=@var{style}
+@item --hash-style=@var{style}
+Set the type of linker's hash table(s). @var{style} can be either
+@code{sysv} for classic ELF @code{.hash} section, @code{gnu} for
+new style GNU @code{.gnu.hash} section or @code{both} for both
+the classic ELF @code{.hash} and new style GNU @code{.gnu.hash}
+hash tables. The default is @code{sysv}.
+
@kindex --reduce-memory-overheads
@item --reduce-memory-overheads
This option reduces memory requirements at ld runtime, at the expense of
--- ld/emultempl/elf32.em.jj 2006-06-20 18:34:24.000000000 +0200
+++ ld/emultempl/elf32.em 2006-06-22 14:39:25.000000000 +0200
@@ -1719,6 +1719,7 @@ cat >>e${EMULATION_NAME}.c <<EOF
#define OPTION_GROUP (OPTION_ENABLE_NEW_DTAGS + 1)
#define OPTION_EH_FRAME_HDR (OPTION_GROUP + 1)
#define OPTION_EXCLUDE_LIBS (OPTION_EH_FRAME_HDR + 1)
+#define OPTION_HASH_STYLE (OPTION_EXCLUDE_LIBS + 1)

static void
gld${EMULATION_NAME}_add_options
@@ -1735,6 +1736,7 @@ cat >>e${EMULATION_NAME}.c <<EOF
{"enable-new-dtags", no_argument, NULL, OPTION_ENABLE_NEW_DTAGS},
{"eh-frame-hdr", no_argument, NULL, OPTION_EH_FRAME_HDR},
{"exclude-libs", required_argument, NULL, OPTION_EXCLUDE_LIBS},
+ {"hash-style", required_argument, NULL, OPTION_HASH_STYLE},
{"Bgroup", no_argument, NULL, OPTION_GROUP},
EOF
fi
@@ -1791,6 +1793,22 @@ cat >>e${EMULATION_NAME}.c <<EOF
add_excluded_libs (optarg);
break;

+ case OPTION_HASH_STYLE:
+ link_info.emit_hash = FALSE;
+ link_info.emit_gnu_hash = FALSE;
+ if (strcmp (optarg, "sysv") == 0)
+ link_info.emit_hash = TRUE;
+ else if (strcmp (optarg, "gnu") == 0)
+ link_info.emit_gnu_hash = TRUE;
+ else if (strcmp (optarg, "both") == 0)
+ {
+ link_info.emit_hash = TRUE;
+ link_info.emit_gnu_hash = TRUE;
+ }
+ else
+ einfo (_("%P%F: invalid hash style \`%s'\n"), optarg);
+ break;
+
case 'z':
if (strcmp (optarg, "initfirst") == 0)
link_info.flags_1 |= (bfd_vma) DF_1_INITFIRST;
@@ -1894,6 +1912,7 @@ cat >>e${EMULATION_NAME}.c <<EOF
fprintf (file, _(" --disable-new-dtags\tDisable new dynamic tags\n"));
fprintf (file, _(" --enable-new-dtags\tEnable new dynamic tags\n"));
fprintf (file, _(" --eh-frame-hdr\tCreate .eh_frame_hdr section\n"));
+ fprintf (file, _(" --hash-style=STYLE\tSet hash style to sysv, gnu or both\n"));
fprintf (file, _(" -z combreloc\t\tMerge dynamic relocs into one section and sort\n"));
fprintf (file, _(" -z defs\t\tReport unresolved symbols in object files.\n"));
fprintf (file, _(" -z execstack\t\tMark executable as requiring executable stack\n"));
--- bfd/elf-bfd.h.jj 2006-06-20 18:34:24.000000000 +0200
+++ bfd/elf-bfd.h 2006-07-06 15:48:12.000000000 +0200
@@ -1038,6 +1038,9 @@ struct elf_backend_data
bfd_boolean *, bfd_boolean *,
bfd *, asection **);

+ /* Return TRUE if symbol should be hashed in the `.gnu.hash' section. */
+ bfd_boolean (*elf_hash_symbol) (struct elf_link_hash_entry *);
+
/* Used to handle bad SHF_LINK_ORDER input. */
bfd_error_handler_type link_order_error_handler;

@@ -1481,6 +1484,8 @@ extern bfd_vma _bfd_elf_section_offset

extern unsigned long bfd_elf_hash
(const char *);
+extern unsigned long bfd_elf_gnu_hash
+ (const char *);

extern bfd_reloc_status_type bfd_elf_generic_reloc
(bfd *, arelent *, asymbol *, void *, asection *, bfd *, char **);
@@ -1651,6 +1656,8 @@ extern bfd_boolean _bfd_elf_merge_symbol
struct elf_link_hash_entry **, bfd_boolean *,
bfd_boolean *, bfd_boolean *, bfd_boolean *);

+extern bfd_boolean _bfd_elf_hash_symbol (struct elf_link_hash_entry *);
+
extern bfd_boolean _bfd_elf_add_default_symbol
(bfd *, struct bfd_link_info *, struct elf_link_hash_entry *,
const char *, Elf_Internal_Sym *, asection **, bfd_vma *,
--- bfd/elf64-x86-64.c.jj 2006-06-20 11:57:19.000000000 +0200
+++ bfd/elf64-x86-64.c 2006-07-06 16:54:21.000000000 +0200
@@ -3615,6 +3615,19 @@ elf64_x86_64_additional_program_headers
return count;
}

+/* Return TRUE if symbol should be hashed in the `.gnu.hash' section. */
+
+static bfd_boolean
+elf64_x86_64_hash_symbol (struct elf_link_hash_entry *h)
+{
+ if (h->plt.offset != (bfd_vma) -1
+ && !h->def_regular
+ && !h->pointer_equality_needed)
+ return FALSE;
+
+ return _bfd_elf_hash_symbol (h);
+}
+
static const struct bfd_elf_special_section
elf64_x86_64_special_sections[]=
{
@@ -3688,5 +3701,7 @@ static const struct bfd_elf_special_sect
elf64_x86_64_special_sections
#define elf_backend_additional_program_headers \
elf64_x86_64_additional_program_headers
+#define elf_backend_hash_symbol \
+ elf64_x86_64_hash_symbol

#include "elf64-target.h"
--- bfd/elf.c.jj 2006-06-20 18:34:24.000000000 +0200
+++ bfd/elf.c 2006-07-10 19:10:40.000000000 +0200
@@ -206,6 +206,21 @@ bfd_elf_hash (const char *namearg)
return h & 0xffffffff;
}

+/* DT_GNU_HASH hash function. Do not change this function; you will
+ cause invalid hash tables to be generated. */
+
+unsigned long
+bfd_elf_gnu_hash (const char *namearg)
+{
+ const unsigned char *name = (const unsigned char *) namearg;
+ unsigned long h = 5381;
+ unsigned char ch;
+
+ while ((ch = *name++) != '\0')
+ h = (h << 5) + h + ch;
+ return h & 0xffffffff;
+}
+
bfd_boolean
bfd_elf_mkobject (bfd *abfd)
{
@@ -1239,6 +1254,7 @@ _bfd_elf_print_private_bfd_data (bfd *ab
case DT_AUXILIARY: name = "AUXILIARY"; stringp = TRUE; break;
case DT_USED: name = "USED"; break;
case DT_FILTER: name = "FILTER"; stringp = TRUE; break;
+ case DT_GNU_HASH: name = "GNU_HASH"; break;
}

fprintf (f, " %-11s ", name);
@@ -1823,6 +1839,7 @@ bfd_section_from_shdr (bfd *abfd, unsign
case SHT_FINI_ARRAY: /* .fini_array section. */
case SHT_PREINIT_ARRAY: /* .preinit_array section. */
case SHT_GNU_LIBLIST: /* .gnu.liblist section. */
+ case SHT_GNU_HASH: /* .gnu.hash section. */
return _bfd_elf_make_section_from_shdr (abfd, hdr, name, shindex);

case SHT_DYNAMIC: /* Dynamic linking information. */
@@ -2295,6 +2312,7 @@ static const struct bfd_elf_special_sect
{ ".gnu.version_r", 14, 0, SHT_GNU_verneed, 0 },
{ ".gnu.liblist", 12, 0, SHT_GNU_LIBLIST, SHF_ALLOC },
{ ".gnu.conflict", 13, 0, SHT_RELA, SHF_ALLOC },
+ { ".gnu.hash", 9, 0, SHT_GNU_HASH, SHF_ALLOC },
{ NULL, 0, 0, 0, 0 }
};

@@ -2811,6 +2829,10 @@ elf_fake_sections (bfd *abfd, asection *
case SHT_GROUP:
this_hdr->sh_entsize = 4;
break;
+
+ case SHT_GNU_HASH:
+ this_hdr->sh_entsize = bed->s->arch_size == 64 ? 0 : 4;
+ break;
}

if ((asect->flags & SEC_ALLOC) != 0)
@@ -3256,6 +3278,7 @@ assign_section_numbers (bfd *abfd, struc
break;

case SHT_HASH:
+ case SHT_GNU_HASH:
case SHT_GNU_versym:
/* sh_link is the section header index of the symbol table
this hash table or version table is for. */
--- bfd/elf32-i386.c.jj 2006-06-23 15:32:34.000000000 +0200
+++ bfd/elf32-i386.c 2006-07-06 16:56:24.000000000 +0200
@@ -3835,6 +3835,18 @@ elf_i386_plt_sym_val (bfd_vma i, const a
return plt->vma + (i + 1) * PLT_ENTRY_SIZE;
}

+/* Return TRUE if symbol should be hashed in the `.gnu.hash' section. */
+
+static bfd_boolean
+elf_i386_hash_symbol (struct elf_link_hash_entry *h)
+{
+ if (h->plt.offset != (bfd_vma) -1
+ && !h->def_regular
+ && !h->pointer_equality_needed)
+ return FALSE;
+
+ return _bfd_elf_hash_symbol (h);
+}

#define TARGET_LITTLE_SYM bfd_elf32_i386_vec
#define TARGET_LITTLE_NAME "elf32-i386"
@@ -3875,6 +3887,7 @@ elf_i386_plt_sym_val (bfd_vma i, const a
#define elf_backend_size_dynamic_sections elf_i386_size_dynamic_sections
#define elf_backend_always_size_sections elf_i386_always_size_sections
#define elf_backend_plt_sym_val elf_i386_plt_sym_val
+#define elf_backend_hash_symbol elf_i386_hash_symbol

#include "elf32-target.h"

--- bfd/elflink.c.jj 2006-06-20 18:34:53.000000000 +0200
+++ bfd/elflink.c 2006-07-10 19:01:44.000000000 +0200
@@ -240,12 +240,30 @@ _bfd_elf_link_create_dynamic_sections (b
if (!_bfd_elf_define_linkage_sym (abfd, info, s, "_DYNAMIC"))
return FALSE;

- s = bfd_make_section_with_flags (abfd, ".hash",
- flags | SEC_READONLY);
- if (s == NULL
- || ! bfd_set_section_alignment (abfd, s, bed->s->log_file_align))
- return FALSE;
- elf_section_data (s)->this_hdr.sh_entsize = bed->s->sizeof_hash_entry;
+ if (info->emit_hash)
+ {
+ s = bfd_make_section_with_flags (abfd, ".hash", flags | SEC_READONLY);
+ if (s == NULL
+ || ! bfd_set_section_alignment (abfd, s, bed->s->log_file_align))
+ return FALSE;
+ elf_section_data (s)->this_hdr.sh_entsize = bed->s->sizeof_hash_entry;
+ }
+
+ if (info->emit_gnu_hash)
+ {
+ s = bfd_make_section_with_flags (abfd, ".gnu.hash",
+ flags | SEC_READONLY);
+ if (s == NULL
+ || ! bfd_set_section_alignment (abfd, s, bed->s->log_file_align))
+ return FALSE;
+ /* For 64-bit ELF, .gnu.hash is a non-uniform entity size section:
+ 4 32-bit words followed by variable count of 64-bit words, then
+ variable count of 32-bit words. */
+ if (bed->s->arch_size == 64)
+ elf_section_data (s)->this_hdr.sh_entsize = 0;
+ else
+ elf_section_data (s)->this_hdr.sh_entsize = 4;
+ }

/* Let the backend create the rest of the sections. This lets the
backend set the right flags. The backend will normally create
@@ -4811,6 +4829,131 @@ elf_collect_hash_codes (struct elf_link_
return TRUE;
}

+struct collect_gnu_hash_codes
+{
+ bfd *output_bfd;
+ const struct elf_backend_data *bed;
+ unsigned long int nsyms;
+ unsigned long int maskbits;
+ unsigned long int *hashcodes;
+ unsigned long int *hashval;
+ unsigned long int *indx;
+ unsigned long int *counts;
+ bfd_vma *bitmask;
+ bfd_byte *contents;
+ long int min_dynindx;
+ unsigned long int bucketcount;
+ unsigned long int symindx;
+ long int local_indx;
+ long int shift1, shift2;
+ unsigned long int mask;
+};
+
+/* This function will be called though elf_link_hash_traverse to store
+ all hash value of the exported symbols in an array. */
+
+static bfd_boolean
+elf_collect_gnu_hash_codes (struct elf_link_hash_entry *h, void *data)
+{
+ struct collect_gnu_hash_codes *s = data;
+ const char *name;
+ char *p;
+ unsigned long ha;
+ char *alc = NULL;
+
+ if (h->root.type == bfd_link_hash_warning)
+ h = (struct elf_link_hash_entry *) h->root.u.i.link;
+
+ /* Ignore indirect symbols. These are added by the versioning code. */
+ if (h->dynindx == -1)
+ return TRUE;
+
+ /* Ignore also local symbols and undefined symbols. */
+ if (! (*s->bed->elf_hash_symbol) (h))
+ return TRUE;
+
+ name = h->root.root.string;
+ p = strchr (name, ELF_VER_CHR);
+ if (p != NULL)
+ {
+ alc = bfd_malloc (p - name + 1);
+ memcpy (alc, name, p - name);
+ alc[p - name] = '\0';
+ name = alc;
+ }
+
+ /* Compute the hash value. */
+ ha = bfd_elf_gnu_hash (name);
+
+ /* Store the found hash value in the array for compute_bucket_count,
+ and also for .dynsym reordering purposes. */
+ s->hashcodes[s->nsyms] = ha;
+ s->hashval[h->dynindx] = ha;
+ ++s->nsyms;
+ if (s->min_dynindx < 0 || s->min_dynindx > h->dynindx)
+ s->min_dynindx = h->dynindx;
+
+ if (alc != NULL)
+ free (alc);
+
+ return TRUE;
+}
+
+/* This function will be called though elf_link_hash_traverse to do
+ final dynaminc symbol renumbering. */
+
+static bfd_boolean
+elf_renumber_gnu_hash_syms (struct elf_link_hash_entry *h, void *data)
+{
+ struct collect_gnu_hash_codes *s = data;
+ unsigned long int bucket;
+ unsigned long int val;
+
+ if (h->root.type == bfd_link_hash_warning)
+ h = (struct elf_link_hash_entry *) h->root.u.i.link;
+
+ /* Ignore indirect symbols. */
+ if (h->dynindx == -1)
+ return TRUE;
+
+ /* Ignore also local symbols and undefined symbols. */
+ if (! (*s->bed->elf_hash_symbol) (h))
+ {
+ if (h->dynindx >= s->min_dynindx)
+ h->dynindx = s->local_indx++;
+ return TRUE;
+ }
+
+ bucket = s->hashval[h->dynindx] % s->bucketcount;
+ val = (s->hashval[h->dynindx] >> s->shift1)
+ & ((s->maskbits >> s->shift1) - 1);
+ s->bitmask[val] |= ((bfd_vma) 1) << (s->hashval[h->dynindx] & s->mask);
+ s->bitmask[val]
+ |= ((bfd_vma) 1) << ((s->hashval[h->dynindx] >> s->shift2) & s->mask);
+ val = s->hashval[h->dynindx] & ~(unsigned long int) 1;
+ if (s->counts[bucket] == 1)
+ /* Last element terminates the chain. */
+ val |= 1;
+ bfd_put_32 (s->output_bfd, val,
+ s->contents + (s->indx[bucket] - s->symindx) * 4);
+ --s->counts[bucket];
+ h->dynindx = s->indx[bucket]++;
+ return TRUE;
+}
+
+/* Return TRUE if symbol should be hashed in the `.gnu.hash' section. */
+
+bfd_boolean
+_bfd_elf_hash_symbol (struct elf_link_hash_entry *h)
+{
+ return !(h->forced_local
+ || h->root.type == bfd_link_hash_undefined
+ || h->root.type == bfd_link_hash_undefweak
+ || ((h->root.type == bfd_link_hash_defined
+ || h->root.type == bfd_link_hash_defweak)
+ && h->root.u.def.section->output_section == NULL));
+}
+
/* Array used to determine the number of hash table buckets to use
based on the number of symbols there are. If there are fewer than
3 symbols we use 1 bucket, fewer than 17 symbols we use 3 buckets,
@@ -4832,42 +4975,26 @@ static const size_t elf_buckets[] =
Therefore the result is always a good payoff between few collisions
(= short chain lengths) and table size. */
static size_t
-compute_bucket_count (struct bfd_link_info *info)
+compute_bucket_count (struct bfd_link_info *info, unsigned long int *hashcodes,
+ unsigned long int nsyms, int gnu_hash)
{
size_t dynsymcount = elf_hash_table (info)->dynsymcount;
size_t best_size = 0;
- unsigned long int *hashcodes;
- unsigned long int *hashcodesp;
unsigned long int i;
bfd_size_type amt;

- /* Compute the hash values for all exported symbols. At the same
- time store the values in an array so that we could use them for
- optimizations. */
- amt = dynsymcount;
- amt *= sizeof (unsigned long int);
- hashcodes = bfd_malloc (amt);
- if (hashcodes == NULL)
- return 0;
- hashcodesp = hashcodes;
-
- /* Put all hash values in HASHCODES. */
- elf_link_hash_traverse (elf_hash_table (info),
- elf_collect_hash_codes, &hashcodesp);
-
/* We have a problem here. The following code to optimize the table
size requires an integer type with more the 32 bits. If
BFD_HOST_U_64_BIT is set we know about such a type. */
#ifdef BFD_HOST_U_64_BIT
if (info->optimize)
{
- unsigned long int nsyms = hashcodesp - hashcodes;
size_t minsize;
size_t maxsize;
BFD_HOST_U_64_BIT best_chlen = ~((BFD_HOST_U_64_BIT) 0);
- unsigned long int *counts ;
bfd *dynobj = elf_hash_table (info)->dynobj;
const struct elf_backend_data *bed = get_elf_backend_data (dynobj);
+ unsigned long int *counts;

/* Possible optimization parameters: if we have NSYMS symbols we say
that the hashing table must at least have NSYMS/4 and at most
@@ -4876,6 +5003,13 @@ compute_bucket_count (struct bfd_link_in
if (minsize == 0)
minsize = 1;
best_size = maxsize = nsyms * 2;
+ if (gnu_hash)
+ {
+ if (minsize < 2)
+ minsize = 2;
+ if ((best_size & 31) == 0)
+ ++best_size;
+ }

/* Create array where we count the collisions in. We must use bfd_malloc
since the size could be large. */
@@ -4883,10 +5017,7 @@ compute_bucket_count (struct bfd_link_in
amt *= sizeof (unsigned long int);
counts = bfd_malloc (amt);
if (counts == NULL)
- {
- free (hashcodes);
- return 0;
- }
+ return 0;

/* Compute the "optimal" size for the hash table. The criteria is a
minimal chain length. The minor criteria is (of course) the size
@@ -4898,6 +5029,9 @@ compute_bucket_count (struct bfd_link_in
unsigned long int j;
unsigned long int fact;

+ if (gnu_hash && (i & 31) == 0)
+ continue;
+
memset (counts, '\0', i * sizeof (unsigned long int));

/* Determine how often each hash bucket is used. */
@@ -4913,9 +5047,9 @@ compute_bucket_count (struct bfd_link_in
# define BFD_TARGET_PAGESIZE (4096)
# endif

- /* We in any case need 2 + NSYMS entries for the size values and
- the chains. */
- max = (2 + nsyms) * (bed->s->arch_size / 8);
+ /* We in any case need 2 + DYNSYMCOUNT entries for the size values
+ and the chains. */
+ max = (2 + dynsymcount) * bed->s->sizeof_hash_entry;

# if 1
/* Variant 1: optimize for short chains. We add the squares
@@ -4925,7 +5059,7 @@ compute_bucket_count (struct bfd_link_in
max += counts[j] * counts[j];

/* This adds penalties for the overall size of the table. */
- fact = i / (BFD_TARGET_PAGESIZE / (bed->s->arch_size / 8)) + 1;
+ fact = i / (BFD_TARGET_PAGESIZE / bed->s->sizeof_hash_entry) + 1;
max *= fact * fact;
# else
/* Variant 2: Optimize a lot more for small table. Here we
@@ -4936,7 +5070,7 @@ compute_bucket_count (struct bfd_link_in

/* The overall size of the table is considered, but not as
strong as in variant 1, where it is squared. */
- fact = i / (BFD_TARGET_PAGESIZE / (bed->s->arch_size / 8)) + 1;
+ fact = i / (BFD_TARGET_PAGESIZE / bed->s->sizeof_hash_entry) + 1;
max *= fact;
# endif

@@ -4959,14 +5093,13 @@ compute_bucket_count (struct bfd_link_in
for (i = 0; elf_buckets[i] != 0; i++)
{
best_size = elf_buckets[i];
- if (dynsymcount < elf_buckets[i + 1])
+ if (nsyms < elf_buckets[i + 1])
break;
}
+ if (gnu_hash && best_size < 2)
+ best_size = 2;
}

- /* Free the arrays we needed. */
- free (hashcodes);
-
return best_size;
}

@@ -5324,7 +5457,10 @@ bfd_elf_size_dynamic_sections (bfd *outp
bfd_size_type strsize;

strsize = _bfd_elf_strtab_size (elf_hash_table (info)->dynstr);
- if (!_bfd_elf_add_dynamic_entry (info, DT_HASH, 0)
+ if ((info->emit_hash
+ && !_bfd_elf_add_dynamic_entry (info, DT_HASH, 0))
+ || (info->emit_gnu_hash
+ && !_bfd_elf_add_dynamic_entry (info, DT_GNU_HASH, 0))
|| !_bfd_elf_add_dynamic_entry (info, DT_STRTAB, 0)
|| !_bfd_elf_add_dynamic_entry (info, DT_SYMTAB, 0)
|| !_bfd_elf_add_dynamic_entry (info, DT_STRSZ, strsize)
@@ -5726,8 +5862,6 @@ bfd_elf_size_dynsym_hash_dynstr (bfd *ou
asection *s;
bfd_size_type dynsymcount;
unsigned long section_sym_count;
- size_t bucketcount = 0;
- size_t hash_entry_size;
unsigned int dtagcount;

dynobj = elf_hash_table (info)->dynobj;
@@ -5778,23 +5912,215 @@ bfd_elf_size_dynsym_hash_dynstr (bfd *ou
memset (s->contents, 0, section_sym_count * bed->s->sizeof_sym);
}

+ elf_hash_table (info)->bucketcount = 0;
+
/* Compute the size of the hashing table. As a side effect this
computes the hash values for all the names we export. */
- bucketcount = compute_bucket_count (info);
+ if (info->emit_hash)
+ {
+ unsigned long int *hashcodes;
+ unsigned long int *hashcodesp;
+ bfd_size_type amt;
+ unsigned long int nsyms;
+ size_t bucketcount;
+ size_t hash_entry_size;
+
+ /* Compute the hash values for all exported symbols. At the same
+ time store the values in an array so that we could use them for
+ optimizations. */
+ amt = dynsymcount * sizeof (unsigned long int);
+ hashcodes = bfd_malloc (amt);
+ if (hashcodes == NULL)
+ return FALSE;
+ hashcodesp = hashcodes;

- s = bfd_get_section_by_name (dynobj, ".hash");
- BFD_ASSERT (s != NULL);
- hash_entry_size = elf_section_data (s)->this_hdr.sh_entsize;
- s->size = ((2 + bucketcount + dynsymcount) * hash_entry_size);
- s->contents = bfd_zalloc (output_bfd, s->size);
- if (s->contents == NULL)
- return FALSE;
+ /* Put all hash values in HASHCODES. */
+ elf_link_hash_traverse (elf_hash_table (info),
+ elf_collect_hash_codes, &hashcodesp);

- bfd_put (8 * hash_entry_size, output_bfd, bucketcount, s->contents);
- bfd_put (8 * hash_entry_size, output_bfd, dynsymcount,
- s->contents + hash_entry_size);
+ nsyms = hashcodesp - hashcodes;
+ bucketcount
+ = compute_bucket_count (info, hashcodes, nsyms, 0);
+ free (hashcodes);

- elf_hash_table (info)->bucketcount = bucketcount;
+ if (bucketcount == 0)
+ return FALSE;
+
+ elf_hash_table (info)->bucketcount = bucketcount;
+
+ s = bfd_get_section_by_name (dynobj, ".hash");
+ BFD_ASSERT (s != NULL);
+ hash_entry_size = elf_section_data (s)->this_hdr.sh_entsize;
+ s->size = ((2 + bucketcount + dynsymcount) * hash_entry_size);
+ s->contents = bfd_zalloc (output_bfd, s->size);
+ if (s->contents == NULL)
+ return FALSE;
+
+ bfd_put (8 * hash_entry_size, output_bfd, bucketcount, s->contents);
+ bfd_put (8 * hash_entry_size, output_bfd, dynsymcount,
+ s->contents + hash_entry_size);
+ }
+
+ if (info->emit_gnu_hash)
+ {
+ size_t i, cnt;
+ unsigned char *contents;
+ struct collect_gnu_hash_codes cinfo;
+ bfd_size_type amt;
+ size_t bucketcount;
+
+ memset (&cinfo, 0, sizeof (cinfo));
+
+ /* Compute the hash values for all exported symbols. At the same
+ time store the values in an array so that we could use them for
+ optimizations. */
+ amt = dynsymcount * 2 * sizeof (unsigned long int);
+ cinfo.hashcodes = bfd_malloc (amt);
+ if (cinfo.hashcodes == NULL)
+ return FALSE;
+
+ cinfo.hashval = cinfo.hashcodes + dynsymcount;
+ cinfo.min_dynindx = -1;
+ cinfo.output_bfd = output_bfd;
+ cinfo.bed = bed;
+
+ /* Put all hash values in HASHCODES. */
+ elf_link_hash_traverse (elf_hash_table (info),
+ elf_collect_gnu_hash_codes, &cinfo);
+
+ bucketcount
+ = compute_bucket_count (info, cinfo.hashcodes, cinfo.nsyms, 1);
+
+ if (bucketcount == 0)
+ {
+ free (cinfo.hashcodes);
+ return FALSE;
+ }
+
+ s = bfd_get_section_by_name (dynobj, ".gnu.hash");
+ BFD_ASSERT (s != NULL);
+
+ if (cinfo.nsyms == 0)
+ {
+ /* Empty .gnu.hash section is special. */
+ BFD_ASSERT (cinfo.min_dynindx == -1);
+ free (cinfo.hashcodes);
+ s->size = 5 * 4 + bed->s->arch_size / 8;
+ contents = bfd_zalloc (output_bfd, s->size);
+ if (contents == NULL)
+ return FALSE;
+ s->contents = contents;
+ /* 1 empty bucket. */
+ bfd_put_32 (output_bfd, 1, contents);
+ /* SYMIDX above the special symbol 0. */
+ bfd_put_32 (output_bfd, 1, contents + 4);
+ /* Just one word for bitmask. */
+ bfd_put_32 (output_bfd, 1, contents + 8);
+ /* Only hash fn bloom filter. */
+ bfd_put_32 (output_bfd, 0, contents + 12);
+ /* No hashes are valid - empty bitmask. */
+ bfd_put (bed->s->arch_size, output_bfd, 0, contents + 16);
+ /* No hashes in the only bucket. */
+ bfd_put_32 (output_bfd, 0,
+ contents + 16 + bed->s->arch_size / 8);
+ }
+ else
+ {
+ BFD_ASSERT (cinfo.min_dynindx != -1);
+ unsigned long int maskwords, maskbitslog2;
+
+ maskbitslog2 = bfd_log2 (cinfo.nsyms) + 1;
+ if (maskbitslog2 < 3)
+ maskbitslog2 = 5;
+ else if ((1 << (maskbitslog2 - 2)) & cinfo.nsyms)
+ maskbitslog2 = maskbitslog2 + 3;
+ else
+ maskbitslog2 = maskbitslog2 + 2;
+ if (bed->s->arch_size == 64)
+ {
+ if (maskbitslog2 == 5)
+ maskbitslog2 = 6;
+ cinfo.shift1 = 6;
+ }
+ else
+ cinfo.shift1 = 5;
+ cinfo.mask = (1 << cinfo.shift1) - 1;
+ cinfo.shift2 = maskbitslog2 + cinfo.shift1;
+ cinfo.maskbits = 1 << maskbitslog2;
+ maskwords = 1 << (maskbitslog2 - cinfo.shift1);
+ amt = bucketcount * sizeof (unsigned long int) * 2;
+ amt += maskwords * sizeof (bfd_vma);
+ cinfo.bitmask = bfd_malloc (amt);
+ if (cinfo.bitmask == NULL)
+ {
+ free (cinfo.hashcodes);
+ return FALSE;
+ }
+
+ cinfo.counts = (void *) (cinfo.bitmask + maskwords);
+ cinfo.indx = cinfo.counts + bucketcount;
+ cinfo.symindx = dynsymcount - cinfo.nsyms;
+ memset (cinfo.bitmask, 0, maskwords * sizeof (bfd_vma));
+
+ /* Determine how often each hash bucket is used. */
+ memset (cinfo.counts, 0, bucketcount * sizeof (cinfo.counts[0]));
+ for (i = 0; i < cinfo.nsyms; ++i)
+ ++cinfo.counts[cinfo.hashcodes[i] % bucketcount];
+
+ for (i = 0, cnt = cinfo.symindx; i < bucketcount; ++i)
+ if (cinfo.counts[i] != 0)
+ {
+ cinfo.indx[i] = cnt;
+ cnt += cinfo.counts[i];
+ }
+ BFD_ASSERT (cnt == dynsymcount);
+ cinfo.bucketcount = bucketcount;
+ cinfo.local_indx = cinfo.min_dynindx;
+
+ s->size = (4 + bucketcount + cinfo.nsyms) * 4;
+ s->size += cinfo.maskbits / 8;
+ contents = bfd_zalloc (output_bfd, s->size);
+ if (contents == NULL)
+ {
+ free (cinfo.bitmask);
+ free (cinfo.hashcodes);
+ return FALSE;
+ }
+
+ s->contents = contents;
+ bfd_put_32 (output_bfd, bucketcount, contents);
+ bfd_put_32 (output_bfd, cinfo.symindx, contents + 4);
+ bfd_put_32 (output_bfd, maskwords, contents + 8);
+ bfd_put_32 (output_bfd, cinfo.shift2, contents + 12);
+ contents += 16 + cinfo.maskbits / 8;
+
+ for (i = 0; i < bucketcount; ++i)
+ {
+ if (cinfo.counts[i] == 0)
+ bfd_put_32 (output_bfd, 0, contents);
+ else
+ bfd_put_32 (output_bfd, cinfo.indx[i], contents);
+ contents += 4;
+ }
+
+ cinfo.contents = contents;
+
+ /* Renumber dynamic symbols, populate .gnu.hash section. */
+ elf_link_hash_traverse (elf_hash_table (info),
+ elf_renumber_gnu_hash_syms, &cinfo);
+
+ contents = s->contents + 16;
+ for (i = 0; i < maskwords; ++i)
+ {
+ bfd_put (bed->s->arch_size, output_bfd, cinfo.bitmask[i],
+ contents);
+ contents += bed->s->arch_size / 8;
+ }
+
+ free (cinfo.bitmask);
+ free (cinfo.hashcodes);
+ }
+ }

s = bfd_get_section_by_name (dynobj, ".dynstr");
BFD_ASSERT (s != NULL);
@@ -6663,9 +6989,6 @@ elf_link_output_extsym (struct elf_link_
{
size_t bucketcount;
size_t bucket;
- size_t hash_entry_size;
- bfd_byte *bucketpos;
- bfd_vma chain;
bfd_byte *esym;

sym.st_name = h->dynstr_index;
@@ -6679,15 +7002,23 @@ elf_link_output_extsym (struct elf_link_

bucketcount = elf_hash_table (finfo->info)->bucketcount;
bucket = h->u.elf_hash_value % bucketcount;
- hash_entry_size
- = elf_section_data (finfo->hash_sec)->this_hdr.sh_entsize;
- bucketpos = ((bfd_byte *) finfo->hash_sec->contents
- + (bucket + 2) * hash_entry_size);
- chain = bfd_get (8 * hash_entry_size, finfo->output_bfd, bucketpos);
- bfd_put (8 * hash_entry_size, finfo->output_bfd, h->dynindx, bucketpos);
- bfd_put (8 * hash_entry_size, finfo->output_bfd, chain,
- ((bfd_byte *) finfo->hash_sec->contents
- + (bucketcount + 2 + h->dynindx) * hash_entry_size));
+
+ if (finfo->hash_sec != NULL)
+ {
+ size_t hash_entry_size;
+ bfd_byte *bucketpos;
+ bfd_vma chain;
+
+ hash_entry_size
+ = elf_section_data (finfo->hash_sec)->this_hdr.sh_entsize;
+ bucketpos = ((bfd_byte *) finfo->hash_sec->contents
+ + (bucket + 2) * hash_entry_size);
+ chain = bfd_get (8 * hash_entry_size, finfo->output_bfd, bucketpos);
+ bfd_put (8 * hash_entry_size, finfo->output_bfd, h->dynindx, bucketpos);
+ bfd_put (8 * hash_entry_size, finfo->output_bfd, chain,
+ ((bfd_byte *) finfo->hash_sec->contents
+ + (bucketcount + 2 + h->dynindx) * hash_entry_size));
+ }

if (finfo->symver_sec != NULL && finfo->symver_sec->contents != NULL)
{
@@ -7861,7 +8192,7 @@ bfd_elf_final_link (bfd *abfd, struct bf
{
finfo.dynsym_sec = bfd_get_section_by_name (dynobj, ".dynsym");
finfo.hash_sec = bfd_get_section_by_name (dynobj, ".hash");
- BFD_ASSERT (finfo.dynsym_sec != NULL && finfo.hash_sec != NULL);
+ BFD_ASSERT (finfo.dynsym_sec != NULL);
finfo.symver_sec = bfd_get_section_by_name (dynobj, ".gnu.version");
/* Note that it is OK if symver_sec is NULL. */
}
@@ -8621,6 +8952,9 @@ bfd_elf_final_link (bfd *abfd, struct bf
case DT_HASH:
name = ".hash";
goto get_vma;
+ case DT_GNU_HASH:
+ name = ".gnu.hash";
+ goto get_vma;
case DT_STRTAB:
name = ".dynstr";
goto get_vma;
--- bfd/elfxx-target.h.jj 2006-06-20 18:34:24.000000000 +0200
+++ bfd/elfxx-target.h 2006-07-06 15:38:39.000000000 +0200
@@ -563,6 +563,10 @@
#define elf_backend_merge_symbol NULL
#endif

+#ifndef elf_backend_hash_symbol
+#define elf_backend_hash_symbol _bfd_elf_hash_symbol
+#endif
+
extern const struct elf_size_info _bfd_elfNN_size_info;

#ifndef INCLUDED_TARGET_FILE
@@ -643,6 +647,7 @@ static struct elf_backend_data elfNN_bed
elf_backend_common_section_index,
elf_backend_common_section,
elf_backend_merge_symbol,
+ elf_backend_hash_symbol,
elf_backend_link_order_error_handler,
elf_backend_relplt_name,
ELF_MACHINE_ALT1,
--- include/elf/common.h.jj 2006-02-17 15:36:26.000000000 +0100
+++ include/elf/common.h 2006-06-22 10:43:21.000000000 +0200
@@ -338,6 +338,7 @@
#define SHT_LOOS 0x60000000 /* First of OS specific semantics */
#define SHT_HIOS 0x6fffffff /* Last of OS specific semantics */

+#define SHT_GNU_HASH 0x6ffffff6 /* GNU style symbol hash table */
#define SHT_GNU_LIBLIST 0x6ffffff7 /* List of prelink dependencies */

/* The next three section types are defined by Solaris, and are named
@@ -577,6 +578,7 @@
#define DT_VALRNGHI 0x6ffffdff

#define DT_ADDRRNGLO 0x6ffffe00
+#define DT_GNU_HASH 0x6ffffef5
#define DT_TLSDESC_PLT 0x6ffffef6
#define DT_TLSDESC_GOT 0x6ffffef7
#define DT_GNU_CONFLICT 0x6ffffef8
--- include/bfdlink.h.jj 2006-04-07 17:17:29.000000000 +0200
+++ include/bfdlink.h 2006-06-22 11:11:20.000000000 +0200
@@ -324,6 +324,12 @@ struct bfd_link_info
/* TRUE if unreferenced sections should be removed. */
unsigned int gc_sections: 1;

+ /* TRUE if .hash section should be created. */
+ unsigned int emit_hash: 1;
+
+ /* TRUE if .gnu.hash section should be created. */
+ unsigned int emit_gnu_hash: 1;
+
/* What to do with unresolved symbols in an object file.
When producing executables the default is GENERATE_ERROR.
When producing shared libraries the default is IGNORE. The
--- binutils/readelf.c.jj 2006-05-30 16:13:54.000000000 +0200
+++ binutils/readelf.c 2006-07-10 18:56:00.000000000 +0200
@@ -135,6 +135,7 @@ static unsigned long dynamic_syminfo_off
static unsigned int dynamic_syminfo_nent;
static char program_interpreter[64];
static bfd_vma dynamic_info[DT_JMPREL + 1];
+static bfd_vma dynamic_info_DT_GNU_HASH;
static bfd_vma version_info[16];
static Elf_Internal_Ehdr elf_header;
static Elf_Internal_Shdr *section_headers;
@@ -1501,6 +1502,7 @@ get_dynamic_type (unsigned long type)
case DT_GNU_CONFLICTSZ: return "GNU_CONFLICTSZ";
case DT_GNU_LIBLIST: return "GNU_LIBLIST";
case DT_GNU_LIBLISTSZ: return "GNU_LIBLISTSZ";
+ case DT_GNU_HASH: return "GNU_HASH";

default:
if ((type >= DT_LOPROC) && (type <= DT_HIPROC))
@@ -2571,6 +2573,7 @@ get_section_type_name (unsigned int sh_t
case SHT_INIT_ARRAY: return "INIT_ARRAY";
case SHT_FINI_ARRAY: return "FINI_ARRAY";
case SHT_PREINIT_ARRAY: return "PREINIT_ARRAY";
+ case SHT_GNU_HASH: return "GNU_HASH";
case SHT_GROUP: return "GROUP";
case SHT_SYMTAB_SHNDX: return "SYMTAB SECTION INDICIES";
case SHT_GNU_verdef: return "VERDEF";
@@ -6228,6 +6231,15 @@ process_dynamic_section (FILE *file)
}
break;

+ case DT_GNU_HASH:
+ dynamic_info_DT_GNU_HASH = entry->d_un.d_val;
+ if (do_dynamic)
+ {
+ print_vma (entry->d_un.d_val, PREFIX_HEX);
+ putchar ('\n');
+ }
+ break;
+
default:
if ((entry->d_tag >= DT_VERSYM) && (entry->d_tag <= DT_VERNEEDNUM))
version_info[DT_VERSIONTAGIDX (entry->d_tag)] =
@@ -6903,6 +6915,9 @@ process_symbol_table (FILE *file)
bfd_vma nchains = 0;
bfd_vma *buckets = NULL;
bfd_vma *chains = NULL;
+ bfd_vma ngnubuckets = 0;
+ bfd_vma *gnubuckets = NULL;
+ bfd_vma *gnuchains = NULL;

if (! do_syms && !do_histogram)
return 1;
@@ -7282,6 +7297,166 @@ process_symbol_table (FILE *file)
free (chains);
}

+ if (do_histogram && dynamic_info_DT_GNU_HASH)
+ {
+ unsigned char nb[16];
+ bfd_vma i, maxchain = 0xffffffff, symidx, bitmaskwords;
+ unsigned long *lengths;
+ unsigned long *counts;
+ unsigned long hn;
+ unsigned long maxlength = 0;
+ unsigned long nzero_counts = 0;
+ unsigned long nsyms = 0;
+ bfd_vma buckets_vma;
+
+ if (fseek (file,
+ (archive_file_offset
+ + offset_from_vma (file, dynamic_info_DT_GNU_HASH,
+ sizeof nb)),
+ SEEK_SET))
+ {
+ error (_("Unable to seek to start of dynamic information"));
+ return 0;
+ }
+
+ if (fread (nb, 16, 1, file) != 1)
+ {
+ error (_("Failed to read in number of buckets\n"));
+ return 0;
+ }
+
+ ngnubuckets = byte_get (nb, 4);
+ symidx = byte_get (nb + 4, 4);
+ bitmaskwords = byte_get (nb + 8, 4);
+ buckets_vma = dynamic_info_DT_GNU_HASH + 16;
+ if (is_32bit_elf)
+ buckets_vma += bitmaskwords * 4;
+ else
+ buckets_vma += bitmaskwords * 8;
+
+ if (fseek (file,
+ (archive_file_offset
+ + offset_from_vma (file, buckets_vma, 4)),
+ SEEK_SET))
+ {
+ error (_("Unable to seek to start of dynamic information"));
+ return 0;
+ }
+
+ gnubuckets = get_dynamic_data (file, ngnubuckets, 4);
+
+ if (gnubuckets == NULL)
+ return 0;
+
+ for (i = 0; i < ngnubuckets; i++)
+ if (gnubuckets[i] != 0)
+ {
+ if (gnubuckets[i] < symidx)
+ return 0;
+
+ if (maxchain == 0xffffffff || gnubuckets[i] > maxchain)
+ maxchain = gnubuckets[i];
+ }
+
+ if (maxchain == 0xffffffff)
+ return 0;
+
+ maxchain -= symidx;
+
+ if (fseek (file,
+ (archive_file_offset
+ + offset_from_vma (file, buckets_vma
+ + 4 * (ngnubuckets + maxchain), 4)),
+ SEEK_SET))
+ {
+ error (_("Unable to seek to start of dynamic information"));
+ return 0;
+ }
+
+ do
+ {
+ if (fread (nb, 4, 1, file) != 1)
+ {
+ error (_("Failed to determine last chain length\n"));
+ return 0;
+ }
+
+ if (maxchain + 1 == 0)
+ return 0;
+
+ ++maxchain;
+ }
+ while ((byte_get (nb, 4) & 1) == 0);
+
+ if (fseek (file,
+ (archive_file_offset
+ + offset_from_vma (file, buckets_vma + 4 * ngnubuckets, 4)),
+ SEEK_SET))
+ {
+ error (_("Unable to seek to start of dynamic information"));
+ return 0;
+ }
+
+ gnuchains = get_dynamic_data (file, maxchain, 4);
+
+ if (gnuchains == NULL)
+ return 0;
+
+ lengths = calloc (ngnubuckets, sizeof (*lengths));
+ if (lengths == NULL)
+ {
+ error (_("Out of memory"));
+ return 0;
+ }
+
+ printf (_("\nHistogram for `.gnu.hash' bucket list length (total of %lu buckets):\n"),
+ (unsigned long) ngnubuckets);
+ printf (_(" Length Number %% of total Coverage\n"));
+
+ for (hn = 0; hn < ngnubuckets; ++hn)
+ if (gnubuckets[hn] != 0)
+ {
+ bfd_vma off, length = 1;
+
+ for (off = gnubuckets[hn] - symidx;
+ (gnuchains[off] & 1) == 0; ++off)
+ ++length;
+ lengths[hn] = length;
+ if (length > maxlength)
+ maxlength = length;
+ nsyms += length;
+ }
+
+ counts = calloc (maxlength + 1, sizeof (*counts));
+ if (counts == NULL)
+ {
+ error (_("Out of memory"));
+ return 0;
+ }
+
+ for (hn = 0; hn < ngnubuckets; ++hn)
+ ++counts[lengths[hn]];
+
+ if (ngnubuckets > 0)
+ {
+ unsigned long j;
+ printf (" 0 %-10lu (%5.1f%%)\n",
+ counts[0], (counts[0] * 100.0) / ngnubuckets);
+ for (j = 1; j <= maxlength; ++j)
+ {
+ nzero_counts += counts[j] * j;
+ printf ("%7lu %-10lu (%5.1f%%) %5.1f%%\n",
+ j, counts[j], (counts[j] * 100.0) / ngnubuckets,
+ (nzero_counts * 100.0) / nsyms);
+ }
+ }
+
+ free (counts);
+ free (lengths);
+ free (gnubuckets);
+ free (gnuchains);
+ }
+
return 1;
}

--- ld/testsuite/ld-powerpc/tlsso32.r.jj 2006-06-29 14:25:12.000000000 +0200
+++ ld/testsuite/ld-powerpc/tlsso32.r 2006-07-10 23:16:16.000000000 +0200
@@ -52,9 +52,9 @@ Relocation section '\.rela\.dyn' at offs
[0-9a-f ]+R_PPC_TPREL16 +0+30 +le0 \+ 0
[0-9a-f ]+R_PPC_TPREL16_HA +0+34 +le1 \+ 0
[0-9a-f ]+R_PPC_TPREL16_LO +0+34 +le1 \+ 0
-[0-9a-f ]+R_PPC_TPREL16 +0+1041c +\.tdata \+ 10430
-[0-9a-f ]+R_PPC_TPREL16_HA +0+1041c +\.tdata \+ 10434
-[0-9a-f ]+R_PPC_TPREL16_LO +0+1041c +\.tdata \+ 10434
+[0-9a-f ]+R_PPC_TPREL16 +0+103e4 +\.tdata \+ 103f8
+[0-9a-f ]+R_PPC_TPREL16_HA +0+103e4 +\.tdata \+ 103fc
+[0-9a-f ]+R_PPC_TPREL16_LO +0+103e4 +\.tdata \+ 103fc
[0-9a-f ]+R_PPC_DTPMOD32 +0+
[0-9a-f ]+R_PPC_DTPREL32 +0+
[0-9a-f ]+R_PPC_DTPMOD32 +0+
--- ld/testsuite/ld-powerpc/tlsso32.d.jj 2005-12-30 13:33:50.000000000 +0100
+++ ld/testsuite/ld-powerpc/tlsso32.d 2006-07-10 23:19:43.000000000 +0200
@@ -42,5 +42,5 @@ Disassembly of section \.got:
.* <\.got>:
\.\.\.
.*: 4e 80 00 21 blrl
-.*: 00 01 04 38 .*
+.*: 00 01 04 00 .*
\.\.\.
--- ld/testsuite/ld-powerpc/tlsso32.g.jj 2005-12-30 13:33:50.000000000 +0100
+++ ld/testsuite/ld-powerpc/tlsso32.g 2006-07-10 23:22:06.000000000 +0200
@@ -9,5 +9,5 @@
Contents of section \.got:
.* 00000000 00000000 00000000 00000000 .*
.* 00000000 00000000 00000000 00000000 .*
-.* 00000000 4e800021 00010438 00000000 .*
+.* 00000000 4e800021 00010400 00000000 .*
.* 00000000 .*
--- ld/testsuite/ld-powerpc/tlsso.r.jj 2006-06-29 14:25:12.000000000 +0200
+++ ld/testsuite/ld-powerpc/tlsso.r 2006-07-10 23:32:22.000000000 +0200
@@ -49,9 +49,9 @@ Relocation section '\.rela\.dyn' at offs
[0-9a-f ]+R_PPC64_TPREL16 +0+60 le0 \+ 0
[0-9a-f ]+R_PPC64_TPREL16_HA +0+68 le1 \+ 0
[0-9a-f ]+R_PPC64_TPREL16_LO +0+68 le1 \+ 0
-[0-9a-f ]+R_PPC64_TPREL16_DS +0+10668 \.tdata \+ 28
-[0-9a-f ]+R_PPC64_TPREL16_HA +0+10668 \.tdata \+ 30
-[0-9a-f ]+R_PPC64_TPREL16_LO +0+10668 \.tdata \+ 30
+[0-9a-f ]+R_PPC64_TPREL16_DS +0+10630 \.tdata \+ 28
+[0-9a-f ]+R_PPC64_TPREL16_HA +0+10630 \.tdata \+ 30
+[0-9a-f ]+R_PPC64_TPREL16_LO +0+10630 \.tdata \+ 30
[0-9a-f ]+R_PPC64_DTPMOD64 +0+
[0-9a-f ]+R_PPC64_DTPMOD64 +0+
[0-9a-f ]+R_PPC64_DTPREL64 +0+
--- ld/testsuite/ld-powerpc/tlsso.g.jj 2005-12-30 13:33:50.000000000 +0100
+++ ld/testsuite/ld-powerpc/tlsso.g 2006-07-10 23:34:21.000000000 +0200
@@ -7,7 +7,7 @@
.*: +file format elf64-powerpc

Contents of section \.got:
-.* 00000000 000187f0 00000000 00000000 .*
+.* 00000000 000187b8 00000000 00000000 .*
.* 00000000 00000000 00000000 00000000 .*
.* 00000000 00000000 00000000 00000000 .*
.* 00000000 00000000 00000000 00000000 .*
--- ld/testsuite/ld-powerpc/tlstocso.g.jj 2005-12-30 13:33:50.000000000 +0100
+++ ld/testsuite/ld-powerpc/tlstocso.g 2006-07-10 23:34:48.000000000 +0200
@@ -7,7 +7,7 @@
.*: +file format elf64-powerpc

Contents of section \.got:
-.* 00000000 00018738 00000000 00000000 .*
+.* 00000000 00018700 00000000 00000000 .*
.* 00000000 00000000 00000000 00000000 .*
.* 00000000 00000000 00000000 00000000 .*
.* 00000000 00000000 00000000 00000000 .*


Jakub
Michael Meeks
2006-07-07 10:10:26 UTC
Permalink
Hi Jakub,
Post by Jakub Jelinek
1) swapped chainoff and bitmask in the buckets array pairs - bitmask
is accessed first by ld.so, so it makes sense to put it first
So - it took a little while to understand your patch :-) from what I
see; we insert a 32bit bitmask into the buckets ( in addition to the
offset ) whose purpose is to reduce the number of L2 misses incurred by
lookups in the 'chain' by seeing if a given bit is set in that mask.

Each chain element sets a mask bit using bits 6-10 of the hash as an
integer offset / bit-mask (0-31 bits), that is then compared at lookup.

So - I guess that's an interesting new twist ;-) I'm not entirely sold
on it at 1st glance since it makes the working set larger and the access
pattern less predictable - but, if the numbers show it working nicely
that's great :-) - [ I found them slightly hard to read side-by-side ].

Some misc. other code questions:

+ if (buckets[0] & new_hash_bit)

Since the average chain length is ~2.5 (say 3) entries; and we have 32
bits to play with here, assuming bits 6-10 are as random as we would
like, there is a P(1/10) here that this branch is taken - should we have
a __builtin_expect reflecting that ?

Similarly is it worth moving some more of the symtab/strtab
construction out of the hot path ?, but perhaps the compiler will do it:

if ((*hasharr & ~1u) == (new_hash & ~1u))
{
symidx = hasharr - map->l_gnu_chain_zero;
+ /* The tables for this map. */
+ const ElfW(Sym) *symtab = (const void *) D_PTR (map, l_info[DT_SYMTAB]);
+ const char *strtab = (const void *) D_PTR (map, l_info[DT_STRTAB]);
+ const ElfW(Half) *verstab = map->l_versyms;
sym = check_match (&symtab[symidx]);
if (sym != NULL)
goto found_it;
}

[ and the same ? for the old style case, AFAIR - I did this in the
dynhash patch of old ]

Similarly - [ and I've no idea how these 'expect' markups really help
on various arches (as no doubt you've guessed by now ;-) ] this is
presumably a rather uncommon case:

+ /* If the hash table is empty there is nothing to do here. */
+ if (map->l_nbuckets == 0)
+ continue;

I guess now the common case in this loop is hitting buckets[0] &
new_hash_bit and looping, is there anything else we can remove / defer /
annotate in that (now much hotter) loop.

Otherwise - this looks really interesting to me :-)

Thanks,

Michael.
--
***@novell.com <><, Pseudo Engineer, itinerant idiot
Ulrich Drepper
2006-07-07 13:43:45 UTC
Permalink
[...]
There is no need to scrutinize the libc code here. First of all, this
is just the code I tested and not what I have in my sources. And
second, I'm very well aware of the generated code. If there are any
annotations or changes needed you can be sure I'll add them.
--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
Jakub Jelinek
2006-07-07 13:47:50 UTC
Permalink
Post by Michael Meeks
+ /* If the hash table is empty there is nothing to do here. */
+ if (map->l_nbuckets == 0)
+ continue;
This is predicted as non-taken by gcc AFAIK, so no need to add
__builtin_expect.

Jakub
djamel anonymous
2006-07-03 17:36:23 UTC
Permalink
Hello, i am writing again for porential speed improvement.the space used for
the hash table can be reduced by almost nchains words by using the least
siginificant bit of the hash value stored
in the chains a a chain terminator; this is possible because of the fact
that :
(hash1%nbuckets==hahs2%nbuckets and (hash1/2)==(hash2/2) ) => hash1==hash2
for every nbuckets>=2
this space improvement could be transformed into speed by various ways.i
suggest 3 different ways:
1-increase nbuckets by 50%: assuming almost each bucket points to a chain,
increasing the
number of buckets by 50% and deleting the length word from each chain will
result in in the same space usage.
2-use a second sparse hash table as suggested in my previous post: this
space hash table will have an average size of nsymbols bytes which will be
exactly the minimal size of the buckets array (nsymbols/4)*4 bytes. and
hence there will be a net space gain compared to the previous
impelemntation.
3-associate an additional word with each bucket; this word will avoid
accessing the chain in at least 7/8 of the time . to apply this solution the
bucket index will be obtained by :
bucket=(hash%(nbuckets*32))/32 inside the additional word a bit will be set
if and only if an element in the chain exists with the value
(hash%(nbuckets*32))%32 equal to i.
the bucket will be a struct of
{
unit32 offset;
unit32 sub_buckets;
};
we will go to the chain if and only if
(sub_buckets&(1<<(hash%(nbuckets*32))%32))==1.
the first solution seems to be the less attractive because it will most
likely give a modest improvement. the second solution avoids doing the
expensive operation of modulo nbuckets, but it seems it is unlikely that it
will be accepted because it adds a second hash table.
i hope this post will be useful.
best regards

_________________________________________________________________
MSN Messenger : discutez en direct avec vos amis !
http://www.msn.fr/msger/default.asp
Loading...