Here is some more information, sorry for the delay.
The test program allocates small chunks of memory, some of which contain
pointers, and some of which do not (gmp integers). The latter are allocated
with GC_MALLOC_ATOMIC.
Here is how the program is spending its time, in a test with 4 client
threads,
according to gprof:
% cumulative self self total
time seconds seconds calls s/call s/call name
19.95 2.19 2.19 1259020 0.00 0.00 GC_generic_lock
14.75 3.81 1.62 2772367 0.00 0.00 evaluate_eval
11.29 5.05 1.24 54417 0.00 0.00 GC_mark_from
7.56 5.88 0.83 8614697 0.00 0.00 GC_malloc
7.01 6.65 0.77 1658054 0.00 0.00 GC_allochblk
4.37 7.13 0.48 1371704 0.00 0.00 GC_generic_malloc_many
3.55 7.52 0.39 3292481 0.00 0.00 GC_malloc_atomic
2.50 7.80 0.28 3299072 0.00 0.00 gmp_toInteger
2.37 8.06 0.26 2292490 0.00 0.00 GC_header_cache_miss
2.37 8.32 0.26 58155 0.00 0.00 GC_reclaim_clear
2.37 8.58 0.26 49848 0.00 0.00 GC_build_fl
2.14 8.81 0.24 15034505 0.00 0.00 TS_Get_Local
1.64 8.99 0.18 3110502 0.00 0.00 GC_allochblk_nth
1.64 9.17 0.18 24769 0.00 0.00 getmem_atomic
1.46 9.33 0.16 80 0.00 0.01 GC_apply_to_all_blocks
1.18 9.46 0.13 1717544 0.00 0.00
GC_generic_malloc_inner
Here is config.log's idea of how gc was configured:
./configure
--prefix=/home/dan/src/M2/trunk-git/M2/BUILD/dan/builds.tmp/ubuntu64.profile/libraries/final
--enable-cplusplus --enable-threads=posix --enable-parallel-mark
--enable-large-config --disable-gcj-support --disable-java-finalization
--build=x86_64-unknown-linux-gnu --cache-file=/dev/null
I was unaware of thread local allocation, as described in the link you
provide! In particular, I was
unaware I had to do anything to get it to be activated. Perhaps that is
the next thing to try.
Post by Bruce HoultHi Daniel,
Unfortunately you haven't given us much to go on. We don't know how
you've built the GC, and we don't know how you are using it.
Post by Bruce HoultAre you spending your time in allocation, or in marking?
Have you read http://www.hboehm.info/gc/scale.html ? What from there have
you done?
Post by Bruce HoultAs you can see, the information there is not very recent. I'm not aware
of any serious work carried out with regard to scalability on modern 6 or 8
or 12 core CPUs.
Post by Bruce HoultThe thread local allocation stuff *should* take care of lock contention,
even with quite a few CPU cores.
Post by Bruce HoultMarking is a more difficult problem. If the structures allocated by
different cores have pointers into structures allocated by other cores (or
it's just generally one big ball of mud) then marking will inevitably cause
a lot of cross-CPU cache traffic. If the different threads' data structures
are mostly disjoint then parallel marking *could* work very well. *Could*.
I really don't know in practice as I've never tried it.
Post by Bruce HoultOn another track .. are your objects mostly pointers or mostly data? If
you have, for example, big arrays filled with numbers, are you using
GC_malloc_atomic() so that the GC knows they don't need to be scanned?
Post by Bruce HoultOn Wed, Jun 25, 2014 at 11:46 AM, Daniel R. Grayson <
Post by Daniel R. GraysonIn our application that uses libgc (see http://macaulay2.com/) I observe no
speedup when running tasks in parallel, if the tasks allocate memory using
libgc. Perhaps I'm doing something wrong. Are there any commonly observed
situations where no speedup occurs?
A glance at the source code shows that mutex locks lock down the world on
almost every occasion, so it's hard to see why there would ever be any
speedup
Post by Bruce HoultPost by Daniel R. Graysonwhen using threads.
--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
_______________________________________________
bdwgc mailing list
https://lists.opendylan.org/mailman/listinfo/bdwgc