Discussion:
[Gc] parallel speedup
Daniel R. Grayson
2014-06-24 23:46:01 UTC
Permalink
In our application that uses libgc (see http://macaulay2.com/) I observe no
speedup when running tasks in parallel, if the tasks allocate memory using
libgc. Perhaps I'm doing something wrong. Are there any commonly observed
situations where no speedup occurs?

A glance at the source code shows that mutex locks lock down the world on
almost every occasion, so it's hard to see why there would ever be any
speedup
when using threads.
Paul Bone
2014-06-25 00:04:51 UTC
Permalink
Post by Daniel R. Grayson
In our application that uses libgc (see http://macaulay2.com/) I observe no
speedup when running tasks in parallel, if the tasks allocate memory using
libgc. Perhaps I'm doing something wrong. Are there any commonly observed
situations where no speedup occurs?
A glance at the source code shows that mutex locks lock down the world on
almost every occasion, so it's hard to see why there would ever be any
speedup
when using threads.
I've found that when a collector (like Boehm GC) is involved then you're
fighting Amdahl's law. This means that you often get less speedup than you
would expect.

Lets say you spend 50% of your time in the mutator (your program) and 50% of
time in the collector (When the collector runs). If you parallelise the
mutator using four processors, and the collector is not parallelised then
your best case speedup isn't 4x, it's:

Sequential time = 0.5 + 0.5
Parallel time = 0.5 / 4 + 0.5
= 0.625

Speedup = 1 / 0.625
= 1.6

So the program runs 1.6 times faster than the sequential one. When naively
you might have expected it to run 4 times faster.

The marking phase of Boehm GC is parallelised, so the situation is a bit
better, but AIUI parallelising a marking phase is difficult, so it's likely
to be far from a perfect theoretical speedup.

I did some testing for my thesis including calculating how much of the
program's time is spent in the mutator and collector. See section 3.1
http://www.mercurylang.org/documentation/papers.html#pbone_phd_thesis

I found that on a program that was theoretically trivial to parallelisable,
but which allocated a lot of memory for intermediate data (a raytracker). I
got a speedup of 1.58 from using four application threads, and 1.29 when
using four marker threads, and when I used 4 threads for the application and
four threads for the merker the speedup was 2.73. When running sequentially
the same program spends 55% of it's time in the mutator and 45% of it's time
in the collector. This program allocates memory at 26.9MB/s (without thread
safety). One of the easiest ways to tune for better performance was to
increase the initial heap size.

I hope this is helpful.
--
Paul Bone
Bruce Hoult
2014-06-25 00:48:14 UTC
Permalink
Hi Daniel,

Unfortunately you haven't given us much to go on. We don't know how you've
built the GC, and we don't know how you are using it.

Are you spending your time in allocation, or in marking?

Have you read http://www.hboehm.info/gc/scale.html ? What from there have
you done?

As you can see, the information there is not very recent. I'm not aware of
any serious work carried out with regard to scalability on modern 6 or 8 or
12 core CPUs.

The thread local allocation stuff *should* take care of lock contention,
even with quite a few CPU cores.

Marking is a more difficult problem. If the structures allocated by
different cores have pointers into structures allocated by other cores (or
it's just generally one big ball of mud) then marking will inevitably cause
a lot of cross-CPU cache traffic. If the different threads' data structures
are mostly disjoint then parallel marking *could* work very well. *Could*.
I really don't know in practice as I've never tried it.

On another track .. are your objects mostly pointers or mostly data? If you
have, for example, big arrays filled with numbers, are you using
GC_malloc_atomic() so that the GC knows they don't need to be scanned?



On Wed, Jun 25, 2014 at 11:46 AM, Daniel R. Grayson <
Post by Daniel R. Grayson
In our application that uses libgc (see http://macaulay2.com/) I observe no
speedup when running tasks in parallel, if the tasks allocate memory using
libgc. Perhaps I'm doing something wrong. Are there any commonly observed
situations where no speedup occurs?
A glance at the source code shows that mutex locks lock down the world on
almost every occasion, so it's hard to see why there would ever be any
speedup
when using threads.
--
This message has been scanned for viruses and
dangerous content by *MailScanner* <http://www.mailscanner.info/>, and is
believed to be clean.
_______________________________________________
bdwgc mailing list
https://lists.opendylan.org/mailman/listinfo/bdwgc
Daniel R. Grayson
2014-07-02 17:01:48 UTC
Permalink
Here is some more information, sorry for the delay.

The test program allocates small chunks of memory, some of which contain
pointers, and some of which do not (gmp integers). The latter are allocated
with GC_MALLOC_ATOMIC.

Here is how the program is spending its time, in a test with 4 client
threads,
according to gprof:

% cumulative self self total
time seconds seconds calls s/call s/call name
19.95 2.19 2.19 1259020 0.00 0.00 GC_generic_lock
14.75 3.81 1.62 2772367 0.00 0.00 evaluate_eval
11.29 5.05 1.24 54417 0.00 0.00 GC_mark_from
7.56 5.88 0.83 8614697 0.00 0.00 GC_malloc
7.01 6.65 0.77 1658054 0.00 0.00 GC_allochblk
4.37 7.13 0.48 1371704 0.00 0.00 GC_generic_malloc_many
3.55 7.52 0.39 3292481 0.00 0.00 GC_malloc_atomic
2.50 7.80 0.28 3299072 0.00 0.00 gmp_toInteger
2.37 8.06 0.26 2292490 0.00 0.00 GC_header_cache_miss
2.37 8.32 0.26 58155 0.00 0.00 GC_reclaim_clear
2.37 8.58 0.26 49848 0.00 0.00 GC_build_fl
2.14 8.81 0.24 15034505 0.00 0.00 TS_Get_Local
1.64 8.99 0.18 3110502 0.00 0.00 GC_allochblk_nth
1.64 9.17 0.18 24769 0.00 0.00 getmem_atomic
1.46 9.33 0.16 80 0.00 0.01 GC_apply_to_all_blocks
1.18 9.46 0.13 1717544 0.00 0.00
GC_generic_malloc_inner

Here is config.log's idea of how gc was configured:

./configure
--prefix=/home/dan/src/M2/trunk-git/M2/BUILD/dan/builds.tmp/ubuntu64.profile/libraries/final
--enable-cplusplus --enable-threads=posix --enable-parallel-mark
--enable-large-config --disable-gcj-support --disable-java-finalization
--build=x86_64-unknown-linux-gnu --cache-file=/dev/null

I was unaware of thread local allocation, as described in the link you
provide! In particular, I was
unaware I had to do anything to get it to be activated. Perhaps that is
the next thing to try.
Post by Bruce Hoult
Hi Daniel,
Unfortunately you haven't given us much to go on. We don't know how
you've built the GC, and we don't know how you are using it.
Post by Bruce Hoult
Are you spending your time in allocation, or in marking?
Have you read http://www.hboehm.info/gc/scale.html ? What from there have
you done?
Post by Bruce Hoult
As you can see, the information there is not very recent. I'm not aware
of any serious work carried out with regard to scalability on modern 6 or 8
or 12 core CPUs.
Post by Bruce Hoult
The thread local allocation stuff *should* take care of lock contention,
even with quite a few CPU cores.
Post by Bruce Hoult
Marking is a more difficult problem. If the structures allocated by
different cores have pointers into structures allocated by other cores (or
it's just generally one big ball of mud) then marking will inevitably cause
a lot of cross-CPU cache traffic. If the different threads' data structures
are mostly disjoint then parallel marking *could* work very well. *Could*.
I really don't know in practice as I've never tried it.
Post by Bruce Hoult
On another track .. are your objects mostly pointers or mostly data? If
you have, for example, big arrays filled with numbers, are you using
GC_malloc_atomic() so that the GC knows they don't need to be scanned?
Post by Bruce Hoult
On Wed, Jun 25, 2014 at 11:46 AM, Daniel R. Grayson <
Post by Daniel R. Grayson
In our application that uses libgc (see http://macaulay2.com/) I observe no
speedup when running tasks in parallel, if the tasks allocate memory using
libgc. Perhaps I'm doing something wrong. Are there any commonly observed
situations where no speedup occurs?
A glance at the source code shows that mutex locks lock down the world on
almost every occasion, so it's hard to see why there would ever be any
speedup
Post by Bruce Hoult
Post by Daniel R. Grayson
when using threads.
--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
_______________________________________________
bdwgc mailing list
https://lists.opendylan.org/mailman/listinfo/bdwgc
Daniel R. Grayson
2014-07-02 17:13:18 UTC
Permalink
Hmm, no, the modern version of the file scale.html that you linked to,
which comes with
the gc source code, says that thread local allocation is automatically
enabled. Indeed,
config.status shows the flag is defined:

D["THREAD_LOCAL_ALLOC"]=" 1"

The gprof output also shows that GC_generic_malloc_many is getting used.
So I don't
know what to do differently.

On Wed, Jul 2, 2014 at 1:01 PM, Daniel R. Grayson <
Post by Daniel R. Grayson
Here is some more information, sorry for the delay.
The test program allocates small chunks of memory, some of which contain
pointers, and some of which do not (gmp integers). The latter are allocated
with GC_MALLOC_ATOMIC.
Here is how the program is spending its time, in a test with 4 client
threads,
Post by Daniel R. Grayson
% cumulative self self total
time seconds seconds calls s/call s/call name
19.95 2.19 2.19 1259020 0.00 0.00 GC_generic_lock
14.75 3.81 1.62 2772367 0.00 0.00 evaluate_eval
11.29 5.05 1.24 54417 0.00 0.00 GC_mark_from
7.56 5.88 0.83 8614697 0.00 0.00 GC_malloc
7.01 6.65 0.77 1658054 0.00 0.00 GC_allochblk
4.37 7.13 0.48 1371704 0.00 0.00
GC_generic_malloc_many
Post by Daniel R. Grayson
3.55 7.52 0.39 3292481 0.00 0.00 GC_malloc_atomic
2.50 7.80 0.28 3299072 0.00 0.00 gmp_toInteger
2.37 8.06 0.26 2292490 0.00 0.00 GC_header_cache_miss
2.37 8.32 0.26 58155 0.00 0.00 GC_reclaim_clear
2.37 8.58 0.26 49848 0.00 0.00 GC_build_fl
2.14 8.81 0.24 15034505 0.00 0.00 TS_Get_Local
1.64 8.99 0.18 3110502 0.00 0.00 GC_allochblk_nth
1.64 9.17 0.18 24769 0.00 0.00 getmem_atomic
1.46 9.33 0.16 80 0.00 0.01
GC_apply_to_all_blocks
Post by Daniel R. Grayson
1.18 9.46 0.13 1717544 0.00 0.00
GC_generic_malloc_inner
Post by Daniel R. Grayson
./configure
--prefix=/home/dan/src/M2/trunk-git/M2/BUILD/dan/builds.tmp/ubuntu64.profile/libraries/final
--enable-cplusplus --enable-threads=posix --enable-parallel-mark
--enable-large-config --disable-gcj-support --disable-java-finalization
--build=x86_64-unknown-linux-gnu --cache-file=/dev/null
Post by Daniel R. Grayson
I was unaware of thread local allocation, as described in the link you
provide! In particular, I was
Post by Daniel R. Grayson
unaware I had to do anything to get it to be activated. Perhaps that is
the next thing to try.
Post by Daniel R. Grayson
Post by Bruce Hoult
Hi Daniel,
Unfortunately you haven't given us much to go on. We don't know how
you've built the GC, and we don't know how you are using it.
Post by Daniel R. Grayson
Post by Bruce Hoult
Are you spending your time in allocation, or in marking?
Have you read http://www.hboehm.info/gc/scale.html ? What from there
have you done?
Post by Daniel R. Grayson
Post by Bruce Hoult
As you can see, the information there is not very recent. I'm not aware
of any serious work carried out with regard to scalability on modern 6 or 8
or 12 core CPUs.
Post by Daniel R. Grayson
Post by Bruce Hoult
The thread local allocation stuff *should* take care of lock
contention, even with quite a few CPU cores.
Post by Daniel R. Grayson
Post by Bruce Hoult
Marking is a more difficult problem. If the structures allocated by
different cores have pointers into structures allocated by other cores (or
it's just generally one big ball of mud) then marking will inevitably cause
a lot of cross-CPU cache traffic. If the different threads' data structures
are mostly disjoint then parallel marking *could* work very well. *Could*.
I really don't know in practice as I've never tried it.
Post by Daniel R. Grayson
Post by Bruce Hoult
On another track .. are your objects mostly pointers or mostly data? If
you have, for example, big arrays filled with numbers, are you using
GC_malloc_atomic() so that the GC knows they don't need to be scanned?
Post by Daniel R. Grayson
Post by Bruce Hoult
On Wed, Jun 25, 2014 at 11:46 AM, Daniel R. Grayson <
Post by Daniel R. Grayson
In our application that uses libgc (see http://macaulay2.com/) I observe no
speedup when running tasks in parallel, if the tasks allocate memory using
libgc. Perhaps I'm doing something wrong. Are there any commonly observed
situations where no speedup occurs?
A glance at the source code shows that mutex locks lock down the world on
almost every occasion, so it's hard to see why there would ever be any
speedup
Post by Daniel R. Grayson
Post by Bruce Hoult
Post by Daniel R. Grayson
when using threads.
--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
_______________________________________________
bdwgc mailing list
https://lists.opendylan.org/mailman/listinfo/bdwgc
Daniel R. Grayson
2014-07-02 17:47:42 UTC
Permalink
Oops, here's something I forgot to consider. Years ago I inserted code for
initializing the
GC_free_space_divisor to 12, but the libgc default is 3. That turns out to
be a big problem all by
itself.

With 4 threads:

divisor setting sequential run time parallel run time
1 26
21
2 33
29
3 31
24
12 78 71




On Wed, Jul 2, 2014 at 1:13 PM, Daniel R. Grayson <
Post by Daniel R. Grayson
Hmm, no, the modern version of the file scale.html that you linked to,
which comes with
the gc source code, says that thread local allocation is automatically
enabled. Indeed,
D["THREAD_LOCAL_ALLOC"]=" 1"
The gprof output also shows that GC_generic_malloc_many is getting used.
So I don't
know what to do differently.
On Wed, Jul 2, 2014 at 1:01 PM, Daniel R. Grayson <
Post by Daniel R. Grayson
Here is some more information, sorry for the delay.
The test program allocates small chunks of memory, some of which contain
pointers, and some of which do not (gmp integers). The latter are
allocated
Post by Daniel R. Grayson
with GC_MALLOC_ATOMIC.
Here is how the program is spending its time, in a test with 4 client
threads,
Post by Daniel R. Grayson
% cumulative self self total
time seconds seconds calls s/call s/call name
19.95 2.19 2.19 1259020 0.00 0.00 GC_generic_lock
14.75 3.81 1.62 2772367 0.00 0.00 evaluate_eval
11.29 5.05 1.24 54417 0.00 0.00 GC_mark_from
7.56 5.88 0.83 8614697 0.00 0.00 GC_malloc
7.01 6.65 0.77 1658054 0.00 0.00 GC_allochblk
4.37 7.13 0.48 1371704 0.00 0.00
GC_generic_malloc_many
Post by Daniel R. Grayson
3.55 7.52 0.39 3292481 0.00 0.00 GC_malloc_atomic
2.50 7.80 0.28 3299072 0.00 0.00 gmp_toInteger
2.37 8.06 0.26 2292490 0.00 0.00
GC_header_cache_miss
Post by Daniel R. Grayson
2.37 8.32 0.26 58155 0.00 0.00 GC_reclaim_clear
2.37 8.58 0.26 49848 0.00 0.00 GC_build_fl
2.14 8.81 0.24 15034505 0.00 0.00 TS_Get_Local
1.64 8.99 0.18 3110502 0.00 0.00 GC_allochblk_nth
1.64 9.17 0.18 24769 0.00 0.00 getmem_atomic
1.46 9.33 0.16 80 0.00 0.01
GC_apply_to_all_blocks
Post by Daniel R. Grayson
1.18 9.46 0.13 1717544 0.00 0.00
GC_generic_malloc_inner
Post by Daniel R. Grayson
./configure
--prefix=/home/dan/src/M2/trunk-git/M2/BUILD/dan/builds.tmp/ubuntu64.profile/libraries/final
--enable-cplusplus --enable-threads=posix --enable-parallel-mark
--enable-large-config --disable-gcj-support --disable-java-finalization
--build=x86_64-unknown-linux-gnu --cache-file=/dev/null
Post by Daniel R. Grayson
I was unaware of thread local allocation, as described in the link you
provide! In particular, I was
Post by Daniel R. Grayson
unaware I had to do anything to get it to be activated. Perhaps that is
the next thing to try.
Post by Daniel R. Grayson
Post by Bruce Hoult
Hi Daniel,
Unfortunately you haven't given us much to go on. We don't know how
you've built the GC, and we don't know how you are using it.
Post by Daniel R. Grayson
Post by Bruce Hoult
Are you spending your time in allocation, or in marking?
Have you read http://www.hboehm.info/gc/scale.html ? What from there
have you done?
Post by Daniel R. Grayson
Post by Bruce Hoult
As you can see, the information there is not very recent. I'm not
aware of any serious work carried out with regard to scalability on modern
6 or 8 or 12 core CPUs.
Post by Daniel R. Grayson
Post by Bruce Hoult
The thread local allocation stuff *should* take care of lock
contention, even with quite a few CPU cores.
Post by Daniel R. Grayson
Post by Bruce Hoult
Marking is a more difficult problem. If the structures allocated by
different cores have pointers into structures allocated by other cores (or
it's just generally one big ball of mud) then marking will inevitably cause
a lot of cross-CPU cache traffic. If the different threads' data structures
are mostly disjoint then parallel marking *could* work very well. *Could*.
I really don't know in practice as I've never tried it.
Post by Daniel R. Grayson
Post by Bruce Hoult
On another track .. are your objects mostly pointers or mostly data?
If you have, for example, big arrays filled with numbers, are you using
GC_malloc_atomic() so that the GC knows they don't need to be scanned?
Post by Daniel R. Grayson
Post by Bruce Hoult
On Wed, Jun 25, 2014 at 11:46 AM, Daniel R. Grayson <
Post by Daniel R. Grayson
In our application that uses libgc (see http://macaulay2.com/) I
observe no
Post by Daniel R. Grayson
Post by Bruce Hoult
Post by Daniel R. Grayson
speedup when running tasks in parallel, if the tasks allocate memory
using
Post by Daniel R. Grayson
Post by Bruce Hoult
Post by Daniel R. Grayson
libgc. Perhaps I'm doing something wrong. Are there any commonly
observed
Post by Daniel R. Grayson
Post by Bruce Hoult
Post by Daniel R. Grayson
situations where no speedup occurs?
A glance at the source code shows that mutex locks lock down the
world on
Post by Daniel R. Grayson
Post by Bruce Hoult
Post by Daniel R. Grayson
almost every occasion, so it's hard to see why there would ever be
any speedup
Post by Daniel R. Grayson
Post by Bruce Hoult
Post by Daniel R. Grayson
when using threads.
--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
_______________________________________________
bdwgc mailing list
https://lists.opendylan.org/mailman/listinfo/bdwgc
Bruce Hoult
2014-07-02 21:08:52 UTC
Permalink
Aha! So the overall speed was self-inflicted! 12 is far down the "I want to
use the absolute minimum amount of RAM and don't care how slow the program
goes" settings curve.

I'm surprised 2 is slower than 3 though.

You've still not getting a lot of parallel speedup, though I note that the
absolute difference between sequential and parallel is approximately the
same in all cases, so the relative speedup is much higher now.

A profile will look much different now.

What is the overall heap size?



On Thu, Jul 3, 2014 at 5:47 AM, Daniel R. Grayson <
Post by Daniel R. Grayson
Oops, here's something I forgot to consider. Years ago I inserted code
for initializing the
GC_free_space_divisor to 12, but the libgc default is 3. That turns out
to be a big problem all by
itself.
divisor setting sequential run time parallel run time
1 26
21
2 33
29
3 31
24
12 78
71
On Wed, Jul 2, 2014 at 1:13 PM, Daniel R. Grayson <
Post by Daniel R. Grayson
Hmm, no, the modern version of the file scale.html that you linked to,
which comes with
the gc source code, says that thread local allocation is automatically
enabled. Indeed,
D["THREAD_LOCAL_ALLOC"]=" 1"
The gprof output also shows that GC_generic_malloc_many is getting used.
So I don't
know what to do differently.
On Wed, Jul 2, 2014 at 1:01 PM, Daniel R. Grayson <
Post by Daniel R. Grayson
Here is some more information, sorry for the delay.
The test program allocates small chunks of memory, some of which contain
pointers, and some of which do not (gmp integers). The latter are
allocated
Post by Daniel R. Grayson
with GC_MALLOC_ATOMIC.
Here is how the program is spending its time, in a test with 4 client
threads,
Post by Daniel R. Grayson
% cumulative self self total
time seconds seconds calls s/call s/call name
19.95 2.19 2.19 1259020 0.00 0.00 GC_generic_lock
14.75 3.81 1.62 2772367 0.00 0.00 evaluate_eval
11.29 5.05 1.24 54417 0.00 0.00 GC_mark_from
7.56 5.88 0.83 8614697 0.00 0.00 GC_malloc
7.01 6.65 0.77 1658054 0.00 0.00 GC_allochblk
4.37 7.13 0.48 1371704 0.00 0.00
GC_generic_malloc_many
Post by Daniel R. Grayson
3.55 7.52 0.39 3292481 0.00 0.00 GC_malloc_atomic
2.50 7.80 0.28 3299072 0.00 0.00 gmp_toInteger
2.37 8.06 0.26 2292490 0.00 0.00
GC_header_cache_miss
Post by Daniel R. Grayson
2.37 8.32 0.26 58155 0.00 0.00 GC_reclaim_clear
2.37 8.58 0.26 49848 0.00 0.00 GC_build_fl
2.14 8.81 0.24 15034505 0.00 0.00 TS_Get_Local
1.64 8.99 0.18 3110502 0.00 0.00 GC_allochblk_nth
1.64 9.17 0.18 24769 0.00 0.00 getmem_atomic
1.46 9.33 0.16 80 0.00 0.01
GC_apply_to_all_blocks
Post by Daniel R. Grayson
1.18 9.46 0.13 1717544 0.00 0.00
GC_generic_malloc_inner
Post by Daniel R. Grayson
./configure
--prefix=/home/dan/src/M2/trunk-git/M2/BUILD/dan/builds.tmp/ubuntu64.profile/libraries/final
--enable-cplusplus --enable-threads=posix --enable-parallel-mark
--enable-large-config --disable-gcj-support --disable-java-finalization
--build=x86_64-unknown-linux-gnu --cache-file=/dev/null
Post by Daniel R. Grayson
I was unaware of thread local allocation, as described in the link you
provide! In particular, I was
Post by Daniel R. Grayson
unaware I had to do anything to get it to be activated. Perhaps that
is the next thing to try.
Post by Daniel R. Grayson
Post by Bruce Hoult
Hi Daniel,
Unfortunately you haven't given us much to go on. We don't know how
you've built the GC, and we don't know how you are using it.
Post by Daniel R. Grayson
Post by Bruce Hoult
Are you spending your time in allocation, or in marking?
Have you read http://www.hboehm.info/gc/scale.html ? What from there
have you done?
Post by Daniel R. Grayson
Post by Bruce Hoult
As you can see, the information there is not very recent. I'm not
aware of any serious work carried out with regard to scalability on modern
6 or 8 or 12 core CPUs.
Post by Daniel R. Grayson
Post by Bruce Hoult
The thread local allocation stuff *should* take care of lock
contention, even with quite a few CPU cores.
Post by Daniel R. Grayson
Post by Bruce Hoult
Marking is a more difficult problem. If the structures allocated by
different cores have pointers into structures allocated by other cores (or
it's just generally one big ball of mud) then marking will inevitably cause
a lot of cross-CPU cache traffic. If the different threads' data structures
are mostly disjoint then parallel marking *could* work very well. *Could*.
I really don't know in practice as I've never tried it.
Post by Daniel R. Grayson
Post by Bruce Hoult
On another track .. are your objects mostly pointers or mostly data?
If you have, for example, big arrays filled with numbers, are you using
GC_malloc_atomic() so that the GC knows they don't need to be scanned?
Post by Daniel R. Grayson
Post by Bruce Hoult
On Wed, Jun 25, 2014 at 11:46 AM, Daniel R. Grayson <
Post by Daniel R. Grayson
In our application that uses libgc (see http://macaulay2.com/) I
observe no
Post by Daniel R. Grayson
Post by Bruce Hoult
Post by Daniel R. Grayson
speedup when running tasks in parallel, if the tasks allocate memory
using
Post by Daniel R. Grayson
Post by Bruce Hoult
Post by Daniel R. Grayson
libgc. Perhaps I'm doing something wrong. Are there any commonly
observed
Post by Daniel R. Grayson
Post by Bruce Hoult
Post by Daniel R. Grayson
situations where no speedup occurs?
A glance at the source code shows that mutex locks lock down the
world on
Post by Daniel R. Grayson
Post by Bruce Hoult
Post by Daniel R. Grayson
almost every occasion, so it's hard to see why there would ever be
any speedup
Post by Daniel R. Grayson
Post by Bruce Hoult
Post by Daniel R. Grayson
when using threads.
--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
_______________________________________________
bdwgc mailing list
https://lists.opendylan.org/mailman/listinfo/bdwgc
--
This message has been scanned for viruses and
dangerous content by *MailScanner* <http://www.mailscanner.info/>, and is
believed to be clean.
Daniel R. Grayson
2014-07-02 23:03:35 UTC
Permalink
Yes, this project is 22 years old, and memory used to be tighter!

Heap size is just 45 MB, at start and end. Here is the new gprof output:

% cumulative self self total
time seconds seconds calls s/call s/call name
18.23 13.64 13.64 6616700 0.00 0.00 GC_generic_lock
17.55 26.77 13.13 396077 0.00 0.00 GC_mark_from
16.99 39.48 12.71 2770149 0.00 0.00 evaluate_eval
6.50 44.34 4.86 9002190 0.00 0.00 GC_allochblk
6.35 49.09 4.75 47480833 0.00 0.00 GC_malloc
3.72 51.87 2.78 7576965 0.00 0.00 GC_generic_malloc_many
3.60 54.57 2.70 29752123 0.00 0.00 gmp_toInteger
3.21 56.97 2.40 25464868 0.00 0.00 GC_malloc_atomic
2.68 58.97 2.01 56972677 0.00 0.00 TS_Get_Local
2.47 60.82 1.85 23128394 0.00 0.00 GC_header_cache_miss
1.95 62.28 1.46 343001 0.00 0.00 GC_build_fl
1.90 63.71 1.43 23733 0.00 0.00 getmem_atomic
1.79 65.05 1.34 382227 0.00 0.00 GC_reclaim_clear
Post by Bruce Hoult
Aha! So the overall speed was self-inflicted! 12 is far down the "I want
to use the absolute minimum amount of RAM and don't care how slow the
program goes" settings curve.
I'm surprised 2 is slower than 3 though.
You've still not getting a lot of parallel speedup, though I note that the
absolute difference between sequential and parallel is approximately the
same in all cases, so the relative speedup is much higher now.
A profile will look much different now.
What is the overall heap size?
On Thu, Jul 3, 2014 at 5:47 AM, Daniel R. Grayson <
Post by Daniel R. Grayson
Oops, here's something I forgot to consider. Years ago I inserted code
for initializing the
GC_free_space_divisor to 12, but the libgc default is 3. That turns out
to be a big problem all by
itself.
divisor setting sequential run time parallel run time
1 26
21
2 33
29
3 31
24
12 78
71
On Wed, Jul 2, 2014 at 1:13 PM, Daniel R. Grayson <
Post by Daniel R. Grayson
Hmm, no, the modern version of the file scale.html that you linked to,
which comes with
the gc source code, says that thread local allocation is automatically
enabled. Indeed,
D["THREAD_LOCAL_ALLOC"]=" 1"
The gprof output also shows that GC_generic_malloc_many is getting used.
So I don't
know what to do differently.
On Wed, Jul 2, 2014 at 1:01 PM, Daniel R. Grayson <
Post by Daniel R. Grayson
Here is some more information, sorry for the delay.
The test program allocates small chunks of memory, some of which
contain
Post by Daniel R. Grayson
pointers, and some of which do not (gmp integers). The latter are
allocated
Post by Daniel R. Grayson
with GC_MALLOC_ATOMIC.
Here is how the program is spending its time, in a test with 4 client
threads,
Post by Daniel R. Grayson
% cumulative self self total
time seconds seconds calls s/call s/call name
19.95 2.19 2.19 1259020 0.00 0.00 GC_generic_lock
14.75 3.81 1.62 2772367 0.00 0.00 evaluate_eval
11.29 5.05 1.24 54417 0.00 0.00 GC_mark_from
7.56 5.88 0.83 8614697 0.00 0.00 GC_malloc
7.01 6.65 0.77 1658054 0.00 0.00 GC_allochblk
4.37 7.13 0.48 1371704 0.00 0.00
GC_generic_malloc_many
Post by Daniel R. Grayson
3.55 7.52 0.39 3292481 0.00 0.00 GC_malloc_atomic
2.50 7.80 0.28 3299072 0.00 0.00 gmp_toInteger
2.37 8.06 0.26 2292490 0.00 0.00
GC_header_cache_miss
Post by Daniel R. Grayson
2.37 8.32 0.26 58155 0.00 0.00 GC_reclaim_clear
2.37 8.58 0.26 49848 0.00 0.00 GC_build_fl
2.14 8.81 0.24 15034505 0.00 0.00 TS_Get_Local
1.64 8.99 0.18 3110502 0.00 0.00 GC_allochblk_nth
1.64 9.17 0.18 24769 0.00 0.00 getmem_atomic
1.46 9.33 0.16 80 0.00 0.01
GC_apply_to_all_blocks
Post by Daniel R. Grayson
1.18 9.46 0.13 1717544 0.00 0.00
GC_generic_malloc_inner
Post by Daniel R. Grayson
./configure
--prefix=/home/dan/src/M2/trunk-git/M2/BUILD/dan/builds.tmp/ubuntu64.profile/libraries/final
--enable-cplusplus --enable-threads=posix --enable-parallel-mark
--enable-large-config --disable-gcj-support --disable-java-finalization
--build=x86_64-unknown-linux-gnu --cache-file=/dev/null
Post by Daniel R. Grayson
I was unaware of thread local allocation, as described in the link you
provide! In particular, I was
Post by Daniel R. Grayson
unaware I had to do anything to get it to be activated. Perhaps that
is the next thing to try.
Post by Daniel R. Grayson
Post by Bruce Hoult
Hi Daniel,
Unfortunately you haven't given us much to go on. We don't know how
you've built the GC, and we don't know how you are using it.
Post by Daniel R. Grayson
Post by Bruce Hoult
Are you spending your time in allocation, or in marking?
Have you read http://www.hboehm.info/gc/scale.html ? What from
there have you done?
Post by Daniel R. Grayson
Post by Bruce Hoult
As you can see, the information there is not very recent. I'm not
aware of any serious work carried out with regard to scalability on modern
6 or 8 or 12 core CPUs.
Post by Daniel R. Grayson
Post by Bruce Hoult
The thread local allocation stuff *should* take care of lock
contention, even with quite a few CPU cores.
Post by Daniel R. Grayson
Post by Bruce Hoult
Marking is a more difficult problem. If the structures allocated by
different cores have pointers into structures allocated by other cores (or
it's just generally one big ball of mud) then marking will inevitably cause
a lot of cross-CPU cache traffic. If the different threads' data structures
are mostly disjoint then parallel marking *could* work very well. *Could*.
I really don't know in practice as I've never tried it.
Post by Daniel R. Grayson
Post by Bruce Hoult
On another track .. are your objects mostly pointers or mostly data?
If you have, for example, big arrays filled with numbers, are you using
GC_malloc_atomic() so that the GC knows they don't need to be scanned?
Post by Daniel R. Grayson
Post by Bruce Hoult
On Wed, Jun 25, 2014 at 11:46 AM, Daniel R. Grayson <
Post by Daniel R. Grayson
In our application that uses libgc (see http://macaulay2.com/) I
observe no
Post by Daniel R. Grayson
Post by Bruce Hoult
Post by Daniel R. Grayson
speedup when running tasks in parallel, if the tasks allocate
memory using
Post by Daniel R. Grayson
Post by Bruce Hoult
Post by Daniel R. Grayson
libgc. Perhaps I'm doing something wrong. Are there any commonly
observed
Post by Daniel R. Grayson
Post by Bruce Hoult
Post by Daniel R. Grayson
situations where no speedup occurs?
A glance at the source code shows that mutex locks lock down the
world on
Post by Daniel R. Grayson
Post by Bruce Hoult
Post by Daniel R. Grayson
almost every occasion, so it's hard to see why there would ever be
any speedup
Post by Daniel R. Grayson
Post by Bruce Hoult
Post by Daniel R. Grayson
when using threads.
--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
_______________________________________________
bdwgc mailing list
https://lists.opendylan.org/mailman/listinfo/bdwgc
--
This message has been scanned for viruses and
dangerous content by *MailScanner* <http://www.mailscanner.info/>, and is
believed to be clean.
Loading...