The Asynchronous-PRAM Programming Language ForkLight
Source-to-source Compiler
for all machines supporting the P4 library
This page is permanently under reconstruction. Last update Nov. 24, 1998 -
Christoph W. Kessler
Asynchronous PRAM model
The Asynchronous PRAM (Asynchronous Parallel Random Access Machine)
is a MIMD multiprocessor
with a shared memory and with a private memory for each processor.
Memory access time needs not be uniform,
but sequential memory consistency is assumed.
The Asynchronous-PRAM programming language ForkLight
ForkLight is a control-synchronous, parallel
programming language for the Asynchronous PRAM model.
Control-Synchronicity means that all processors belonging to the
same (active) group are working on the same basic block.
Like its predecessor Fork95,
ForkLight is based on ANSI C with extensions for management of shared and private
address subspaces and variables, and
recursively nested levels of control synchronicity by
hierarchical construction of processor subgroups.
The language design of ForkLight has been developed in 1997/98 by
Christoph W. Kessler
and Helmut Seidl,
partially based on their previous work on Fork95.
Documentation for ForkLight:
- Technical report describing the ForkLight language and compiler:
ForkLight: A Control-Synchronous Parallel Programming
Language.
C.W. Kessler, H. Seidl.
Technical Report No. 98-13,
University of Trier, Department for
Mathematics and Computer Science, 19 pages, Sep. 1998.
- A
short version
(10 pages) appeared in
Proceedings of HPCN'99, Amsterdam, 12.-14.4.1999, Springer LNCS.
- Summary of the ForkLight language constructs
(postscript, 1 page)
- Slides (postscript) used at several presentations of ForkLight:
title -
asynchronous PRAM model -
SPMD execution -
nested parallelism -
control-synchronous vs. asynchronous computation -
automatic group splitting -
explicit group splitting, mergesort example -
implementation -
target machine interface -
shared variables, shared memory organization -
conclusion -
future work
- Further information available upon request
by email to chrke \at ida.liu.se.
Example programs
ForkLight Compiler
The ForkLight compiler flcc, written by
C. W. Kessler,
is a source-to-source compiler that emits C code with calls to the
P4 library routines. This offers portability of the generated code
over a wide range of parallel architectures, including (multiprocessor)
SOLARIS workstations and the SB-PRAM.
The current beta version 1.0 of the compiler,
including all sources, is available by ftp
here (about 425 KB gzip'ed tar file).
For bug reports or further information please contact me
by email at chrke \at ida.liu.se.
flcc is partially based on the lcc1.9 C compiler.
Hardware platforms:
You need an installation of the P4 library package and an ANSI-C compiler
like gcc.
Installation and compiler options
(postscript, 1 page)
First Performance Results:
The parallel architectures running P4 that are currently accessible to us
are (1) multiprocessor SOLARIS workstations and
(2) the SB-PRAM. We have done some measurements for both of them.
1. multiprocessor SUN server
We have tested our compiler on a loaded 4 processor SUN server
running SOLARIS 2.5.1.
This machine has no atomic fetch-add accessible via the P4 routines;
thus atomic memory access is sequentialized.
Nevertheless this configuration is useful for testing purposes.
We obtained the following
run times (in milliseconds) on some small example programs.
- pi calculation (stochastic method, N=5000000):
p=1: 22.61 s | p=2: 14.55 s | p=3: 11.92 s | p=4: 7.86 s
|
- matrix-matrix multiplication for 200x200 integer matrices:
p=1: 41.08 s | p=2: 24.16 s | p=3: 19.22 s | p=4: 14.49 s
|
- parallel mergesort for 48000 integers
using the
host's optimized qsort() routine for sequential sorting:
p=1: 1.02 s | p=2: 0.98 s | p=4: 1.65 s
|
- parallel mergesort for 48000 integers
using a handwritten
routine for sequential sorting, thus a more fair comparison:
p=1: 4.39 s | p=2: 3.58 s | p=4: 2.33 s
|
- parallel quicksort for 120000 integers
using the
host's optimized qsort() routine for sequential sorting:
p=1: 4.45 s | p=2: 4.65 s | p=4: 4.42 ms
|
- parallel quicksort for 120000 integers
using a handwritten quicksort routine for sequential sorting
(without synchronization, of course), thus a more fair comparison:
p=1: 14.5 s | p=2: 9.83 s | p=4: 6.35 s
|
2. SB-PRAM
We have ported the compiler to the
SB-PRAM.
An important improvement has been reached by
exploiting the native fetch-add operators which do not lead to
sequentialization, in comparison to standard P4 which does not
efficiently support atomic fetch-add or atomic-add.
There is an operational 128 PE prototype of the SB-PRAM at Saarbrücken
running the SB-PRAM operating system PRAMOS.
Due to some operating system restrictions, only up to 124 PE's can be used
for the application program, the other ones (one PE per processor board)
are reserved as I/O servers by PRAMOS.
For our mergesort example program using (a)
the optimized sequential
qsort() function, as listed above and
(b) a hand-written sequential quicksort routine, we obtain
the following run time results:
mergesort on
SB-PRAM
| p=1 | p=2 | p=4 | p=8 | p=16 | p=32
| p=64 |
(a) N=1000 | 573 ms | 373 ms | 232 ms | 142 ms
| 88 ms | 57 ms | 39 ms |
(b) N=10000
| 2892 ms | 4693 ms | 3865 ms | 1571 ms
| 896 ms | 509 ms | 290 ms |
For the parallel quicksort program, using a handwritten sequential quicksort
routine, we obtain the following execution times:
quicksort on
SB-PRAM
| p=1 | p=2 | p=4 | p=8 | p=16 | p=32
| p=64 |
N=1000 | 1519 ms | 807 ms | 454 ms | 259 ms
| 172 ms | 145 ms | 130 ms |
This page by Christoph W. Kessler.