The Asynchronous-PRAM Programming Language ForkLight

Source-to-source Compiler for all machines supporting the P4 library

This page is permanently under reconstruction. Last update Nov. 24, 1998 - Christoph W. Kessler

Asynchronous PRAM model

The Asynchronous PRAM (Asynchronous Parallel Random Access Machine) is a MIMD multiprocessor with a shared memory and with a private memory for each processor. Memory access time needs not be uniform, but sequential memory consistency is assumed.

The Asynchronous-PRAM programming language ForkLight

ForkLight is a control-synchronous, parallel programming language for the Asynchronous PRAM model.
Control-Synchronicity means that all processors belonging to the same (active) group are working on the same basic block.
Like its predecessor Fork95, ForkLight is based on ANSI C with extensions for management of shared and private address subspaces and variables, and recursively nested levels of control synchronicity by hierarchical construction of processor subgroups.

The language design of ForkLight has been developed in 1997/98 by Christoph W. Kessler and Helmut Seidl, partially based on their previous work on Fork95.

Documentation for ForkLight:

Technical report describing the ForkLight language and compiler:
ForkLight: A Control-Synchronous Parallel Programming Language.
C.W. Kessler, H. Seidl.
Technical Report No. 98-13, University of Trier, Department for Mathematics and Computer Science, 19 pages, Sep. 1998.
A short version (10 pages) appeared in Proceedings of HPCN'99, Amsterdam, 12.-14.4.1999, Springer LNCS.
Summary of the ForkLight language constructs (postscript, 1 page)
Slides (postscript) used at several presentations of ForkLight:
title - asynchronous PRAM model - SPMD execution - nested parallelism - control-synchronous vs. asynchronous computation - automatic group splitting - explicit group splitting, mergesort example - implementation - target machine interface - shared variables, shared memory organization - conclusion - future work
Further information available upon request by email to chrke \at ida.liu.se.

Example programs

ForkLight Compiler

The ForkLight compiler flcc, written by C. W. Kessler, is a source-to-source compiler that emits C code with calls to the P4 library routines. This offers portability of the generated code over a wide range of parallel architectures, including (multiprocessor) SOLARIS workstations and the SB-PRAM.

The current beta version 1.0 of the compiler, including all sources, is available by ftp here (about 425 KB gzip'ed tar file).

For bug reports or further information please contact me by email at chrke \at ida.liu.se.
flcc is partially based on the lcc1.9 C compiler.

Hardware platforms:

You need an installation of the P4 library package and an ANSI-C compiler like gcc.

Installation and compiler options (postscript, 1 page)

First Performance Results:

The parallel architectures running P4 that are currently accessible to us are (1) multiprocessor SOLARIS workstations and (2) the SB-PRAM. We have done some measurements for both of them.

1. multiprocessor SUN server

We have tested our compiler on a loaded 4 processor SUN server running SOLARIS 2.5.1.
This machine has no atomic fetch-add accessible via the P4 routines; thus atomic memory access is sequentialized. Nevertheless this configuration is useful for testing purposes.

We obtained the following run times (in milliseconds) on some small example programs.

pi calculation (stochastic method, N=5000000):

p=1: 22.61 s p=2: 14.55 s p=3: 11.92 s p=4: 7.86 s
matrix-matrix multiplication for 200x200 integer matrices:

p=1: 41.08 s p=2: 24.16 s p=3: 19.22 s p=4: 14.49 s
parallel mergesort for 48000 integers
using the host's optimized qsort() routine for sequential sorting:

p=1: 1.02 s p=2: 0.98 s p=4: 1.65 s
parallel mergesort for 48000 integers
using a handwritten routine for sequential sorting, thus a more fair comparison:

p=1: 4.39 s p=2: 3.58 s p=4: 2.33 s
parallel quicksort for 120000 integers
using the host's optimized qsort() routine for sequential sorting:

p=1: 4.45 s p=2: 4.65 s p=4: 4.42 ms
parallel quicksort for 120000 integers
using a handwritten quicksort routine for sequential sorting (without synchronization, of course), thus a more fair comparison:

p=1: 14.5 s p=2: 9.83 s p=4: 6.35 s

2. SB-PRAM

We have ported the compiler to the SB-PRAM. An important improvement has been reached by exploiting the native fetch-add operators which do not lead to sequentialization, in comparison to standard P4 which does not efficiently support atomic fetch-add or atomic-add.

There is an operational 128 PE prototype of the SB-PRAM at Saarbrücken running the SB-PRAM operating system PRAMOS. Due to some operating system restrictions, only up to 124 PE's can be used for the application program, the other ones (one PE per processor board) are reserved as I/O servers by PRAMOS.

For our mergesort example program using (a) the optimized sequential qsort() function, as listed above and (b) a hand-written sequential quicksort routine, we obtain the following run time results:

mergesort
on SB-PRAM p=1 p=2 p=4 p=8 p=16 p=32 p=64
(a) N=1000 573 ms 373 ms 232 ms 142 ms 88 ms 57 ms 39 ms

(b) N=10000 2892 ms 4693 ms 3865 ms 1571 ms 896 ms 509 ms 290 ms

For the parallel quicksort program, using a handwritten sequential quicksort routine, we obtain the following execution times:

quicksort
on SB-PRAM p=1 p=2 p=4 p=8 p=16 p=32 p=64
N=1000 1519 ms 807 ms 454 ms 259 ms 172 ms 145 ms 130 ms

This page by Christoph W. Kessler.

mergesort on SB-PRAM	p=1	p=2	p=4	p=8	p=16	p=32	p=64
(a) N=1000	573 ms	373 ms	232 ms	142 ms	88 ms	57 ms	39 ms
(b) N=10000	2892 ms	4693 ms	3865 ms	1571 ms	896 ms	509 ms	290 ms

quicksort on SB-PRAM	p=1	p=2	p=4	p=8	p=16	p=32	p=64
N=1000	1519 ms	807 ms	454 ms	259 ms	172 ms	145 ms	130 ms