Back to David's web page

David M. Dantowitz

 

Description of my M.S. Thesis research, 1989-1993

 

SPEDS: A Synchronous Parallel Event-Driven Simulator

 

Goal

To study and develop a discrete-event simulation engine with maximal efficiency for single processor and shared memory parallel computers. SPEDS permits a broad range of systems to be simulated without attending to the fact that you may or may not be running on a parallel machine.

 

Summary

SPEDS, a new synchronous parallel event-driven simulator for shared memory multiprocessors was developed with a highly optimized message transport system that minimizes cross processor memory traffic and the use of synchronization primitives.

 

Accomplishments

In designing and implementing a complete parallel simulation engine from scratch and running it on three different operating systems and platforms (one a parallel super computer), I learned a great deal. A high level look at some of the tasks included the need to design high performance, scaleable parallel data structures, analyzing how data is shared by processors in parallel and debugging the SPEDS kernel, a complex multi-threaded parallel application.

 

Design Guidelines

My advisor had used SIMON II, a single processor simulator written in Modula 2, so the SPEDS API was created with SIMON II compatibility in mind. Written in ÒvanillaÓ C, the code is very portable (an MPI compatibility mode should be very easy to construct). SPEDS runs on Macintosh computers, Sun workstations, and a BBN TC2000 parallel super computer (the platforms available during development).

 

Requirements

To run SPEDS on a particular platform you need a C compiler, thread support, and for multiprocessor machines, the ability to allocate dynamic memory three ways:

 

a) locally (non-shared),

b) shared, location not important

c) for optimal performance, shared memory located on processor i,

    where i is specified at the time memory is requested

 

Structure

To use the simulation engine, a user designs a set of objects and with a set of ports it uses to send and receive messages (each runs in its own thread). During setup, objects are created, and the user directs which ports connect from object to object. At run-time, objects read incoming messages, create messages, and send them out a port for SPEDS to deliver to all attached ports.

 

Performance Optimization

I examined two sources of performance optimization:

 

a) Cross-processor memory traffic

b) Synchronization primitives

 

Cross-processor memory traffic canÕt be avoided when objects are running on different processors, but the overhead can be reduced by optimizing the code that creates, schedules, and moves messages between objects[1]. By developing a specialized transport mechanism, I was able to streamline message handling to a very high degree.

 

Synchronization primitives are a key source of overhead in synchronous parallel processing:

 

i)  for each clock tick in the simulation the processors must wait for others in the simulation group

ii) synchronization primitives are used to control access to shared memory resources used by the simulation engine

 

With locked-step, synchronous simulation, the overhead due to (i) is unavoidable unless conservative or optimistic simulation methods are used for clock updates. At least one all processor synchronization primitive is required for each step of the simulation to keep the processors in lock step.

 

Synchronization is also required in access methods (ii) to keep data structures consistent when shared asynchronously by multiple processors. The need for this type of synchronization was minimized by creating specialized data access methods to minimize synchronization requirements.

 

Instrumentation

To study performance and related statistics, the SPEDS kernel is heavily instrumented to measure delays due to synchronization, message delivery, thread use, and more (instrumentation is enabled optionally at compile time). The kernel also has three levels of debugging settings, which aid in porting the kernel and tracing and debugging simulations.

 

Status

Results from parallel runs of existing simulations were excellent and demonstrated significant gains from running on multiple CPUs.

 

 



[1] Partitioning a simulation so that tightly coupled objects are located on the same processor also minimizes cross-processor traffic, but this optimization was not addressed in the original research.