The AppleCrate Parallel Work Simulator

Michael J. Mahon - July 26, 2004
Revised - November 13, 2004
Revised - December 20, 2008


When all the lower-level protocols and the message server had passed their tests, I was ready to move on to an application, and I chose a skeletal parallel work simulator. The simulator is a good prototype for many classes of "embarrassingly parallel" computing problems, where each subtask is independent of the others. I will present the example in some detail, to clarify how NadaNet can be used to control parallel execution of a program on an AppleCrate. SIM.MASTER and SIM.SLAVE can be found on the ProDOS boot disk.

The Master Program

The simulator has been revised to use the NadaNet 3.0 ampersand commands. This interface provides a simple and convenient way to use NadaNet to control slave machines.

The Applesoft BASIC program, SIM.MASTER, is presented here, with liberal commentary:

 100  REM Simulated Parallel Workload
 110  REM      MJM - 07/26/04
 115  REM     Revised 11/01/08
 120 :

This initialization presumes that NadaNet, with the AmperNada extensions, has already been installed below the OS, HIMEM has been adjusted, and an initializing call has been made. The following GOSUB merely verifies that NadaNet is installed and sets up some pointer variables for application use.

 230  GOSUB 60000: REM Initialize AmperNada

The BASIC subroutine at 61000 "takes a census" of the machines serving on the net. It classifies them according to their operating system (or lack thereof) and specific server capabilities (for example, machines running the Message Server are recorded as type 3).

All census results are recorded in a 32-byte table at ITBL. Machine ID's status is represented by a single digit at ITBL+ID as described in A Network Extension for Applesoft BASIC.

 240  GOSUB 61000: REM Take census
 250 MSRV =  PEEK (ITBL + 2) = 3: REM Save pre-boot Msg Server status
 260 BUF = 4 * 4096: REM Buffer ($4000)
 270 :

The next lines load the boot code in preparation for booting any network-bootable machines awaiting boot, then call &SERVE#(10) to enter the master's service loop with a count such that it will stay in the service loop for a fraction of a second (since, if nothing is requested, each RCVPKT call times out in about 20ms, and this specifies 10 iterations).

The purpose of this &SERVE call is to process the GETID requests of any just-booted AppleCrate slave machines. When we return from SERVE, all slave machines awaiting boot will have been booted and will be running their service loops.

 280  REM Boot unbooted slaves
 290  PRINT  CHR$ (4)"bload /ap/merlin/work/nadanet/nada.crate,A"BUF
 300 PSRT =  PEEK (BUF + 2) * 256: REM Prog start
 310 PLNG =  PEEK (48840) + 256 *  PEEK (48841): REM Prog length
 330  & SERVE#(10): REM For BOOT
 340  PRINT "Boot of slaves completed."
 350 :

When booting is completed, we retake a census to determine how many slaves are present. Since machine #2 will be used as the message server, there must be at least two slaves, so that there will be at least one slave machine to do work.

 360  GOSUB 61000: REM Re-take census
 370  IF MSRV GOTO 1000: REM Already running
 380 :
 390  REM Start Message Server
 400  IF  NOT  PEEK (ITBL + 2) THEN  PRINT "No Msg Server (2)": STOP 
 410  PRINT  CHR$ (4)"bload /ap/merlin/work/nadanet/nada.mserve,A"BUF
 420 PSRT =  PEEK (BUF + 2) * 256: REM Prog start
 430 PLNG =  PEEK (48840) + 256 *  PEEK (48841): REM Prog length
 450  &  B RUN (2,PSRT,PLNG,BUF): REM  BRUN Message Server
 490  PRINT "Message Server now running on 2"
 500 MSRV = 1: POKE ITBL + 2,3: REM Mark Msg Server
 510 :

Now we &RUN the "worker" Applesoft program on each slave machine.

To begin, we BLOAD the program into a safe area of the master's memory and save its length, then iterate over each of the slave machines, &RUNning it on each one. Note that while this simulator runs the same program in each slave machine, it would be easy to load each slave machine with an arbitrary program for more complex applications.

 1000  REM Load Applesoft simulation program
 1010 P$ = "sim.slave"
 1020 BUF = 3 * 256: REM $300 utility buffer
 1030 PBUF = 2 * 4096: REM Program buffer $2000
 1040  PRINT  CHR$ (4)"bload "P$",a"PBUF",TBAS"
 1050 PLNG =  PEEK (48840) + 256 *  PEEK (48841): REM Prog Length
 1060 PSTR = (8 * 256 + 1): REM Prog start = $0801
 1080 :
 1090  FOR D = 3 TO MX
 1100  IF  PEEK (ITBL + D) <  > 2 GOTO 1410: REM Only AppleCrate machines
 1120 & RUN (D,PSTR,PLNG,PBUF): REM  RUN the program on machine D
 1400  PRINT "'"P$"' running on "D
 1410  NEXT D
 1420 :

This is the actual parallel work simulator code.

It defines P, the maximum number of jobs that will be submitted for simultaneous processing and N, the total number of jobs. By varying P, it is possible to limit the number of slaves simultaneously active, and also to control whether the master will pre-schedule work, or whether a slave, upon completing a job, will have to wait for the master to process a result before offering another job. It also re-seeds the RND generator so that re-runs are repeatable.

 2000  REM Initialize parallel simulation
 2010 :
 2020 P = 16: REM Max simultaneous "jobs"
 2030 N = 100: REM Total number of "jobs"
 2040 X =  RND ( - 1): REM Seed RND for repeatability
 2050 :
 2060  REM Define message parameters
 2070 JB = 16: REM Job Queue ($10)
 2080 RS = 34: REM Result Queue ($22)
 2090 ML = 8: REM Length = 8
 2100 BUF = 768: REM Buffer ($300)
 2110 :

This code initializes the "job number", the "result number", and the number of jobs currently scheduled. The WL and WH parameters determine the distribution of the random workload.

 2120  REM Job parameters
 2130 JN = 0:RN = 0:SCH = 0
 2140 WL = 10
 2150 WH = 40

These constants define the format of the "job" request message. (The PDL(3) value is reported just as confirmation. It is used as the AppleCrate machine's temporary ID during the boot process. It does not change significantly.)

 2160 JW = BUF: REM Job work units
 2170 JZ = BUF + 2: REM Job number
 2180 RW = BUF + 3: REM Result work done
 2190 RP = BUF + 4: REM Result PDL(3) value
 2200 RI = BUF + 5: REM Result ID
 2210 :

This beeps the speaker just prior to starting for easy manual timing.

 2220  PRINT CHR$ (7): REM Signal start
 2230 :

This is the main loop. As long as there is more work to do and we haven't reached our scheduled work quota, it schedules another job. As long as there are results not received, it tries to receive a result.

 2240  REM Build and maintain job queue
 2250  IF JN < N AND SCH < P THEN  GOSUB 2400: REM Sched another job
 2260  IF RN < N THEN  GOSUB 2500: REM Get result
 2270  IF RN < N GOTO 2250
 2280 :

When all jobs are completed, beep again and offer the choice to re-run the same pseudorandom workload for further examination. Otherwise, send a message of length 20 for each slave machine to cause them to end their programs and return to their service loops.

 2290  PRINT CHR$ (7)"All jobs completed."
 2320  PRINT "Run again? ";: GET K$: PRINT : IF K$ = "y" GOTO 2000
 2330  FOR D = 3 TO MX
 2340  IF  PEEK (ITBL + D) <  > 2 GOTO 2360: REM Only AppleCrate machines
 2350  & PUTMSG(2,JB,20,BUF): REM Send termination message
 2360  NEXT D
 2370  END 
 2380 :

This subroutine computes a random amount of "work" and stuffs it into the job message, along with the ordinal number of the job. It then sets the "job" message class and calls PUTMSG to send it to the message server.

 2400  REM Schedule new job
 2410 JN = JN + 1
 2420 W =  INT ((WH - WL) *  RND (1) + WL): REM Number of work units
 2430  POKE JW,W
 2440  POKE JZ,JN: REM Job number
 2460  & PUTMSG(2,JB,ML,BUF)
 2470 SCH = SCH + 1
 2480  RETURN 
 2490 :

This subroutine tries to receive a result message from the message server using &GETMSG#. If there is a message, it updates counters and prints a narrative report on the completed job. Note that the "no exception" form of &GETMSG is used to permit program logic to depend on whether the queue is empty or not.

 2500  REM Receive job result
 2520  &  GET MSG#(2,RS,L,BUF)
 2540  IF  PEEK (1) THEN  RETURN : REM No result message
 2550 SCH = SCH - 1:RN = RN + 1
 2560  PRINT RN": Job " PEEK (JZ)", " PEEK (RW)" units, on " PEEK (RI)" (PDL3=" PEEK (RP)")"
 2570  RETURN 
 2580 :

This is the standard "epilog" for NadaNet applications, added by EXECing the file NADADEFS. The subroutine at 60000 sets up a few named pointers. The subroutine at 61000 performs a "census" of serving machines on the net and saves the results in a table at ITBL. Note that because the census loop is probing machines that may not be serving (or even exist), it temporarily sets the &TIMEOUT value down to 3 * 60ms. so that unresponsive machines do not stall the &PEEK for very long. The default timeout is restored when the census is completed.

 60000  REM NadaNet Definitions for Applesoft
 60010  REM        MJM - 12/20/08
 60020 :
 60030 NP = 3 * 256 + 12 * 16 + 15: REM NADANET page ($3CF)
 60040  IF  PEEK (NP - 2) <  > 76 THEN  PRINT "NadaNet not loaded.": STOP 
 60050  IF  PEEK (NP) <  > 145 THEN  PRINT "Not ProDOS version of NadaNet.": STOP 
 60060 :
 60070 MX = 20: REM Max machine ID
 60080 BSLF = NP - 3: REM Boot ID in $3CC
 60090  & IDTBL(ITBL): REM ID table
 60100  RETURN 
 60110 :
 61000  REM Take census of serving machines
 61010  & TIMEOUT(2)
 61020 QC =  INT ( PEEK (33) / 13): REM Number of columns
 61030 QL =  INT ((MX + QC - 1) / QC): REM Number of lines
 61040 SELF =  PEEK (BSLF)
 61050  FOR I = 1 TO QL
 61060  FOR D = I TO MX STEP QL
 61070  IF D = SELF THEN J =  PEEK (975): GOTO 61120
 61080 A$ = "        "
 61090  &  PEEK #(D,975,1,512): REM Machine type
 61100  IF  PEEK (1) THEN K = 0: GOTO 61160
 61110 J =  PEEK (512)
 61120  IF J = 184 THEN A$ = "CRATE   ":K = 2
 61130  IF J = 008 THEN A$ = "MSERVER ":K = 3
 61140  IF J = 145 THEN A$ = "PRODOS  ":K = 4
 61150  IF J = 141 THEN A$ = "DOS     ":K = 5
 61160  POKE ITBL + D,K: REM Save type in IDTBL
 61170  IF D = SELF THEN A$ = "==SELF=="
 61180  IF D < 10 THEN  PRINT " ";
 61190  PRINT D":"A$;: PRINT "  ";
 61200  NEXT D
 61210  PRINT 
 61220  NEXT I
 61230  & TIMEOUT(): REM Reset retrys
 61240  RETURN 

The Slave Program

This is the Applesoft program SIM.SLAVE that runs in each slave machine to process the jobs scheduled by the parallel work simulator. In this simple example, all slaves process the same queue of jobs, and all run the same program. In a real application it is not necessary for all slaves to run the same program, though this is a common case.

 100  REM  Simulated Parallel Slave program
 110  REM         MJM - 07/09/04
 120  REM        Revised 11/17/04

We define the speaker address so that we can PEEK it within the "work" loop. AppleCrate boards have no connected I/O except two LEDs. One is the factory "speaker LED" which lights when the speaker is toggled and there is no speaker connected-this will serve as our "busy" indicator for the slave machines. The other is the "send LED" which is connected to the PDL(3) timer, and is triggered each time that a board sends a packet.

The effect of these two LEDs is to nicely display the state of each slave: green send LED pulsing when the slave is polling for work, red speaker LED when the slave is doing a job. During a run, it is fascinating to watch the entire operation in the AppleCrate LEDs-watching the work queue grow and slave boards getting work, then seeing the progress of the simulation as jobs are processed, and finally watching the slave machines finish their final jobs and return to polling the message server for additional work.

 140 SP = 49200: REM Speaker
 150 ID =  PEEK (3 * 256 + 12 * 16 + 12): REM Our ID ($3CC)
 160 :

Next, we define the messages used to communicate work to the slaves and results back to the master. Although the work and result messages here have the same format, in the more general case the messages would have different formats.

 1000  REM  Parallel work simulation program
 1010 :
 1020  REM  Definitions
 1030 :
 1040  REM  Message parameters
 1050 ML = 8: REM Length = 8
 1060 BUF = 768: REM Message buffer ($300)
 1070 JB = 16: REM Job Queue ($10)
 1080 RS = 34: REM Result Queue ($22)
 1090 :
 1100  REM  Job parameters
 1110 JL = BUF: REM Work lower limit
 1120 JU = BUF + 1: REM Work upper limit
 1130 RW = BUF + 3: REM Result work done
 1140 RP = BUF + 4: REM PDL(3) value
 1150 RI = BUF + 5: REM Our ID
 1160 :

This is the main loop. It continually attempts to get a job from the job queue and, if successful, it "processes" it (in this case, just iterations of a delay loop) and sends the result to the result queue.

In this example, each slave endlessly polls for more work. If the master machine chooses to re-start the simulation, then the Job queue refills and the slaves start to get more work. If the master sends a message with a length of 20 bytes, the slave program terminates, returning the machine to the service loop.

 2000  REM  Main loop
 2010 :
 2020  FOR J = 1 TO 20: NEXT : REM Re-try delay
 2030  &  GET MSG#(2,JB,L,BUF): REM Get first job in queue
 2040  IF  PEEK (C) GOTO 2020: REM If no job, keep trying
 2050  IF L = 20 THEN PRINT "End SIM.SLAVE": END : REM End signal
 2060 WW =  PEEK (JL)
 2070  FOR I = 1 TO WW: REM Simulate work
 2080  FOR J = 1 TO 10:X =  PEEK (SP): NEXT J
 2090  NEXT I
 2100 :

This subroutine is used to place a result on the result message queue. It adds some information about the job into the message, particularly which slave did the processing.

 2110  REM  Deliver result
 2120  POKE RW,WW: REM Work done
 2130  POKE RP, PDL (3): REM PDL 3
 2140  POKE RI,ID: REM Our machine ID
 2150  & PUTMSG(2,RS,ML,BUF): REM Deliver result
 2160  GOTO 2030


The parallel work simulator provides a simple way to experiment with parallel speedup while varying the degree of parallelism. I ran a series of tests on the 8-processor AppleCrate I with different levels of parallelism ("P" in the simulator) and plotted the time required to complete all jobs (in seconds), and the performance as a percent of maximum speedup. Note that 7x corresponds to 100% of the maximum possible speedup, since one of the eight AppleCrate machines is used as the Message Server. For comparison, a linear speedup line is also plotted.

As you can see, for the workload being simulated, where each job takes about a second on average, the scheduling infrastructure (the master, the message server, and the network) is far from saturation, and the speedup is nearly linear. The minor asymptotic improvements with more than 7 jobs scheduled is a result of keeping the job queue non-empty so that no slave ever has to wait for work.