The AppleCrate Parallel Work Simulator
Michael J. Mahon - July 26, 2004
Revised - November 13, 2004
Revised - December 20, 2008
Introduction
When all the lower-level protocols and the message server had passed their tests, I was ready to move on to an application, and I chose a skeletal parallel work simulator. The simulator is a good prototype for many classes of "embarrassingly parallel" computing problems, where each subtask is independent of the others. I will present the example in some detail, to clarify how NadaNet can be used to control parallel execution of a program on an AppleCrate. SIM.MASTER and SIM.SLAVE can be found on the ProDOS boot disk.
The Master Program
The simulator has been revised to use the NadaNet 3.0 ampersand commands. This interface provides a simple and convenient way to use NadaNet to control slave machines.
The Applesoft BASIC program, SIM.MASTER, is presented here, with liberal commentary:
100 REM Simulated Parallel Workload 110 REM MJM - 07/26/04 115 REM Revised 11/01/08 120 :
This initialization presumes that NadaNet, with the AmperNada extensions, has already been installed below the OS, HIMEM has been adjusted, and an initializing call has been made. The following GOSUB merely verifies that NadaNet is installed and sets up some pointer variables for application use.
230 GOSUB 60000: REM Initialize AmperNada
The BASIC subroutine at 61000 "takes a census" of the machines serving on the net. It classifies them according to their operating system (or lack thereof) and specific server capabilities (for example, machines running the Message Server are recorded as type 3).
All census results are recorded in a 32-byte table at ITBL. Machine ID's status is represented by a single digit at ITBL+ID as described in A Network Extension for Applesoft BASIC.
240 GOSUB 61000: REM Take census 250 MSRV = PEEK (ITBL + 2) = 3: REM Save pre-boot Msg Server status 260 BUF = 4 * 4096: REM Buffer ($4000) 270 :
The next lines load the boot code in preparation for booting any network-bootable machines awaiting boot, then call &SERVE#(10) to enter the master's service loop with a count such that it will stay in the service loop for a fraction of a second (since, if nothing is requested, each RCVPKT call times out in about 20ms, and this specifies 10 iterations).
The purpose of this &SERVE call is to process the GETID requests of any just-booted AppleCrate slave machines. When we return from SERVE, all slave machines awaiting boot will have been booted and will be running their service loops.
280 REM Boot unbooted slaves 290 PRINT CHR$ (4)"bload /ap/merlin/work/nadanet/nada.crate,A"BUF 300 PSRT = PEEK (BUF + 2) * 256: REM Prog start 310 PLNG = PEEK (48840) + 256 * PEEK (48841): REM Prog length 320 & BOOT(PSRT,PLNG,BUF) 330 & SERVE#(10): REM For BOOT 340 PRINT "Boot of slaves completed." 350 :
When booting is completed, we retake a census to determine how many slaves are present. Since machine #2 will be used as the message server, there must be at least two slaves, so that there will be at least one slave machine to do work.
360 GOSUB 61000: REM Re-take census 370 IF MSRV GOTO 1000: REM Already running 380 : 390 REM Start Message Server 400 IF NOT PEEK (ITBL + 2) THEN PRINT "No Msg Server (2)": STOP 410 PRINT CHR$ (4)"bload /ap/merlin/work/nadanet/nada.mserve,A"BUF 420 PSRT = PEEK (BUF + 2) * 256: REM Prog start 430 PLNG = PEEK (48840) + 256 * PEEK (48841): REM Prog length 450 & B RUN (2,PSRT,PLNG,BUF): REM BRUN Message Server 490 PRINT "Message Server now running on 2" 500 MSRV = 1: POKE ITBL + 2,3: REM Mark Msg Server 510 :
Now we &RUN the "worker" Applesoft program on each slave machine.
To begin, we BLOAD the program into a safe area of the master's memory and save its length, then iterate over each of the slave machines, &RUNning it on each one. Note that while this simulator runs the same program in each slave machine, it would be easy to load each slave machine with an arbitrary program for more complex applications.
1000 REM Load Applesoft simulation program 1010 P$ = "sim.slave" 1020 BUF = 3 * 256: REM $300 utility buffer 1030 PBUF = 2 * 4096: REM Program buffer $2000 1040 PRINT CHR$ (4)"bload "P$",a"PBUF",TBAS" 1050 PLNG = PEEK (48840) + 256 * PEEK (48841): REM Prog Length 1060 PSTR = (8 * 256 + 1): REM Prog start = $0801 1080 : 1090 FOR D = 3 TO MX 1100 IF PEEK (ITBL + D) < > 2 GOTO 1410: REM Only AppleCrate machines 1120 & RUN (D,PSTR,PLNG,PBUF): REM RUN the program on machine D 1400 PRINT "'"P$"' running on "D 1410 NEXT D 1420 :
This is the actual parallel work simulator code.
It defines P, the maximum number of jobs that will be submitted for simultaneous processing and N, the total number of jobs. By varying P, it is possible to limit the number of slaves simultaneously active, and also to control whether the master will pre-schedule work, or whether a slave, upon completing a job, will have to wait for the master to process a result before offering another job. It also re-seeds the RND generator so that re-runs are repeatable.
2000 REM Initialize parallel simulation 2010 : 2020 P = 16: REM Max simultaneous "jobs" 2030 N = 100: REM Total number of "jobs" 2040 X = RND ( - 1): REM Seed RND for repeatability 2050 : 2060 REM Define message parameters 2070 JB = 16: REM Job Queue ($10) 2080 RS = 34: REM Result Queue ($22) 2090 ML = 8: REM Length = 8 2100 BUF = 768: REM Buffer ($300) 2110 :
This code initializes the "job number", the "result number", and the number of jobs currently scheduled. The WL and WH parameters determine the distribution of the random workload.
2120 REM Job parameters 2130 JN = 0:RN = 0:SCH = 0 2140 WL = 10 2150 WH = 40
These constants define the format of the "job" request message. (The PDL(3) value is reported just as confirmation. It is used as the AppleCrate machine's temporary ID during the boot process. It does not change significantly.)
2160 JW = BUF: REM Job work units 2170 JZ = BUF + 2: REM Job number 2180 RW = BUF + 3: REM Result work done 2190 RP = BUF + 4: REM Result PDL(3) value 2200 RI = BUF + 5: REM Result ID 2210 :
This beeps the speaker just prior to starting for easy manual timing.
2220 PRINT CHR$ (7): REM Signal start 2230 :
This is the main loop. As long as there is more work to do and we haven't reached our scheduled work quota, it schedules another job. As long as there are results not received, it tries to receive a result.
2240 REM Build and maintain job queue 2250 IF JN < N AND SCH < P THEN GOSUB 2400: REM Sched another job 2260 IF RN < N THEN GOSUB 2500: REM Get result 2270 IF RN < N GOTO 2250 2280 :
When all jobs are completed, beep again and offer the choice to re-run the same pseudorandom workload for further examination. Otherwise, send a message of length 20 for each slave machine to cause them to end their programs and return to their service loops.
2290 PRINT CHR$ (7)"All jobs completed." 2320 PRINT "Run again? ";: GET K$: PRINT : IF K$ = "y" GOTO 2000 2330 FOR D = 3 TO MX 2340 IF PEEK (ITBL + D) < > 2 GOTO 2360: REM Only AppleCrate machines 2350 & PUTMSG(2,JB,20,BUF): REM Send termination message 2360 NEXT D 2370 END 2380 :
This subroutine computes a random amount of "work" and stuffs it into the job message, along with the ordinal number of the job. It then sets the "job" message class and calls PUTMSG to send it to the message server.
2400 REM Schedule new job 2410 JN = JN + 1 2420 W = INT ((WH - WL) * RND (1) + WL): REM Number of work units 2430 POKE JW,W 2440 POKE JZ,JN: REM Job number 2460 & PUTMSG(2,JB,ML,BUF) 2470 SCH = SCH + 1 2480 RETURN 2490 :
This subroutine tries to receive a result message from the message server using &GETMSG#. If there is a message, it updates counters and prints a narrative report on the completed job. Note that the "no exception" form of &GETMSG is used to permit program logic to depend on whether the queue is empty or not.
2500 REM Receive job result 2520 & GET MSG#(2,RS,L,BUF) 2540 IF PEEK (1) THEN RETURN : REM No result message 2550 SCH = SCH - 1:RN = RN + 1 2560 PRINT RN": Job " PEEK (JZ)", " PEEK (RW)" units, on " PEEK (RI)" (PDL3=" PEEK (RP)")" 2570 RETURN 2580 :
This is the standard "epilog" for NadaNet applications, added by EXECing the file NADADEFS. The subroutine at 60000 sets up a few named pointers. The subroutine at 61000 performs a "census" of serving machines on the net and saves the results in a table at ITBL. Note that because the census loop is probing machines that may not be serving (or even exist), it temporarily sets the &TIMEOUT value down to 3 * 60ms. so that unresponsive machines do not stall the &PEEK for very long. The default timeout is restored when the census is completed.
60000 REM NadaNet Definitions for Applesoft 60010 REM MJM - 12/20/08 60020 : 60030 NP = 3 * 256 + 12 * 16 + 15: REM NADANET page ($3CF) 60040 IF PEEK (NP - 2) < > 76 THEN PRINT "NadaNet not loaded.": STOP 60050 IF PEEK (NP) < > 145 THEN PRINT "Not ProDOS version of NadaNet.": STOP 60060 : 60070 MX = 20: REM Max machine ID 60080 BSLF = NP - 3: REM Boot ID in $3CC 60090 & IDTBL(ITBL): REM ID table 60100 RETURN 60110 : 61000 REM Take census of serving machines 61010 & TIMEOUT(2) 61020 QC = INT ( PEEK (33) / 13): REM Number of columns 61030 QL = INT ((MX + QC - 1) / QC): REM Number of lines 61040 SELF = PEEK (BSLF) 61050 FOR I = 1 TO QL 61060 FOR D = I TO MX STEP QL 61070 IF D = SELF THEN J = PEEK (975): GOTO 61120 61080 A$ = " " 61090 & PEEK #(D,975,1,512): REM Machine type 61100 IF PEEK (1) THEN K = 0: GOTO 61160 61110 J = PEEK (512) 61120 IF J = 184 THEN A$ = "CRATE ":K = 2 61130 IF J = 008 THEN A$ = "MSERVER ":K = 3 61140 IF J = 145 THEN A$ = "PRODOS ":K = 4 61150 IF J = 141 THEN A$ = "DOS ":K = 5 61160 POKE ITBL + D,K: REM Save type in IDTBL 61170 IF D = SELF THEN A$ = "==SELF==" 61180 IF D < 10 THEN PRINT " "; 61190 PRINT D":"A$;: PRINT " "; 61200 NEXT D 61210 PRINT 61220 NEXT I 61230 & TIMEOUT(): REM Reset retrys 61240 RETURN
The Slave Program
This is the Applesoft program SIM.SLAVE that runs in each slave machine to process the jobs scheduled by the parallel work simulator. In this simple example, all slaves process the same queue of jobs, and all run the same program. In a real application it is not necessary for all slaves to run the same program, though this is a common case.
100 REM Simulated Parallel Slave program 110 REM MJM - 07/09/04 120 REM Revised 11/17/04
We define the speaker address so that we can PEEK it within the "work" loop. AppleCrate boards have no connected I/O except two LEDs. One is the factory "speaker LED" which lights when the speaker is toggled and there is no speaker connected-this will serve as our "busy" indicator for the slave machines. The other is the "send LED" which is connected to the PDL(3) timer, and is triggered each time that a board sends a packet.
The effect of these two LEDs is to nicely display the state of each slave: green send LED pulsing when the slave is polling for work, red speaker LED when the slave is doing a job. During a run, it is fascinating to watch the entire operation in the AppleCrate LEDs-watching the work queue grow and slave boards getting work, then seeing the progress of the simulation as jobs are processed, and finally watching the slave machines finish their final jobs and return to polling the message server for additional work.
140 SP = 49200: REM Speaker 150 ID = PEEK (3 * 256 + 12 * 16 + 12): REM Our ID ($3CC) 160 :
Next, we define the messages used to communicate work to the slaves and results back to the master. Although the work and result messages here have the same format, in the more general case the messages would have different formats.
1000 REM Parallel work simulation program 1010 : 1020 REM Definitions 1030 : 1040 REM Message parameters 1050 ML = 8: REM Length = 8 1060 BUF = 768: REM Message buffer ($300) 1070 JB = 16: REM Job Queue ($10) 1080 RS = 34: REM Result Queue ($22) 1090 : 1100 REM Job parameters 1110 JL = BUF: REM Work lower limit 1120 JU = BUF + 1: REM Work upper limit 1130 RW = BUF + 3: REM Result work done 1140 RP = BUF + 4: REM PDL(3) value 1150 RI = BUF + 5: REM Our ID 1160 :
This is the main loop. It continually attempts to get a job from the job queue and, if successful, it "processes" it (in this case, just iterations of a delay loop) and sends the result to the result queue.
In this example, each slave endlessly polls for more work. If the master machine chooses to re-start the simulation, then the Job queue refills and the slaves start to get more work. If the master sends a message with a length of 20 bytes, the slave program terminates, returning the machine to the service loop.
2000 REM Main loop 2010 : 2020 FOR J = 1 TO 20: NEXT : REM Re-try delay 2030 & GET MSG#(2,JB,L,BUF): REM Get first job in queue 2040 IF PEEK (C) GOTO 2020: REM If no job, keep trying 2050 IF L = 20 THEN PRINT "End SIM.SLAVE": END : REM End signal 2060 WW = PEEK (JL) 2070 FOR I = 1 TO WW: REM Simulate work 2080 FOR J = 1 TO 10:X = PEEK (SP): NEXT J 2090 NEXT I 2100 :
This subroutine is used to place a result on the result message queue. It adds some information about the job into the message, particularly which slave did the processing.
2110 REM Deliver result 2120 POKE RW,WW: REM Work done 2130 POKE RP, PDL (3): REM PDL 3 2140 POKE RI,ID: REM Our machine ID 2150 & PUTMSG(2,RS,ML,BUF): REM Deliver result 2160 GOTO 2030
Performance
The parallel work simulator provides a simple way to experiment with parallel speedup while varying the degree of parallelism. I ran a series of tests on the 8-processor AppleCrate I with different levels of parallelism ("P" in the simulator) and plotted the time required to complete all jobs (in seconds), and the performance as a percent of maximum speedup. Note that 7x corresponds to 100% of the maximum possible speedup, since one of the eight AppleCrate machines is used as the Message Server. For comparison, a linear speedup line is also plotted.
As you can see, for the workload being simulated, where each job takes about a second on average, the scheduling infrastructure (the master, the message server, and the network) is far from saturation, and the speedup is nearly linear. The minor asymptotic improvements with more than 7 jobs scheduled is a result of keeping the job queue non-empty so that no slave ever has to wait for work.