NONMEM Users Network Archive

Hosted by Cognigen

RE: setup of parallel processing and supporting software - help wanted

From: Bob Leary <Bob.Leary>
Date: Wed, 9 Dec 2015 19:36:48 +0000

Correction - the times I cited should be 18 seconds for 4 processors, 72 f=
or 1, not the reverse

From: Bob Leary
Sent: Wednesday, December 09, 2015 1:04 PM
To: Mark Sale; Faelens, Ruben (Belgium); Pavel Belo; nmusers
Subject: RE: [NMusers] setup of parallel processing and supporting software=
 - help wanted


a) I have to disagree with you that the efficiency of MPI implementation do=
es not depend on the size of the
data set for a single desktop SMP machine with multiple processors - larger=
 data sets mean higher granularity and more cpu-bound work between stoppage=
s for communication.
This assumes the NLME MPI implementation is done efficiently - I don't kno=
w the details of the NONMEM MPI implementation, particularly those of how c=
ommunications are handled.

b) your I/O timings seem horrendously large (if by msec you mean millisecon=

 I/O times of 40 milliseconds per function evaluation (assuming 1 function =
evaluation is a single sweep
 through all Nsub subjects, evaluating and summing the likelihood contribu=
tion from each subject) seem very high. I have been
running MPI since its original release in 1994 (I was a member of the commi=
ttee that designed the first release of MPI during 1992-1994 ) -
these communications timings would seem more appropriate for machines from =
that era.

I/O timings for MPI are usually modeled by a latency (startup time - typica=
lly on current SMP single desktop machines on the
order of 1 microsecond) , and a bandwidth (on the order of 10's of gigabyte=
s/sec for current era SMPs, but much lower for clusters).
Based on the latency/bandwidth model, the conventional wisdom is to manage =
the message processing so as to
favor a few large messages as opposed to many small messages to minimize th=
e latency contribution.
If possible, small messages should be concatenated into larger messages. =
 I don't know the details of the MPI implementation in NONMEM, but for FOCE=
NLME algorithms, it is possible to limit the number of messages to just a f=
ew per function evaluation.
 If the data set size is expanded by adding more subjects, then more work (=
more subjects processed)
will be done between stoppages for communication at the function evaluation=

In the MPI implementation for Phoenix NLME, I find it almost impossible to =
 find a model where I/O dominates to such an
extent that the MPI version runs slower than the single processor version o=
n a 4-processor Intel i7 desktop. For example,
I just tested (FOCE) the classic simple closed form Emax model used in the =
INSERM estimation method comparison exercise from 2004
 (Girard and Mentre', PAGE 2005, abstract 234) with Phoenix NLME. It wou=
ld be hard to find a simpler model -
E=E0 + EMAX*DOSE/(ED50 + DOSE) +EPS, with random effects on each of the =
three parameters E0, EMAX, and ED50,
and three observations per subject. If I expand the data set to around 160=
0 from the original 100 subjects
and run on a four processor i7, the internally reported cpu time is 72 sec =
for four processors vs 18 sec for one processor (a speedup of 4).
  Wall clock times were a few seconds longer for each run. If I make the d=
ata set smaller, down to the original size of 100,
 the speedup clearly suffers a decrease but I still observe a reported cpu =
time speedup of 2.5x for the four processors (times are
well under 1 sec, so reliable wall clock times are not available).
(this was done on a relatively old i7 desktop, so more current machines may=
 do better).

c) It is not always necessary to parallelize over function evaluations (i.e=
. over subjects). In importance sampling EM methods, (IMP in NONMEM,
QRPEM in Phoenix NLME), in principle the parallelization can be done over t=
he sample points used in the monte carlo or quasi-monte carlo integral eval=
uations -
there are usually many more of these than processors available. In PHX QRP=
we actually do it this way and it works fine. Now all processors are worki=
ng on the same subject at the same time, so
load balancing problems tend to go away, but communications overhead increa=
ses since now you have to pass separate messages for
each subject, whereas in FOCE-like algorithms you only have to pass messag=
es at the end of a sweep through all the subjects.
One thing we have noticed is that QRPEM parallelized this way is much more =
reproducible - single processor results almost always
match multiprocessor results exactly, which is not always the case with som=
e of the FOCE-like methods.

Bob Leary
Fellow, Pharsight Corporation

From: owner-nmusers
 of Mark Sale [msale
Sent: Wednesday, December 09, 2015 7:42 AM
To: Faelens, Ruben (Belgium); Pavel Belo; nmusers
Subject: Re: [NMusers] setup of parallel processing and supporting software=
 - help wanted

Maybe a little more clarification:

Thanks to Bob for pointing out that the


option implements some code for load balancing, and there really is no down=
side, so should probably always be used.

Contrary to other comments, NONMEM 7.3 (and 7.2) does parallelize the covar=
iance step. Ruben is correct that the $TABLE step is not parallelize in 7.=

WRT sometimes it works and sometimes it doesn't, we can be more specific th=
an this. The parallelization takes place at the level of the calculation of=
 the objective function. The data are split up and the OBJ for the subsets=
 of the data is sent to multiple processes. When all processes are done, t=
he results are compiled by the manager program. The total round trip time=
 for one process then is the calculation time + I/O time. Without parallel=
ization, there is no I/O time. For each parallel process, the I/O time is =
essentially fixed (in our benchmarks maybe 20-40 msec per process on a sing=
le machine). The variable of interest then is the calculation time. If th=
e calculation time is 1 msec and the I/O time is 20 msec, if you paralleliz=
e to 2 cores, you cut the calculation time to 0.5 msec, now have 40 msec (2=
*20 msec) of I/O time, for a total of 40.5 msec, much slower. If the calcu=
lation time is 500 msec, and you parallelize to 2 cores, the total time is =
250 msec (for calculation) + 2*20 msec (for I/O) = 290 msec. If The key =
parameter then is the time for a single objective function evaluation (not =
the total run time). If the time for a single function evaluation is > 500=
 msec, parallelization will be helpful (on a single machine). There really=
 isn't anything very mystical about when it helps and when it doesn't. The =
efficiency depends very little on the size of the data set, except that the=
 limit of parallelization is the number of subjects (the data set must be s=
plit up by subject).

Mark Sale M.D.
Vice President, Modeling and Simulation
Nuventra, Inc. ™
2525 Meridian Parkway, Suite 280
Research Triangle Park, NC 27713
Office (919)-973-0383
msale<> NOTICE: The information c=
ontained in this electronic mail message is intended only for the personal =
and confidential use of the designated recipient(s) named above. This =
message may be an attorney-client communication, may be protected by t=
he work product doctrine, and may be subject to a protective order. As such=
, this message is privileged and confidential. If the reader of this m=
essage is not the intended recipient or an agent responsible for delivering=
 it to the intended recipient, you are hereby notified that you have r=
eceived this message in error and that any review, dissemination, dist=
ribution, or copying of this message is strictly prohibited. If you have re=
ceived this communication in error, please notify us immediately by te=
lephone and e-mail and destroy any and all copies of this message in y=
our possession (whether hard copies or electronically stored copies). Thank=
 you. buSp9xeMeKEbrUze

Received on Wed Dec 09 2015 - 14:36:48 EST

The NONMEM Users Network is maintained by ICON plc. Requests to subscribe to the network should be sent to:

Once subscribed, you may contribute to the discussion by emailing: