NONMEM Users Network Archive

Hosted by Cognigen

RE: setup of parallel processing and supporting software - help wanted

From: Bob Leary <Bob.Leary>
Date: Wed, 9 Dec 2015 21:41:47 +0000

Mark -
I did not realize you were talking about FPI, rather than MPI. Your multi =
millisecond latencies are right for disk i/o , but I was referring to dire=
ct memory to memory message passing, which is orders of magnitude faster t=
han going thru a disk. Why would anyone use FPI if MPI on an SMP were avai=
(at least for parallelizing a single job - not talking about 'embarassingly=
 parallel' tasks such as
bootstrap where MPI, as several in this thread have correctly remarked, sho=
uld normally not be used since it just introduces extra overhead.)

From: Mark Sale [msale
Sent: Wednesday, December 09, 2015 1:56 PM
To: Bob Leary; Faelens, Ruben (Belgium); Pavel Belo; nmusers
Subject: Re: [NMusers] setup of parallel processing and supporting software=
 - help wanted


For what it is worth, the 20 msec was when we first started developing this=
, which has been about 10 years now. Disc seek time is currently about 8 m=
sec and latency

about 4 msec, so we thought 20 was reasonable.

That is not the I/O time for a "single sweep through the data" this is not =
data (i.e., FDATA) reading, each process gets it's own copy of the data, MP=
I does not send data (i.e. FDATA) back and forth. The I/O time is for:

1. Manager to write the required information (THETA and a whole lot of othe=
r stuff)

2. worker to read that information

3. worker to write the information back

4. Manager to read the required information.

In between steps 2 and 3 is the "calculation" part.

This is disc read/write, so MPI should be much better at it than FPI, since=
 it doesn't have to write all of this to disc (and I assume very rarely doe=
s, I believe that MPI is very good at doing all of this in memory).

You're right, there are other points at which non linear regression can be =
parallelized, although NONMEM only does it at the function evaluation level=

WRT any model running faster parallel than single processor, at least with =
NONMEM that is not my experience, again threshold for meaningful gain is a =
function evaluation time of 500 msec, in my experience, but haven't benchma=
rked it recently, may be less now. I suspect you still won't get a 2 minute=
 4000 function evaluation run down to 30 seconds on 4 cores, but would look=
 forward to learning about other peoples experience.


Mark Sale M.D.
Vice President, Modeling and Simulation
Nuventra, Inc.
2525 Meridian Parkway, Suite 280
Research Triangle Park, NC 27713
Office (919)-973-0383

Empower your Pipeline

CONFIDENTIALITY NOTICE The information in this transmittal (including attac=
hments, if any) may be privileged and confidential and is intended only for=
 the recipient(s) listed above. Any review, use, disclosure, distribution o=
r copying of this transmittal, in any form, is prohibited except by or on b=
ehalf of the intended recipient(s). If you have received this transmittal i=
n error, please notify me immediately by reply email and destroy all copies=
 of the transmittal.

From: Bob Leary <Bob.Leary
Sent: Wednesday, December 9, 2015 2:04 PM
To: Mark Sale; Faelens, Ruben (Belgium); Pavel Belo; nmusers
Subject: RE: [NMusers] setup of parallel processing and supporting software=
 - help wanted


a) I have to disagree with you that the efficiency of MPI implementation do=
es not depend on the size of the
data set for a single desktop SMP machine with multiple processors - larger=
 data sets mean higher granularity and more cpu-bound work between stoppage=
s for communication.
This assumes the NLME MPI implementation is done efficiently - I don't kno=
w the details of the NONMEM MPI implementation, particularly those of how c=
ommunications are handled.

b) your I/O timings seem horrendously large (if by msec you mean millisecon=

 I/O times of 40 milliseconds per function evaluation (assuming 1 function =
evaluation is a single sweep
 through all Nsub subjects, evaluating and summing the likelihood contribu=
tion from each subject) seem very high. I have been
running MPI since its original release in 1994 (I was a member of the commi=
ttee that designed the first release of MPI during 1992-1994 ) -
these communications timings would seem more appropriate for machines from =
that era.

I/O timings for MPI are usually modeled by a latency (startup time - typica=
lly on current SMP single desktop machines on the
order of 1 microsecond) , and a bandwidth (on the order of 10's of gigabyte=
s/sec for current era SMPs, but much lower for clusters).
Based on the latency/bandwidth model, the conventional wisdom is to manage =
the message processing so as to
favor a few large messages as opposed to many small messages to minimize th=
e latency contribution.
If possible, small messages should be concatenated into larger messages. =
 I don't know the details of the MPI implementation in NONMEM, but for FOCE=
NLME algorithms, it is possible to limit the number of messages to just a f=
ew per function evaluation.
 If the data set size is expanded by adding more subjects, then more work (=
more subjects processed)
will be done between stoppages for communication at the function evaluation=

In the MPI implementation for Phoenix NLME, I find it almost impossible to =
 find a model where I/O dominates to such an
extent that the MPI version runs slower than the single processor version o=
n a 4-processor Intel i7 desktop. For example,
I just tested (FOCE) the classic simple closed form Emax model used in the =
INSERM estimation method comparison exercise from 2004
 (Girard and Mentre', PAGE 2005, abstract 234) with Phoenix NLME. It wou=
ld be hard to find a simpler model -
E=E0 + EMAX*DOSE/(ED50 + DOSE) +EPS, with random effects on each of the =
three parameters E0, EMAX, and ED50,
and three observations per subject. If I expand the data set to around 160=
0 from the original 100 subjects
and run on a four processor i7, the internally reported cpu time is 72 sec =
for four processors vs 18 sec for one processor (a speedup of 4).
  Wall clock times were a few seconds longer for each run. If I make the d=
ata set smaller, down to the original size of 100,
 the speedup clearly suffers a decrease but I still observe a reported cpu =
time speedup of 2.5x for the four processors (times are
well under 1 sec, so reliable wall clock times are not available).
(this was done on a relatively old i7 desktop, so more current machines may=
 do better).

c) It is not always necessary to parallelize over function evaluations (i.e=
. over subjects). In importance sampling EM methods, (IMP in NONMEM,
QRPEM in Phoenix NLME), in principle the parallelization can be done over t=
he sample points used in the monte carlo or quasi-monte carlo integral eval=
uations -
there are usually many more of these than processors available. In PHX QRP=
we actually do it this way and it works fine. Now all processors are worki=
ng on the same subject at the same time, so
load balancing problems tend to go away, but communications overhead increa=
ses since now you have to pass separate messages for
each subject, whereas in FOCE-like algorithms you only have to pass messag=
es at the end of a sweep through all the subjects.
One thing we have noticed is that QRPEM parallelized this way is much more =
reproducible - single processor results almost always
match multiprocessor results exactly, which is not always the case with som=
e of the FOCE-like methods.

Bob Leary
Fellow, Pharsight Corporation

From: owner-nmusers
 of Mark Sale [msale
Sent: Wednesday, December 09, 2015 7:42 AM
To: Faelens, Ruben (Belgium); Pavel Belo; nmusers
Subject: Re: [NMusers] setup of parallel processing and supporting software=
 - help wanted

Maybe a little more clarification:

Thanks to Bob for pointing out that the


option implements some code for load balancing, and there really is no down=
side, so should probably always be used.

Contrary to other comments, NONMEM 7.3 (and 7.2) does parallelize the covar=
iance step. Ruben is correct that the $TABLE step is not parallelize in 7.=

WRT sometimes it works and sometimes it doesn't, we can be more specific th=
an this. The parallelization takes place at the level of the calculation of=
 the objective function. The data are split up and the OBJ for the subsets=
 of the data is sent to multiple processes. When all processes are done, t=
he results are compiled by the manager program. The total round trip time=
 for one process then is the calculation time + I/O time. Without parallel=
ization, there is no I/O time. For each parallel process, the I/O time is =
essentially fixed (in our benchmarks maybe 20-40 msec per process on a sing=
le machine). The variable of interest then is the calculation time. If th=
e calculation time is 1 msec and the I/O time is 20 msec, if you paralleliz=
e to 2 cores, you cut the calculation time to 0.5 msec, now have 40 msec (2=
*20 msec) of I/O time, for a total of 40.5 msec, much slower. If the calcu=
lation time is 500 msec, and you parallelize to 2 cores, the total time is =
250 msec (for calculation) + 2*20 msec (for I/O) = 290 msec. If The key =
parameter then is the time for a single objective function evaluation (not =
the total run time). If the time for a single function evaluation is > 500=
 msec, parallelization will be helpful (on a single machine). There really=
 isn't anything very mystical about when it helps and when it doesn't. The =
efficiency depends very little on the size of the data set, except that the=
 limit of parallelization is the number of subjects (the data set must be s=
plit up by subject).

Mark Sale M.D.
Vice President, Modeling and Simulation
Nuventra, Inc.
2525 Meridian Parkway, Suite 280
Research Triangle Park, NC 27713
Office (919)-973-0383

NOTICE: The information contained in this electronic mail message is intend=
ed only for the personal and confidential use of the designated recipient(s=
) named above. This message may be an attorney-client communication, may be=
 protected by the work product doctrine, and may be subject to a protective=
 order. As such, this message is privileged and confidential. If the reader=
 of this message is not the intended recipient or an agent responsible for =
delivering it to the intended recipient, you are hereby notified that you h=
ave received this message in error and that any review, dissemination, dist=
ribution, or copying of this message is strictly prohibited. If you have re=
ceived this communication in error, please notify us immediately by telepho=
ne and e-mail and destroy any and all copies of this message in your posses=
sion (whether hard copies or electronically stored copies). Thank you.


Received on Wed Dec 09 2015 - 16:41:47 EST

The NONMEM Users Network is maintained by ICON plc. Requests to subscribe to the network should be sent to:

Once subscribed, you may contribute to the discussion by emailing: