Performance, serial simulations

This page reports performance measurements on Asap on various machines. To make it easy to estimate the running time of your own scripts, all times are given in microseconds per atom per timestep. So if you want to do N timesteps with M atoms, your running time will be X*N*M*1e-6 seconds.

Performance for parallel simulations is on a separate page.

Computers

Timing has been done on the following computers:

S50
Niflheim S50 node with a 3.2 GHz Intel(R) Pentium(R) 4 CPU. Compiled using the Intel C++ compiler version 9.1.
Opt285
Niflheim Opteron node with two Dual Core AMD Opteron Processor 285. The timing is performed on one CPU with other jobs running on the three other CPUs, the worst number from a coupe of runs is used. Actual performance may vary depending on what the other CPUs do. Use this to estimate the run time of your jobs.
Opt285A
Same as Opt285, but with the three other CPUs idle. This is not very realistic for estimating job run times, but gives more reliable results when optimizing the code.
Opt2218A
As Opt285A, but on the nodes with two Dual Core AMD Opteron Processor 2218.
Nehalem
Niflheim node with two quad-core Intel Xeon Nehalem X5570 CPUs running at 2.93 GHz. Timing estimated with the same job running on all eight cores.
NehalemA
As Nehalem, but only one job running and the other seven cores idle.
IBMT30
An IBM Thinkpad T30 Laptop (demokrit) with a 2 GHz Mobile Intel Pentium 4. Compiled using the Intel C++ compiler version 9.1.
Dell5150
A Dell Inspiron 5150 Laptop with a 3.06 GHz Mobile Intel Pentium 4. Compiled using the Intel C++ compiler version 9.1.
IBMT60
An IBM Thinkpad T30 Laptop (demokrit) with a 1.833 GHz Intel Core 2 Duo processors (with two cores). Compiled using the intel C++ compiler version 10.1.

Tests

The following tests were performed. Except when otherwise noted, the simulation was performed on hundred thousand copper atoms (100920 atoms to be exact) with periodic boundary conditions, and the simulation was done at 300 Kelvin.

Ver300
Verlet dynamics at 300 K. The simplest possible type of simulation.
Ver1000
Verlet dynamics at 1000 K. Testing the temperature dependence of performance (mainly due to more neighbor list updates).
Langevin
Langevin dynamics (NVT ensemble). Testing a slightly more complicated dynamics.
NPT
HooverNPT dynamics (Constant stress and temperature). The most complicated dynamics.
FreeBC
As Ver300, but with free boundary conditions. Checking influence of boundary conditions.
Alloy
With a Cu3Ni alloy in the L1_2 structure. Checking multi-element simulations.
Molly
Checking the Molybdenum potential.
Tiny
Checking performance of very small systems (1008 atoms).
L-J
Lennard-Jones potential, with same cutoff as for EMT, and otherwise the same simulation as Ver300.

Results

. Running time in microseconds/atom/timestep
Machine Asap ver. Compiler Ver300 Ver1000 Langevin NPT FreeBC Alloy Molly Tiny L-J
S50 2.16.3 ICC 9.1 7.99 9.01 10.58 11.36 7.45 12.43 30.67 7.44  
Opt285A 2.16.3 Path 2.5 6.23 6.99 7.85 8.13 5.87 8.48 17.64 6.13  
Opt285A 2.17.9 Path 2.5 6.27 7.06 7.93 8.44 5.92 8.63 17.53 5.18 4.04
Opt285A 3.1.10 Path 3.2 6.39 7.45 7.34 11.84 6.02 8.20   6.85 2.47
Opt285 2.16.3 Path 2.5 7.1 8.3 10.1 10.2 7.0 10.6 19.1 6.9  
Opt2218A 2.17.9 Path 2.5 6.14 6.94 7.69 8.25 5.81 8.61 17.51 5.79 3.94
Nehalem 3.1.10 ICC 11.0 2.88 3.46 3.52 5.39 2.71 4.18   2.94 1.51
NehalemA 3.1.10 ICC 11.0 2.82 3.35 3.42 5.25 2.65 3.65   2.88 1.43
IBMT30 2.16.3 ICC 9.1 13.93 15.40 18.70 19.72 12.98 20.74 49.89 12.46  
Dell5150 2.16.3 ICC 9.1 9.43 10.53 12.90 13.28 8.91 14.06 30.83 8.29  
IBMT60 2.17.9 ICC 10.1 5.96 6.72 7.51 8.06 5.51 6.88 22.54 5.56 5.09

ICC is the Intel C++ Compiler.

Path is the PathScale C++ compiler.

GCC is the GNU Compiler Collection C++ compiler.

Comments

On the Multiprocessor Niflheim nodes, performance is somewhat affected by what goes on on the other CPUs. If Asap is running on the other CPUs, the performance hit is less than what the table above suggests.

The Verlet algorithm is the simples. A significant performance hit is seen when using Langevin or NPT dynamics. There is clearly room for improvement, in particular in case of Langevin, which is not a very complicated algorithm.

The boundary conditions have some effect. It is unclear if free boundary conditions is faster because of the handling of boundary conditions, or because the atoms near surfaces have fewer interactions.

Alloys are surprisingly slow, this should be investigated.

Molybdenum is much slower, this is expected as the potential is far more complex.

Very small systems are faster, presumably this is because a tiny system can reside all in cache. Going to two million atoms gives approximately the same running time per atom as does hundred thousand (not shown in the table), so the code scales O(N) for large enough systems.

The Lennard-Jones potential clearly has room for improvements.

Asap: Performance (last edited 2010-11-01 11:03:34 by localhost)