This document is aimed at beginner to intermediate level of scientists in the computational chemistry / physics community. While the focus is on atomistic simulations, many of the concepts can be used in related fields.
For those who are in a hurry, here are the main points:
- Document everything: Make sure everything which may ever be of interest is kept around.
- Automate when possible: It avoids human errors and scales much better than students.
- Work structured: Both data and results have to be well-organised.
- Know your tools: Learn what standard tools can offer and use their features.
- Human time is valuable: Save your own time and the time of your colleagues / senior researchers.
This document presents some methods how this can be done in detail.
When investigating a single physical quantity, you may run multiple calculations which elucidate different aspects of the underlying physical system. Since many analysis tools assume that you have the same atoms in all calculations you work with, it is helpful to organise data that way. Also, this keeps multiple calculations together. These multiple calculations could be minimization before a MD run followed my an MD run or steps in the equilibration procedure or just equilibration and production runs. Additionally, this means you can treat these folders for a single physical system as self-contained entity. This enables re-using input files for different runs of the same system and eliminates potential sources of error. Additional benefit is that data can be archived fairly easily: just move the whole folder containing simulation data and post-processing scripts to the archive storage and everything you may need in the future is at one well-defined location.
For every run (which on this page means a single invocation of a simulation package), keep a separate folder with all output and log files. Input files may be shared in a single top-level folder across independent runs, since it reduces storage requirements and potential sources of errors. The main reason why to keep runs in separate folders is that simulations could overwrite files of the same name. While most software packages have measures to avoid that by renaming existing files of the same name rather than overwriting them, the actual job script used for the queueing system may do differently. Moreover, it’s a lot of manual and, hence, error-prone work to restore the previous state of the folder even if the simulation package renames the underlying files.
When archiving simulation data, large text files such as log files can be reduced drastically in size by compression. Since there may be many files in a given simulation folder, and figureing out which ones are compressible and contribute significantly to the overall storage requirements, you may be tempted to just compress the whole folder into a single archive file. However, this imposes serious limitations on file access later on: not all archive formats allow to list the contents of the archive without reading the whole file. With a slow connection or over a large geographic distance, this can slow down future work significantly. Also, storage savings may be limited for a large portion of the data in the archive. This means that even for formats that support listing contents, data has to be compressed initially and decompressed on access, which generates load on all machines involved without any or very limited benefit. The only exception is for large numbers (more than ~10.000 per directory) of small files which are only relevant for documentation of calculations, but not for post-processing. These numbers are problematic for backup solutions and generally slow down file transfers, since directory listings get slower and communication overhead increases. Additionally, keep in mind that some archive formats are not streamable or seekable. Streamable means that the archive file can be written without navigating forth and back in the target file (a feature which is not supported for some file transfer methods, most notably scp, although SFTP is fine). As a result, the target file has to be created on the data source before transfer via network, which requires local storage and human time. If an archive format is seekable, then the extracting code can directly target single files in the archive. Otherwise, the extracting code has to read the whole archive until a specific file the user requested turns up in the output stream. Even if the files extracted on the way to get there are not saved to disk, they have to be extracted which takes computational effort and human time.
Moreover, accidents happen. While most university or research lab network connections are reasonably fast, they can be flaky at times. If you transfer a large file, the risk of having to start the transfer over is non-zero. Tools like rsync can mitigate that if necessary, but avoidable problems should be avoided.
Here is a comparison of popular archive formats and their capabilities:
In many cases, simulation codes offer binary or plain-text output for e.g. trajectory files. When in doubt, use the former. Not only this obliterates the need for compression (binary files can rarely be compressed significantly). Also smaller files mean faster reads of these files from disk which – given current computational resources – often is the bottleneck. The same argument applies to network transfers of files.
For many file formats, there is another benefit of following this principle: the exact starting point of objects or timeseries can be calculated upon reading because binary formats commonly employ fixed width datatypes. Compare e.g. a DCD file and an XYZ file. For DCD, given the number of atoms, the byte position of the atom coordinates in the file can be directly calculated for every frame. In XYZ files, the reading code can only calculate the row number (and even this is not guaranteed for different reasons), but this essentially comes down to finding the n-th newline character in the file, which again requires everything up to the frame in question.
Tempora mutantur. Names of simulation folders are likely to change over time. You start having a look at a certain physical system A in folder A, find something interesting and continue to work on derivatives which likely will be called A-something, A-modified and so on. Suddenly, renaming A makes sense for distinguishing the calculations. However, this invalidates all internal documentation written up until then because the referenced folder does not exist any more. Also, restart files or log files of simulation packages often state the folder where the simulation started initially, which is helpful for figuring out which calculation depends on which other one. Sticking to a single simulation name is unrealistic, so it is helpful to have two components in the simulation name: a human-readable one that can change arbitrarily and an immutable machine-readable one that is not changed under all circumstances. Then scripts can identify the individual runs based on their identifier only. It is highly recommended to use randomly generated strings or numbers instead of manually counting the simulations. The reason is collaboration: what happens if your colleague also has a simulation number 42? Or do you remember what the last number was you generated on a different compute cluster in a different country? Personally, I use md5 hashes of random data, but they may be too long for everybody’s liking, so shorter strings are usually fine. But if they are too short, the probability of collisions (two simulations getting the name identifier) increases quickly.
Storage is cheap, recalculating something or trying to remember old data really hard is expensive. It’s always better to keep as much information as possible. Of course, this does not mean that all files have to be kept around indefinitely and unmodified (compression and archiving are good habits), but even queueing system log files can become important at some point in the future. Even calculations that failed are worth conservation: in a year’s time you may remember that the simulation failed but maybe the keyword in the input file that made it fail has slipped your mind.
Storage systems that are designed for (cost-effective) long-term archiving are called cold storage. Think of them as external hard drive at home. You can access it, it just will take a long time to start retreiving data, because you need to get home first and attach the hard drive. Ideally, any work that is considered to be finished should be archived on cold storage if available and backup’d. However, data on these storage facilites should not be changed any more (or at least very, very rarely) for backup and access efficiency reasons. A good rule of thumb is: if you have not touched a simulation folder for half a year and this is unlikely to change, it’s time to move it to cold storage. This also helps documenting the work done and allows to reuse or to reproduce earlier results.
This is a tough one, since many facilities do not offer a large backup space. On compute clusters, efficient file systems present a particular problem, because so many CPU cores can issue write requests at the same time. Usually, clusters split their file systems into two parts: a home directory that has a backup and a work or scratch directory that is without backup. The scratch filesystem typically is substantially larger and offers a better write performance. Data on file systems that have a backup should only be changed as rarely as possible to reduce load on the backup systems. If data is only changed (or moved to a new location, for that matter) as rarely as possible, this means that backups can go farther in history because less backup space is required. Since you may not know when a particular backup is performed, please compress any data before moving it on to an backuped filesystem if you plan to keep them compressed for an extended period of time. Following the same logic, please copy data to a scratch directory first before uncompressing (or uncompress directly to the other file system).
Before moving data to backup filesystems, please clean the folders from unnecessary files. For example, application crashes can lead to large core dump files, which in most cases have no further value (if they do, you likely know how to use and how long to keep them). Other cases may involve (depending on your calculation) wavefunction restart files, swap memory files or similar data of ephemeral nature.
When submitting a calculation job to a compute cluster, you typically have to specify the tasks similar to a shell script. In most environments, you cannot alter the contents of the job script once submitted, because the file has been copied upon submision. When the queue times (the time a job waits between submission and execution) are long and the jobs are more of debugging nature, it may make more efficient use of the cluster to have the job script reference an external script with the commands instead running a single job script only. This allows you to reuse existing queuing jobs rather than queueing and cancelling jobs that have become irrelevant or that contain a simple typo.
Queueing systems work more efficient if they know as much as possible about your jobs. Specify limits like memory requirements or walltime requirements reasonably tight, but not too tight: if you request too many resources (“too many” in the sense that the same job would have run successfully with fewer resources), the job may wait longer than required. Also, the queueing system is stacking other jobs in order to minimise gaps. This mechanism works best if the only information the queueing system has about the expected runtime of a job is as accurate as possible. Of course, if the requested walltime is too short, the job will be aborted prematurely. Depending of the nature of the calculations, this may come at a serious expense. In general, overestimation of walltimes is better than underestimation.
If your cluster admin staff is happy with this, there is another approach to improve throughput by minimising the delay between consecutive runs of the same simulation: self-submitting job scripts. You can create a job script that makes sure to terminate the actual calculation just minutes before the walltime limit is reached and automatically prepares and submits the next run. This can be particularly helpful for long molecular dynamics runs. However, please make sure to include a safety switch in case there is an error in the actual calculation. Otherwise, the script will re-submit itself over and over again just to see all runs terminate immediately. Please be very careful with this concept and consult with the cluster admins whether this is acceptable if you plan to use it. An alternative way of archieving the same result with standard tools is to prepare job arrays where prepared jobs depend on each other such that subsequent runs will only be scheduled once the previous ones have been completed. Job arrays also enable much more complex rules including alternative calculations should one of the runs fail. These features are generally well documented, but are not supported by all clusters. Please refer to the documentation for your cluster for details.
For most applications, choosing appropriate job sizes and parallelisation schemes can have a significant impact on the computational efficiency and, therefore, on the costs. If possible, try to allocate complete nodes (physical machines) rather than individual cores, because this reduces the risk of an uneven load on a single physical host which can lead to less computational throughput. Consult benchmarks which have been published for the combination of software package and cluster that you are using. If none are available and your jobs are expensive enough, conduct your own timing experiments before spending months of wall time on a simulation. If you allocate multiple nodes, try to stick to powers of two in node or core count. Many algorithms (e.g. FFT) and communication schemes (e.g. MPI Gather) are most efficient if the cores are a power of two.
Do scaling tests: measure the duration of a small calculation (a few ten minutes) for various core counts and check whether adding resources actually speeds up the calculation. For most parallel algorithms, the total required time per calculation task will decrease with additional resources at first, but this trend eventually flattens out. This means that adding more resources to the calculation only increases the computational cost, but does not accelerate the calculation. Often, adding even more resources will even slow the calculation down, since the communication between all the computers involved takes more and more time, thus outweighting the benefits of added resources.
If you have the choice for the output files, prefer fewer larger files over plenty of small files, because backup software and file systems operate faster on large files. The same applies to file transfers with are more efficient for large files, since the overhead for initiating a new file transfer per file is smaller. Do not keep large intermediate around if if can be generated quickly, but keep the scripts around that generated them. In this case, it is advised to place the generating scripts in the same folder as the raw data to ensure they match. Whereever possible, prefer simple or well-established file formats over complicated ones or those that are rarely used. It makes it harder for future scientists that try to re-use the raw data to find tools that can read the raw data.
Often, the queueing systems allow a certain number of jobs to be placed in the queue. If this limit is defined by the queueing administrators, feel free to use it completely, if you have enough jobs to run. If problems arise, the admins will decrease the limit. If there is no limit, please be considerate. On most clusters, a few ten queued jobs are likely to be deemed acceptable. Enqueue jobs as soon as they are ready to run, since the queueing system can stack them more efficiently, the more units are in the queue.
Generally, it is helpful to separate preparation runs (e.g. the structure minimisation prior to a molecular dynamics calculation) from continuous time series (e.g. the actual molecular dynamics run). Since the input files are slightly different for many program packages, this makes it clear which data files belong to with process. Also, a separate job in the queue gives you the opportunity to check whether the preparation step was successful and went as expected.
When working with multiple machines, you often switch between them using SSH. Every time, you will be prompted for the password of the the account and you need to remember the user names and passwords and hostnames for each of the individual machines or clusters. SSH has two features which make your life significantly easier: key-based login and a configuration file.
Key-based login is a method where your password is substituted for two files: a public key and a private key. These two keys are tied together and can only be used together. The public key is exactly that: you can give it to anybody (similar to your full name in real life). The private key is basically a strong password (similar e.g. to the chip in your ID card). Now what happens is that you login first to the remote machine using your regular password. Then you tell it to accept this public key in the future (it will still accept your password). Once you logout, you are able to login using the private key instead. Since your SSH client will do that automatically, you do not have to use a password for any SSH connection afterwards. Please make sure that nobody can access your private key. Anybody who can would be able to impersonate you without your knowledge. Use one pair of public/private keys for each machine you want to connect from. This allows you to revoke access in case your notebook gets stolen or lost. How to set-up key-based login is well documented on the web and depends on the operating system, so it won’t be covered here. If unsure whether the relevant regulations allow this kind of login (all of the machines I’ve seen do), please ask staff whether they are fine with ‘key-based SSH logins’.
A configuration file allows you to replace commands like
$ ssh firstname.lastname@example.org
$ ssh somename
Together with the key-based login, this accelerates switching between different hosts quite a bit. Again, how to do this is covered in plenty of tutorials on the web.
Over the past decades, transfer of information was a severe bottleneck. So the more reliable method is to move code to the data it works on rather than the other way around. Of course, this only is helpful once the data set is large enough, but it helps to use the same work flow for all simulations. You can run commands remotely e.g. via ssh just as you would run them on your local machine
$ ssh user@hostname 'cd /some/data/directory; ./analysis.py'
To make your life easier, set up key-based login for remote SSH hosts (see above).
Do the analysis with scripts and not manually or via the command line. The main reason for this is reproducibility. The nice benefit is: once you have another calculation like this one or once you get more data from extending a calculation, you can just re-run the script. Moreover, scripting allows you to prevent human errors and enables you to write checks against the data. E.g. once you do a molecular dynamics calculation, you could make the analysis script plot temperature and/or pressure over time. If anything goes wrong, you will see this immediately. If you were to do this manually, it becomes tedious very quickly.
Keep analysis scripts in version control and make them print the revision that is currently used. This enables you (or your fellow co-worker) to figure out which analysis scripts have to be re-run after you discover a bug several months later. Keep the checked-out analysis scripts as a copy in the simulation directory such that you have the direct link between them in case the repository becomes inaccessible later on. Remember that each simulation directory should be sufficient to obtain all intermediate data from the raw simulation results.
If you prepare slides, figures or tables, link the unique identifier of the simulation bucket to it. If you combine results from multiple buckets, build up a list of all the simulations that have been included. This allows you to go back later and answer questions like ‘In that calculation you did there: what was the cutoff for xy?’ without spending hours of matching values from the graphs to something on the disk. While you can safely omit them in publications, its a good idea to include them in reports/theses and internal presentations. (Please shorten the unique identifier to maybe seven characters or so for readability.)
Simulations often produce a large set of different files which have to be combined in a certain way to allow meaningful analysis. Do this merging process once and make all analysis scripts read the resulting database. This is helpful once the set of raw information that needs to be combined changes or has to be updated. Also, it makes you less work, since you do not have to duplicate the merging code for each quantity you want to analyse. Software packages like matlab, mathematica, jupyter notebook and the pandas ecosystem can be helpful here. It does not matter which one you use (I suggest pandas, though), but work will be more efficient and reliable once you use any of them. Keep derived data in standard formats (e.g. pickle for pandas or plain text for maximum flexibility) that are likely to be easy to use for future researchers.
While academia does not feel like a place where everything is governed by the cost of operations, it is worth designing your work such that there is a good balance between resources you use. Most importantly: remember human time is valuable too. When asking colleagues of senior researchers for help (which is a good idea), make sure you have all figures and relevant information at hand. Waiting a minute for an analysis script to generate a figure you want to discuss feels fast if you are doing it, but is highly inefficient if you have some colleague waiting as well. Usually, it’s a good idea to prepare a few slides with the contents you want to discuss. This is for two reasons: first, it forces you to bring your thoughts in order (and helps you to identify flaws) and secondly, it makes it faster to go through during the discussion. The slides don’t have to be pretty but legible.
When organising your work or writing code, keep in mind that often acceptable code is good, good code is unnecessary and perfect code is impossible. Don’t spend time on finding the most elegant solution to a problem unless it really, really saves time (usually, even if you think it saves time, it actually does not – sadly). Clean and documented code is better than complicated solutions. If you document, add information what your code is doing not how it is done. The latter can be inferred from the code, but the former is much harder to deduce if you don’t have at least a clue. Don’t write libraries if there is already one to do the job (usually, there is – ask around).
If you decide to write a tool for a certain task, stick to the UNIX philosophy of ‘one task – one tool’. Combining several features into one code requires substantial maintenance and testing which should not be underestimated.
When thinking about your schedule on a daily basis, optimize for throughput, not for deadlines. This means that you should try to keep as many resources busy as possible at a given time. If you have access to a cluster to do calculations, make sure this cluster always has to do something. If somebody is waiting on feedback from you to continue research, make sure this answer gets there quickly. This procedure does not necessarily mean that you have to switch tasks often (which you should avoid): e.g. you could check the cluster as soon as you get into the office, do emails in the morning, after lunch and before leaving the office.
In the context of keeping the cluster busy, keep in mind that computational resources do not only cost internal budgeting units but actual money. Estimate the cost of calculations and treat them accordingly. As a rule of thumb: using one compute core for one day is about 0.5 EUR/USD/GBP.
Manual processes are almost never scalable. If you have the option, go for the method that can be automated. Only this way, applying ideas to large problem sets is feasible.
Use version control for software and reports. It is an integral part of documenting your work and helps ensuring development continuation and quality. Otherwise, you will quickly have documents with names like ‘report_final_v2_updated_submission.tex’.
Finally, when asking a question, describe what you want to accomplish first, not the detailed question you have (otherwise, you may run in what is called the XY problem). Speak about the tools you are using and why you are using them. Of course science is important and the goal we want to achieve, but tools are the vehicles that get you there. Somebody in the group may know the right tool for the job. Asking around takes three minutes, but doing stuff with the wrong tool can take days.