The truth is rarely pure and never simple

Parsing CASTEP Formatted Density Header Grid

When parsing a CASTEP formatted grid with python, you may be tempted to use numpy.loadtxt – in particular, as it allows for fast access to the numerical data afterwards. However, you need to know the number of grid points in each dimension in order to work out the proper arguments for numpy.reshape. I my office, ridiculous ideas how to get this information from the comments of the file header emerged quite quickly. Here’s the associated timing along with the disappointing result that the simplest solution actually is the fastest. (Tests on Late-2013 Mac with a 2.4 GHz i5 and SSD)

import numpy as np
import subprocess as sp
import timeit

inPath = '2x2x2_surf_0_b-velec.dat'

def method_grep():
	nLines = sp.check_output(['grep', '-A1', '-m', '2', 'grid size', inPath]).split('\n')[1].split()
	return nLines[1], nLines[3], nLines[5]

def method_unique():
	return tuple(np.unique(data[:, _]).shape[0] for _ in range(3))

def method_find():
	Nz = data.shape[0]/len(np.where(data[:, 0] == data[0, 0])[0])
	Ny = len(np.where(data[::Nz, 1]==data[0, 1])[0])
	Nx = data.shape[0]/Ny/Nz
	return Nx, Ny, Nz

def method_file():
	nextLine = False
	with open(inPath) as fh:
		while nextLine is False:
			line = next(fh)
			if 'grid size' in line:
				nextLine = True
		line = next(fh)
		return map(int, line.split()[1::2]) 

data = np.loadtxt(inPath)
print 'grep   ', timeit.timeit(";method_grep()";, setup=";from __main__ import method_grep";, number=10)/10
print 'unique ', timeit.timeit(";method_unique()";, setup=";from __main__ import method_unique,data";, number=10)/10
print 'where  ', timeit.timeit(";method_find()";, setup=";from __main__ import method_find,data";, number=10)/10
print 'file   ', timeit.timeit(";method_file()";, setup=";from __main__ import method_file";, number=10)/10

Results (time per run in seconds):

grep    0.308823895454
unique  0.0853920936584
where   0.00581040382385
file    3.71932983398e-05

Leave a comment

Your email address will not be published.