This content has been updated on 04/09/2014. Please note that names may have changed in newer releases of ipython. For more on computational physics, have a look at http://guido.vonrudorff.de.

Reasoning

Ipython parallel is a method to transport your code to your data and work remotely on multiple machines at once. This is particular helpful if the tasks you want to perform on the raw data are embarassingly parallel.

In this document, I focus on the application of ipython parallel to the case that you have plenty of data you want to analyse---much more than want to move around.

Concepts

Ipython parallel comes with three components which are crucial for understanding the different roles and the communication between these components.

  • Client This is the machine you interact with. It runs the ipython notebook webserver. You may only have one of these.
  • Engine Here the code you wrote for the analysis actually is executed. You should have multiple of them (although one will be fine, as well.)
  • Controller This is the communication center and runs the dispatcher of work chunks. It has to be reachable via network for all engines and the client. You may one have one controller, but it can be on the same machine as the client.

Preparation

Although it is not a strict requirement, it is sensible to have identical versions of ipython running on all computers that are involved in the setup. It does not matter whether they run a client, an engine or a controller. As the communication methods change with new releases of ipython, your setup is likely to fail during connecting if the versions mismatch.

I suggest using a fresh ipython virtualenv for your setup on all machines, as no system-wide update can break this virtualenv. Although creating a virtualenv is beyond the scope of this document, you may be warned that you need sqlite3 support compiled into the python binary. For ipython notebook to work, you will have to install the following packages via easy_install:

  • tornado (web server)
  • pyzmq (python bindings to communication library zeromq)
  • matplotlib (optional, for plotting)

Please be aware of the fact that easy_install only can install dependencies that are pythonic in nature. Dependencies that are to be installed using the package manager of the distribution are not covered. In most cases, there is not even a clear hint on missing dependencies. For matplotlib, you need libfreetype-dev and libpng-dev on top of the default debian packages including build-essentials.

Connection

Imagine, you have several computers that have access to a shared file system. They will host the engines, while controller and client will share a common host. The former machines will be called nodes in the following, while I will refer to the latter machine by server.

On the server, create a new profile with the name nbserver with

server$ ipython profile create nbserver 

Make sure that you use the ipython commands from the virtualenv, if applicable. The profile name is arbitrary (as long as you stay away from default)---just be consistent. The command will create a new subdirectory ~/.ipython/profile_nbserver, the profile directory. Now create a folder for the ipython notebook files and edit ~/.ipython/profile_nbserver/ipython_config.py, which may not exist yet. The content should read

c = get_config()
c.IPKernelApp.matplotlib = 'inline'
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8888
c.NotebookApp.notebook_dir = u'/path/to/your/notebook/dir'

Please be warned that this setup is insecure, as the communication between nodes and the server is done in plain text only.

Now you can start the controller on the node

server$ ipcontroller --profile=nbserver

This will create two files in your profile directory: security/ipcontroller-client.json and security/ipcontroller-engine.json. The second one has to be copied to all nodes such that they know which server to connect to. The very small file also contains authentication information. Create a profile on the nodes, as well.

nodes$ ipython profile create nbnode

After copying the ipcontroller-engine.json file to ~/.ipython/profile_nbnode/security/, you can start the ipython engines.

nodes$ ipcluster engines --profile=nbnode --n=2

This will start two processes on the node, which will automatically connect to the controller. If you now start the ipython shell you may know, you can work on these nodes.

server$ ipython notebook

Working Parallel

In the beginning, you have to load the relevant parts of ipython and create a client object. This client object always offers access to all running engines.

In [2]:
import IPython.parallel as ipp
c = ipp.Client()

You can get a list of the connected engines by checking their ids. The controller assigns a running number to the engines. Numbers from stopped engines are not assigned once again.

In [3]:
print c.ids
[21, 22, 23, 24]

Before we can start with running tasks in parallel, we need to create a view. A view is a subset of the engines that are part of a client object together with a parallelization strategy. In ipython parallel, there are two of them: DirectView and LoadBalancedView. The first one takes all function calls you scheduled to make and distributes them equally among the engines. The second one has a queue sending the next task (a function call) to the first engine getting ready again. The overhead of the second approach is reasonable once the function calls are expected to be highly different in execution time.

In [4]:
dview = c[:]
lview = c.load_balanced_view()

Now we can define a simple function and evaluate it on each engine. Technically speaking, an engine is a python instance on the remote machine (no ipython instance).

In [5]:
def get_hostname():
    import socket
    return socket.gethostname()

print dview.apply_sync(get_hostname)
['seb11', 'seb11', 'seb08', 'seb08']

As the namespace does not get exported implicitly, you have to import socket on each instance, hence it is part of the function itself to load the module. If you want to pass parameters, you can do so by appending them to apply_sync.

In [7]:
def get_average(numbers):
    return sum(numbers)/float(len(numbers))

print dview.apply_sync(get_average, [2, 3, 5, 7, 9])
[5.2, 5.2, 5.2, 5.2]

Of course, the point of doing things in parallel is to do them on different input data. There is a convenient way very similar to the regular map function.

In [9]:
def get_squared(number):
    return number*number

print dview.map(get_squared, [2, 3, 5, 7, 9], block=True)
[4, 9, 25, 49, 81]

The keyword parameter block tells ipython to wait until the engines have completed their results. While apply_sync is inherently synchronizing the results, DirectView.map has to be told to do so. This offers asynchronous analysis and intermediate visualisation.

If you want to move data from the server to the nodes or vice versa, you can use DirectView.push and DirectView.pull.

In [17]:
def increment():
    global counter
    counter += 1
    return counter

i = 0
dview.push(dict(counter=i))
print dview.apply_sync(increment)
print i, dview.pull('counter', block=True)
[1, 1, 1, 1]
0 [1, 1, 1, 1]

Potential error messages

There are many ways the procedure above can fail. However, most of those cases do not come with a clear error message which helps you to find the right starting point for fixing the underlying problem. Here is a incomprehensive list of potential messages and their solutions, as far as they work for me.

ImportError: No module named testing

This error occurs if you call a recent version of ipcontroller with python2.6. Use a virtualenv or enable it by

$ source /path/to/virtualenv/bin/activate

fatal error: png.h: No such file or directory

Your machine is missing the development package of libpng. Please install it using the package manager of your distribution.