"MapReduce-MPI WWW Site"_mws - "MapReduce-MPI Documentation"_md :c

:link(mws,http://mapreduce.sandia.gov)
:link(md,Manual.html)

:line

Python interface to the MapReduce-MPI Library :h3

A Python wrapper for the MR-MPI library is included in the
distribution.  The advantage of using Python is how concise the
language is, enabling rapid development and debugging of MapReduce
programs.  The disadvantage is speed, since Python is slower than a
compiled language.  Using the MR-MPI library from Python incurs two
additional overheads, discussed in the "Technical
Details"_Technical.html section.

Before using MR-MPI from a Python script, you need to do two things.
You need to build MR-MPI as a dynamic shared library, so it can be
loaded by Python.  And you need to tell Python how to find the library
and the Python wrapper file python/mrmpi.py.  Both these steps are
discussed below.  If you wish to run MR-MPI in parallel from Python,
you also need to extend your Python with MPI.  This is also discussed
below.

The Python wrapper for MR-MPI uses the amazing and magical (to me)
"ctypes" package in Python, which auto-generates the interface code
needed between Python and a set of C interface routines for a library.
Ctypes is part of standard Python for versions 2.5 and later.  You can
check which version of Python you have installed, by simply typing
"python" at a shell prompt.

The following sub-sections cover the rest of the Python discussion:

"Building MR-MPI as a shared library"_#py_1
"Installing the Python wrapper into Python"_#py_2
"Extending Python with MPI to run in parallel"_#py_3
"Testing the Python/MR-MPI interface"_#py_4
"Using the MR-MPI library from Python"_#py_5 :ul

:line
:line

[Building MR-MPI as a shared library] :link(py_1)

Instructions on how to build MR-MPI as a shared library are given in
the "Start section"_Start.html. A shared library is one that is
dynamically loadable, which is what Python requires.  On Linux this is
a library file that ends in ".so", not ".a".

From the src directory, type

make -f Makefile.shlib foo :pre

where foo is the machine target name, such as linux or g++ or serial.
This should create the file libmrmpir_foo.so in the src directory, as
well as a soft link libmrmpi.so, which is what the Python wrapper will
load by default.  Note that if you are building multiple machine
versions of the shared library, the soft link is always set to the
most recently built version.

If this fails, see the "Start section"_Start.html for more details.

:line

[Installing the Python wrapper into Python] :link(py_2)

For Python to invoke MR-MPI, there are 2 files it needs to know about:

python/mrmpi.py
src/libmrmpi.so :ul

Mrmpi.py is the Python wrapper on the MR-MPI library interface.
Libmrmpi.so is the shared MR-MPI library that Python loads, as
described above.

You can insure Python can find these files in one of two ways:

set two environment variables
run the python/install.py script :ul

If you set the paths to these files as environment variables, you only
have to do it once.  For the csh or tcsh shells, add something like
this to your ~/.cshrc file, one line for each of the two files:

setenv PYTHONPATH ${PYTHONPATH}:/home/sjplimp/mrmpi/python
setenv LD_LIBRARY_PATH ${LD_LIBRARY_PATH}:/home/sjplimp/mrmpi/src :pre

If you use the python/install.py script, you need to invoke it every
time you rebuild MR-MPI (as a shared library) or make changes to the
python/mrmpi.py file.

You can invoke install.py from the python directory as

% python install.py \[libdir\] \[pydir\] :pre

The optional libdir is where to copy the MR-MPI shared library to; the
default is /usr/local/lib.  The optional pydir is where to copy the
mrmpi.py file to; the default is the site-packages directory of the
version of Python that is running the install script.

Note that libdir must be a location that is in your default
LD_LIBRARY_PATH, like /usr/local/lib or /usr/lib.  And pydir must be a
location that Python looks in by default for imported modules, like
its site-packages dir.  If you want to copy these files to
non-standard locations, such as within your own user space, you will
need to set your PYTHONPATH and LD_LIBRARY_PATH environment variables
accordingly, as above.

If the install.py script does not allow you to copy files into system
directories, prefix the python command with "sudo".  If you do this,
make sure that the Python that root runs is the same as the Python you
run.  E.g. you may need to do something like

% sudo /usr/local/bin/python install.py \[libdir\] \[pydir\] :pre

You can also invoke install.py from the make command in the src
directory as

% make install-python :pre

In this mode you cannot append optional arguments.  Again, you may
need to prefix this with "sudo".  In this mode you cannot control
which Python is invoked by root.

Note that if you want Python to be able to load different versions of
the MR-MPI shared library (see "this section"_#py_5 below), you will
need to manually copy files like lmpmrmpi_g++.so into the appropriate
system directory.  This is not needed if you set the LD_LIBRARY_PATH
environment variable as described above.

:line

11.3 Extending Python with MPI to run in parallel :link(py_3),h4

If you wish to run MR-MPI in parallel from Python, you need to extend
your Python with an interface to MPI.  This also allows you to make
MPI calls directly from Python in your script, if you desire.

There are several Python packages available that purport to wrap MPI
as a library and allow MPI functions to be called from Python.

These include

"pyMPI"_http://pympi.sourceforge.net/
"maroonmpi"_http://code.google.com/p/maroonmpi/
"mpi4py"_http://code.google.com/p/mpi4py/
"myMPI"_http://nbcr.sdsc.edu/forum/viewtopic.php?t=89&sid=c997fefc3933bd66204875b436940f16
"Pypar"_http://code.google.com/p/pypar :ul

All of these except pyMPI work by wrapping the MPI library and
exposing (some portion of) its interface to your Python script.  This
means Python cannot be used interactively in parallel, since they do
not address the issue of interactive input to multiple instances of
Python running on different processors.  The one exception is pyMPI,
which alters the Python interpreter to address this issue, and (I
believe) creates a new alternate executable (in place of "python"
itself) as a result.

In principle any of these Python/MPI packages should work to invoke
MR-MPI in parallel and MPI calls themselves from a Python script which
is itself running in parallel.  However, when I downloaded and looked
at a few of them, their documentation was incomplete and I had trouble
with their installation.  It's not clear if some of the packages are
still being actively developed and supported.

The one I recommend, since I have successfully used it with MR-MPI, is
Pypar.  Pypar requires the ubiquitous "Numpy
package"_http://numpy.scipy.org be installed in your Python.  After
launching python, type

import numpy :pre

to see if it is installed.  If not, here is how to install it (version
1.3.0b1 as of April 2009).  Unpack the numpy tarball and from its
top-level directory, type

python setup.py build
sudo python setup.py install :pre

The "sudo" is only needed if required to copy Numpy files into your
Python distribution's site-packages directory.

To install Pypar (version pypar-2.1.4_94 as of Aug 2012), unpack it
and from its "source" directory, type

python setup.py build
sudo python setup.py install :pre

Again, the "sudo" is only needed if required to copy Pypar files into
your Python distribution's site-packages directory.

If you have successully installed Pypar, you should be able to run
Python and type

import pypar :pre

without error.  You should also be able to run python in parallel
on a simple test script

% mpirun -np 4 python test.py :pre

where test.py contains the lines

import pypar
print "Proc %d out of %d procs" % (pypar.rank(),pypar.size()) :pre

and see one line of output for each processor you run on.

IMPORTANT NOTE: To use Pypar and MR-MPI in parallel from Python, you
must insure both are using the same version of MPI.  If you only have
one MPI installed on your system, this is not an issue, but it can be
if you have multiple MPIs.  Your MR-MPI build is explicit about which
MPI it is using, since you specify the details in your lo-level
src/MAKE/Makefile.foo file.  Pypar uses the "mpicc" command to find
information about the MPI it uses to build against.  And it tries to
load "libmpi.so" from the LD_LIBRARY_PATH.  This may or may not find
the MPI library that MR-MPI is using.  If you have problems running
both Pypar and MR-MPI together, this is an issue you may need to
address, e.g. by moving other MPI installations so that Pypar finds
the right one.

:line

11.4 Testing the Python-MR-MPI interface :link(py_4),h4

To test if MR-MPI is callable from Python in serial, launch Python
interactively and type:

>>> from mrmpi import mrmpi
>>> mr = mrmpi() :pre

If you get no errors, you're ready to use MR-MPI from Python.  If the
2nd command fails, the most common error to see is

OSError: Could not load MR-MPI dynamic library :pre

which means Python was unable to load the MR-MPI shared library.  This
typically occurs if the system can't find the MR-MPI shared library,
or if something about the library is incompatible with your Python.
The error message should give you an indication of what went wrong.

You can also test the load directly in Python as follows, without
first importing from the mrmpi.py file:

>>> from ctypes import CDLL
>>> CDLL("libmrmpi.so") :pre

If an error occurs, carefully go thru the steps in "Start"_Start.html
and above about building a shared library and about insuring Python
can find the necessary two files it needs.

[Test MR-MPI and Python in parallel:] :h5

To run MR-MPI in parallel, assuming you have installed the
"Pypar"_http://datamining.anu.edu.au/~ole/pypar package as discussed
above, create a test.py file containing these lines:

import pypar
from mrmpi import mrmpi
mr = mrmpi()
print "Proc %d out of %d procs has" % (pypar.rank(),pypar.size()),mr
pypar.finalize() :pre

You can then run it in parallel as:

% mpirun -np 4 python test.py :pre

Note that if you leave out the 3 lines from test.py that specify Pypar
commands you will instantiate and run MR-MPI independently on each of
the P processors specified in the mpirun command.  In this case you
should get 4 sets of output, each showing that a MR-MPI was
initialized on a single processor, instead of one set of output
showing MR-MPI was initialized on 4 processors.  If the 1-processor
outputs occur, it means that Pypar is not working correctly.

Also note that once you import the PyPar module, Pypar initializes MPI
for you, and you can use MPI calls directly in your Python script, as
described in the Pypar documentation.  The last line of your Python
script should be pypar.finalize(), to insure MPI is shut down
correctly.

[Running Python scripts:] :h5

Note that any Python script (not just for MR-MPI) can be invoked in
one of several ways:

% python foo.script
% python -i foo.script
% foo.script :pre

The last command requires that the first line of the script be
something like this:

#!/usr/local/bin/python 
#!/usr/local/bin/python -i :pre

where the path points to where you have Python installed, and that you
have made the script file executable:

% chmod +x foo.script :pre

Without the "-i" flag, Python will exit when the script finishes.
With the "-i" flag, you will be left in the Python interpreter when
the script finishes, so you can type subsequent commands.  As
mentioned above, you can only run Python interactively when running
Python on a single processor, not in parallel.

:line
:line

[Using the MR-MPI library from Python] :link(py_5)

The Python interface to MR-MPI consists of a Python "mrmpi" module,
the source code for which is in python/mrmpi.py, which creates a
"mrmpi" object, with a set of methods that can be invoked on that
object.  The sample Python code below assumes you have first imported
the "mrmpi" module in your Python script, as follows:

from mrmpi import mrmpi :pre

These are the methods defined by the mrmpi module.  Some of them take
callback functions as arguments, e.g. "map()"_map.html and
"reduce()"_reduce.html.  These are Python functions you define
elsewhere in your script.  When you register "keys" and "values" with
the library, they can be simple quantities like strings or ints or
floats.  Or they can be Python data structures like lists or tuples.

These are the class methods defined by the mrmpi module.  Their
functionality and arguments are described in the "C++ interface
section"_Interface_c++.html.

mr = mrmpi()                # create a MR-MPI object using the default libmrmpi.so library
mr = mrmpi(mpi_comm)        # ditto, but with a specified MPI communicator
mr = mrmpi(0.0)             # ditto, and the library will finalize MPI
mr = mrmpi(None,"g++")      # create a MR-MPI object using the libmrmpi_g++.so library
mr = mrmpi(mpi_comm,"g++")  # ditto, but with a specified MPI communicator
mr = mrmpi(0.0,"g++")       # ditto, and the library will finalize MPI :pre

mr2 = mr.copy()             # copy mr to create mr2 :pre

mr.destroy()                # destroy an mrmpi object, freeing its memory
                            # this will also occur if Python garbage collects :pre

mr.add(mr2)
mr.aggregate()
mr.aggregate(myhash)        # if specified, myhash is a hash function
			    #   called back from the library as myhash(key)
			    # myhash() should return an integer (a proc ID)
mr.broadcast(root)
mr.clone()
mr.close()
mr.collapse(key)
mr.collate()
mr.collate(myhash)          # if specified, myhash is the same function
			    #   as for aggregate() :pre

mr.compress(mycompress)     # mycompress is a function called back from the
			    #   library as mycompress(key,mvalue,mr,ptr)
			    #   where mvalue is a list of values associated
			    #   with the key, mr is the MapReduce object,
			    #   and you (optionally) provide ptr (see below)
			    # your mycompress function should typically
			    #   make calls like mr->add(key,value)
mr.compress(mycompress,ptr) # if specified, ptr is any Python datum
		            #    and is passed back to your mycompress()
			    # if not specified, ptr = None :pre

mr.convert()
mr.gather(nprocs) :pre

mr.map(nmap,mymap)          # mymap is a function called back from the
			    #   library as mymap(itask,mr,ptr)
			    #   where mr is the MapReduce object,
			    #   and you (optionally) provide ptr (see below)
			    # your mymap function should typically
			    #   make calls like mr->add(key,value)
mr.map(nmap,mymap,ptr)      # if specified, ptr is any Python datum
			    #    and is passed back to your mymap()
			    # if not specified, ptr = None
mr.map(nmap,mymap,ptr,addflag) # if addflag is specfied as a non-zero int,
			       #   new key/value pairs will be added to the
			       #   existing key/value pairs :pre
mr.map_file(files,self,recurse,readfile,mymap)
                             # files is a list of filenames and dirnames
			     # mymap is a function called back from the
			     #   library as mymap(itask,filename,mr,ptr)
			     # as above, ptr and addflag are optional args
mr.map_file_char(nmap,files,recurse,readfile,sepchar,delta,mymap)
                             # files is a list of filenames and dirnames
			     # mymap is a function called back from the
			     #   library as mymap(itask,str,mr,ptr)
			     # as above, ptr and addflag are optional args
mr.map_file_str(nmap,files,recurse,readfile,sepstr,delta,mymap)
                             # files is a list of filenames and dirnames
			     # mymap is a function called back from the
			     #   library as mymap(itask,str,mr,ptr)
			     # as above, ptr and addflag are optional args
mr.map_mr(mr2,mymap)         # pass key/values in mr2 to mymap
                             # mymap is a function called back from the
			     #   library as mymap(itask,key,value,mr,ptr)
			     # as above, ptr and addflag are optional args :pre

mr.open()
mr.open(addflag)
mr.print_screen(proc,nstride,kflag,vflag)
mr.print_file(file,fflag,proc,nstride,kflag,vflag) :pre
                            
mr.reduce(myreduce)         # myreduce is a function called back from the
			    #   library as myreduce(key,mvalue,mr,ptr)
			    #   where mvalue is a list of values associated
			    #   with the key, mr is the MapReduce object,
			    #   and you (optionally) provide ptr (see below)
			    # your myreduce function should typically
			    #   make calls like mr->add(key,value)
mr.reduce(myreduce,ptr)     # if specified, ptr is any Python datum
			    #    and is passed back to your myreduce()
			    # if not specified, ptr = None :pre

mr.scan_kv(myscan)          # myscan is a function called back from the
			    #   library as myscan(key,value,ptr)
			    #   for each key/value pair
			    #   and you (optionally) provide ptr (see below)
mr.scan_kv(myscan,ptr)      # if specified, ptr is any Python datum
			    #    and is passed back to your myreduce()
			    # if not specified, ptr = None :pre
mr.scan_kmv(myscan)         # myscan is a function called back from the
			    #   library as myreduce(key,mvalue,ptr)
			    #   where mvalue is a list of values associated
			    #   with the key,
			    #   and you (optionally) provide ptr (see below)
mr.scan_kmv(myscan,ptr)     # if specified, ptr is any Python datum
			    #    and is passed back to your myreduce()
			    # if not specified, ptr = None :pre

mr.scrunch(nprocs,key)
mr.sort_keys(mycompare)
mr.sort_values(mycompare)
mr.sort_multivalues(mycompare) # compare is a function called back from the
			       #   library as mycompare(a,b) where
			       #   a and b are two keys or two values
			       # your mycompare() should compare them
			       #   and return a -1, 0, or 1 
			       #   if a < b, or a == b, or a > b
mr.sort_keys_flag(flag)
mr.sort_values_flag(flag)
mr.sort_multivalues_flag(flag) :pre

mr.kv_stats(level)
mr.kmv_stats(level) :pre

mr.mapstyle(value)             # set mapstyle to value
mr.all2all(value)              # set all2all to value
mr.verbosity(value)            # set verbosity to value
mr.timer(value)                # set timer to value
mr.memsize(value)              # set memsize to value
mr.minpage(value)              # set minpage to value
mr.maxpage(value)              # set maxpage to value :pre

mr.add(key,value)                 # add single key and value
mr.add_multi_static(keys,values)  # add list of keys and values
				  # all keys are assumed to be same length
				  # all values are assumed to be same length
mr.add_multi_dynamic(keys,values) # add list of keys and values
				  # each key may be different length
				  # each value may be different length :pre

:line

Note that you can create multiple MR-MPI objects in your Python
script, and coordinate the data stored in each and moved between them,
just as can from a C or C++ program.

The class methods above correspond one-to-one with the C++ methods
described "here"_Interface_c++.html, except that for C++ methods with
multiple interfaces (e.g. "map()"_map.html), there are multiple Python
methods with slightly different names, similar to the "C
interface"_Interface_c.html.

There is no set function the the {keyalign} and {valuealign}
"settings"_settings.html.  These are hard-wired to 1 for
the Python interface, since no other values make sense, due to
the pickling/unpickling that is performed in key and value data.

See the Python scripts in the examples directory for
"examples"_Examples.html of how these calls are made from a Python
program.  They are conceptually identical to the C++ and C programs in
the same directory.
