Welcome to our round table! Each participant writes one blog post about his or her experiences with distributing scientific software. You are invited to post. More information here.

2011-09-20

Chris Kees

Introducing myself

I am a research hydraulic engineer at the US Army Engineer Research & Development Center (ERDC) and one of the developers of Proteus, a Python toolkit for computational methods and simulation.  I've been working in the area of numerical methods for partial differential equations and high performance computing since 1995.

In the interest of contributing to the roundtable discussion in a timely manner, I'm going to basically post what I put on the sage-support list.  I have learned a lot since that post, and I see a lot of good ideas in what others have shared about their knowledge of linux packaging systems like Nix and gentoo.

One area where I have a slightly different opinion is that I think we should focus on just the needs of the Python environment on HPC systems. That includes the difficulties of working with many other packages and system libraries, but I am looking for an evolutionary step in what we currently have working for our Python software. If the resulting python distribution solves the more general problem then so be it.


The way I see forward (from sage-support)


Here's what I think we need:


1) A standard, which specifies a python version, and a list of python packages and their dependent packages. This allows for-profit vendors to build to our standard.

2) A build system that allows extensive configuration of the entire system but with enough granularity that the format of a package is standardized and relatively straightforward. On the other hand, the whole system must be designed such that it can be built repeatedly from scratch without any interactive steps.

3) A testing system that is simple enough that the community can easily contribute tests to ensure that the community python is reliable for their needs

4) A framework for making this environment extensible without requiring forking it and creating yet more distributions

Here's a straw man:

1) Standard:

Python 2.7.2  PLUS:
  • numpy *
  • scipy
  • matplotlib *
  • vtk (python wrappers + C++ libs) *
  • elementtree *
  • ctypes *
  • readline (i.e. a functional readline extension module) *
  • swig
  • mpi4py *
  • petsc4py *
  • pympi
  • nose *
  • pytables *
  • basemap
  • cython *
  • sympy *
  • pycuda
  • pyopencl
  • IPython *
  • wxpython
  • PyQt  *
  • pygtk
  • PyTrilinos
  • virtualenv *
  • Pandas
  • numexpr *
  • pygrib
Note:
*Our group has these in the python stack we build for our PDE solver framework (http://proteus.usace.army.mil), which we build on a range of machines at 4 major supercomputing centers. 

The main issue I see with 1) is that this is somewhat different from the sage package list. We would need many optional sage packages but wouldn't need some of the standard sage packages.

2) Build System: 

a. Use cmake* for the top level configuration, storing the part relevant for each package in a subdirectory for each package (call it package_name_Config e.g. numpyConfig, petsc4pyConfig, ...)

b. store each package as an spkg** that meets sage community standards except that spkg-install will rely on information from package_name_Config (maybe it would be OK to edit files in package_name_Config located INSIDE package_name_version.spkg during the interactive configuration step?)  

c. each package will still get built with it's native built system***

Notes:

*Our group simply uses make instead of cmake, with a top level Makefile containing 'editConfig' and 'newConfig' targets that allows you to edit and copy existing configurations
**Our group only produces a top level spkg, but I think we could easily generate a finer grained set of spkg's for ones that don't already exist
***Our group does this (i.e. we don't rewrite upstream build systems).  I think spkg's also use the native build system in most cases, right?

The main issue  with 2. (the build system) is that building on HPC systems requires extensive configuration of individual packages: numpy needs to get built with the right vendor blas/lapack and potentially the correct, non-gcc, optimizing compilers (maybe even a mixture of gcc and some vendor fortran). Likewise petsc4py might need to use PETSc libraries installed as part of the HPC baseline configuration rather than building the source included with this distribution. My impression is that sage very reasonably opted to focus on the web notebook and a gnu-based linux environment so the spkg system alone doesn't fully meet the needs of the HPC community. We need the ability to specify different compilers for different packages and to do a range of things from building every dependency to building only python wrappers for many dependencies.

3) buildbot + nose and a package_nameTest directory for community supplied tests of each package in addition to the packages' own tests. This way users only have to add test_NAME.py files to 

4) virtualenv + pip should allow users to extend the python installation into a their private environment where they can update and add new packages as necessary.  An issue here is that it wouldn't allow a per-user sage environment so I'm not sure whether users could also install spkg's or even use their modified python environment from sage.

2 comments:

  1. Questions:

    You propose making a standard that vendors can build to, but also describe a build system. I would think that the build system is something the vendors would take care of? (E.g., Enthought could build you a special version of EPD) Are you proposing that there's an open source build system in addition to, and in competition with, the commercial offerings? If so, what value is really added by the vendors, and is there really any incentive for vendors to spend effort cloning the open source effort?

    Secondly, you do not mention non-Python libraries. An MPI implementation, Trilinos, Petsc: Are they assumed to already be present on the system, or part of the "standard"?

    ReplyDelete
  2. I don't think there will be a any incentive for vendors to provide binary installs or build systems for HPC systems anytime soon. I was thinking more of the windows/mac side. It would be a selling point for me that some windows Python distribution complied with our standard and could therefore be used to deploy our HPC software on centrally managed windows machines.

    No, the non-Python libraries are not assumed to be already installed. Unfortunately they can't always be assumed to be not installed either (which is sort of the approach that sage takes--install everything). I think on HPC systems there is reason to design the system so that it can both take advantage of pre-installed (and presumably optimized) non-Python libraries or build them from scratch if necessary.

    ReplyDelete