Fixing scientific software distribution

2011-10-02

Matt T

My name is Matt Turk, and I'm a computational astrophysicist working on structure formation in the early universe. I am the original author of the yt code and a developer on the Enzo project. I'm at Columbia University, on an NSF postdoctoral fellowship to drive forward my studies of the First Stars while developing infrastructure for these simulations, targeting simulations from laptop- to peta-scale. yt is a python project, while Enzo is largely C/C++/Fortran. yt is designed to target the output of multiple different simulation codes, and has a growing user- and developer-base.

My primary interests with respect to software are ensuring that communities of users can deploy appropriate analysis packages easily and on multiple systems. While the majority of the users of yt utilize XSEDE resources, we have a large number that also use laptops and local computing clusters.

yt started out as very, very difficult to install. The software stack was quite large and it was not automated. For the most part, we have addressed this in two ways. The first is that the dependency stack has been whittled away substantially; we are extremely conservative in adding new dependencies to yt, and the core dependencies for most simulation input types is simply numpy, hdf5, and Python itself. The second approach is to provide a hand-written installer script, which handles installation of the following dependencies into an isolated directory structure:

zlib
bzlib
libpng
freetype (optional)
sqlite (optional)
Python
numpy
matplotlib (optional)
ipython (optional)
hdf5
h5py (optional)
Cython (optional)
Forthon (optional)
mercurial
yt

This seems like a large stack, but the trickiest libraries are usually matplotlib and numpy. We have also reached out to XSEDE and modules are now available on several HPC installations. The install script takes care of the rest. We're currently in the process of attempting to make yt available as a component of both ParaView's superbuild and VisIt's build_visit script, both of which also handle dependency stacks. I'm extremely concerned with ensuring that yt's installation works everywhere, especially those systems where root / sudo is not available.

Easily the hardest problem, and the one that I hope we can solve in some way, is that of static builds. The problem of building a stack library (for use, for instance, on Compute Node Linux on some Cray systems) is difficult; starting with the GPAW instructions we at one time attempted to maintain static builds of yt, but the inclusion of C++ components (and the lack of C++ ABI interoperability) became too much of a burden and we no longer do so. Now we are faced with the issue of needing one because file systems typically cannot keep up with importing a python stack from every MPI task (which becomes burdensome even at as few as 256 processes, and essentially impossible above a couple thousand). While egg imports and zipped file systems alleviate this problem for pure-python libraries, this will not work for shared libraries. Neither I nor my fellow developers have found a simple and easy way to generate static builds that are easily updated, but this is a primary concern for me.

I don't have a particular takeaway or suggestion for a call to action; we have lately simply come to terms with the time it takes to load shared libraries, and we'll probably have another go at a unified static builder at some point in the future. But for now, out install script works reasonably well, and we will probably continue using it while still reaching out to system administrators for assistance building on individual supercomputer installations.

Qsnake

Here is my answer to all the questions from the Introduction post.

My name is Ondřej Čertík and I am doing my PhD in Chemical Physics at University of Nevada, Reno. For my work, I need C/C++ libraries (like PETSc and Trilinos) as well as Fortran libraries (BLAS, LAPACK, ARPACK and some FEM packages). My own code used to be in C++, but last year I switched to Fortran (so I need Fortran as the first class citizen). I then wrap it using Cython and call it from Python.

I took Sage, and rewrote the build system in Python, and created Qsnake (the core is BSD licensed, other packages have their own license). The packages are hosted at github, and it uses a json file with package dependencies. Quite a few people have tried it already and here is our plan with the package management. I need a lot of packages, I would call it engineering packages, mainly around the SciPy stack, and a few more numeric packages.

I don't think there is any fundamental problem with my approach, it works great for me and does exactly what I need. Obviously more improvements would be cool, see the packages plan, and I work on it as my time allows.

My key insight is that it is a lot of work to get things working and testing on various platforms (Linux, Mac so far). Also it is advantageous to keep packages compatible with Sage, because then people can reuse them and instead of creating yet another fork, I view Qsnake as a complement to Sage.

The way forward that I can see is to simply continue working on Qsnake (in my case), or if I can see significant progress on some competing product, I might join it. I am working on Qsnake. I don't have much money to spend on it, but thanks to being compatible with Sage, there is possibility of common workshop with Sage (I got the offer from William Stein to organize such a thing, but I am currently too busy). I use Qsnake almost daily for my own work, and I improve it as time permits. I don't have much time in general though, but I do my best.

Finally, I think that the most important insight that I have learned is that it is important to get and discuss good design and so on, but it is even more important to simply start working on something and make it fix a problem for somebody (me in my case). Qsnake does exactly that, and as I said, I am open to adjust it's goal if anyone wants to join, as well as join any competing project, if somebody else think he can do (and does) a better job (which shouldn't be that difficult given my own time constrains). In the meantime, I simply continue with Qsnake.

What now?

This blog was hardly a success (though thanks to those who did post!). Quite a few people said they wanted to post and quite a few more read the blog (800 page views). So at least it's obvious there's a lot of interest in this subject.

What now? Ideas? (Beyond just giving up...)

If you really did intend to put aside time to make a post, please do it now. Don't worry too much about quality at this point... in particular those who sent me emails in private, feel free to just dump them to the blog.
We could announce a migration to a mailing list (python-hpc? Or a new Google group?) and promise to keep each other posted on our individual efforts there. Though I'm not sure whether starting a mailing list is really easier than a blog, many mailing lists die out too.
I did throw out the idea of a Skype meeting. If anyone would prefer that please say so. Probably more time-consuming overall than getting the blog to work, but perhaps more comfortable and easier to get people to commit...

2011-09-22

Experiences with Python under git Control

I find Dag's idea very straight and brave. I'm facing pretty much the same problem with Python only (i.e. no C libs, no Fortran libs, whatever). I do use mpl, though. Maybe you remember me doing some builds of numpy for the OS X users. That time I resolved the reproducibility and branchability problem using .. git. Yes, Python path under git control. I've put the whole Python installation directory, which is usually /Library/Frameworks/Python.framework/Version/2.x/ under a singleton git control (i.e. making this dir a repo). This works pretty flawless after some tinkering around. There are some pitfalls, which can be avoided, and some problems, which can be solved. The pitfalls will not be covered here. One of the "major" problems is the .pth file issue, when merging in branches of a different software which is installed via .pth, git sometimes cannot merge the .pth file properly and this needs to be fixed manually.

There is one more major problem where I would need feedback or have a lack of knowledge, that is the speed issue with .pyc being older than the corresponding .py file. The .pyc files in general are one of the pitfalls of the method. They are hard to repoduce to look the same as when installed. And you cannot git-ignore them because .pyc's lingering around can, for instance, disturb nose. For this problem, which can be ignored at all if load speed does not matter, I didn't find a real solution so far; I simply ignored it and kept the .pyc's in the repo.

I'm going to use this approach for my OS X Lion Python installation with a shared system Python and User-local Python (probably via virtualenv), because I want the System Python to be free of user's software, let it be software interesting to others or not. I've worked out the strategy for handling mpkg (i.e. point-and-click installers), by fetching from the system Python and deleting the branch in the system Python afterwards. Anyway, this is planned, not yet tested. Tested is only the strategy for handling and maintaining the system Python, and this I was used to use on a regular basis.

Of course, it would be possible to wrap git in a Python-specific application, but that would hinder portability to the C, Cython, Fortran etc. folks amongst us.

I must add that I don't have experience with most of the packaging software that is around except the Python distutils & Bento (where I clearly prefer the latter).

To my experience there will not be the solution. World is not that simple as we scientists would like it to be. It's like a zoo out there [from a well-known movie]. Nothing will ever help this. Life would be boring if everything would be standardised and there would be only 1 standard. In my opinion, a discussion serves as .. discussion on its own, i.e. propelling the mind by developing new ideas. Standards might be a side-effect of a good democratic evolution. So this round table might have perfect conditions when it brings together the different parties, which were separated in mind before.

2011-09-20

Chris Kees

Introducing myself

I am a research hydraulic engineer at the US Army Engineer Research & Development Center (ERDC) and one of the developers of Proteus, a Python toolkit for computational methods and simulation. I've been working in the area of numerical methods for partial differential equations and high performance computing since 1995.

In the interest of contributing to the roundtable discussion in a timely manner, I'm going to basically post what I put on the sage-support list. I have learned a lot since that post, and I see a lot of good ideas in what others have shared about their knowledge of linux packaging systems like Nix and gentoo.

One area where I have a slightly different opinion is that I think we should focus on just the needs of the Python environment on HPC systems. That includes the difficulties of working with many other packages and system libraries, but I am looking for an evolutionary step in what we currently have working for our Python software. If the resulting python distribution solves the more general problem then so be it.

The way I see forward (from sage-support)

Here's what I think we need:

1) A standard, which specifies a python version, and a list of python packages and their dependent packages. This allows for-profit vendors to build to our standard.

2) A build system that allows extensive configuration of the entire system but with enough granularity that the format of a package is standardized and relatively straightforward. On the other hand, the whole system must be designed such that it can be built repeatedly from scratch without any interactive steps.

3) A testing system that is simple enough that the community can easily contribute tests to ensure that the community python is reliable for their needs

4) A framework for making this environment extensible without requiring forking it and creating yet more distributions

Here's a straw man:

1) Standard:

Python 2.7.2 PLUS:

numpy *

scipy

matplotlib *

vtk (python wrappers + C++ libs) *

elementtree *

ctypes *

readline (i.e. a functional readline extension module) *

swig

mpi4py *

petsc4py *

pympi

nose *

pytables *

basemap

cython *

sympy *

pycuda

pyopencl

IPython *

wxpython

PyQt *

pygtk

PyTrilinos

virtualenv *

Pandas
numexpr *
pygrib

Note:

*Our group has these in the python stack we build for our PDE solver framework (http://proteus.usace.army.mil), which we build on a range of machines at 4 major supercomputing centers.

The main issue I see with 1) is that this is somewhat different from the sage package list. We would need many optional sage packages but wouldn't need some of the standard sage packages.

2) Build System:

a. Use cmake* for the top level configuration, storing the part relevant for each package in a subdirectory for each package (call it package_name_Config e.g. numpyConfig, petsc4pyConfig, ...)

b. store each package as an spkg** that meets sage community standards except that spkg-install will rely on information from package_name_Config (maybe it would be OK to edit files in package_name_Config located INSIDE package_name_version.spkg during the interactive configuration step?)

c. each package will still get built with it's native built system***

Notes:

*Our group simply uses make instead of cmake, with a top level Makefile containing 'editConfig' and 'newConfig' targets that allows you to edit and copy existing configurations

**Our group only produces a top level spkg, but I think we could easily generate a finer grained set of spkg's for ones that don't already exist

***Our group does this (i.e. we don't rewrite upstream build systems). I think spkg's also use the native build system in most cases, right?

The main issue with 2. (the build system) is that building on HPC systems requires extensive configuration of individual packages: numpy needs to get built with the right vendor blas/lapack and potentially the correct, non-gcc, optimizing compilers (maybe even a mixture of gcc and some vendor fortran). Likewise petsc4py might need to use PETSc libraries installed as part of the HPC baseline configuration rather than building the source included with this distribution. My impression is that sage very reasonably opted to focus on the web notebook and a gnu-based linux environment so the spkg system alone doesn't fully meet the needs of the HPC community. We need the ability to specify different compilers for different packages and to do a range of things from building every dependency to building only python wrappers for many dependencies.

3) buildbot + nose and a package_nameTest directory for community supplied tests of each package in addition to the packages' own tests. This way users only have to add test_NAME.py files to

4) virtualenv + pip should allow users to extend the python installation into a their private environment where they can update and add new packages as necessary. An issue here is that it wouldn't allow a per-user sage environment so I'm not sure whether users could also install spkg's or even use their modified python environment from sage.

Dag Sverre Seljebotn

Introducing myself

I'm a Ph. D. student doing statistical analysis of the cosmic microwave background (Institute of
Theoretical Astrophysics, Oslo, Norway). This is a very Fortran-oriented place, with only a couple of Python developers.

I'm one of the developers of Cython. I also worked for some months helping Enthought to port SciPy to .NET.

In a couple of months the institute will likely be buying a small cluster (~700 cores). The upside is we don't have to use a certain ever-failing cluster (which will remain unnamed) nearly so much. The downside is we need to build all those libraries yet again.

My Ph. D. project is to rewrite an existing code so that it can scale up to higher resolutions than today (including work on statistical methods and preconditioners). My responsibility will be the computationally heavy part of the program. The current version is 10k lines of Fortran code, the rewritten version will likely be a mix of Python, Cython, C and Fortran. MPI code with many dependencies: Libraries for dense and sparse linear algebra, Fourier transforms, and
spherical harmonic transforms.

What I have tried

During my M. Sc. I relied on a Sage install:

It got so heavily patched with manually installed packages that I never dared upgrade
matplotlib was botched and needed configuration + rebuild (no GUI support)
NumPy was built with ATLAS, which produced erronous results on my machine, and I made the NumPy SPKG work with Intel MKL
I needed to work both on my own computer and the cluster, and keep my heavily patched setups somewhat consistent. I started writing SPKGs to do this, but it was more pain than gain
I still ended up with a significant body of C and Fortran code in $HOME/local, and Python code in $PYTHONPATH.

In the end I started afresh with EPD, simply because the set of packages that I wanted was larger and the set of packages I didn't want smaller. My current setup is a long $PYTHONPATH of the code I modify, and EPD + manually easy_install'ed packages. There are probably better ways of using EPD, but I don't want to invest in something which only solves one half of my problems.

In search of a better solution I've tried Gentoo Prefix and Nix. Both of these were a bit difficult to get started with (Nix much better though), and also assumed that you wanted to build everything, including gcc and libc. Building its own libc makes it incompatible with any shared libraries on the "host" machine, so it's an all-or-nothing approach, and I didn't dare make the commitment.

None of the popular solutions solve my problems. They work great for the target communities --Mathematicians in the case of Sage, scientists not using clusters or GPLed code in the case of EPD -- but nobody has a system for "PhD students who uses a software stack that cluster admins have barely heard of, C/Fortran libraries that the Python community has never heard of, and need to live on bleeding edge on some components (Cython, NumPy) but cares less about the bleeding edge of other components (matplotlib, ipython)".

Build-wise: Building Cython+Fortran used to be a pain with distutils. I switched to SCons, which was slightly better, but had its own problems. Finally, the current waf works nicely (thanks to the work of David Cournapeau and Kurt Smith), so I switched to that. Bento sounds nice but I didn't use it yet since I just use PYTHONPATH for my own code and didn't need to distribute code to others than co-workers yet.

My insights

The problem we face is not unique to Python (though perhaps made worse by people actually starting to reuse each others code...). A solution should target scientific software in general.
Many Python libraries wraps C/Fortran code which must also be properly installed. Bundling C/Fortran libraries in a Python package (as some do) is a pain, because you can no longer freely upgrade or play with compilation of the C/Fortran part.
Non-root use is absolutely mandatory. I certainly won't get root access on the clusters, and the sysadmins can't be bothered to keep the software stack I want to use up to date.
I think all popular solutions fall short of allowing me the flexibility that Git offers me with respect to branching and rollbacks. I want to develop different applications on top of different software stacks, to switch to a stack I used a year ago for comparison (reproducible research and all that), and to more easily hop between stacks compiled with different versions of BLAS.
I like to use my laptop, not just the cluster. Relying on a shared filesystem or hard-coded paths is not good.
I want it to be trivial to use the software distribution system to distribute my own code to my co-workers. I don't want to invent a system on top for moving around and keeping in sync the code I write myself.

The way I see forward

Before daring to spend another second working on any solutions, I want to see the problems faced and solutions tried by others.

Pointing towards a solution, I've become an admirer of Nix (http://nixos.org). It solves the problem of jumping between different branches and doing atomic rollbacks in seconds. On the other hand there's a couple of significant challenges with Nix. I won't go into details here, but instead here: https://github.com/dagss/scidist/blob/master/ideas.rst.

On the one hand, I'm a Ph. D. student with 3 paid years ahead of me. On the other hand, I need to do research (and I'm a father of two and can't work around the clock). I'd wish I didn't have to spend any more time on this, but now that a new cluster is coming up and I need to edit Makefiles yet another time, I'm sufficiently frustrated that I might anyway.

Right now my vision is sufficiently many people with sufficient skills coming together for a week-long workshop to build something Nix-like (either based on Nix or just stealing ideas).

Introduction

Every day, countless researcher hours are spent getting software to run. A significant number of scientific Python distributions are available, but none solve everybody's problems.

There's no lack of mailing list threads out there on the subject. This time it was on the mpi4py mailing list, where the feeling was that nobody has really catered to the "HPC" or "large cluster" segment. Rather than plunging ahead and develop yet another scientific Python distribution or set of standards for another special case, let's take a breath and make sure we understand the full problem first. Why aren't the current solutions working, and what do we really want?

Goals:

Avoid redoing the mistakes of the past
Save time by pooling our efforts
Take a step towards more reproducible research
Make scientific Python more attractive

Round 1: Getting to know each other + surveying problems and solutions

Everybody can participate (HPC or desktop, Python or Fortran shouldn't matter at this point). Send me an email to d.s.seljebotn@astro.uio.no and I'll give you posting rights. Then write a post where you:

Introduce yourself and the problems you work with
What have you tried? What are you currently using? Why does this cause problems?
What are the key insights you can draw from your experience?
What way do you see forward? Are you already working on something? How much time and other resources (money for workshops etc.) do you have to work towards a solution?

We hope many will participate so that we all get a broad collection of the problems faced and the solutions tried.

Blog Archive