Welcome to our round table! Each participant writes one blog post about his or her experiences with distributing scientific software. You are invited to post. More information here.


Dag Sverre Seljebotn

Introducing myself

I'm a Ph. D. student doing statistical analysis of the cosmic microwave background (Institute of
Theoretical Astrophysics, Oslo, Norway). This is a very Fortran-oriented place, with only a couple of Python developers.

I'm one of the developers of Cython. I also worked for some months helping Enthought to port SciPy to .NET.

In a couple of months the institute will likely be buying a small cluster (~700 cores).  The upside is we don't have to use a certain ever-failing cluster (which will remain unnamed) nearly so much. The downside is we need to build all those libraries yet again.

My Ph. D. project is to rewrite an existing code so that it can scale up to higher resolutions than today (including work on statistical methods and preconditioners).  My responsibility will be the computationally heavy part of the program. The current version is 10k lines of Fortran code, the rewritten version will likely be a mix of Python, Cython, C and Fortran. MPI code with many dependencies: Libraries for dense and sparse linear algebra, Fourier transforms, and
spherical harmonic transforms.

What I have tried

During my M. Sc. I relied on a Sage install:

  • It got so heavily patched with manually installed packages that I never dared upgrade
  • matplotlib was botched and needed configuration + rebuild (no GUI support)
  • NumPy was built with ATLAS, which produced erronous results on my machine, and I made the NumPy SPKG work with Intel MKL
  • I needed to work both on my own computer and the cluster, and keep my heavily patched setups somewhat consistent. I started writing SPKGs to do this, but it was more pain than gain
  • I still ended up with a significant body of C and Fortran code in $HOME/local, and Python code in $PYTHONPATH.

In the end I started afresh with EPD, simply because the set of packages that I wanted was larger and the set of packages I didn't want smaller. My current setup is a long $PYTHONPATH of the code I modify, and EPD + manually easy_install'ed packages. There are probably better ways of using EPD, but I don't want to invest in something which only solves one half of my problems.

In search of a better solution I've tried Gentoo Prefix and Nix. Both of these were a bit difficult to get started with (Nix much better though), and also assumed that you wanted to build everything, including gcc and libc. Building its own libc makes it incompatible with any shared libraries on the "host" machine, so it's an all-or-nothing approach, and I didn't dare make the commitment.

None of the popular solutions solve my problems. They work great for the target communities --Mathematicians in the case of Sage, scientists not using clusters or GPLed code in the case of EPD -- but nobody has a system for "PhD students who uses a software stack that cluster admins have barely heard of, C/Fortran libraries that the Python community has never heard of, and need to live on bleeding edge on some components (Cython, NumPy) but cares less about the bleeding edge of other components (matplotlib, ipython)".

Build-wise: Building Cython+Fortran used to be a pain with distutils. I switched to SCons, which was slightly better, but had its own problems. Finally, the current waf works nicely (thanks to the work of David Cournapeau and Kurt Smith), so I switched to that. Bento sounds nice but I didn't use it yet since I just use PYTHONPATH for my own code and didn't need to distribute code to others than co-workers yet.

My insights

  • The problem we face is not unique to Python (though perhaps made worse by people actually starting to reuse each others code...). A solution should target scientific software in general.
  • Many Python libraries wraps C/Fortran code which must also be properly installed. Bundling C/Fortran libraries in a Python package (as some do) is a pain, because you can no longer freely upgrade or play with compilation of the C/Fortran part.
  • Non-root use is absolutely mandatory. I certainly won't get root access on the clusters, and the sysadmins can't be bothered to keep the software stack I want to use up to date.
  • I think all popular solutions fall short of allowing me the flexibility that Git offers me with respect to branching and rollbacks. I want to develop different applications on top of different software stacks, to switch to a stack I used a year ago for comparison (reproducible research and all that), and to more easily hop between stacks compiled with different versions of BLAS.
  • I like to use my laptop, not just the cluster. Relying on a shared filesystem or hard-coded paths is not good.
  • I want it to be trivial to use the software distribution system to distribute my own code to my co-workers. I don't want to invent a system on top for moving around and keeping in sync the code I write myself.
The way I see forward

Before daring to spend another second working on any solutions, I want to see the problems faced and solutions tried by others.

Pointing towards a solution, I've become an admirer of Nix (http://nixos.org). It solves the problem of jumping between different branches and doing atomic rollbacks in seconds. On the other hand there's a couple of significant challenges with Nix. I won't go into details here, but instead here: https://github.com/dagss/scidist/blob/master/ideas.rst.

On the one hand, I'm a Ph. D. student with 3 paid years ahead of me. On the other hand, I need to do research (and I'm a father of two and can't work around the clock). I'd wish I didn't have to spend any more time on this, but now that a new cluster is coming up and I need to edit Makefiles yet another time, I'm sufficiently frustrated that I might anyway.

Right now my vision is sufficiently many people with sufficient skills coming together for a week-long workshop to build something Nix-like (either based on Nix or just stealing ideas).

No comments:

Post a Comment