There has been a recent trend in biology (and as is relevant to me, ecology and evolutionary biology) to learn and push programing in very “high-level” languages and frameworks such as python and R. This tries to get the programmer to defer much of the programmatic stuff to the language itself so the scientist (in this case) can get on with solving the problem.
This follows a general trend in programming that started with FORTRAN way back in 50′s when engineers and scientists were thought to be more productive if they could just enter in the formulas directly instead of dealing with the programming specifics. Hence the name “The IBM Mathematical Formula Translating System”. Nevertheless, I am not sure many people in the biological sciences would consider FORTRAN very simple and obvious unless they already program.
What I wonder about here is the true merit of programming in a very high-level language? First, let’s just say C, C++, and Java, although they may be high-level as compared to assembly, they are lower level than Python, Perl and R. So, why chose Python, Perl, or R (or Matlab, or many of the other frameworks) over C, C++, and Java? Some reasons people give are: 1) easier to learn, 2) existing toolsets, 3) speed doesn’t matter, and 4) other languages are too hard.
First, let me say that I use both Python and R. I use R for some figure creation and statistics (though moving away for some of these for more flexibility). I use Python almost daily for text manipulation, dataset manipulation, simple tree manipulation, and for mocking up future programs. However, I do not favour it for programs meant for release.
Generally, the low level languages offer an enormous speed advantage over the higher level languages (see here). Also, despite the propaganda, they are easy to read and have an enormous following outside the biological sciences. Furthermore, they are generally truly cross platform and the tools are easily found for all operating systems.
Sure, it may take a little longer (few weeks) to start programming real programs with ease in C, C++, and Java, but speed matters. I know that there is the feeling that processors will get faster so it won’t matter, but also analyses will get harder. They already are. For example, a simple dataset in the biogeographic program lagrange takes 36 hours to complete (I have a real dataset in mind). An alpha c++ version of lagrange completes this dataset in 1 minute 30 seconds. These kinds of speed ups allow for endless complexity to be added to the program and procedure. Sure, if we wait for a fast processor, the 36 hours could shrink to 12 hours, or 6 hours, but how can you argue with a speed up of over 1000x.
Another issue is dataset size. There is the simple fact that R, python, and perl are less efficient with memory. For the same reason as above, for current datasets, this may not be a problem. However, datasets are growing and controling memory can be essential between completing an analysis and not completing one.
Because many of us are focusing on computational aspects of evolutionary biology, we can expect to build the best and most efficient tools. Just a thought, but those will probably be in C, C++, and Java.