science education in North Carolina

Just a short comment.

This weekend I was at the UNC planetarium to check out some of the videos they offer. Just recently went to the Hayden Planetarium at the AMNH and was hungry for some more astronomy. I ended up checking out one of the 15 minute lectures (ok, I thought it was a video because it was called NASA digital theater). The lecture was on Black Holes. Pretty interesting. It was basically a powerpoint lecture given by an Evironmental Science major. Obviously very basic and for the general public.

Now, the most interesting part was a question after the talk that was asked by a well dressed middle aged man there with his four children. He asked “So all stars don’t turn into black holes right? I mean there are shooting stars”. Wow. Ok, I guess it is ok that he doesn’t know what a shooting star is, pretty disappointing but to be expected. Then, the undergrad responded that “no, those are comets”. Pretty surprising that this is the state of science education. I am not an astronomer but makes me worry about the biology being taught (or biology that doesn’t stick).

Posted in science | Tagged , | 1 Comment

Pairwise alignment alternatives to BLAST — back to basics

I have been programming my large dataset assembler program called PHLAWD (another in a series of self-deprecating program names). One step in the algorithm is to compare sequences (pairwise sequence comparisons). Many dataset assemblers employ this step, using maybe a N x N blast procedure or something similar. I make some simplifying assumptions and only do a n x N comparison where n is a subset of “known” sequences of a particular gene region.

ANYWAY…

in order to incorporate blast in my program, I first tried using bl2seq which compares two sequences using the BLAST algorithm. This required writing two files for each comparison which totaled about 1,000,000 comparisons, at least. Most of the time was spent writing the files.

So I tried the NCBI C++ toolkit. Let me just save you some time. Don’t even look into it. The amount of time to familiarize yourself with the datatypes and usage in the “toolkit” is just crazy. It is clear to me why this toolkit is so rarely used. A shame really.

Finally, I decided to go back to the good ol’ Smith and Waterman local alignment algorithm [link][Gotoh's improvement]. To refresh, this was developed in ’81 by Temple Smith and Michael Waterman and was derived from the global algorithm of Needleman and Wunsch algorithm of ’70 [link]. The Smith-Waterman algorithm is one of, if not the, most sensitive pairwise algorithm and is exact, not heuristic like BLAST. Unfortunately it is slow and requires a bit of memory.

BUT…

There have been some amazing strides forward in speeding up this algorithm and for pairwise comparisons, when implemented correctly, it can be as fast as BLAST. Some of these show off the capabilities of the new Playstation 3 as well [link][link][link][GPU's].

And so I have used the speed-up presented here in PHLAWD abandoning BLAST for the time being. Now, I have 4,000,000 comparisons at less than 5 minutes. Nice.

Posted in programming | Tagged , | 2 Comments

megaphylogenies

So we just published a paper describing how we make some big trees in BMC Evolutionary Biology. It is titled “Mega-phylogeny approach for comparative biology: an alternative to supertree and supermatrix approaches“. This contains the tree mentioned in the New York Times article (see below) and describes the general method for making the trees in the Science paper published late last year. We are trying to coin a phrase (megaphylogeny) which has been thrown around for a few months now. Might not stick, but worth a try.

Posted in Uncategorized | Tagged , | 4 Comments

mention in the New York Times

I got a mention in the New York Times today for a large tree that is in press at BMC Evolutionary Biology (to be published today).

Crunching the Data for the Tree of Life

This is work done when I first started at NESCent with my coauthor’s Michael Donoghue and Jeremy Beaulieu at Yale.

We thought they were going to use our figure, but they aren’t so I will post it here:
This is an rbcL tree of seed plants with 13,533 species.
Posted in large phylogenies, publication, science | Tagged , , | 2 Comments

evolutionary bioinformatics books

Recently there has been some interest in books with bioinformatics algorithms that would be of interest to evolutionary biology and phylogenetics. So I figured I would list some:



Numerical Recipes 3rd Edition: The Art of Scientific Computing — classic in a new edition that explicitly includes phylogenetics algorithms


there are plenty of others  (some that are titled Evolutionary Bioinformatics, etc.) these are just some of the ones I find helpful. 
Posted in programming | Tagged , | Leave a comment

not usually about politics, but come on

So this blog isn’t suppose to be about politics, but I am so disappointed in Obama’s choice for his invocation, I figured I would rant. Obama chose Rick Warren, the author of “a purpose driven life”. This is the evangelical minister that lied about John McCain being in the “cone of silence” when he interviewed Obama (he also interviewed McCain). 

Let us list off some of the reasons this is a major disappointment. First and foremost, the guy is a major opponent of evolution. What?!?! He has also compared abortion to the holocaust. Lovely. He is also vocally outspoken about same-sex marriage (though most politicians in D.C. are against same-sex marriage). Now he has also encouraged work against AIDS, poverty and genocide in Darfur, but Obama couldn’t find a minister that spoke more to his campaign and political philosophy? Maybe a lesser known minister from an urban area (like Chicago?) that is helping his/her local community. Wouldn’t that be a little more his style? Maybe that means his political philosophy is changing?
But he chose so well for his energy secretary (Chu, the nobel laureate).
NYTIMES article
Posted in politics | Tagged | Leave a comment

programming languages for science (biology)

A ramble follows…

As previous posts have eluded, I am working on building large trees and the software to do that. I have found out a few things along the way, such as the fact that linux is still better for some things (some would say, quite a lot of things) than mac (especially easy install of complicated libraries and easy manipulation of configuration files to tune performance of software). More importantly however, I have found limitations in many of the scripting languages such as Python. I am trying to use all the memory available on my machine (8Gb) and therefore have 64 bit Ubuntu on one hard drive of my MacPro (use refit to dualboot). Turns out many libraries are broken for Python in 64 bit (such as psycopg2 for postgresql). That isn’t great. I managed a work around by using MySQL for now (the Python mysql library is fine). 

But this brings up an interesting issue. Are scripting languages not keeping up with the computational needs of a growing biology field with growing computational needs? For instance, Python only just recently added the ability to run over multiple cores (and this is not yet standard). 
Does this mean that we are back to C, C++, and Java? Seems like it.
Posted in programming, science | Tagged , | 1 Comment

open source and mac

So I just finished setting up my new macbook pro. The last thing I installed were some open source apps and libraries. I noticed a few unfortunate things in the process of installing these and that is the poor support and representation (still!) for mac and these products. This includes specifically NumPy, SciPy, and postgresql. Sometimes the problem is lack of clear instructions for mac and other times it is failure to compile from source without many modifications (usually poorly documented). Furthermore, there are occasionally problems with conflicting information. Case in point, there is a website offering easy installation of many of these libraries for python (macinscience). However, if this is installed you must run programs depending on these libraries in ipython. Well I don’t like ipython, so that is no fun. May be easy, but I want flexibility (this is science not email or word processing). Depending on your usage, these are either minor issues or they are complete barriers. For example, in order to install and use biopython, NumPy must be installed. 

One of the more surprising issues is postgresql. You can compile from source, but the binary installation is not supported by postgresql, instead it is made by enterprisedb. Now, why hasn’t the community been strong enough to get postgresql to support a mac osx binary? It is simply confusing for me. 
Hopefully, the long favored linux solution will ported to mac. In Ubuntu for example, the installation of the above software and thousands of others is as simply as typing (basically) install *name of software*. It installs the dependencies and all that software is included in the updates. Wonderful. It allows for correctly having multiple versions of important software (like java, gcc, fortran, python, etc. etc.). I believe there is a google summer of code project on this, but can’t seem to find it. 
Just some complaints after a few frustrating hours.
Posted in programming | Tagged | 3 Comments

On large trees

Some colleagues and I (especially Jeremy Beaulieu and Michael Donoghue) have been interested in making large phylogenies. We have been trying to develop (semi) automated methods and implementations that can assemble datasets quickly. The assembly of the phylogeny and matrix itself appears to not be the major computational problem anymore. Although we are sure to hit limits (as my colleagues and I have with matrices of tens of thousands) the more pressing concern are the post-treebuilding analyses. These include comparative analyses, dating analyses, and visualization. For example, to complete these tasks for our recent Science paper (link) individual command line programs had to be developed for each stage (more than 20 scripts and programs in fact).

Brian O’Meara has been thinking about this problem in relation to comparative analyses (link). As Brian discusses, the major problem for comparative analyses (asside from possible memory issues) are the extremely small numbers that computers are not able to handle naturally. He discusses his solution (writing some clever code to get around small numbers).

Other programmers have solved these problems by shortcuts or tricks. Here are a couple libraries that solve the problem:
mpmath – python library for small number math
GMP – C/C++ library (though seems to have problems with Apple’s GCC)
for Java there is BigInteger and BigDouble

Here are some libraries for solving these problems but there are also clever algorithms. If I have some time I will put some up here and hopefully more will be presented to provide a source for programming and programs in the future.

Posted in large phylogenies, programming | Tagged , | Leave a comment

banjo playing

I play the banjo and this is my lame plug. I have some youtube videos and am advertising them.
This first song is Bonaparte Crossing the Rhine followed by Bonaparte’s Retreat.

This second one is Whiskey Before Breakfast.

OK, there will be science here next time.

Posted in Uncategorized | Tagged | Leave a comment
  • me

    The blog of Stephen A. Smith, an evolutionary biology at the National Evolutionary Synthesis Center

    find me on IRC