开发者

Clojure or Scala for bioinformatics/biostatistics/medical research [closed]

Closed. This question is opinion-based. It is not currently accepting answers.

Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.

Closed 7 years ago.

Improve this question

I am not a professional programmer (my area is medical research), but I am quite capable in C/C++, and various scripting languages. A while back I got intrigued by Lisp, but I never got the time to seriously learn it. After a brief exposure to R I decided to invest more time in a functio开发者_运维百科nal programming language.

I would like the practicality of a JVM language and thus narrowed to Clojure and Scala. From what I understand, both can use already existing Java libraries and given at performance-critical code can be delegated to Java, have the potential to perform relatively equally well.

How do these languages compare in the application space I need them for? Are There any real-life projects in bioinformatics using either?

Already existing code would be a serious plus, as would be good documentation and a fairly gentle learning curve. Also, how does the concurrency model of the two compare with each other?

Any significant advantages/disadvantages any one has?


I can personally vouch for Clojure as a great tool for this kind of work. (I believe Scala would be great too, I just have less experience with it).

My personal research is in the field of predictive modelling / machine learning and is very computationally intensive - so I think it has many parallels with bioinformatics or biostatistics.

My personal approach / setup includes:

  • Incanter used primarily as a data visualisation tool. Great for producing quick visualisations which are usually just 1-liners at the REPL. There are also lots of statistical and numerical processing tools which I believe use the Colt library under the hood. I'm not an expert in R but I understand that Incanter is roughly "R translated to Clojure/Lisp".

  • Exploiting quite a few Java libraries as needed. Some of these are my own, for example algorithms that I have written in Java in order to get the best possible fine-tuned performance out of the JVM. But you could equally easily use any of the other great Java libraries available, as calling Java from Clojure is very simple (.methodName object param1 param2)

  • Quite a lot of higher order functions to automate my workflow. For example I have a higher order function that will run an optimisation algorithm of any kind in a loop for a specified amount of time and then produce an Incanter graph of the improvement on each iteration. Not rocket science, but really easy to code up in a few lines of Clojure.

  • Never really having to worry about performance. You can make Clojure go pretty fast if you want to (e.g. with type hints, primitive arithmetic support etc.) but normally it's irrelevant as you're going to spend 99%+ of your cycles in well-optimised library code anyway. Hence a bit of overhead in the "glue" code is negligible - I feel I gain much more in terms of personal productivity by having a dynamic, high-level, functional language to work in.

  • Major use of Clojure's concurrency features - this has to be one of Clojure's strongest features. I tend to use the STM to code concurrent processes with transactions that can't interfere with each other, then kick off long-running calculations in a future so that I can get on with other tasks and wait for notification of the result.

  • A slowly growing collection of macros to "extend the language" when needed. I actually use macros less than I thought I would (higher order functions are often a better choice). But when you need them they are invaluable - this is where you really appreciate the value of a homoiconic language. Since they effectively allow you to add new syntax to the language itself, they are very powerful when used correctly to build the DSL that you need.

In short - I don't think you can go wrong with Clojure as a researcher.

The one thing I probably wouldn't use it for (yet) is actually writing a new numerical library - this would probably be better done in Scala or pure Java as you would probably want to adopt a more imperative / OOP style.


I am not sure about bioinformatics and biostatistics per se, but I do scientific data analysis frequently and I appreciate that Scala allows me to write as-fast-as-Java code with relative ease. I believe that it is often possible in Clojure now, but I haven't seen the benchmarks to back that up. For the time being, I think the prudent thing to assume is that they do not perform equally well. See, for example, the Computer Languages Benchmark Game, where Scala is faster than Clojure in every single test. (Ignore the horrible "pidigits" result for Clojure--Scala (and Java) are calling the GMP library written in C, which Clojure could do but because of a technical detail requiring a different wrapping for the library, isn't presently allowed in the game). Looking at multicore comparisons doesn't improve Clojure's showing, and note that the Clojure code is no shorter for these sorts of lowish-level algorithmic tasks.

Clojure is ahead for the time being with parallel collections, though the upcoming 2.9 release of Scala should make up much of the difference. Neither has a gentle learning curve when coming from C++; Scala is maybe a little easier given that the syntax outwardly looks a little more familiar. I believe there are good materials for learning each of them.


Edit: P.S. You can call R from Java (and therefore from either Clojure or Scala) using rJava (specifically the JRI interface). Edit to edit: and, these days, rScala.

Edit #2: Scala was faster than Clojure in everything at the time of writing; as of this edit, Clojure's a little ahead in one (at the cost of a huge amount of code)--but anyway, the overall point stands. (And the Scala implementation on that one test could be sped up.)


If you like R, give Incanter a try! It's R for Clojure.

Scala's is geared toward being syntactically easy for people coming from Java, which was intended to be syntactically easy for people coming from C though with two levels of indirection like this the advantage may be lost.

Clojure is getting a lot of traction in the Big Data space and maps very well onto Hadoop jobs for Huge Data. I think this would be a big advantage in the bioinformatics world.

Really, these things are largely personal taste so try both and see that makes you happy :)

If you are looking to get a feel for Clojure without a lot of "intellectual overhead" may I suggest using leiningen to get a test project started quickly?


To build on Rex's answer I would like to add some Scala libraries/products that may be of interest to you:

  • ADAM
  • Spark (sparkseq, 2)
  • Scala Map Reduce (SMR): http://scala-blogs.org/2008/09/scalable-language-and-scalable.html
  • SHadoop: http://jonhnny-weslley.blogspot.com/2008/05/shadoop.html
  • ScalaLab: MATLAB-like scientific computing in Scala
  • ScalaNLP: Collection of libraries for natural language processing (NLP), machine learning, and statistics.
  • Factorie: Toolkit for deployable probabilistic modeling
  • Gridgain: Compute cluster for Scala and Java
  • BioScala: Bioinformatics for the Scala programming language


I don't know Scala, so I can't offer a comparison, but I am actively using Clojure in bioinformatics projects.

The Java integration is excellent, and I have had no problem making use of the BioJava libraries.

Where Clojure's concurrency model shines is in the immutable default data types and functional programming with the seq abstraction.

In my bioinformatic work I very often find myself with a lot input data (say gene sequences) which need to be subjected to the same analysis. Once I have my analysis function I can map it over a sequence of inputs (with the results lazily generated). I have gotten full utilization of a large 48-core server simply by changing that map to a pmap.

Large scale parallelization with a single character change is hard to beat!

Of course pmap isn't a magic bullet and only helps when the analysis function computationally dominates, but the fact that map and pmap can just be plugged in and out shows the elegance and simplicity enabled by Clojure's design.


I am only passingly familiar with Scala, so the best I can do is evangelize a bit for Clojure. It's a great language, but take all this advice with a grain of salt as it's coming from an enthusiast.

If you are looking for concurrency, Clojure is fantastic both for ease of programming and for performance. The immutable data structures mean that it's trivial to work with a coherent snapshot of the world without any manual and error-prone locking; the STM makes it fairly simple to change data in a thread-sensitive way without breaking anyone else's snapshots.

My understanding is that Scala has a lot of the nice functional tools that Clojure does, but Clojure will always win syntactically by virtue of being a Lisp. If you're looking to do some specialized bioinformatics stuff, Clojure is able to hide the bits of Lisp that you don't want, and raise your own constructs to the same level as the built-in language constructs. I can't find the reference right now, but there's some well-known quote about Lisp that goes like:

Lisp is not the perfect language for any program. But it is the perfect language for building the perfect language for every program.

That's horribly paraphrased, but in my experience it has been true. It looks like you'll want a fairly specialized set of tools, and no language will make those feel as natural as a Lisp.


You have to ask yourself how important functional programming is for you. You know C++ so you probably know OO. I would say it's easier to do FP in Clojure (because you can't really drop back to OO-style) in Scala you will probebly end up dropping FP and do more OO style.

I can't really say anything about your application space.

Since you mentioned R, there is an R-like Clojure library for statistics called Incanter. I don't know about other existing projects in your application space.

There is a lot of information about both languages, so that should not be a problem. The learning curve is kind of steep with both languages. Clojure is a much smaller language and since you already know some lisp it should not be to hard to learn the important stuff. Scala has a type system that will be hard to pick up especially since your main experience is with C/C++.

Both languages have great concurrency models and you will probably be happy with both.


I have some experience in Scala and only little knowledge in Clojure, but I programmed Lisp many years ago.

Lisp is a beautiful language, but it never made it to the world, because it was too limited. I believe you need a statically-typed language to develop robust systems. The type system in Scala is not difficult to master to benefit from it. If you want to do very advanced things with it to make your libraries idiot-proof, you can, but then you will need to study the type system a little more.

Scala favours immutable types, but you can use mutables without any problem, which you sometimes do need. Concurrency in Scala is very well implemented and frameworks like akka extend and enhance these possibilities.

Scala stands a better chance to become a mainstream language since it's a fuller language. I'm afraid that Clojure is too much like Lisp (but reimplemented on the JVM). I liked Lisp a lot, but it had too many disadvantages for real-life programs. With Scala I think we have the best of both worlds (OO and functional) in a clean marriage. On top of that, Scala seems to really catch on in the market.


We have been working on some experimental code in the Rudolf/BioClojure project on GitHub. Also, look at Jan Aert's BioClojure project which is more structured.

Additionally, there is a BioCaml project in the works...

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜