开发者

Data structures for bioinformatics [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers. 开发者_运维百科

We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.

Closed 5 years ago.

Improve this question

What are some data structures that should be known by somebody involved in bioinformatics? I guess that anyone is supposed to know about lists, hashes, balanced trees, etc., but I expect that there are domain specific data structures. Is there any book devoted to this subject?


The most fundamental data structure used in bioinformatics is string. There are also a whole range of different data structures representing strings. And algorithms like string matching are based on the efficient representation/data structures.

A comprehensive work on this is Dan Gusfield's Algorithms on Strings, Trees and Sequences


A lot of introductory books on bioinformatics will cover some of the basic structures you'd use. I'm not sure what the standard textbook is, but I'm sure you can find that. It might be useful to look at some of the language-specific books:

  • Bioinformatics Programming With Python
  • Beginning Perl for Bioinformatics

I chose those two as examples because they're published by O'Reilly, which, in my experience, publishes good quality books.

I just so happen to have the Python book on my hard drive, and a great deal of it talks about processing strings for bioinformatics using Python. It doesn't seem like bioinformatics uses any fancy special data structures, just existing ones.


Spatial hashing datastructures (kd-tree) for example are used often for nearest neighbor queries of arbitrary feature vectors as well as 3d protein structure analysis.

Best book for your $$ is Understanding Bioinformatics by Zvelebil because it covers everything from sequence analysis to structure comparison.


In addition to basic familiarity with the structures you mentioned, suffix trees (and suffix arrays), de Bruijn graphs, and interval graphs are used extensively. The Handbook of Computational Molecular Biology is very well written. I've never read the whole thing, but I've used it as a reference.


I also highly recommend this book, http://www.comp.nus.edu.sg/~ksung/algo_in_bioinfo/

And more recently, python is much more frequently used in bioinformatics than perl. So I really suggest you start with python, it is widely used in my projects.


Many projects in bioinformatics involve combining information from different, semi-structured sources. RDF and ontologies are essential for much of this. See, for example, the bio2RDF project. http://bio2rdf.org/. A good understanding of identifiers is valuable.

Much bioinformatics is exploratory and rapid lightweight tools are often used. See workflow tools such as Taverna where the primary resource is often a set of web services - so HTTP/REST are common.


Whatever your mathematical or computational expertise is, you are likely to find an application in computational biology. If not, make this another question of stackoverflow and you'll be helped :o)

As mentioned in the other answers, somewhat timeless are string comparisons and pattern discovery in 1-dimensional data since sequences are so easy to get. With a renewed interest in medical informatics though you also have two/three-dimensional image analysis that you run e.g. against genomic data. With molecular biochemistry you also have pattern searches on 3D surfaces and molecular simulations. To study drug effects you will work with gene networks and compare those across tissues. Typical challenges for big data and information integration apply. And then, you need statistical descriptions of the likelihood of a pattern or the clinical association of any features identified to be found by chance.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜