开发者

Coding Blosum62 in the source code

I am trying to implement protein pairwise sequence alignment using "Global Alignment" algorithm by 'Needleman -Wunsch'.

I am not clear about how to include 'Blosum62 Matrix' in my source code to do the scoring or to fill the two-dimensional matrix?

I have googled and found that most people suggested to use flat file which contains the standard 'Blosum62 Matrix'. Does it mean that I need to read from this flat file and fill my coded "Blosum62 Martrix' ?

Also, the other approach could be is to use some mathematical formula and include it in your programming logic to construct 'Blosum62 Matri开发者_运维百科x'. But not very sure about this option.

Any ideas or insights are appreciated.

Thanks.


It would help to know what language you're working in so we can help you with the correct terms, but what I did was use a map of maps (or dictionaries if you're using Python).

Here's an example of my code in Groovy, but it's fairly portable to other languages:

def blosum62 = [
Cys:[Cys:9, Ser:-1, Thr:-1, Pro:-3, Ala:0,  Gly:-3, Asn:-3, Asp:-3, Glu:-4, Gln:-3, His:-3, Arg:-3, Lys:-3, Met:-1, Ile:-1, Leu:-1, Val:-1, Phe:-2, Tyr:-2, Trp:-2],
Ser:[Cys:-1,Ser:4,  Thr:1,  Pro:-1, Ala:1,  Gly:0,  Asn:1,  Asp:0,  Glu:0,  Gln:0,  His:-1, Arg:-1, Lys:0,  Met:-1, Ile:-2, Leu:-2, Val:-2, Phe:-2, Tyr:-2, Trp:-3],
Thr:[Cys:-1,Ser:1,  Thr:4,  Pro:1,  Ala:-1, Gly:1,  Asn:0,  Asp:1,  Glu:0,  Gln:0,  His:0,  Arg:-1, Lys:0,  Met:-1, Ile:-2, Leu:-2, Val:-2, Phe:-2, Tyr:-2, Trp:-3],
Pro:[Cys:-3,Ser:-1, Thr:1,  Pro:7,  Ala:-1, Gly:-2, Asn:-1, Asp:-1, Glu:-1, Gln:-1, His:-2, Arg:-2, Lys:-1, Met:-2, Ile:-3, Leu:-3, Val:-2, Phe:-4, Tyr:-3, Trp:-4],
Ala:[Cys:0, Ser:1,  Thr:-1, Pro:-1, Ala:4,  Gly:0,  Asn:-1, Asp:-2, Glu:-1, Gln:-1, His:-2, Arg:-1, Lys:-1, Met:-1, Ile:-1, Leu:-1, Val:-2, Phe:-2, Tyr:-2, Trp:-3],
Gly:[Cys:-3,Ser:0,  Thr:1,  Pro:-2, Ala:0,  Gly:6,  Asn:-2, Asp:-1, Glu:-2, Gln:-2, His:-2, Arg:-2, Lys:-2, Met:-3, Ile:-4, Leu:-4, Val:0,  Phe:-3, Tyr:-3, Trp:-2],
Asn:[Cys:-3,Ser:1,  Thr:0,  Pro:-2, Ala:-2, Gly:0,  Asn:6,  Asp:1,  Glu:0,  Gln:0,  His:-1, Arg:0,  Lys:0,  Met:-2, Ile:-3, Leu:-3, Val:-3, Phe:-3, Tyr:-2, Trp:-4],
Asp:[Cys:-3,Ser:0,  Thr:1,  Pro:-1, Ala:-2, Gly:-1, Asn:1,  Asp:6,  Glu:2,  Gln:0,  His:-1, Arg:-2, Lys:-1, Met:-3, Ile:-3, Leu:-4, Val:-3, Phe:-3, Tyr:-3, Trp:-4],
Glu:[Cys:-4,Ser:0,  Thr:0,  Pro:-1, Ala:-1, Gly:-2, Asn:0,  Asp:2,  Glu:5,  Gln:2,  His:0,  Arg:0,  Lys:1,  Met:-2, Ile:-3, Leu:-3, Val:-3, Phe:-3, Tyr:-2, Trp:-3],
Gln:[Cys:-3,Ser:0,  Thr:0,  Pro:-1, Ala:-1, Gly:-2, Asn:0,  Asp:0,  Glu:2,  Gln:5,  His:0,  Arg:1,  Lys:1,  Met:0,  Ile:-3, Leu:-2, Val:-2, Phe:-3, Tyr:-1, Trp:-2],
His:[Cys:-3,Ser:-1, Thr:0,  Pro:-2, Ala:-2, Gly:-2, Asn:1,  Asp:1,  Glu:0,  Gln:0,  His:8,  Arg:0,  Lys:-1, Met:-2, Ile:-3, Leu:-3, Val:-2, Phe:-1, Tyr:2,  Trp:-2],
Arg:[Cys:-3,Ser:-1, Thr:-1, Pro:-2, Ala:-1, Gly:-2, Asn:0,  Asp:-2, Glu:0,  Gln:1,  His:0,  Arg:5,  Lys:2,  Met:-1, Ile:-3, Leu:-2, Val:-3, Phe:-3, Tyr:-2, Trp:-3],
Lys:[Cys:-3,Ser:0,  Thr:0,  Pro:-1, Ala:-1, Gly:-2, Asn:0,  Asp:-1, Glu:1,  Gln:1,  His:-1, Arg:2,  Lys:5,  Met:-1, Ile:-3, Leu:-2, Val:-3, Phe:-3, Tyr:-2, Trp:-3],
Met:[Cys:-1,Ser:-1, Thr:-1, Pro:-2, Ala:-1, Gly:-3, Asn:-2, Asp:-3, Glu:-2, Gln:0,  His:-2, Arg:-1, Lys:-1, Met:5,  Ile:1,  Leu:2,  Val:-2, Phe:0,  Tyr:-1, Trp:-1],
Ile:[Cys:-1,Ser:-2, Thr:-2, Pro:-3, Ala:-1, Gly:-4, Asn:-3, Asp:-3, Glu:-3, Gln:-3, His:-3, Arg:-3, Lys:-3, Met:1,  Ile:4,  Leu:2,  Val:1,  Phe:0,  Tyr:-1, Trp:-3],
Leu:[Cys:-1,Ser:-2, Thr:-2, Pro:-3, Ala:-1, Gly:-4, Asn:-3, Asp:-4, Glu:-3, Gln:-2, His:-3, Arg:-2, Lys:-2, Met:2,  Ile:2,  Leu:4,  Val:3,  Phe:0,  Tyr:-1, Trp:-2],
Val:[Cys:-1,Ser:-2, Thr:-2, Pro:-2, Ala:0,  Gly:-3, Asn:-3, Asp:-3, Glu:-2, Gln:-2, His:-3, Arg:-3, Lys:-2, Met:1,  Ile:3,  Leu:1,  Val:4,  Phe:-1, Tyr:-1, Trp:-3],
Phe:[Cys:-2,Ser:-2, Thr:-2, Pro:-4, Ala:-2, Gly:-3, Asn:-3, Asp:-3, Glu:-3, Gln:-3, His:-1, Arg:-3, Lys:-3, Met:0,  Ile:0,  Leu:0,  Val:-1, Phe:6,  Tyr:3,  Trp:1],
Tyr:[Cys:-2,Ser:-2, Thr:-2, Pro:-3, Ala:-2, Gly:-3, Asn:-2, Asp:-3, Glu:-2, Gln:-1, His:2,  Arg:-2, Lys:-2, Met:-1, Ile:-1, Leu:-1, Val:-1, Phe:3,  Tyr:7,  Trp:2],
Trp:[Cys:-2,Ser:-3, Thr:-3, Pro:-4, Ala:-3, Gly:-2, Asn:-4, Asp:-4, Glu:-3, Gln:-2, His:-2, Arg:-3, Lys:-3, Met:-1, Ile:-3, Leu:-2, Val:-3, Phe:1,  Tyr:2,  Trp:11]
]

Using this you can just call

def score = blosum62[Arg][Cys]
println("Substituting Arg by Cys gives " + score)


You can always download the matrix from NCBI web site:

ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62

Other matrices are also available from the parent directory.

I never saw implementation of Needleman-Wunsch with matrix calculation. It's much easier just to include the matrix in the code or as a separate file.

You can find some details how BLOSUM matrices are calculated for example here: http://en.wikipedia.org/wiki/BLOSUM.


You can't infer a blosum matrix from another as you can do for PAM ones: all the blosum are calculated from a different set of data and are not correlated within theirselves.

For example, a PAM250 Matrix is just a PAM1 matrix multiplied 250 times by itself; but this is not true for BLOSUMs, and you can't infer BLOSUM80 from BLOSUM64, for example.


Yes, you can implement a blosum matrix as hardwired piece of code, you might gain some speed with this. But definitely you loose flexibility. I would recommend writing a reader for NCBI format, e.g returning SubstitutionMatrix data type. Then you can pass around such a matrix as an object.

SubstitutionMatrix object may hold a 2D matrix and "something" responsible for decoding amino acid names, e.g. a hashing array. Depending on the language you choose, you may also use enums to represent amino acid types. In such a case you can use them directly to address the 2D array.

Hopefully this is clear, I can write more details if you like/need.


Here is example of parsing the blosum62 file from this link ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62 in Java.

Create class Parsing:

private String[] rowData;
private final int MARTRIX_SIZE = 24;
private String[][] matrix = new String[MARTRIX_SIZE][MARTRIX_SIZE];
private HashMap<String, Integer> index = new HashMap<>();
private String filePath = "blosum62.txt";

public void blosumMartix() {
    File fileObj = new File(filePath);
    Scanner input;
    boolean readLine = true;
    int rowCounter = 0;

    try {
        input = new Scanner(fileObj);
        while (input.hasNext()) {
            String lineReader = input.nextLine().replaceAll("  ", " ");
            if (!lineReader.substring(0, 1).equals("#")) {
                if (readLine) {
                    readLine = false;
                    continue;
                }
                rowData = lineReader.split(" ");
                for (int i = 1; i < rowData.length; i++) {
                    matrix[rowCounter][i - 1] = rowData[i];
                }
                index.put(rowData[0], rowCounter);
                rowCounter++;
            }
        }

    } catch (FileNotFoundException ex) {
        ex.printStackTrace();
    }
}

Now you want to calculate the the cost of example A and * which returns -4 you should write method for this:

public int getDistance(String strS1, String strS2) {
    try {
        return getDistance(matrixIndex.get(strS1), matrixIndex.get(strS2));
    } catch (Exception ex) {
        System.out.println("Key out of range, check your string input");
        System.exit(0);
    }
    return 0;
}

private int getDistance(int charS1, int charS2) {
    if (charS1 < 0 || charS1 > matrix[0].length) {
        System.out.println("Gap out of range");
        System.exit(1);
    }
    if (charS2 < 0 || charS2 > matrix[0].length) {
        System.out.println("Gap out of range");
        System.exit(2);
    }

    return Integer.parseInt(matrix[charS1][charS2]);
}

Finally in main method

Parsing parsing = new Parsing();
parsing.blosumMartix();
String result = parsing.getDistance("A", "*");
System.out.print(result);

This will print -4.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜