How to parse a string (by a "new" markup) with R?

2022-12-22 21:20 问答作者：

I want to use R to do string parsing that (I think) is like a simplistic HTML parsing.

For example, let's say we have the following two variables:

Seq <- "GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA"
Str <- ">>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<."

Say that I want to parse "Seq" According to "Str", by using the legend here

Seq: GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA
Str: >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.
     |     |  |              | |               |     |               ||     |
     +-----+  +--------------+ +---------------+     +---------------++-----+
        |        Stem 1            Stem 2                 Stem 3         |
        |                                                                |
        +----------------------------------------------------------------+
                                Stem 0

Assume that we always have 4 stems (0 to 3), but that the length of letters before and after each of them can very.

The output should be something like the following list structure:

list(
    "Stem 0 opening" = "GCCTCGA",
    "before Stem 1" = "TA",
    "Stem 1" = list(opening = "GCTC",
                inside = "AGTTGGGA",
                closing = "GAGC"
            ),
    "between Stem 1 and 2" = "G",
    "Stem 2" = list(opening = "TACGA",
                inside = "CTGAAGA",
                closing = "TCGTA"
            ),
    "between Stem 2 and 3" = "AGGtC",
    "Stem 3" = list(opening = "ACCAG",
                inside = "TTCGATC",
                closing = "CTGGT"
            ),
    "After Stem 3" = "",
    "Stem 0 closing" = "TCGGGGC"
)

I don't have any experience with programming a parser, and would like advices as to what strategy to use when programming something like this (and any recommended R commands to use).

What I was thinking of is to first get rid of the "Stem 0", then go through the inner string with a recursive function (let's call it "seperate.stem") that each time will split the string into: 1. before stem 2. opening stem 3. inside stem 4. closing stem 5. after stem

Where the "after stem" will then be recursively entered into the same function ("seperate.stem")

The thing is that I am not sure how to try and do this coding without using a loop.

Any advices will be most welcomed.

Update: someone sent me a bunch of question, here they are.

Q: Does each sequence have the same number of ">>>>" for the opening sequence as it does for "<<<<" on the ending sequence?

A: Yes

Q: Does the parsing always start with a partial stem 0 as your example shows? A: No. Sometimes it will start with a few "."

Q: Is there a way of making sure you have the right sequences when you start? A: I am not sure I understand what you mean.

Q: Is there a chance of error in the middle of the string that you have to restart from? A: Sadly, yes. In which case, I'll need to ignore one of the inner stems...

Q: How long are these strings that you want to parse? A: Each string has between 60 to 150 characters (and I have tens of thousands of them...)

Q: Is each one a self contained sequence like you show in your example, or do they go on for thousands of characters? A: each sequence is self contained.

Q: Is there always at least one '.' between stems?

A: No.

Q: A full set of rules as to how the parsing should be done would be useful. A: I agree. But since I don't have ev开发者_JS百科en a basic idea on how to start coding this, I thought first to have some help on the beginning and try to tweak with the other cases that will come up before turning back for help.

Q: Do you have the BNF syntax for parsing? A: No. Your e-mail is the first time I came across it (http://en.wikipedia.org/wiki/Backus–Naur_Form).

You can simplify the task by using run length encoding.

First, convert Str to be a vector of individual characters, then call rle.

split_Str <- strsplit(Str, "")[[1]]
rle_Str <- rle(split_Str)

Run Length Encoding
  lengths: int [1:14] 7 2 4 8 4 1 5 7 5 5 ...
  values : chr [1:14] ">" "." ">" "." "<" "." ">" "." "<" "." ">" "." "<" "."

Now you just need to parse rle_Str$values, which is perhaps simpler. For instance, an inner stem will always look like ">" "." "<".

I think the main thing that you need to think about is the structure of the data. Does a "." always have to come between ">" and "<", or is it optional? Can you have a "." at the start? Do you need to be able to generalise to stems within stems within stems, or even more complex structures?

Once you have this solved, contructing your list output should be straightforward.

Also, don't worry about using loops, they are in the language because they are useful. Get the thing working first, then worry about speed optimisations (if you really have to) afterwards.

继续阅读：loops parsing r string

How to parse a string (by a "new" markup) with R?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？