split command question

2023-03-27 06:53 问答作者：

I got problem with using split command. The input string is as follows:

080821_HWI-EAS301_0002_30ALBAAXX:1:8:1649:2027  83  chr10   42038185    255 36M =   42037995    -225    GCCAGGTTTAATAAATTATTTATAGAATACTGCATC    @?DDEAEFDAD@FBG@CDA?DBCDEECD@D?CBA>A    NM:i:0  MD:Z:36

I want to grab '2027' from this string my command is: line.split(':',4)[1].split()[0] However, it doesn't work. The output is '1'

Then I switch to line.split(':',4) And output is still '1', and I see the first-step split is already problematic.

However, when I try line.split(':',1), I got expected result as:

1:8:1649:2027   83  chr10   42038185    255 36M =   42037995-225    GCCAGGTTTAATAAATTATTTATAGAATACTGCATC    @?DDEAEFDAD@FBG@CDA?DBCDEECD@D?开发者_开发技巧CBA>A    NM:i:0  MD:Z:36

I'm confused by this split command! (I asked the similar question before, and split command worked at that time) thanks

It appears that what you want is

line.split(':',4)[4].split()[0]

The numeric parameter to split indicates the maximum number of splits that will occur. So you have:

>>> line='080821_HWI-EAS301_0002_30ALBAAXX:1:8:1649:2027 ...'
>>> line.split(':',4)
['080821_HWI-EAS301_0002_30ALBAAXX', '1', '8', '1649', '2027 ...']

If you pull element [1] out of this return value, you get '1'. I don't see why you are surprised by this.

Since you are allowing up to 4 splits, and the item you want will be the last one, the subscript you want is [4]:

>>> line.split(':',4)[4]
'2027 ...'

Then you can split that on space and get element [0] from it to produce your result.

You get the same result if you don't pass a split limit value at all:

>>> line.split(':')[4].split()[0]
'2027'

Try this:

#!/usr/bin/python

line = '080821_HWI-EAS301_0002_30ALBAAXX:1:8:1649:2027  83  chr10   42038185    255 36M =   42037995    -225    GCCAGGTTTAATAAATTATTTATAGAATACTGCATC    @?DDEAEFDAD@FBG@CDA?DBCDEECD@D?CBA>A    NM:i:0  MD:Z:36'

print line.split(':')[4].split()[0]

I'm not sure why you're trying to access the token containing 2027 like this:

line.split(':',4)

rather than this:

line.split(':')[4]

I think that you might be confused about how split works. The last parameter to the Python split function is the maximum number of splits to perform.

The second argument to split is the maximum number of splits to exercise, so you probably don't want to be using it in this case. To access the 5th element after performing the split, do this:

line.split(":")[4]

Anyway, what you probably want is to first split by whitespace (you can do this by using no arguments), and then split by colons. This can be done on one line like this:

line.split()[0].split(":")[4]

You can use instead:

s.split()[0].split(':')[4]

Split on the white space first. Then split the first element in the resultant list based on the separator (here: ':').

line.split()[0].split(':')[4]

Do you must use split?

I ask this because I've found regex to be a much better tool to use when I just need to grab a specific substring. It's not the easiest thing to learn and does appear very unapproachable at first, but you have to pay the price of learning it only once and it is an investment worth making. :)

Python homepage has a good introduction of it.

P.S. 2027 will be matched by the following regex .*?:([0-9]+)\s+

I presume that you will do numerous extractions of information from strings in the future. Then, my advice is to learn to use the regex tool, it will be inevitable.

Or you'll have to learn and use specialized library to do treatments of string in the field of genomics.

Simple solution to your present problem with module re :

line = '''080821_HWI-EAS301_0002_30ALBAAXX:1:8:1649:2027  83  chr10
42038185    255 36M =   42037995    -225
GCCAGGTTTAATAAATTATTTATAGAATACTGCATC    @?DDEAEFDAD@FBG@CDA?
DBCDEECD@D?CBA>A    NM:i:0  MD:Z:36'''

import re

print re.search(':(\d+) ',line).group(1)

If there are blanks before the fourth ':' the regex's pattern will be:

line = '''080821_HWI-EAS301_0002_30AL BAAXX:1:8     :1649:2027  83  chr10
42038185    255 36M =   42037995    -225
GCCAGGTTTAATAAATTATTTATAGAATACTGCATC    @?DDEAEFDAD@FBG@CDA?
DBCDEECD@D?CBA>A    NM:i:0  MD:Z:36'''

import re

print re.search('(:[^:]+){3}:(\d+)',line).group(2)

(:[^:]+) matches a ':' followed by as many characters different from ':' that may follow

{3} says that this match must be performed 3 times

then the fourth ':' must be encountered, followed by the searched number matched by \d+ ; there is no more need to indicate that there must be a blank after the number, because \d+ will stop to match in the string as soon as a non-digit character will be encountered

Parentheseses define groups. Here the desired number is catched by the second group

继续阅读：python split

split command question

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？