Excel-like text import in Python: automatically parsing fixed width columns

2023-03-12 18:03 问答作者：

In Excel, if you import whitespace delineated text in which the columns do not line up perfectly and data may be missing, like

  pH             pKa/Em  n(slope) 1000*chi2      vdw0  
CYS-I0014_        >14.0                          0.00  
LYS+I0013_       11.827     0.781     0.440      0.18

you are given the option to treat it as fixed width columns, and Excel can figure out the column widths automatically, usually with pretty good results. Is there a library in Python which can break up poorly formated fixed-width text in a similarly automatic fashion?

EDIT: This is what the fixed width text import looks like in Excel. In step one, you just check the 'fixed-width' radio button, and then here in step two Excel has already added in column breaks automatically. The only time it fails to do so properly is when there is not at least one whitespace character overlapping in each colu开发者_运维知识库mn break in each line.

Excel-like text import in Python: automatically parsing fixed width columns

First off, Excel (2003, at home) isn't quite so smart. If your column 1000*chi2 contains spaces, e.g. 1000 * chi2, excel will guess wrong.

Trivial case: if your data was originally separated by tabs (not spaces), and multiple tabs were used to indicate empty columns, then, at least in TCL, it's easy to split each line by tab content, and I guess trivial too in Python.

But I'm guessing your problem is that they've used space characters only. The biggest clue I see to solving this one was to paste your text into notepad and choose a fixed size font. Everything lines up neatly, and you can use the number of characters in each line as a measure of "length".

So, IF you can rely on this feature of your input, then you can use a "sieve" approach to identifying where the column breaks are automatically. As you munch through the lines in a first pass, note the "positions" along the line that are occupied by non white-space, eliminating a position from your list if it's EVER occupied by non white space. As you go, you'll quickly arrive at a set of positions that are NEVER occupied by non white space. These, then, are your column dividers. In your example, your "sieve" would end up with positions 10-16, 23-24,32, 42-47 never occupied by non whitespace (assuming I can count). Thus the complement of that set is your set of column positions in which your data must lie. So, foreach line, each block of nonwhitespace will fit into exacly one of your columns from the set of positions (i.e the complement set) identified above. I've never coded in Python, so attached is a TCL script that will identify,using the sieve approach, where the column breaks are in the text, and emit a new text file with exactly those space characters replaced by a single tab - ie. 10-16 replaced by one tab, 23-24 by another, etc. The resulting file is tab-separated, i.e. the trivial case. I confess I've only tried it on YOUR little case data, copied into a text file called ex.txt; output goes to ex_.txt. I suspect it also might have problems if headings contain spaces.

Hope this helps!

set fh [open ex.txt]
set contents [read $fh];#ok for small-to-medium files.
close $fh

#first pass
set occupied {}
set lines [split $contents \n];#split contents at line breaks.
foreach line $lines {
  set chrs [split $line {}];#split each line into chars.
  set pos 0
  foreach chr $chrs {
    if {$chr ne " "} {
      lappend occupied $pos
    }
    incr pos
  }
}

#drop out with long list of occupied "positions": sort to create
#our sieve.
set datacols [lsort -unique -integer $occupied]
puts "occupied: $datacols"

#identify column boundaries.
set colset {}
set start [lindex $datacols 0];#first occupied pos might be > 0??
foreach index $datacols {
  if {$start < $index} {
    set end $index;incr end -1
    lappend colset [list $start $end]
    puts "col break starts at $start, ends at $end";#some instro!
    set start $index
  }
  incr start
}

#Now convert input file to trivial case output file, replacing
#sieved space chars with tab characters.
set tesloc [lreverse $colset];#reverse the column list!
set fh [open ex_.txt w]
foreach line $lines {
  foreach ele $tesloc {
    set line [string replace $line [lindex $ele 0] [lindex $ele 1] "\t" ]
  }
  puts "newline is $line"
  puts $fh $line
}
close $fh

继续阅读：excel fixed-width parsing python text

Excel-like text import in Python: automatically parsing fixed width columns

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？