Python - Best way to compare two strings, record stats comparing serial position of particular item?
I'm dealing with two files, both of which have lines that look like the following:
This is || an example || line .
In one of the files, the above line开发者_开发问答 would appear, whereas the corresponding line in the other file would be identical BUT might have the '||' items in a different position:
This || is an || example || line .
I just need to collect stats for how often a "||" fell in the "right" place in the second file (we're assuming the first file is always "right"), how often a "||" fell in a place where the first file didn't have a "||", and how the number of overall "||" markers differed for that particular line.
I know I could do this alone, but wondered if you brilliant folks knew some incredibly easy way of doing this? The basic stuff (such as reading the files in) is all stuff I'm familiar with--I'm really just looking for advice on how to do the actual comparisons of lines and collect the stats!
Best, Georgina
Is this what you are looking for?
This code assumes that every line is formatted in the same way as in your examples
fileOne = open('theCorrectFile', 'r')
fileTwo = open('theSecondFile', 'r')
for corrrectLine in fileOne:
otherLine = fileTwo.readline()
for i in len(correctLine.split("||")):
count = 0
wrongPlacement = 0
if (len(otherLine.split("||")) >= i+1) and (correctLine.split("||")[i] == otherLine.split("||")[i]):
count += 1
else:
wrongPLacement += 1
print 'there are %d out of %d "||" in the correct places and %d in the wrong places' %(count, len(correctLine.split("||"), wrongPlacement)
I'm not sure how easy this is, since it does make use of some more advanced concepts like generators, but it's at least robust and well-documented. The actual code is at the bottom and is fairly concise.
The basic idea is that the function iter_delim_sets
returns an iterator over (aka a sequence of) tuples containing the line number, the set of indices in the "expected" string where the delimiter was found, and a similar set for the "actual" string. There's one such tuple generated for each pair of (expected, result) lines. Those tuples are succinctly formalized into a collections.namedtuple
type called DelimLocations
.
Then the function analyze
just returns higher-level information based on such a data set, stored in a DelimAnalysis
namedtuple
. This is done using basic set algebra.
"""Compare two sequences of strings.
Test data:
>>> from pprint import pprint
>>> delimiter = '||'
>>> expected = (
... delimiter.join(("one", "fish", "two", "fish")),
... delimiter.join(("red", "fish", "blue", "fish")),
... delimiter.join(("I do not like them", "Sam I am")),
... delimiter.join(("I do not like green eggs and ham.",)))
>>> actual = (
... delimiter.join(("red", "fish", "blue", "fish")),
... delimiter.join(("one", "fish", "two", "fish")),
... delimiter.join(("I do not like spam", "Sam I am")),
... delimiter.join(("I do not like", "green eggs and ham.")))
The results:
>>> pprint([analyze(v) for v in iter_delim_sets(delimiter, expected, actual)])
[DelimAnalysis(index=0, correct=2, incorrect=1, count_diff=0),
DelimAnalysis(index=1, correct=2, incorrect=1, count_diff=0),
DelimAnalysis(index=2, correct=1, incorrect=0, count_diff=0),
DelimAnalysis(index=3, correct=0, incorrect=1, count_diff=1)]
What they mean:
>>> pprint(delim_analysis_doc)
(('index',
('The number of the lines from expected and actual',
'used to perform this analysis.')),
('correct',
('The number of delimiter placements in ``actual``',
'which were correctly placed.')),
('incorrect', ('The number of incorrect delimiters in ``actual``.',)),
('count_diff',
('The difference between the number of delimiters',
'in ``expected`` and ``actual`` for this line.')))
And a trace of the processing stages:
>>> def dump_it(it):
... '''Wraps an iterator in code that dumps its values to stdout.'''
... for v in it:
... print v
... yield v
>>> for v in iter_delim_sets(delimiter,
... dump_it(expected), dump_it(actual)):
... print v
... print analyze(v)
... print '======'
one||fish||two||fish
red||fish||blue||fish
DelimLocations(index=0, expected=set([9, 3, 14]), actual=set([9, 3, 15]))
DelimAnalysis(index=0, correct=2, incorrect=1, count_diff=0)
======
red||fish||blue||fish
one||fish||two||fish
DelimLocations(index=1, expected=set([9, 3, 15]), actual=set([9, 3, 14]))
DelimAnalysis(index=1, correct=2, incorrect=1, count_diff=0)
======
I do not like them||Sam I am
I do not like spam||Sam I am
DelimLocations(index=2, expected=set([18]), actual=set([18]))
DelimAnalysis(index=2, correct=1, incorrect=0, count_diff=0)
======
I do not like green eggs and ham.
I do not like||green eggs and ham.
DelimLocations(index=3, expected=set([]), actual=set([13]))
DelimAnalysis(index=3, correct=0, incorrect=1, count_diff=1)
======
"""
from collections import namedtuple
# Data types
## Here ``expected`` and ``actual`` are sets
DelimLocations = namedtuple('DelimLocations', 'index expected actual')
DelimAnalysis = namedtuple('DelimAnalysis',
'index correct incorrect count_diff')
## Explanation of the elements of DelimAnalysis.
## There's no real convenient way to add a docstring to a variable.
delim_analysis_doc = (
('index', ("The number of the lines from expected and actual",
"used to perform this analysis.")),
('correct', ("The number of delimiter placements in ``actual``",
"which were correctly placed.")),
('incorrect', ("The number of incorrect delimiters in ``actual``.",)),
('count_diff', ("The difference between the number of delimiters",
"in ``expected`` and ``actual`` for this line.")))
# Actual functionality
def iter_delim_sets(delimiter, expected, actual):
"""Yields a DelimLocations tuple for each pair of strings.
``expected`` and ``actual`` are sequences of strings.
"""
from re import escape, compile as compile_
from itertools import count, izip
index = count()
re = compile_(escape(delimiter))
def delimiter_locations(string):
"""Set of the locations of matches of ``re`` in ``string``."""
return set(match.start() for match in re.finditer(string))
string_pairs = izip(expected, actual)
return (DelimLocations(index=index.next(),
expected=delimiter_locations(e),
actual=delimiter_locations(a))
for e, a in string_pairs)
def analyze(locations):
"""Returns an analysis of a DelimLocations tuple.
``locations.expected`` and ``locations.actual`` are sets.
"""
return DelimAnalysis(
index=locations.index,
correct=len(locations.expected & locations.actual),
incorrect=len(locations.actual - locations.expected),
count_diff=(len(locations.actual) - len(locations.expected)))
精彩评论