Need to remove line breaks from a text file with certain conditions
I have a text file running into 20,000 lines. A block of meaningful data for me would consist of name, address, city, state,zip, phone. My file has each of these on a new line, so a file would go like:
StoreName1
, Address
, City
,State
,Zip
, Phone
StoreName2
, Address
, City
,State
,Zip
, Phone
I need to create a CSV file and will need the above information for each store in 1 single line :
StoreName1, Address, City,State,Zip, Phone
StoreName2, Address, City,State,Zip, Phone
So essentially, I am trying to remove \r\n only at the appropriate points. How do I do this with python re. Examples would be 开发者_StackOverflow社区very helpful, am a newbie at this.
Thanks.
s/[\r\n]+,/,/g
Globally substitute 'linebreak(s),' with ','
Edit:
If you want to reduce it further with a single linebreak between records:
s/[\r\n]+(,|[\r\n])/$1/g
Globally substitute 'linebreaks(s) (comma or linebreak) with capture group 1.
Edit:
And, if it really gets out of whack, this might cure it:
s/[\r\n]+\s*(,|[\r\n])\s*/$1/g
This iterator/generator version doesn't require reading the entire file into memory at once
from itertools import groupby
with open("inputfile.txt") as f:
groups = groupby(f, key=str.isspace)
for row in ("".join(map(str.strip,x[1])) for x in groups if not x[0]):
...
Assuming the data is "normal" - see my comment - I'd approach the problem this way:
with open('data.txt') as fhi, open('newdata.txt', 'w') as fho:
# Iterate over the input file.
for store in fhi:
# Read in the rest of the pertinent data
fields = [next(fhi).rstrip() for _ in range(5)]
# Generate a list of all fields for this store.
row = [store.rstrip()] + fields
# Output to the new data file.
fho.write('%s\n' % ''.join(row))
# Consume a blank line in the input file.
next(fhi)
First mind-numbigly solution
import re
ch = ('StoreName1\r\n'
', Address\r\n'
', City\r\n'
',State\r\n'
',Zip\r\n'
', Phone\r\n'
'\r\n'
'StoreName2\r\n'
', Address\r\n'
', City\r\n'
',State\r\n'
',Zip\r\n'
', Phone')
regx = re.compile('(?:(?<=\r\n\r\n)|(?<=\A)|(?<=\A\r\n))'
'(.+?)\r\n(,.+?)\r\n(,.+?)\r\n(,.+?)\r\n(,.+?)\r\n(,[^\r\n]+)')
with open('csvoutput.txt','wb') as f:
f.writelines(''.join(mat.groups())+'\r\n' for mat in regx.finditer(ch))
ch mimics the content of a file on a Windows platform (newlines == \r\n)
Second mind-numbigly solution
regx = re.compile('(?:(?<=\r\n\r\n)|(?<=\A)|(?<=\A\r\n))'
'.+?\r\n,.+?\r\n,.+?\r\n,.+?\r\n,.+?\r\n,[^\r\n]+')
with open('csvoutput.txt','wb') as f:
f.writelines(mat.group().replace('\r\n','')+'\r\n' for mat in regx.finditer(ch))
Third mind-numbigly solution, if you want to create a CSV file with other delimiters than commas:
regx = re.compile('(?:(?<=\r\n\r\n)|(?<=\A)|(?<=\A\r\n))'
'(.+?)\r\n,(.+?)\r\n,(.+?)\r\n,(.+?)\r\n,(.+?)\r\n,([^\r\n]+)')
import csv
with open('csvtry3.txt','wb') as f:
csvw = csv.writer(f,delimiter='#')
for mat in regx.finditer(ch):
csvw.writerow(mat.groups())
.
EDIT 1
You are right , tchrist, the following solution is far simpler:
regx = re.compile('(?<!\r\n)\r\n')
with open('csvtry.txt','wb') as f:
f.write(regx.sub('',ch))
.
EDIT 2
A regex isn't required:
with open('csvtry.txt','wb') as f:
f.writelines(x.replace('\r\n','')+'\r\n' for x in ch.split('\r\n\r\n'))
.
EDIT 3
Treating a file, no more ch:
'à la gnibbler" solution, in cases when the file can't be read all at once in memory because it is too big:
from itertools import groupby
with open('csvinput.txt','r') as f,open('csvoutput.txt','w') as g:
groups = groupby(f,key= lambda v: not str.isspace(v))
g.writelines(''.join(x).replace('\n','')+'\n' for k,x in groups if k)
I have another solution with regex:
import re
regx = re.compile('^((?:.+?\n)+?)(?=\n|\Z)',re.MULTILINE)
with open('input.txt','r') as f,open('csvoutput.txt','w') as g:
g.writelines(mat.group().replace('\n','')+'\n' for mat in regx.finditer(f.read()))
I find it similar to the gnibbler-like solution
f = open(infilepath, 'r')
s = ''.join([line for line in f])
s = s.replace('\n\n', '\\n')
s = s.replace('\n', '')
s = s.replace("\\n", "\n")
f.close()
f = open(infilepath, 'r')
f.write(s)
f.close()
That should do it. It will replace your input file with the new format
精彩评论