Reading a large file in perl, record by record, with a dynamic record separator
I have a script that reads a large file line by line. The record separator ($/
) that I would like to use is (\n
). The only problem is that the data on each line contains CRLF characters (\r\n
), which the program should not be considered the end of a line.
For example, here is a sample data file (with the newlines and CRLFs written out):
line1contents\n
line2contents\n
line3\r\ncontents\n
line4contents\n
If I set $/ = "\n"
, then it splits the third line into two lines. Ideally, I could just set $/
to a regex that matches \n
and not \r\n
, but I don开发者_开发技巧't think that's possible. Another possibility is to read in the whole file, then use the split function to split on said regex. The only problem is that the file is too large to load into memory.
Any suggestions?
For this particular task, it sounds pretty straightforward to check your line ending and append the next line as necessary:
$/ = "\n";
...
while(<$input>) {
while( substr($_,-2) eq "\r\n" ) {
$_ .= <$input>;
}
...
}
This is the same logic used to support line continuation in a number of different programming contexts.
You are right that you can't set $/
to a regular expression.
dos2unix would put a UNIX newline character in for the "\r\n" and so wouldn't really solve the problem. I would use a regex that replaces all instances of "\r\n" with a space or tab character and save the results to a different file (since you don't want to split the line at those points). Then I would run your script on the new file.
Try using dos2unix on the file first, and then read in as normal.
精彩评论