Pythonic way to extract values from this text file
I have an output file from a legacy piece of software which is shown below. I want to extract values from it, so that, for example, I can set a variable called direct_solar_irradiance
to 648.957
, and target ground pressure
to 1013.00
.
So far, I have been extracting individual lines and processing them like below (repeated many times for the different values I want to extract):
values = lines[97].split()
self.irradiance_direct, self.irradiance_diffuse, self.irradiance_env = values
However, I have now found that extra lines are added to the middle of the output when certain parameters are selected. This means, of course that the 97th line will no longer have the values I need on it.
Is there a good Pythonic way to extract these values, given that there may be extra lines added into the output under certain circumstances? I guess I need to search for known pieces of text in the file, and then extract the numbers referred to by them, but the only ways I can think of doing that are very clunky.
So:
Is there a nice Pythonic way to search for these strings and extract the values that I want?
If not, is there some other way to sensibly do this? (for example, some kind of cool text-file parsing library that I know nothing about).
******************************* 6sV version 1.0B ****************************** * * * geometrical conditions identity * * ------------------------------- * * user defined conditions * * * * month: 14 day : 1 * * solar zenith angle: 10.00 deg solar azimuthal angle: 20.00 deg * * view zenith angle: 30.00 deg view azimuthal angle: 40.00 deg * * scattering angle: 159.14 deg azimuthal angle difference: 20.00 deg * * * * atmospheric model description * * ----------------------------- * * atmospheric model identity : * * midlatitude summer (uh2o=2.93g/cm2,uo3=.319cm-atm) * * aerosols type identity : * * Maritime aerosol model * * optical condition identity : * * visibility : 8.49 km opt. thick. 550 nm : 0.5000 * * * * spectral condition * * ------------------ * * monochromatic calculation at wl 0.400 micron * * * * Surface polarization parameters * * ---------------------------------- * * * * * * Surface Polarization Q,U,Rop,Chi 0.00000 0.00000 0.00000 0.00 * * * * * * target type * * ----------- * * homogeneous ground * * monochromatic reflectance 1.000 * * * * target elevation description * * ---------------------------- * * ground pressure [mb] 1013.00 * * ground altitude [km] 0.000 * * * * plane simulation description * * ---------------------------- * * plane pressure [mb] 1013.00 * * plane altitude absolute [km] 0.000 * * atmosphere under plane description: * * ozone content 0.000 * * h2o content 0.000 * * aerosol opt. thick. 550nm 0.000 * * * * atmospheric correction activated * * -------------------------------- * * BRDF coupling correction * * input apparent reflectance : 0.500 * * * ******************************************************************************* ******************************************************************************* * * * integrated values of : * * -------------------- * * * * apparent reflectance 1.1287696 appar. rad.(w/m2/sr/mic) 588.646 * * total gaseous transmittance 1.000 * * * ******************************************************************************* * * * coupling aerosol -wv : * * -------------------- * * wv above aerosol : 1.129 wv mixed with aerosol : 1.129 * * wv under aerosol : 1.129 * ******************************************************************************* * * * integrated values of : * * -------------------- * * * * app. polarized refl. 0.0000 app. pol. rad. (w/m2/sr/mic) 0.000 * * direction of the plane of polarization 0.00 * * total polarization ratio 0.000 * * * ******************************************************************************* * * * int. normalized values of : * * --------------------------- 开发者_运维技巧 * * % of irradiance at ground level * * % of direct irr. % of diffuse irr. % of enviro. irr * * 0.351 0.354 0.295 * * reflectance at satellite level * * atm. intrin. ref. background ref. pixel reflectance * * 0.000 0.000 1.129 * * * * int. absolute values of * * ----------------------- * * irr. at ground level (w/m2/mic) * * direct solar irr. atm. diffuse irr. environment irr * * 648.957 655.412 544.918 * * rad at satel. level (w/m2/sr/mic) * * atm. intrin. rad. background rad. pixel radiance * * 0.000 0.000 588.646 * * * * * * sol. spect (in w/m2/mic) * * 1663.594 * * * ******************************************************************************* ******************************************************************************* * * * integrated values of : * * -------------------- * * * * downward upward total * * global gas. trans. : 1.00000 1.00000 1.00000 * * water " " : 1.00000 1.00000 1.00000 * * ozone " " : 1.00000 1.00000 1.00000 * * co2 " " : 1.00000 1.00000 1.00000 * * oxyg " " : 1.00000 1.00000 1.00000 * * no2 " " : 1.00000 1.00000 1.00000 * * ch4 " " : 1.00000 1.00000 1.00000 * * co " " : 1.00000 1.00000 1.00000 * * * * * * rayl. sca. trans. : 0.84422 1.00000 0.84422 * * aeros. sca. " : 0.94572 1.00000 0.94572 * * total sca. " : 0.79616 1.00000 0.79616 * * * * * * * * rayleigh aerosols total * * * * spherical albedo : 0.23410 0.12354 0.29466 * * optical depth total: 0.36193 0.55006 0.91199 * * optical depth plane: 0.00000 0.00000 0.00000 * * reflectance I : 0.00000 0.00000 0.00000 * * reflectance Q : 0.00000 0.00000 0.00000 * * reflectance U : 0.00000 0.00000 0.00000 * * polarized reflect. : 0.00000 0.00000 0.00000 * * degree of polar. : nan 0.00 nan * * dir. plane polar. : -45.00 -45.00 -45.00 * * phase function I : 1.38819 0.27621 0.71751 * * phase function Q : -0.09117 -0.00856 -0.04134 * * phase function U : -1.34383 0.02142 -0.52039 * * primary deg. of pol: -0.06567 -0.03099 -0.05762 * * sing. scat. albedo : 1.00000 0.98774 0.99261 * * * * * ******************************************************************************* ******************************************************************************* ******************************************************************************* * atmospheric correction result * * ----------------------------- * * input apparent reflectance : 0.500 * * measured radiance [w/m2/sr/mic] : 260.747 * * atmospherically corrected reflectance * * Lambertian case : 0.52995 * * BRDF case : 0.52995 * * coefficients xa xb xc : 0.00241 0.00000 0.29466 * * y=xa*(measured radiance)-xb; acr=y/(1.+xc*y) *
A more complete, perhaps more robust solution will require the use of either a parser using a custom grammer (pyparsing) or some sort of FSM-based processor (TextFSM).
Both options like they'll be non-trivial to use with this output. A (possibly) lighter-weight solution would be to identify each line based on known labels, then extract appropriately (as suggested by other posters).
There are several ways to implement this. I would suggest mapping 'extractor' callables to known line labels, then iterate and call matched extractors. Each callable would take line and a context object/dict as arguments and add attributes to the context as required. Something along the lines of https://gist.github.com/1035938
you could throw your own mini-language, i.e. automate the extraction. I did the following to automate the parsing of a proprietary program-output
# will match in the order written here
tokens = ["num_ref_frames", "Max QP", "Min QP", "Avg QP", "I4x4",
"I16x16", "SkipZero", "SkipMV", "16x16", "16x8", "8x16",
"8x8", "8x4", "4x8", "4x4"]
special = ["Quarterpel MVs"]
# this dictionary (hash-table) contains the search string from tokens array
# as well as an array where the first element is the field to extract to
# create matrix array. e.g. 0 = 1st field, 1 = 2nd field, 3 = 3rd field etc.
dict = {tokens[0]: [1], tokens[1]: [1], tokens[2]: [1], tokens[3]: [1],
tokens[4]: [2], tokens[5]: [2], tokens[6]: [2], tokens[7]: [2],
tokens[8]: [2], tokens[9]: [2], tokens[10]: [2], tokens[11]: [2],
tokens[12]: [2], tokens[13]: [2], tokens[14]: [2],}
Then I simply looped over the input, and for each line checking against the content of token
; if match found I did a split according to the dict-entry to extract the correct field.
special
above was to handle, well a special variable that required reading from multiple lines.
Update
clone git://gist.github.com/1037403.git
to get a copy of the code
usage:
./parser.py all_dec.txt
Hope it helps!
Well if you want a generic parsing library there is pyparsing, but it would probably be overkill in this case.
This appears to be a fairly line-oriented text file, that is not that large in size, so you're best bet would be to loop through each line looking for text that will identify the things you are after.
So something like:
lines = open('file.txt', 'r')
for n, line in enumerate(lines):
if 'direct solar irr. atm. diffuse irr. environment irr' in line:
values = lines[n+1].split() # after the next line after this one
self.irradiance_direct, self.irradiance_diffuse, self.irradiance_env = values
You could then add more if statements and so on as needed to get other data out. Though if you've got a lot of data you'd probably want to generalise the code somewhat. (Probably a dictionary with the text to match as key and a function to call when you match the key).
You may also want to use a regex to match the line, so that you can handle different amounts of white-space better. Otherwise just a one space too many or too few will throw it out.
The best way, IMHO would be to use a mmaped file, and then use regular expression to find what you are looking for.
text = mmap.mmap(file)
re.sub(pattern, text)
Mmap module maps a file as it were text, so you can perform pretty much any operations you would perform on a string. And the regex is the best way to search for something. Simple and efficient.
If you need to find specific lines, just handle everything as a string and run specific regular expressions to dig out your gems.
If you need to extract more data, I believe that with a small amount of work you can craft a nice parser for your data. I would use the following functions as a start:
def extract_screens(text):
"""
Returns a list of screens (divided by astericks).
Each screen is a list of strings stripped from asterisks.
"""
...
def process_screen(screen):
"""
Returns a list of screen divisions as tuples: [(heading, body)...]
heading is a string, body is a list of strings
blank lines are filtered out.
"""
...
By now you should have an indexed list of pieces of text. You can loop through them and execute a simple and specific special parser method for each section.
Tip: Use unit tests to keep yourself sane.
精彩评论