开发者

Parsing text with Python: unstructured but similar information with different formatting

I'm trying to parse thousands of spec sheet text files containing company, material, chemical properties, etc. (Material Safety Data Sheets, to be specific) with Python. The text files contain similar information in loosely structured formatting such that it's human readable, but unstructured enough that it's not easily parsed (e.g. not XML or CSV). In short, it's just all over the place.

Originally the data is entered by different people working in different companies by hand. Another set of people transcribe the information into these text files (OCR it into a txt file).

Is there a parsing library or patterns to extract bits of information of this type? (This seems to be a "common" data entry problem.) Certainly regular expressions will be used a lot. I don't have any experience with natural language processing libraries. Would they even be appropriate for the problem?

My initial thought is to try and group the files in different caegories, then create a set of parsing functions for each format. Unfortunately his may only work for a small subset of the problem and the different cases could quickly spiral out of control.

Since this question general I'll provide a bunch of examples illustrating the problem.

ADDRESS INFORMATION

Each file contains company information such as information and address. The information may or may not have an identifier, it may or may not be on one line, etc. In short, there seems to be every combination.

Ex.(w/ field info):

MANUFACTURER: Foo Bar Inc.  
ADDRESS: 123 Foo St.  
Bar, CA 90012

Ex. (wo/ field info):

Foo Bar Inc.  
123 Foo St.  
Bar, CA 90012

Ex. (Sometimes extra lines between information):

FOO BAR INC.

123 FOO ST.

BAR, CA 90012

Ex. (inconsistent field names):

MANUFACTURER'S NAME: FOO BAR INC.  
CREATIVE DIVISION  
ADDRESS: 123 FOO ST.  
CITY, STATE & ZIP: BAR, CALIFORNIA 90012  
PHONE NUMBER: 310-111-2222

SECTION INFO

The spec sheets also have similar sections but are inconsistent orders, headings, numeral types and delimiters.

Ex:

========================================
SECTION 1 -- MATERIALS
========================================

Ex:

Section I. Materials
------------------------------------------

Ex:

----- Section 3       Materials

And sometimes the files had their width changed, so the following line breaks.

Ex:

===================================================
1.    Materials
===================================================

Becomes:

=========================================
==========
1.    Materials
=========================================
==========

Here is a complete example:

Hopefully this will clarify the issues parsing the file. You'll notice the line wrapping, information split on different lines, etc. Not all have the exact structure, some will be formatted differently, with information in different places. Here is a link to a paper hard copy.

MATERIAL SAFETY DATA SHEET

=================================================================
=========
SECTION I-PRODUCT AND PREPARATION INFORMATION
=================================================================
=========

MANUFACTURER:         Some Company Inc     EMERGENCY AND
INFORMATION
TELEPHONE
(111)222-3333
ADDRESS:              Some Road
City, ST
12346

IDENTITY (AS USED ON
LABEL AND LIST):      Some Identity

PREPARATION DATE:     Some Date

=================================================================
=========
SECTION II-HAZARDOUS INGREDIENTS/IDENTITY INFORMATION开发者_开发技巧
=================================================================
=========

OSHA
ACGIH
HAZARDOUS COMPONENTS             CAS#       PEL   TWA        TLV
%
(SPECIFIC CHEMICAL IDENTITY;
COMMON NAME(S)
-----------------------------------------------------------------
---------

Some Chemical             111-22-3   15    10         10
12.34


=================================================================
=========
SECTION III-PHYSICAL/CHEMICAL CHARACTERISTICS
=================================================================
=========

Boiling Point:              N/A  Specific Gravity (H20=1):   N/A
Vapor Pressure (mm Hg):     N/A  Melting Point:              N/A
Vapor Density (AIR=1)       N/A  Evaporation Rate
(Butyl Acetate=1)           N/A
Solubility in Water:        None

Appearance:  Solid, various colors, may have slight
odor.

N/A = Not applicable

=================================================================
=========
SECTION IV-FIRE AND EXPLOSION HAZARD DATA
=================================================================
=========

FLASH POINT (METHOD USED):  None
FLAMMABLE LIMITS:  None          LEL:  N/A        UEL:  N/A
EXTINGUISHING MEDIA:  None
SPECIAL FIRE FIGHTING PROCEDURES:  None required.
UNUSUAL FIRE AND EXPLOSION HAZARDS:  None.

=================================================================
=========
SECTION V-REACTIVITY DATA
=================================================================
=========

STABILITY:  Stable
CONDITIONS TO AVOID:  None
INCOMPATIBILITY (MATERIALS TO AVOID):  None
HAZARDOUS POLYMERIZATION:  Will not occur

=================================================================
=========
SECTION VI-HEALTH HAZARD DATA
=================================================================
=========

ROUTES OF ENTRY:

INHALATION:  Yes
SKIN:  Possibly
INGESTION:  Possibly
EYES:  Possibly

HEALTH HAZARDS (ACUTE AND CHRONIC):  Pneumoconiosis, silicosis,
emphysema,
nose and throat irritation, eye irritation, skin irritation in
some.

CARCINOGENICITY:  No applicable information found.

SIGNS AND SYMPTOMS OF EXPOSURE:  Coughing, sneezing; irritation
of the
mucous membranes; eye irritation; skin irritation or rash, dry
throat.

MEDICAL CONDITIONS GENERALLY AGGRAVATED BY EXPOSURE:  Nasal,
bronchial or
pulmonary conditions which tend to restrict breathing, skin
abrasions.

EMERGENCY AND FIRST AID PROCEDURES:  Remove to fresh air,
irrigate eyes,
wash with soap and water, contact physician if necessary.

=================================================================
=========
SECTION VII-PRECAUTIONS FOR SAFE HANDLING AND USE
=================================================================
=========

STEPS TO BE TAKEN IN CASE MATERIAL IS RELEASED OR SPILLED:
Normal clean-up
procedures.

WASTE DISPOSAL METHOD:  Standard landfill methods consistent with
applicable state and federal regulations.

PRECAUTIONS TO BE TAKEN IN HANDLING AND STORING:  Use caution not
to drop,
crush, break or chip.

OTHER PRECAUTIONS:  Do not use at speeds greater than the
not-to-exceed
speed printed on the hub assembly.

=================================================================
=========
SECTION VIII-CONTROL MEASURES
=================================================================
=========

RESPIRATORY PROTECTION (SPECIFY TYPE):  OSHA or NIOSH approved
respirators
may be required.

VENTILATION:  Local exhaust recommended.  Special:  N/A.
Mechanical:  Useful.  Other:  N/A.

PROTECTIVE GLOVES:  May be useful.

EYE PROTECTION:  Recommended.

OTHER PROTECTIVE CLOTHING OR EQUIPMENT:  Not required.

WORK/HYGIENIC PRACTICES:  Keep clothing and area clean.  Wash to
remove


I'd write a for loop with lots of state variables, processing each line, and use the state variables to keep track of what is going on. The condtionals (if) inside the for loop would make the same "questions" a human would have to do would he be parsing the file by hand.

"
for line in file:
    Is there a colon in line?
        field_name = normalize(informaton before the colon)
        data = information after the colon
    else: 
        field_name = next_field_in_list(previous_field)
        data = line
"

And so on. I could not understand from the examples if you at least have a fixed order for the fields, and either a maximum number of fields per record or a distinct record separator. Without these, I think it would be harder to write.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜