Parsing text with Python: unstructured but similar information with different formatting

2023-02-23 14:12 问答作者：

I'm trying to parse thousands of spec sheet text files containing company, material, chemical properties, etc. (Material Safety Data Sheets, to be specific) with Python. The text files contain similar information in loosely structured formatting such that it's human readable, but unstructured enough that it's not easily parsed (e.g. not XML or CSV). In short, it's just all over the place.

Originally the data is entered by different people working in different companies by hand. Another set of people transcribe the information into these text files (OCR it into a txt file).

Is there a parsing library or patterns to extract bits of information of this type? (This seems to be a "common" data entry problem.) Certainly regular expressions will be used a lot. I don't have any experience with natural language processing libraries. Would they even be appropriate for the problem?

My initial thought is to try and group the files in different caegories, then create a set of parsing functions for each format. Unfortunately his may only work for a small subset of the problem and the different cases could quickly spiral out of control.

Since this question general I'll provide a bunch of examples illustrating the problem.

ADDRESS INFORMATION

Each file contains company information such as information and address. The information may or may not have an identifier, it may or may not be on one line, etc. In short, there seems to be every combination.

Ex.(w/ field info):

MANUFACTURER: Foo Bar Inc.  
ADDRESS: 123 Foo St.  
Bar, CA 90012

Ex. (wo/ field info):

Foo Bar Inc.  
123 Foo St.  
Bar, CA 90012

Ex. (Sometimes extra lines between information):

FOO BAR INC.

123 FOO ST.

BAR, CA 90012

Ex. (inconsistent field names):

MANUFACTURER'S NAME: FOO BAR INC.  
CREATIVE DIVISION  
ADDRESS: 123 FOO ST.  
CITY, STATE & ZIP: BAR, CALIFORNIA 90012  
PHONE NUMBER: 310-111-2222

SECTION INFO

The spec sheets also have similar sections but are inconsistent orders, headings, numeral types and delimiters.

Ex:

========================================
SECTION 1 -- MATERIALS
========================================

Ex:

Section I. Materials
------------------------------------------

Ex:

----- Section 3       Materials

And sometimes the files had their width changed, so the following line breaks.

Ex:

===================================================
1.    Materials
===================================================

Becomes:

=========================================
==========
1.    Materials
=========================================
==========

Here is a complete example:

Hopefully this will clarify the issues parsing the file. You'll notice the line wrapping, information split on different lines, etc. Not all have the exact structure, some will be formatted differently, with information in different places. Here is a link to a paper hard copy.

MATERIAL SAFETY DATA SHEET

=================================================================
=========
SECTION I-PRODUCT AND PREPARATION INFORMATION
=================================================================
=========

MANUFACTURER:         Some Company Inc     EMERGENCY AND
INFORMATION
TELEPHONE
(111)222-3333
ADDRESS:              Some Road
City, ST
12346

IDENTITY (AS USED ON
LABEL AND LIST):      Some Identity

PREPARATION DATE:     Some Date

=================================================================
=========
SECTION II-HAZARDOUS INGREDIENTS/IDENTITY INFORMATION开发者_开发技巧
=================================================================
=========

OSHA
ACGIH
HAZARDOUS COMPONENTS             CAS#       PEL   TWA        TLV
%
(SPECIFIC CHEMICAL IDENTITY;
COMMON NAME(S)
-----------------------------------------------------------------
---------

Some Chemical             111-22-3   15    10         10
12.34


=================================================================
=========
SECTION III-PHYSICAL/CHEMICAL CHARACTERISTICS
=================================================================
=========

Boiling Point:              N/A  Specific Gravity (H20=1):   N/A
Vapor Pressure (mm Hg):     N/A  Melting Point:              N/A
Vapor Density (AIR=1)       N/A  Evaporation Rate
(Butyl Acetate=1)           N/A
Solubility in Water:        None

Appearance:  Solid, various colors, may have slight
odor.

N/A = Not applicable

=================================================================
=========
SECTION IV-FIRE AND EXPLOSION HAZARD DATA
=================================================================
=========

FLASH POINT (METHOD USED):  None
FLAMMABLE LIMITS:  None          LEL:  N/A        UEL:  N/A
EXTINGUISHING MEDIA:  None
SPECIAL FIRE FIGHTING PROCEDURES:  None required.
UNUSUAL FIRE AND EXPLOSION HAZARDS:  None.

=================================================================
=========
SECTION V-REACTIVITY DATA
=================================================================
=========

STABILITY:  Stable
CONDITIONS TO AVOID:  None
INCOMPATIBILITY (MATERIALS TO AVOID):  None
HAZARDOUS POLYMERIZATION:  Will not occur

=================================================================
=========
SECTION VI-HEALTH HAZARD DATA
=================================================================
=========

ROUTES OF ENTRY:

INHALATION:  Yes
SKIN:  Possibly
INGESTION:  Possibly
EYES:  Possibly

HEALTH HAZARDS (ACUTE AND CHRONIC):  Pneumoconiosis, silicosis,
emphysema,
nose and throat irritation, eye irritation, skin irritation in
some.

CARCINOGENICITY:  No applicable information found.

SIGNS AND SYMPTOMS OF EXPOSURE:  Coughing, sneezing; irritation
of the
mucous membranes; eye irritation; skin irritation or rash, dry
throat.

MEDICAL CONDITIONS GENERALLY AGGRAVATED BY EXPOSURE:  Nasal,
bronchial or
pulmonary conditions which tend to restrict breathing, skin
abrasions.

EMERGENCY AND FIRST AID PROCEDURES:  Remove to fresh air,
irrigate eyes,
wash with soap and water, contact physician if necessary.

=================================================================
=========
SECTION VII-PRECAUTIONS FOR SAFE HANDLING AND USE
=================================================================
=========

STEPS TO BE TAKEN IN CASE MATERIAL IS RELEASED OR SPILLED:
Normal clean-up
procedures.

WASTE DISPOSAL METHOD:  Standard landfill methods consistent with
applicable state and federal regulations.

PRECAUTIONS TO BE TAKEN IN HANDLING AND STORING:  Use caution not
to drop,
crush, break or chip.

OTHER PRECAUTIONS:  Do not use at speeds greater than the
not-to-exceed
speed printed on the hub assembly.

=================================================================
=========
SECTION VIII-CONTROL MEASURES
=================================================================
=========

RESPIRATORY PROTECTION (SPECIFY TYPE):  OSHA or NIOSH approved
respirators
may be required.

VENTILATION:  Local exhaust recommended.  Special:  N/A.
Mechanical:  Useful.  Other:  N/A.

PROTECTIVE GLOVES:  May be useful.

EYE PROTECTION:  Recommended.

OTHER PROTECTIVE CLOTHING OR EQUIPMENT:  Not required.

WORK/HYGIENIC PRACTICES:  Keep clothing and area clean.  Wash to
remove

I'd write a for loop with lots of state variables, processing each line, and use the state variables to keep track of what is going on. The condtionals (if) inside the for loop would make the same "questions" a human would have to do would he be parsing the file by hand.

"
for line in file:
    Is there a colon in line?
        field_name = normalize(informaton before the colon)
        data = information after the colon
    else: 
        field_name = next_field_in_list(previous_field)
        data = line
"

And so on. I could not understand from the examples if you at least have a fixed order for the fields, and either a maximum number of fields per record or a distinct record separator. Without these, I think it would be harder to write.

继续阅读：parsing python text-parsing

Parsing text with Python: unstructured but similar information with different formatting

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？