Python: normalizing a text file
I have a text file which contains several spelling variants of many words:
For e.g.
identification ... ID .. identity...contract.... contr.... contractor...medicine...pills..tables
So I want to have a synonym text file which contains the words synonyms and would like to replace all the variants with the primary word. Essentially I want the normalize the input file.
F开发者_运维知识库or e.g my synonym list file would look like
identification = ID identify
contracting = contract contractor contractors contra......
word3 = word3_1 word3_2 word3_3 ..... word3_n
.
.
.
.
medicine = pills tables drugs...
I want the end output file to look like
identification ... identification .. identification...contractor.... contractor.... contractor...medicine...medicine..medicine
How do I got about programming in python?
Thanks a lot for your help!!!
You could read the synonym file and convert it into a dictionary, table
:
import re
table={}
with open('synonyms','r') as syn:
for line in syn:
match=re.match(r'(\w+)\s+=\s+(.+)',line)
if match:
primary,synonyms=match.groups()
synonyms=[synonym.lower() for synonym in synonyms.split()]
for synonym in synonyms:
table[synonym]=primary.lower()
print(table)
yields
{'word3_1': 'word3', 'word3_3': 'word3', 'word3_2': 'word3', 'contr': 'contracting', 'contract': 'contracting', 'contractor': 'contracting', 'contra': 'contracting', 'identify': 'identification', 'contractors': 'contracting', 'word3_n': 'word3', 'ID': 'identification'}
Next, you could read in the text file, and replace each word with its primary synonym from table
:
with open('textfile','r') as f:
for line in f:
print(''.join(table.get(word.lower(),word)
for word in re.findall(r'(\W+|\w+)',line)))
yields
identification identification identity contracting contracting contracting medicine medicine medicine
re.findall(r'(\w+|\W+)',line)
was used split eachline
while preserving whitespace. If whitespace is not of interest, you could also use the easierline.split()
.table.get(word,word)
returnstable[word]
if word is intable
, and simply returnsword
ifword
is not in the synonymtable
.
Just a thought: Instead of having a list of all variation of a word, have a look at difflib
>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']
精彩评论