Search and replace characters in a file with Python

2022-12-20 05:21 问答作者：

I am trying to do transliteration where I need to replace every source char开发者_如何学Cacter in English from a file with its equivalent from a dictionary I am using in the source code corresponding to another language in Unicode format. I am now able to read character by character from a file in English how do I search for its equivalent map in the dictionary I have defined in the source code and make sure that is printed in a new transliterated output file. Thank you:).

The translate method of Unicode objects is the simplest and fastest way to perform the transliteration you require. (I assume you're using Unicode, not plain byte strings which would make it impossible to have characters such as 'पत्र'!).

All you have to do is layout your transliteration dictionary in a precise way, as specified in the docs to which I pointed you:

each key must be an integer, the codepoint of a Unicode character; for example, 0x0904 is the codepoint for ऄ, AKA "DEVANAGARI LETTER SHORT A", so for transliterating it you would use as the key in the dict the integer 0x0904 (equivalently, decimal 2308). (For a table with the codepoints for many South-Asian scripts, see this pdf).
the corresponding value can be a Unicode ordinal, a Unicode string (which is presumably what you'll use for your transliteration task, e.g. u'a' if you want to transliterate the Devanagari letter short A into the English letter 'a'), or None (if during the "transliteration" you want to simply remove instances of that Unicode character).

Characters that aren't found as keys in the dict are passed on untouched from the input to the output.

Once your dict is laid out like that, output_text = input_text.translate(thedict) does all the transliteration for you -- and pretty darn fast, too. You can apply this to blocks of Unicode text of any size that will fit comfortably in memory -- basically doing one text file as a time will be just fine on most machines (e.g., the wonderful -- and huge -- Mahabharata takes at most a few tens of megabytes in any of the freely downloadable forms -- Sanskrit [[cross-linked with both Devanagari and roman-transliterated forms]], English translation -- available from this site).

Note: Updated after clarifications from questioner. Please read the comments from the OP attached to this answer.

Something like this:

for syllable in input_text.split_into_syllables():
    output_file.write(d[syllable])

Here output_file is a file object, open for writing. d is a dictionary where the indexes are your source characters and the values are the output characters. You can also try to read your file line-by-line instead of reading it all in at once.

继续阅读：python

Search and replace characters in a file with Python

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？