What is the efficient way to find some pattern in a big text?

2022-12-30 04:45 问答作者：

I want to extract email addresses from a large text file. what is the best way to do it?

My idea is to find '@' in the text and use "Regex" to find email address into substring at (for example) 256 chars before this position and length of 512.

P.S.: Straightforwardly I want to know the best and most efficient way t开发者_开发百科o find some pattern (like email addresses) in a huge text.

256 and 512 sound like arbitrary values.

You could indeed scan for the @ sign, but then you'd have to read forward and backward until you encounter a character that is not allowed in an email address (for example, another @ sign, a whitespace, a backslash...)
Quoting wikipedia:

The local-part of an e-mail address may be up to 64 characters long and the domain name may have a maximum of 255 characters.

So those values would be nicer.

Now combine both methods and voila, you have your algorithm.

It depends on how many false positives and false negatives you want. Email addresses tend to be made up of letters, numbers, and certain symbols. However, while it is probably extremely rare to see characters out of that set in a real email address, the standard certainly allows it. So you really need to decide how many real matches you want and how many matches you want that match your regular expression but are not actually email addresses.

Here's one answer excludes many valid cases and also probably includes too many:

[A-Za-z0-9!#$%&*+-=?^_~]{1,64}@[A-Za-z0-9-.]{1,255}\.[A-Z]{2,6}

If you absolutely need the most efficient way, I don't think regular expressions should be used.

Assuming almost all instances of @ in your text are email addresses and you are working in a language with fast forward and backward string traversal, this method will probably be close to the fastest:

Search for @
Manually compare each character after the @ to make sure they are within the allowed ASCII ranges
Keep track of whether a valid domain was found before the first space or other valid terminating character
Search again from the @ symbol backwards, comparing each character to make sure they fall within the valid character ranges for the local component

Locating all valid email addresses is not an easy thing to do as RFC for email address syntax is quite complex. If you just want to locate normal email addresses, you can use something like:

/(?<=^|[\s<(\["'])[a-z][\w.+-]+@[\w-]+(?:\.[\w-]+)+(?=[>)\]"']|$)/gi

This regex assumes that:

Email address starts with a letter and contains only alphanumeric characters, period, underscores and hyphens (and one @, of course). It allows + in the name part.
They are enclosed in whitespaces, square brackets, parentheses, single/double quotes or angle brackets

It doesn't check if the lengths of the name and domain parts are within their allowed range (and many other constraints set by RFC). Test it on a sample file and see how many emails it matches.

继续阅读：regex text

What is the efficient way to find some pattern in a big text?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？