开发者

What is the efficient way to find some pattern in a big text?

I want to extract email addresses from a large text file. what is the best way to do it?

My idea is to find '@' in the text and use "Regex" to find email address into substring at (for example) 256 chars before this position and length of 512.

P.S.: Straightforwardly I want to know the best and most efficient way t开发者_开发百科o find some pattern (like email addresses) in a huge text.


256 and 512 sound like arbitrary values.

  • You could indeed scan for the @ sign, but then you'd have to read forward and backward until you encounter a character that is not allowed in an email address (for example, another @ sign, a whitespace, a backslash...)
  • Quoting wikipedia:

The local-part of an e-mail address may be up to 64 characters long and the domain name may have a maximum of 255 characters.

So those values would be nicer.

Now combine both methods and voila, you have your algorithm.


It depends on how many false positives and false negatives you want. Email addresses tend to be made up of letters, numbers, and certain symbols. However, while it is probably extremely rare to see characters out of that set in a real email address, the standard certainly allows it. So you really need to decide how many real matches you want and how many matches you want that match your regular expression but are not actually email addresses.

Here's one answer excludes many valid cases and also probably includes too many:

[A-Za-z0-9!#$%&*+-=?^_~]{1,64}@[A-Za-z0-9-.]{1,255}\.[A-Z]{2,6}


If you absolutely need the most efficient way, I don't think regular expressions should be used.

Assuming almost all instances of @ in your text are email addresses and you are working in a language with fast forward and backward string traversal, this method will probably be close to the fastest:

  1. Search for @
  2. Manually compare each character after the @ to make sure they are within the allowed ASCII ranges
  3. Keep track of whether a valid domain was found before the first space or other valid terminating character
  4. Search again from the @ symbol backwards, comparing each character to make sure they fall within the valid character ranges for the local component


Locating all valid email addresses is not an easy thing to do as RFC for email address syntax is quite complex. If you just want to locate normal email addresses, you can use something like:

/(?<=^|[\s<(\["'])[a-z][\w.+-]+@[\w-]+(?:\.[\w-]+)+(?=[>)\]"']|$)/gi

This regex assumes that:

  • Email address starts with a letter and contains only alphanumeric characters, period, underscores and hyphens (and one @, of course). It allows + in the name part.
  • They are enclosed in whitespaces, square brackets, parentheses, single/double quotes or angle brackets

It doesn't check if the lengths of the name and domain parts are within their allowed range (and many other constraints set by RFC). Test it on a sample file and see how many emails it matches.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜