Struggling with a regular expression - new to it
I'm trying to match a regular expression in order to extract a value of a substring. I've read on regular expression syntax but it seems I still can't get it right.
I have the following coming from http response:
Content-Disposition: attachment; filename=Subtitle开发者_如何学Python.197747.zip; type=s
Content-Disposition: attachment; filename="file one.txt" type=s
Content-Disposition: attachment; filename="file one.txt"; type=s
Content-Disposition: attachment; filename=Subtitle.197747.zip type=s
I'm trying to extract the value of filename without the double quotes if specified. Came up with something like:
`.*filename="?(?<filename>[^;"]*)\s?.*`
But this doesn't seem to do the trick, would appreciate some guidance.
Thanks everyone for your answers, I read them all and went with:
filename="?(?<filename>[^;"]+)[\s;"]
Though I'm not sure how to get it to get it to compile correctly (either \s or " giving me troubles).
Try this
filename="?(?<filename>[^;"]+)[;"\s]*type
The trick with regex (imo) is to not ask it to do too much all at once. Write the expression that doesn't care about the quotes, and then look for quotes in normal procedural code and strip them there if needed. You could even use a separate regex to find the leading/trailing quotes if you want (but it's hardly needed).
The reason for this is not that regex isn't up to the job. You certainly could fit this all in one expression. The reason is that (again: imo) the complexity and maintenance penalty on the regex tends to increase at a much greater rate than the functionality provided. There's a sweet spot in there where a regex is the perfect, elegant solution, but it's easy to take it too far.
The problem you have right now, though, is that your \s
near the end of the expression fits within the [^;"]*
character class used to get your value, and since the asterisk is greedy, you'll very often never match that portion. Based on your sample, I'd use ;? type=s
as the trailing condition.
You are close, try
filename="?(?<filename>[^;"]+)["\s]
Firstly you don't need to match the whole string, so the initial and final .*
can be removed: the simpler you can keep things the better.
Assuming the last example is wrong (see my comment to the Q), you need everything between filename=
and either a semicolon and the end of the string. The value might, if quoted, contain a semi-colon (see definition of value
and token
in RFC2045 based on a quick read), so something like:
filename=("[^"]+"|.+)\s*(;|^)
albeit that second .+
should be replaced by a character class of the valid characters in a token
(a subset of ASCII).
The filename will be the value of the first capture.
There are plenty of answers that will do the job, here's mine:
filename=\"?([^;"]+).*type
For testing regular expressions, I use Expresso. It's a free download and gives you a plain English representation of what your regex is actually looking for, which is really handy.
精彩评论