Regular Expression to extract (video) names from html tags
I 开发者_StackOverflow社区have a webpage which contains the followsing code snippet containg links to videos:
<a href="video.php?video=sampel1.mov">
<a href="video.php?video=anothersample.mov">
<a href="video.php?video=yetanothersample.mov">
I want to use sed and regular expression to extract the video names, eg:
sampel1.mov
anothersample.mov
yetanothersample.mov
so I can use wget
to download them.
Thanks a lot!
Give this a try:
sed -n 's/.*video=\([^"]*\)">/\1/p' inputfile
With GNU grep
:
grep -Po '(?<=video=).*?(?=">)' inputfile
Pipe either of those commands through xargs
:
command | xargs wget ...
You could do something simple like
grep -o 'video.php?video=[^"]\+' | sed -e 's/^video.php?video=//'
You can use sed to retrieve your movie names.
Create a file, for eg. movie_string.txt with all your strings containing the movie name
Now, create a sed script file, say movie_name.sed with the following:
s/\"//g
s/<//g
s/>//g
s/\(.*=\)\([a-z]\)/ \2/
save and quit.
Now from the terminal, you just need to issue the following command to redirect the result to another file movie.txt:
sed -f movie_name.sed movie_string.txt > movie.txt
A word of warning: parsing HTML/XML using regular expressions is usually not a good idea. Instead, use a language like Ruby or Python that has an XML parser library that can intelligently interpret the page structure.
Here are a few questions that might help you out (many more are only a quick search away):
- retrieve links from web page using python and BeautifulSoup
- What's the easiest way to extract the links on a web page using python without BeautifulSoup?
- Parse XHTML using Ruby
Update:
In your comment, you mentioned that you already know how to do the link extraction in Python but that you don't want to use a Python script that invokes wget
directly. You can still solve this with Python (which is probably the easiest solution since you already know how to do it). If your Python script prints the extracted filenames to standard output with a newline following each name, you can use either of the following shell commands to do what you want to do:
python your_script.py >filenames.txt
wget -i filenames.txt
or
python your_script.py | wget -i -
This will pass the data extracted by your script to wget
without requiring your script to invoke wget
via a system call.
cat yourlinks.txt | cut -f2 -d\" | cut -f2 -d=
精彩评论