开发者

print everything up to match in pattern

I have a data set that looks like the following:

movie (year) genre

for example.

some words (1934) action

My goal is to grab each "movie" field and then check a different file that also has a bunch of movies and delete the lines from the second file that do not contain the movie. I have been trying to use awk to do this, but have only been able to match the year field. Is there a way that I can create a variable for the movie field? I feel like the easiest way to do this would be to match the year field and create a va开发者_StackOverflow社区riable from everything that comes before it in each line. I have not been able to figure this out, is there some way to do this that might be easier than my suggestion?


assuming your dataset is in a file

$ cat dataset
Terminator (19XX) action
The Ghostrider (2009) supernatural

$ awk -F"[()]" '{print $1}' dataset
Terminator
The Ghostrider

$ awk -F"[()]" '{print $1}' dataset > movie_names

$ grep -f movie_names secondfile
$ grep -f secondfile movie_names

Of course, you can do it with just awk as well

awk -F"[()]" 'FNR==NR { m[++d]=$1;next } { for(i=1;i<=d;i++){if( $0 ~ m[i] ){ print }}}' dataset secondfile


You can ask sed to remove the year field and everything that comes after it.

$ cat file | sed 's/([0-9]\+).*//'

This will only return the name of the movie on each line. You can then pipe it into a while read; loop.

If needed you can refine the regex so that it only matches on 4 digits (this one will match any number of digits between parens).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜