开发者

grepping out invalid URIs

I have dbpedia's NTriple files. Some of them contain non absolute URIs, URI's that don't start with http://. This is causing problem to the parsing.

i.e. i开发者_开发知识库 have some triples that have URIs like <www.example.com> instead of <http://www.example.com>

I'd like to grep them out by negating them.

I tried, failing, with grep -v "^(<http)".

Any suggestion?

Edit

I probably made my point wrongly. These URI's aren't necessarily at the beginning of the line. That was my mistake in using the '^' operator as NOT. Also, I want to grep them out, with grep -v.

These are some sample lines:

<http://dbpedia.org/resource/Petrodvorets_Watch_Factory> <http://xmlns.com/foaf/0.1/homepage> <www.raketa.su> .

<http://dbpedia.org/resource/ABS_network> <http://xmlns.com/foaf/0.1/homepage> <www.absn.tv> .


grep -P '^(?!<http).*'

(?!...) is a negative lookahead I did not test it so if you that does not work, search the web for 'regex negative lookahead' that should do the job


To handle multiple URIs per line the working regex is:

grep -P '<(?!http(s)?:\/\/).*>', to start with.


"^(<http)" would only match if "<http" is at the beginning of the line. Is that true in your case?

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜