how to parse a file with keyword-value pairs and {} and line breaks in Java?

2022-12-16 15:37 问答作者：

In a file I have some variables stored like this:

author = {Some Author},
link = {some link},
text = { bla bla bla bla bla bla bla bla bla bla bla
bla bla bla bla bla bla bla bla bla bla bla bla bla},
...开发者_如何学运维

Some of the variables are on multiline.

After that i need to spit the every String entry into key and value, but thats not a problem.I'm so far:

\\S+\\s*[=][{]\\s*\\S*[},]

The solutions, that are working fine for me are:

(\w+)\s*=\s*\{(.*?)\}

and

\\S+\\s*[=]\\s*[{].*[},]

It's not obvious from your post, but this looks like a bibtex file. If it is then braces can occur within braces, meaning your language is not "regular" and cannot be described by regular expressions such as the one you provide.

If not, then you want something like

(\w+)\s*=\s*\{(.*?)\}

but writing a parser is probably the most respectable way to solve your problem. If it is bibtex you are parsing, an open source Java bibliography manager (such as Jabref) might give you some ideas on building something more robust.

I would recommend that you not use regexes for this, since it seems your format is a bit too free-form. Writing a simple parser that first reads a string up to the = as a key and then reads the insides of the braces up to the separating comma or end-of-file without caring about newlines would, to me, seem a simpler approach. And if you need it to, you can replace the newlines with spaces as you go. It also has the benefit that if your values can contain braces, suitably escaped, it is simpler to handle them with an actual parser than with regexes.

This format seems simple enough and unlikely to be extended overmuch that a hand-written parser is pretty suitable. But for a more complex language, or even if you just want the exercise, you could use a parser generator to build your parser, which has the benefit of a much more comprehensible language definition. I understand ANTLR is a popular one to use in Java.

You could use String class's split method.

public String[] split(String regex)

Splits this string around matches of the given regular expression.

You could first split the input at comma, then split the text between {} by white space (\s).

have you considered Java properties files? http://en.wikipedia.org/wiki/.properties

You should use Properties, regex is not the good solution in your case.

Using a different file format will probably save you some headaches but you could parse it like:

Pattern p = Pattern.compile("\\s*(\\w+)\\s*=\\s*\\{(.*?)\\},?\\s*", Pattern.DOTALL);
while (true) {
    Matcher m = p.matcher(input);
    if (!m.find()) break;
    String key = m.group(1);
    String val = m.group(2);
    System.out.println("OK: key=" + key + ", val=" + val);
    input = m.replaceFirst("");
}

Just replace the println with insertion into your Map.

I'm not sure exactly what you're asking and your regex isn't much help in providing additional information.

However, if brackets can't nest and you don't want to handle escaped brackets then the regex is pretty straight-forward.

Note: even your most recent regex (probably should have just edited your post instead of responding to yourself: \\S+\\s*[=]\\s*[{].*[},] Is doing some things it doesn't need to that will certainly mess you up. The over-use of [] style character classes is probably confusing you. Your last [},] is really saying "character matching '}' or ','" which is I'm pretty sure not what you mean.

Regex seems to be everyone's favorite whipping boy but I think it's appropriate here.

Pattern p = Pattern.compile( "\\s*([^={}]+)\\s*=\\s*{([^}]+)},?" );
Matcher m = p.matcher( someString );
while( m.find() ) {
    System.out.println( "name:" + m.group(1) + " value:" + m.group(2) );
}

The regex breaks down as:

Any preceding whitespace.
First capture group is a non-zero length string containing only characters that are NOT '=', '{', or '}'
Any intermediate whitespace.
'='
Any intermediate whitespace.
'{'
Second capture group is a non-zero length string containing only characters that are not the closing '}'
'}'
Optional ','

This regex should perform more efficiently than the .* versions because it is easier for it to figure out where to stop. I also think it is clearer but I speak regex conversationally. :)

继续阅读：regex

how to parse a file with keyword-value pairs and {} and line breaks in Java?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？