How serious is the Java7 "Solr/Lucene" bug?

2023-03-24 20:50 问答作者：

Apparently Java7 has some nasty bug regarding loop optimization: Google search.

From the reports and bug descriptions I find it hard to judge how significant this bug is (unless 开发者_开发问答you use Solr or Lucene).

What I'd like to know:

How likely is it that my (any) program is affected?
Is the bug deterministic enough that normal testing will catch it?

Note: I can't make users of my program use -XX:-UseLoopPredicate to avoid the problem.

The problem with any hotspot bugs, is that you need to reach the compilation threshold (e.g. 10000) before it can get you: so if your unit tests are "trivial", you probably won't catch it.

For example, we caught the incorrect results issue in lucene, because this particular test creates 20,000 document indexes.

In our tests we randomize different interfaces (e.g. different Directory implementations) and indexing parameters and such, and the test only fails 1% of the time, of course its then reproducable with the same random seed. We also run checkindex on every index that tests create, which do some sanity tests to ensure the index is not corrupt.

For the test we found, if you have a particular configuration: e.g. RAMDirectory + PulsingCodec + payloads stored for the field, then after it hits the compilation threshold, the enumeration loop over the postings returns incorrect calculations, in this case the number of returned documents for a term != the docFreq stored for the term.

We have a good number of stress tests, and its important to note the normal assertions in this test actually pass, its the checkindex part at the end that fails.

The big problem with this, is that lucene's incremental indexing fundamentally works by merging multiple segments into one: because of this, if these enums calculate invalid data, this invalid data is then stored into the newly merged index: aka corruption.

I'd say this bug is much sneakier than previous loop optimizer hotspot bugs we have hit (e.g. sign-flip stuff, https://issues.apache.org/jira/browse/LUCENE-2975). In that case we got wacky negative document deltas, which make it easy to catch. We also only had to manually unroll a single method to dodge it. On the other hand, the only "test" we had initially for that was a huge 10GB index of http://www.pangaea.de/, so it was painful to narrow it down to this bug.

In this case, I spent a good amount of time (e.g. every night last week) trying to manually unroll/inline various things, trying to create some workaround so we could dodge the bug and not have the possibility of corrupt indexes being created. I could dodge some cases, but there were many more cases I couldn't... and I'm sure if we can trigger this stuff in our tests there are more cases out there...

Simple way to reproduce the bug. Open eclipse (Indigo in my case), and Go to Help/Search. Enter a search string, you will notice that eclipse crashes. Have a look at the log.

# Problematic frame:
# J  org.apache.lucene.analysis.PorterStemmer.stem([CII)Z
#
# Failed to write core dump. Minidumps are not enabled by default on client versions of Windows
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.sun.com/bugreport/crash.jsp
#

---------------  T H R E A D  ---------------

Current thread (0x0000000007b79000):  JavaThread "Worker-46" [_thread_in_Java, id=264, stack(0x000000000f380000,0x000000000f480000)]

siginfo: ExceptionCode=0xc0000005, reading address 0x00000002f62bd80e

Registers:

The problem, still exist as of Dec 2, 2012 in both Oracle JDK java -version java version "1.7.0_09" Java(TM) SE Runtime Environment (build 1.7.0_09-b05) Java HotSpot(TM) 64-Bit Server VM (build 23.5-b02, mixed mode) and openjdk java version "1.7.0_09-icedtea" OpenJDK Runtime Environment (fedora-2.3.3.fc17.1-x86_64) OpenJDK 64-Bit Server VM (build 23.2-b09, mixed mode)

Strange that individually any of -XX:-UseLoopPredicate or -XX:LoopUnrollLimit=1 option prevent bug from happening, but when used together - JDK fails see e.g. https://bugzilla.redhat.com/show_bug.cgi?id=849279

Well it's two years later and I believe this bug (or a variation of it) is still present in 1.7.0_25-b15 on OSX.

Through very painful trial and error I have determined that using Java 1.7 with Solr 3.6.2 and autocommit <maxTime>30000</maxTime> seems to cause index corruption. It only seems to happen w/ 1.7 and maxTime at 30000- if I switch to Java 1.6, I have no problems. If I lower maxTime to 3000, I have no problems.

The JVM does not crash, but it causes RSolr to die with the following stack trace in Ruby: https://gist.github.com/armhold/6354416. It does this reliably after saving a few hundred records.

Given the many layers involved here (Ruby, Sunspot, Rsolr, etc) I'm not sure I can boil this down into something that definitively proves a JVM bug, but it sure feels like that's what's happening here. FWIW I have also tried JDK 1.7.0_04, and it also exhibits the problem.

As I understand it, this bug is only found in the server jvm. If you run your program on the client jvm, you are in the clear. If you run your program on the server jvm it depends on the program how serious the problem can be.

继续阅读：java-7

How serious is the Java7 "Solr/Lucene" bug?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？