开发者

Weblogic EJB calls start to fail under moderate load with OptionalDataException

Our system setup consists of two Weblogic 10.3 servers: one hosts the presentation layer and the other hosts the EJBs. The system runs fine under moderate load for some time (one to several days) after which EJB method calls from the presentation server to the EJB server start to fail with the following error:

java.rmi.RemoteException: java.rmi.UnmarshalException: error unmarshalling arguments; nested exception is: java.io.OptionalDataException

Stack trace:

java.io.OptionalDataException
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
    at weblogic.utils.io.ChunkedObjectInputStream.readObject(ChunkedObjectInputStream.java:197)
    at weblogic.rjvm.MsgAbbrevInputStream.readObject(MsgAbbrevInputStream.java:564)
    at weblogic.utils.io.ChunkedObjectInputStream.readObject(ChunkedObjectInputStream.java:193)
    at weblogic.jndi.internal.RootNamingNode_WLSkel.invoke(Unknown Source)
    at weblogic.rmi.internal.BasicServerRef.invoke(BasicServerRef.java:589)
    at weblogic.rmi.cluster.ClusterableServerRef.invoke(ClusterableServerRef.java:230)
    at weblogic.rmi.internal.BasicServerRef$1.run(BasicServerRef.java:477)
    at weblogic.security.acl.internal.AuthenticatedSubject.doAs(AuthenticatedSubject.java:363)
    at weblogic.security.service.SecurityManager.runAs(Unknown Source)
    at weblogic.rmi.internal.BasicServerRef.handleRequest(BasicServerRef.java:473)
    at weblogic.rmi.internal.wls.WLSExecuteRequest.run(WLSExecuteRequest.java:118)

Once the first OptionalDataException is encountered all subsequent calls fail with the same result. Some sources suggest that this might be related to cluster multicast port being misconfigured. However, these servers do not belong to a cluster.

Booting the EJB server always temporarily resolves the issue, but the issue seems to occur again after some time.

Update: it seems that the problem is not related to an overflow in the number of socket connections after all (see my own answer below). After disallowing network classloading we ran very steadily for a week after which we started receiving OptionalDataExceptions on the presentation server again (stack trace below). It is very strange that the system works fine for a week and then starts to fail.

javax.naming.CommunicationException [Root exception is java.rmi.UnmarshalException: error unmarshalling arguments; nested exception is:
    java.io.OptionalDataException]
    at weblogic.jndi.internal.ExceptionTranslator.toNamingException(ExceptionTranslator.java:74)
    at weblogic.jndi.internal.WLContextImpl.translateException(WLContextImpl.java:439)
    at weblogic.jndi.internal.WLContextImpl.lookup(WLContextImpl.java:395)
    at weblogic.jndi.internal.WLContextImpl.lookup(WLContextImpl.java:380)
    at javax.naming.InitialContext.lookup(InitialContext.java:392)
    ...
Caused by: java.rmi.UnmarshalException: error unmarshalling arguments; nested exception is:

    java.io.OptionalDataException
    at weblogic.rjvm.ResponseImpl.unmarshalReturn(ResponseImpl.java:234)
    at weblogic.rmi.cluster.ClusterableRemoteRef.invoke(ClusterableRemoteRef.java:348)
    at weblogic.rmi.cluster.ClusterableRemoteRef.invoke(ClusterableRemoteRef.java:259)
    at weblogic.jndi.internal.ServerNamingNode_1030_WLStub.lookup(Unknown Source)
    at weblogic.jndi.internal.WLContextImpl.lookup(WLContextImpl.java:392)  
    ... 38 more
Caused by: java.io.OptionalDataException
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
    at     
    weblogic.utils.io.ChunkedObjectInputStream.readObject(ChunkedObjectInputStream.java:197)
    at weblogic.rjvm.MsgAbbrevInputStream.readObject(MsgAbbrevInputStream.java:564)
    at     
weblogic.utils.io.ChunkedObjectInputStream.readObject(ChunkedObjectInputStream.java:193)
    at weblogic.jndi.internal.RootNamingNode_WLSkel.invoke(Unknown Source)
    at weblogic.rmi.internal.BasicServerRef.invoke(BasicServerRef.java:589)
    at weblogic.rmi.cluster.ClusterableServerRef.invoke(ClusterableServerRef.java:230)
    at weblogic.rmi.internal.BasicServerRef$1.run(BasicServerRef.java:477)
    at        
weblogic.security.acl.internal.AuthenticatedSubject.doAs(AuthenticatedSubject.java:363)
    at weblogic.security.service.SecurityManager.runAs(Unknown Source)
    at weblogic.rmi.internal.BasicServerRef.handleReques开发者_开发问答t(BasicServerRef.java:473)
    at weblogic.rmi.internal.wls.WLSExecuteRequest.run(WLSExecuteRequest.java:118)
    ... 2 more

We obtain the initial context quite the standard way:

Properties p = new Properties();
p.put(Context.INITIAL_CONTEXT_FACTORY, "weblogic.jndi.WLInitialContextFactory");
p.put(Context.PROVIDER_URL, serverPath);
Context context = new InitialContext(p);

Also calls to any obtained references fail with a similar OptionalDataException. Booting the presentation server alone resolves the issue temporarily.


Finally the OptionalDataExceptions are history. In short, in our application code a complex value object (used as a return value for remote method invocations) had a HashMap datastructure as an internal field. After changing the type of this field to SynchronizedMap the OptionalDataExceptions stopped occurring. It seems that somewhere in the legacy code this Map is handled in non thread-safe way.

What is strange is that this caused no problems with WLS 8.1, but somehow caused WLS 10 enter a state where all subsequent remote method invocations (including JNDI lookups) started to fail.


Finally we found the solution to this (Edit: later we found out that this was not the root cause of the issue, but a separate serious issue. For the final solution, please see the answer below). Once we started to receive the following exception we got on the tracks of the cause:

<BEA-000403> <IOException occurred on socket: Socket[addr=/x.x.x.x,port=3266,localport=7001]
 java.net.SocketException: Connection refused.
java.net.SocketException: Connection refused
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at weblogic.socket.SocketMuxer.readReadySocketOnce(SocketMuxer.java:887)
at weblogic.socket.SocketMuxer.readReadySocket(SocketMuxer.java:859)
at weblogic.socket.DevPollSocketMuxer.processSockets(DevPollSocketMuxer.java:120)
at weblogic.socket.SocketReaderRequest.run(SocketReaderRequest.java:29)
at weblogic.socket.SocketReaderRequest.execute(SocketReaderRequest.java:42)
at weblogic.kernel.ExecuteThread.execute(ExecuteThread.java:145)
at weblogic.kernel.ExecuteThread.run(ExecuteThread.java:117)

On the presentation server, which is running on a different host than the EJB server we had the option

-Dweblogic.NetworkClassLoadingEnabled=true

to obviously enable class loading from the EJB server. What we did not know is that using this option can result in a huge number of network sockets being opened. Using netstat we were able to find out that several thousand sockets were either in CLOSE_WAIT or FIN_WAIT_2 state. It seems that all the elements in the web UI were loaded from the EJB server in addition to the classes despite the fact that the war file on the presentation server contained all these. The huge amount of sockets did not result in "too many files" error messages since Weblogic removes the ulimit for files in its startup script. Using a test server we found out that a single click on the web UI by the user opened around 30 sockets between the two servers.

We removed this option and repackaged the war on the presentation server to contain all the needed classes thus removing the need for network classloading. This resulted in a decrease in the number of socket connections between the two servers from thousands to 1.

In a summary, avoid network class loading in Weblogic if at all possible.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜