Avoid DataNucleus joins?
I'm experimenting with moving a JDBC webapp to JDO DataNucleus 2.1.1.
Assume I have some classes that look something like this:
public class Position { private Integer id; private String title; }
public class Employee { private Integer id; private String name; private Position position; }
The contents of the Position SQL table really don't change very often. Using JDBC, I read the entire table into memory (with the ability to refresh periodically or at-will). Then, when I read an Employee into memory, I simply retrieve the position ID from the Employee table and use that to obtain the in-memory Position instance.
However, using DataNucleus, if I iterate over all Positions:
Extent<Position> extent =pm.getExtent(Position.class, true);
Iterator<Position> iter =extent.iterator();
while(iter.hasNext()) {
Position position =iterPosition.next();
System.out.println(position.toString());
}
And then later, with a different PersistenceManager, iterate over all Employees, obtaining their Position:
Extent<Employee> extent =pm.getExtent(Employee.class, true);
Iterator<Employee> iter =extent.iterator();
while(iter.hasNext()) {
Employee employee =iter.next();
System.out.println(employee.getPosition());
}
Then DataNucleus appears to produce SQL joining the two tables when I obtain an Employee's Position:
SELECT A0.POSITION_ID,B0.ID,B0.TITLE FROM MYSCHEMA.EMPLOYEE A0 LEFT OUTER JOIN MYSCHEMA."POSITION" B0 ON A0.POSITION_ID = B0.ID WHERE A0.ID = <1>
My understanding is that DataNucleus will use a cached Position instance, when available. (Is that correct?) However, I'm concerned that the joins will degrade performance. I'm not yet far enough along to run benchmarks. Are my fears misplaced? Should I continue, and benchmark? Is there a way to have DataNucleus avoid the join?
<jdo>
<package name="com.example.staff">
<class name="Position" identity-type="application" schema="MYSCHEMA" table="Position">
<inheritance strategy="new-table"/>
<field name="id" primary-key="true">
<column name="ID" jdbc-type="integer"/>
</field>
<field name="title">
<column name="TITLE" jdbc-type="varchar"/>
</field>
</class>
</package>
</jdo>
<jdo>
<package name="com.example.staff">
<class name="Employee" identity-type="application" schema="MYSCHEMA" table="EMPLOYEE">
<inheritance strategy="new-table"/>
<field name="id" primary-key="true">
<column name="ID" jdbc-type="integer"/>
</field>
<field name="name">
<column name="NAME" jdbc-type="varchar"/>
</field>
<field name="position" table="Position">
<column name="POSITION_ID" jdbc-type="int" />
<join column="ID" />
</field>
</class>
</package>
</jdo>
I guess what I'm hoping to be able to do is tell DataNucleus to go ahead and read the POSITION_ID int as part of the default fetch group, and see if the corr开发者_StackOverflowesponding Position is already cached. If so, then set that field. If not, then do the join later, if called upon. Better yet, go ahead and stash that int ID somewhere, and use it if getPosition() is later called. That would avoid the join in all cases.
I would think that knowing the class and the primary key value would be enough to avoid the naive case, but I don't yet know enough about DataNucleus.
With the helpful feedback I've received, my .jdo is now cleaned up. However, after adding the POSITION_ID field to the default fetch group, I'm still getting a join.
SELECT 'com.example.staff.Employee' AS NUCLEUS_TYPE,A0.ID,A0."NAME",A0.POSITION_ID,B0.ID,B0.TITLE FROM MYSCHEMA.EMPLOYEE A0 LEFT OUTER JOIN MYSCHEMA."POSITION" B0 ON A0.POSITION_ID = B0.ID
I understand why it is doing that, the naive method will always work. I was just hoping it was capable of more. Although DataNucleus might not read all columns from the result set, but rather return the cached Position, it is still calling upon the datastore to access a second table, with all that entails - including possible disk seeks and reads. The fact that it will throw that work away is little consolation.
What I was hoping to do was tell DataNucleus that all Positions will be cached, trust me on that. And if for some reason you find one that isn't, blame me for the cache miss. I understand that you'll have to (transparently) perform a separate select on the Position table. (Even better, pin any Positions you do have to go fetch due to a cache miss. That way there won't be a cache miss on the object again.)
That is what I'm doing now using JDBC, by way of a DAO. One of the reasons for investigating a persistence layer was to ditch these DAOs. It is difficult to imagine moving to a persistence layer that can't move beyond naive fetches resulting in expensive joins.
As soon as Employee has not only a Position, but a Department, and other fields, an Employee fetch causes a half dozen tables to be accessed, even though all of those objects are already pinned in the cache, and are addressable given their class and primary key. In fact, I can implement this myself, changing Employee.position to an Integer, creating an IntIdentity, and passing it to PersistenceManager.getObjectByID().
What I think I'm hearing is that DataNucleus is not capable of this optimization. Is that right? It's fine, just not what I expected.
By default, a join will not be done when the Employee entity is fetched from the datastore, it will only be done when Employee.position is actually read (this is called lazy loading).
Additionally, this second fetch can be avoided using the level 2 cache. First check that the level 2 cache is actually enabled (in DataNucleus 1.1 it is disabled by default, in 2.0 it is enabled by default). You should probably then "pin" the class so that the Position entities it will be cached indefinitely:
The level 2 cache can cause issues if other applications use the same database, however, so I would recommend only enabling it for classes such as Position which are rarely changed. For other classes, set the "cacheable" attribute to false (default is true).
EDITED TO ADD:
The <join> tag in your metadata is not suitable for this situation. In fact you don't need to specify the relationship explicitly at all, DataNucleus will figure it out from the types. But you are right when you say that you need POSITION_ID to be read in the default fetch group. This can all be achieved with the following change to your metadata:
<field name="position" default-fetch-group="true">
<column name="POSITION_ID" jdbc-type="int" />
</field>
EDITED TO ADD:
Just to clarify, after making the metadata change descibed above I ran the test code which you provided (backed by a MySQL database) and I saw only these two queries:
SELECT 'com.example.staff.Position' AS NUCLEUS_TYPE,`THIS`.`ID`,`THIS`.`TITLE` FROM `POSITION` `THIS` FOR UPDATE
SELECT 'com.example.staff.Employee' AS NUCLEUS_TYPE,`THIS`.`ID`,`THIS`.`NAME`,`THIS`.`POSITION_ID` FROM `EMPLOYEE` `THIS` FOR UPDATE
If I run only the second part of the code (the Employee extent), then I see only the second query, without any access to the POSITION table at all. Why? Because DataNucleus initially provides "hollow" Position objects and the default implementation of Position.toString() inherited from Object doesn't access any internal fields. If I override the toString() method to return the position's title, and then run the second part of your sample code, then the calls to the database are:
SELECT 'com.example.staff.Employee' AS NUCLEUS_TYPE,`THIS`.`ID`,`THIS`.`NAME`,`THIS`.`POSITION_ID` FROM `EMPLOYEE` `THIS` FOR UPDATE
SELECT `A0`.`TITLE` FROM `POSITION` `A0` WHERE `A0`.`ID` = <2> FOR UPDATE
SELECT `A0`.`TITLE` FROM `POSITION` `A0` WHERE `A0`.`ID` = <1> FOR UPDATE
(and so on, one fetch per Position entity). As you can see, there are no joins being performed, and so I'm surprised to hear that your experience is different.
Regarding your description of how you hope caching should work, that is how the level 2 cache ought to work when a class is pinned. In fact, I wouldn't even bother trying to pre-load Position objects into the cache at application start-up. Just let DN cache them cumulatively.
It's true that you may have to accept some compromises if you adopt JDO...you'll have to relinquish the absolute control that you get with hand-rolled JDBC-based DAOs. But in this case at least you should be able to achieve what you want. It really is one of the archetypal use cases for the level 2 cache.
Adding on to Todd's reply, to clarify a few things.
A <join> tag on a 1-1 relation means nothing. Well it could be interpreted as saying "create a join table to store this relationship", but then DataNucleus doesn't support such a concept since best practice is to use a FK in either owner or related table. So remove the <join>
A "table" on a 1-1 relation suggest that it is stored in a secondary table, yet you don't want that either, so remove it.
You retrieve Position objects, so it issues something like
SELECT 'org.datanucleus.test.Position' AS NUCLEUS_TYPE,A0.ID,A0.TITLE FROM "POSITION" A0
- You retrieve Employee objects, so it issues something like
SELECT 'org.datanucleus.test.Employee' AS NUCLEUS_TYPE,A0.ID,A0."NAME" FROM EMPLOYEE A0
Note that it doesn't retrieve the FK for the position here since that field is not in the default fetch group (lazy loaded)
- You access the position field of an Employee object, so it needs the FK retrieving (since it doesn't know which Position object relates to this Employee), so it issues
SELECT A0.POSITION_ID,B0.ID,B0.TITLE FROM EMPLOYEE A0 LEFT OUTER JOIN "POSITION" B0 ON A0.POSITION_ID = B0.ID WHERE A0.ID = ?
At this point it doesn't need to retrieve the Position object since it is already present (in the cache), so that object is returned.
All of this is expected behaviour IMHO. You could put the "position" field of Employee into its default fetch group and that FK would be retrieved in step 4, hence removing one SQL call.
精彩评论