How to get distinct results in hibernate with joins and row-based paging?

After some time wondering why my batch jobs producing sometimes different results, when i compared to the SQL count or sum aggregate values. I found the answer in a Post on Stackoverflow.

The reason why this happens is explained here. It has something to do with the way in which Hibernate uses joins on collections and materialization of objects as a result of a query. That saves time but the drawback is that the result contains probably some duplicates if you use pagination on object which have collection properties that are eagerly joined.

As mentioned before, on Stackoverflow you find a solution which ships around the problem by preserving the eagerly joined (fetched) properties.

I will now explain a different solution, where you can use the normal query and pagination method without having the problem of possibly duplicates or suppressed items.

This Solution has the advance that it is:

faster than the PK id solution mentioned in the article on Stackoverflow
preserves the Ordering and don’t use the in clause on a possibly large Dataset of PK’s

Hibernate gives the possibility to define the association fetching method not only at design time but also at runtime by a query execution. So we use this aproach in conjunction with a simple relfection stuff and can also automate the process of changing the query property fetching algorithm only for collection properties.

First we create a method which resolves all collection properties from the Entity Class:

public static List<String> resolveCollectionProperties(Class<?> type) {
List<String> ret = new ArrayList<String>();
try {
BeanInfo beanInfo = Introspector.getBeanInfo(type);
for (PropertyDescriptor pd : beanInfo.getPropertyDescriptors()) {
if (Collection.class.isAssignableFrom(pd.getPropertyType()))
ret.add(pd.getName());
}
} catch (IntrospectionException e) {
e.printStackTrace();
}
return ret;
}

After doing that you can use this little helper method do advise your criteria object to change the FetchMode to SELECT on that query.

Criteria criteria = …

// … add your expression here …

// set fetchmode for every Collection Property to SELECT
for (String property : ReflectUtil.resolveCollectionProperties(YourEntity.class)) {
criteria.setFetchMode(property, org.hibernate.FetchMode.SELECT);
}
criteria.setFirstResult(firstResult);
criteria.setMaxResults(maxResults);
criteria.list();

Doing that is different from define the FetchMode of your entities at design time. So you can use the normal join association fetching on paging algorithms in you UI, because this is most of the time not the critical part and it is more important to have your results as quick as possible.

For my batch jobs using springsource spring-batch project i use this aproach to get the precise paginated list of items for my query.

Hope that you find that useful!

1 Kommentar

Hello and thanks for writing this article 🙂
I really appreciate that people are sharing their knowledge,

I was successful in applying your method in order to obtain distinct results and do pagination.

All well, until I was asked to add a filter on my collection, and that is the point where this solution does not apply.

Because, basically it does at the end a query like this:
select * from class c, pupil p where c.id=p.class_id and p.name ‚%Alex%‘

and this inevitably returns results that are not distinct, it returns a http://en.wikipedia.org/wiki/Cartesian_product

So I had to do everything from the beginning, and exclude the usage of Criteria, DetachedCriteria and so on.

I settled for HQL and did this query:
select distinct c from class c, pupil p where c.id=p.class_id and p.name ‚%Alex%‘

Which works fine and doesn’t give headaches.

But there is an another solution:
1. Use same approach as in you article
2. In order to apply the restriction use a sub-query:
Restrictions.sqlRestriction(„({alias}.id in (select id_class from pupils where name like ?))“, „%“ + pupilName + „%“, StringType.INSTANCE);

This of course will run much slower than the traditional query with the join mixed in, just because the number of results returned by the sub-query can be big.

And the other drawback is of course the usage of standard SQL, which contradicts with the logic behind HQL – one query language to be used with all databases.

Hope this helps somebody 😉

Antworten