About Me

My Photo
A Sun certified Java professional with time proven experience in architecture and designing of mid/large scale Java/JEE based applications. Creator of the EasyTest Framework.A lot of experience with technologies such as Spring framework, Hibernate , JPA, Java4-6, REST, SOA , YUI , JUnit , Cloud Foundry PaaS and other technologies.

Tuesday, July 16, 2013

JPA 2 | Fetch Joins and whether we should use them

Introduction

Recently I have been working with FETCH JOINS in JPA 2 to fetch the data eagerly from the Database and I learned quite a lot on why we should avoid using Fetch Joins in our day to day operations.

Today's blog post talks about my experience with Fetch and my learning (based mostly on the review comments I got when I had lots of FETCH JOINS in my query).

Problem Statement

There was a situation in our project where an Entity was being fetched from the Database that had many Collection Valued Associations defined ( OneToMany, ManyToMany , also referred as ToMany association). Here is a conceptual picture of how the entity looked like(getters and setters are omitted for clarity purpose) . This is an extremely simplified example of the Entity. In the real scenario we had around 11 associations.


public class ItemRecord {

    @Id
    private String itemId;
    
    @Column(name="author")
    private String author;

    @OneToMany(cascade = CascadeType.ALL, mappedBy = "item")
    private List costs;

   @OneToMany(cascade = CascadeType.ALL, mappedBy = "item")
    private List notes;

    @OneToMany(cascade = CascadeType.ALL, mappedBy = "item")
    private List stats;

}

There are few things to notice about the above Entity:
  • It has 3 Collection Valued Associations.
  • All these associations are being fetched lazily as the default fetch strategy for Collection Valued Association in JPA is Lazy.
In our business implementation, we have a translator that takes the value returned by our DAO layer and convert it into a Business DTO. 

The algorithm of our business method was thus as follows :

@TransactionAttribute
public List searchItemRecords (SearchCriteria sc) {
           
             List ir = itemRecordDao.search(sc);
             List convertedData = recordConverter.convert(ir);
             return convertedData;
}


Note that the whole method is running inside a Transaction.

Whenever we fetched the data from the database, none of the related data for cost, statistic etc was fetched eagerly. But our ItemInformation DTO expected all the data. Therefore when getCosts or getStatistics was called for the first time, a query was fired to the database by the persistence Provider (Hibernate in our case) to get the specified data. This was creating an N + 1 Select query problem for us. If you are unfamiliar with N+1 Selects or need some refreshing, you can look at this article on DZone.

Solution

Most of us, including me, would choose the fastest and easiest solution to N+1 select problem, which is to use a Fetch Join. This is also suggested a lot in different blogs/articles spread on the internet.

So I also took the same approach. And it was a BAD approach, in my case atleast.

Lets see first how we can use FETCH JOIN.

The query before using Fetch Join was as follows:

SELECT item FROM ItemRecord item WHERE author=:author;

Note the query is in JPA form.
This query did not fetch the collection values. As a result, in translator, when we did getCosts for every ItemRecord, a query similar to below was fired :

SELECT cost FROM Cost where itemId = :itemId

Thus, if we had 3 ItemRecords, then the total Number of SELECT queries fired to the Database are :

  • 1 for fetching all the ItemRecords
  • 3 for fetching Costs for each ItemRecord
  • 3 for fetching Notes for each ItemRecord
  • 3 for fetching Statistics for each ItemRecord
i.e. (3 Times 3) + 1
which when generalised translates to N times M + 1 , 
where,
  • N is the number of Primary Entity records found
  • M is the number of Collection valued associations in the Primary Entity
  • 1 query for fetching all the Primary Entities
In our real scenario, we had 11 associations. Thus for each Primary ItemRecord Entity, we were firing 11 SELECT queries. The number of queries fired multiplied for each ItemRecord found.

Making the collection valued associations EAGER was not the option as there were a lot of other queries being run on the Entity that required only selected data.

This is not an optimal solution. Something had to be done about it and a lot of internet articles suggested using FETCH JOINS as they were easiest to implement and solved the (N Time M + 1) query problem.

So I decided to use FETCH Joins in my query to fetch all the data eagerly for the given scenario.
The query looked similar to :

SELECT item FROM ItemRecord
 JOIN FETCH item.costs,
 JOIN FETCH item.notes,
 JOIN FETCH item.stats;


Hurdle 1

I realised pretty quickly that this query will not work in my scenario as I can have an ItemRecord that does not have any statistics associated with it and in such a case the above query will not return me that ItemRecord( because of the way ORM works) and therefore I will have incomplete data.

So, I next moved to a LEFT JOIN FETCH that will give return me ItemRecord entities even when some of the associated relationships are empty. The query looked like this :

SELECT item FROM ItemRecord
 LEFT JOIN FETCH item.costs,
 LEFT JOIN FETCH item.notes,
 LEFT JOIN FETCH item.stats;

Hurdle 2

When I ran my Unit Tests to test the above query, I got the exception :

javax.persistence.PersistenceException: org.hibernate.HibernateException: cannot simultaneously fetch multiple bags

What is the problem?

The problem is that I am using List as collection type in my Entity. Using List confuses JPA/Hibernate. This confusion is nicely documented in this article.

In order to work around this problem, I chose to use Set instead of List, mainly because it was the easiest of the three solutions provided by the above blog post and it also made sense (atleast while I was implementing.)


So I changed my Entity to have Set as collection instead of List and the modified entity looked like this:

public class ItemRecord {

    @Id
    private String itemId;
    
    @Column(name="author")
    private String author;

    @OneToMany(cascade = CascadeType.ALL, mappedBy = "item")
    private Set costs;

   @OneToMany(cascade = CascadeType.ALL, mappedBy = "item")
    private Set notes;

    @OneToMany(cascade = CascadeType.ALL, mappedBy = "item")
    private Set stats;

}

I ran my test cases again and my test case that was testing the query succeeded. Yippie. BUT, my another test case that was testing the entries in the notes section failed. Looking at the test case, I realised that I needed the data in a specific order, in the order in which it was entered and I was using HashSet which is not ordered. The solution was simple. Use LinkedHashSet which maintains the order of the elements.

Using LinkedHashSet did the trick and my test case passed. 

I was very happy but my happiness was short lived. 

Hurdle 3

I had another test case that expected 3 cost objects for a given ItemRecord. As soon as I moved to Set implemenation, the test started failing. Turned out I had incorrect hashCode and equals implementation for my Cost Entity which was returning same hashcode for two different entities and as a result only one entity was ever persisted as Set does not allow duplicate values.

So, the next thing for me to do was have proper HashCode and Equals implementation for all my Entities.



Final Hurdle

Finally, when all my test cases started passing, I made a code review and sent it out to the team.
The first to freak out was my Tech Lead. :)
He was simply furious to look at the FETCH JOINS. Reason? Reason was that LEFT FETCH JOINs return a cartesian product of ALL the data. With the amount of data that we have in production, it would become a nightmare to even support multiple selects on the ItemRecord. The whole problem can be easily understood on this blog post.

Thus I was trying to solve a performance problem and it turned out I was actually creating a bigger performance issue. :)


The whole solution of moving to FETCH JOIN was dropped and it was decided to investigate further why we need the entire data on the UI and why can't we divide fetching the data into smaller dedicated transaction.



Summary:

The whole process of working with Fetch Joins gave me a good insight into how Joins work in general and what to expect when we use them.

I hope you enjoyed the blog post. If you want to keep reading interesting posts, you can follow my blog.  


5 comments:

Lieven Doclo said...

Ah, the cartesian product pitfall. There are actually interesting workarounds for this:
- You can fire multiple queries each fetching another to-many relationship. If this is done within the same transaction, Hibernate's first level cache will ensure the incremental population of the collection
- You can tell Hibernate not to use joins when fetching relationships but a single separate select query for each of the relationships

Anuj said...

We were already firing multiple queries to the DB within the same transaction, and that was the actual problem we intended to solve.

Whats the difference between doing a SELECT Join and fetching relationships lazily by calling the get** method on the Parent entity?

MichaelB said...

Perhaps you should step back and examine your relationship schema.

A single object that has many, many relationships is usually a sign of poor design.

Perhaps your encounter at hundle 1 is illustrative "I can have an ItemRecord that does not have any statistics associated with" That iether means you don't need it and theres no problem or some data is assosiated with the wrong enity or relationship.

ItemRecord seems overly generic but that might just be the example.

An other comment I'd like to make is that it isn't all or nothing you can fetch join some fields and generate + N queries for the rest. Depending on what get cached(I don't know the usage patterns) some extra queries against things that are cached might be ok. Besides it's not an invasive optimalization (as opposed to switching to Set's) and improvement is improvement and it might be ok for now.

Anuj said...

Thats indeed correct MichaelB, that the design of the Entity and its relations is probably poor. But thats beyond my scope to change/modify the schema.

With the Statistic thing, it means that the relationship is nullable, but I need the data if it is present.

I agree that I could fetch only some data initially and then fetch others when needed and it also sounds useful. I will try it out.

Just out of curiosity, what do you think is a better way of designing the schema where an entity has a lot of bidirectional(or may be Unidirectional) OneToMany relationships.

Anonymous said...

I think you will not get the cartesian product if you specify how the tables should join:
left outer join table A on ...