Monday, 21 November 2011

Improving the Feature Join experience

So what was this screenshot all about?


Time for an explanation. This screenshot was the result of an experimental mg-desktop addition to make feature joins perform much faster (around 100 times faster as evidenced by the screenshot), by taking advantage of new FDO Join APIs that were recently introduced with the 3.6 release. Why does the FDO Join APIs perform so much faster? Time for a little backstory...

Why Feature Joins suck (performance wise)

Feature Joins are performed by the GWS Query Engine component of MapGuide. From a high-level overview, this component takes two input iterators (with interfaces similar to a FDO/MG Feature Reader), and depending on the capabilities of the left and right sides of the join a "merged" feature reader is returned that will loop the two iterators in the following fashion:
Now these join algorithms themselves are also sensitive to the data you're working with. For example, any sorted block algorithm is going to perform bad on datasets with high-spread of join key values (low-spread of join keys means lots of sorted duplicate joins values which can be skipped). Sort merge has issues regarding case-sensitivity of key values (though this may be a bug in the implementation). Nested Loops is obviously terrible if both sides are large (as is the case of a SDF to SDF join as per the above screenshot)

MapGuide in its current form is not aware of join factors which impact performance:
  • The size of both sides of the join (Nested loops is obviously bad, but the inability to determine size means this can't be avoided if this is the designated algorithm)
  • The spread of join key values (Sorted blocks work best on lookup-value type spreads not foreign-key type spreads)
It is all of the above factors which make having good performing feature joins a difficult proposition. ^

The FDO Join API (or: Let the data store figure it out)

So it's clear that MapGuide does a terrible and/or erratic job as a query optimizer, so assuming we want respectable performance of joined data, you would've probably avoided Feature Joins altogether and have the bits you want to join in question residing in the same centralised RDBMS, do the joins at the database level and wrap the result in a database view.

This approach is a good solution in terms of performance because figuring out the best way to join data is now a data store responsibility, and you can apply a whole assortment of indexes and other DBMS performance improvements that MapGuide is not aware of. From MapGuide's perspective all it sees is a table/view and it just queries that. But this approach presents its own problems:
  • The need for metadata hacks for views to be properly recognised as a feature class (eg. SDO_GEOM_METADATA in Oracle)
  • The type of view affecting selectability (eg. SQL Server, which requires schema overrides to work around)
With the introduction of FDO Join APIs, it allows us to do these type of joins in an ad-hoc fashion without needing to wrap our joined data into database views. Joining data with the FDO Join APIs is basically the equivalent of doing a SQL SELECT query with JOIN clauses. In other words, the FDO Join APIs lets you delegate join optimization to the data store and let it figure out the best performing way to do it.

Taking advantage of FDO Joins in MapGuide (with RFC123)

So as already explained, work on bringing support for this into MapGuide started with mg-desktop. Being essentially a self-contained version of MapGuide meant that mg-desktop was a nice sandbox for playing and experimenting with new features. So once support was built into mg-desktop, bringing it over to MapGuide proper was quite an easy affair.

So how does mg-desktop and MapGuide take advantage of FDO Joins?

We use the existing Extended Feature Class mechanism to construct an equivalent FDO Join select query, applying the proper property/class aliasing to ensure that the reader that is returned presents the same property list as a default feature join query.

When querying an Extended Feature Class, MapGuide does several checks to see if the FDO Join optimization path can be taken. These checks include:
  • Does the underlying FDO provider for the Feature Source support FDO Joins? (obvious one!)
  • Does the primary and joined feature classes of this Extended Feature Class originate from the same feature source? (this is important, because FDO Joins work within the context of the same connection, thus the classes we are joining must come from the same source)
  • A whole bunch of other minor conditions which your Extended Feature Class will most likely satisfy (detailed in the RFC)
If the Extended Feature Class being queried passes the above checks, MapGuide takes the FDO Join path, which as seen from the screenshot above, can be up to 100 times faster than the regular Feature Join approach.

Additionally, Extended Feature Classes also double up as the metadata. You don't need to do SDO_GEOM_METADATA hacks or FDO Schema Overrides for these classes to be recognised. Because MapGuide already understands these elements out of the box!

Also, in case you're wondering, the performance numbers in MapGuide are even better than mg-desktop (up to 1s faster than the numbers in the above screenshot) due to connection pooling and object caching mechanisms present in the MapGuide Server code.

Conditions Apply

Now as you can see, only certain Extended Feature Classes can take advantage of this optimization. The other limitation is FDO provider support. As of writing this post, the following providers implement the FDO Join APIs:
  • SQL Server
  • SQLite
Given the high usage of these providers (you are using SQLite for all your spatial data aren't you?), there is a good chance of this optimization path being used.

The beauty of FDO is that as soon as an FDO provider implements the FDO Join APIs (like King.Oracle wink, wink, nudge, nudge), such configured Extended Feature Classes in MapGuide can automatically take advantage of this optimization path.

So there you have it. 100x faster feature joins if your data ticks all the boxes.

^ There is a 5th type of join algorithm that is (suprisingly) not implemented by the GWS Query Engine: Hash Joins. Though probably the most memory intensive join algorithm (due to needing to store one side of the join in a hash-table), this should always perform better than the default nested loops. Anyway, it's 2011. Memory is cheap! Having this join type implemented should improve the Feature Join user story even further. Guess that's another item on the TODO list.

No comments: