As it stands, the primary problem with the current join implementation is one of performance. Without going into detail the current implementation is one that uses nested loops. This, being a O(m*n) operation will get slower as the size of the data you're working with increases.
In thinking about the different join algorithms available to tackle this problem, I came to the conclusion that implementing what is essentially a core DBMS component is just too much effort, so being the lazy programmer that I am, I've updated the join implementation to delegating the join functionality a DBMS which day-by-day continues to impress me with its amazing performance and capabilities: SQLite
So now, the updated implementation can be described as follows:
- Create a temporary SQLite database with table structure identical to our "left" and "right" sources.
- Pump the data from our "left" source into this temp SQLite database
- Pump the data from our "right" source into this temp SQLite database
- Create a view in our temp SQLite that encapsulates a join between these "left" and "right" tables
- Do some metadata hacks to make this view be recognised as a FDO feature class.
- Pump the data from this view out to our target data store.
- Old approach: 16 minutes
- SQLite approach: 21 seconds!
I truly <3 SQLite!