Thursday 26 May 2016

MapGuide Open Source 3.1 pre-flight check part 3 (a wild roadblock appeared!)

Stop the presses. We ran into a blocker!

A particular nasty one, that affects not just MapGuide on CentOS, but on Ubuntu as well.

Something slipped into MapGuide or the WMS FDO provider that causes mgserver to segfault when one does a GETFEATUREPROVIDERS request and a WFS/WMS FDO connection is created when inspecting its capabilities. Previously, I thought it was a quirk with the internal FDO source copy of OpenSSL and so FDO was built using the system-installed copy of OpenSSL (like we did for Ubuntu), but this problem still persisted on CentOS, and it turned out it was also segfaulting on Ubuntu as well!

Running this under gdb didn't help us at all. It only shows that the SIGSEGV happens at a function named ?? in libWMSProvider.so on CentOS, and at a function named _init in libWMSProvider.so on Ubuntu. Not very helpful.

Even the venerable valgrind couldn't help me here. Putting mgserver under valgrind and triggering the segfault gives me this incomprehensible gibberish

==11521== Thread 13:
==11521== Jump to the invalid address stated on the next line
==11521==    at 0x6E8A6: ???
==11521==    by 0x4275D9F: FdoConnectionManager::CreateConnection(wchar_t const*) (ConnectionManager.cpp:334)
==11521==    by 0x4BBB543: MgServerGetFeatureProviders::AddConnectionProperties(xercesc_3_1::DOMElement*, wchar_t const*) (in /usr/local/mapguideopensource-3.1.0/server/lib/libMgServerFeatureService-3.1.0.so)
==11521==    by 0x4BBADD3: MgServerGetFeatureProviders::CreateFeatureProvidersDocument() (in /usr/local/mapguideopensource-3.1.0/server/lib/libMgServerFeatureService-3.1.0.so)
==11521==    by 0x4BB9E23: MgServerGetFeatureProviders::GetFeatureProviders() (in /usr/local/mapguideopensource-3.1.0/server/lib/libMgServerFeatureService-3.1.0.so)
==11521==    by 0x4B67285: MgServerFeatureService::GetFeatureProviders() (in /usr/local/mapguideopensource-3.1.0/server/lib/libMgServerFeatureService-3.1.0.so)
==11521==    by 0x4B303A4: MgOpGetFeatureProviders::Execute() (in /usr/local/mapguideopensource-3.1.0/server/lib/libMgServerFeatureService-3.1.0.so)
==11521==    by 0x4B19AE6: MgFeatureServiceHandler::ProcessOperation() (in /usr/local/mapguideopensource-3.1.0/server/lib/libMgServerFeatureService-3.1.0.so)
==11521==    by 0x809768A: MgOperationThread::ProcessOperation(MgServerStreamData*) (OperationThread.cpp:397)
==11521==    by 0x8095C1B: MgOperationThread::ProcessMessage(ACE_Message_Block*) (OperationThread.cpp:226)
==11521==    by 0x8094536: MgOperationThread::svc() (OperationThread.cpp:90)
==11521==    by 0x623521B: ACE_Task_Base::svc_run(void*) (in /usr/local/mapguideopensource-3.1.0/lib/libACE.so)
==11521==  Address 0x6e8a6 is not stack'd, malloc'd or (recently) free'd
==11521== 
==11521== 
==11521== Process terminating with default action of signal 11 (SIGSEGV)
==11521==  Bad permissions for mapped region at address 0x6E8A6
==11521==    at 0x6E8A6: ???
==11521==    by 0x4275D9F: FdoConnectionManager::CreateConnection(wchar_t const*) (ConnectionManager.cpp:334)
==11521==    by 0x4BBB543: MgServerGetFeatureProviders::AddConnectionProperties(xercesc_3_1::DOMElement*, wchar_t const*) (in /usr/local/mapguideopensource-3.1.0/server/lib/libMgServerFeatureService-3.1.0.so)
==11521==    by 0x4BBADD3: MgServerGetFeatureProviders::CreateFeatureProvidersDocument() (in /usr/local/mapguideopensource-3.1.0/server/lib/libMgServerFeatureService-3.1.0.so)
==11521==    by 0x4BB9E23: MgServerGetFeatureProviders::GetFeatureProviders() (in /usr/local/mapguideopensource-3.1.0/server/lib/libMgServerFeatureService-3.1.0.so)
==11521==    by 0x4B67285: MgServerFeatureService::GetFeatureProviders() (in /usr/local/mapguideopensource-3.1.0/server/lib/libMgServerFeatureService-3.1.0.so)
==11521==    by 0x4B303A4: MgOpGetFeatureProviders::Execute() (in /usr/local/mapguideopensource-3.1.0/server/lib/libMgServerFeatureService-3.1.0.so)
==11521==    by 0x4B19AE6: MgFeatureServiceHandler::ProcessOperation() (in /usr/local/mapguideopensource-3.1.0/server/lib/libMgServerFeatureService-3.1.0.so)
==11521==    by 0x809768A: MgOperationThread::ProcessOperation(MgServerStreamData*) (OperationThread.cpp:397)
==11521==    by 0x8095C1B: MgOperationThread::ProcessMessage(ACE_Message_Block*) (OperationThread.cpp:226)
==11521==    by 0x8094536: MgOperationThread::svc() (OperationThread.cpp:90)

==11521==    by 0x623521B: ACE_Task_Base::svc_run(void*) (in /usr/local/mapguideopensource-3.1.0/lib/libACE.so)

I admit to not being an expert-level C++ developer, so when I can't find the usual cases for a SIGSEGV (eg. A null pointer de-reference) in gdb/valgrind output, I am often left scratching my head. Google searches on these error messages weren't of much help.

The good thing was that this problem only happens on FDO since 4.0 was branched and released (we're targeting FDO 4.1 here) so in the absence of expert-level C++ knowledge it was time to be systematic and track down the offending FDO revision since 4.0 branched that introduced this breakage. Time to apply that binary search again! After several wasted days of trial and error compilation/testing, we found the offending revision.

At face value, this commit looks innocuous enough, a few new files and some modifications. Nothing in the new and modified files that sounds like dangerous C++ code to my padawan-tier knowledge.


But looking at the overall list of files, something caught my eye. In this commit was a project file update, but it was only for windows (the modified WMSProvider.vcxproj file). Normally if new files are to be added, not only should the windows vcxproj file be updated as well (to include the new .h and .cpp files), but the corresponding Makefile.am files for Linux as well, and this was conspicuously absent.

This lit a light-bulb in my head, I had a look at the Makefile.am file for the WMS provider and lo-and-behold, it didn't have the new .cpp and .h files in this commit. Then it all came together.

The WMS provider was segfaulting because it was being compiled with undefined symbols (to the new FdoWmsGetFeatureInfoFormats class). Updating the Makefile.am with the missing files and re-compiling the WMS provider made the segfault go away.

And that was the tale of how I cleared the final roadblock to releasing MapGuide Open Source 3.1 (smoke tests with this change all pass, the beta 1 binaries are being built as this blog post is being published). Now on Windows with the MSVC compiler, such a problem would've manifested at the compiler phase as LNK2001 unresolved external symbol errors, so it surprised me big time that gcc or ld did not error out about undefined symbols and immediately break the build which would've caught this problem much sooner!

It turns out gcc/ld erroring out is a flag you have to opt-in to. The more you know. Thanks to Johan for the heads up.

So you can now expect the next post to be about ...

No comments: