Fun With Deployments: Troubleshooting Foundation Services Issues in Oracle EPM 11.2.x: The Config.xml Puzzle
Anyone who has worked with EPM systems for a while will tell you that sometimes things just don’t work as expected. This is particularly true with WebLogic components. In EPM 11.2.x, they are, for lack of a better term, temperamental.
During a previous upgrade to 11.2.15, everything worked as expected, and all system components passed multiple rounds of testing and validation, from a bevy of users.
In our story, a couple of weeks pass, and suddenly the second node of Foundation Services stops working. Digging into the logs, the component would not go into running mode, and trying to force this state from the WebLogic Admin console failed to work as well.
Normally in a situation like this, I simply clear the /tmp, /cache, and /log folders and restart things, and this (a sort of ‘soft’ redeployment) fixes the issue. However, in this case it didn’t help things.
Taking the next step, I redeployed on the server using the configuration utility for EPM. Now, this step had been done after the patch application as stipulated in the ReadMe for the patch. But in this re-run, it failed, consistently, with an ‘Illegal State’ error in the logs. A search of the Oracle Support Knowledge Base (KB) yielded no results.
So, what to do next?
Well, in my case, I decided to login to the WebLogic Admin Server console and delete the entry for FoundationServices1. Then I re-ran the configuration services and redeployed the instance. This time I hit paydirt and the deployment was successful. Upon starting the services though, I started getting ‘struts’ errors, and complaints about the config.xml file entries.
Another quick perusal of the Oracle KB articles pointed to the WebLogic Admin server not running when the service was restarted for the first time…but it was running.
Where to next?
Well, in my case I had two Foundation Servers, one (the WebLogic Admin server) was playing nice, and one that was not. Based on this, I decided to compare the config.xml files between the servers and found they were dramatically different, and the non-working host had a ‘struts’ file section.
This led to me copying the ‘working’ config.xml over to my non-working server host, and then restarting services again. This resolved the issue and Foundation Services went into RUNNING state as expected.
So what gives here? Well, all I can say for certain is that the config.xml should be getting updated by the WebLogic Admin server in these situations. In this case, despite multiple server OS and application service restarts, it was not.
This could simply be a software issue, but I have suspicions around modern security software. Oracle notes that things like virus scan and malware should be disabled during configuration activities, and I’ve seen firsthand that simple directory exclusion and bypass modes do not work as advertised. In this case, I believe these utilities were preventing the server from ‘pulling’ the good config.xml from the WebLogic Admin server.
This was an interesting case, because while not that difficult to resolve, it was time consuming. It points to the fact that EPM systems are complex, layered, and intertwined, and not always easy to troubleshoot.
As always, it is a pleasure to share these experiences with readers of this blog, and if iArch Solutions can help with your EPM or Cloud systems, please reach out and let us know. We’re always here to help.