Friday, November 11, 2016

Hyperion Planning JVM Crash Frequently Asked Questions

Purpose:

A detailed discussion regarding frequently asked questions about Hyperion Planning Crashes

    Why does the Planning JVM crash?
    Why doesn’t it crash right away?
    We have nn GB memory on our Planning Server, taking this into consideration what is the comfortable number of cells that we can increase to?
    What is the current heap size set for my Planning Server?
    How can I avoid the crash?
    How many cells can the application query without problems?  What value should the governor be set to?
    Why does one persons mistake have to impact all of the other users?
    How do I know a large query was run?  Who may have run it?  When?
    What do we search in the logs for?


Questions and Answers:
Why does the Planning JVM crash?

Most of the time the planning JVM crashes because users run a query that is too large for the system to handle.

Most frequently this is an ad hoc query from Smart View.  This is one of the most common design considerations that Planning customers and consultants must tackle.

Some of us remember when computer processors were 64k and available space was barely 1 mg, so there is always the hope that technology will eventually allow these large queries but think about trying to send 3 million cells through the JVM container, it’s like trying to drive 3000 miles on a single tank of gas!  It is unrealistic although it would be really nice if we could.

Usually when a review is done of the large query, the large size is related to the number of cells a form might have (basically the product of rows times columns) and suppress missing (a setting in the form) does NOT reduce the size of the query sent to the server. Occasionally you can prevent a crash using suppress missing blocks but that does not apply to every situation.

Note: you cannot use suppress missing blocks in ad hoc mode if there is an Essbase transparent partition prior to release 11.1.2.3.502


Why doesn’t it crash right away?

The system will try very hard not to go down as long as it possibly can, how long it takes can vary but it WILL eventually go down (if you don’t restart it first) once a huge query is launched. It may appear not to be fully down for a while though.

Full CPU or RAM will not be reached, instead, it reaches the capacity of the JVM container, which is set at the web app level using the -Xmx heap size setting for each Planning web app service that is running.


We have nn GB memory on our planning server, taking this into consideration what is the comfortable number of cells that we can increase to?

Unfortunately your server can be a million GB and not help. Instead, this is a combination of design, heap size (-Xmx) allocated to the Planning JVM, and cell count. There is a point when the cell count is just too big even if all the other components are maximized.

You really want to analyze the form itself and meet the business need in a different way.

That said you want to ensure your heap size is not too high or too low.

    Usually customers will set that to 2GB or 4GB but it never is set to the full capacity of the server. Throwing more memory at is usually not the resolution for this issue.  In fact, going beyond 6GB or 8GB in many implementations, can have an adverse effect.
    Conversely, less than 1GB would be the default setting and is too low. This should always be checked initially with every new installation.


What is the current heap size set for your Planning?

You can verify the same by the below steps:

Navigate to oracle/middleware/user_projects/epmsystem1/bin/deploymentscript: Open >  setcustomparamsPlanning.sh

Verify the below parameter USER_MEM_ARGS="-xms2048m -xx:Permsize=64m -xx:Maxpermsize=256m -xmx4096m"




How can I avoid the crash?
a) Set the Governor

To prevent the crash if you are on the 11.1.2.3.501 version of Planning or higher you can enable the system property ERROR_THRESHOLD_NUM_OF_CELLS.  250,000 is a comfortable starting point. You do not want it to be too high or it will again crash the server. You will want to shrink the size of the query instead.

The following is a KM article that explains this governor in more detail. How it calculates the number and how to set it. Doc ID 1951571.1 Error Opening Planning Web Forms : "Unable to execute this operation as it exceeded the configured threshhold"

At least with the governor in place you will prevent these crashes and users will be able to log into the system.
b) Redesign the form

Most of the time final users don't want to use a governor because they simply WANT to run large queries.

It is important to remember that Planning is used for budgeting and planning, so the initial goal would be to NOT use forms for reporting purposes or data mining. 

The most common mistake is to try and use the planning and budgeting product as a way to hunt for data. It is very likely, if they think about it,  the business users know more specifically where the slice of data that they are looking for is located (so they can narrow the form scope; i.e. move largest row or column dimension to the page) but it is so much easier to design one form that looks "everywhere" for them. This is sometimes referred to as a lazy form because it is very easy to create and there is no maintenance involved. However, this should be replaced by specific forms, with a specific business purpose, focused on a select slice of the overall data in the system for the given portion of the budget being worked on at the time. This has the added benefit of controlling security access to portions of the data otherwise hard to secure.

As an example, if you have idescendants of DimA and DimB in the rows and there are 1000 members in each dimension you are starting with 1 million rows. Even if there are only 4 columns, this form has a cell count of 4 million (this example is oversimplified for the sake of discussion)

To reduce the size of the form, the most common solution is to create a form where one of these dimensions (preferably the largest sparse dimension) becomes the anchor dimension for the form. So, the adjusted form is based on 1, or 10 or however many members in that DimA. Then the other dimension remains in the row.  1000 rows is much less than 1000 times 1000. If you use the page dimension to toggle between your available members in DimA and analyze and budget accordingly you won’t run a query that is 4 million cells. At the onset, there may be a need to design a few extra forms, but the benefit is never running a query that is too large for the JVM container to process.
c)  Disable ad-hoc for the user(s)

Another way to prevent the huge query from running would be to disable the ad hoc option from forms for any user who has run large queries. This is an option in the form design. If the user does not have access to the ad hoc form they cannot run a query that is too large for the JVM container to process.


How many cells can the application query without problems?  What value should the governor be set to?

There is no set maximum. You have to tune and experiment to find what your maximum number is because how robust a form is, is based on multiple design factors like sparsity of the data (whether the dimensions in the row and column are sparse or dense and if they are very populated or less populated), cell count, JVM tuning, suppress missing blocks, etc.  

Recently, 250,000 has not crashed the JVM unless the heap size was set uncharacteristically low.


Why does one persons mistake have to impact all of the other users?

Unless you load balance, all of the Planning processing is running on the same thread or process. There is only one process that is going down, all of the planning users will be subject to the ramifications of the restart of the  service.


How do I know a large query was run?  Who may have run it?  When?



We do not usually have access to an exact form name in the logs but we usually have clues as to what may have caused the issue. We use the Planning logs to narrow it down.


Locations for the logs:

\oracle\middleware\user_projects\domains\epmsystem\servers\Planning0\logs

\Oracle\Middleware\user_projects\epmsystem1\diagnostics\logs\services

\Oracle\Middleware\user_projects\epmsystem1\diagnostics\logs\servers

\oracle\middleware\user_projects\domains\epmsystem\servers\Planning0\logs
Keep in mind if the Planning server is load balanced there will be multiple copies of these logs to look through.
Possible Names of the logs:

    HyS9Planning-sysout.log
    Planning_WebApp.log
    Planning0.log
    HyS9Planning-syserr.log
    start-Planning0-out

If log archiving has occurred there may be slight variations of log names above that need to be reviewed depending on how recent this issue occurred. (i.e. Planning_webapp-27.log Planning0.log00004 Planning0.log00127 planning1.log)


What do we search in the logs for?

In older releases or implementations with minimal logging we might see the following in the planning sys logs:

    java.lang.Exception: Webgrid was null!

Or the Planning_webapp might only show a Smart View query

    SRC_CLASS: com.hyperion.planning.HspSmartViewProvider

Or it may have more information associated with that JVM stack ie:

    An error occurred processing incoming XML (HspSmartViewServlet) request.[[
    java.lang.reflect.InvocationTargetException
    com.hyperion.planning.db.HspFMDBImpl.getWebGrid
    com.hyperion.planning.smartview.XMLAdhocGrid2.getWebGrid
    com.hyperion.planning.HspSmartViewProvider.adHocValidateQuery
    HspSmartViewServlet.java
    HspSmartViewServlet.doPost
    java.lang.OutOfMemoryError: GC overhead limit exceeded
    java.util.Arrays.copyOf
    com.hyperion.planning.olap.HspWebGrid.setGrid

These will have a time stamp (i.e. [2015-10-22T16:05:20.911-05:00] usually at the top of the stack trace which can be helpful if you are able to determine the time of the initial occurrence

Planning webapp, planning0 or Planning syserr and sysout logs may also show even more useful information under the First stuck below.

In this example you can tell that  the query was launched at 4:15:41 minus 1893 seconds... so it was launched at 3:44:08

####<Oct 22, 2015 4:15:41 PM CDT> <Error> <WebLogicServer> <hypfnd1> <Planning0> <[ACTIVE] ExecuteThread: '23' for queue: 'weblogic.kernel.Default (self-tuning)'> <<WLS Kernel>> <> <> <1445548541272> <BEA-000337> <[STUCK] ExecuteThread: '11' for queue: 'weblogic.kernel.Default (self-tuning)' has been busy for "1,893" seconds working on the request "Workmanager: default, Version: 1, Scheduled=true, Started=true, Started time: 1893060 ms

[POST /HyperionPlanning/SmartView HTTP/1.1

Accept: */*

Accept-Language: es-ES

Accept-Encoding: gzip

ORA_EPM_SVCLIENT_CLIENTIP: 192.168.3.229

ORA_EPM_SVCLIENT_EXTENSIONS: Oracle.SmartView.OBIEE.JBIPSProvider;AF52322C-D60B-41f1-A8C8-0C299FBB0AA4

Content-Encoding: gzip

User-Agent: HttpApp/1.0

Content-Length: 3246

Cache-Control: no-cache

Cookie: ORA_HP_MRUApplication=PTTEST; ORA_HP_MRUUsername=llopez;

JSESSIONID=…;

ORA_EPMWS_User=llopez;

ORA_EPMWS_Locale=es_ES; ORA_EPMWS_AccessibilityMode=false; ORA_EPMWS_ThemeSelection=BpmTadpole

ECID-Context: 1.0005H5KfyD76uHWjLxZR8A0001Ob0007lT;kXjE1ZDLIPIKj8HUkLQSdVQRGSPOiUPQiNRBJTPOmKULdLRT

Connection: Keep-Alive

X-Forwarded-For: 192.168.3.229

Proxy-Client-IP: 192.168.3.229

X-WebLogic-KeepAliveSecs: 30

X-WebLogic-Force-JVMID: -1124059291

]", which is more than the configured time (StuckThreadMaxTime) of "1,200" seconds. Stack trace:

com.hyperion.planning.sql.GenericCache.getCache(GenericCache.java:149)
com.hyperion.planning.sql.GenericCache.getObject(GenericCache.java:294)
com.hyperion.planning.sql.GenericCache.getObject(GenericCache.java:217)
com.hyperion.planning.db.HspDEDBImpl.getDimMember(HspDEDBImpl.java:1618)
com.hyperion.planning.db.HspDEDBImpl.getMemberByName(HspDEDBImpl.java:9977)
com.hyperion.planning.db.HspDEDBImpl.getMemberByName(HspDEDBImpl.java:9995)
com.hyperion.planning.db.HspDEDBImpl.getMemberByName(HspDEDBImpl.java:9991)
com.hyperion.planning.db.HspFMDBHelper.createWebGrid(HspFMDBHelper.java:470)
com.hyperion.planning.db.HspFMDBHelper.createWebGrid(HspFMDBHelper.java:26)
com.hyperion.planning.db.HspFMDBHelper.createWebGrid(HspFMDBHelper.java:1125)
com.hyperion.planning.db.HspFMDBHelper.createWebGrid(HspFMDBHelper.java:1121)
com.hyperion.planning.db.HspFMDBImpl.getWebGrid(HspFMDBImpl.java:578)
com.hyperion.planning.db.HspFMDBImpl.getWebGrid(HspFMDBImpl.java:565)
com.hyperion.planning.smartview.XMLAdhocGrid2.getWebGrid(XMLAdhocGrid2.java:823)
com.hyperion.planning.HspSmartViewProvider.adHocValidateQuery(HspSmartViewProvider.java:6859)

Reference:2071231.1

1 comment: