2017/12/21

An ORA-07445 troubleshooting case

When somebody adopts fresh Oracle releases, she or he will certainly hit some fresh bug, which no one has registered or described before.
And there is no choice except trying to dig deeply into available trace information to get some clue of how to deal with this situation. I mean a dev team is not going to wait until SR will be resolved in some indefinite future.

So, it was a huge query with more than 5 page downs, and it failed to work. MOS search gave nothing for ORA-07445 first argument as well as Google search.
But still, it is not the time for giving up. Oracle kernel error stack actually can provide a lot of interesting stuff.

Here is an excerpt from ORA-07445 incident trace file:

----- Call Stack Trace -----
calling              call     entry                
location             type     point                
-------------------- -------- -------------------- 
skdstdst()+45        call     kgdsdst()            
                                                   
                                                   
...
__sighandler()       call     sslsshandler()       
                                                   
                                                   
                                                   
kkoordi()+670        signal   __sighandler()       
                                                   
                                                   
kkocnp()+279         call     kkoordi()            
                                                   
                                                   
kkooqb()+1882        call     kkocnp()             
                                                   
                                                   
kkoqbc()+2438        call     kkooqb()             
                                                   
                                                   
apakkoqb()+182       call     kkoqbc()             
                                                   
                                                   
apaqbdDescendents()  call     apakkoqb()           
+488                                               
                                                   
apaqbd()+135         call     apaqbdDescendents()  
                                                   
                                                   
kkqcbyGetCost()+749  call     apaqbd()             
                                                   
                                                   
kkqcbydrvPreUA()+29  call     kkqcbyGetCost()      
89                                                 
                                                   
qksqbApplyToQbcLoc(  call     kkqcbydrvPreUA()     
)+635                                              
                                                   
qksqbApplyToQbcLoc(  call     qksqbApplyToQbcLoc(  
)+987                         )                    
...

A function kkoordi() is that which failed. Going through the stack, some human-readable names can be found. And that is what can help to get in the right way. Just two names gave a couple of idea of what to do next.
kkqcbyGetCost() - is certainly about Cost-Based Optimizer. And it means that the error must be reproducible by EXPLAIN PLAN statement, which was easily confirmed.
qksqbApplyToQbcLoc() - seems to apply something to a query block, that block which caused the error! So, turning on 10053 trace event and executing EXPLAIN PLAN one more time provides the second major piece of information. A tail of 10053 trace-file showed the query block (subquery) which caused the error while costing its different join orders.

So simple in this case! And it provided all necessary information for developers to rewrite just one subquery and eliminate the issue.