2016/07/22

Why you don't want to parse clob using instr/substr functions

A very strange issue we have hit recently with a report which completion time could suddenly rise two or three times and then drop to some normal half an hour.
The report is a pretty complex query with lots of PL/SQL function calls, which in its turn calls another query with another PL/SQL function and so on. There were three levels of recursive pl/sql calls: sql(top)->pl/sql->sql(1)->pl/sql->sql(2)->pl/sql->sql & pl/sql in a jumble.

As it appeared, when the report is slow, it consumes CPU only, and all V$ views (like ASH or v$sql) show as active only the query from the level "sql(2)". Also v$sesstat didn't show any interesting anomalies (except some lob and data changing activities (db block changes), which did not give any new idea at that time).

As Tanel Poder once said, if your code constantly burns CPU, you have to trace Oracle kernel (it isn't the exact quote, just my attempt to express his message).

I started with truss, just to learn that it produced almost empty output in my case. No OS calls were observed.

pstack was not available, the last resort was oradebug.

I ended up with the following command:

oradebug PDUMP interval=1 ndumps=60 CALLSTACK 1

So, what? What can I get from that highly encrypted trace files?
The first thing I observe was "plsql_run" kernel function, which appeared exactly the same number of times as pl/sql recursive calls in my report. Actually, when I tested the report and made an oradebug trace, I run the report from anonymous PL/SQL block, so the trace below will show one additional "plsql_run" call in the top (actually bottom) of the stack.
Observation of "plsql_run" gave me some confidence that I can decrypt this trace somehow. To move further I had to find well known but unavailable on MOS Note 175982.1. And, thank God, it can be googled easily. Also, it was worth to refresh my knowledge about how to use all this stuff. Here it is.

Below is edited stack trace of Oracle kernel calls. At the right of some kernel function, I placed, what I believe, corresponds to my report's sql and pl/sql calls. It was not a rocket science after all. I gathered pl/sql source codes, queries, query plans and put it all together.

Received ORADEBUG command (#64) 'DUMP CALLSTACK 1 0x0 _seq=60 ndumps=60' from process 'Unix process pid: 6815810, image: '
Executing periodic dump 60 of 60

----- Call Stack Trace -----
calling              
location             
-------------------- 
skdstdst()+40        
ksedst1()+112        
ksedst_tracecb()+48  
ksdxdmp()+544        
ksdxfdmp()+132       
ksdxcb()+3432        
sspuser()+116        
__sighandler()       oradebug dump CALLSTACK
kcbinvbrhb()+312     
kcbivbr()+596        
kcblibr()+76         manages Oracle's buffer cache operation
kdlrsp()+924         
kdl_destroy()+928    
kdlt_freetemp()+148  lob and temporary lob management
kokltfr()+448        
kokltfc()+16         lob access
koltitfr()+612       
kolttfr()+136        support for object Lob buffering PKGPROV.get_last_ord() temporary_quote.tmp_quote (CLOB field operations)
pfrcvfc_format_conv  
pevm_CNVMSC()+28     
pfrinstr_CNVMSC()+6  
pfrrun_no_tool()+96  
pfrrun()+1032        
plsql_run()+616      PKGPROV.get_last_ord
peidxr_run()+280     
peidxexe()+104       
kkxdexe()+352        
kkxmpexe()+244       
kgmexwi()+584        
kgmexec()+1388       
evapls()+736         
evaopn2()+1284       
evaopn2()+4492       
evaopn2()+4224       
evaopn2()+4224       
evaopn2()+4224       
evaopn2()+4224       
evaopn2()+3712      
qerixGetKey()+4132   Index row source |* 68 |            INDEX UNIQUE SCAN          | XPKTAB_OBJ
qerixStart()+1832    
qerjotFetch()+1096   
qerjotFetch()+392    
qerghFetch()+324     
qervwFetch()+148     
rwsfcd()+292         
qeruaFetch()+404    
opifch2()+2312       row source : union-All SQL_ID  7p6225u9gc7ka |   1 |  UNION-ALL
opiall0()+2196       
opikpr()+460         
opiodr()+908         
rpidrus()+216        
skgmstack()+160      
rpidru()+132         
rpiswu2()+520        
kprball()+1044       
kkxsarfch()+568      
kkxsfcn()+556        
pevm_icd_call_commo  
pfrinstr_ICAL()+140  
pfrrun_no_tool()+96  
pfrrun()+1032        
plsql_run()+616      table(CALCS.GET_PAR(ci.id, 113852902)) t
peicnt()+324        
kkxuexe()+468        
kkxmpsexe()+400      
kgmexwi()+584        
kgmexec()+1388       
evapls()+736         
evaopn2()+1284       
qerocImageIterStart  object collection iterator row source |  38 | COLLECTION ITERATOR PICKLER FETCH| GETCALCVAL
qerocStart()+504     
qerjotFetch()+1096   
qerjotFetch()+392    
qerjotFetch()+392    
qernsFetch()+624     
rwsfcd()+100         
qeruaFetch()+404     
subsr1()+288         
subsr3()+364         
IPRA.$evaopn3()+697  
evanvl()+956         
evaopn2()+1284       
opifcr()+5500        
qervwRowProcedure()  
qerghFetch()+1372    
qervwFetch()+148     qervw - view row source SQL_ID  bb8mfnvw4agxp |  41 |  VIEW | VM_NWVW_3
opifch2()+2312       
opiall0()+2196       
opikpr()+460         
opiodr()+908         
rpidrus()+216        
skgmstack()+160      
rpidru()+132         
rpiswu2()+520        
kprball()+1044       
kkxsarfch()+568      
kkxsfcn()+556        
pevm_icd_call_commo  
pfrinstr_ICAL()+140  
pfrrun_no_tool()+96  
pfrrun()+1032        
plsql_run()+616      CALCS.GET_PAR(P_ID => A.OBJ_ID, P_A_ID => 313006642)
peicnt()+324        
kkxuexe()+468        
kkxmpsexe()+400      
kgmexwi()+584        
kgmexec()+1388       
evapls()+736         
evaopn2()+1284       
qerocImageIterStart  SQL_ID  3kp00z3rvsjf0 |   1 |  COLLECTION ITERATOR PICKLER FETCH| GET_PAR | 1 |
qerocStart()+504     object collection iterator row source
subsr1()+208         
subsr3()+364         
IPRA.$evaopn3()+697  
evaopn3()+92         
opifcr()+5808        
qertbFetchByRowID()  tab by rowid SQL_ID  3kp00z3rvsjf0 |*  2 |  TABLE ACCESS BY INDEX ROWID | TAB_OBJ
opifch2()+2312       
opifch()+48          
opipls()+4248        
opiodr()+908         
rpidrus()+216        
skgmstack()+160    
rpidru()+132         
rpiswu2()+520        
rpidrv()+1060      recurcive sql
psddr0()+448         
psdnal()+472       binds
pevm_BFTCHC()+284    
pfrinstr_BFTCHC()+1  
pfrrun_no_tool()+96  
pfrrun()+1032        
plsql_run()+616    top level anonymous block
peicnt()+324         
kkxexe()+484         
opiexe()+17356       
kpoal8()+1872        
opiodr()+908         
ttcpip()+1028        
opitsk()+1612        
opiino()+940         
opiodr()+908         
opidrv()+1132        
sou2o()+136          
opimai_real()+560    
ssthrdmain()+276     
main()+204           
__start()+112        

So, my report was constantly doing some LOB manipulation! Very soon I found some CLOB column which was fetched inside PKGPROV.get_last_ord(). And that CLOB was parsed using instr() and substr() functions. It was pretty good example of the "SQL Semantics" usage. It works, but how much does it cost? You can read more about this in the famous "Oracle PL/SQL Programming, Sixth Edition by Steven Feuerstein with Bill Pribyl" book, paragraph "SQL semantics may yield temporary LOBs"

Taking this, a quick workaround appeared in my mind. In my case CLOB column temporary_quote.tmp_quote contained not so long pieces of text, 6K at maximum. It was possible to declare a local variable with VARCHAR2(32768), assign CLOB column content to it, and just do the same parsing but with local VARCHAR2 variable.

See below Tomas Kyte's RUNSTATS utility style comparison for CLOB manipulation part of the report. Run1 is the original approach, Run2 is using a varchar2 local variable. The difference is huge!

Statistic                                     Run1             Run2
---------------------------------- --------------- ----------------
Elapsed (santisec)...............:          124349              743
CPU (santisec)...................:           69551              404

One more thing. If you have noted that CPU time is much smaller than elapsed time, it is not due to some IO waiting time or similar. All these experiments were conducted on AIX Power 7 server. So, this article gives some explanation.