Guilty as charged, until proven innocent, right?
Well, it is not that nothing can't go wrong in the database, but I wish it would stop be treated as the black sheep of the family.
Recently, I had been in a situation where application was constantly complaining for bad response time, day after day. First thing to explore was whether application changed its behaviour, since most of the time that is the cause for performance issues. Surprisingly enough, it did not. No new SQLs, no major changes in execution count.
Second, I went to look for database changing its behaviour. But it didn't. No changes in execution plan (usually major cause for performance degradation). As much as I have looked, from the database side, it looked like database didn't change its behavior.
So what could be wrong?
Storage can go wrong for example, and that was the case here.
Looking for problems in the wrong place will not assist in solving them. But just saying "it is not the database" probably won't help. You will probably always have to prove it...
Average physical read/write time is what oracle experiencing when dealing with IO. But underneath these numbers, two more tiers are involved. Storage and File System.
Well, after checking average physical read time and average physical write time of several files in this database, it was clear (unfortunately, just to me for a while) that writes became much slower, at exactly the time when the application have started to complain.
Don't think it was easy. For many days Storage guys continues to say "Nothing is wrong with the storage" (how many times have you heard this? I wonder), but finally, when numbers became so high, it couldn't be ignored any more.
At last, spotlight went to spot on the right place (and stopped pointing to the wrong place, the database). Well, probably just until the next complaint will rise again...but I guess I can't win it all..
Here is a query that can be run in Oracle 10g and above. It might look scary, but in short, it's just go over dba_hist_filestatxs (where file statistics are held overtime), and since numbers there are incremented every snapshot (it doesn't hold the deltas), I have used LAG analytic function to subtract current value from preceding value (i.e, current snapshot from preceding snapshot), and I did it for every statistics I wanted to observe.
That's it.
select * from (
select
b.begin_interval_time begin_interval_time ,
nullif((phywrts - ( lag(phywrts,1,0) over( order by a.snap_id ) )),0) delta_phywrts,
trunc( (writetim*10 - ( lag(writetim*10,1,0) over( order by a.snap_id ) ) ) /
nullif((phywrts - ( lag(phywrts,1,0) over( order by a.snap_id ) )),0) ,3 ) avg_write_ms,
nullif((phyrds - ( lag(phyrds,1,0) over( order by a.snap_id ) )),0) delta_phyrds,
trunc( (readtim*10 - ( lag(readtim*10,1,0) over( order by a.snap_id ) ) ) /
nullif((phyrds - ( lag(phyrds,1,0) over( order by a.snap_id ) )),0) ,3 ) avg_read_ms,
nullif((singleblkrds - ( lag(singleblkrds,1,0) over( order by a.snap_id ) )),0) delta_singleblkrds,
trunc( (singleblkrdtim*10 - ( lag(singleblkrdtim*10,1,0) over( order by a.snap_id ) ) ) /
nullif((singleblkrds - ( lag(singleblkrds,1,0) over( order by a.snap_id ) )) ,0),3 ) avg_singleblkrds_ms
from dba_hist_filestatxs a ,
dba_hist_snapshot b
where filename = 'my_file.dbf'
and a.snap_id=b.snap_id
)
where to_char(begin_interval_time,'HH24') between 7 and 11 -- in case you wish to see only specific hours
and to_char(begin_interval_time,'HH24') not in ('FRIDAY','SATURDAY ') -- in case you wish to filter nonworking days
order by begin_interval_time;
Below is part of the real life output, where major changes in average write time can clearly be seen, especially since 24.5. I have taken statistics of a specific data file, each time between 09:30-12:00, in normal working days (not in a weekend). Snapshot interval is 15 minutes. (For simplification, although above query shows reads and writes statistics, below I am showing only writes, since this is the issue relevant for this post).
on 7.5, average write time was between 2-7 ms.
on 16.5, average write time was between 6-11 ms.
on 21.5, average write time was between 6-14 ms.
on 24.5, average write time was between 12-21 ms.
on 13.6, average write time was very very bad. around 40 ms write time. This is 10 times slower than a month ago!
Finally, on 6/7, average write time went back to normal. went down to around 5 ms.
Pay attention to the increasing values of the 3rd column, avg_write_ms, which shows the average physical write time of a specific datafile, for every snapshot.
BEGIN_INTERVAL_TIME | DELTA_PHYWRTS | AVG_WRITE_MS |
=================== | ============= | ============ |
07-MAY-12 09:30:05 | 1970 | 3.395 |
07-MAY-12 09:45:10 | 2007 | 4.489 |
07-MAY-12 10:00:12 | 1679 | 2.829 |
07-MAY-12 10:15:17 | 2120 | 3.037 |
07-MAY-12 10:30:19 | 2355 | 2.169 |
07-MAY-12 10:45:23 | 1609 | 2.094 |
07-MAY-12 11:00:24 | 750 | 7.533 |
07-MAY-12 11:15:26 | 737 | 2.713 |
07-MAY-12 11:30:28 | 779 | 3.465 |
07-MAY-12 11:45:29 | 625 | 2.448 |
16-MAY-12 09:30:48 | 572 | 6.73 |
16-MAY-12 09:45:49 | 645 | 5.395 |
16-MAY-12 10:00:51 | 576 | 8.072 |
16-MAY-12 10:15:52 | 720 | 10.513 |
16-MAY-12 10:30:54 | 622 | 11.495 |
16-MAY-12 10:45:55 | 745 | 7.208 |
16-MAY-12 11:01:00 | 692 | 9.205 |
16-MAY-12 11:15:01 | 920 | 10.793 |
16-MAY-12 11:30:03 | 839 | 8.116 |
16-MAY-12 11:45:05 | 755 | 8.172 |
21-MAY-12 09:30:54 | 2156 | 8.409 |
21-MAY-12 09:45:57 | 2016 | 14.627 |
21-MAY-12 10:00:59 | 1877 | 10.436 |
21-MAY-12 10:15:03 | 2224 | 6.753 |
21-MAY-12 10:30:06 | 1145 | 12.043 |
21-MAY-12 10:45:10 | 869 | 12.105 |
21-MAY-12 11:00:12 | 774 | 11.989 |
21-MAY-12 11:15:14 | 857 | 8.191 |
21-MAY-12 11:30:16 | 737 | 8.71 |
21-MAY-12 11:45:18 | 826 | 12.421 |
24-MAY-12 09:30:27 | 661 | 14.19 |
24-MAY-12 09:45:29 | 715 | 12.881 |
24-MAY-12 10:00:30 | 800 | 15.637 |
24-MAY-12 10:15:32 | 716 | 12.835 |
24-MAY-12 10:30:33 | 895 | 19.307 |
24-MAY-12 10:45:37 | 799 | 19.148 |
24-MAY-12 11:00:38 | 716 | 21.983 |
24-MAY-12 11:15:41 | 841 | 21.878 |
24-MAY-12 11:30:44 | 724 | 15.138 |
24-MAY-12 11:45:46 | 653 | 15.773 |
13-JUN-12 09:30:15 | 24608 | 24.476 |
13-JUN-12 09:45:17 | 24249 | 31.427 |
13-JUN-12 10:00:21 | 20164 | 39.09 |
13-JUN-12 10:15:23 | 19807 | 38.68 |
13-JUN-12 10:30:24 | 19572 | 38.088 |
13-JUN-12 10:45:26 | 18812 | 41.594 |
13-JUN-12 11:00:29 | 19003 | 40.993 |
13-JUN-12 11:15:33 | 21660 | 44.534 |
13-JUN-12 11:30:34 | 23242 | 38.171 |
13-JUN-12 11:45:36 | 21055 | 42.248 |
06-JUL-12 09:30:50 | 3057 | 6.319 |
06-JUL-12 09:45:51 | 3406 | 5.733 |
06-JUL-12 10:00:52 | 3880 | 8.103 |
06-JUL-12 10:15:53 | 4324 | 5.381 |
06-JUL-12 10:30:54 | 5409 | 4.906 |
06-JUL-12 10:45:55 | 2714 | 5.036 |
06-JUL-12 11:00:56 | 2723 | 4.697 |
06-JUL-12 11:15:57 | 2434 | 7.781 |
06-JUL-12 11:30:58 | 3804 | 4.91 |
06-JUL-12 11:45:59 | 2243 | 5.737 |
Ladies and gentlemen of the jury, what is your verdict?
We have found the defendant NOT GUILTY, your honor.
To sum it all up, as stated in the beginning, Database is not always to blame. This time the jury found it innocent.
:)
First thing to check is the application. If you see that applications is behaving more or less the same (no new SQLs, same execution count per SQL), check the database.
And if the database is behaving more or less the same, it is simply not there.
Something else is the source of the problem.
As always, comments are most welcome..
Merav Kedem,
Founder of
DB OPTimize
Oracle Performance Tuning & DBA Consulting
It is very simple, Applications are rarely correctly instrumented. Operating systems are barely instrumented and networks less so.
ReplyDeleteThus the well instrumented Database in between can become a useful source of information about what goes on above and bellow.
The simple reason that the DBA gets kicked so often is because the apps guys probably can't answer questions about their own application!
We need to be immensely clear, just because the Database recorded the problem that doesn't mean it caused the problem!