While working on some customer support issue on reporting in my project, I got to know few things which i would like to share with my readers.
The issue was on a partucular day Customer was not able to generate user access report for all the users of his organization. However the next day they were able to do generate them.
I analyzed production logs and could not find something fishy. So I debugged the app and tried to reproduce the same issue. I tried to generate report with about 4000 users after 2 min i canceled it and tried to generate report for 1000 users, then for 100 users than for 10 users and finally for only 1 user.
But to my surprise report was not getting generated even for 1 user.
While debugging i realzed there were full table scan due and with increated data it was taking some more time. However the catch was, when user keeps on canceling data before getting the output of one report, the Database thread/request is not canceled. On subsiquent request they all gets queued up.
While debugging on my local box as the part of investigation, I observed below facts:
1. When cancel is pressed in user access report wizard – the DB call is not terminated, only UI moves to admin page. Hence, on subsequent report generation the query gets queued up. Which makes report very slow and more you retry by cancelling, more items gets queue up in DB server resulting in endless waiting. Probably on 23 Nov and 6 Oct, there were lot of requests and retries must have happened due to which report could not be generated, however next day is zero items in queue, reports were generating.
2. There was full table scan on app event table which made the query cost very expensive, hence for 100+ users it very slow and for all users like 4665 it was impossible to get the report within tolerable time frame.
3. By introducing the index on session id – full table scan was avoided and response was quite faster for about 100
+ users data was retrieved within 100 seconds and for 4665 (1000+) – firstly report was getting generated that too within 10 min time.
4. There are million of rows getting inserted with every single day which is impacting the report generation for user activity report. Hence we should come up with solution to archive older data and put them in de normalized table, probably on weekly basis using some crone job. So that older data could be fetched faster.
5. As such there in no programmatic logs in the code line of user access report. So whenever generate report is clicked or cancelled nothing is traceable.
As a part of solution design below action items should be considered :-
1. Index session id column on app event and further optimization of table.
2. Terminating the DB process on click of cancel instead of allowing it to get queued up on Database server.
3. Some archival strategy to deal with the increasing data load on APP_EVENT table.
4. Logs should get added for better monitoring.