We had a problem with the awesome power that the Oracle database offers... or to be more specific the third party had a problem with that awesome power.
Setting the stage
Our applications handles many, many orders per day and mostly they are processed in waves during the day. Meaning there are peaks and valleys in the amount of orders that need to be processed. When the orders are processed they are send to a third party application by using a REST Service. Informing the third party doesn't have to be "real time", as long as they get the information.
The way we set up our side of the application is as follows:
- when the order has a certain status, enough information is placed on a Queue (Advanced Queueing)
- A callback function on the Queue will construct the JSON and do the REST Call (using APEX_WEBSERVICE)
The Problem
The advantage of having the database of the Queue is that it "auto-scales". Because it runs in the background it can spin up as many "sessions" as it wants to perform the callback function and clear the queue. If the system gets busy, it will scale down the processing of the Queue so that the operational system doesn't slow down to a crawl. Simply put: Awesome.
However the third party could only handle a limited number of connections, four to be precise. The consequence of this was that when the system would spin up many sessions, a lot of them would get an Error: "Concurrent request limit exceeded. Request blocked." and the orders had to be tried again.
Solutions?
As we know that the third party could only handle four "sessions" we prefer to reduce the number of sessions that the database would spin up. However there is (as far as I could tell) no way to tell Oracle "you can only use at most four sessions to clear the Queue".Reaching out on Twitter suggested adjusting the AQ_TM_PROCESSES parameter but that controls the time, not the number of "sessions".
Another suggestion was to tweak JOB_QUEUE_PROCESSES, but we didn't try that to be honest. There are other processes that utilize DBMS_SCHEDULER (e.g. for importing orders) that we didn't want to interfere with this parameter.
Thanks to Vidar Eidissen and Philipp Salvisberg for your responses on Twitter.
Our Solution
The method that we implemented was as follows:
- Get an exclusive lock
- Construct the necessary JSON and do the REST Call
- Release the lock
To facilitate this method, I created a table which is just to coordinate the locks:
create table lock_handles (handle varchar2(10) ,nr number ,constraint lh_pk primary key (handle) ); insert into lock_handles values ('LOCK1', 0); insert into lock_handles values ('LOCK2', 0); insert into lock_handles values ('LOCK3', 0); insert into lock_handles values ('LOCK4', 0); commit;The only purpose that this table serves is just for locking, thinking about it a little bit more I doubt that I need more than one column though.
Next a function to get a lock, of course this function is placed in a package (for the example, the name of the package is "MYPKG".
function get_lock return lock_handles.handle%type is e_locked exception; pragma exception_init (e_locked, -54); function get_lock (p_handle in lock_handles.handle%type) return lock_handles.handle%type is l_nr lock_handles.nr%type; begin select lh.nr into l_nr from lock_handles lh where lh.handle = p_handle for update nowait; return p_handle; end get_lock; begin for i in 1..4 loop begin return get_lock ('LOCK'||to_char (i)); exception when e_locked then continue; end; end loop; -- wait for a little bit before attempting -- to get the lock again by recursively -- calling this function sys.dbms_session.sleep (10); return mypkg.get_lock; end get_lock;When calling this function, at first it will attempt to lock the row in the table with handle "LOCK1". Instead of waiting (FOR UPDATE NOWAIT) when that row is locked, an exception is raised. With the Pragma Exception Init, the exception is named E_LOCKED and thus can be handled by name.
When the exception is raised, another session has a lock on that row, it will attempt to lock the row in the table with handle "LOCK2", and so on.
If no locks can be obtained, tried and exception is raised each time, the function will wait for a little bit (10 seconds) and than call the function recursively to start all over again.
Let's say that a certain function call obtains a lock on the row, the calling program can continue and do the REST call.
When the REST Call is completed, the lock on the table is released by ending the transaction by issuing a commit.
Does it work?
Until sofar the theory, but does it work? Which is of course a valid question and I was shocked by my first tests. I tried processing a mere 15 orders and to my surprise I discovered that only the first lockhandle, LOCK1, was used.
It dawned that the queue could easily be handled by a single session, so the callback would just serialize the processing.
Increasing the load on the Queue did indeed show that all lockhandles were used. Also nice to see that at after a certain time, the usage of the lockhandle also decreases. It scales to whatever Oracle deems a good method of processing the Queue.
Could it be handled differently? Yes, probably and I'm open for suggestions.
Could the code be improved? Yes, probably and I'm open for suggestions.
Doh!
Just as I finishing up this blogpost, Connor McDonald chimed in on Twitter. Well, yes Connor, yes, it certainly does...The rewritten method is as follows
function get_lock return varchar2 is l_lock integer; begin loop for i in 1..4 -- max number of concurrent sessions loop l_lock := dbms_lock.request (id => to_number ('223344'||to_char (i)) -- arbitrary locknr ,lockmode => dbms_lock.x_mode -- exclusive mode ,timeout => 0 -- don't wait to obtain a lock ,release_on_commit => true -- release when transaction ends ); if l_lock = 1 -- Timeout then continue; -- try the next lock else return 'LOCK'||to_char (i); end if; end loop; -- A variation to the suggestion that -- Connor made in the comments. -- I've added an extra pause here -- before attempting to get another lock sys.dbms_session.sleep (10); end loop; end get_lock;No need for the extra table, and the code is a lot shorter.
Nice stuff....but I'd change timeout to (say) 0.2 seconds otherwise thats potentially a lot of CPU burned when you least want it (ie, when you're busy)
ReplyDeleteOr be really fancy :-) and have no timeout for the first pass through the 4, then a non-zero timeout for the second pass etc etc...