I work for SAP Labs and every once in a while get great problems to work on with our customers. One such customer recently had come to us asking for a root cause and solution to an intriguing problem they faced in their production SAP ECC system. I blog about it to highlight some of the very interesting issues that arise in a large SAP installation due to various reasons - end-user behavior, slow performance and deep technical issues. The issue was in the standard shipment process in SAP SD. In the customer system, the process in question started with the end-user running transaction VT01N to create a Shipment. After a few days, once the necessary paperwork was completed, the user would call transaction VT02N to perform PGI and Invoicing. This setup used to work fine till about a couple of months back when, on a completely random basis, a few shipments ended up with duplicate delivery and invoices. This caused a problem with the customers credit management system and hence became very critical indeed.
To begin analysis, we had few clues to start off with. No common characteristic or pattern was observed between the affected shipments. We had a look at all the user/customer exits and BAdIs to see if customer code was the culprit. Zilch, we didn't find any z-code which might play havoc with the standard flow. And since the whole problem was not reproducible, we did not have much possibility to debug. A few initial attempts at debugging the whole shipment->delivery->invoice process were efforts in vain.
Our first hint for root cause came from ST03N which we pored through religiously. We could see that around when the problem actually started(ie. when the customer started observing skewed Credit Management figures), there was a substantial dip in overall system dialog response. More specifically, the Database Time in the Dialog Response was almost double of what it was earlier. This was because of a new database(Oracle 11) which the customer had upgraded to, which was not well tuned.
Ideally, SAP standard will lock and update the Shipment first in its own LUW. Once the shipment has been updated, it will start another LUW for the delivery and invoice creation. Once these follow-on documents have been created, the Shipment status will be updated to reflect its closed status. Therefore, this much was clear to us from the very beginning - the issue was arising either due to some inconsistency in the creation/sequencing of LUWs(Logical Unit of Work) or because of enqueue problems.
To make some headway towards resolution, we started to monitor STAD records closely. As soon as we got news of a duplication(the customer had now started to monitor duplicate invoices very closely), we jumped into STAD to see all the reports/programs that were involved. We observed that the program which actually creates the delivery and invoice (VTFAKT) was getting executed twice! And that too by the same user. It wasn't a case of different users tampering with the same shipment and somehow screwing with the locks.
Some more investigation into the code flow gave us a better picture of how standard SAP actually creates the delivery and invoice (when configured so in the activity profile). To start with, VT02N is called manually by the user. He makes the required changes to trigger follow-on documents creation (most likely updates the shipment status) and hits 'Save'. The program then first locks the shipment and processes the shipment data. It internally CALLs TRANSACTION VTFAKT, which is responsible for delivery and invoice creation. Therefore VTFAKT is called in its own LUW. Once VTFAKT COMMITs its changes and returns control back to VT02N, it finishes the job by COMMITing (and hence passing the locks to the update task) the shipment status changes so that no further follow-on documents can be created. Logically the flow of the entire process seems sound. Surely, there is little chance of something going wrong with the two LUWs involved.
We spoke to the users who had created the duplicate invoices to understand whether they had observed any discrepancy while running VT02N. To our surprise, yes they had. The end users told us that sometimes when they ran VT02N, the transaction would seem to hang and not give them any response. To the end user, who were extremely untrained in SAP, the expected 'response' from the system was the printing of the invoice just created at one of their local printers. Becoming impatient, this user would then end the session and rerun VT02N to receive the printout for the invoice successfully this time.
This clarified the situation a lot. Now we were able to figure out what was going wrong, everything - poor database performance, end-user habits & expectations, enqueues and LUWs - came together beautifully.
Lets go through the same process again, this time talking about what happens to the enqueues and LUWs involved. The user runs VT02N, which creates a lock on the shipment document. The user then changes the shipment and hits Save, which starts LUW1. Inside of LUW1, we have a CALL TRANSACTION to VTFAKT which starts the second logical unit of work, LUW2. Note two things, LUW2 is contained within LUW1 and that the shipment lock created earlier is not passed to the dialog work process handling LUW2. Therefore, VTFAKT, which is called from VT02N runs making the assumption that the shipment is locked (since the transaction VTFAKT cannot be invoked manually, this is a fair assumption to make). Now if there is a sudden spike in database response, LUW2 will take time and the end user becomes impatient and cancels the session.
Looking closely, there are actually two distinct stages during LUW2 processing - first when the processing is with the dialog work process and second when the COMMIT happens and the update work process processes all the update function modules called in stage one. If the user cancels the while processing is still in stage one, LUW1 and LUW2 are both cancelled. However, if the session cancellation happens during stage two processing, LUW1 ends but not LUW2. This is because there is no link between the update work process of LUW2 and the dialog work process of LUW1. What this means is that the shipment is now unlocked (dequeue happens when LUW1 is cancelled) and that the shipment status is still open since this update is supposed to happen in LUW1 after LUW2 is completed successfully. So when the same user runs VT02N, he is inadvertently starting another update to the same shipment and hence ends up creating a duplicate set of delivery and invoice.
Now that the root cause had been identified, we needed a solution too. We had two options, tune the database so that there are no performance slowdowns or to make the whole shipment more robust in the standard itself. Naturally, the second option was preferred which we managed to pull off by setting the update task of LUW2 as local. This is done by the ABAP statement SET UPDATE TASK LOCAL in the program which calls all the update function modules. The result it has is that all of the update processing of LUW2 is handled by the same dialog work process that processed this logical unit of work. So when LUW2 gets cancelled (which happens when LUW1 gets cancelled), the update never takes place. This change was released to the customer as a pilot Note released to no other customer as of writing this - no other customer has come to us with a similar problem yet.
Interesting issue eh!
P.S - The Note which we created to solve the issue is in pilot release, hence have not linked it here.