Wednesday, May 30, 2012

Calculating Running Totals - SQLTeam.com

Calculating Running Totals - SQLTeam.com


Roger writes "Is there a way to retrive a field value from the previously read row in order to use it to calculate a field in the current row . . ."
For example:
day   sales   cumu_total
1     120     120
2     60      180
3     125     305
4     40      345
In order to calculte the cumulative total I need to know the previous cumulative total (cumulative total could be any other calculation). I did this in MySQL by using variables to temporarily hold values between rows but SQL server doesn't assinging variables in retrieval statements. Is it possible to do this in a Select statement?"

The answer to the questions is "yes", you can solve this problem with a single SELECT statement. Instead of just showing the solution that came to mind, though, I want to take a look at three ways to solve the problem.

Creating the Test Data

I used the code shown below to create the base table and a few thousand rows of test data. It is important to add enough data so that the efficiency of the three solutions can be accurately measured. In other words, with only a few rows of data they all seem efficient.
CREATE TABLE Sales (DayCount smallint, Sales money)
CREATE CLUSTERED INDEX ndx_DayCount ON Sales(DayCount)
go
INSERT Sales VALUES (1,120)
INSERT Sales VALUES (2,60)
INSERT Sales VALUES (3,125)
INSERT Sales VALUES (4,40)

DECLARE @DayCount smallint, @Sales money
SET @DayCount = 5
SET @Sales = 10

WHILE @DayCount < 5000
 BEGIN
  INSERT Sales VALUES (@DayCount,@Sales)
  SET @DayCount = @DayCount + 1
  SET @Sales = @Sales + 15
 END

The Three Solutions

The three different solutions I tested are shown below. The execution time with and without a clustered index on DayCount is shown at the top of each batch.

Solution 1: Temp Table/Cursor

(NoIndex = 2 secs, Index = 2 secs)
CREATE TABLE #Sales (DayCount smallint, Sales money, RunningTotal money)

DECLARE @DayCount smallint,
        @Sales money,
        @RunningTotal money

SET @RunningTotal = 0

DECLARE rt_cursor CURSOR
FOR
SELECT DayCount, Sales
FROM Sales

OPEN rt_cursor

FETCH NEXT FROM rt_cursor INTO @DayCount,@Sales

WHILE @@FETCH_STATUS = 0
 BEGIN
  SET @RunningTotal = @RunningTotal + @Sales
  INSERT #Sales VALUES (@DayCount,@Sales,@RunningTotal)
  FETCH NEXT FROM rt_cursor INTO @DayCount,@Sales
 END

CLOSE rt_cursor
DEALLOCATE rt_cursor

SELECT * FROM #Sales ORDER BY DayCount

DROP TABLE #Sales

Solution 2: The "Celko" Solution

(NoIndex = 25 secs, Index = 20 secs)
SELECT DayCount,
       Sales,
       Sales+COALESCE((SELECT SUM(Sales) 
                      FROM Sales b 
                      WHERE b.DayCount < a.DayCount),0)
                         AS RunningTotal
FROM Sales a
ORDER BY DayCount

Solution 3: The "Guru's Guide" Solution

(NoIndex = 38 secs, Index = 17 secs)
SELECT a.DayCount,
       a.Sales,
       SUM(b.Sales)
FROM Sales a
CROSS JOIN Sales b
WHERE (b.DayCount <= a.DayCount) AS RunningTotal
GROUP BY a.DayCount,a.Sales
ORDER BY a.DayCount,a.Sales
Solution 1 uses a temporary table and a cursor and executes extremely fast. I must admit that my original intent was to show this solution as the way *not* to solve the problem. In general, I tell developers to avoid cursors at all costs. In this particular case, however, the cursor approach is much quicker than the single SELECT solutions.
Solution 2 is the one that came to mind when I originally read the question. I attribute the solution to Joe Celko, because it can be found in his SQL for Smarties book and I'm sure I learned about it in one of his posts in the SQL Server newsgroups.
The tricky part about this solution is the use of a correlated subquery and the COALESCE function. A correlated subquery is one that executes for each row in the outer query. The inner query sums the Sales for rows with a DayCount value less than the current row.
The COALESCE function returns the first non-NULL value in its list of values. The first row processed returns a NULL for SUM(Sales) because there is no DayCount less than it's value. When this happens a 0 is returned and added to the Sales value.
Solution 3 is labeled the "Guru's Guide" because it can be found in Ken Henderson's enormously popular "The Guru's Guide to Transact-SQL." I would have never come up with this solution, but I guess that's what makes Ken the Guru and me a Guru wannabee.
This solution uses a CROSS JOIN and table aliases to join the Sales table with itself. This causes each row in the left table (Sales a) to be joined with each row in the right table (Sales b) where the DayCount in b in less than the DayCount in a. The SUM(b.Sales) and the GROUP BY a.DayCount, a.Sales then allow the running total for each row to be calculated. If you are having trouble figuring out how this works draw a layout of two instances of Sales side-by-side, and then draw lines from a row in the left table to the rows in the right table that meet the WHERE condition.
When I first ran the tests I did not have a clustered index on DayCount. Upon realizing DayCount is a prime candidate for a clustered index, I added it and re-ran the tests. The execution time for Solution 3 was significantly reduced after the index was created because of the number of rows joined by the CROSS JOIN.
And as an FYI, the server I used to tests these solutions is a Compaq ML370 with dual 600's and 1 GB RAM--your results may vary.


More than 3 requests, I'll translate this to Chinese.
超过3个请求,我就会把这篇文章翻译成中文。

No comments: