12 June 2009

Analytic Function: Finding Gaps

A little while ago Anton Nielsen posted a blog named "SQL for Spanned Data".

In this blog he describes a challenging query involving a table which stores start and end dates. The challenge as stated by Anton:

The challenge was to create a sql statement to only return contiguous spans by division.

To illustrate what is required:

Id, Division, Start_date, End_date
10 1 09-JUN-2009 10-JUN-2009
11 1 11-JUN-2009 12-JUN-2009
12 1 13-JUN-2009 14-JUN-2009 -- Note the following row is not contiguous
14 1 17-JUN-2009 18-JUN-2009
15 1 19-JUN-2009 20-JUN-2009

The final result will be

Division, Start_date, End_date
1 09-JUN-2009 14-JUN-2009
1 17-JUN-2009 20-JUN-2009


In this post, I will use Analytic Functions to find contiguous spans. As Anton already provided the DDL for this blog, I will not be repeating that here. So, if you want to follow along grab the DDL from his blog and join the fun.

The logic that we are using, is like this

  1. compare the current start date with the previous end date

  2. if the difference between these dates is one day, we have a contiguous span

  3. if the difference is anything else (greater than one day or NULL) then we start a new group of spans

  4. Now that we have a list of numbers (1 and 0) we do a running total, each span group will have it's own number

  5. Get the first and last day of each span group



To compare the current start date with the previous end date, we will use the LAG function. With the LAG function you can "look back" in your result set.
Let's have a look at the effect of this function

SQL> select id
2 , start_date
3 , end_date
4 , lag (end_date) over (partition by division
5 order by start_date
6 )
7 from spantest
8 where start_date between to_date ('01-06-2009', 'dd-mm-yyyy')
9 and to_date ('20-06-2009', 'dd-mm-yyyy')
10 and division = 1
11 order by start_date
12 /

ID START_DAT END_DATE LAG(END_D
---------- --------- --------- ---------
73383 02-JUN-09 03-JUN-09
73384 04-JUN-09 05-JUN-09 03-JUN-09
73385 06-JUN-09 07-JUN-09 05-JUN-09
73386 08-JUN-09 09-JUN-09 07-JUN-09
73387 10-JUN-09 11-JUN-09 09-JUN-09
73388 12-JUN-09 13-JUN-09 11-JUN-09
73389 14-JUN-09 15-JUN-09 13-JUN-09
73391 18-JUN-09 19-JUN-09 15-JUN-09
73392 20-JUN-09 21-JUN-09 19-JUN-09

9 rows selected.

The last column is the end date of the previous record. This makes it easy to implement steps 2 and 3. For this we will use a CASE statement


SQL> select id
2 , start_date
3 , end_date
4 , case
5 when start_date -
6 lag (end_date) over (partition by division
7 order by start_date
8 ) = 1
9 then 0
10 else 1
11 end span_group
12 from spantest
13 where start_date between to_date ('01-06-2009', 'dd-mm-yyyy')
14 and to_date ('20-06-2009', 'dd-mm-yyyy')
15 and division = 1
16 order by start_date
17 /

ID START_DAT END_DATE SPAN_GROUP
---------- --------- --------- ----------
73383 02-JUN-09 03-JUN-09 1
73384 04-JUN-09 05-JUN-09 0
73385 06-JUN-09 07-JUN-09 0
73386 08-JUN-09 09-JUN-09 0
73387 10-JUN-09 11-JUN-09 0
73388 12-JUN-09 13-JUN-09 0
73389 14-JUN-09 15-JUN-09 0
73391 18-JUN-09 19-JUN-09 1
73392 20-JUN-09 21-JUN-09 0

9 rows selected.

The last column (named Span_Group) now contains a list of 1 and 0. The "1" indicate the start of a new Span Group. This sample set therefore contains two Span Groups.
Using the Running Total technique, we can more clearly identify the Span Groups. Because it is not possible to nest analytic functions, we push the query we build so far into an inline view. Then we can use the SUM () OVER () on the Span Groups we created earlier.

SQL> select id
2 , start_date
3 , end_date
4 , sum (span_group) over (partition by division
5 order by start_date
6 ) span_grps
7 from (
8 select id
9 , division
10 , start_date
11 , end_date
12 , case
13 when start_date -
14 lag (end_date) over (partition by division
15 order by start_date
16 ) = 1
17 then 0
18 else 1
19 end span_group
20 from spantest
21 where start_date between to_date ('01-06-2009', 'dd-mm-yyyy')
22 and to_date ('20-06-2009', 'dd-mm-yyyy')
23 and division = 1
24 )
25 order by start_date
26 /

ID START_DAT END_DATE SPAN_GRPS
---------- --------- --------- ----------
73383 02-JUN-09 03-JUN-09 1
73384 04-JUN-09 05-JUN-09 1
73385 06-JUN-09 07-JUN-09 1
73386 08-JUN-09 09-JUN-09 1
73387 10-JUN-09 11-JUN-09 1
73388 12-JUN-09 13-JUN-09 1
73389 14-JUN-09 15-JUN-09 1
73391 18-JUN-09 19-JUN-09 2
73392 20-JUN-09 21-JUN-09 2

9 rows selected.

The last column in the result set (named SPAN_GRPS) now clearly identifies the two groups.

The final thing we need to do is retrieve the earliest start date and the latest end date, a simple aggregate will suffice

SQL> select division
2 , min (start_date) start_date
3 , max (end_date) end_date
4 from (
5 select id
6 , division
7 , start_date
8 , end_date
9 , sum (span_group) over (partition by division
10 order by start_date
11 ) span_grps
12 from (
13 select id
14 , division
15 , start_date
16 , end_date
17 , case
18 when start_date -
19 lag (end_date) over (partition by division
20 order by start_date
21 ) = 1
22 then 0
23 else 1
24 end span_group
25 from spantest
26 where start_date between to_date ('01-06-2009', 'dd-mm-yyyy')
27 and to_date ('20-06-2009', 'dd-mm-yyyy')
28 and division = 1
29 )
30 )
31 group by division, span_grps
32 /

DIVISION START_DAT END_DATE
---------- --------- ---------
1 02-JUN-09 15-JUN-09
1 18-JUN-09 21-JUN-09


Personally I think that Analytic Functions are a lot easier to understand than the CONNECT BY query. Just out of curiosity I ran both Queries with Autotrace on, and here are the Statistics on the queries:


Statistics
---------------------------------------------------
0 recursive calls
0 db block gets
248 consistent gets
0 physical reads
0 redo size
112543 bytes sent via SQL*Net to client
3164 bytes received via SQL*Net from client
255 SQL*Net roundtrips to/from client
3 sorts (memory)
0 sorts (disk)
3800 rows processed


The CONNECT BY Query showed these statistics:

Statistics
---------------------------------------------------
3 recursive calls
233 db block gets
3882 consistent gets
224 physical reads
692 redo size
112543 bytes sent via SQL*Net to client
3164 bytes received via SQL*Net from client
255 SQL*Net roundtrips to/from client
18 sorts (memory)
0 sorts (disk)
3800 rows processed

And yes, I did run the latter query a few times to reduce the recursive calls (the first time around 161 recursive calls)

Enough said,.. Another case to show the power of Analytic Functions.
Final Note:
The results which I got from my query are the same as the results from the first query in the original post.
The query, which Anton labeled "Don't run this"... well, I did run it and got these results for the Division 1 in June:

1 02-JUN-09 15-JUN-09
1 04-JUN-09 15-JUN-09
1 06-JUN-09 15-JUN-09
1 08-JUN-09 15-JUN-09
1 10-JUN-09 15-JUN-09
1 12-JUN-09 15-JUN-09
1 14-JUN-09 15-JUN-09

And I must admit, I don't understand these results... They are completely different than my query and Anton's first query. Maybe I don't understand the original requirements,...

29 comments:

  1. Alex,

    I never expected such a fantastic response to this query. Your solution is far superior to mine--just what I was looking for.

    Thanks,

    Anton

    ReplyDelete
  2. Well Done. If the application uses a null end_date to indicate an open-ended span, the Max(end_date) returns the highest non-null date in the span group. That turned out to be an error in my application, but not hard to correct by nvl(end_date,'31-DEC-4444') in the innermost view.

    Good explaination too. Thanks

    ReplyDelete
  3. I really like the step by step approach to the final solution. I really like analytic functions now after reading this post. Thanks!!!

    ReplyDelete
  4. Salman
    I like this query i was stuck in a problem but this query help me a lot. thank thanks

    ReplyDelete
  5. Beautiful - I was just about to write some ugly cursor code when I found this.

    ReplyDelete
  6. Thank you! This was so helpful. The step by step approach you took made trying to replicate this with my variables very understandable.

    ReplyDelete
  7. Interesting! Thank you for writing all of the code in detail.

    - Susan
    Danvers Function Rooms

    ReplyDelete
  8. Your blog is really useful for me. Thanks for sharing this informative blog. If anyone wants to get real time Oracle Training Chennai reach FITA Oracle Training Institutes in Chennai. They give professional and job oriented training for all students.

    ReplyDelete
  9. Oracle Training

    The information you posted here is useful to make my career better keep updates..If anyone want to become an oracle certified professional reach FITA Oracle Training Center in Chennai, which offers Best Oracle Training in Chennai with years of experienced professionals.

    ReplyDelete
  10. Nice article i was really impressed by seeing this article, it was very interesting and it is very useful for Learners..
    Web designing Training in chennai | Web designing Training chennai Web designing course in chennai | Web designing course chennai |

    ReplyDelete
  11. I gathered a lot of information through this article.Every example is easy to understandable and explaining the logic easily.Thanks! AWS Training in chennai | AWS Training chennai | AWS course in chennai

    ReplyDelete
  12. This information is impressive; I am inspired with your post writing style & how continuously you describe this topic. After reading your post, thanks for taking the time to discuss this, I feel happy about it and I love learning more about this topic.
    Regards,
    Salesforce training in Chennai|Salesforce training institute in Chennai|Salesforce course in Chennai

    ReplyDelete
  13. The future of automation testing tool is on the positive tool. As the demand of quality web application keeps on increasing, the tool like Selenium IDE is getting very popular all over the world. Thus, taking Selenium Training will help you to enter software testing industry.

    ReplyDelete
  14. I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly.
    Regards,
    Salesforce training institute in Chennai|Salesforce training

    ReplyDelete
  15. This comment has been removed by a blog administrator.

    ReplyDelete
  16. This comment has been removed by a blog administrator.

    ReplyDelete
  17. This comment has been removed by a blog administrator.

    ReplyDelete
  18. Oracle Database Training will help you develop a thorough understanding of Oracle Database, as well as its related products. With more than 50 Database offerings, enhance your knowledge of Database 11g, MySQL, Data Guard, and more.

    SQL Training in Chennai
    Oracle Training in Chennai

    ReplyDelete
  19. You saved my day, thanks for sharing you experience

    ReplyDelete
  20. Nice Blog , This is what I exactly Looking for , Keep sharing more blog .

    Bigdata Analytics

    ReplyDelete
  21. Very helpful. Followed your design above for a report I am building at work. Worked out very well plus I just learned something new today!

    ReplyDelete
  22. nice posts...
    SAS Institute introduced the SAS Certified Professional Program,training proper understanding of how the SAS software works. Among the five certification programs that SAS Institute has come up with, SAS training can be considered as the entry point into the big data and the data analytics industry.
    SAS online training in hyderabad

    ReplyDelete
  23. thanku for sharig..
    Informatica training, in the recent times has acquired a wide scope of popularity amongst the youngsters at the forefront of their career.
    Informatica online training in hyderabad


    ReplyDelete
  24. Hi Tom, I have a simple SQL question on how to find gaps. Immagine a ... Would you also know of a way to do this without analytic functions? R Programming Training | DataStage Training | SQL Training | SAS Training | Android Training | SharePoint Training

    ReplyDelete
  25. hi,i read your blog about oracle,its good and admirable,I am really happy reading your well written articles SAP Netweaver Training in Hyderabad thanks for sharing this.

    ReplyDelete