07 July 2009

Making up Data with Partition Outer Join

Just the other day on the Oracle SQL and PL/SQL forum, someone asked on how to create non-existent rows. This post is not about generating a number of rows.
It's about handling Sparse Data, when you want to fill in some missing records in your result set.
First time I heard about this was in a blog written by Lucas Jellema.
Let's first start out with the table and some test data:

SQL> create table test_table1
2 (status varchar2(15)
3 ,manager number
4 ,sales number);

Table created.

SQL> insert into test_table1 values ('in process', 14, 100);

1 row created.

SQL> insert into test_table1 values ('in process', 15, 10);

1 row created.

SQL> insert into test_table1 values ('in process', 15, 40);

1 row created.

SQL> insert into test_table1 values ('done', 14, 200);

1 row created.

SQL> insert into test_table1 values ('done', 16, 50);

1 row created.
SQL> select *
2 from test_table1
3 /

STATUS MANAGER SALES
--------------- ---------- ----------
in process 14 100
in process 15 10
in process 15 40
done 14 200
done 16 50

As you can see in the sample data, Manager 14 has entries for the status "in process" and "done". Manager 15 only has entries for "in process". Manager 16 only has a single entry for "done".
The result that we are after is to show for each Manager a total sales value for both statuses "in process" and "done".
When we use a regular SUM and GROUP BY:

SQL> select manager
2 , status
3 , sum(sales)
4 from test_table1
5 group by manager
6 , status
7 order by manager
8 , status
9 /

MANAGER STATUS SUM(SALES)
---------- --------------- ----------
14 done 200
14 in process 100
15 in process 50
16 done 50

we only see values for records that are actually in the table... Go Figure!

Nice results, but not exactly what we are after. We want an extra record for Manager 15 (with status "done" and sales value of 0) and an extra record for Manager 16 (with status "in process" and also a value of 0).

One way to tackle this problem (or challenge if you prefer) is to use a Partition Outer Join. As far as i know this is not ANSI-SQL, but Oracle specific syntax. Tahiti.Oracle.com calls it an "extension to the ANSI syntax".
To make this query work, we need a "table" (or inline view) which has all possible statuses. Something like

SQL> select 'done' st from dual
2 union all
3 select 'in process' from dual
4 /

ST
----------
done
in process

This inline view will be outer joined to our table.
What makes a Partition Outer Join work differently from a regular Outer Join?
A regular Outer Join will show an extra single record even when a matching value is not present. In our case, this will not make a difference as the values "done" and "in process" are present in our base table.
What we want is to outer join all statuses from the inline view to our base table for each manager.
And this is exactly what the Partition Clause does. It breaks up the result set per manager. Per partition (one for Manager 14, one for Manager 15 and one for Manager 16) we want to outer join to the inline view.

Putting it all together, and here is the final result:

SQL> select manager
2 , st
3 , nvl (sum (sales) , 0)
4 from test_table1 t partition by (manager)
5 right outer
6 join (select 'done' st from dual
7 union all
8 select 'in process' from dual
9 ) sts
10 on (t.status = sts.st)
11 group by manager
12 , st
13 order by manager
14 , st
15 /

MANAGER ST NVL(SUM(SALES),0)
---------- ---------- -----------------
14 done 600
14 in process 300
15 done 0
15 in process 150
16 done 150
16 in process 0

6 rows selected.

Each Manager shows an entry for both statuses "done" and "in process", even when this value is not in the base table.

If -for whatever reason- you don't like RIGHT OUTER, just flip the tables around and call it a LEFT OUTER:

SQL> select manager
2 , st
3 , nvl (sum (sales) , 0)
4 from (select 'done' st from dual
5 union all
6 select 'in process' from dual
7 ) sts
8 left outer
9 join test_table1 t partition by (manager)
10 on (t.status = sts.st)
11 group by manager
12 , st
13 order by manager
14 , st
15 /

MANAGER ST NVL(SUM(SALES),0)
---------- ---------- -----------------
14 done 800
14 in process 400
15 done 0
15 in process 200
16 done 200
16 in process 0

6 rows selected.


Original question
Lucas Jellema on Partition Outer Join
Oracle 10g Documentation
Oracle 11g Documentation
Rob van Wijk on Interval Based Row Generation

6 comments:

  1. Hi Alex,

    Just for fun, another way of doing this partition outer join using the model clause:

    SQL> select *
    2 from test_table1
    3 group by status
    4 , manager
    5 model
    6 dimension by (status,manager)
    7 measures (sum(sales) s)
    8 rules upsert all
    9 ( s['done',any] = nvl(s[cv(),cv()],0)
    10 , s['in process',any] = nvl(s[cv(),cv()],0)
    11 )
    12 order by manager
    13 , status
    14 /

    STATUS MANAGER S
    --------------- ---------- ----------
    done 14 200
    in process 14 100
    done 15 0
    in process 15 50
    done 16 50
    in process 16 0

    6 rows selected.

    Oh, and maybe you should end your script with "drop table test_table1", because the number of rows in test_table1 keeps increasing each run ;-)

    Groet,
    Rob.

    ReplyDelete
  2. I have been looking for this information too much time ago, I think this is an interesting topic, and also
    that it is really helpful, thank you so much for posting this, it has been really interesting, just keep posting, things like this one.

    ReplyDelete
  3. Isn't this analytic sql? Calling it an outer join is a slightly confusing nomenclature, well, for me anyway.
    But thanks - an easy-to-follow illustration of its use.
    J

    ReplyDelete
    Replies
    1. Even though there is the line "partition by" in the statement, this doesn't qualify as analytic functions. This functionality can only be used with an OUTER JOIN (hence the name partition outer join).
      Analytic functions only occur in the SELECT clause of the statement and are real functions that return values.

      Delete
  4. I am not able to see the correct Sum(sales) for ?for 14 done it should be 200 not 600 ,Please correct me if i am wrong

    ReplyDelete
    Replies
    1. No, you're not wrong - you're absolutely right. It is just as Rob pointed out in the first comment, instead of resetting the results in the table I made the mistake of re-running the INSERT statements too often.

      Delete