Tuesday, 15 April 2008

LEAD and LAG Analytical Functions in Oracle

The LEAD and LAG analytical functions provide access to more than one row of a table at the same time, without the need for a self join. LEAD allows access to a row at a physical offset beyond the current row and LAG allows access to a row at a physical offset prior to the current row.

Here are a couple of examples which show LEAD and LAG in action.

Upper and Lower Bounds

The SALGRADE table, supplied as part of the Oracle demo schema, holds lower and upper salary values for a grade as follows.
SELECT *
FROM salgrade
ORDER BY grade;
GRADE       LOSAL        HISAL                  
----------- ------------ ----------
1 1 1200
2 1201 1400
3 1401 2000
4 2001 3000
5 3001 9999
5 rows selected.
Another way to maintain this information is to store only the upper bound and derive the lower bound from the the previous upper bound plus one.

First, we create our new table containing just the grade and high salary values.
CREATE TABLE new_salgrade AS
SELECT grade, hisal
FROM salgrade;
CREATE TABLE succeeded.

SELECT *
FROM new_salgrade;
GRADE       HISAL                  
----------- -----------
1 1200
2 1400
3 2000
4 3000
5 9999
5 rows selected.
Using LAG we can construct a view which will produce the data in the same format as the original table.
CREATE VIEW salgrade_v AS
SELECT
grade,
LAG(hisal, 1, 0) OVER (ORDER BY hisal) + 1 AS losal,
hisal
FROM new_salgrade;
CREATE VIEW succeeded.

SELECT * 
FROM salgrade_v
ORDER BY grade;
GRADE      LOSAL      HISAL                  
---------- ---------- -----------
1 1 1200
2 1201 1400
3 1401 2000
4 2001 3000
5 3001 9999
5 rows selected.
One advantage with this technique is that the table is easier to maintain. If a salary grade is changed, then only the HISAL on one row needs to be updated. The LOSAL on the next row is adjusted automatically.
UPDATE new_salgrade
SET hisal = 2200
WHERE grade = 3;
1 rows updated

COMMIT;
Commit complete.

SELECT * 
FROM salgrade_v;
GRADE       LOSAL      HISAL                  
----------- ---------- -------------
1 1 1200
2 1201 1400
3 1401 2200
4 2201 3000
5 3001 9999
5 rows selected.

External Log File

The next example processes metric data which has been collected and recorded in an external log file.

The contents of the file look like this.
21-sep-07 13:51:12
type: metric
low: 33
avg: 121
high: 230
21-sep-07 19:23:55
type: metric
low: 66
avg: 75
high: 120
21-sep-07 23:40:17
type: metric
low: 13
avg: 98
high: 554
22-sep-07 01:39:22
type: metric
low: 121
avg: 375
high: 520
22-sep-07 10:57:15
type: metric
low: 43
avg: 99
high: 148
Suppose we are required to produce the following report with one line for each metric sample.
recorded_at          duration      low  avg  high
-------------------- ------------- ---- ---- ----
dd-mon-yy hh:mi:ss 9 9:99:99.9 999 999 999
dd-mon-yy hh:mi:ss 9 9:99:99.9 999 999 999
dd-mon-yy hh:mi:ss 9 9:99:99.9 999 999 999
We will use an external table to access the contents of the file.
CREATE OR REPLACE DIRECTORY tmp_dir AS '/tmp';
Directory created.

CREATE TABLE event_log (
rec VARCHAR2(200))
ORGANIZATION EXTERNAL(
DEFAULT DIRECTORY tmp_dir
ACCESS PARAMETERS(
records delimited by newline
fields
missing field VALUES ARE NULL
reject rows with all null fields)
LOCATION('event.log'));
Table created.

SELECT *
FROM event_log
WHERE rownum <=10;
REC
-----------------------------
21-sep-07 13:51:12
type:
metric
low: 33
avg: 121
high: 230
21-sep-07 19:23:55
type:
metric
low: 66
avg: 75
high: 120
10 rows selected.
Now we can start to construct our SQL statement.

First, we will remove any labels from the records using REGEXP_REPLACE. Labels are identified as a single word at the start of a line followed by a colon and a space.
SELECT REGEXP_REPLACE(rec,'^[[:alpha:]]*: ', null) rec
FROM event_log
WHERE rownum <= 10;
REC
-----------------------------
21-sep-07 13:51:12
metric
33
121
230
21-sep-07 19:23:55
metric
66
75
120
10 rows selected
Next, we use the LEAD function to to merge several records from the file into a single row for each sample.
SELECT 
TO_TIMESTAMP(recorded_at, 'dd-mon-yy hh24:mi:ss') recorded_at,
TO_NUMBER(low) low,
TO_NUMBER(avg) avg,
TO_NUMBER(high) high
FROM (
SELECT
rec,
LAG(rec, 1, null) OVER (ORDER BY rownum) recorded_at,
LEAD(rec, 1, null) OVER (ORDER BY rownum) low,
LEAD(rec, 2, null) OVER (ORDER BY rownum) avg,
LEAD(rec, 3, null) OVER (ORDER BY rownum) high
FROM (
SELECT REGEXP_REPLACE(rec,'^[[:alpha:]]*: ', null) rec
FROM event_log
)
)
WHERE rec = 'metric';
recorded_at           low    avg   high
------------------ ------ ------ ------
21-SEP-07 13.51.12 33 121 230
21-SEP-07 19.23.55 66 75 120
21-SEP-07 23.40.17 13 98 554
22-SEP-07 01.39.22 121 375 520
22-SEP-07 10.57.15 43 99 148
5 rows selected.
And finally, now we have all the required columns with the correct data types, we can use the LAG function to calculate the duration between the current sample and the previous one.
SELECT
recorded_at,
recorded_at - LAG(recorded_at, 1, null)
OVER (ORDER BY recorded_at) duration,
low,
avg,
high
FROM (
SELECT
TO_TIMESTAMP(recorded_at, 'dd-mon-yy hh24:mi:ss') recorded_at,
TO_NUMBER(low) low,
TO_NUMBER(avg) avg,
TO_NUMBER(high) high
FROM (
SELECT
rec,
LAG(rec, 1, null) OVER (ORDER BY rownum) recorded_at,
LEAD(rec, 1, null) OVER (ORDER BY rownum) low,
LEAD(rec, 2, null) OVER (ORDER BY rownum) avg,
LEAD(rec, 3, null) OVER (ORDER BY rownum) high
FROM (
SELECT REGEXP_REPLACE(rec,'^[[:alpha:]]*: ', null) rec
FROM event_log
)
)
WHERE rec = 'metric'
);
recorded_at         duration        low   avg   high
------------------- ------------ ------ ----- ------
21-SEP-07 13.51.12 33 121 230
21-SEP-07 19.23.55 0 5:32:43.0 66 75 120
21-SEP-07 23.40.17 0 4:16:22.0 13 98 554
22-SEP-07 01.39.22 0 1:59:5.0 121 375 520
22-SEP-07 10.57.15 0 9:17:53.0 43 99 148
5 rows selected.
So there we have it. In this example, we read an external file, stripped out the labels and and used LEAD to pivot several rows into columns. Then we used LAG to calculate the duration between adjacent samples.

No comments: