Tuesday, 1 July 2008

JDBC Batch Updates with Oracle

The JDBC batch update feature allows multiple SQL statements to be submitted to Oracle at the same time. This can have a significant impact on performance by reducing network round trips.

With Oracle, there are two batch update techniques; standard JDBC batch updates and Oracle batch updates. We will look at both of these and demonstrate the performance benefits of each.

This simple table is used for our tests.
CREATE TABLE batch_test
( id NUMBER,
name VARCHAR2(50)
);
Table created.

Without Batching

First we will execute an INSERT statement ten thousand times without batching. This will give us a baseline to compare with the batched examples later on.

Autocommit has been disabled (as it always should be) and the updates are committed explicitly after one hundred rows.
public static void testNoBatching() throws SQLException {
    Connection con = ConnectionFactory.getNewConnection();
PreparedStatement pstmt =
con.prepareStatement("INSERT INTO batch_test (id, name) " +
"VALUES (?, rpad('x', 50, 'x'))");
    long start = System.currentTimeMillis();
for (int i = 1; i <= 10000; i++) {
pstmt.setInt(1, i);
pstmt.executeUpdate();
if (i % 100 == 0) {
con.commit();
}
}
    System.out.printf("Inserting 10,000 rows took %4.2f seconds\n",  
(System.currentTimeMillis() - start) / 1000f);

pstmt.close();
con.close();
}
------------------------------------------------------------------------------
Inserting 10,000 rows took 12.83 seconds

This test uses a PreparedStatement and bind variables to ensure that the SQL is executed as efficiently as possible. Even so, without batching, the inserts took nearly 13 seconds.

Standard JDBC Batch Updates

With standard JDBC batch updates, the SQL statements are batched and submitted explicitly.
public static void testStandardBatching() throws SQLException {
    Connection con = ConnectionFactory.getNewConnection();
PreparedStatement pstmt =
con.prepareStatement("INSERT INTO batch_test (id, name) " +
"VALUES (?, rpad('x', 50, 'x'))");
    if (!con.getMetaData().supportsBatchUpdates()) {
throw new SQLException("Batch updates not supported");
}
    long start = System.currentTimeMillis();
for (int i = 1; i <= 10000; i++) {
pstmt.setInt(1, i);
pstmt.addBatch();
if (i % 100 == 0) {
int[] updateCounts = pstmt.executeBatch();
con.commit();
}
}
    System.out.printf("Inserting 10,000 rows took %4.2f seconds\n",  
(System.currentTimeMillis() - start) / 1000f);
    pstmt.close();
con.close();
}
------------------------------------------------------------------------------
Inserting 10,000 rows took 0.40 seconds
The addBatch method adds an SQL statement to the current batch. At an appropriate point the batch is submitted to Oracle by calling the executeBatch method. Here we submit the batch after one hundred inserts, just before we commit. executeBatch returns an array of integers which contains the number of rows affected by each statement in the batch.

This time, the entire test took less than half a second, a fraction of the time taken to perform the original test.

Oracle Batch Updates

Oracle update batching takes a different approach. The batch size is set on the OraclePreparedStatement object and SQL statements are implicitly batched by the JDBC driver until the batch size is reached. At this point the SQL is automatically submitted to Oracle.
public static void testOracleBatching() throws SQLException {
    Connection con = ConnectionFactory.getNewConnection();
PreparedStatement pstmt =
con.prepareStatement("INSERT INTO batch_test (id, name) " +
"VALUES (?, rpad('x', 50, 'x'))");
                                 
((OraclePreparedStatement)pstmt).setExecuteBatch (100);
long start = System.currentTimeMillis();
for (int i = 1; i <= 10000; i++) {
pstmt.setInt(1, i);
int updateCount = pstmt.executeUpdate();
if (i % 100 == 0) {
con.commit();
}
}
    System.out.printf("Inserting 10,000 rows took %4.2f seconds\n",  
(System.currentTimeMillis() - start) / 1000f);
    pstmt.close();
con.close();
}
------------------------------------------------------------------------------
Inserting 10,000 rows took 0.39 seconds
This time we implement batching with a single call to the setExecuteBatch method on OraclePreparedStatement. The update count returned from executeUpdate will return zero if the call is adding the statement to the batch, or the total number of rows affected by all the update statements if the batch is being submitted to Oracle. Once again, with the batch size set to one hundred, the ten thousand rows are inserted in less than half a second.

Summary

These examples show that there can be a significant improvement in performance when batch updates are used rather than single statement updates. In our tests, execution times were reduced from over twelve seconds to less than half a second when batching was introduced.

Of course, not every application is suited to batch updates but, when used appropriately, the performance benefits can be considerable.

Monday, 30 June 2008

Using Byte and Char with Character Datatypes

Character datatypes such as CHAR, VARCHAR2 and CLOB can be defined using byte or character length semantics as follows.
CREATE TABLE t(x VARCHAR2(10),   
y VARCHAR2(20 BYTE),
z VARCHAR2(30 CHAR));
Table created.
If CHAR or BYTE is not specified then the value defined by the NLS_LENGTH_SEMANTICS parameter is used as the default. We can check the current value of this parameter as follows.
SELECT value
FROM v$nls_parameters
WHERE parameter = 'NLS_LENGTH_SEMANTICS';
VALUE                                                            
---------------
BYTE
1 rows selected
In the above example, table t was created with columns x and y with maximum lengths of 10 and 20 bytes respectively and column z with a maximum length of 30 characters.

But what does this actually mean?

Character Set

During database creation, the character set to be used by CHAR, VARCHAR2 and CLOB datatypes is specified. These are some of the values available for selection.

US7ASCII ASCII 7-bit American (SB)
WE8ISO8859P1 ISO 8859-1 West European 8-bit (SB)
EE8ISO8859P2 ISO 8859-2 East European 8-bit (SB)
JA16VMS JVMS 16-bit Japanese (MB)
KO16KSCCS KSCCS 16-bit Korean (MB)
AL16UTF16 Unicode 3.2 UTF-16 Universal character set (MB)
AL32UTF8 Unicode 3.2 UTF-8 Universal character set (MB)

SB = Single Byte
MB = Multi Byte
The character set for a database can be determined with the following SQL.
SELECT value
FROM v$nls_parameters
WHERE parameter = 'NLS_CHARACTERSET';
VALUE                                    
---------------
WE8ISO8859P1
1 rows selected

When using a single byte character set such as WE8ISO8859P1, a character is always held in a single byte so specifying the BYTE or CHAR qualifier has no effect on storage allocation. However, with a multi-byte character set such as EE8ISO8859P2, a single character may require up to 4 bytes of storage. In this case it may be useful to specify the length of the column in characters and let Oracle figure out how many bytes to allocate.

Consideration must be given to the maximum number of bytes that can be allocated. When using BYTE semantics the maximum is 2000 for CHAR datatypes, and 4000 for VARCHAR2. However, when using the CHAR length qualifier the maximum size that can be specified may be reduced in order to accommodate the character set.

National Character Set

The Unicode or national character set is also specified when creating the database. For example.


AL32UTF8 UTF-8
AL16UTF16 UTF-16

This setting is used by the Unicode datatypes NCHAR, NVARCHAR2 and NCLOB and the size of the column is always specified in characters.

Use the following SQL to determine the national character set details for a database.
SELECT VALUE
FROM nls_database_parameters
WHERE parameter = 'NLS_NCHAR_CHARACTERSET';
VALUE                                    
---------------
UTF8
1 rows selected

Fast Dual in Oracle

Oracle 10g introduces FAST DUAL. Now, when the the DUAL table is accessed, no logical I/O is performed which improves performance.

We can demonstrate this by generating a million rows using the DUAL table.
set timi on
set autot trace stat
SELECT rownum FROM dual CONNECT BY LEVEL <= 1000000;
1000000 rows selected.
Elapsed: 00:00:08.07
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
0 consistent gets
0 physical reads
0 redo size
14380219 bytes sent via SQL*Net to client
733838 bytes received via SQL*Net from client
66668 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
1000000 rows processed
The statistics trace shows that 1 million rows have been processed without any logical I/O being performed.

Saturday, 14 June 2008

Divide a result set into groups using NTILE

The NTILE analytic function divides a result set into a specified number of groups or buckets.

The following SQL takes a generated result set and, using NTILE, adds a column containing three distinct values with an equal number of rows for each value.
SELECT id, val, ntile(3) OVER(ORDER BY id) nt
FROM (
SELECT rownum id, 'xyz' || rownum val
FROM dual
CONNECT BY LEVEL <=12
)
ORDER BY id;
ID       VAL          NT
-------- ------------ ---------
1 xyz1 1
2 xyz2 1
3 xyz3 1
4 xyz4 1
5 xyz5 2
6 xyz6 2
7 xyz7 2
8 xyz8 2
9 xyz9 3
10 xyz10 3
11 xyz11 3
12 xyz12 3
12 rows selected

Friday, 13 June 2008

Oracle Flashback Table

The FLASHBACK TABLE command can be used to restore an earlier state of a table in the event of human or application error.

Restoring a dropped table

The first use of FLASHBACK TABLE is to retrieve a table from the recycle bin after it has been dropped.
DROP TABLE dept;
Table dropped.

SELECT *
FROM dept;
SQL Error: ORA-00942: table or view does not exist

SELECT object_name, original_name
FROM recyclebin;
OBJECT_NAME                    ORIGINAL_NAME                    
------------------------------ --------------------------------
BIN$QzJJ24I1hx7gQAAK+wFg5Q==$0 DEPT
1 rows selected

FLASHBACK TABLE dept TO BEFORE DROP;
Flashback complete.

SELECT *
FROM dept;
DEPTNO                 DNAME          LOC           
---------------------- -------------- -------------
10 ACCOUNTING NEW YORK
20 RESEARCH DALLAS
30 SALES CHICAGO
40 OPERATIONS BOSTON
4 rows selected
The RENAME TO clause can be used to give the table a new name during the restore. This is useful if a new table has been created with the same name since the old one was deleted.
FLASHBACK TABLE dept TO BEFORE DROP RENAME TO dept_before_drop;
Flashback complete.
Any indexes or triggers associated with the table will still have their recycled names and will need to be re-instated manually. Also, views and procedures which rely on the table will have been invalidated and will need to be recompiled.

Restoring a table to an earlier state

FLASHBACK TABLE can also be used to revert the state of a table back to a specified time or SCN.

Let's see this in action using the EMP table in the Oracle demo schema.
SELECT deptno, count(*)
FROM emp
GROUP BY deptno;
DEPTNO       COUNT(*)               
------------ --------------
10 3
20 5
30 6
3 rows selected
Suppose we wish to delete all entries for department 10 but we mistype the SQL and commit before we notice our mistake.
DELETE FROM emp
WHERE deptno >= 10;
14 rows deleted

COMMIT;
Commit complete.

SELECT deptno, count(*)
FROM emp
GROUP BY deptno;
DEPTNO       COUNT(*)               
------------ --------------
0 rows selected
All the rows have been deleted and the transaction committed. But fear not, flashback technology was introduced to deal with just this kind of problem. We can restore the table back to the state it was in, say, one minute ago. Before we can do this, however, we must ensure that row movement is enabled on the table.
ALTER TABLE emp ENABLE ROW MOVEMENT;
Table altered.

FLASHBACK TABLE emp TO TIMESTAMP systimestamp - INTERVAL '1' MINUTE;
Flashback complete.

SELECT deptno, count(*)
FROM emp
GROUP BY deptno;
DEPTNO       COUNT(*)               
------------ --------------
10 3
20 5
30 6
3 rows selected
The table has been 'rewound' to a prior point in time. All the rows that existed on the table one minute ago have been restored. It is as if the delete never happened.

Alternatively, we could restore to a particular SCN. To use this method we must know the exact SCN we wish to restore to, so we will need to record the value before the update is attempted.
SELECT current_scn
FROM v$database;
CURRENT_SCN            
----------------------
3779819703
1 rows selected

DELETE FROM emp
WHERE deptno >= 10;
14 rows deleted

COMMIT;
COMMIT succeeded.

FLASHBACK TABLE emp TO SCN 3779819703;
FLASHBACK TABLE succeeded.

SELECT deptno, count(*)
FROM emp
GROUP BY deptno;
DEPTNO       COUNT(*)               
------------ --------------
10 3
20 5
30 6
3 rows selected

Restore Point

Oracle 10g Release 2 introduced the concept of restore points. A restore point is a name which is associated with the SCN of the database at the time the restore point was created. We can reference a restore point name in the FLASHBACK TABLE command.

For example.
CREATE RESTORE POINT before_changes;
Restore point created.
If we run in to any problems during our updates, we can simply rewind one or more tables back to the restore point.
DELETE FROM emp
WHERE deptno >= 10;
14 rows deleted

COMMIT;
COMMIT succeeded.

FLASHBACK TABLE emp TO RESTORE POINT before_changes;
FLASHBACK TABLE succeeded.

SELECT deptno, count(*)
FROM emp
GROUP BY deptno;
DEPTNO       COUNT(*)               
------------ --------------
10 3
20 5
30 6
3 rows selected

Restrictions

There are are some rules and restrictions to be aware of when using FLASHBACK TABLE.
  • This feature is only available with Oracle Enterprise Edition.
  • You will need the appropriate privileges (such as FLASHBACK ANY TABLE) to execute these commands. See the Oracle documentation for more details.
  • Oracle may restore rows to a different location and allocate new rowids. Any references to the original rowids will no longer be valid.
  • There must be sufficient redo data available in the database to restore the data. This can be guaranteed for a predetermined period of time if the UNDO_RETENTION initialization parameter is used.
  • You cannot restore a table to an earlier state across any DDL operations that change the structure of the table (including TRUNCATE).
  • Restore points were only introduced at 10g Release 2.

Precision and scale do not specify column size

Precision and scale do not determine the physical storage allocation for a number, only the maximum size of the value that can be stored in it. A NUMBER column will occupy as much or as little space as it needs to, up to a maximum of 22 bytes. The physical size will grow and shrink depending on the size of the number to be held. In this sense, it behaves in a similar way to a VARCHAR2 datatype which also dynamically allocates enough bytes to hold the value up to the maximum specified size.

To demonstrate this we will use the VSIZE function which returns the number of bytes in the internal representation of a column.

CREATE TABLE t (
num_col NUMBER(16)
);
CREATE TABLE succeeded.

INSERT INTO t (num_col) 
VALUES (1234567890);
1 rows inserted

SELECT num_col, VSIZE(num_col) 
FROM t;
NUM_COL                VSIZE(NUM_COL)         
---------------------- ----------------------
1234567890 6
1 rows selected
If we change the value we see that the internal size, the number of bytes required to hold our number, has changed.
UPDATE t 
SET num_col = 1;
1 rows updated

SELECT num_col, VSIZE(num_col) 
FROM t;
NUM_COL                VSIZE(NUM_COL)         
---------------------- ----------------------
1 2
1 rows selected
So when we specify a precision and scale for a number we are not telling Oracle to reserve a fixed number of bytes for every number in the column. We are only providing the maximum size and the number of decimal places. Oracle will store the number in the minimum number of bytes necessary to hold the value.

Oracle Quick Tip: Setting the Default Date and Timestamp Format for a Session

The default DATE, TIMESTAMP and TIMESTAMP WITH TIME ZONE format models for a session can be altered with the following commands.
ALTER SESSION SET nls_date_format = [format model];
ALTER SESSION SET nls_timestamp_format = [format model];
ALTER SESSION SET nls_timestamp_tz_format = [format model];
For example.
ALTER SESSION SET nls_date_format='DD-Mon-RR HH24:MI:SS';
ALTER SESSION SET nls_timestamp_format='DD-Mon-RR HH24.MI.SSXFF3';
ALTER SESSION SET nls_timestamp_tz_format='DD-Mon-RR HH:MI.SSXFF6 TZR';

SELECT current_date, localtimestamp, current_timestamp FROM dual;

CURRENT_DATE LOCALTIMESTAMP CURRENT_TIMESTAMP
------------------- ----------------------- ----------------------------
09-Feb-09 12:57:43 09-Feb-09 12.57.42.787 09-Feb-09 12:57.42.787953 GB

1 rows selected
Alternatively, if you are using SQL Developer, select Tools -> Preferences, expand Database and select NLS Parameters.

The V$NLS_PARAMETERS view contains the current values of the NLS parameters.
SELECT *
FROM v$nls_parameters
WHERE parameter IN ('NLS_DATE_FORMAT', 'NLS_TIMESTAMP_FORMAT', 'NLS_TIMESTAMP_TZ_FORMAT');
PARAMETER                 VALUE
------------------------- -----------------------------
NLS_DATE_FORMAT DD-Mon-RR HH24:MI:SS
NLS_TIMESTAMP_FORMAT DD-Mon-RR HH24.MI.SSXFF3
NLS_TIMESTAMP_TZ_FORMAT DD-Mon-RR HH24:MI.SSXFF TZR

3 rows selected

Tuesday, 20 May 2008

Oracle Pipelined Functions

In a previous article we saw that a collection can be returned from a PL/SQL function and used in a query as if it were a table.

Rows from a collection returned by a table function can also be pipelined. This means that the rows are returned to the client as they are produced, rather than all at once when the function has run to completion. This can improve performance, especially with large result sets, because the client can start to process the results without having to wait for the function to finish.

CSV to Column

We will develop a simple pipelined function to read data from an external table which contains the following contact data.
"mrblond@abc.com","m","25"," movies, music"
"mrsred@jkl.com","f","33","cooking , reading, yoga, travel"
"mrpink@def.com","m","31"," movies, reading,sport"
"mrsbrown@ghi.com","f","22","cycling, swimming"
The function will pivot the comma separated list of interests and present them in a column which contains a single interest like so.
mrblond@abc.com         m     25    movies
mrblond@abc.com m 25 music
mrsbrown@ghi.com f 22 cycling
mrsbrown@ghi.com f 22 swimming
mrpink@def.com m 31 movies
mrpink@def.com m 31 reading
mrpink@def.com m 31 sport
First we create a directory object and an external table to access the contact details in the flat file.
CREATE OR REPLACE DIRECTORY tmp_dir AS '/tmp';
Directory created.

CREATE TABLE contact_ext(
email VARCHAR2(100),
sex CHAR(1),
age NUMBER,
interests VARCHAR2(500)
)
ORGANIZATION EXTERNAL
(DEFAULT DIRECTORY tmp_dir
ACCESS PARAMETERS(
records delimited BY newline
fields terminated BY ','
optionally enclosed BY '"')
LOCATION('contacts.txt')
);
Table created.
Next we create a contact object to represent the new table structure and a nested table of this type.
CREATE OR REPLACE type contact_type AS OBJECT
(email VARCHAR2(100),
sex char(1),
age NUMBER,
interest VARCHAR(50));
Type created.

CREATE OR REPLACE TYPE contact_table_type AS
TABLE OF contact_type;
Type created.
The table function is then created using the nested table as the return type. The PIPELINED keyword indicates that the function will return rows iteratively.
CREATE OR REPLACE FUNCTION contact_list 
RETURN contact_table_type PIPELINED
AS
l_interests dbms_utility.uncl_array;
l_count NUMBER;
BEGIN
  FOR rec IN
(SELECT *
FROM contact_ext)
LOOP
dbms_utility.comma_to_table(list => rec.interests,
tablen => l_count,
tab => l_interests);
    FOR i IN 1 .. l_count
LOOP
PIPE ROW (CONTACT_TYPE(rec.email,
rec.sex,
rec.age,
TRIM(l_interests(i))));
END LOOP;
END LOOP;
END;
Function created.
We take each CSV list of values from the interests column of CONTACT_EXT and convert it into a nested table using the COMMA_TO_TABLE procedure of the DBMS_UTILITY package. We then iterate through the nested table and return a series of CONTACT_TYPEs using the PIPE ROW command.

This procedure can now be used as a table in a query.
SELECT * FROM TABLE (contact_list);
EMAIL                   SEX     AGE    INTEREST      
----------------------- ------- ------ -------------
mrblond@abc.com m 25 movies
mrblond@abc.com m 25 music
mrsbrown@ghi.com f 22 cycling
mrsbrown@ghi.com f 22 swimming
mrpink@def.com m 31 movies
mrpink@def.com m 31 reading
mrpink@def.com m 31 sport
mrsred@jkl.com f 33 cooking
mrsred@jkl.com f 33 reading
mrsred@jkl.com f 33 yoga
mrsred@jkl.com f 33 travel
11 rows selected

Column to CSV

Now let's do this in reverse. Suppose we are given the data in the above format and we need to return each email address on a separate line with the interests shown as a comma separated list.

We could use the SYS_CONNECT_BY_PATH function to provide an SQL only solution but for the purposes of this exercise we will develop a bespoke PL/SQL pipelined function.

First, we create a table to work with.
CREATE TABLE contact
AS SELECT * FROM TABLE (contact_list);
Table created.
This time, the function will take the contents of the table column and build up a comma separated string of values for each email address. The LEAD function helps us to identify when an email is about to change by looking at the next email address in the result set. Once again, rows are returned as they are produced using the PIPE ROW command.
CREATE OR REPLACE
FUNCTION contact_csv
RETURN contact_table_type PIPELINED
AS
l_interests VARCHAR2(1000);
l_count NUMBER;
BEGIN
  FOR row IN
(SELECT
email,
LEAD(email, 1) OVER (ORDER BY email) AS next_email,
sex,
age,
interest
FROM contact
ORDER BY email)
LOOP
    l_interests := l_interests || ',' || row.interest;
    IF (row.email != row.next_email OR
row.next_email IS NULL) THEN
PIPE ROW (CONTACT_TYPE(row.email,
row.sex,
row.age,
LTRIM(l_interests,',')));
l_interests := null;
END IF;
END LOOP;
END;
Function created.
We can confirm the results using the following SQL.
SELECT * 
FROM TABLE(contact_csv);
EMAIL                  SEX     AGE     INTEREST      
---------------------- ------- ------- ------------------------------
mrblond@abc.com m 25 movies,music
mrpink@def.com m 31 movies,reading,sport
mrsbrown@ghi.com f 22 cycling,swimming
mrsred@jkl.com f 33 cooking,reading,yoga,travel
4 rows selected

Summary

Oracle pipelined functions are table functions which return data iteratively. This allows the client to start processing the output from the function call before the entire collection has been populated.

How to pivot a column to a CSV list in Oracle

Suppose we have a column of data that we wish to display as a CSV list. One solution is to develop a pipelined table function in PL/SQL, but how can we achieve the same results using just SQL? This article shows how the Oracle SYS_CONNECT_BY_PATH function can be used to solve the problem.

Example

Using the employee table from the Oracle demo schema, we can easily display a list of employee names for each department.

SELECT deptno, ename
FROM emp
ORDER BY deptno;
DEPTNO        ENAME      
------------- ----------
10 CLARK
10 KING
10 MILLER
20 SMITH
20 ADAMS
20 FORD
20 SCOTT
20 JONES
30 ALLEN
30 BLAKE
30 MARTIN
30 JAMES
30 TURNER
30 WARD
14 rows selected
But suppose we wish to format the names as a comma separated list as follows.
DEPTNO       EMPLOYEES
------------ ------------------------------------
10 CLARK,KING,MILLER
20 SMITH,ADAMS,FORD,SCOTT,JONES
30 ALLEN,BLAKE,MARTIN,JAMES,TURNER,WARD
3 rows selected

SYS_CONNECT_BY_PATH

The technique described here uses the SYS_CONNECT_BY_PATH function to provide the CSV column. Let's remind ourselves of what SYS_CONNECT_BY_PATH actually does by displaying a a hierarchy of employees.
SELECT 
empno,
mgr,
level,
SYS_CONNECT_BY_PATH(ename, '/') path
FROM emp
START WITH ename ='KING'
CONNECT BY PRIOR empno = mgr;
EMPNO   MGR        LEVEL     PATH
------- ---------- --------- --------------------------
7839 1 /KING
7566 7839 2 /KING/JONES
7788 7566 3 /KING/JONES/SCOTT
7876 7788 4 /KING/JONES/SCOTT/ADAMS
7902 7566 3 /KING/JONES/FORD
7369 7902 4 /KING/JONES/FORD/SMITH
7698 7839 2 /KING/BLAKE
7499 7698 3 /KING/BLAKE/ALLEN
7521 7698 3 /KING/BLAKE/WARD
7654 7698 3 /KING/BLAKE/MARTIN
7844 7698 3 /KING/BLAKE/TURNER
7900 7698 3 /KING/BLAKE/JAMES
7782 7839 2 /KING/CLARK
7934 7782 3 /KING/CLARK/MILLER
14 rows selected
SYS_CONNECT_BY_PATH returns the path of a column value from root to node, with column values separated by a '/' (or whichever character we specify) for each row returned by CONNECT BY condition.

Now this function is only valid in hierarchical queries and unfortunately, in our example, there is no natural hierarchy in our data that we can use. Therefore, we must manufacture one using analytic functions.

Analytic Functions

Let's add a couple more columns to our original query.
SELECT deptno,
ename,
row_number() over(PARTITION BY deptno ORDER BY deptno) rnum,
count(*) over(PARTITION BY deptno) tot
FROM emp;
DEPTNO       ENAME      RNUM  TOT                  
------------ ---------- ------ -----
10 CLARK 1 3
10 KING 2 3
10 MILLER 3 3
20 SMITH 1 5
20 ADAMS 2 5
20 FORD 3 5
20 SCOTT 4 5
20 JONES 5 5
30 ALLEN 1 6
30 BLAKE 2 6
30 MARTIN 3 6
30 JAMES 4 6
30 TURNER 5 6
30 WARD 6 6
14 rows selected
The RNUM column creates an artificial hierarchy within department which allows us to identify the parent using rnum – 1. TOT tells us the total number of employees for the current department. We use this to identify the employees who are at the bottom of the hierarchy (where rnum=tot).

Complete Solution

So, putting all this together, we get the following SQL.
SELECT 
deptno,
LTRIM(SYS_CONNECT_BY_PATH(ename, ','), ',') employees
FROM (
SELECT deptno,
ename,
row_number() over(PARTITION BY deptno ORDER BY deptno) rnum,
count(*) over(PARTITION BY deptno) tot
FROM emp
)
WHERE rnum=tot
START WITH rnum = 1
CONNECT BY PRIOR rnum = rnum -1 AND PRIOR deptno = deptno;
DEPTNO       EMPLOYEES
------------ ------------------------------------
10 CLARK,KING,MILLER
20 SMITH,ADAMS,FORD,SCOTT,JONES
30 ALLEN,BLAKE,MARTIN,JAMES,TURNER,WARD
3 rows selected
We create an arbitrary hierarchy of employees within department and select only those employees at the bottom of the hierarchy. Then we use SYS_CONNECT_BY_PATH with a comma delimiter to format the employee names as a CSV list. LTRIM removes the first comma from our list.

Clearly, this is not the most intuitive SQL statement ever written. However, we could use this statement to create a view which would simplify access to the data. Alternatively, we could write a bespoke pipelined table function to provide a programmatic solution.

Oracle Quick Tip: Querying Buffer Cache Usage

This SQL statement shows which objects in our schema are using the most blocks in the data buffer cache.
SELECT object_name, object_type type, COUNT(*) blocks
FROM v$bh
JOIN user_objects ON object_id = objd
GROUP BY object_name, object_type
ORDER BY blocks DESC;
OBJECT_NAME                   TYPE     BLOCKS 
----------------------------- -------- ------
MGMT_METRICS_RAW_PK INDEX 1960
MGMT_METRICS_1HOUR_PK INDEX 1183
MGMT_METRICS_1DAY_PK INDEX 349
MGMT_METRICS TABLE 149
MGMT_SYSTEM_PERF_LOG_IDX_01 INDEX 62
MGMT_CURRENT_METRICS_PK INDEX 29
MGMT_SYSTEM_PERFORMANCE_LOG TABLE 21
MGMT_TARGETS TABLE 14
MGMT_METRIC_COLLECTIONS TABLE 14
MGMT_JOB_EMD_STATUS_QUEUE TABLE 14
To display all objects, regardless of schema, use DBA_OBJECTS instead of USER_OBJECTS.

Oracle Table Functions

An Oracle table function returns a collection that can be queried as if it were a table.

For example.
SELECT * FROM TABLE(my_function);
The TABLE keyword informs Oracle that the collection produced by my_function should be treated as a table in this query.

Example

To demonstrate this, let's create a simple function which generates a list of dates for a specified range, which could be used as follows.
SELECT column_value gen_date
FROM TABLE (date_list('1-jan-08', '5-jan-08'));
gen_date                   
----------------
01-jan-08
02-jan-08
03-jan-08
04-jan-08
05-jan-08
5 rows selected
Incidentally, this article provides a simple solution for generating rows, including dates in a range, using the CONNECT BY LEVEL clause.

The DATE_LIST function

First, we create a collection to be used as the return type from the function. Here we use a nested table but a VARRAY would also be valid.
CREATE TYPE interval_tab
AS TABLE OF date;
/
Type created.
Now we can go ahead and create the date_list function.
CREATE OR REPLACE FUNCTION date_list(p_start_date IN DATE, 
p_end_date IN DATE)
RETURN interval_tab AS
l_interval interval_tab := interval_tab();
l_days binary_integer := p_end_date - p_start_date + 1;
BEGIN
IF (l_days < 0) THEN
RETURN null;
END IF;
l_interval.extend(l_days);
FOR i IN 1 .. l_days
LOOP
l_interval(i) := p_start_date + i -1;
END LOOP;
RETURN l_interval;
END;
/
Function created.
This function populates a nested table by adding an incrementing value to the start date.

Using the table function

One use for this function would be to generate a list of dates to be used in an outer join. For example, the following query displays the number of schema objects created on each day of a specified date range.
SELECT to_char(column_value, 'dd-Mon-yy') created, nvl(cnt, 0) cnt
FROM TABLE(date_list('1-feb-08', '5-feb-08'))
LEFT OUTER JOIN
(SELECT TRUNC(created, 'dd') created, COUNT(*) cnt
FROM all_objects
GROUP BY TRUNC(created, 'dd'))
ON created = column_value
ORDER BY column_value;
CREATED        CNT                    
--------- --------
01-Feb-08 4
02-Feb-08 0
03-Feb-08 0
04-Feb-08 1
05-Feb-08 2
5 rows selected
Using the date_list function and an outer join we can display a row per day, including those days that have no entries in the ALL_OBJECTS table.

Oracle table functions can also be pipelined which allows the data to be returned iteratively as it is produced, rather than all at once when the function has run to completion.

Tuesday, 15 April 2008

LEAD and LAG Analytical Functions in Oracle

The LEAD and LAG analytical functions provide access to more than one row of a table at the same time, without the need for a self join. LEAD allows access to a row at a physical offset beyond the current row and LAG allows access to a row at a physical offset prior to the current row.

Here are a couple of examples which show LEAD and LAG in action.

Upper and Lower Bounds

The SALGRADE table, supplied as part of the Oracle demo schema, holds lower and upper salary values for a grade as follows.
SELECT *
FROM salgrade
ORDER BY grade;
GRADE       LOSAL        HISAL                  
----------- ------------ ----------
1 1 1200
2 1201 1400
3 1401 2000
4 2001 3000
5 3001 9999
5 rows selected.
Another way to maintain this information is to store only the upper bound and derive the lower bound from the the previous upper bound plus one.

First, we create our new table containing just the grade and high salary values.
CREATE TABLE new_salgrade AS
SELECT grade, hisal
FROM salgrade;
CREATE TABLE succeeded.

SELECT *
FROM new_salgrade;
GRADE       HISAL                  
----------- -----------
1 1200
2 1400
3 2000
4 3000
5 9999
5 rows selected.
Using LAG we can construct a view which will produce the data in the same format as the original table.
CREATE VIEW salgrade_v AS
SELECT
grade,
LAG(hisal, 1, 0) OVER (ORDER BY hisal) + 1 AS losal,
hisal
FROM new_salgrade;
CREATE VIEW succeeded.

SELECT * 
FROM salgrade_v
ORDER BY grade;
GRADE      LOSAL      HISAL                  
---------- ---------- -----------
1 1 1200
2 1201 1400
3 1401 2000
4 2001 3000
5 3001 9999
5 rows selected.
One advantage with this technique is that the table is easier to maintain. If a salary grade is changed, then only the HISAL on one row needs to be updated. The LOSAL on the next row is adjusted automatically.
UPDATE new_salgrade
SET hisal = 2200
WHERE grade = 3;
1 rows updated

COMMIT;
Commit complete.

SELECT * 
FROM salgrade_v;
GRADE       LOSAL      HISAL                  
----------- ---------- -------------
1 1 1200
2 1201 1400
3 1401 2200
4 2201 3000
5 3001 9999
5 rows selected.

External Log File

The next example processes metric data which has been collected and recorded in an external log file.

The contents of the file look like this.
21-sep-07 13:51:12
type: metric
low: 33
avg: 121
high: 230
21-sep-07 19:23:55
type: metric
low: 66
avg: 75
high: 120
21-sep-07 23:40:17
type: metric
low: 13
avg: 98
high: 554
22-sep-07 01:39:22
type: metric
low: 121
avg: 375
high: 520
22-sep-07 10:57:15
type: metric
low: 43
avg: 99
high: 148
Suppose we are required to produce the following report with one line for each metric sample.
recorded_at          duration      low  avg  high
-------------------- ------------- ---- ---- ----
dd-mon-yy hh:mi:ss 9 9:99:99.9 999 999 999
dd-mon-yy hh:mi:ss 9 9:99:99.9 999 999 999
dd-mon-yy hh:mi:ss 9 9:99:99.9 999 999 999
We will use an external table to access the contents of the file.
CREATE OR REPLACE DIRECTORY tmp_dir AS '/tmp';
Directory created.

CREATE TABLE event_log (
rec VARCHAR2(200))
ORGANIZATION EXTERNAL(
DEFAULT DIRECTORY tmp_dir
ACCESS PARAMETERS(
records delimited by newline
fields
missing field VALUES ARE NULL
reject rows with all null fields)
LOCATION('event.log'));
Table created.

SELECT *
FROM event_log
WHERE rownum <=10;
REC
-----------------------------
21-sep-07 13:51:12
type:
metric
low: 33
avg: 121
high: 230
21-sep-07 19:23:55
type:
metric
low: 66
avg: 75
high: 120
10 rows selected.
Now we can start to construct our SQL statement.

First, we will remove any labels from the records using REGEXP_REPLACE. Labels are identified as a single word at the start of a line followed by a colon and a space.
SELECT REGEXP_REPLACE(rec,'^[[:alpha:]]*: ', null) rec
FROM event_log
WHERE rownum <= 10;
REC
-----------------------------
21-sep-07 13:51:12
metric
33
121
230
21-sep-07 19:23:55
metric
66
75
120
10 rows selected
Next, we use the LEAD function to to merge several records from the file into a single row for each sample.
SELECT 
TO_TIMESTAMP(recorded_at, 'dd-mon-yy hh24:mi:ss') recorded_at,
TO_NUMBER(low) low,
TO_NUMBER(avg) avg,
TO_NUMBER(high) high
FROM (
SELECT
rec,
LAG(rec, 1, null) OVER (ORDER BY rownum) recorded_at,
LEAD(rec, 1, null) OVER (ORDER BY rownum) low,
LEAD(rec, 2, null) OVER (ORDER BY rownum) avg,
LEAD(rec, 3, null) OVER (ORDER BY rownum) high
FROM (
SELECT REGEXP_REPLACE(rec,'^[[:alpha:]]*: ', null) rec
FROM event_log
)
)
WHERE rec = 'metric';
recorded_at           low    avg   high
------------------ ------ ------ ------
21-SEP-07 13.51.12 33 121 230
21-SEP-07 19.23.55 66 75 120
21-SEP-07 23.40.17 13 98 554
22-SEP-07 01.39.22 121 375 520
22-SEP-07 10.57.15 43 99 148
5 rows selected.
And finally, now we have all the required columns with the correct data types, we can use the LAG function to calculate the duration between the current sample and the previous one.
SELECT
recorded_at,
recorded_at - LAG(recorded_at, 1, null)
OVER (ORDER BY recorded_at) duration,
low,
avg,
high
FROM (
SELECT
TO_TIMESTAMP(recorded_at, 'dd-mon-yy hh24:mi:ss') recorded_at,
TO_NUMBER(low) low,
TO_NUMBER(avg) avg,
TO_NUMBER(high) high
FROM (
SELECT
rec,
LAG(rec, 1, null) OVER (ORDER BY rownum) recorded_at,
LEAD(rec, 1, null) OVER (ORDER BY rownum) low,
LEAD(rec, 2, null) OVER (ORDER BY rownum) avg,
LEAD(rec, 3, null) OVER (ORDER BY rownum) high
FROM (
SELECT REGEXP_REPLACE(rec,'^[[:alpha:]]*: ', null) rec
FROM event_log
)
)
WHERE rec = 'metric'
);
recorded_at         duration        low   avg   high
------------------- ------------ ------ ----- ------
21-SEP-07 13.51.12 33 121 230
21-SEP-07 19.23.55 0 5:32:43.0 66 75 120
21-SEP-07 23.40.17 0 4:16:22.0 13 98 554
22-SEP-07 01.39.22 0 1:59:5.0 121 375 520
22-SEP-07 10.57.15 0 9:17:53.0 43 99 148
5 rows selected.
So there we have it. In this example, we read an external file, stripped out the labels and and used LEAD to pivot several rows into columns. Then we used LAG to calculate the duration between adjacent samples.

Saturday, 5 April 2008

Using SOUNDEX with a Function Based Index in Oracle

The SOUNDEX function returns a phonetic representation of an English word. It can be used to compare words that are spelled differently, but sound alike.

For example.
SELECT ename
FROM emp
WHERE SOUNDEX(ename) = SOUNDEX('janes');
ENAME      
----------
JONES
JAMES
2 rows selected
However, a regular index on the ENAME column will not be used by Oracle if we include the SOUNDEX function.
CREATE INDEX emp_ix1 ON emp(ename);
Index created.

SELECT ename
FROM emp
WHERE SOUNDEX(ename) = SOUNDEX('janes');
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=4 Card=1 Bytes=6)
1 0 TABLE ACCESS (FULL) OF 'EMP' (TABLE) (Cost=4 Card=1 Bytes=6)
Instead, we must use a function based index.
CREATE INDEX emp_ix2 ON emp(SOUNDEX(ename));
Index created.

SELECT ename
FROM emp
WHERE SOUNDEX(ename) = SOUNDEX('janes');
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=2 Card=1 Bytes=6)
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'EMP' (TABLE) (Cost=2 Card=1 Bytes=6)
2 1 INDEX (RANGE SCAN) OF 'EMP_IX2' (INDEX) (Cost=1 Card=1)
This time, the function based index is used to retrieve the data.

Incidentally, a Java equivalent is included as part of the Apache Commons project.
import org.apache.commons.codec.language.Soundex;
..
public boolean soundsLike(String firstWord, String secondWord) {
Soundex s = new Soundex();
return s.soundex(firstWord).equals(s.soundex(secondWord));
}

RLWRAP

rlwrap is a Linux command which provides line editing, persistent history and word completion for command line tools such as sqlplus and rman.

Simply type rlwrap before your command as follows.
rlwrap sqlplus scott/tiger
Now you can use the up and down arrow keys to retrieve your history, even across invocations of the command. You can also edit the line using the left and right keys, deleting and inserting text as required.

The -c option enables file completion which can be useful when running scripts. If you type @ and the start of a file name and press tab, rlwrap will complete the name of the file for you.

You can also specify a file containing your own list of words to be auto-completed.
rlwrap -c -f mywordlist.txt sqlplus scott/tiger
This file could contain a list of tables and views in your schema, for example.

Friday, 28 March 2008

Using ROWNUM in Oracle

ROWNUM is an Oracle pseudo column which numbers the rows in a result set.
SELECT rownum, table_name
FROM user_tables;
ROWNUM        TABLE_NAME                     
------------- -----------------
1 EMP
2 DEPT
3 BONUS
4 SALGRADE
5 DUMMY
5 rows selected
Here is a summary of how ROWNUM can be used.

Limiting Rows

ROWNUM can be used to limit the number of rows returned by a query in a similar way to LIMIT in Postgres and MySql, TOP in SQL Server and FETCH FIRST in DB2.
SELECT rownum, table_name
FROM user_tables
WHERE rownum <=3;
ROWNUM        TABLE_NAME                     
------------- -----------------
1 EMP
2 DEPT
3 BONUS
3 rows selected

ROWNUM with DML

The use of ROWNUM is not restricted to select statements. It can be used with DML statements that update the database too.
CREATE TABLE o AS
SELECT *
FROM all_objects
WHERE rownum <= 1000;
Table created

UPDATE o
SET object_id = rownum,
created = created + INTERVAL '1' MINUTE * rownum
WHERE rownum <= 100;
100 rows updated

DELETE FROM o
WHERE OWNER = 'SYS'
AND rownum = 1;
1 rows deleted
ROWNUM is particularly useful, when used in conjunction with the CONNECT BY LEVEL clause, for creating arbitrary rows in the database. See the article on generating rows in Oracle for more details.

Offsetting Rows

Rows can also be skipped at the beginning of a result set using ROWNUM.
SELECT rnum, table_name
FROM
(SELECT rownum rnum, table_name
FROM user_tables)
WHERE rnum > 2;
RNUM     TABLE_NAME                     
-------- ----------------
3 SALGRADE
4 DUMMY
5 DEPT
3 rows selected
You will notice that an inline view has been introduced to transform the ROWNUM pseudo column into a 'real' column before we do the comparison.

It is tempting to write the above SQL as follows.
SELECT table_name
FROM user_tables
WHERE rownum > 2;
TABLE_NAME                     
------------------------------
0 rows selected
However, this query will always return zero rows, regardless of the number of rows in the table.

To explain this behaviour, we need to understand how Oracle processes ROWNUM. When assigning ROWNUM to a row, Oracle starts at 1 and only only increments the value when a row is selected; that is, when all conditions in the WHERE clause are met. Since our condition requires that ROWNUM is greater than 2, no rows are selected and ROWNUM is never incremented beyond 1.

The bottom line is that conditions such as the following will work as expected.

.. WHERE rownum = 1;

.. WHERE rownum <= 10;



While queries with these conditions will always return zero rows.

.. WHERE rownum = 2;

.. WHERE rownum > 10;

Top-n Query

Typically, a top-n query sorts data into the required sequence and then limits the output to a subset of rows.

For example, suppose we wish to retrieve the top three earners from our employee table.
SELECT ename, sal
FROM (
SELECT ename, sal
FROM emp
ORDER BY sal DESC)
WHERE rownum <=3;
ENAME      SAL                    
---------- ---------
KING 5000
SCOTT 3000
FORD 3000
3 rows selected
The inline view (the inner select) sorts the rows and passes the result up to the outer select. The outer select then limits the output to three rows.

It may seem more natural to use the following SQL.
SELECT ename, sal
FROM emp
WHERE rownum <=3
ORDER BY sal DESC;
ENAME      SAL                    
---------- ----------------------
ALLEN 1600
WARD 1250
SMITH 800
3 rows selected
However, this does not give us the result we want because Oracle assigns the ROWNUM values to the rows before it does the sort.

In this example, Oracle will retrieve three rows from the table, any three rows, and sort only these three rows. We really need Oracle to sort all the rows and then return the first three. The inline view will ensure that this will happen.

Sort Performance

Limiting rows on a sorted result set using ROWNUM can also provide an added performance benefit. Rather than physically sorting all the rows to retrieve just the top few, Oracle maintains an array which contains just the highest or the lowest values (depending on whether we specified ASC or DESC in the ORDER BY clause). The size of the array will be the number of rows we wish to return. As rows are processed, only the highest (or lowest) values are retained in the array. All other rows are discarded.

Pagination

Next, we will see how ROWNUM is used to select a range of rows from within a result set. This is useful if we are to provide pagination on a web screen, for example.

Suppose we are paging through the employee table in name order and we wish to display rows six to ten inclusive.
SELECT rnum, ename, job
FROM
(SELECT /*+ FIRST_ROWS(10) */ rownum rnum, ename, job
FROM
(SELECT ename, job
FROM emp
ORDER BY ename)
WHERE rownum <= 10
)
WHERE rnum > 5;
RNUM     ENAME      JOB       
-------- ---------- ---------
6 JAMES CLERK
7 JONES MANAGER
8 KING PRESIDENT
9 MARTIN SALESMAN
10 MILLER CLERK
5 rows selected
We use nested inline views to retrieve and sort the data and then apply the range check using ROWNUM. We have split the upper and lower bound check, which allows Oracle to use COUNT(STOPKEY) in the execution plan when checking for ROWNUM <= 10. This is a performance optimization which, along with the sorting optimization described earlier, will ensure that our query runs efficiently as the table grows.

The FIRST_ROWS(n) hint also tells Oracle to optimize the query so that the first n rows are returned as quickly as possible.

Summary

ROWNUM provides a mechanism for returning a subset or range of rows from a query. It can be misleading at first if not properly understood but, once mastered, is invaluable for limiting result set output for pagination and top-n style queries.

For more information on ROWNUM, see Tom Kytes article on OTN.

For more information on Oracle, visit level up.

Saturday, 22 March 2008

How to Monitor Oracle Index Usage

We can use the following command to determine whether a particular index is being used by Oracle.
ALTER INDEX <index_name> MONITORING USAGE;
If the specified index is used before monitoring is disabled, then the USED column of the relevant V$OBJECT_USAGE row, is set to YES.

Let's see this in action.

First, we create a table using the CONNECT BY LEVEL clause to generate a thousand rows, with an index on the ID column.
CREATE TABLE index_usage AS
SELECT rownum id, 'xyz' || rownum name
FROM dual
CONNECT BY LEVEL <= 1000;
Table created.

CREATE INDEX index_usage_ix1 ON INDEX_USAGE (id);
Index created.
Then we start monitoring the usage of the index and display the results from the V$OBJECT_USAGE view.
ALTER INDEX index_usage_ix1 MONITORING USAGE;
Index altered.

SELECT monitoring, used, start_monitoring, end_monitoring
FROM v$object_usage
WHERE index_name = 'INDEX_USAGE_IX1';
MONITORING USED START_MONITORING    END_MONITORING      
---------- ---- ------------------- -------------------
YES NO 03/22/2008 11:44:20
1 rows selected
Next, we access the table using the index, disable the monitoring and display the results again.
SELECT *
FROM index_usage
WHERE id = 123;
ID                     NAME                                        
---------------------- -------------------
123 xyz123
1 rows selected
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=2 Card=1 Bytes=10)
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'INDEX_USAGE' (TABLE) (Cost=2 Card=1 Bytes=10)
2 1 INDEX (RANGE SCAN) OF 'INDEX_USAGE_IX1' (INDEX) (Cost=1 Card=1)

ALTER INDEX index_usage_ix1 NOMONITORING USAGE;
Index altered.

SELECT monitoring, used,start_monitoring, end_monitoring
FROM v$object_usage
WHERE index_name = 'INDEX_USAGE_IX1';
MONITORING USED START_MONITORING    END_MONITORING      
---------- ---- ------------------- -------------------
NO YES 03/22/2008 11:44:20 03/22/2008 11:47:35
1 rows selected
This shows us that the index was used during our monitoring period.

This feature can help us to identify unused indexes which are candidates for removal. Removing unnecessary indexes can improve performance by reducing overhead during updates.

Saturday, 15 March 2008

JDBC Fetch Size

Fetch size is a performance hint which tells the JDBC driver how many rows should be fetched from the database when more rows are needed by the client. Increasing this value can improve overall query performance, especially with large result sets.

The required fetch size can be set on a Statement or a ResultSet object using the setFetchSize method. If used with a Statement then all ResultSets returned by that Statement will have the same fetch size. The default is ten.

Setting a JDBC fetch size that is too large or too small can degrade performance. Usually, a fetch size of one half or one quarter of the total expected result size is optimal.

Example

To see the effect of using different fetch sizes with Oracle, we first create a table containing a million rows using the CONNECT BY LEVEL clause.

CREATE TABLE one_million_rows AS
SELECT rownum id, rpad('x', 20, 'x') || rownum a, sysdate b
FROM dual
CONNECT BY LEVEL <= 1000000;
Table created.

Next, we call the following method several times with different fetch sizes and record the elapsed time taken to retrieve the data. SQL tracing is also enabled, allowing us to create a TKPROF report for each run.

public void testFetchSize(int fetchSize) throws SQLException {
  Connection con = ConnectionFactory.getNewConnection();
Statement stmt = con.createStatement();
  stmt.execute(
"ALTER SESSION SET tracefile_identifier = " +
"FETCH_SIZE_" + fetchSize);
stmt.execute(
"ALTER SESSION SET events " +
"'10046 trace name context forever, level 12'");
  ResultSet rs = stmt.executeQuery(
"SELECT /* FETCH SIZE " + fetchSize + " */ * " +
"FROM one_million_rows");
rs.setFetchSize(fetchSize);
  long start = System.currentTimeMillis();
while(rs.next());
System.out.printf("Fetch size %d took %4.2f seconds\n",
fetchSize,
(System.currentTimeMillis() - start) / 1000f);
  stmt.execute(
"ALTER SESSION SET events '10046 trace name context off'");

}

For this exercise we have used event 10046 to enable SQL tracing (rather than sql_trace) because we need to capture the wait events in addition to the execution statistics. We will see why later.

Note: Starting with Oracle 10g release 2, the preferred method of enabling SQL tracing is with the DBMS_SESSION package. The ALTER SESSION SET EVENTS command shown above can now be enabled with the following procedure call.

DBMS_SESSION.SESSION_TRACE_ENABLE(waits=>TRUE, binds=>TRUE);

Results

Here is a summary of the results.

Fetch SizeElapsed Time (secs)
1093.78
5029.48
10016.76
2009.58
5006.22
10004.86
50003.91
100003.79
500003.93
1000003.98

Clearly, the fetch size has a considerable impact on the overall performance of our query. The difference between the fastest, which was under four seconds, and the slowest, which was over a minute and a half, is certainly not trivial.

We can also see that performance starts to degrade slightly with a fetch size of somewhere between ten and fifty thousand. From these results, it would appear that ten thousand is a reasonable value for our fetch size.

TKPROF

We can also see the difference in performance with the TKPROF reports produced during the tests. Here, we compare a fetch size of ten to a fetch size of ten thousand.
SELECT /* FETCH SIZE 10 */ * 
FROM
one_million_rows

call count cpu elapsed query rows
------- ------ -------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0
Execute 1 0.00 0.00 0 0
Fetch 100001 7.40 6.77 101220 1000000
------- ------ -------- ---------- ---------- ----------
total 100003 7.40 6.77 101220 1000000
Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
-------------------------------- Waited ---------- ------------
SQL*Net message to client 100001 0.00 0.37
SQL*Net message from client 100001 0.05 65.21
************************************************************************

SELECT /* FETCH SIZE 10000 */ *
FROM
one_million_rows

call count cpu elapsed query rows
------- ------ -------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0
Execute 1 0.00 0.00 0 0
Fetch 101 1.98 3.03 11573 1000000
------- ------ -------- ---------- ----------- -----------
total 103 1.98 3.03 11573 1000000
Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
-------------------------------- Waited ---------- ------------
SQL*Net message to client 101 0.00 0.00
SQL*Net message from client 101 0.11 0.77
SQL*Net more data to client 17740 0.07 1.35
************************************************************************

Note: The report has been condensed to fit on the page.

Increasing the fetch size to ten thousand has more than halved the CPU and elapsed times and reduced consistent (logical) reads by almost ninety percent. However, the most significant gain is in the reduction of time spent waiting for a response from the client, from over a minute to less than a second.

Sunday, 9 March 2008

Oracle External Tables

An external table enables you to access data in a flat file as if it were in a table in the database.

External tables can be used to read the contents of a log file or spread sheet export, for example, or to import information into a data warehouse as part of an ETL operation.

A Simple Example

Suppose we have a file containing city populations in our /tmp directory and we wish to access this data from within our database.
Tokyo,Japan,33600000
Seoul,South Korea,23400000
Mexico City,Mexico,22400000
New York,USA,21900000
Bombay,India,21600000
Delhi,India,21500000
Sao Paulo,Brazil,20600000
Los Angeles,USA,18000000
Shanghai,China,17500000
Osaka,Japan,16700000

First, we create a DIRECTORY object which will reference the /tmp directory in our file system.

CREATE OR REPLACE DIRECTORY tmp_dir AS '/tmp';
Directory created.

Note: This command format will work with Windows and Linux.

Now, we create an external table by specifying the name of the file we wish to query and the directory where it can be located. We will rely on the default values of comma for a field separator and new line as the record delimiter, to describe our data.

CREATE TABLE pop_ext (
city VARCHAR2(30),
country VARCHAR2(30),
population NUMBER)
ORGANIZATION EXTERNAL
(DEFAULT DIRECTORY tmp_dir
LOCATION('pop.txt'));
Table created.

The file can now be queried as if it were a table in the database.

SELECT * 
FROM pop_ext;
CITY              COUNTRY          POPULATION             
----------------- ---------------- ------------------
Tokyo Japan 33600000
Seoul South Korea 23400000
Mexico City Mexico 22400000
New York USA 21900000
Bombay India 21600000
Delhi India 21500000
Sao Paulo Brazil 20600000
Los Angeles USA 18000000
Shanghai China 17500000
Osaka Japan 16700000
10 rows selected

External tables are read-only. An attempt to update the table will result in the following error message, complete with some useful advice on the appropriate action to take.

UPDATE pop_ext 
SET population = 0;
30657.0000 -  "operation not supported on external organized table"
*Cause: User attempted on operation on an external table which is
not supported.
*Action: Don't do that!

We could, of course, create a 'real' table containing the city populations if we wish to maintain the data.

CREATE TABLE pop AS
SELECT * FROM pop_ext;
Table created.

A More Detailed Example

In this example we have a file which contains bank holidays in the UK for 2007 to 2009.
"Holiday"       2007            2008            2009
"New Year" 01/01/07 01/01/08 01/01/09
"Good Friday" 06/04/07 21/03/08 10/04/09
"Easter Monday" 09/04/07 24/03/08 13/04/09
"Early May" 07/05/07 05/05/08 04/05/09
"(First Monday in May)"
"Spring Bank" 28/05/07 26/05/08 25/05/09
"(Last Monday in May)"
"Summer" 27/08/07 25/08/08 31/08/09
"(Last Monday in August)"
"Christmas Day" 25/12/07 25/12/08 25/12/09
"Boxing Day" 26/12/07 26/12/08 28/12/09

This time, we will include a few more options in the table definition to describe the format of the data and to exclude the comment lines (the text surrounded by brackets).

CREATE OR REPLACE DIRECTORY log_dir AS '/logs';
Directory created.

CREATE TABLE bank_hol_ext(
hol_desc VARCHAR2(50),
hol_2007 DATE,
hol_2008 DATE,
hol_2009 DATE
)
ORGANIZATION EXTERNAL
(TYPE oracle_loader
DEFAULT DIRECTORY tmp_dir
ACCESS PARAMETERS(
records delimited BY newline
load when (hol_2007 != BLANKS)
badfile log_dir:'bank_holidays.bad'
logfile log_dir:'bank_holidays.log'
discardfile log_dir:'bank_holidays.dsc'
skip 1
fields terminated by '\t'
optionally enclosed by '"'
missing field values are null
(
hol_desc,
hol_2007 DATE "dd/mm/yy",
hol_2008 DATE "dd/mm/yy",
hol_2009 DATE "dd/mm/yy"
)
)
LOCATION('hols.txt'))
REJECT LIMIT UNLIMITED;
Table created.

SELECT *
FROM bank_hol_ext;

HOL_DESC HOL_2007 HOL_2008 HOL_2009
----------------------------- -------------- -------------- ------------
New Year's Day 01-jan-2007 01-jan-2008 01-jan-2009
Good Friday 06-apr-2007 21-mar-2008 10-apr-2009
Easter Monday 09-apr-2007 24-mar-2008 13-apr-2009
Early May Bank Holiday 07-may-2007 05-may-2008 04-may-2009
Spring Bank Holiday 28-may-2007 26-may-2008 25-may-2009
Summer Bank Holiday 27-aug-2007 25-aug-2008 31-aug-2009
Christmas Day 25-dec-2007 25-dec-2008 25-dec-2009
Boxing Day 26-dec-2007 26-dec-2008 28-dec-2009
8 rows selected

Here, we have specified that the first row (the title) should be skipped, that fields may be surrounded by quotes and that a tab is used as the field delimiter. We also tell Oracle the format of the date fields and that we don't wish to impose a reject limit for bad records.

The logfile, badfile and discardfile clauses specify the location and file names for these file types. The logfile contains messages generated by the external tables utility while it is accessing data in the data file. The badfile contains the rows that could not be loaded due to errors (for example, a date was not valid for the format model) and the discardfile contains the rows excluded by the condition in the LOAD WHEN clause.

The default behaviour is to create these files in the same directory as the data file. However, if we only have read access to this directory, we can use these clauses to create the files in another location. In our example, we have chosen to write the files to the /logs directory.

If we wish, we can create another external table to view the contents of these files. Here, we define a table to query the log file, which contains a single column with a fixed width of four thousand bytes.

CREATE TABLE bank_hol_rej_ext (
rec VARCHAR2(4000))
ORGANIZATION EXTERNAL
(DEFAULT DIRECTORY log_dir
ACCESS PARAMETERS(
records delimited by newline
fields
missing field VALUES ARE NULL
(
rec position(1:4000)
)
)
LOCATION('bank_holidays.log'));
Table created.

SELECT *
FROM bank_hol_rej_ext
WHERE rownum <=10
AND rec IS NOT NULL;
REC
---------------------------------------------------------------
LOG file opened at 03/04/08 21:23:35
Field Definitions for table BANK_HOL_EXT
Record format DELIMITED BY NEWLINE
Data in file has same endianness as the platform
Rows with all null fields are accepted
Load when (HOL_2007 != BLANKS)
Fields in Data Source:
HOL_DESC CHAR (255)
Terminated by " "
Enclosed by """ and """
10 rows selected

We can change the table definition to reference a different file using the ALTER TABLE command.

ALTER TABLE bank_hol_rej_ext LOCATION ('bank_holidays.dsc');
Table altered.
This allows us to query the discard file with the same external table.
SELECT * 
FROM bank_hol_rej_ext;
REC
----------------------------------
"(First Monday in May)"
"(Last Monday in May)"
"(Last Monday in August)"
3 rows selected.

Data Pump

Oracle 10g introduced the option to create an external table using data constructed from within the database.
CREATE TABLE emp_ext
ORGANIZATION EXTERNAL
(
TYPE ORACLE_DATAPUMP
DEFAULT DIRECTORY tmp_dir
LOCATION ('emp.dp')
)
AS
SELECT empno, ename, job, sal, comm, dname
FROM emp JOIN dept USING (deptno);
Table created.

This SQL will create a flat file in the /tmp directory containing the results of the SELECT statement. Unfortunately, the file is a proprietary binary format which can only be read by Data Pump. It would be useful to have the option to produce a pure CSV format which could easily be read by other tools and utilities. However, the file is readable across platforms, which allows us to move it between Windows and Linux and create another external table on the target machine, to read the data.

CREATE TABLE emp_ext (
empno NUMBER(4,0),
ename VARCHAR2(10),
job VARCHAR2(9),
sal NUMBER(7,2),
comm NUMBER(7,2),
dname VARCHAR2(14)
)
ORGANIZATION EXTERNAL
(
TYPE ORACLE_DATAPUMP
DEFAULT DIRECTORY tmp_dir
LOCATION ('emp.dp')
);
Table created.

This can be useful for transferring arbitrary data from one database to another where a database link is not available.

Saturday, 1 March 2008

How to Generate Rows in Oracle


Here is a a simple and efficient way to generate arbitrary rows in Oracle.
SELECT <columns>
FROM dual
CONNECT BY LEVEL <= n;
where n is the number of rows you wish to produce.

For example.
SELECT rownum 
FROM dual
CONNECT BY LEVEL <= 5;
    ROWNUM
----------
1
2
3
4
5
5 rows selected

Test Data

One use for this technique is to quickly generate the contents of a table. The following example uses a CTAS statement to create a table and populate it with test data.

CREATE TABLE test_data(
id PRIMARY KEY,
group_id NOT NULL,
created_at NOT NULL,
text NOT NULL) AS
SELECT rownum,
MOD(rownum, 5),
TO_DATE('1-jan-07', 'dd-mon-yy') + INTERVAL '1' MINUTE * rownum,
CAST ('xyz' || rownum AS VARCHAR2(50))
FROM dual
CONNECT BY LEVEL <= 10000;
Table Created.

SELECT * 
FROM test_data
WHERE id <= 10;
        ID   GROUP_ID  CREATED_AT          TEXT
---------- ---------- ------------------- -----------
1 1 01-jan-07 00:01:00 xyz1
2 2 01-jan-07 00:02:00 xyz2
3 3 01-jan-07 00:03:00 xyz3
4 4 01-jan-07 00:04:00 xyz4
5 0 01-jan-07 00:05:00 xyz5
6 1 01-jan-07 00:06:00 xyz6
7 2 01-jan-07 00:07:00 xyz7
8 3 01-jan-07 00:08:00 xyz8
9 4 01-jan-07 00:09:00 xyz9
10 0 01-jan-07 00:10:00 xyz10
10 rows selected.

We now have ten thousand rows of test data with a sequential number for the primary key, a group item which contains five distinct values, a date field which increments by one minute for each row and a text field containing unique values.

Sparse Data

Another way we can utilise row generation is when dealing with sparse data.

Suppose we wish to use the data from the previous example to show the total number of rows by day.

SELECT TRUNC(created_at, 'dd') created_on, count(*)
FROM test_data
GROUP BY TRUNC(created_at, 'dd');
CREATED_ON   COUNT(*)
---------- ----------
01-jan-07 1439
02-jan-07 1440
03-jan-07 1440
04-jan-07 1440
05-jan-07 1440
06-jan-07 1440
07-jan-07 1361
7 rows selected.

However, let us assume that we must display a row for every day within a specified date range, even if the row count is zero. This is often the case if the data is used to produce a graph, for example.

It just so happens that our query produced a row per day between the 1st and 7th of January, but suppose the required date range is the 1st to the 10th of January and our table has a couple of days with no data. We will delete the rows for the 3rd and 5th of January to simulate this.

DELETE FROM test_data 
WHERE TRUNC(created_at, 'dd') IN (TO_DATE('3-jan-2007', 'dd-mon-yyyy'),
TO_DATE('5-jan-2007', 'dd-mon-yyyy'));
2880 rows deleted.

SELECT TRUNC(created_at, 'dd') created_on, count(*)
FROM test_data
GROUP BY TRUNC(created_at, 'dd');
CREATED_ON   COUNT(*)
---------- ---------
01-jan-07 1439
02-jan-07 1440
04-jan-07 1440
06-jan-07 1440
07-jan-07 1361
5 rows selected.

Because we are dealing with sparse data, row counts for only five days are returned by our query. However, the requirement is to produce a row for each of the ten days in our range. We will use the CONNECT BY LEVEL clause, inline views and an outer join to provide the solution.

First, we construct a query to return a row for each day in a specified range.

SELECT TO_DATE('1-jan-2007', 'dd-mon-yyyy') + (rownum - 1) created_on
FROM dual
CONNECT BY LEVEL <= TO_DATE('10-jan-2007', 'dd-mon-yyyy') -
TO_DATE('1-jan-2007', 'dd-mon-yyyy') + 1;
CREATED_ON
----------
01-jan-07
02-jan-07
03-jan-07
04-jan-07
05-jan-07
06-jan-07
07-jan-07
08-jan-07
09-jan-07
10-jan-07
10 rows selected.
This query is then joined with an aggregate query on the test data using an outer join.
SELECT created_on,
NVL(qty, 0) qty
FROM
(SELECT TO_DATE('1-jan-2007', 'dd-mon-yyyy') + (rownum - 1) created_on
FROM dual
CONNECT BY LEVEL <= TO_DATE('10-jan-2007', 'dd-mon-yyyy') -
TO_DATE('1-jan-2007', 'dd-mon-yyyy') + 1)
LEFT OUTER JOIN
(SELECT TRUNC(created_at, 'dd') created_on,
COUNT(*) qty
FROM test_data
GROUP BY TRUNC(created_at, 'dd'))
USING(created_on)
ORDER BY created_on;
CREATED_ON        QTY
---------- ----------
01-jan-07 1439
02-jan-07 1440
03-jan-07 0
04-jan-07 1440
05-jan-07 0
06-jan-07 1440
07-jan-07 1361
08-jan-07 0
09-jan-07 0
10-jan-07 0
10 rows selected.

Now we are guaranteed to return a row for every day in our range, even for the days with no data.