Information repositories are essential. They allow data to be shared within or outside an organization, bringing us closer to the reality of the paperless office.
With the toolset shown in Table 1, you can build an enterprise-class, scalable Web-enabled repository that fully incorporates various forms of media. Document files, photographs, video clips and sound files can easily be included in the repository using Java and Oracle8i's LOB (Large Objects) data types.
In this article I'm going to focus on how you build a repository to store and search documents such as Microsoft Word, and HTML and XML files stored in a LOB column of a database table. The example used here populates the repository with Microsoft Word résumés, indexes it using Oracle Intermedia and reads it using Java streams from a servlet (see Figure 1).
Benefits of Java and Oracle8i
Building repositories using Java and Oracle8i has several benefits. The documents inherently take advantage of the transaction management and ACID (atomicity, concurrency, integrity and durability) properties of the relational database, which means that changes to an internal LOB can be committed or rolled back. Moreover, associated applications can seamlessly take advantage of database features such as backup and recovery. This makes things easier for the system administrators, who no longer have to perform separate database and file system backups for relational information and documents. All data housed in the database, whether structured (relational) or unstructured (document files), can be written, searched and accessed using a single industry standard interface - SQL. These SQL statements can be executed from Java using JDBC (Java Database Connectivity).
Working with Large Objects
Oracle8i supports several types of LOB columns. One type, a BLOB (Binary Large Object), can house binary information such as audio, video, images and documents internally within the database. Each row can store up to 4 gigabytes of data. I used the BLOB data type to store the Microsoft Word résumés in my example.
The Oracle database stores a locator inline with the data. The locator is a pointer to the actual location of the data (LOB value). The LOB data can be stored in the same table or a separate one. The advantage of the locator is that the database doesn't have to scan the LOB data each time it reads multiple rows because only the LOB's locator value is read; the actual LOB data is read only when required.
When working with Java and LOBs, first execute the SELECT statement to get the LOB locator, then read or write LOBs using JDBC. (Oracle JDBC driver's type extension classes from oracle.sql package is used to read and write from an Oracle database.) The actual LOB data is materialized as a Java stream from the database, with the locator representing the data in the table. The following code reads the résumé of an employee whose employee number, 7900, is stored in a LOB column called résumé in the sam_emp table.
Statement st = cn.createStatement();
ResultSet rs = st.executeQuery
("Select resume from sam_emp where empno=7900");
rs.next();
oracle.sql.BLOB blob=((OracleResultSet)rs).getBLOB(1);
InputStream is=blob.getBinary-
Stream();
Populating the Repository
The documents can be written to LOB columns using Java, PL/SQL or a bulk utility called Oracle SQL*Loader. To insert a new row, do the following:
- Execute the SQL insert statement with an empty BLOB.
- Query the same row to get the locator object. Use this locator to write your
document to the LOB column. Note: Java streams are employed to
write the documents to the LOB column.
- Create the Java output stream using the getBinaryOutputStream() method of this object to write your document or any binary information to that column.
For example, to insert information in the sam_emp table about a new employee whose employee number is 9001, first insert all the structured information along with an empty BLOB using JDBC. Next, select the LOB column, résumé, from the same row to get the oracle.sql.BLOB object (the locator). Finally, create the Java output stream from this object:
st.execute("INSERT INTO sam_emp(empno, resume)
VALUES(9001,empty_blob())");
ResultSet rs = st.executeQuery(
"Select résumé from sam_emp where empno=9001 for update");
rs.next();
oracle.sql.BLOB blob = ((OracleResultSet)rs).getBLOB(1);
OutputStream os = blob.getBinaryOutputStream();
Optionally, you may use the java.awt.FileDialog class and java.io package to dynamically select and read a file from your PC. Then load it to a LOB column using the preceding code.
The way you search and retrieve documents is independent of how you load the documents. For example, you can store the documents using PL/SQL or SQL*Loader, then search and retrieve using Java servlets. Using PL/SQL, Listing 1 loads an employee's résumé, saved as a Microsoft Word file, to the résumé column of the sam_emp table.
Searching the Repository
The documents stored in the LOB columns can be indexed using Oracle Intermedia, which provides advance search capabilities such as fuzzy, stemming, proxy, phrases and more. It can also generate thematic searches and gist. The documents can be indexed via the "Create Index" database command.
Refer to Listing 2 to see how the index is built on the résumé column of the sam_emp table. Once the index is created, the Java applications can search the repository by simply submitting SELECT statements.
The MyServletCtx servlet in Listing 3 searches the term passed to it as a parameter in the résumé column of the sam_emp table. The servlet returns the rows matching the search criteria in HTML table format. The employee names in the HTML table are hyperlinked to another servlet, MyServlet, which reads the entire résumé from the database in its original format.
Retrieving from the Repository
Document retrieval using Java is similar to writing documents to the repository. The "Working with Large Objects" section earlier in this article describes how to read LOBs from the database. The MyServlet in Listing 4 reads a Microsoft Word résumé from the sam_emp table, sets the content type, then streams it out to the browser using an output stream.
Summary
In this article I've shown how you store, search and retrieve Word documents using LOB data types and Java.
You can also use the Oracle8i database to store, index, parse and transform XML documents. Storing XML documents in the database removes the need to administer and manage multiple repositories for relational and XML data. The Oracle8i's JVM makes it possible to run a Java XML parser in the database. Using the parser, you can parse and transform the XML files inside the database before outputting it to an application server.
|