DAML Project for Stanford Knowledge Systems, AI Laboratory

Homework 3 Lessons Learned

DAML Ontology Generation

The first ontology generated for this homework assignment is the UNSPSC Product Ontology (http://ksl.stanford.edu/projects/DAML/UNSPSC.daml ; Size: 1.8 MB) which was obtained from www.unspsc.org . This is a slight modification of a previous version. If people desire to reuse this ontology, we suggest obtaining an updated version from unspsc first. The original UNSPSC ontology format was imported into the Ontolingua Knowledge Base Server (www.ontolingua.stanford.edu) and exported as DAML content.

The second ontology generated for this homework assignment is the CIA World Fact Book (http://ontolingua.stanford.edu/doc/chimaera/ontologies/cia-world-fact-book.daml ; Size: 3.4 MB) which was scraped from the web-based version of this content (http://www.odci.gov/cia/publications/factbook/) and loaded into Ontolingua. The scraping was done over 2 years ago. The content was processed in accordance with our needs in the DARPA HPKB program (www.darpa.mil). Prior to using this ontology, we suggest that an update of the factbook content be obtained. Ontolingua was then used to export the DAML content.

The third ontology for this assignment (http://ksl.stanford.edu/projects/DAML/chimaera-jtp-cardinality-test1.daml ; Size: 25 KB) defines the object structure for a diagnostic interface between JTP (a Java-based theorem prover) and Chimaera (an on-line KB diagnostic tool www.ksl.stanford.edu/software/chimaera ). It is used to test the inferential power of the reasoner. The initial knowledge base just tests the cardinality section of the inferential work required for DAML.

DAML Importing/Exporting Issues

Although we did not have any direct problems exporting the above knowledge bases as DAML+OIL, there are a few general DAML+OIL language issues that were encountered during the development of our DAML importing and exporting tools. These issues are listed below:

The lack of a formal grammar (or even an XML DTD) for DAML-ONT or DAML+OIL makes it difficult to write a parser that correctly deals with all aspects of the language which go above and beyond RDFS. This was most problematic for the List collection and for the parseType daml:collection.
In addition, one of our main goals with DAML+OIL is to integrate it seamlessly with our other software. This involves writing a translator that can translate DAML+OIL ontologies into a format that can be read by a pre-existing knowledge representation system (in our case, OKBC-compliant systems). This requires that we have a fairly exact mapping between DAML+OIL and OKBC. Although it is certainly not the job of other DAML participants to provide this mapping, it is necessary to have the syntax and semantics of DAML+OIL precisely defined for us to develop this mapping. Additional examples were needed for our parser writer to understand intended meaning beyond what was provided in the examples and walkthrough files. The problem was most evident in the areas of daml:restriction. We will be extending this file to include a detailed example of the restriction problem.

Related KSL Work for Ontology Generation

We have also had experience with the generation of "instance oriented" content for related projects by scraping HTML pages for web services content. We thought it would be of value to the DAML community to share some of our experience with this process.

A common approach to information extraction from HTML pages is to use syntactic rules/pattern matching to retrieve structural/semi-structural information. Two software packages available for such syntactic content scraping are W4 (http://www.tropea-inc.com/technology/W4F/) and Compaq's Web Language, formerly called WebL (http://research.compaq.com/SRC/WebL). The following is a brief description of the pros and cons of each language based on an evaluation in the fall. Note that the software products we describe are both rapidly evolving and some of our comparisons may already be out of date.

Compaq's Web Language (formerly called WebL):
Some Notable Features:

Retrieve HTML pages via HTTP with error-handling (such as timeout, parallel download)
Search/match patterns in given HTML pages

Pros:

Easy to use
Interface ready for OAA

Cons:

No clean interface with high-level applications, such as Java. We need wrapper code to do that.

W4F:
Some Notable Features:

Retrieve HTML pages via HTTP, but less error-handling
Mechanism provided as WebL
Use extraction rules to extract information
Generate java code directly
Wizard tools help to write extraction rules

Pros:

More powerful extraction function
Good interface with java
Wizards help to write rules

Cons:

More complex to use than WebL
Wizard tools are still immature. We experimented a little with both systems but chose to use W4 because of the easy-to-use production rules for information extraction and because of the nice integration with java. Both systems seemed good, however. We have been very pleased with W4 to date.

Last modified: Sunday, 03-Jul-2005 06:07:02 PDT