Project CLAQ

CONCEPT LEARNING

The goal of this research is to explore or develop models, knowledge bases, linguistic corpora and tools, presentation metaphors, learning algorithms and software to produce agents that provide a high level of interaction with people. The research will develop practical software tools in forms that allow their exploitation in applications (e.g. as COM components or as Web services) both within the project and by others.

Filing and retrieving documents is a task where a higher level of interaction through adaptive agents can be exploited. Web search engines have shown that people find convenient to delegate the knowledge on where documents are stored and to interact with the engine through simple natural language queries. Nevertheless it is still the user responsibility to determine which documents contain the exact information he needs and to extract it for completing his task.

We wish to go beyond the ability to retrieve documents containing pertinent information and support directly the user task. The TREC-8 Question Answering Track has identified this need and states that: "Automatic question answering will definitely be a significant advance in the state-of-art information retrieval technology. Systems that can do reliable question answering without domain restrictions have not been developed yet."

Ask Jeeves is an attempt in this direction. Using natural language processing technology, Ask Jeeves determines both the meaning of the words in the question (semantic processing) as well as the meaning in the grammar of the question (syntactic processing). Ask Jeeves's answer-processing engine provides several question template responses that contains links to the answer locations. The user still has to extract the answer from the documents.

Our approach will be to work at the conceptual level. Linguistic analysis tools and machine learning techniques will be applied to learn concepts from documents and interactions with users. Identifying concepts and relations among them in the documents will enable building knowledge bases suitable for processing and answering questions.

The research will tackle the following issues:

Concept learning. Extracting relevant concepts from training collections, exploiting thesauri and ontologies.
Identification of relations among concepts, creating or extending thesauri and ontologies.
Techniques to determine relevance and authoritativeness of sources (link analysis, reinforcement learning and belief network techniques can be used)

The techniques will be applied to the following tasks:

Categorization. Supporting automated categorization by tools for generating category profiles and concept matching. In particular we plan to continue earlier work on search and categorization of documents, refining the techniques with the addition of semantic linguistic analysis.
Question answering. Identifying documents in a collection containing pertinent information and answer directly (simple) questions issued by users.

QUESTION ANSWERING

This work is sponsored by Microsoft Research through a PhD grant.