Defining General Concepts

The component library is intended to include knowledge base nodes for general, domain-independent concepts. The user of the library will build nodes for more complex, domain-specific concepts by composing existing components. Fundamental questions, then, in building the library are:
  1. what are the general, domain-independent concepts?
  2. how do we make sure that the library has broad coverage?
  3. how do we know our primitive concepts are the "right" ones?
Choosing general concepts through introspection is one way to answer question #1, but perhaps not the best way to ensure satisfactory answers to #2 and #3. Fortunately, there are existing lexical-semantic resources that can be used for inspiration. Both dictionaries and thesauri often group words into general categories. For example, the Longman Dictionary of Contemporary English (aka LDOCE: Summers 1987) uses a "defining vocabulary" of about 2,000 words. All definitions in the dictionary are supposed to ground out eventually to the defining vocabulary. WordNet (Miller 1991) groups semantically similar words into "synsets", which are themselves linked hierarchically. Roget's Thesaurus (see, for example, Lloyd 1982) divides the universe into six classes. Each class is subdivided into multiple divisions and sections, themselves further subdivided. The one thousand leaves in Roget's tree contain semantically related words (not quite synonyms), one of which is chosen as the representative for the group: the headword.

The LDOCE defining vocabulary, a horizontal slice of the WordNet hierarchy, the Roget headwords: each of these could be used as a list of general concepts, or as inspiration for an original list. (See Rick Harrison's "Vital English Vocabulary" (Harrison 1997) for a similar experiment). None of these sources is perfect: LDOCE is not semantically motivated and is apparently not without circular references; WordNet has much less coverage than the others, especially among the non-nouns; Roget is also somewhat arbitrary, and obviously influenced by his culture.

Using an established lexical resource not only identifies general concepts, but goes a long way to ensuring broad coverage (question #2). It can be argued (but won't be here by me) that coverage of English words implies coverage of general knowledge.

Question #3 is best avoided. Let's argue that there are no "right" primitive concepts. Our concepts will be validated by the ease with which the user of the library can compose other, more complex concepts using them.

I have chosen to experiment with Roget's Thesaurus as a source for a list of general actions. The actions were chosen by going through Roget's headword-by-headword, looking specifically at verb paragraphs (see my description of the format of Roget's, including the upper level, non-leaf nodes). For each headword, I chose one representative verb (usually). These verbs are grouped in categories that correspond roughly to mid-level Roget categories. The list still needs some cleaning, and its ability to capture domain-specific actions is still uncertain.

I have also put Longman's defining vocabulary online.

References

HARRISON, RICK (1997). "Vital English Vocabulary". http://www.rick.harrison.net/annex/vitaleng.html

LLOYD, SUSAN M., (1982). Roget's Thesaurus. Essex: Longman.

MILLER, GEORGE A., ed. (1990). "WordNet: An On-Line Lexical Database." International Journal of Lexicography 3(4).

SUMMERS, DELLA, ed. (1987). Longman Dictionary of Contemporary English: New Edition. Essex: Longman.


The List of General Actions

Here is a first kick at the cat at deriving a list of general actions from the public domain Roget's Thesaurus (based on the 1911 Thomas Crowell edition). The list still needs some cleaning. For example, there are some duplicates (referring to different senses of the same verb) that should be renamed. Furthermore, I tried to err on the side of "Recall": at this stage it is better to have too many specific verbs than to miss some important general ones.