The DART Database

Overview

The DART (Discovery and Aggregation of Relations in Text) database contains approximately 23 million distinct "world knowledge" propositions (110 million with duplicates), extracted from text by abstracting parse trees. The extraction system (DART) is based on ideas by Len Schubert. The DART system and database is described in detail in the paper below. The database was generated from the Reuters and BNC corpora. The database was created by Peter Clark and Phil Harrison (Boeing Research and Technology) and is publically available under the GNU Lesser General Public Licence. Please let us know how you get on! For more details, see:

Forms of proposition

There are 12 kinds of proposition, contained in 12 different text files. Each line in the file contains three items of data: The frequency (number of occurrences) of the tuple in the corpora; the tuple itself; and a verbalization of the proposition that the tuple represents. Verbalizations are automatically generated from the tuple, and represent an informal interpretation of the tuples' meaning. Examples of the 12 types are shown below:

Frequency Tuple Proposition Verbalization
144 (an "small" "hotel") "Hotels can be small."
121 (anpn "subject" "agreement" "to" "approval") "Agreements can be subject to approvals."
17 (nn "drug" "distributor") "There can be drug distributors."
153 (nv "bus" "carry") "Buses can carry [something/someone]."
26 (npn "sentence" "for" "offence") "Sentences can be for offences."
119 (nvn "critic" "claim" "thing") "Critics can claim things."
192 (nvpn "person" "go" "into" "room") "People can go into rooms."
11 (nvnpn "democrat" "win" "seat" "in" "election") "Democrats can win seats in elections."
1572 (qn "year" "contract") "Contracts can be measured in years."
8 (vn "find" "spider") "Spiders can be found."
14 (vpn "refer" "to" "business") "Referring can be to businesses."
103 (vnpn "educate" "person" "at" "college") "People can be educated at colleges."

Examples

To see a small subset of the database for tuples mentioning the word "computer", click below. (The below files were constructed by running a simple "grep" for "computer" on the full DART database files).

Download

The database can be downloaded below. The download is ~280MB (zipped), expanding to ~2GB (unzipped), and contains 12 plain ASCII .txt files, one for each of the categories above.

Additional Notes

Generally, words in the tuple structure are root forms of the words occurring in the parse trees. Named entities have been replaced with one of "person", "place", or "organization" using a simple named entity recognizer (NER), or "person/place/organization" when the NER was unable to make a classification. Pronouns are replace with "person" or "thing" depending on the pronoun's gender. Embedded propositions are replaced with "thing". Missing words (e.g., a null proposition) are denoted by NIL in the tuple structure. The verb "bear" has been renamed "born" in the data (i.e., "a baby was born" appears as the tuple (VN "born" "baby") rather than (VN "bear" "baby")). The verbalizations are machine generated solely from the tuples, and represent a somewhat informal interpretation of the tuples' meaning. Tuples in the files are listed in order of most frequent to least frequent.
Contact: Peter Clark (peter.e.clark@boeing.com), Boeing Research and Technology