UTexas: Natural Language Semantics using Distributional Semantics and Probabilistic Logic

We represent natural language semantics by combining logical and distributional information in probabilistic logic. We use Markov Logic Networks (MLN) for the RTE task, and Probabilistic Soft Logic (PSL) for the STS task. The system is evaluated on the SICK dataset. Our best system achieves 73% accuracy on the RTE task, and a Pearson’s correlation of 0.71 on the STS task.


Introduction
Textual Entailment systems based on logical inference excel in correct reasoning, but are often brittle due to their inability to handle soft logical inferences. Systems based on distributional semantics excel in lexical and soft reasoning, but are unable to handle phenomena like negation and quantifiers. We present a system which takes the best of both approaches by combining distributional semantics with probabilistic logical inference.
Our system builds on our prior work (Beltagy et al., 2013;Beltagy et al., 2014a;Beltagy and Mooney, 2014;Beltagy et al., 2014b). We use Boxer (Bos, 2008), a wide-coverage semantic analysis tool to map natural sentences to logical form. Then, distributional information is encoded in the form of inference rules. We generate lexical and phrasal rules, and experiment with symmetric and asymmetric similarity measures. Finally, we use probabilistic logic frameworks to perform inference, Markov Logic Networks (MLN) for RTE, and Probabilistic Soft Logic (PSL) for STS. This work is licensed under a Creative Commons Attribution 4.0 International Licence. Page numbers and proceedings footer are added by the organisers. Licence details: http://creativecommons.org/licenses/by/4.0/ 2 Background

Logical Semantics
Logic-based representations of meaning have a long tradition (Montague, 1970;Kamp and Reyle, 1993). They handle many complex semantic phenomena such as relational propositions, logical operators, and quantifiers; however, they can not handle "graded" aspects of meaning in language because they are binary by nature.

Distributional Semantics
Distributional models use statistics of word cooccurrences to predict semantic similarity of words and phrases (Turney and Pantel, 2010;Mitchell and Lapata, 2010), based on the observation that semantically similar words occur in similar contexts. Words are represented as vectors in high dimensional spaces generated from their contexts. Also, it is possible to compute vector representations for larger phrases compositionally from their parts (Mitchell and Lapata, 2008;Mitchell and Lapata, 2010;Baroni and Zamparelli, 2010). Distributional similarity is usually a mixture of semantic relations, but particular asymmetric similarity measures can, to a certain extent, predict hypernymy and lexical entailment distributionally (Kotlerman et al., 2010;Lenci and Benotto, 2012;Roller et al., 2014). Distributional models capture the graded nature of meaning, but do not adequately capture logical structure (Grefenstette, 2013).

Markov Logic Network
Markov Logic Networks (MLN) (Richardson and Domingos, 2006) are a framework for probabilistic logic that employ weighted formulas in firstorder logic to compactly encode complex undirected probabilistic graphical models (i.e., Markov networks). Weighting the rules is a way of softening them compared to hard logical constraints.
MLNs define a probability distribution over possible worlds, where the probability of a world increases exponentially with the total weight of the logical clauses that it satisfies. A variety of inference methods for MLNs have been developed, however, computational overhead is still an issue.

Probabilistic Soft Logic
Probabilistic Soft Logic (PSL) is another recently proposed framework for probabilistic logic (Kimmig et al., 2012). It uses logical representations to compactly define large graphical models with continuous variables, and includes methods for performing efficient probabilistic inference for the resulting models. A key distinguishing feature of PSL is that ground atoms (i.e., atoms without variables) have soft, continuous truth values on the interval [0, 1] rather than binary truth values as used in MLNs and most other probabilistic logics. Given a set of weighted inference rules, and with the help of Lukasiewicz's relaxation of the logical operators, PSL builds a graphical model defining a probability distribution over the continuous space of values of the random variables in the model (Kimmig et al., 2012). Then, PSL's MPE inference (Most Probable Explanation) finds the overall interpretation with the maximum probability given a set of evidence. This optimization problem is a second-order cone program (SOCP) (Kimmig et al., 2012) and can be solved in polynomial time.

Recognizing Textual Entailment
Recognizing Textual Entailment (RTE) is the task of determining whether one natural language text, the premise, Entails, Contradicts, or is not related (Neutral) to another, the hypothesis.

Semantic Textual Similarity
Semantic Textual Similarity (STS) is the task of judging the similarity of a pair of sentences on a scale from 1 to 5 (Agirre et al., 2012). Gold standard scores are averaged over multiple human annotations and systems are evaluated using the Pearson correlation between a system's output and gold standard scores.

Logical Representation
The first component in the system is Boxer (Bos, 2008), which maps the input sentences into logical form, in which the predicates are words in the sentence. For example, the sentence "A man is driving a car" in logical form is: ∃x, y, z. man(x) ∧ agent(y, x) ∧ drive(y) ∧ patient(y, z) ∧ car(z)

Distributional Representation
Next, distributional information is encoded in the form of weighted inference rules connecting words and phrases of the input sentences T and H. For example, for sentences T : "A man is driving a car", and H: "A guy is driving a vehicle", we would like to generate rules like ∀x.
where w 1 and w 2 are weights indicating the similarity of the antecedent and consequent of each rule.
Inferences rules are generated as in Beltagy et al. (2013). Given two input sentences T and H, for all pairs (a, b), where a and b are words or phrases of T and H respectively, generate an inference rule: We experimented with the symmetric similarity measure cosine, and asym, the supervised, asymmetric similarity measure of Roller et al. (2014).
The asym measure uses the vector difference ( − → a − − → b ) as features in a logistic regression classifier for distinguishing between four different word relations: hypernymy, cohyponymy, meronomy, and no relation. The model is trained using the noun-noun subset of the BLESS data set (Baroni and Lenci, 2011). The final similarity weight is given by the model's estimated probability that the word relationship is either hypernymy or meronomy: Distributional representations for words are derived by counting co-occurrences in the ukWaC, WaCkypedia, BNC and Gigaword corpora. We use the 2000 most frequent content words as basis dimensions, and count co-occurrences within a two word context window. The vector space is weighted using Positive Pointwise Mutual Information.
Phrases are defined in terms of Boxer's output to be more than one unary atom sharing the same variable like "a little kid" (little(k) ∧ kid(k)), or two unary atoms connected by a relation like "a man is driving" (man(m) ∧ agent(d, m) ∧ drive(d)). We compute vector representations of phrases using vector addition across the component predicates. We also tried computing phrase vectors using component-wise vector multiplication (Mitchell and Lapata, 2010), but found it performed marginally worse than addition.

Probabilistic Logical Inference
The last component is probabilistic logical inference. Given the logical form of the input sentences, and the weighted inference rules, we use them to build a probabilistic logic program whose solution is the answer to the target task. A probabilistic logic program consists of the evidence set E, the set of weighted first order logical expressions (rule base RB), and a query Q. Inference is the process of calculating P r(Q|E, RB).

Task 1: RTE using MLNs
MLNs are the probabilistic logic framework we use for the RTE task (we do not use PSL here as it shares the problems of fuzzy logic with probabilistic reasoning). The RTE classification problem for the relation between T and H can be split into two inference tasks. The first is testing if T entails H, P r(H|T, RB). The second is testing if the negation of the text ¬T entails H, P r(H|¬T, RB). In case P r(H|T, RB) is high, while P r(H|¬T, RB) is low, this indicates Entails. In case it is the other way around, this indicates Contradicts. If both values are close, this means T does not affect the probability of H and indicative of Neutral. We train an SVM classifier with LibSVM's default parameters to map the two probabilities to the final decision.
The MLN implementation we use is Alchemy (Kok et al., 2005). Queries in Alchemy can only be ground atoms. However, in our case the query is a complex formula (H). We extended Alchemy to calculate probabilities of queries (Beltagy and Mooney, 2014). Probability of a formula Q given an MLN K equals the ratio between the partition function Z of the ground network of K with and without Q added as a hard rule (Gogate and Domingos, 2011) We estimate Z of the ground networks using Sam-pleSearch (Gogate and Dechter, 2011), an advanced importance sampling algorithm that is suitable for ground networks generated by MLNs.
A general problem with MLN inference is its computational overhead, especially for the complex logical formulae generated by our approach. To make inference faster, we reduce the size of the ground network through an automatic type-checking technique proposed in Beltagy and Mooney (2014). For example, consider the evidence ground atom man(M ) denoting that the constant M is of type man. Then, consider another predicate like car(x). In case there are no inference rule connecting man(x) and car(x), then we know that M which we know is a man cannot be a car, so we remove the ground atom car(M ) from the ground network. This technique reduces the size of the ground network dramatically and makes inference tractable.
Another problem with MLN inference is that quantifiers sometimes behave in an undesirable way, due to the Domain Closure Assumption (Richardson and Domingos, 2006) that MLNs make. For example, consider the text-hypothesis pair: "There is a black bird" and "All birds are black", which in logic are T : bird(B)∧black(B) and H : ∀x. bird(x) ⇒ black(x). Because of the Domain Closure Assumption, MLNs conclude that T entails H because H is true for all constants in the domain (in this example, the single constant B). We solve this problem by introducing extra constants and evidence in the domain. In the example above, we introduce evidence of a new bird bird(D), which prevents the hypothesis from being true. The full details of the technique of dealing with the domain closure is beyond the scope of this paper.

Task 2: STS using PSL
PSL is the probabilistic logic we use for the STS task since it has been shown to be an effective approach for computing similarity between structured objects. We showed in Beltagy et al. (2014a) how to perform the STS task using PSL. PSL does not work "out of the box" for STS, because Lukasiewicz's equation for the conjunction is very restrictive. We address this by replacing Lukasiewicz's equation for conjunction with an averaging equation, then change the optimization problem and grounding technique accordingly.
For each STS pair of sentences S 1 , S 2 , we run PSL twice, once where E = S 1 , Q = S 2 and another where E = S 2 , Q = S 1 , and output the two scores. The final similarity score is produced from an Additive Regression model with WEKA's default parameters trained to map the two PSL scores to the overall similarity score (Friedman, 1999;Hall et al., 2009).

Task 3: RTE and STS using Vector Spaces and Keyword Counts
As a baseline, we also attempt both the RTE and STS tasks using only vector representations and unigram counts. This baseline model uses a supervised regressor with features based on vector similarity and keyword counts. The same input features are used for performing RTE and STS, but a SVM classifier and Additive Regression model is trained separately for each task. This baseline is meant to establish whether the task truly requires the sophisticated logical inference of MLNs and PSL, or if merely checking for logical keywords and textual similarity is sufficient. The first two features are simply the cosine and asym similarities between the text and hypothesis, using vector addition of the unigrams to compute a single vector for the entire sentence.
We also compute vectors for both the text and hypothesis using vector addition of the mutually exclusive unigrams (MEUs). The MEUs are defined as the unigrams of the premise and hypothesis with common unigrams removed. For example, if the premise is "A dog chased a cat" and the hypothesis is "A dog watched a mouse", the MEUs are "chased cat" and "watched mouse." We compute vector addition of the MEUs, and compute similarity using both the cosine and asym measures. These form two features for the regressor.
The last feature of the model is a keyword count. We count how many times 13 different keywords appear in either the text or the hypothesis. These keywords include negation (no, not, nobody, etc.) and quantifiers (a, the, some, etc.) The counts of each keyword form the last 13 features as input to the regressor. In total, there are 17 features used in this baseline system.

Evaluation
The dataset used for evaluation is SICK: Sentences Involving Compositional Knowledge dataset, a task for SemEval 2014 (Marelli et al., 2014a;Marelli et al., 2014b). The dataset is 10,000 pairs of sentences, 5000 training and 5000 for testing. Sentences are annotated for both tasks.

Systems Compared
We compare multiple configurations of our probabilistic logic system.
• Baseline: Vector-and keyword-only baseline described in Section 3.6; • MLN/PSL + Cosine: MLN and PSL based methods described in Sections 3.4 and 3.5, using cosine as a similarity measure; • MLN/PSL + Asym: MLN and PSL based methods described in Sections 3.4 and 3.5, using asym as a similarity measure; • Ensemble: An ensemble method which uses all of the features in the above methods as inputs for the RTE and STS classifiers. Table 1 shows our results on the held-out test set for SemEval 2014 Task 1. On the RTE task, we see that both the MLN + Cosine and MLN + Asym models outperformed the Baseline, indicating that textual entailment requires real inference to handle negation and quantifiers. The MLN + Asym and Ensemble systems perform identically on RTE, further suggesting that the logical inference subsumes keyword detection.

Results and Discussion
The MLN + Asym system outperforms the MLN + Cosine system, emphasizing the importance of asymmetric measures for predicting lexical entailment. Intuitively, this makes perfect sense: dog entails animal, but not vice versa.
In an error analysis performed on a development set, we found our RTE system was extremely conservative: we rarely confused the Entails and Contradicts classes, indicating we correctly predict the direction of entailment, but frequently misclassify examples as Neutral. An examination of these examples showed the errors were mostly due to missing or weakly-weighted distributional rules.
On STS, our vector space baseline outperforms both PSL-based systems, but the ensemble outperforms any of its components. This is a testament to the power of distributional models in their ability to predict word and sentence similarity. Surprisingly, we see that the PSL + Asym system slightly outperforms the PSL + Cosine system. This may indicate that even in STS, some notion of asymmetry plays a role, or that annotators may have been biased by simultaneously annotating both tasks. As with RTE, the major bottleneck of our system appears to be the knowledge base, which is built solely using distributional inference rules.
Results also show that our system's performance is close to the baseline system. One of the reasons behind that could be that sentences are not exploiting the full power of logical representations. On RTE for example, most of the contradicting pairs are two similar sentences with one of them being negated. This way, the existence of any negation cue in one of the two sentences is a strong signal for contradiction, which what the baseline system does without deeply representing the semantics of the negation.

Conclusion & Future Work
We showed how to combine logical and distributional semantics using probabilistic logic, and how to perform the RTE and STS tasks using it. The system is tested on the SICK dataset.
The distributional side can be extended in many directions. We would like to use longer phrases, more sophisticated compositionality techniques, and contextualized vectors of word meaning. We also believe inference rules could be dramatically improved by integrating from paraphrases collections like PPDB (Ganitkevitch et al., 2013).
Finally, MLN inference could be made more efficient by exploiting the similarities between the two ground networks (the one with Q and the one without). PLS inference could be enhanced by using a learned, weighted average of rules, rather than the simple mean.