Abstract:
The task of finding semantic similarity of any two arbitrary sentences consists of two main steps, which are encoding sentences to produce feature vectors of equal length and measuring the similarity, respectively. The quality of an encoding technique can determine the degree of success a model can achieve in measuring the similarity. This is because a good representation is subjected to how finely established the spectrum of similarities is. The clearer the definition of similarity is, the better the representations can be constructed. This, in turn, helps distinguish between types of sentences. Generally, all existing methods for measuring similarity were designed for vectorized data in a feature space of fixed dimensions. Thus, transforming a set of various-length sentences into a set of feature vectors in the same dimension is very essential. The dataset used in this thesis provides both relatedness score and textual entailment. Textual entailment distinguishes sentence pair relations among three classes: namely, neutral, entailment and contradiction. The task indicates the types of entailments, which is interpreted as relatedness in this thesis. Additionally, powerful pretrained encoding models are usually of millions of parameters, or even billions. This is one obstacle in training one’s own embedding model due to the need of resources with heavy computing capabilities.
In this thesis, we propose a self-encoding scheme to classify among the three classes of textual entailment. The relevancy of all words in a sentence is simultaneously captured by this self-encoding structure. Unlike the other encoding methods based on sequential learning, no interference of memory loss due to the length of sentence occurs in this approach. The framework involves filtering contradiction pairs at an early stage and employing a set of y-x-y encoders, where y is the length after two sentences are concatenated and x is the optimal encoding size for samples of length y, and classifiers to output neutral and entailment probabilities. With over 90% accuracy for all classes, our method has proven that this task is possible to be carried out effectively without the need of large-scale datasets and heavy computational resources.