Abstract:
Public opinion analysis plays a vital role in various domains, such as marketing and politics. With the increasing volume of text data available through the internet and social media, efficient text-based analysis methods have become crucial. This study explores the application of static and contextualized word embeddings in word-based opinion analysis. The research questions focus on the impact of pre-training on static word embeddings, the efficacy of static and contextualized word embeddings in delineating opposing opinions, and the behavioral differences between the two embedding types. The findings suggest that pre-training improves embedding quality in small-sized datasets but may introduce noise in large-sized datasets. Additionally, word-based opinion analysis is more suitable for large-sized datasets, with un-pre-trained static word embeddings demonstrating superior performance. Static word embeddings are preferred over contextualized word embeddings due to their ability to capture syntactical relationships, while contextualized word embeddings provide semantic-related similar words. To apply both embedding types effectively, the study recommends using contextualized sequence embedding to predict the corpus, training word2vec on the predicted corpus, and analyzing the corpus based on the most similar words from the word2vec model.