Vocabulary Semantic Similarity Calculation in Natural Language Processing

Huixiang Xiao; Kaige Zheng; Xiangyu Li

doi:10.62177/jaet.v2i4.946

Authors

Huixiang Xiao Chongqing University of Technology
Kaige Zheng Kookmin University
Xiangyu Li Shanghai Jiao Tong University

DOI:

https://doi.org/10.62177/jaet.v2i4.946

Keywords:

Semantic Similarity, Convolutional Neural Network, Gated Recurrent Unit, Natural Language Processing, Word Vector

Abstract

Natural language processing (NLP) is a critical research direction in artificial intelligence, where the calculation of vocabulary semantic similarity is the foundation and core work. However, existing calculation methods are faced with problems, e.g., the inability to extract important semantic information. Failure to address this issue can compromise the accuracy of semantic similarity measures in NLP applications. To this end, in this paper, a vocabulary semantic similarity calculation model based on word vectors and convolutional neural networks (CNNs) was proposed. The word vector model was improved using long short-term memory (LSTM) networks, and important semantics were extracted using convolutional layers and ensured semantic order through bidirectional Gated Recurrent Unit. The structure of the Siamese neural network was used to ensure consistency in text encoding. The experimental findings have shown that the proposed model has the highest F1 value in different datasets. In the original Chinese natural language inference (OCNLI) dataset, the Pearson correlation coefficient of the model was 0.021 and 0.018 higher than that of the LSTM network and CNNs, respectively. The accuracy of the similarity calculation in the two datasets was 92.4% and 96.5%, respectively. According to these results, the semantic similarity prediction value of the proposed model can be closer to the true value, and the prediction performance of the model is more excellent.

Downloads

Download data is not yet available.

References

Giabelli, A., Malandri, L., Mercorio, F., et al. (2022). Embeddings evaluation using a novel measure of semantic similarity. Cognitive Computation, 14(2), 749–763. https://doi.org/10.1007/s12559-021-09934-5 DOI: https://doi.org/10.1007/s12559-021-09987-7

Yang, D., & Yin, Y. (2022). Evaluation of taxonomic and neural embedding methods for calculating semantic similarity. Natural Language Engineering, 28(6), 733–761. https://doi.org/10.1017/S1351324922000295 DOI: https://doi.org/10.1017/S1351324921000279

Vakulenko, M. O. (2023). Semantic comparison of texts by the metric approach. Digital Scholarship in the Humanities, 38(2), 766–771. https://doi.org/10.1093/llc/fqac078 DOI: https://doi.org/10.1093/llc/fqac059

Oussalah, M., & Mohamed, M. (2022). Knowledge-based sentence semantic similarity: Algebraical properties. Progress in Artificial Intelligence, 11(1), 43–63. https://doi.org/10.1007/s13748-021-00296-8 DOI: https://doi.org/10.1007/s13748-021-00248-0

Wingfield, C., & Connell, L. (2023). Sensorimotor distance: A grounded measure of semantic similarity for 800 million concept pairs. Behavior Research Methods, 55(7), 3416–3432. https://doi.org/10.3758/s13428-023-02118-1 DOI: https://doi.org/10.3758/s13428-022-01965-7

Triandini, E., Fauzan, R., Siahaan, D. O., et al. (2022). Software similarity measurements using UML diagrams: A systematic literature review. Register: Jurnal Ilmiah Teknologi Sistem Informasi, 8(1), 10–23. https://doi.org/10.29303/j.regist.v8i1.3867 DOI: https://doi.org/10.26594/register.v8i1.2248

Ismail, S., Shishtawy, T. E. L., & Alsammak, A. K. (2022). A new alignment word-space approach for measuring semantic similarity for Arabic text. International Journal on Semantic Web and Information Systems (IJSWIS), 18(1), 1–18. https://doi.org/10.4018/IJSWIS.300729 DOI: https://doi.org/10.4018/IJSWIS.297036

Pan, J. S., Wang, X., Yang, D., et al. (2024). Flexible margins and multiple samples learning to enhance lexical semantic similarity. Engineering Applications of Artificial Intelligence, 133, 108275–108294. https://doi.org/10.1016/j.engappai.2024.108275 DOI: https://doi.org/10.1016/j.engappai.2024.108275

Dai, B. (2024). Relational analysis of college English vocabulary - A reflection based on semantic association network modeling. Applied Mathematics and Nonlinear Sciences, 9(1), 64–82. https://doi.org/10.2478/amns.2024.1.00006 DOI: https://doi.org/10.2478/amns-2024-1205

Ahmad, F., & Faisal, M. (2022). A novel hybrid methodology for computing semantic similarity between sentences through various word senses. International Journal of Cognitive Computing in Engineering, 3(6), 58–77. https://doi.org/10.1080/23311916.2022.2124236 DOI: https://doi.org/10.1016/j.ijcce.2022.02.001

Chauhan, S., Kumar, R., Saxena, S., et al. (2024). Semsyn: Semantic-syntactic similarity based automatic machine translation evaluation metric. IETE Journal of Research, 70(4), 3823–3834. https://doi.org/10.1080/03772063.2023.2288606 DOI: https://doi.org/10.1080/03772063.2023.2195819

Osth, A. F., & Zhang, L. (2024). Integrating word-form representations with global similarity computation in recognition memory. Psychonomic Bulletin & Review, 31(3), 1000–1031. https://doi.org/10.3758/s13423-023-02293-4 DOI: https://doi.org/10.3758/s13423-023-02402-2

Zhang, Y., Zhao, H., Wei, J., et al. (2022). Context-based semantic communication via dynamic programming. IEEE Transactions on Cognitive Communications and Networking, 8(3), 1453–1467. https://doi.org/10.1109/TCCN.2022.3176618 DOI: https://doi.org/10.1109/TCCN.2022.3173056

Asudani, D. S., Nagwani, N. K., & Singh, P. (2023). Impact of word embedding models on text analytics in deep learning environment: A review. Artificial Intelligence Review, 56(9), 10345–10425. https://doi.org/10.1007/s10462-023-10409-8 DOI: https://doi.org/10.1007/s10462-023-10419-1

Rodriguez, P. L., & Spirling, A. (2022). Word embeddings: What works, what doesn’t, and how to tell the difference for applied research. The Journal of Politics, 84(1), 101–115. https://doi.org/10.1086/715618 DOI: https://doi.org/10.1086/715162

Mars, M. (2022). From word embeddings to pre-trained language models: A state-of-the-art walkthrough. Applied Sciences, 12(17), 8805–8817. https://doi.org/10.3390/app12178805 DOI: https://doi.org/10.3390/app12178805

Eminagaoglu, M. (2022). A new similarity measure for vector space models in text classification and information retrieval. Journal of Information Science, 48(4), 463–476. https://doi.org/10.1177/01655515211069131 DOI: https://doi.org/10.1177/0165551520968055

Ichien, N., Lu, H., & Holyoak, K. J. (2022). Predicting patterns of similarity among abstract semantic relations. Journal of Experimental Psychology: Learning, Memory, and Cognition, 48(1), 108. https://doi.org/10.1037/xlm0000948 DOI: https://doi.org/10.1037/xlm0001010

Gao, Q., Huang, X., Dong, K., et al. (2022). Semantic-enhanced topic evolution analysis: A combination of the dynamic topic model and word2vec. Scientometrics, 127(3), 1543–1563. https://doi.org/10.1007/s11192-022-04337-8 DOI: https://doi.org/10.1007/s11192-022-04275-z

Zhang, Y., Zhang, C., & Hu, F. (2025). Optimization of science and technology project management system based on hybrid semantic similarity evaluation framework. Journal of Computational Methods in Sciences and Engineering, 25(4), 3384–3396. https://doi.org/10.1177/14727978251319396 DOI: https://doi.org/10.1177/14727978251319396