Colombia
Albañil Sánchez, Misael Andrey
Galpin, I.
2022-07-22T19:15:19Z
2022-07-22T19:15:19Z
2022
http://hdl.handle.net/20.500.12010/27756
Throughout the world, the provision of online goods and services has increased significantly over the last few years. We consider the case of Tango Discos, a small company in Colombia that sells entertainment products through an e-commerce website and receives customer messages through various channels, including a webform, email, Facebook and Twitter. This dataset comprises 29,970 messages collected from 2019 to 2021. Each message can be categorized as being either being a sale, request or complaint. In this work we evaluate different supervised classification models to automate the task of classifying the messages, viz. decision trees, Naive Bayes, linear Support Vector Machines and logistic regression. As the data set is unbalanced, the different models are evaluated in combination with various data balancing approaches to obtain the best performance. In order to maximize revenue, the management is interested in prioritizing messages that may result in potential sales. As such, the best model for deployment is one that minimizes false positives in the sales category, so that these are processed in a timely fashion. As such, the best performing model is found to be the Linear Support Vector Machine using the Random Over Sampler balancing technique. This model is deployed in the cloud and exposed using a RESTful interface.
15 páginas
application/pdf
eng
Universidad de Bogotá Jorge Tadeo Lozano
instname:Universidad de Bogotá Jorge Tadeo Lozano
reponame:Expeditio Repositorio Institucional UJTL
E-Commerce
Classifying incoming customer messages for an e-commerce site using supervised learning
Trabajo de grado de maestría
Comercio electrónico -- Tesis y disertaciones académicas
Comercio electrónico -- Medidas de seguridad -- Tesis y disertaciones académicas
Minería de datos -- Tesis y disertaciones académicas
info:eu-repo/semantics/openAccess
info:eu-repo/semantics/acceptedVersion
Abierto (Texto Completo)
http://expeditio.utadeo.edu.co
Magíster en Ingeniería y Analítica de Datos
Maestría en Ingeniería y Analítica de Datos
Adaji, I., Kiron, N., Vassileva, J.: Evaluating the susceptibility of e-commerce shoppers to persuasive strategies. a game-based approach. In: International Conference on Persuasive Technology. pp. 58–72. Springer (2020)
Alghoul, A., Al Ajrami, S., Al Jarousha, G., Harb, G., Abu-Naser, S.S.: Email classification using artificial neural network (2018)
BlackSip, Vtex, Nielsen, PayU, Credibanco, MercadoLibre, Rappi, emBlue, Icommkt: BlackIndex: reporte del ecommerce en Colombia. BlackSip (2019)
Busemann, S., Schmeier, S., Arens, R.G.: Message classification in the call center. arXiv preprint cs/0003060 (2000)
Confecamaras: https://confecamaras.org.co (13 de Enero de 2022)
Duan, L., Li, A., Huang, L.: A new spam short message classification. In: 2009 First International Workshop on Education Technology and Computer Science. vol. 2, pp. 168–171. IEEE (2009)
Fang, W., Luo, H., Xu, S., Love, P.E., Lu, Z., Ye, C.: Automated text classification of near-misses from safety reports: An improved deep learning approach. Advanced Engineering Informatics 44, 101060 (2020)
Manning, C., Raghavan, P., Sch¨utze, H.: Introduction to information retrieval. Natural Language Engineering 16(1), 100–103 (2010)
Mansoor, R., Jayasinghe, N.D., Muslam, M.M.A.: A comprehensive review on email spam classification using machine learning algorithms. In: 2021 International Conference on Information Networking (ICOIN). pp. 327–332. IEEE (2021)
Masterov, D.V., Mayer, U.F., Tadelis, S.: Canary in the e-commerce coal mine: Detecting and predicting poor experiences using buyer-to-seller messages. In: Proceedings of the Sixteenth ACM Conference on Economics and Computation. pp. 81–93 (2015)
Menini, S., Moretti, G., Corazza, M., Cabrio, E., Tonelli, S., Villata, S.: A system to monitor cyberbullying based on message classification and social network analysis. In: Proceedings of the third workshop on abusive language online. pp. 105–110 (2019)
Mohammed,R., Rawashdeh, J., Abdullah, M.: Machine learning with oversampling and undersampling techniques: overview study and experimental results. In: 2020 11th international conference on information and communication systems (ICICS). pp. 243–248. IEEE (2020)
Nkansah, E.A.: Kayayo: An e-commerce site with recommendations and text messaging (2013)
Ozel, S.A., Sara¸c, E., Akdemir, S., Aksu, H.: Detection of cyberbullying on social media messages in turkish. In: 2017 International Conference on Computer Science and Engineering (UBMK). pp. 366–370. IEEE (2017)
Webster, J.J., Kit, C.: Tokenization as the initial phase in nlp. In: COLING 1992 volume 4: The 14th international conference on computational linguistics (1992)
Wirth, R., Hipp, J.: Crisp-dm: Towards a standard process model for data mining. In: Proceedings of the 4th international conference on the practical applications of knowledge discovery and data mining. vol. 1, pp. 29–39. Manchester (2000)
Zois, D.S., Kapodistria, A., Yao, M., Chelmis, C.: Optimal online cyberbullying detection. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 2017–2021. IEEE (2018)
En todo el mundo, la adquisicion de bienes y servicios en línea ha aumentado significativamente en los últimos años. Consideramos el caso de Tango Discos, que es una pequeña empresa en Colombia que vende productos de entretenimiento a través de un sitio web de comercio electrónico y recibe mensajes de los clientes a través de varios canales, incluido un formulario web, correo electrónico, Facebook y Twitter. Este conjunto de datos comprende 29,970 mensajes recopilados entre 2019 y 2021. Cada mensaje se puede clasificar como una venta, una solicitud o una queja. En este trabajo evaluamos diferentes modelos de clasificación supervisada para automatizar la tarea de clasificar los mensajes, a saber. árboles de decisión, Naive Bayes, Máquinas de Vectores Soporte lineales y regresión logística. Como el conjunto de datos está desequilibrado, los diferentes modelos se evalúan en combinación con varias tecnicas de balanceo de datos para obtener el mejor rendimiento. Como requerimiento desde el negocio, la gerencia está interesada en priorizar los mensajes que pueden resultar en ventas potenciales. Como tal, el mejor modelo para la implementación es aquel que minimiza los falsos positivos en la categoría de ventas, para que estos se procesen de manera oportuna. Asi, se encuentra que el modelo con mejor desempeño es el lineal. Support Vector Machine utilizando la técnica de balanceo Random Over Sampler. Este modelo se implementa en la nube y se expone mediante una API RESTful.
info:eu-repo/semantics/masterThesis
http://purl.org/coar/resource_type/c_2df8fbb1