Diseño de una metodología de diagnóstico proactivo para incidentes que afectan la disponibilidad de aplicaciones empresariales

dc.description.abstractEste trabajo de grado se centra en el diseño de una metodología de diagnóstico proactivo para la gestión de incidentes que afectan la disponibilidad de aplicaciones empresariales desplegadas en entornos Oracle/Linux y SQL Server/Windows, tanto on-premise como en la nube. El objetivo principal es anticipar y detectar condiciones de degradación que comúnmente se resuelven mediante reinicios de servidor o servicios de aplicación, una acción reactiva que no aborda la causa raíz del problema. El análisis técnico abarca la identificación de causas recurrentes de degradación, tales como: memoria (fugas, bloqueos de heap, saturación de swap); procesos (hilos colgados o consumo excesivo de recursos); CPU (sobreutilización y cargas sostenidas); configuración (parámetros inadecuados en aplicaciones o middleware); y base de datos (agotamiento de pools de conexiones e impacto en Oracle/SQL Server). Se busca establecer una metodología que permita a los administradores anticipar incidentes mediante la evaluación de herramientas nativas disponibles en los sistemas (por ejemplo, sar, journalctl, logs de GC, AWR/ADDM de Oracle, PerfMon en .NET) y su integración en una guía práctica. La propuesta metodológica se evaluará considerando su utilidad en tres componentes principales: 1) una taxonomía de incidentes de degradación; 2) la comparación funcional de herramientas nativas de diagnóstico; y 3) una guía en forma de checklist y flujogramas de decisión para actuar antes de recurrir a reinicios.
dc.description.abstractenglishThis thesis proposes the design of a proactive diagnostic methodology aimed at managing incidents that affect the availability of enterprise applications deployed in Oracle/Linux and SQL Server/Windows environments, both on-premise and in the cloud. The main goal is to anticipate and detect degradation conditions that, in operational practice, are frequently mitigated through server or application service restarts—reactive actions that temporarily restore functionality but do not address the underlying root cause. The analysis focuses on the most recurrent sources of degradation in Java EE and .NET applications, including memory leaks, heap blocking, swap saturation, hung processes, sustained CPU overutilization, misconfigurations at the application server or middleware level, and exhaustion of connection pools in Oracle and SQL Server databases. Additionally, the study evaluates the diagnostic potential of native tools available in these environments—such as sar, journalctl, garbage collection logs, Oracle AWR/ADDM reports, and Windows PerfMon—highlighting their usefulness in detecting early deterioration before it escalates into critical failures. The proposed methodology is structured into three components: (1) a taxonomy of degradation incidents derived from operational evidence and prior research; (2) a functional assessment of native diagnostic tools present in the evaluated environments; and (3) a practical guide composed of checklists and decision flowcharts designed to facilitate early detection and reduce the reliance on restarts as a recovery strategy. The results aim to provide system and database administrators with a practical and academically grounded framework to transition from reactive operations toward proactive reliability practices aligned with modern availability and service continuity standards.
dc.format.extent35 páginas
dc.format.mimetypeapplication/pdf
dc.language.isoes
dc.relation.referencesP. T. Endo, M. Rodrigues, G. E. Gonçalves, J. Kelner, D. H. Sadok y C. Curescu, «High availability in clouds: systematic review and research challenges,» Journal of Cloud Computing: Advances, Systems and Applications, vol. 5, n.o 16, págs. 1-22, 2016. doi: 10.1186/s13677-016-0066-8.
dc.relation.referencesN. Jäntti, «Detecting anomalies in server performance,» [Online]. Available: https://trepo.tuni.fi/bitstream/handle/10024/121617/JanttiNiko.pdf?sequence=24, Tesis de mtría., Tampere University, Faculty of Information Technology y Communication Sciences, mayo de 2020
dc.relation.referencesF. Xian, W. Srisa-an y H. Jiang, «Garbage collection: Java application servers’ Achilles heel,» Science of Computer Programming, vol. 70, n.o 2–3, págs. 89-110, feb. de 2008. doi: 10.1016/j.scico.2007.07.008.
dc.relation.referencesO. Hamed y N. Kafri, «Performance Prediction of Web Based Application Architectures: Case Study .NET vs. Java EE,» International Journal of Web Applications, vol. 1, n.o 3, págs. 146-155, sep. de 2009, [Online]. Available: https : / / www . researchgate . net / publication / 220500854 _ Performance _ Prediction _ of _ Web _ Based _ Application _ Architectures_Case_Study_NET_vs_Java_EE.
dc.relation.referencesU. Naseer, L. Niccolini, U. Pant, A. Frindell, R. Dasineni y T. A. Benson, «Zero Downtime Release: Disruption-free Load Balancing of a Multi-Billion User Website,» en Proc. ACM SIGCOMM, ago. de 2020, págs. 1-13. doi: 10.1145/3387514.3405885.
dc.relation.referencesR. M. Jr., P. J. F. Filho, L. Guedes y A. Dias, «Estimation of Web Servers’ Reliability with Symptoms of Software Aging,» en Proceedings of the Experimental Software Engineering Latin American Workshop (ESELAW), 2005, págs. –.
dc.relation.referencesM. Saarinen, «Evaluation of Reliability in IoT and Edge Computing Platforms through SRE Practices,» CC-BY 4.0 License, Master’s Thesis, University of Oulu, Faculty of Information Technology y Electrical Engineering, Oulu, Finland, jun. de 2024. dirección: https://oulurepo.oulu.fi/handle/10024/50800.
dc.relation.referencesH. Allam, «Reliability at the Edge: SRE for Distributed Cloud and IoT Platforms,» International Journal of Engineering Research & Emerging Technology (IJERET), vol. 6, n.o 2, págs. 39-52, mayo de 2025. doi: 10.63282/3050-922X.IJERET-V6I2P106.
dc.relation.referencesS. W. Hunter y W. E. Smith, «Availability Modeling and Analysis of a Two Node Cluster,» en Proc. 5th Int. Conf. on Information Systems, Analysis and Synthesis (ISAS), Orlando, FL, USA, 1999.
dc.relation.referencesH. Handoko, S. M. Isa y S. I. Si, «High Availability Analysis with Database Cluster, Load Balancer and Virtual Router Redundancy Protocol,» en Proceedings of the 2018 3rd International Conference on Computer and Communication Systems (ICCCS), abr. de 2018, –. doi: 10.1109/CCOMS.2018.8463263.
dc.relation.referencesR. S. Barga y D. B. Lomet, «Phoenix/ODBC: A Recoverable, Scalable ODBC Interface for Transaction Processing,» en Proceedings of the 17th International Conference on Data Engineering (ICDE), IEEE, 2001, págs. 339-348. doi: 10.1109/ICDE.2001.914810.
dc.relation.referencesN. Anerousis, A. Black, S. Hanson, L. Mummert y G. Pacifici, «Health Monitoring and Control for Application Server Environments,» en Proceedings of the IEEE/IFIP Network Operations and Management Symposium (NOMS), Hawthorne, NY, USA: IEEE, 2005, págs. 385-398. doi: 10.1109/NOMS.2005.102.
dc.relation.referencesD. Tomić, B. Markić, M. Mabić y D. Gašpar, «Continuous Database Availability as a Precondition for Business Continuity,» en Proceedings of the Conference on Business, Economics and Information Technology, University of Mostar, Bosnia y Herzegovina, 2007. dirección: https://www.researchgate.net/publication/270100780_Continuous_ Database_Availability_as_a_Precondition_for_Business_Continuity.
dc.relation.referencesR. Shrestha, «High Availability and Performance of Database in the Cloud: Traditional Master-Slave Replication versus Modern Cluster-Based Solutions,» en Proceedings of the 7th International Conference on Cloud Computing and Services Science (CLOSER), Porto, Portugal: SciTePress, 2017, págs. 385-392, isbn: 978-989-758-243-1. doi: 10 . 5220/0006294604130420.
dc.relation.referencesH. Allam, «Cloud-Native Reliability: Applying SRE to Serverless and Event-Driven Architectures,» International Journal of Artificial Intelligence, Data Science, and Machine Learning, vol. 5, n.o 3, págs. 68-79, 2024, issn: 3050-9262. doi: 10.63282/3050- 9262.IJAIDSML-V5I3P108.
dc.relation.referencesP. Shah, Linnix 3B Distilled: Incident Detection Model, Fine-tuned Qwen2.5-3B for system observability., 2025. dirección: https://huggingface.co/parth21shah/linnix3b-distilled.
dc.relation.referencesP. Shah, Linnix Observability Pipeline Diagram, https : / /github. com / linnix - os/ linnix, Diagrama de arquitectura del pipeline eBPF–User Space–LLM., 2025.
dc.subjectAlta Disponibilidad
dc.subjectDeadlock
dc.subjectEnfoque Proactivo
dc.subjectLatencia
dc.subjectFuga de Memoria
dc.subjectObservabilidad
dc.subjectPool de Conexiones
dc.subjectSRE
dc.subjectSwap
dc.subjectThread
dc.subjectThroughput
dc.subjectZero Downtime
dc.subject.keywordZero Downtime
dc.subject.keywordThroughput
dc.subject.keywordSwap
dc.subject.keywordSRE
dc.subject.keywordDeadlock
dc.subject.lembAdministración de sistemas informáticos
dc.subject.lembDiagnóstico de sistemas computacionales
dc.subject.lembGestión de incidentes informáticos
dc.titleDiseño de una metodología de diagnóstico proactivo para incidentes que afectan la disponibilidad de aplicaciones empresariales
dc.type.coarhttp://purl.org/coar/resource_type/c_46ec

Archivos

Bloque original

Mostrando 1 - 1 de 1
Cargando...
Miniatura
Nombre:
Andres Acevedo_TRABAJO_FINAL.pdf
Tamaño:
1.19 MB
Formato:
Adobe Portable Document Format
Descripción:
Tesis

Bloque de licencias

Mostrando 1 - 2 de 2
Cargando...
Miniatura
Nombre:
license.txt
Tamaño:
3.28 KB
Formato:
Item-specific license agreed upon to submission
Descripción:
Cargando...
Miniatura
Nombre:
Carta de Autorizacion.pdf
Tamaño:
687.53 KB
Formato:
Adobe Portable Document Format
Descripción:
Carta de autorización