Diseño de una arquitectura lakehouse empresarial integrando modelado relacional y bases vectoriales para soportar analítica avanzada e IA generativa

González Quintero, Erika Yiseth; Amado Otálora, Víctor Julián

Diseño de una arquitectura lakehouse empresarial integrando modelado relacional y bases vectoriales para soportar analítica avanzada e IA generativa

dc.contributor.advisor	Romero Gelvez, Jorge Ivan
dc.contributor.advisor	Garcés Restrepo, Mauricio
dc.creator	González Quintero, Erika Yiseth
dc.creator	Amado Otálora, Víctor Julián
dc.date.accessioned	2026-06-03T19:42:38Z
dc.date.created	2026-06-02
dc.description.abstract	Las organizaciones actuales producen y consumen información en múltiples formatos: bases relacionales, documentos, archivos semiestructurados, sistemas transaccionales, fuentes externas, registros de interacción y repositorios institucionales. Esta diversidad ha ampliado las posibilidades de análisis, pero también ha incrementado la fragmentación de la información. En especial, las organizaciones colombianas suelen operar con sistemas contables, facturación electrónica, nómina, POS, CRM, gestión documental, fuentes tributarias y datos externos que no siempre se integran bajo una misma arquitectura de datos. En este contexto, la inteligencia artificial generativa introduce una exigencia adicional: no basta con recuperar información, también es necesario conocer su origen, su versión, sus permisos, su vigencia y la evidencia que sustenta cada respuesta. Los enfoques basados en Retrieval-Augmented Generation (RAG), embeddings y bases vectoriales permiten conectar modelos de lenguaje con conocimiento empresarial, pero su adopción aislada puede generar nuevos silos si no se articula con gobierno, linaje, evaluación y modelos de datos consistentes. Este trabajo propone el diseño de una arquitectura lakehouse empresarial que integra modelado relacional, almacenamiento analítico y bases vectoriales para soportar analítica avanzada e inteligencia artificial generativa. La propuesta conserva la precisión del modelo relacional, incorpora la flexibilidad del lakehouse y añade una capa semántica orientada a búsqueda híbrida, trazabilidad documental, generación aumentada por recuperación y evaluación de respuestas. La investigación desarrolla la propuesta a partir de la descripción del problema, la formulación de objetivos, la definición de requerimientos, la revisión del estado del arte, la construcción del marco teórico y el diseño metodológico de la arquitectura. Como resultado, se plantea una arquitectura de referencia adaptable a organizaciones con sistemas heterogéneos, acompañada de vistas arquitectónicas, criterios de validación, discusión de cumplimiento de objetivos, plan de implementación ágil tipo Scrum y presupuesto estimado para el planteamiento y el piloto. El aporte central consiste en mostrar que la IA generativa empresarial no debe entenderse como una aplicación aislada, sino como una capacidad que depende de una infraestructura de datos gobernada, trazable y evaluable. La arquitectura propuesta articula fuentes, documentos, fragmentos, embeddings, consultas, respuestas, citas, métricas y políticas de seguridad dentro de un mismo ciclo de gestión de información.
dc.description.abstractenglish	Organizations increasingly make decisions in environments where information is abundant, distributed, and difficult to reconcile. Relational databases, transactional systems, documents, semi-structured files, external registries, analytical reports, and digital interaction records coexist in the same institutional landscape, but they often evolve through separate technological paths. In the Colombian context, this fragmentation is especially visible in organizations that depend on accounting platforms, electronic invoicing, payroll systems, point-of-sale solutions, CRM tools, document repositories, tax-related services, and sector-specific information sources that are not always governed under a common data architecture. This situation becomes more critical with the adoption of generative artificial intelligence. For an enterprise assistant, retrieving a fragment of information is not enough: the organization must know where that information came from, whether it is current, who is allowed to access it, how it was transformed, and which evidence supports the generated answer. Retrieval- Augmented Generation (RAG), embeddings, vector databases, and hybrid search provide powerful mechanisms to connect language models with enterprise knowledge; however, when they are implemented as isolated components, they can reproduce the same fragmentation they are expected to solve. This research proposes the design of an enterprise lakehouse architecture that integrates relational modeling, analytical storage, document processing, vector databases, and governance mechanisms to support advanced analytics and generative artificial intelligence. The proposal does not treat the lakehouse, the relational model, or the vector layer as competing alternatives. Instead, it articulates them as complementary capabilities: the relational model provides business structure and consistency, the lakehouse offers scalable analytical organization, and the semantic layer enables contextual retrieval, traceability, and evidence-based generation. The study develops the proposal through the formulation of the problem, the definition of objectives and requirements, the review of the state of the art, the theoretical framework, and the methodological design of the architecture. As a result, it presents a reference model for organizations with heterogeneous information systems, including architectural views, data lifecycle criteria, logical traceability between sources and responses, security considerations, validation criteria, an agile implementation plan, a WBS decomposition, a Gantt schedule, and estimated budgets for both the academic formulation and a controlled technical pilot. The main contribution of this work is to argue that enterprise generative AI should be understood as a governed data capability rather than as an isolated application layer. A reliable generative system depends not only on a language model, but also on the quality of the sources, the clarity of metadata, the validity of embeddings, the enforcement of permissions, and the possibility of auditing each answer. The proposed architecture therefore connects sources, datasets, documents, chunks, embeddings, queries, retrieved evidence, generated responses, citations, metrics, and security policies within a unified information management cycle.
dc.format.extent	54 páginas
dc.format.mimetype	application/pdf
dc.identifier.uri	https://hdl.handle.net/20.500.12010/39618
dc.language.iso	es
dc.relation.references	M. Armbrust, A. Ghodsi, R. Xin y M. Zaharia, «Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics,» en Proceedings of the Conference on Innovative Data Systems Research (CIDR), 2021, págs. 1–9.
dc.relation.references	H. Yu, A. Gan, K. Zhang, S. Tong, Q. Liu y Z. Liu, «Evaluation of Retrieval-Augmented Generation: A Survey,» arXiv preprint arXiv:2405.07437, 2024.
dc.relation.references	S. Pan et al., «A Comprehensive Survey on Vector Databases: Storage and Retrieval Techniques,» arXiv preprint arXiv:2310.11703, 2025
dc.relation.references	E. Kandogan et al., «A Blueprint Architecture of Compound AI Systems for Enterprise,» arXiv preprint arXiv:2406.00584, 2024.
dc.relation.references	L. Jing et al., «When Large Language Models Meet Vector Databases: A Survey,» arXiv preprint arXiv:2402.01763, 2024.
dc.relation.references	Y. Chronis, H. Caminal, Y. Papakonstantinou, F. Özcan y A. Ailamaki, «Filtered Vector Search: State-of-the-Art and Research Opportunities,» Proceedings of the VLDB Endowment, vol. 18, págs. 5488–5500, 2025.
dc.relation.references	J. Song et al., «Magnus: A Holistic Approach to Data Management for Large-Scale AI Applications,» Proceedings of the VLDB Endowment, vol. 18, págs. 4964–4976, 2025.
dc.relation.references	P. Lewis et al., «Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,» en Advances in Neural Information Processing Systems, vol. 33, págs. 9459–9474, 2020.
dc.relation.references	S. Es, J. James, L. Espinosa-Anke y S. Schockaert, «RAGAS: Automated Evaluation of Retrieval Augmented Generation,» en Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, págs. 150–158, 2024.
dc.relation.references	R. Yan, X. Zhao y S. Mazumdar, «Chatbots in Libraries: A Systematic Literature Review,» Education for Information, vol. 39, núm. 4, págs. 431–449, 2023.
dc.relation.references	C. Chen et al., «SingleStore-V: An Integrated Vector Database System in SingleStore,» Proceedings of the VLDB Endowment, vol. 17, págs. 3772–3784, 2024.
dc.relation.references	T. Prajapati, «LiveVectorLake: A Real-Time Versioned Knowledge Base Architecture for Streaming Vector Updates and Temporal Retrieval,» arXiv preprint arXiv:2601.05270, 2025.
dc.relation.references	J. Tagliabue, F. Bianchi y C. Greco, «Trustworthy AI in the Agentic Lakehouse: From Concurrency to Governance,» arXiv preprint, 2025.
dc.relation.references	S. Ordóñez Salinas y A. C. Nieto Lemus, «Data Warehouse and Big Data Integration, » International Journal of Computer Science and Information Technology, 2017, doi: 10.5121/ijcsit.2017.9201
dc.relation.references	N. D. Duque-Méndez, M. Orozco-Alzate y J. J. Vélez Upegui, «Hydro-meteorological data analysis using OLAP techniques,» DYNA, vol. 81, núm. 185, págs. 160–167, 2014, doi: 10.15446/dyna.v81n185.37700.
dc.relation.references	Y. M. Pérez-Pérez, A. A. Rosado-Gómez y A. M. Puentes-Velásquez, «Application of business intelligence in the quality management of higher education institutions,» Journal of Physics: Conference Series, vol. 1126, art. 012053, 2018, doi: 10.1088/1742- 6596/1126/1/012053.
dc.relation.references	D. A. Fuentes-Vargas, M. E. Mendoza-Becerra y L. C. Gómez-Flórez, «Adaptable Data Warehouse Based on the Research Factor of the NAC Institutional Accreditation Model,» Revista Facultad de Ingeniería, vol. 31, núm. 62, 2022, doi: 10.19053/01211129.v31.n62.2022.15211.
dc.relation.references	O. Barrenechea, A. Mendieta, J. Armas y J. M. Madrid, «Data Governance Reference Model to streamline the supply chain process in SMEs,» en 2019 IEEE XXVI International Conference on Electronics, Electrical Engineering and Computing (INTERCON), 2019, doi: 10.1109/INTERCON.2019.8853634.
dc.relation.references	J. D. Velásquez-Henao, C. J. Franco-Cardona y L. Cadavid-Higuita, «Prompt Engineering: a methodology for optimizing interactions with AI-Language Models in the field of engineering,» DYNA, vol. 90, núm. 230, 2023, doi: 10.15446/dyna.v90n230.111700.
dc.relation.references	D. Rico-Bautista, C. D. Guerrero, C. A. Collazos, G. Maestre-Góngora, J. A. Hurtado- Alegría, Y. Medina-Cárdenas y J. Swaminathan, «Smart University: a vision of technology adoption,» Revista Colombiana de Computación, 2021, doi: 10.29375/25392115.4153.
dc.relation.references	A. Noguera, A. L. Mogollón-Benavides, S. Rua, D. Sanin-Villa y J. C. Tejada, «Retrieval- Augmented Generation for Maternal Healthcare: Design and Evaluation of a Clinical Question-Answering System in Spanish,» en 2025 Mexican International Conference on Computer Science (ENC), 2025, doi: 10.1109/ENC68268.2025.11311944
dc.relation.references	Ministerio de Tecnologías de la Información y las Comunicaciones, Marco de Interoperabilidad para Gobierno Digital, https://lenguaje.mintic.gov.co/marco, consultado el 5 de mayo de 2026.
dc.relation.references	Agencia Nacional Digital, Servicios Ciudadanos Digitales, https://and.gov.co/ servicios-ciudadanos-digitales, consultado el 5 de mayo de 2026.
dc.relation.references	Dirección de Impuestos y Aduanas Nacionales, Sistema de Factura Electrónica, https://www.dian.gov.co/impuestos/Paginas/Sistema-de-Factura-Electronica/ Sistema-de-Factura-Electronica.aspx, consultado el 5 de mayo de 2026.
dc.relation.references	Dirección de Impuestos y Aduanas Nacionales, Registro Único Tributario (RUT), https://www.dian.gov.co/tramitesservicios/tramites-y-servicios/ tributarios/Paginas/RUT.aspx, consultado el 5 de mayo de 2026.
dc.relation.references	Registro Único Empresarial y Social, La Gran Central de Información Empresarial de Colombia - RUES, https://app-antiguoprd.rues.org.co/Home/About, consultado el 5 de mayo de 2026.
dc.relation.references	Colombia Compra Eficiente, Datos JSON del Sistema de Compra Pública, https:// operaciones.colombiacompra.gov.co/transparencia/datos-json, consultado el 5 de mayo de 2026.
dc.relation.references	OpenAI, Retrieval and vector stores, documentación oficial de OpenAI API, https: //developers.openai.com/api/docs/guides/retrieval, consultado el 5 de mayo de 2026.
dc.relation.references	OpenAI, Model optimization and fine-tuning, documentación oficial de OpenAI API, https://developers.openai.com/api/docs/guides/model-optimization, consultado el 5 de mayo de 2026.
dc.relation.references	Google Cloud, Vector Search overview, documentación oficial de Vertex AI, https: //cloud.google.com/vertex-ai/docs/vector-search/overview, consultado el 5 de mayo de 2026
dc.relation.references	Google Cloud, Vector database choices in Vertex AI RAG Engine, documentación oficial de Vertex AI, https://cloud.google.com/vertex-ai/generative-ai/docs/ rag-engine/vector-db-choices, consultado el 5 de mayo de 2026.
dc.relation.references	Google Cloud, About supervised fine-tuning for Gemini models, documentación oficial de Vertex AI, https://cloud.google.com/vertex-ai/generative-ai/docs/models/ gemini-supervised-tuning, consultado el 5 de mayo de 2026.
dc.relation.references	Anthropic, Embeddings, documentación oficial de Anthropic, https://docs.anthropic. com/en/docs/build-with-claude/embeddings, consultado el 5 de mayo de 2026.
dc.relation.references	Anthropic, Search results, documentación oficial de Anthropic, https://docs.anthropic. com/en/docs/build-with-claude/search-results, consultado el 5 de mayo de 2026
dc.relation.references	Pinecone, Pinecone documentation, documentación oficial, https://docs.pinecone.io/ guides/get-started/overview, consultado el 5 de mayo de 2026.
dc.relation.references	Weaviate, Hybrid search, documentación oficial, https://docs.weaviate.io/weaviate/ concepts/search/hybrid-search, consultado el 5 de mayo de 2026.
dc.relation.references	Qdrant, Search, documentación oficial, https://qdrant.tech/documentation/ concepts/search/, consultado el 5 de mayo de 2026.
dc.relation.references	Milvus, Hybrid Search with Milvus, documentación oficial, https://milvus.io/docs/ hybrid_search_with_milvus.md, consultado el 5 de mayo de 2026.
dc.relation.references	pgvector, Open-source vector similarity search for Postgres, documentación oficial del proyecto, https://github.com/pgvector/pgvector, consultado el 5 de mayo de 2026.
dc.relation.references	Hugging Face, LLM Finetuning with AutoTrain Advanced, documentación oficial, https: //huggingface.co/docs/autotrain/tasks/llm_finetuning, consultado el 5 de mayo de 2026.
dc.relation.references	Hugging Face, Inference Endpoints, documentación oficial, https://huggingface.co/ docs/huggingface_hub/en/guides/inference_endpoints, consultado el 5 de mayo de 2026.
dc.relation.references	Hugging Face, Search index, documentación oficial de Datasets, https://huggingface. co/docs/datasets/faiss_es, consultado el 5 de mayo de 2026.
dc.relation.references	W. H. Inmon, Building the Data Warehouse, 4.a ed. Wiley, 2005
dc.relation.references	R. Kimball y M. Ross, The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3.a ed. Wiley, 2013
dc.relation.references	J. Dixon, Pentaho, Hadoop, and Data Lakes, https://jamesdixon.wordpress.com, Blog post, 2010
dc.relation.references	IBM, Data Lake vs Data Warehouse, https://www.ibm.com, Consultado para comparación conceptual, 2023.
dc.relation.references	Microsoft, Data Lake vs Data Warehouse, https://learn.microsoft.com, Documentación oficial, 2023.
dc.relation.references	Amazon Web Services, What is a Data Lake? https://aws.amazon.com, Documentación oficial, 2023.
dc.subject	Arquitectura Lakehouse
dc.subject	Bases de Datos Vectoriales
dc.subject	Inteligencia Artificial Generativa
dc.subject	Búsqueda Híbrida
dc.subject	RAG
dc.subject	Gobierno de Datos
dc.subject.keyword	Lakehouse Architecture
dc.subject.keyword	Vector Databases
dc.subject.keyword	Generative Artificial Intelligence
dc.subject.keyword	Hybrid Search
dc.subject.keyword	Retrieval-Augmented Generation
dc.subject.keyword	Data Governance
dc.subject.keyword	Enterprise Architecture
dc.subject.lemb	Arquitectura de la información
dc.subject.lemb	Inteligencia artificial
dc.subject.lemb	Gestión de datos
dc.title	Diseño de una arquitectura lakehouse empresarial integrando modelado relacional y bases vectoriales para soportar analítica avanzada e IA generativa
dc.type.coar	http://purl.org/coar/resource_type/c_baaf

Archivos

Bloque original

Mostrando 1 - 1 de 1

Nombre:: Diseño de una Arquitectura Lakehouse Empresarial).pdf
Tamaño:: 1.29 MB
Formato:: Adobe Portable Document Format

Descargar

Bloque de licencias

Mostrando 1 - 2 de 2

Nombre:: license.txt
Tamaño:: 3.28 KB
Formato:: Item-specific license agreed upon to submission
Descripción:

Descargar

Nombre:: FOR-EFE-GDB-007_AUTORIZACION_DE_PUBLICACION_DE_TESIS_O_TRABAJO_DE_GRADO firmado.pdf
Tamaño:: 370.58 KB
Formato:: Adobe Portable Document Format
Descripción:

Descargar

Colecciones

Especialización en Desarrollo de Bases de Datos