RT info:eu-repo/semantics/masterThesis T1 HDFS File Formats: Study and Performance Comparison A1 Alonso Isla, Álvaro A2 Universidad de Valladolid. Escuela Técnica Superior de Ingenieros de Telecomunicación K1 Big Data K1 Hadoop K1 HDFS K1 MapReduce AB The distributed system Hadoop has become very popular for storing and process large amounts of data (Big Data). As it is composed of many machines, its file system, calledHDFS (Hadoop Distributed File System), is also distributed. But as HDFS is not a traditionalstorage system, plenty of new file formats have been developed, to take advantageof its features. In this work we study that new formats to find out their characteristics,and being able to decide which ones can be better knowing the needs of our data. Forthat goal, we have made a theoretical framework to compare them, and easily recognizewhich formats fit our needs. Also we have made an experimental study to find out how theformats work in some specific situations, selecting two very different datasets and a set ofsimple queries, resolved with MapReduce jobs, written with Java or run using Hive tool.The final goal of this work is to be able to identify the different strengths and weakenessesof the file formats. YR 2018 FD 2018 LK http://uvadoc.uva.es/handle/10324/32896 UL http://uvadoc.uva.es/handle/10324/32896 LA eng NO Departamento de Informática (Arquitectura y Tecnología de Computadores, Ciencias de la Computación e Inteligencia Artificial, Lenguajes y Sistemas Informáticos) DS UVaDOC RD 22-dic-2024