Por favor, use este identificador para citar o enlazar este ítem:http://uvadoc.uva.es/handle/10324/27615
Título
Compresión de datasets RDF en HDT usando Spark
Director o Tutor
Año del Documento
2017
Titulación
Máster en Investigación en Tecnologías de la Información y las Comunicaciones
Resumo
Apache Spark is a general purpose big data processing framework using the mapreduce
paradigm, quickly becoming very popular. Although the information provided
by Spark authors indicates a substantial improvement in performance against Hadoop,
there is very little evidence in the literature of specific tests that reliably proves such
claims. In this Master Work study the benefits of Spark and the most important factors
on which they depend, considering as a reference the transformation of RDF datasets
into HDT format. The main objective of this work is to perform one exploratory study
to leverage Spark solving the HDT serialization problem, finding ways to remove limitations
of the current implementations, like the memory need which use to increase
with the dataset size. To do that, first we’ve setup a open environment to ensure reproducibility
and contributed with 3 different approaches implementing the most heavy
task in the HDT serialization. The test performed with different dataset sizes showed
the benefits obtained with the proposed solution compared to legacy Hadoop MapReduce
implementation, as well as some highlights to improve even more the serialization
algorithm.
Palabras Clave
Apache Spark (Procesador de datos)
RDF
Hadoop MapReduce
HDT (Formato)
Departamento
Departamento de Informática (Arquitectura y Tecnología de Computadores, Ciencias de la Computación e Inteligencia Artificial, Lenguajes y Sistemas Informáticos)
Idioma
eng
Derechos
openAccess
Aparece en las colecciones
- Trabajos Fin de Máster UVa [6578]
Arquivos deste item
Exceto quando indicado o contrário, a licença deste item é descrito como Attribution-NonCommercial-NoDerivatives 4.0 International