Big data and the web: algorithms for data intensive scalable computig

Gianmarco De Francisci Morales
This thesis explores the problem of large scale Web mining by using Data Intensive Scalable Computing (DISC) systems. Web mining aims to extract useful information and models from data on the Web, the largest repository ever created. DISC systems are an emerging technology for processing huge datasets in parallel on large computer clusters. Challenges arise from both themes of research. The Web is heterogeneous: data lives in various formats that are best modeled in different...