Data Mining and Knowledge Discovery Group at the Department of Computer Science and Engineering (DISI) of the University of Bologna

Master theses

A list of available subjects for Master theses, by research topic.

Send an email to the group head prof. Claudio Sartori for further information.

Master theses in parallel and distributed data mining

Introduction to thesis topics

Parallel data mining

Due to the growing size and complexity of real data sets, traditional, centralized data mining approaches which are designed for a single processor cannot provide users with timely responses in many important applications. Parallel data mining (PDM) has therefore been proposed in the late nineties to exploit the computational power of parallel shared-memory machines (SMM) and distributed memory machines (DMM). Today's massively parallel supercomputers with tens of thousands of multi-core nodes, and many-core devices with thousands of cores, such as General-purpose Graphic Processing Units (GPGPU), provide sufficient computational power to effectively support high-performance large-scale data mining, and to face emerging scientific challenges such as genomic analysis and environmental protection.

Distributed data mining

In an increasing number of data mining applications, the data set consists of a collection of distributed data sets generated at different times and different locations, and, possibly, by different entities. Examples are data sets generated at different branches of a company, data sets generated by different companies which must be cooperatively mined, and data sets arising in natural and earth sciences. Transmitting a distributed data set to a central site is often an undesirable solution, mainly due to bandwidth limitations, but also to privacy constraints, and loss of competitive advantage. Distributed data mining (DDM) is an approach to data mining in loosely-coupled distributed systems, which allows to reduce network load. Every node of the DDM system locally extracts a small summary, or a partial mining model, from its data set, and sends it to a central site, which processes the summaries or models and generates a possibly approximate global model. A trade-off between model accuracy and transmission complexity must be achieved for such a solution to be effective. In case the data are subject to privacy constraints, the summaries must be immune to inferential attacks which could reconstruct the original data.

Stream data mining

Data streams are continuous, potentially unbounded flows of data originated by high-speed and high-volume data sources such as network and web activity loggers, sensor networks, and geophysical monitoring stations. Examples of organizations which create data streams are telecommunication companies, on-line retail companies, and environment monitoring agencies. Due to the volume and speed of data streams, algorithms performing multiple accesses to every stream element, including multiple sequential scans, are unfeasible. Moreover, complete stream data cannot be stored in main memory. Therefore, in stream processing it is assumed that each stream element is available for processing for a bounded amount of time after its arrival, and that it must be deleted or stored on off-line archival memory afterwards. Such constraints render stream mining a difficult task, since the accuracy of a data mining model often depends on the amount of non-local information an algorithm is designed to process. Further difficulties arise from the intrinsic variability of streams. The structure of stream data may evolve substantially over time, thereby challenging designers of applications which must compare the differences of mining models at different time horizons or detect changes in real time.

Support vector models in distributed, parallel, and streaming environments

SVM learning in a stream environment

In a data stream environment, the arrival frequency and the volume of example instances render the design of any type of classifier a difficult task. The Support Vector Machine (SVM) has been recognized as one of the most accurate classifier learning algorithms to date. However, SVM classifier learners require the solution of an optimization problem, and the design of a SVM classifier for stream data is therefore a hard problem. The candidate is asked to design, implement and test such a classifier, improving over existing algorithms. A theoretical and experimental basis for the design task can be found here.

Parallel and distributed outlier detection

Parallel distance-based outlier detection in multi-core high-performance architectures

The last generation of multi-core, multi-node, massively parallel supercomputers are composed of a very large number of nodes, of the order of 104, and an even larger number of cores, of the order of 105. Such computing architectures offer an opportunity to solve mining problems of unprecedented size. However, to fully exploit their power, efficient and accurate distributed algorithms, with proper synchronization and a low transmission complexity, must be designed. The candidate is asked to design, implement and test an efficient algorithm for parallel distance-based outlier detection for the IBM BlueGene/Q High Performance Computing architecture, consisting in an evolution of the DistributedSolvingSet algorithm. The programming environment includes C/C++ compilers and Message Passing Interface (MPI) wrappers.

Outlier detection in a cloud computing infrastructure

In cloud computing clusters of distributed computers provide on-demand computing resources over a network. Two types of clouds can be distinguished: clouds that provide computing instances, like Amazon's EC2, and clouds that provide computing capacity, such as Google's MapReduce. In this thesis subject we are concerned with data mining in the second type of cloud, and, in particular, with MapReduce and its open source implementation, Hadoop. MapReduce is a simple distributed programming paradigm consisting of a Map function, which generates a list of intermediate key-value pairs for each key-value pair in the input, and a Reduce function, which operates on all intermediate pairs with the same key. MapReduce is an attractive solution to the problem of mining terabyte- or petabyte-sized data sets on demand in a cloud infrastructure. However, although an application of the MapReduce paradigm to simple data analysis tasks, such as histogram computation, is relatively straightforward, data mining algorithms are much harder to implement. In this thesis, the candidate is asked to design, implement and test on very large data sets a distance-based outlier detection algorithm in the MapReduce paradigm, taking a distributed solving set approach.

Data clustering in distributed and streaming environments

Stream clustering using kernel density estimates

The composition, shape, location and number of clusters may evolve quickly in a stream. Therefore, the temporal dimension of stream data plays a central role in stream clustering, making it different from traditional data clustering concerning two aspects. First, light-weight data structures must record on-line the structural changes that take place in a stream, with a complexity per operation that is low enough to prevent processing of new stream elements to lag behind. Second, as the history of a stream may contain valuable information on the mechanism that drives such changes, suitable query procedures have to be provided to enable a user to inspect the clusters that were active within any specified time frame in the past. These requirements are difficult to satisfy if cluster shapes must be captured in great detail, because larger and more complex data structures are required. The candidate is asked to design and implement a stream clustering algorithm capable of recognizing clusters of arbitrary shape and changes to the structure of clusters, and of retrieving past configurations of such structure, possibly as extension and improvement to previous work on the topic.

Distributed data clustering in sensor networks

Wireless sensor networks (WSN) are composed of a large number of inexpensive, battery-powered nodes, each equipped with a sensor, local memory, a digital wireless communication device and a processing unit, which are deployed in a physical environment for monitoring purposes. A node of a sensor network communicates its readings to a sink node, which in turn forwards it to a base station for further processing. The lifetime of a sensor node largely depends on the total transmission time of its transceiver, which is the component having the largest power consumption. If a WSN collects a large number of readings, such as a WSN which collects data to be clustered, a technique for preserving battery charge is to reduce the number of transmitted records, by partially clustering data in the nodes: only small, summarized representations of the data, which are sufficient to globally cluster the data with the desired accuracy, are transmitted to the sink. The candidate must design a clustering algorithm for a wireless sensor network which minimizes data transmission, and implement and test a software simulator to validate properties and performance of the algorithm.

August, 2013