The Data Mining and Knowledge Discovery Group at the Department of Computer Science and Engineering (DISI) of the University of Bologna conducts research on methods and algorithms for mining knowledge from large databases and data sets, with a focus on distributed and parallel computing environments, and streaming data. The group's current interests include mining data in massive self-administered networks, such as peer-to-peer systems, pervasive and ubiquitous computing environments, such as intelligent sensor networks and networks of portable devices, and multi-core parallel systems.

The Support Vector Model (SVM) is a method to predict the class of a new observation which can be learned from already classified examples. Two challenging problems in SVM research deal with the approximate computation of a SVM from examples that are distributed among several data sources, or arrive through an unbounded stream. The equivalence between the Minimum Enclosing Ball (MEB) problem and the L2-SVM learning problem, and fast algorithms for the computation of MEBs by Core Vector Machines, provide tools to solve the SVM learning problems both in the distributed and streaming environment efficiently and accurately. Moreover, Core Vector Machines can be replaced by a new instance of the Frank-Wolfe optimization method for the MEB problem, exhibiting better scalability.

Outlier detection methods find out examples considerably dissimilar, exceptional or inconsistent with respect to the remaining data, and have practical applications in fraud and network intrusion detection, data cleaning, medical diagnosis, and marketing segmentation. Defining outlier dissimilarity based on cumulated distance to its neighbours, the Solving Set algorithm efficiently extracts a subset of the data which determines distances that are sufficient to solve the problem. A distributed solving set can be extracted exactly and efficiently. On a Graphic Processing Unit, parallel extraction of the solving set achieves significant speedup over the CPU solving set algorithm and large speedup over GPU and CPU brute force detection algorithms.

Data clustering is the grouping of a set of objects into classes, also termed clusters, maximizing homogeneity within classes and separation between classes. If homogeneity is based on a density criterion, every site owning a portion of a horizontally partitioned data set can estimate the density function of the local data and send to a central site a set of regular samples of the function, where the global density function can be recovered, exploiting function reconstruction by sampling series. The approach is feasible in an agent-based environment as well, taking appropriate countermeasures to protect sensitive data.

The emergence of self-administered, potentially massive, data-centric networks, such as peer-to-peer systems and sensor networks, poses many new research challenges both for multidimensional data management and mining of data in such networks, such as resilience to node failures and energy and load minimizing data processing. For multimensional query answering, a novel wireless grid-based infrastructure has been designed, which avoids broadcasting, does not suffer from dead-end problems, and can support fault-tolerance, by guaranteeing at least two independent paths between a node and its ancestors. The infrastructure can support a density-based data clustering method in sensor networks which directly exploits its data partitioning.

July, 2013