CINECA IRIS Institutional Research Information System

Data sanitization and frequent pattern mining are two well-studied topics in data mining. Our work initiates a study on the fundamental relation between data sanitization and frequent pattern mining in the context of sequential (string) data. Current methods for string sanitization hide confidential patterns. This, however, may lead to spurious patterns that harm the utility of frequent pattern mining. The main computational problem is to minimize this harm. Our contribution here is as follows. First, we present several hardness results, for different variants of this problem, essentially showing that these variants cannot be solved or even be approximated in polynomial time. Second, we propose integer linear programming formulations for these variants and algorithms to solve them, which work in polynomial time under realistic assumptions on the input parameters. We complement the integer linear programming algorithms with a greedy heuristic. Third, we present an extensive experimental study, using both synthetic and real-world datasets, that demonstrates the effectiveness and efficiency of our methods. Beyond sanitization, the process of missing value replacement may also lead to spurious patterns. Interestingly, our results apply in this context as well.

Hide and Mine in Strings: Hardness, Algorithms, and Experiments

Bernardini, Giulia;Conte, Alessio;Gourdel, Garance;Grossi, Roberto;Loukides, Grigorios;Pisanti, Nadia;Pissis, Solon;Punzi, Giulia;Stougie, Leen;Sweering, Michelle

2023-01-01

Abstract

Data sanitization and frequent pattern mining are two well-studied topics in data mining. Our work initiates a study on the fundamental relation between data sanitization and frequent pattern mining in the context of sequential (string) data. Current methods for string sanitization hide confidential patterns. This, however, may lead to spurious patterns that harm the utility of frequent pattern mining. The main computational problem is to minimize this harm. Our contribution here is as follows. First, we present several hardness results, for different variants of this problem, essentially showing that these variants cannot be solved or even be approximated in polynomial time. Second, we propose integer linear programming formulations for these variants and algorithms to solve them, which work in polynomial time under realistic assumptions on the input parameters. We complement the integer linear programming algorithms with a greedy heuristic. Third, we present an extensive experimental study, using both synthetic and real-world datasets, that demonstrates the effectiveness and efficiency of our methods. Beyond sanitization, the process of missing value replacement may also lead to spurious patterns. Interestingly, our results apply in this context as well.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2023
			
	Codice DOI
	
				https://dx.doi.org/10.1109/TKDE.2022.3158063
			
	Tutti gli autori
	
						Bernardini, Giulia; Conte, Alessio; Gourdel, Garance; Grossi, Roberto; Loukides, Grigorios; Pisanti, Nadia; Pissis, Solon; Punzi, Giulia; Stougie, Lee...espandi
						
	Appare nelle tipologie:
	
				1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
Hide_and_Mine_in_Strings_Hardness_Algorithms_and_Experiments.pdf accesso aperto Tipologia: Documento in Post-print Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 2.48 MB Formato Adobe PDF Visualizza/Apri	2.48 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11568/1132936

Citazioni

ND

9

6

social impact