Description: DISCONA: Distributed Sample Compression for Nearest Neighbor Algorithm

This title appears in the Scientific Report : 2023

DISCONA: Distributed Sample Compression for Nearest Neighbor Algorithm

Sample compression using epsilon nets effectively reduces the number of labeled instances required for accurate classification with nearest neighbor algorithms. However, one-shot construction of an epsilon nets can be extremely challenging in large-scale distributed data sets. We explore two approac...

Personal Name(s):	Rybicki, Jedrzej (Corresponding author)
	Frenklach, Tatiana / Puzis, Rami
Contributing Institute:	Jülich Supercomputing Center; JSC
Published in:	Applied intelligence, 53 (2023) 7, S. 14
Imprint:	Dordrecht [u.a.] Springer Science + Business Media B.V 2023
DOI:	10.1007/s10489-023-04482-y
Document Type:	Journal Article
Research Program:	Cross-Domain Algorithms, Tools, Methods Labs (ATMLs) and Research Groups
Link:	OpenAccess
	Publikationsportal JuSER

Please use the identifier: http://dx.doi.org/10.1007/s10489-023-04482-y in citations.
Please use the identifier: http://hdl.handle.net/2128/34242 in citations.

Sample compression using epsilon nets effectively reduces the number of labeled instances required for accurate classification with nearest neighbor algorithms. However, one-shot construction of an epsilon nets can be extremely challenging in large-scale distributed data sets. We explore two approaches for distributed sample compression: one where local epsilon net is constructed for each data partition and then merged during an aggregation phase, and one where a single backbone of an epsilon net is constructed from one partition and aggregates target label distributions from other partitions. Both approaches are applied to the problem of malware detection in a complex, real-world data set of Android apps using the nearest neighbor algorithm. Examination of the compression rate, computational efficiency, and predictive power shows that a single backbone of an epsilon net attains favorable performance while achieving a compression rate of 99%.