Description: Design and Evaluation of an SVM Framework for Scientific Data Applications

This title appears in the Scientific Report : 2015

Design and Evaluation of an SVM Framework for Scientific Data Applications

Support vector machines (SVMs) are a popular classification method due totheir good accuracy and broad usage domains in scientific applications. Thecomputational complexity is between O(n2) and O(n3) for the number of n trainingsamples. The scalability for larger data sets is therefore a problem of...

Personal Name(s):	Glock, Philipp (Corresponding author)
Contributing Institute:	Jülich Supercomputing Center; JSC
Imprint:	2015
Physical Description:	ix, 58 p.
Dissertation Note:	Maastricht University, Masterarbeit, 2015
Document Type:	Master Thesis
Research Program:	Data-Intensive Science and Federated Computing
Subject (ZB):	Masterarbeit
Link:	OpenAccess OpenAccess
	Publikationsportal JuSER

Please use the identifier: http://hdl.handle.net/2128/9412 in citations.

Support vector machines (SVMs) are a popular classification method due totheir good accuracy and broad usage domains in scientific applications. Thecomputational complexity is between O(n2) and O(n3) for the number of n trainingsamples. The scalability for larger data sets is therefore a problem of SVMs. Withthe increasing number of large data problems, this disadvantage becomes moreand more significant. In order to overcome these scalability issues, this thesisdesigns and implements a parallel and scalable framework that realizes the cascadeSVM approach including specific improvements. A fundamental speed up andincreased scalability is gained by splitting up the data set into several sub setsthat can be worked on in parallel. The framework is designed to run in modernHigh Performance Computing (HPC) environments, that provide the necessarymassively parallel resources (e.g. large clusters with good node interconnects) tosolve large data problems. The framework however also works on a simple computerfor smaller problems if needed. To keep the interface usable for non-technical savvydomain scientists, Python is used.The standard cascade SVM approach is improved with a standardized file formatand parallel I/O is introduced that both improve the I/O performance, whichbesides computing is also often observed to be a bottleneck for large problems. Inorder to enable enhanced training speed up as well as a better accuracy furtherimprovements such as distance filters and cross-feedback options are realized andevaluated. The resulting improved cascade SVM approach and parallel and scalableframework design is then evaluated on a real world remote sensing data set andcompared to another parallel implementation called pi-SVM. The parallelizationstrategies of these two implementations are different whereby the cascade SVM is adata processing approach, pi-SVM follows primarily an algorithmic-driven approach.