Automatically classifying source code using tree-based approaches

Phan, A.V. and Chau, P.N. and Nguyen, M.L. and Bui, L.T. (2018) Automatically classifying source code using tree-based approaches. Data and Knowledge Engineering, 114. pp. 12-25. ISSN 0169023X

Text
Automatically classifying source code using tree-based approaches..pdf
Download (872kB) | Preview

Official URL: https://www.scopus.com/inward/record.uri?eid=2-s2....

Abstract

Analyzing source code to solve software engineering problems such as fault prediction, cost, and effort estimation always receives attention of researchers as well as companies. The traditional approaches are based on machine learning, and software metrics obtained by computing standard measures of software projects. However, these methods have faced many challenges due to limitations of using software metrics which were not enough to capture the complexity of programs. To overcome the limitations, this paper aims to solve software engineering problems by exploring information of programs' abstract syntax trees (ASTs) instead of software metrics. We propose two combination models between a tree-based convolutional neural network (TBCNN) and k-Nearest Neighbors (kNN), support vector machines (SVMs) to exploit both structural and semantic ASTs' information. In addition, to deal with high-dimensional data of ASTs, we present several pruning tree techniques which not only reduce the complexity of data but also enhance the performance of classifiers in terms of computational time and accuracy. We survey many machine learning algorithms on different types of program representations including software metrics, sequences, and tree structures. The approaches are evaluated based on classifying 52000 programs written in C language into 104 target labels. The experiments show that the tree-based classifiers dramatically achieve high performance in comparison with those of metrics-based or sequences-based; and two proposed models TBCNN + SVM and TBCNN + kNN rank as the top and the second classifiers. Pruning redundant AST branches leads to not only a substantial reduction in execution time but also an increase in accuracy. © 2017 Elsevier B.V.

Item Type:	Article
Divisions:	Faculties > Faculty of Information Technology
Identification Number:	10.1016/j.datak.2017.07.003
Uncontrolled Keywords:	Clustering algorithms; Codes (symbols); Complex networks; Convolution; Convolutional neural networks; Cost engineering; Learning algorithms; Learning systems; Motion compensation; Nearest neighbor search; Object oriented programming; Semantics; Software engineering; Support vector machines; Syntactics; Trees (mathematics); High dimensional data; K nearest neighbor (KNN); Performance of classifier; Program representations; Support vector machine (SVMs); Syntax tree; Traditional approaches; Tree-based; C (programming language)
Additional Information:	Language of original document: English.
URI:	http://eprints.lqdtu.edu.vn/id/eprint/9588

Actions (login required)

: View Item