Construction auditing risk detection using machine learning approaches
Tóm tắt Construction auditing risk detection using machine learning approaches: ...een prevented, detected or corrected by internal audit. Hội nghị Khoa học công nghệ lần thứ XXII Trường Đại học Giao thông vận tải -693- Fig 1. Audit risk detection process Based on the internal control system description, the auditor will assess whether this system is effective or not, wh...alues (2)ka (2) (2) (2) 1 , 1, , M k kj j k j a w z b k c = = + = where c is the number of outputs. These values are then passed through the output layer to produce output values , 1..ky k c= . There are several forms of activation functions. For the classification purpose, we con...he first hidden layer, eight nodes for second hidden layer and three nodes for the output layer, corresponding three levels of audit risk, low, medium and high, respectively. Activation function for each hidden layer was rectified linear unit (ReLU), and sigmoid function for the output layer. ...
Hội nghị Khoa học công nghệ lần thứ XXII Trường Đại học Giao thông vận tải -691- CONSTRUCTION AUDITING RISK DETECTION USING MACHINE LEARNING APPROACHES Cao Phuong Thao1* 1 University of Transport and Communications, No. 3 Cau Giay, Hanoi * Corresponding author: Email: thaocp@utc.edu.vn Abstract. Audit report plays a key role in determining the validity of final accounting in the completion of any construction project. However, the quality of reports depends heavily on the quality of the auditors themselves, whose variety of skill set and bias level could lead to different assessment outcome of the accounting risk level. This paper presents a method that automatically detects auditing risk using machine learning approaches. The criteria to assess auditing risks will serve as inputs in the machine learning algorithms, and the output will be the ranking of low, medium, high level of auditing risk. The proposed two machine learning methods was tested on 80 construction projects in Vietnam and the result shows the high accuracy level of this method in auditing risk detection. Keywords: auditing, audit risk detection, neural network, random forest, machine learning. 1. INTRODUCTION The purpose of the audit is to examine and verify the truthfulness of the financial statements provided by the accountant, thereby providing the most accurate information about the financial situation of the organization. The final product of an audit is a report express the auditor’s opinion about the truthfulness and fairness of the financial statements as produced by the accountants. To do this, the auditor performs a survey to the company or project management unit to see if the internal control system follows the process properly. The assessment of this audit risk depends heavily on the subjective opinion of the auditor. Therefore, if we can build an automated risk assessment system based on objective criteria then the assessment of risk would happen more quickly and accurately. Recently, artificial intelligence has been applied in many fields such as financial services, image processing, medical, natural language processing, text mining and many others [1, 2, 3]. In [1], Bahrammirzaee (2010) had reviewed three artificial intelligence methods applied to financial market. In another research, the same author proposed the hybrid intelligent system for credit ranking using reasoning Hội nghị Khoa học công nghệ lần thứ XXII Trường Đại học Giao thông vận tải -692- transformational models [2]. In this method, the expert system is considered as symbolic module and artificial neural network is considered as non-symbolic module. In [3], Kasman (2010) proposed the method using neural network with back propagation learning algorithm to evaluate credit risk. Information technology had employed in auditing and categorized the system into five groups according to audit areas: data extraction and analysis, fraud detection, internal control evaluation, electronic commerce control, and continuous monitoring [4, 5]. In [6] neural network has proposed to classify the credit risk into good vs. bad consumer groups for the bank. Recently, Cao et al. [7] presented a method of neural network to detect the audit risk in three level low, medium and high level. Although there are many research works that apply artificial intelligence in audit and financial, these methods focus on evaluating audit risk in company, where there is statistical & historical audit data. In this paper, we present a method of audit risk detection in construction project using two methods of machine learning inclusing neural network and random forest. The difference between audits in construction project vs. other audit projects is that construction projects usually have short execution time and the audit is implemented when the project is finished. Criteria to assess audit risk will provide the inputs for a multi-layer perceptron neural network as well as random forest and the output are the three level of risk include low, medium, and high of audit risk. We test the method using data from 80 construction projects in Vietnam. The experimental results show the efficiency of the method. In the first section, we describe the audit process and how neural network and random forest is applied to audit risk detection. In the next section, real data will be used to illustrate the performance of the method. Finally, we draw conclusion of the study and implication for future work. 2. MAIN CONTENT 8.1. Auditing risk assessment Risk is a problem arises in all fields, each field has to develop its own unique ways of assessing and handling this problem. In construction project, auditing risk is associated with important errors in the final project settlement report. The final project settlement report is very important as its job is to assess how comprehensive and relevant are the samples that the auditor selects, how convincing the evidence collected by the auditor, whether the project is complied with the law, or at point the project is not in continuous operation, etc. To control this risk, the audit plan must be appropriately established to detect fraud, risks, and potential problems and also ensure that the audit is completed on time. Moreover, the auditor must consider and assess all kinds of risk in construction project including potential risks, control risks and detection risks with confidence sample. The audit risk assessment process is shown in figure 1. This paper focuses on control risk, which is the possibility of errors occurred that have not been prevented, detected or corrected by internal audit. Hội nghị Khoa học công nghệ lần thứ XXII Trường Đại học Giao thông vận tải -693- Fig 1. Audit risk detection process Based on the internal control system description, the auditor will assess whether this system is effective or not, what risks that could occur and how the enterprise could overcome the risk at some sensitive points. To assess the internal control system, several factors need to be considered such as model, operating framework and ability of the project management unit; financial management and accountancy; works related to policy changes; existed findings from previous audits; errors in planning strategies; weakness in management that leads to inadequate investment, slow progress, outstanding investment cost, falling to meet the objective, and environmental impact caused by the project. An assessment of control risk is to check the information of internal control system of the project management unit such as diagram of organizational structure, level of staffs, internal management documents, internal audit works. Also, the auditor needs to observe activity of the unit and discuss with managers and employees to understand the organizational characteristic, personnel policies, qualification of the managers and employees. To do this, auditor designs the survey form with 46 criteria, divided into five groups. They are Ability and quality of the project management unit (PMU) director, Risk management process, Information/report, Control operations, and Evaluation and Monitor (E & M). The detail of the criterions are shown in [7]. These criterions will be quantification by coding in range from 0 to 1. These values will be the inputs of neural network. Outputs of neural network are risks, measured on three levels of audit risk as low, medium and high. 8.2. Neural network An Artificial Neural Network (ANN) is a computational model that simulates biological neurons and functions in the brain. Typically, an ANN has layers of interconnected nodes. The nodes and their inter-connections are similar to the network of neurons in the brain. Any basic ANN will always have multiple layers of nodes, specific connection patterns and links between the layers, connection weights and activation functions for the nodes that convert weighted inputs to outputs. The learning process for the network typically involves a cost function and the objective is to optimize the cost function (typically minimize the cost). The weights keep getting updated in the process of learning. Hội nghị Khoa học công nghệ lần thứ XXII Trường Đại học Giao thông vận tải -694- For the audit risk detection, we have considered them as classification problem. In this paper, we use a multiple outputs three-layer structure of multilayer perceptron (MLP) neural network. Although this classifier needs quite large training time but it is able to process data and classification fast [8]. Figure 2 presents an example of MLP structure which consists of one input layer, one hidden layer and one output layer [9]. Fig. 2. Two layer feed-forward neural network Let ix , i = 1..d is the input value to the network, the output forms M linear combinations of these inputs to (1)ja as: (1) (1) (1) 1 , 1, , d j ji i j i a w x b j M = = + = where w ji are element of the weight matrix and jb are the bias parameters associated with the hidden unit. Also, each variable aj was associated with each hidden unit and then transformed by the non-linear activation functions of the hidden layer. The output of the hidden units are then given by (1)tanh( ), 1, ,j jz a j M= = The jz are then combined with weights and biases of the next layer to produce values (2)ka (2) (2) (2) 1 , 1, , M k kj j k j a w z b k c = = + = where c is the number of outputs. These values are then passed through the output layer to produce output values , 1..ky k c= . There are several forms of activation functions. For the classification purpose, we consider the logistic sigmoidal activation functions as follow: (1) (2) (3) Hội nghị Khoa học công nghệ lần thứ XXII Trường Đại học Giao thông vận tải -695- (2) 1 1 exp( ) k k y a = + − The network need to train to model the data in order to make a best predictions of new input data. In this paper we consider the back propagation algorithm [8]. Assume we have the target vector t for input data x, the error of the network, E, is defined as: 1 1 1 ( ) 2 N c n n k k n k E y t = = = − Where nky is the actual value of k th output unit for the nth input pattern, nkt is desired value of the kth output unit for the nth input pattern. The derivative of E with respect to the second layer weights are given detail in [7]. The difference between the calculated output and the desired output is back- propagated to the previous layers, usually modified by the derivative of the activate function, and the connection weights are normally adjusted using the Delta Rule. This process proceeds for the previous layers until the input layer is reached. 8.3. Random Forest The Random Forest Classifier is a set of decision trees from randomly selected subset of training set. It aggregates the votes from different decision trees to decide the final class of the test object. The figure 3 describes the diagram of the Random Forest. Fig. 3. The diagram of the Random Forest Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction. There are two phrases in RF process, which are training and testing phrase. During training phrase, each trees of random forests are built randomly using bagging. Bagging technique build many bootstrap samples LB which are replications of initial learning set L but with (4) (5) Hội nghị Khoa học công nghệ lần thứ XXII Trường Đại học Giao thông vận tải -696- replacement, each (Xk, Yk) (k = 1, 2, , n) may repeat many times in each bootstrap sample LB. To build a decision tree of random forest, m features (variables) was chosen randomly from each bootstrap sample LB which has n features (m<n) to build a decision tree, then random forest algorithm choose best split variable among m selected features and split function, then split node to two children nodes [10]. Internal nodes t have binary spit st which use variable Xk to apply to incoming data, divide into two subsets of data correspond with two children trees tL and tR. To make best split then algorithm need to choose best split st which maximize the impurity decrease [11] (6) where is impurity measurement such as Gini index, Shannon entropy, Nt, NtL, NtR are number of variables at node t, number of variable of left child and right child of node t respectively, = NtL / Nt and = NtR / Nt. 3. RESULTS Data used in this paper have been collected from 80 construction projects in Vietnam. The survey forms for the internal audit includes 46 criterion such as described in table 1. From the data collected, we quantify these criterion to form the matrix with score from 0 to 1. In this data set, each row presents data from one project and each column presents a criteria. These criterion have brought to the input of neural network. The neural network structure here is designated with three layers, two hidden layers and an output layer. The number of nodes in each layer is selected by experiment, we used twelve nodes for the first hidden layer, eight nodes for second hidden layer and three nodes for the output layer, corresponding three levels of audit risk, low, medium and high, respectively. Activation function for each hidden layer was rectified linear unit (ReLU), and sigmoid function for the output layer. Table 1. List of criterion for internal control 1. Ability and quality of the project management unit (PMU) director 28 criterions 2. Risk management process 2 criterions 3. Information, report 6 criterions 4. Control operations 4 criterions 5. Evaluation and Monitor (E & M) 6 criterions The data set was divided into two parts, 70% for training and 30% for testing for two machine learning methods. The programs were written using Python with keras backend tensorflow support GPU run on the computer Core i7, RAM 8GB. The training result of neural network is shown in figure 3. Figure 3 indicates that the values of lost function in both testing set and training set are equal at the starting point. The values of lost function in both sets tend to be Hội nghị Khoa học công nghệ lần thứ XXII Trường Đại học Giao thông vận tải -697- convergent (declining and the declining speed is also slower). However, once the number of iteration increases gradually, the value of lost function in the training set will be smaller than that of the testing function as data in the testing set is less than that of the training set. While the accuracy hardly changes with the two functions, this dues to a fact that the data set is not sufficient enough to cover all cases. Figure 4 shows that the accuracy of the testing set is more than 90%, while the accuracy of the training set is about 97%%. The accuracy of the testing set is higher than training set about 2% because data in testing set less than data in training set. Therefore, the accuracy of this model is about 95% to 96%. Fig. 4. Performance of training and testing process Fig 5. Confusion matrix for Neural Network (left0of two machine learning methods (left: neural network, right: Random Forest) Prediction accuracy are evaluated on the testing set. We evaluate the accuracy of the methods using the ground truth notion of positive and negative detection. The Hội nghị Khoa học công nghệ lần thứ XXII Trường Đại học Giao thông vận tải -698- confu-sion matrix for two methods neural network and random forest is shown in figure 5. The accuracy of the method will be calculated as the percentage of correctly classified samples compared with the total number of samples. where TP is true positive, TN is true negative, FP is false positive, FN is false negative. Base on the matrix of neural network, we can see that 17 samples of high risk, 9 samples of medium risk and 4 samples of low risk were classified correctly. Similarity, in the random forest, 12 samples of high risk, 10 samples of medium risk and no sample of low risk were classified correctly. The overall is 94% accuracy for neural network and 60% accuracy for random forest. 4. CONCLUSION This paper proposed two machine learning methods to detect the audit risk in construction projects. By quantifying the criterion survey, these variables can be used as inputs to the neural network and random forest to train the model which can be used to detect the audit risk in any new project. The experimental results show the efficiency of the neural network method. This method can be applied to information system to quickly detect the audit risk and also recur the work load for auditors. This method can be applied to detect risk and can serve as a framework to identify risk in a comprehensive manner for construction projects. REFERENCES [1]. Bahrammirzaee A., A Comparative Survey of Artificial Intelligence Applications in Finance: Artificial Neural Networks, Expert System and Hybrid Intelligent Systems. Neural Computing & Applications, Vol. 19 No. 8, pp.1165-1195, 2010. [2]. Bahrammirzaee A., Ghatari A., Ahmadi P., and Madani K., Hybrid Credit Ranking Intelligent System Using Expert System and Artificial Neural Networks. Applied Intelligence, Vol. 34 No.1, pp. 28-46, 2011. [3]. Kashman A., Neural Networks for Credit Risk Evaluation: Investigation of Different Neural Models and Learning Schemes, Expert Systems with Applications, Vol. 37 No.9, pp. 6233-6239, 2010. [4]. Glower S. M., and Romney M. B., The Next Generation. Internal Auditor 55(August): 47-53, 1998. [5]. Eija Koskivaara, Artificial Neural Networks in Auditing: State of the Art, TUCS Technical Report No 509, 2003. [6]. Qeethara K. A-Shayea, And Ghaleb A. E-Refae, Evaluating Credit Risk Using Hội nghị Khoa học công nghệ lần thứ XXII Trường Đại học Giao thông vận tải -699- Artificial Neural Networks, Global Engineers & Technologist Review, Vol. 1 No.1, 2011. [7]. Phuong Thao Cao, Hoang Tung Nguyen and Thi Hau Nguyen, Construction Auditing Risk Detection Using Neural Network, Science, Engineering & Education, 4, (1), pp. 39-44, 2019. [8]. Ripley B. D., Pattern Recognition and Neural Networks, Cambridge University Press, UK, 1996. [9]. Ian Nabney, 'Netlab: Algorithms for Pattern Recognition', Advances in Pattern Recognition, Springer, 2004. [10]. Hastie, Trevor & Tibshirani, Robert & Friedman, Jerome. The Elements Of Statistical Learning. Aug, Springer. 1. 10.1007/978-0-387-21606-5_7, 2001. [11]. Criminisi, A. & Shotton, J. & Konukoglu, Ender. (2012). Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning. Foundations and Trends in Computer Graphics and Vision. 7. 81-227.
File đính kèm:
- construction_auditing_risk_detection_using_machine_learning.pdf