Groundwater Arsenic and Cancer Risk Assessment Prediction model via Machine Learning: A Step Towards Modernizing Academic Research

Ground water contamination with Arsenic (As) is one of the foremost issues in the South Asian countries where ground water is one of the foremost sources of drinking water. In Asian countries, especially people of Pakistan living in rural areas are devouring ground water for drinking purpose, and cleaned water is not accessible to them. This arsenic contaminated water is hazardous for human health. The persistence of this study is to study the increasing level of arsenic in ground water in coming years for Khairpur, Sindh Pakistan, which is also increasing the cancer rate (skin cancer, blood cancer) gradually in human body. To predict the arsenic value and cancer risk for the next five years, we have developed two models via Microsoft Azure machine learning with algorithms include Support Vector Machine (SVM), Linear Regression (LR), Bayesian Linear Regression (BLR), Boosted Decision tree (BDT), exponential smoothing ETS, Autoregressive Integrated Moving Average (ARIMA). The developed predictive model named as Arsenic Contamination and Cancer Risk Assessment Prediction Model (ACCRAP model) will help us to forecast the arsenic contamination levels and the cancer rate. The results demonstrated that BLR pose highest prediction accuracy of cancer rate among the four deployed machine learning algorithms.


I. INTRODUCTION
Orthogonal Ground water is one of the major sources of drinking water in the World, specifically in Pakistan. Day by day countless changes occur in weather and environment conditions, which influences ground water and indirectly on human health. There are many elements present in ground water, like Arsenic (As), Fluoride (Fl), Zinc (Zn) etc. The undertaking of environmental changes with the passage of time generate the increasing trend of these elements which ultimately lead towards appearance of adverse types of diseases in the human body [1]. Arsenic is one amongst the toxic components reported in groundwater from over seventy countries, that is inflicting health hazards to concerning one hundred fifty million individuals worldwide [2]. Asia is among one of the utmost expressively affected areas from arsenic contamination round the world. Ground water contamination due to arsenic became a vital public health issue in Sindh Pakistan [3]. The increasing level of arsenic in ground water is because of the deferral of arsenic compounds departing from Himalayas through Indus River, settling down in the course of year by geothermal, hydrological and biochemical factors becoming the part of ground water. Pakistan stood at number 80 among 122 countries relating to contaminated water quality [4]. This study predicts the elevated level of arsenic in the coming years for district Khairpur, Sindh Pakistan and the way cancer rate are raised because of arsenic changes, in body. In rural areas of Sindh, arsenic component is gradually developed in the body of the people because of the usage of contaminated water and food, and its exposure comes within the variety of skin, blood and scalp cancer [5]. Thus, we have developed ACCRAP Model using "Microsoft Azure Machine Learning Studio" [6]. The three main contributions of this study are: i.
To the best of knowledge this is the foremost attempt towards the forecasting of arsenic for the next five years with ETS and ARIMA methods. ii.
Second, for the first-time cancer rate prediction is done with four different machine learning algorithms. iii.
Third, cancer rate prediction is calculated manually and automatically, then both results are compared. The paper is consisting of six sections. Sections II justifies how this study modernizes academic research. Section III presents methodology details. Section IV focuses on implementation of ACCRAP Model in "Microsoft Azure Machine Learning Studio". Section V discusses results. The last section, i.e., Section VI, discusses conclusion and future work directions.

II. A STEP TOWARDS MODERNIZING ACADEMIC RESEARCH
From literature study, the wide applicability of machine leaning techniques has been comprehended [7][8][9]. This study is considered as one of the vigorous steps towards modernizing academic research because it is presenting a mapping between water research and artificial intelligence techniques. To conduct this research first papers related with 'As' levels in different areas of Sindh were studied [2], [10][11][12][13]. Then health effects caused due to drinking 'As' contaminated water were studied [14]. Further applicability of machine learning techniques for 'As' prediction was analyzed [8], [9], [15]. The data set used in the experiments of this study is collected from the published papers. During the experiments, two forecasting and four machine learning methods are deployed which make this study a preliminary step towards modernization of water research.

A. Data Collection
For this study, the dataset has been collected from [2], [16][17][18]. A group of researchers analyzed more than two hundred ground water samples from two sides of river Indus [2]. From that study the arsenic levels of Khairpur district are taken for the year 2013. In another study 180 ground water samples were collected from the Khairpur Mir and Tharparkar, Sindh Pakistan [17]. They identified the arsenic in the scalp hair of different ages of people. From that study the arsenic levels of Khairpur district are taken for the year 2014. Researchers also evaluated arsenic values in groundwater of different communities in Khairpur district with the help of multivariate techniques [18]. We took arsenic values of Khairpur district for the year 2016 from this study. Authors scrutinized total arsenic, copper, iron, nickel, lead and zinc contents in two hundred forty three hand-pump ground water samples and ninety tube-well ground water samples of Sobhodero, Sindh, Pakistan [16]. They adapted multivariate techniques and cluster analysis for polluting elements. It was found that arsenic concentration, in most of hand-pump and tube-well samples, was higher than the WHO permissible limits. Arsenic levels of Khairpur district for the year 2017 are taken from this study.

B. Statistical Data
The datasets may contain missing quantities for different reasons. One tactic to addressing this issue is to get rid of assumptions which lack details. Yet, there is a chance of losing concentrated information with critical details. The missing quantities are due to a superior method. As such, we must construe from the existing portion of knowledge certain missing quantities. From the literature review we collected the data for 2013, 2014, 2016, 2017 which is in "Dataset collection" section, but for the missing data of 2015, we used mean/median for non-missing values in a column using Eq. (1)

III. METHODOLOGY
Working mechanism for this study is shown in Fig. 1

IV. IMPLEMENTATION
We used "Microsoft Azure Machine Learning Studio", for developing our predictive model. Two different models were developed, first one for arsenic prediction named as 'Arsenic Prediction Model' and second one for cancer rate prediction named as 'Cancer Prediction Model'. Fig. 2 presents 'As' prediction model. The 'As' levels of different areas of Khairpur district' are maintained in CSV dataset file. It is important to mention that 'select column in dataset' module is one commonly used feature of machine learning studio. Next two core columns 'Year and Arsenic' are selected from loaded dataset in order to apply two different forecasting methods ETS, ARIMA. These both methods are applied on the Arsenic dataset to forecast the future values by using forecast library. These both methods are scripted in R-script. The R-script is used for making an average of forecast points and further for representing the outcome into graph. R-scripting is deployed for ETS method and its pseudocode is presented in our published work [19]. The libraries used in the R-scripting are: Forecast, Zoo and Ggplot2.

A. R-Scripting
By including R-scripting in Fig. 2 model, we can accomplish a variety of modified tasks that are not available in 'Microsoft Machine Learning Studio' [20]. For example, 'Create' custom information changes. We can utilize our measurements for assessing forecasts, and manufacture models utilizing calculations that are not executed as independent modules in Studio.
The predictive model used for the cancer rate in relation to the arsenic prediction model is shown in Fig. 3. Three columns Year, P-arsenic and Cancer-value are selected with the help of "Select Column Dataset" module from the loaded datasets. Then 'Split Data' module is used for separating the rows of loaded data set into training and testing through specified percentages. Then BLR, BDT, LR, SVM algorithms are deployed. Among these the best algorithm is identified that gives best accuracy in Evaluate module. The data set 'Cancer Rate' for this Fig. 3 contains last fiveyears arsenic and cancer value, this cancer value is calculated manually by using the formulas detailed in study [2].  Table I reveals arsenic forecast point through columns named as "F-year", "F-AS", afterward applying 'As' predictive model. F-AS specify the average rate of forecast points altogether which is obtained via ETS and ARIMA forecasting methods. The seasonality is also generated with column named F-year through ARIMA and ETS. These both method ETS and ARIMA are scripted in R-script. P-year (previous year) and P-As (previous arsenic value) shows previous average rate of arsenic which is collected from dataset section. With help of ARIMA and ETS trend, observed, seasonality, randomness of the arsenic results is derived. That's why these both are good for forecasting because they detect these four factors automatically with the help of previous data. The increasing trend of arsenic from 2013 to 2022 is depicted in

B. Cancer Outcome Scrutiny
We have calculated cancer rate by two methods. First, we have calculated the cancer rate manually and then we have projected subsequent five-years cancer value by means of model presented in Fig. 3. The results of both methods are presented in Table II, in which first column shows the results of manually calculated cancer-rate and second column presents predicted cancer rate values. The cancer rate is obtained using the well-defined and extensively adapted method described in literature [2], [21], [22]. They have used Eq. (2) to calculate the cancer rate; deploying arsenic value of one year. We are forecasting fiveyears arsenic value via model 1 illustrated in Fig 1. The values which are obtained from that model are then deployed as an input value to the Eq. (2) and cancer rate is computed manually that totally rely on two factors, i.e., EDI and CSF. Where, EDI is the quantity of daily intake of water for human body. According to studies arsenic value is increasing in drinking water day by day that raises Cancer Slop Factor (CSF). After manually computation we used same forecasted arsenic values from the model demonstrated in Fig. 2, to predict the next five-year cancer rate values from the model demonstrated in Fig. 3 and the results are shown in the "Predicated" column of Table II. The accuracy column tells how much percent the predictable results are near to manual results.

Fig. 6: Accuracy of Three Splitting Ratio by Four ML-Algorithms for Khairpur
The cancer dataset is divided into testing and training parts. Training part of the dataset is used to train the cancer model and testing part of dataset is used for validating the trained model for new prediction. We split dataset into different ration and deployed cancer predictive model on three different splitting rations as shown in Fig. 6:  Sr-01: 30%-test and 70% (training)  Sr-02: 25%-test and 75% (training)  Sr-03: 20%-test and 80% (training) It is apparent from Fig. 6 that the 70 percent of the data deployed to train the machine learning models and remaining 30 percent of the test dataset deployed to validate the predictive model gives highest accuracy among the three splitting ratios. We used three models of regression, and one model of binary classification to envision cancer rate. We found the accuracy of these four algorithms and are portrayed in Table III. It is apparent from the Table III that the accuracy of three regression algorithms is much accurate than the SVM method. The BLR algorithm portrays the highest prediction accuracy of cancer rate. It is also obvious from Fig. 6 that in three different splitting ratios the SVM method poses lowest accuracy than the other three regression algorithms. In Fig. 7, increasing trend of cancer rate is depicted from 2013 to 2022. From the graph, it is evident that due to gradual increase in the arsenic contamination from the year 2018 to 2022 in the Khairpur district, the cancer rate is also showing the increasing trend accordingly. This graph is gained following deployment of the trained model shown in Fig. 4. Trained model with the projected arsenic input gives an average of five-year cancer value at run time.

VI. CONCLUSION
Literature is evident that arsenic contamination studies have got immense attention around the world from the past decade due to its hazardous effects on the human health. This is the foremost effort to deploy 'Microsoft Azure Machine Learning Studio' intended for the prediction of arsenic and cancer rate. For this study; Khairpur district is carefully chosen and the arsenic values intended for the subsequent five-years and consequences on the wellbeing of the individuals in terms of cancer are predicted. Further, upturn in arsenic levels is found via the models of 'Arsenic Prediction' and 'Cancer Rate'. It is manifest from the obtained results that the cancer rate is correspondingly illustrating growing trends due to increasing trend of arsenic from 2018 to 2022. Prediction and forecasting methods are implemented which makes this study different from the previous studies. One of the limitations of this work is that the web service developed in this work is currently able to predict the arsenic values for just Khairpur district. The accuracy comparison of three regression and one binary classification algorithms revealed that the accuracy of BLR is highest among the four algorithms. One of the future directions of this work is the application of the developed ACCRAP model on the dataset of other arsenic affected regions of the Sindh, Pakistan. Another future direction could be to analyze the arsenic contamination levels in water according to seasons because the intake of water fluctuates according to seasons.
ACKNOWLEDGMENT This project idea achieved Microsoft "AI for Earth Grant". We are grateful to the Microsoft for their support and MUET for conducting this research work.