An Efficient Exploration on Big Data Analysis in Adolescent Diabetic Prediction with Deep Learning Techniques

International Journal of Information Technology, Research and Applications (IJITRA) is a journal that publishes articles which contribute new theoretical results in all the areas of Computer Science, Communication Network and Information Technology. Research paper and articles on Big Data, Machine Learning, IOT, Blockchain, Network Security, Optical Integrated Circuits, and Artificial Intelligence are in prime position.


Introduction:
Diabetes is a category of chronic metabolic disease caused by insufficient insulin production or insulin activity. International Diabetes Federation estimates that 463 million individuals worldwide have diabetes in 2019, half of whom are undiagnosed due to disease's complicated pathophysiology [1]. Furthermore, the global prevalence of diabetes is expected to rise dramatically in the future decade. As a result, diabetes prevention and treatment is significant financial burden on national economies, healthcare and personal medical spending, particularly in low-and middle-income nations [2]. Type 1 diabetes (T1D), type 2 diabetes (T2D), and gestational diabetes mellitus (GDM) are three primary clinical types of diabetes, according to etiopathology [3]. Adult-onset latent autoimmune diabetes and young-onset maturity-onset diabetes are two further kinds with particular aetiology. T1D is caused by the immune system attacking the pancreas' insulin-secreting β cells [4]. People with T1D lack endocrine insulin produced by pancreatic βcells; hence they must rely on exogenous insulin.
T2D is most common type of diabetes, accounting for over 90% of cases. It is caused by insulin resistance or inadequate insulin synthesis. GDM can develop during pregnancy as well as may necessitate lifestyle changes as well as exogenous insulin delivery to avoid issues in baby. The early identification and classification of diabetes are typically challenging due to rising heterogeneity and a lack of continuous K. Manohari [5]. To perform like a human, AI and ML systems require causability as well as Explainability, as many of best-performing Ml and AI techniques are least transparent. Capacity of AI methods to explain themselves helps doctors have more faith in future AI methods. Causability is based on a causal process evaluated in terms of efficiency, effectiveness in terms of causal comprehension, and user transparency.
This review article's key contribution is that it considers both ML as well as AI methods in DM detection, diagnosis, self-management, and personalization. As far as we know, this is first review study to include both machine learning as well as artificial intelligence in detection, diagnosis, and self-management of diabetes, as well as the personalization of DM therapy. Review papers are important because they provide a complete summary of current research in a specific field. Furthermore, the authors have only looked at ML procedures, leaving out important aspects of the field such as databases, pre-processing techniques, feature extraction and selection methodologies utilised to find DM as well as AI answers to demand for intelligent DM assistants.

Diabetes Mellitus:
Diabetes Mellitus (DM), sometimes known as diabetes, is a term that refers to a group of diseases that affect how body transforms food into energy. When food is consumed, body converts it into glucose, a sugar that is then transported to bloodstream. Insulin is a hormone produced by pancreas that helps glucose move from bloodstream to cells that need it for energy. If you don't take medicine for diabetes, your body won't create as much insulin as it should. A condition known as high blood sugar occurs when the body retains a large amount of glucose. It has the potential to cause serious or life-threatening health problems. DM manifests itself in various ways, depending on the source [7].

Prediabetes
Prediabetes develops when blood sugar levels rise over what they should but are not high enough for a doctor to diagnose diabetes. As a result, prediabetes increases risk of type II diabetes as well as heart disease. These risks can be reduced by increasing exercise and losing extra weight (usually less than 5%-7% of body weight) [8].

Type-I diabetes
Insulin-dependent diabetes mellitus (IDDM) is used to describe type I diabetes. It's also known as juvenile-onset diabetes because it usually starts in childhood. Type I diabetes is an autoimmune disease. Antibodies threaten the pancreas, causing this to happen. Organ has deteriorated to the point where it no longer produces insulin. A person's genes can cause this type of diabetes. Complications with insulinproducing cells in the pancreas may potentially cause this. Many health problems associated with type I diabetes are caused by narrowing blood arteries in the kidneys, eyes, and nerves. Type-I people are at a higher risk of heart failure and stroke [9].

Type-II diabetes
Non-insulin-dependent diabetes, often known as adult-onset diabetes, is type-II diabetes. Type-II diabetes affects around 90% of individuals. When you have type II diabetes, your pancreas produces some insulin. However, either that isn't enough, or body isn't utilising it to its full potential. Type-II diabetes is also less severe than type-I diabetes. However, it has the potential to create serious health issues, particularly in small blood vessels of the nerves, kidneys, and eyes. Type-II diabetes also increases risk of stroke as well as heart failure. Insulin resistance is a side effect of obesity, and the pancreas has to work harder to create more insulin [10].

Other forms of diabetes
Other causes are causes of diabetes in 1% to 5% of people. Those who suffer from pancreatic diseases, as well as other operations and medications, as well as illnesses. In these circumstances, the doctor may wish to monitor blood sugar levels. While extensive DM research has yielded a wealth of material on a) etiopathology, b) diagnosis and c) illness detection, diagnosis, and treatment in recent decades, there is still much more to be discovered, unraveled, explained, and delineated. In this effort, relying on a large, rapidly developing body of clinical data and research provides a solid foundation for effective examination and follow-up. As a result, machine learning and artificial intelligence (AI) look to be essential technologies that K. Manohari  will greatly impact clinical decision-making. As a result, the goal is to link data analysis with treatment and intelligent decision-making in drug implementation and application [11].

Related works:
As data have become an integral part of any organization and analyzing it to discover hidden knowledge has become inevitable for improvement in services, same is true for medical field where predictive data mining is used for prognosis of disease at an early stage to pre-empt its effects and to aid physicians in developing contingency plan. The available literature reveals majority of the work that has been carried out on diabetes has focused mainly on developing the methods for prognosis or diagnosis of type II diabetes to reduce its complications, in majority of the cases Pima India dataset has been used for experimentation though methods and tool used have varied. To detect diabetes at initial stage a multi-agent system was designed by [12] called "CoLe" the system was not just a combinational agent but had multiple data miners. The intermingling of learning to depict information with alternate viewpoints was its fundamental goal and was achieved to large accuracy. To predict diabetes [13] applied, associative rule mining to discretize continuousvalued attributes an equal interval binning technique was used and for diabetic classification Apriori algorithm was applied and at the end association rules were generated for understanding relationship among measured fields used in prediction. In [14], Fuzzy ID3 combined with EM (Estimation Maximization) was used for diabetic prediction. The model was called Hybrid Classification System, it was a 2-phasic system in initial phase EM algorithm that was fed with cleaned data for clustering data and in second phase adaption rules essential for diabetic prognosis were obtained using ID3 algorithm. The models accuracy was about 91.32%. An expert system for diabetic prognosis was developed by [15]. The system used an extended classifier system (XCS)-a learning agent in artificial intelligence with a greater accuracy than others. The system is composed of a simple set of if-else rules. The achieved accuracy was 91.3%. Their research work [16] used novel artificial bee algorithm to predict diabetes. To improve prediction, they used a fake mutation operator. The model achieved an accuracy of 84.21%. Joint implementation of two most widely used algorithms, Support Vector Machines and Naïve Bayes was used by [17] for prognosis of diabetes. The system had an accuracy of 97.6%. Hybrid prediction model for prognosis of type II diabetes was proposed by [18], the system was the combination of two data mining classifiers K-means clustering as well as C4.5 decision tree method, K-fold cross-validation was applied for validation, system achieved an accuracy of 92.38%. To predict risk of heart attack in a diabetic patient [19] applied Naïve Bayes classifier on diabetic data with an accuracy of 74%. Ensembling methods were combined with the J48 decision tree algorithm to design a technique that could be applied to diabetic data set to predict the chances a diabetic patient has of diabetic neuropathy, diabetic nephropathy, and cardiovascular disease [20].
Researchers have begun to recognise the DL approaches' handling of huge datasets. As a result, DL approaches have also been used to predict diabetes. In last six years, seven research have been published. In addition, [21] employed a DNN. MLP, GRNN and RBF make up DNN's architecture. Pima Indian dataset was used to evaluate method. Because DNN can filter data as well as develop biases, authors did not preprocess dataset on purpose. Dataset is divided into two parts: 192 samples for testing as well as rest for training. Authors reported an accuracy percentage of 88.41 percent. Another study [22] employed two DL approaches to increase diabetes prediction accuracy. The performance of CNN and CNN-LSTM was calculated utilizing a private dataset called Electrocardiograms. There were 142,000 samples and eight qualities in total. Dataset was separated into training as well as testing sets using five-fold cross-validation. Because of DNN's self-learning, the authors did not pre-process data or utilise a feature selection strategy. Models produced accuracy rates of 90.9 percent and 95.1 percent, respectively. Logistic regression was employed as a baseline for MPNNs and CNN in study [23]. The goal was to use CGM signal dataset to identify diabetes patients. The dataset contains nine individuals, each of whom had 10,800 days of CGM data for 97,200 simulated CGM days. There was no discussion of qualities employed in this investigation.
Furthermore, [24] proposed the Deep Patient framework, an unsupervised DNN framework. Framework made use of a database of 704,857 patients' electronic health records. The scientists did not explain which features were employed in this dataset, but they did say that it may be used to forecast various diseases. The authors divided data into 5000 for validation, 76,217 for testing, and rest for training during validation phase. Area Under Curve (AUC) was used to determine accuracy, 0.91. To improve the prediction performance, the authors suggested pre-processing the dataset. Before running the DL, they proposed utilising PCA to extract relevant attributes. On a manually gathered dataset from a regional Australian hospital, work [25] used three distinct DL approaches. The dataset consists of 12,000 samples (patients) with a male-tofemale ratio of 55.5 percent. To clean as well as decrease samples to 7191 patients, certain pre-processing procedures were used. The dataset was divided into two thirds for training, one-sixth for validation, and one-K. Manohari [26] to predict the two forms of diabetes.
Pima Indian dataset, which had 768 samples as well as eight attributes, was used by researchers. According to their research, "Glucose, BMI, Age, Pregnancies, Diabetes Pedigree Function, Blood Pressure, Skin Thickness, and Insulin" are most important qualities. Dataset was split 80 percent for training and 20% for testing to validate the study. Type 1 diabetes was correctly predicted 78% of the time, whereas type 2 diabetes was correctly predicted 81%. Work [27] employed a one-dimensional modified CNN to predict diabetes based on breath signals and other investigations. The researchers gathered a breath signal dataset from 11 healthy patients, nine types 2 diabetes patients, and five types 1 diabetic patient. The dataset was not pre-processed in any way. The authors employed Leave-One-Out Cross-Validation for validation method. ROC curve, which achieved 0.96, was used to assess the performance.

Diagnosis of Adolescent Diabetes using Hybrid Architecture of Neural Network
Diabetes Mellitus (DM) is a prevalent chronic condition that can lead to serious health consequences and even death. The early detection of diabetes is critical, and significant complexity needs to occur countered. Many research studies on diabetes diagnosis have to obtain, the majority of which do base on a single data set, the Pima Indian diabetes data set. As a result, early detection and treatment are essential for illness prevention. Machine Learning (ML) approaches are self-handled to make more accurate predictions and improve performance. Hadoop/Map Reduce system was employed and a predictive analysis method to anticipate the forms of diabetes that are prevalent, the difficulties related with it and type of therapy that will deliver. This approach, according to the report, provides significant way to cure and care for patients with superior outcomes such as affordability as well as availability. Hadoop Distributed File System (HDFS) is used to store a large set of data and Map Reduce programming model is used to Analyze the massive amount of patient data sets in parallel programming method. Using this analysis it is possible to identify patients likely to suffer from diabetic risk and diagnose the patient at the earliest. Once the diabetes range is detected, for those patients with an abnormal diabetic range from their pre-historic medical data will obtain analysis for predicting the risk of cardiac attack. The data has been collected as images and classified using the hybrid architecture of the Convolutional neural network (CNN) that comprises VGG-19 and inception V3 algorithm.

Map reduce algorithm with Hadoop K-means clustering in Hadoop/ Map reduce environment for predictive analysis of Adolescent Diabetes:
In this system architecture, Hadoop clustering and hybrid architecture of neural network which comprises of VGG-19 and inception V3 algorithm are used, which enhances data processing and collection with higher efficiency. Hadoop is an open-source application of Map Reduce parallel processing model where processing details are hidden which includes distribution of data to the processing nodes, Subtasks and collection of computational results. In this framework, developers can focussing on computational problems in the series of Parallelization. DFS does use in Hadoop. Hadoop works on the master-slave framework along with Hadoop Distributed File System (HDFS) has two nodes that are Name Node (acts as a master) and Data Node (acts as a slave)., which works with identical patterns.

Pattern discovery:
For diabetic treatment it is necessary to test the patterns like, plasma glucose concentration, serum insulin, diastolic blood pressure, diabetes pedigree, Body Mass Index (BMI), age, number of times pregnant.
The pattern discovery of predictive analysis must include the following: • Association rule mining-Association between diabetic type and pages viewed (e.g. laboratory results) • Clustering-clustering of similar patterns of usage, etc.
• Classification-Classification of health risk value by the level of patient health condition. • Usage of statistics

Predictive -Pattern Matching System Predictive Pattern Matching:
Whenever the warehoused dataset was sent to Hadoop system, immediately the map reduce task is performed. In mapping phase, the Master Node splits large data into smaller tasks for numerous Worker K. Manohari  Nodes.It deploys the exact operation of predictive pattern matching system. The Master node is one consists of Name Node (NN) and Job Tracker (JT), which always employs the map and reduce task. The Worker Node or Slave Node receives the order from the Master Node, process the pattern matching task for diabetes data with the help of Data Node -Same Machine (DN) and Task Tracker (TT). The predictive matching is the process of comparing the analyzed threshold value with the obtained value. If the pattern matching process was completed by all Worker Nodes based on the requirement, it was stored in intermediate disks. This process is known as local write. If the reduce task was initiated by Master Node, all other allocated Worker Nodes will read the processed data from intermediate disks. Based on the query received from Client through Master Node, the reduce task will be performed in Worker Node. The results obtained from the reduce phase will be distributed in various servers.

Processing Analyzed Reports
After the analysis of large diabetic data went through Hadoop, the final results are distributed over various server and replicated through several nodes depending on the geographical area. By employing proper electronic communication technology to exchange the information of individual patients among health care centers will leads to get proper treatment at right time in remote locations at low cost. This method uses very complicated procedure and accuracy has been enhanced

Benefits of this predictive analysis system
The diabetes may associate with severe diseases such as heart attacks, strokes, eye diseases and kidney diseases, etc. Analyzing the risk value by the level of patient health condition using above results of can be used by the physicians at remote locations to serve the people. Detecting diseases at earlier stages can help to be treated more easily and effectively. In developing countries such as India, it is mandatory to manage specific individual and population health and detecting health care fraud more quickly. The middle-income families can be with the high availability of medical facility at minimum cost. This system leads to the improved focus on every individual patient health. Thereby we can reduce and save our next generation from diabetic mellitus.
K. Manohari  Although DL has advanced state of the art in various areas of diabetes, healthcare applications must be solid, dependable, and persuasive to avoid safety concerns as well as deliver significant treatment support. In this context, several restrictions and obstacles must be overcome before DL may be used in therapeutic settings. First, due to human errors and sensor abnormalities, data obtained from people with diabetes in realworld circumstances are prone to being imperfect. The process of gathering real data can be costly and timeconsuming. Sharing datasets among research teams might be problematic due to data privacy policies. As a result of these issues, many studies use a limited and sometimes insufficient amount of data.
Furthermore, deep learning models are opaque. Why methods produce output for a specific input situation is crucial to physicians, especially for some critical decision-making applications. Intricate structures of DNN layers significantly learn patterns from non-linear data but impair interpretability of method. When looking into DL for diabetes, it's critical to think about trade-off between performance as well as interpretability. Finally, new algorithmic and hardware advances are likely to improve the efficiency of training DL methods.

Applications:
DL is a hot topic in the AI era, and it's worth noticing that the majority of the papers chosen are from the last two years, indicating that this is a new technique. As a result, there is much room to improve the present diabetic applications. First, multimodal methods with wearables as well as smartphone applications are progressively collecting digital data and vital indicators. The majority of this information can be easily uploaded to centralised systems or cloud storage. Data volumes, as well as variability of data sources, are expected to enhance efficiency in numerous healthcare applications, including diabetes care, as IoT, as well as 5G networks, become more popular. Many low-quality data samples are filtered out as well as deleted from training sets as data volume grows, and advancements in wearables can effectively reduce measurement mistakes. DL is well-suited to handle such a surge in data availability.
Numerous recent initiatives in AI area is undertaken to improve model transparency as well as comprehend model functionality to interpret DL methods, In particular, S Hapley Additive explanations (SHAP) was presented as a unified framework for explaining input features that contribute to final output. This method of ranking input features is also helpful for selecting input features. T-SNE is applied to various CNN applications, such as DR detection, qualitatively assesses the collected feature maps. Furthermore, in terms of glucose-insulin dynamics, a recent study confirmed conformity of NN methods. Analyzing performance of DNNs as well as improving interpretability can be done using similar methods. Rather than relying exclusively on data-driven methods, incorporating expert knowledge into learning process might aid in better understanding underlying causes of a disease like diabetes. There are two viable options in particular. One is to employ expert knowledge as a guide throughout training process, and the other is to incorporate physiological characteristics as an input element of methods. Expert knowledge is also required to create safety constraints as well as evaluate model output confidence. Numerous papers emphasised that their findings needed to be confirmed in real-world situations. In this regard, a Google team has made significant progress. In a human-centred study, they used deep learning to treat diabetic eye problems at 11 clinics. The findings suggest several socio-environmental issues must be addressed before such automated systems can be widely deployed.

Conclusion and future scope:
The adoption of Big Data Analytics in Hadoop provides a systematic means to achieve improved outcomes, such as the availability and cost of healthcare services for all populations. NCDs such as diabetes is a major health concern in India. This study will help diabetes patients comprehend difficulties that may emerge by converting diverse health records into relevant analysed results. This paper provides a complete overview of the current trend in DL methods for diabetes research. We conducted a systematic search, chose a set of publications, and summarised major findings in three areas: diabetes diagnosis, glucose management, and diabetes-related problems diagnosis. Many DNN and learning algorithms are used in these areas, and the results have outperformed earlier conventional machine learning approaches in terms of experimental performance. Data availability, feature processing, and model interpretability have all been noted as problems in the literature. Transferring the newest breakthroughs in DL technology into vast multi-modal data of diabetes care has the potential to meet these issues in the future. An anticipate that DL methods are widely used in clinical settings as well as will significantly enhance diabetes management.