Toniya Patil, Shevangae Singh, Utkarsh Nandy, Mohammad Mendahawai, Arav Mukherjee
Computer Science Department, University of California, Davis
tbpatil@ucdavis.edu, svisingh@ucdavis.edu, unandy@ucdavis.edu
mmendahawi@ucdavis.edu, arvmukherjee@ucdavis.edu
Final Project Report - Spring 2024
ECS 171: Machine Learning - University of California, Davis
Demo Notice: This is an adaptation of a UC Davis research paper submitted as a project for a machine learning class. This is purely for demonstration purposes only. Content has been modified to showcase AsterAI's document analysis capabilities.
Breast cancer tumor detection and classification has always been a popular area of medical study because of its importance in early diagnosis. Many existing methods based on machine learning, have made substantial contributions in this field. However, these methods still face challenges, as they exhibit class imbalances, which could lead to the underrepresentation of less common cancer types. In this paper, we propose a breast cancer classifier model that takes detailed input from the user to describe the tumor. The methodology involves preprocessing the data by removing irrelevant columns and conducting exploratory data analysis (EDA) to eliminate weakly correlated features, resulting in a cleaned dataset of 19 features. For accurate classification, we employ a model based on the Yggdrasil Decision Forest ML algorithm, incorporating K-fold cross-validation to prevent overfitting. We have rigorously tested the proposed model on data distinguishing benign and malignant tumors. The model achieved a highest accuracy of 98%. The model performed well on test data containing more than 500 samples, producing fewer than 5% false diagnoses.
Index Terms: Machine Learning, Yggdrasil Decision Forest, Classification, Breast Cancer, Malignant, Benign
In the United States alone, around 42,000 women and 500 men die from breast cancer every year. There are five total stages of breast cancer ranging from stage 0 to stage IV. When the cancer is identified at stage 0 or I, the five-year survival rate is 99%. This means that a person diagnosed with breast cancer at stage 0 or I is 99% as likely to survive for five years as a person without breast cancer. At stage II, the five-year survival rate drops to 93%. At stage III, it drops significantly to 72%. Lastly, at stage IV, the probability of five-year survival becomes very poor at 22%. Given these statistics along with the prevalence of this disease, it is clear that early and accurate detection is crucial to saving lives.
The integration of machine learning (ML) techniques along with traditional detection methods such as ultrasound, mammogram, MRI, and biopsy have the potential to enhance diagnostic accuracy and efficiency. Our dataset contains feature descriptions from MRI scans of different patients. Figure 1 visualizes the MRI scans for benign and malignant cancer tumors, showing the visual differences that are crucial for accurate classification.
Fig. 1. MRI Scan of Patient with benign Cancer Tumor vs. Malignant Tumor
Recent studies have made substantial contributions in the field of machine learning (ML) for breast cancer detection, highlighting both the potential and challenges of these techniques. These existing studies have made substantial contributions to using ML for breast cancer detection but are still subject to critical challenges inherent to many ML techniques. Breast cancer datasets tend to exhibit class imbalance, which traditional models struggle to handle effectively. This can lead to an underrepresentation of less common breast cancer types, especially in a classification context.
Yggdrasil Decision Forests (YDFs), comprised of various tree-based machine learning models such as Decision Trees, Random Forests, and Gradient Boosted Trees offer a promising solution to these issues. Starting with class imbalance, YDFs implement several techniques such as ensemble learning to combat this. In terms of interpretability, YDFs provide insights into feature importance and decision paths, leading to much greater transparency compared to traditional "black-box models". YDFs are highly efficient and capable of handling complex medical imaging data with faster training times than traditional ML models.
In this section, we conduct exploratory data analysis (EDA), as visualized in Figure 2, to understand and visualize the breast cancer dataset's patterns, distributions, and relationships. This involves describing the dataset, examining numerical and categorical variables, and identifying key observations useful while building the model. This model aims to classify the type of Breast Cancer using Digital Image Analysis.
Fig. 2. EDA PIPELINE
Each numeric variable represents different measurements. Table I shows the mean, median, and standard deviation for each variable. Each categorical variable in the dataset represents different categories (like benign and malignant). Table II contains the frequency distributions for the variables.
Variable | Mean | Median | Standard Deviation |
---|---|---|---|
id | 30,371,830 | 906,024 | 125,020,600 |
radius mean | 14.13 | 13.37 | 3.52 |
texture mean | 19.29 | 18.84 | 4.30 |
perimeter mean | 91.97 | 86.24 | 24.30 |
area mean | 654.89 | 551.10 | 351.91 |
smoothness mean | 0.096 | 0.096 | 0.014 |
compactness mean | 0.104 | 0.093 | 0.053 |
concavity mean | 0.089 | 0.062 | 0.080 |
symmetry mean | 0.181 | 0.179 | 0.027 |
fractal dimension mean | 0.063 | 0.062 | 0.007 |
Variable | B (Benign) | M (Malignant) |
---|---|---|
Count | 357 | 212 |
Some key observations include: The dataset has more benign cases (357) compared to malignant cases (212). Variables like radius, texture, perimeter, and area have significantly different scales, with area values being higher than the radius and perimeter. Skewness in a dataset refers to the asymmetry in the distribution of data values, as shown in Figure 3.
Fig. 3. Skewness of Data-left skewed has the tail to left vs. Right Skewed has a tail to the right
Outliers are data points that are significantly different from the rest of the data. Figure 4 visualizes the outliers (if present) and values of skewness by plotting the features and values for each variable.
Fig. 4. Visualization of Outliers along with the skewness values to depict if the feature contains outliers or not
Correlational analysis is a statistical technique used to determine the relationship between variables. The correlation matrix in Figure 5 revealed several pairs of features with high correlation coefficients (greater than 0.95), indicating redundancy.
Fig. 5. Correlational Matrix where red shows the highly correlated features and light blue is uncorrelated features
The performance of our random forest model was evaluated using KFold cross-validation. We trained the model on various folds ranging from 2 to 10. As noticed in Figure 6, the accuracy of our model increased with increasing batch size. After training each model, it was appended to an empty list. Using the 'argmax' function, we filtered out the best model with the highest accuracy.
Fig. 6. Batch size vs Accuracy
Figure 7 shows the relationship between batch size and MSE, indicating that lower batch sizes resulted in models with more noise. By decreasing the batch size, we reduced the variance in the gradient estimates, effectively smoothing out the fluctuations. As a result, we were able to lower the MSE of the model.
Fig. 7. Batch size vs MSE
We produced a confusion matrix to display the results of testing our model on the sample dataset. The confusion matrix yielded a false positive rate of 5.07% and a false negative rate of 0.515%. Overall, the model yielded 20 wrong predictions out of 569 samples, resulting in a 3.51% error rate. The mean squared error of the best model was 0.0351%, indicating an extremely small error rate.
Previous studies on breast cancer classification have yielded accuracies such as 98% for SVM, 78-92% for combinations of AlexNet, VGG 16, Inception, ResNet, and Nasnet, and 95-96% for Inception v3, ResNet 50, VGG 16, and VGG 19. We achieved accuracies over 95% through our Yggdrasil random forest approach. The best model yielded an accuracy of over 98.5% and the MSE was lower than 1%. Our study presents a model for classifying breast cancer tumors as Benign or Malignant using a Random Forest classifier implemented through the Yggdrasil Decision Forest library, contributing to the early detection of breast cancer.
[1] CDC Breast Cancer, "Basic information about breast cancer," Centers for Disease Control and Prevention, Sep. 14, 2020. cdc.gov
[2] World Cancer Research Fund International, "Worldwide cancer data — World Cancer Research Fund International," WCRF International, 2022. wcrf.org
[3] Research Gate, Figure 1 - uploaded by Md Abdullah Al Nasim, www.researchgate.net, June. 30, 2019. www.researchgate.net
[4] American Cancer Society, "Survival rates for breast cancer," www.cancer.org, Mar. 01, 2023. wcrf.org
[5] Rose, C. (n.d.). DDSM: Digital Database for Screening Mammography. USF Digital Mammography Home Page. www.eng.usf.edu
[6] Centers for Disease Control and Prevention, "How Is Breast Cancer Diagnosed?," CDC, 2019. cdc.gov
[7] Daraje kaba Gurmessa and Worku Jimma, "Explainable machine learning for breast cancer diagnosis from mammography and ultrasound images: a systematic review," BMJ Health & Care Informatics, vol.31, no.1, pp. e100954–e100954, Feb. 2024, doi: doi.org
[8] "Yggdrasil Decision Forests - Yggdrasil Decision Forests' documentation," ydf.readthedocs.io. ydf.readthedocs.io
[9] University of California Irvine Machine Learning Repository, "Breast Cancer Wisconsin (Diagnostic) Data Set," UCI, 2023. archive.ics.uci.edu
[10] PrepBytes, "iloc function in Python," PrepBytes Blog, 2023. prepbytes.com
[11] M. Guillame-Bert, S. Bruch, R. Stotz, J. Pfeifer, "Yggdrasil Decision Forests: A Fast and Extensible Decision Forests Library," arXiv preprint, 2022. arxiv.org
[12] S. Arooj et al., "Breast cancer detection and classification empowered with transfer learning," Frontiers in public health, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9289190/ (accessed Jun. 5, 2024). www.ncbi.nlm.nih.gov
UC Davis ECS 171: Machine Learning | Final Project Report | Spring 2024 | Classification Technique to Detect Breast Cancer