UC Davis Computer Science Department

Classification Technique to Detect Breast Cancer

Toniya Patil, Shevangae Singh, Utkarsh Nandy, Mohammad Mendahawai, Arav Mukherjee

Computer Science Department, University of California, Davis

tbpatil@ucdavis.edu, svisingh@ucdavis.edu, unandy@ucdavis.edu

mmendahawi@ucdavis.edu, arvmukherjee@ucdavis.edu

Final Project Report - Spring 2024

ECS 171: Machine Learning - University of California, Davis

Demo Notice: This is an adaptation of a UC Davis research paper submitted as a project for a machine learning class. This is purely for demonstration purposes only. Content has been modified to showcase AsterAI's document analysis capabilities.

Abstract

Breast cancer tumor detection and classification has always been a popular area of medical study because of its importance in early diagnosis. Many existing methods based on machine learning, have made substantial contributions in this field. However, these methods still face challenges, as they exhibit class imbalances, which could lead to the underrepresentation of less common cancer types. In this paper, we propose a breast cancer classifier model that takes detailed input from the user to describe the tumor. The methodology involves preprocessing the data by removing irrelevant columns and conducting exploratory data analysis (EDA) to eliminate weakly correlated features, resulting in a cleaned dataset of 19 features. For accurate classification, we employ a model based on the Yggdrasil Decision Forest ML algorithm, incorporating K-fold cross-validation to prevent overfitting. We have rigorously tested the proposed model on data distinguishing benign and malignant tumors. The model achieved a highest accuracy of 98%. The model performed well on test data containing more than 500 samples, producing fewer than 5% false diagnoses.

Index Terms: Machine Learning, Yggdrasil Decision Forest, Classification, Breast Cancer, Malignant, Benign

1. Introduction

In the United States alone, around 42,000 women and 500 men die from breast cancer every year. There are five total stages of breast cancer ranging from stage 0 to stage IV. When the cancer is identified at stage 0 or I, the five-year survival rate is 99%. This means that a person diagnosed with breast cancer at stage 0 or I is 99% as likely to survive for five years as a person without breast cancer. At stage II, the five-year survival rate drops to 93%. At stage III, it drops significantly to 72%. Lastly, at stage IV, the probability of five-year survival becomes very poor at 22%. Given these statistics along with the prevalence of this disease, it is clear that early and accurate detection is crucial to saving lives.

The integration of machine learning (ML) techniques along with traditional detection methods such as ultrasound, mammogram, MRI, and biopsy have the potential to enhance diagnostic accuracy and efficiency. Our dataset contains feature descriptions from MRI scans of different patients. Figure 1 visualizes the MRI scans for benign and malignant cancer tumors, showing the visual differences that are crucial for accurate classification.

MRI Scan comparison showing benign vs malignant tumor characteristics

Fig. 1. MRI Scan of Patient with benign Cancer Tumor vs. Malignant Tumor

2. Literature Review

Recent studies have made substantial contributions in the field of machine learning (ML) for breast cancer detection, highlighting both the potential and challenges of these techniques. These existing studies have made substantial contributions to using ML for breast cancer detection but are still subject to critical challenges inherent to many ML techniques. Breast cancer datasets tend to exhibit class imbalance, which traditional models struggle to handle effectively. This can lead to an underrepresentation of less common breast cancer types, especially in a classification context.

Yggdrasil Decision Forests (YDFs), comprised of various tree-based machine learning models such as Decision Trees, Random Forests, and Gradient Boosted Trees offer a promising solution to these issues. Starting with class imbalance, YDFs implement several techniques such as ensemble learning to combat this. In terms of interpretability, YDFs provide insights into feature importance and decision paths, leading to much greater transparency compared to traditional "black-box models". YDFs are highly efficient and capable of handling complex medical imaging data with faster training times than traditional ML models.

3. Exploratory Data Analysis

In this section, we conduct exploratory data analysis (EDA), as visualized in Figure 2, to understand and visualize the breast cancer dataset's patterns, distributions, and relationships. This involves describing the dataset, examining numerical and categorical variables, and identifying key observations useful while building the model. This model aims to classify the type of Breast Cancer using Digital Image Analysis.

EDA pipeline showing data preprocessing and analysis workflow

Fig. 2. EDA PIPELINE

Each numeric variable represents different measurements. Table I shows the mean, median, and standard deviation for each variable. Each categorical variable in the dataset represents different categories (like benign and malignant). Table II contains the frequency distributions for the variables.

TABLE I
SUMMARY FOR NUMERICAL VARIABLES

Variable	Mean	Median	Standard Deviation
id	30,371,830	906,024	125,020,600
radius mean	14.13	13.37	3.52
texture mean	19.29	18.84	4.30
perimeter mean	91.97	86.24	24.30
area mean	654.89	551.10	351.91
smoothness mean	0.096	0.096	0.014
compactness mean	0.104	0.093	0.053
concavity mean	0.089	0.062	0.080
symmetry mean	0.181	0.179	0.027
fractal dimension mean	0.063	0.062	0.007

TABLE II
FREQUENCY DISTRIBUTION FOR CATEGORICAL VARIABLES

Variable	B (Benign)	M (Malignant)
Count	357	212

Some key observations include: The dataset has more benign cases (357) compared to malignant cases (212). Variables like radius, texture, perimeter, and area have significantly different scales, with area values being higher than the radius and perimeter. Skewness in a dataset refers to the asymmetry in the distribution of data values, as shown in Figure 3.

Data skewness analysis showing left-skewed vs right-skewed distributions

Fig. 3. Skewness of Data-left skewed has the tail to left vs. Right Skewed has a tail to the right

Outliers are data points that are significantly different from the rest of the data. Figure 4 visualizes the outliers (if present) and values of skewness by plotting the features and values for each variable.

Outlier detection visualization with skewness values for each feature

Fig. 4. Visualization of Outliers along with the skewness values to depict if the feature contains outliers or not

Correlational analysis is a statistical technique used to determine the relationship between variables. The correlation matrix in Figure 5 revealed several pairs of features with high correlation coefficients (greater than 0.95), indicating redundancy.

Correlation matrix showing highly correlated features in red and uncorrelated features in blue

Fig. 5. Correlational Matrix where red shows the highly correlated features and light blue is uncorrelated features

4. Experimental Results

The performance of our random forest model was evaluated using KFold cross-validation. We trained the model on various folds ranging from 2 to 10. As noticed in Figure 6, the accuracy of our model increased with increasing batch size. After training each model, it was appended to an empty list. Using the 'argmax' function, we filtered out the best model with the highest accuracy.

Chart showing the relationship between batch size and model accuracy

Fig. 6. Batch size vs Accuracy

Figure 7 shows the relationship between batch size and MSE, indicating that lower batch sizes resulted in models with more noise. By decreasing the batch size, we reduced the variance in the gradient estimates, effectively smoothing out the fluctuations. As a result, we were able to lower the MSE of the model.

Chart showing the relationship between batch size and Mean Squared Error (MSE)

Fig. 7. Batch size vs MSE

We produced a confusion matrix to display the results of testing our model on the sample dataset. The confusion matrix yielded a false positive rate of 5.07% and a false negative rate of 0.515%. Overall, the model yielded 20 wrong predictions out of 569 samples, resulting in a 3.51% error rate. The mean squared error of the best model was 0.0351%, indicating an extremely small error rate.

5. Evaluation and Conclusion

Previous studies on breast cancer classification have yielded accuracies such as 98% for SVM, 78-92% for combinations of AlexNet, VGG 16, Inception, ResNet, and Nasnet, and 95-96% for Inception v3, ResNet 50, VGG 16, and VGG 19. We achieved accuracies over 95% through our Yggdrasil random forest approach. The best model yielded an accuracy of over 98.5% and the MSE was lower than 1%. Our study presents a model for classifying breast cancer tumors as Benign or Malignant using a Random Forest classifier implemented through the Yggdrasil Decision Forest library, contributing to the early detection of breast cancer.

References

[1] CDC Breast Cancer, "Basic information about breast cancer," Centers for Disease Control and Prevention, Sep. 14, 2020. cdc.gov

[2] World Cancer Research Fund International, "Worldwide cancer data — World Cancer Research Fund International," WCRF International, 2022. wcrf.org

[3] Research Gate, Figure 1 - uploaded by Md Abdullah Al Nasim, www.researchgate.net, June. 30, 2019. www.researchgate.net

[4] American Cancer Society, "Survival rates for breast cancer," www.cancer.org, Mar. 01, 2023. wcrf.org

[5] Rose, C. (n.d.). DDSM: Digital Database for Screening Mammography. USF Digital Mammography Home Page. www.eng.usf.edu

[6] Centers for Disease Control and Prevention, "How Is Breast Cancer Diagnosed?," CDC, 2019. cdc.gov

[7] Daraje kaba Gurmessa and Worku Jimma, "Explainable machine learning for breast cancer diagnosis from mammography and ultrasound images: a systematic review," BMJ Health & Care Informatics, vol.31, no.1, pp. e100954–e100954, Feb. 2024, doi: doi.org

[8] "Yggdrasil Decision Forests - Yggdrasil Decision Forests' documentation," ydf.readthedocs.io. ydf.readthedocs.io

[9] University of California Irvine Machine Learning Repository, "Breast Cancer Wisconsin (Diagnostic) Data Set," UCI, 2023. archive.ics.uci.edu

[10] PrepBytes, "iloc function in Python," PrepBytes Blog, 2023. prepbytes.com

[11] M. Guillame-Bert, S. Bruch, R. Stotz, J. Pfeifer, "Yggdrasil Decision Forests: A Fast and Extensible Decision Forests Library," arXiv preprint, 2022. arxiv.org

[12] S. Arooj et al., "Breast cancer detection and classification empowered with transfer learning," Frontiers in public health, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9289190/ (accessed Jun. 5, 2024). www.ncbi.nlm.nih.gov

UC Davis ECS 171: Machine Learning | Final Project Report | Spring 2024 | Classification Technique to Detect Breast Cancer

Hold Alt or ⌥ to activate Lens Mode