I am an MSc Data Science graduate from Newcastle University with a passion for Machine Learning and Deep Learning. I have worked on various projects as part of bachelor's and master's, leveraging concepts from statistics and AI. I am focused on computer vision challenges in the healthcare sector, particularly in disease detection/diagnosis.
It is a challenge to perform weakly supervised semantic segmentation (WSSS), especially when pixel-level prediction is being supervised by image-level labels. A Class Activation Map (CAM) is typically created to offer pixel-level pseudo labels to fill the space between them. Convolutional neural networks' CAMs experience partial activation or activation of only the most discriminative regions. On the other hand, transformer-based approaches are very good at investigating global context with long-range dependency modelling, perhaps resolving the "partial activation" problem. In this study, we evaluate the efficacy of several transformer networks for the task and present the first transformer-based WSSS technique for histopathology pictures. Unlike other CNN-based techniques, we train a segmentation model for semantic segmentation using high-quality pseudo segmentation masks created by Vision Transformer (ViT), a classifier. Additionally, we adapt the Pixel-Adaptive Refinement module, which incorporates low-level picture appearance information to refine the pseudo labels, to assure accuracy and local consistency. Extensive testing on the WSSS4LUAD dataset showed that this method can successfully segment histopathological images using only image-level labels.
Skin cancer is defined as the uncontrolled growth of abnormal cells in the epidermis, the skin's outermost layer, caused by unrepaired DNA damage that results in mutations. Skin cancer is one of the most frequent types of cancer in the globe. The three most frequent types of skin cancer are squamous cell carcinoma, basal cell carcinoma, and melanoma. The clinical therapy of a skin lesion is mostly determined by its prompt discovery. The prevalence of skin cancer is on the rise, particularly melanoma, which is aggressive due to its high metastatic rate. As a result, early detection is crucial for therapy before the beginning of malignancy. To overcome this issue, medical imaging is employed for dermoscopic image processing and classification/identification. This gives an opportunity to create an automated model for identifying skin lesions. Using Convolutional Neural Networks, we investigated an automated technique for lesion diagnosis (CNN). We trained our model using Kaggle's "Skin Cancer MNIST: HAM10000" dataset, which contains a huge collection of multi-source dermoscopic pictures of pigmented lesions. We did data augmentation on the training data to boost generalization and classification performance. We employed the VGG16 and ResNet50 architectures for skin lesion picture classification and got the greatest accuracy and f1 score of 0.7651 and 0.7569 for ResNet50 with augmented data, respectively.
In this project, I worked on analyzing a large dataset of taxi trips in New York City using Apache Spark, a popular big data processing framework. The goal of the project was to gain insights into the dataset by summarizing interzonal travel, and ranking zones by traffic, passenger volume, and profitability. I also recorded the execution time of the pipeline under different conditions and analyze the effect of dataset size ('S', 'M', 'L', 'XL', 'XXL'), dataset format (parquet and delta), and task complexity on pipeline performance. I evaluated the resulting execution times to comment on the effect of dataset size, dataset format, and task complexity on analysis pipeline performance. To start, I performed data preprocessing and cleaning to ensure the data was ready for analysis. Then, I computed new columns such as zone names and unit profitability for each trip. Next, I summarized interzonal travel by building a graph data structure of zone-to-zone traffic and obtained aggregate information about all trips between those zones. I used this graph to rank zones by traffic, passenger volume, and profitability. Finally, I recorded the execution time of the entire pipeline and the pipeline without task 'Removing outliers using the modified z-score' on the two tables, for all dataset sizes and formats. I analyzed the resulting execution times and commented on the effect of dataset size, dataset format, and task complexity on pipeline performance. Overall, this project was a valuable learning experience that helped me develop my skills in big data analytics and data processing. This project help us understand how the pipeline performs under different conditions and how we can optimize the pipeline for better performance. The insights gained from this analysis could be used to inform policy decisions related to the taxi industry in New York City.
The aim of this technical project is to design, implement and test an image informatics approach for automatic colon cancer cells segmentation in microscopy imagery using image informatics methods/techniques. To assess the accuracy and the quality of the cells (nuclei and cytoplasm) segmentation approach, we compare the nuclei and cytoplasm segmentation results with the golden standard manual nuclei and cytoplasm segmentation results, using Jaccard Score metric. There are a slew of image segmentation techniques that can be used to segment images. The most common techniques are:
This project focuses of the analysis of the learner data for Cyber Security massive open online certificate (MOOC) by Newcastle University. It also demonstrates the advantages of repeatable data research with r markdown. The data collection comprises 62.csv files, from which enrollment data has been utilized for analysis with target demographic understanding as the overall objective. The main objective of this analysis is to understand the target demographic. Our aim is to develop insights of the learners enrolled in the course. This analysis will helpful in developing the course to have a wider reach and be more appealing to the learners. We analyse the user information such as location data, gender, age, education, etc. to generate these insights. This gives us a better understanding of “Who” the learners are so that the course can be developed to the business needs and market demand. We followed the Cross-industry standard process for data mining (CRISP-DM) methodology for our analysis, which is an open standard process model that describes common approaches for data mining and analysis and is the most widely-used analytics model. The dataset for Cyber Security: Safety at Home, Online, in Life course includes 62 with files containing user data such as follows: - Archetype survey - Enrolments - Leaving survey response - Question response - Step activity - Team members - Video stats - Weekly sentiment survey responses The dataset comprises data gathered from seven runs of the course. During our analysis, we concentrated solely on the enrolment data, which contains information about each learner such as gender, age, education, nation, and so on. Following the CRISP-DM model, we performed our analysis in more than one cycle as explained below, with our main business objective being target demographic understanding:
In this project, I conducted a cluster analysis using hierarchical and k-means clustering on the ISLR gene expression dataset, consisting of 40 tissue samples with measurements on 1,000 genes. Additionally, I performed linear regression on the diabetes data to develop a model for predicting disease progression on the basis of one or more of the 10 baseline variables. I split the data into a training and validation set, computed the test error for a multiple linear regression model, a 6-predictor model, and a ridge regression model with cross-validation to identify the optimal tuning parameter. I also generated a plot of the regression coefficients for different values of the tuning parameter and compared the test errors to determine the best model.
In this project, I built a classifier for identifying benign or malignant breast tissue samples based on cytological characteristics. The dataset used contains 699 tissue samples, each classified into benign or malignant classes, and nine cytological characteristics on a 1 to 10 scale. The project begins with exploratory data analysis, including data cleaning and numerical and graphical summaries. Various classification methodologies, such as logistic regression, regularization with logistic regression, and discriminant analysis, were implemented to identify the best classification model based on their performance on predictive accuracy or error rate. Skills demonstrated: Data pre-processing, exploratory data analysis, classification methodologies (logistic regression, regularization with logistic regression, discriminant analysis), data visualization, statistical analysis. Tools and technologies used: R programming language, RStudio IDE, mlbench package, ggplot2 package, caret package, glmnet package, MASS package. Outcome: Linear Discriminant Analysis was found to be the best classification model for this dataset, with a slight drop in accuracy for the 6 predictor model. However, when using priors that better reflect the real-world prevalence of breast cancer, Quadratic Discriminant Analysis performed the best in terms of classification accuracy.
Here are some of the other projects that I have worked on.