Sentiment Analysis Project

Sentiment Analysis on Amazon Customer Reviews

Bachelor's Thesis • Machine Learning • Natural Language Processing • Python

Project Overview

This Bachelor's Thesis project titled "Machine Learning: A powerful tool for enhancing customer experience through sentiment analysis" explores how businesses can leverage sentiment analysis to interpret large volumes of customer reviews from Amazon. The research investigates how automating the classification of text into positive, negative, and neutral categories helps companies rapidly identify trends, strengths, and areas for product improvement using data-driven decisions.

Technical Stack

  • Programming Language: Python 3.7+
  • Data Processing: pandas, numpy
  • Machine Learning: scikit-learn
  • Lexicon Analysis: vaderSentiment
  • Dataset: Amazon Customer US Reviews Dataset (2015-2020)
  • Text Vectorization: TF-IDF, CountVectorizer

Model Performance Results

Logistic Regression

Two-Class: 93%
Three-Class: 86%
Approach: Maximum Entropy

VADER Analyzer

Two-Class: 90%
Three-Class: 78%
Approach: Lexicon & Rule-Based

Decision Tree

Two-Class: 88%
Three-Class: 80%
Approach: Rule-Based

Sentiment Analysis Pipeline

01

Data Collection

Amazon Customer Reviews
Dataset: 2015-2020
20 Product Categories

02

Preprocessing

Text Cleaning
Star Rating Mapping
Data Normalization

03

Feature Extraction

TF-IDF Vectorization
CountVectorizer
Text to Numerical

04

Model Training

8 ML Algorithms
80% Training Data
Cross Validation

05

Evaluation

20% Test Data
Performance Metrics
Model Comparison

Key Research Findings

Best Performance

93%

Logistic Regression achieved highest accuracy for two-class sentiment analysis, demonstrating superior performance in binary classification tasks.

Lexicon Approach

90%

VADER sentiment analyzer proved highly effective with 90% accuracy for binary classification, showing the power of rule-based methods.

Multi-Category Training

Robust

Training on diverse product categories yields more robust models, reducing category bias and improving generalization across domains.

Neutral Class Impact

-7%

Adding neutral sentiment class reduces overall accuracy, highlighting the complexity of identifying mixed or ambiguous sentiment expressions.

Cross-Category Analysis

Varies

Model performance varies significantly when testing across different product categories, emphasizing the importance of domain-specific training.

CV GitHub LinkedIn Contact