Part 2 Machine Learning

Introduction

Following last week’s data science exploration, the next logical step was to dive into machine learning—specifically, its classical foundations. While artificial intelligence encompasses a broad range of techniques (most of the AI methods you interact with include complex deep neural networks that are far, far more complex), this post focuses purely on traditional machine learning methods.

Rather than deep learning—which will get its dedicated post—I explored fundamental algorithms that have shaped the field. These classical approaches form the backbone of many modern AI systems and are essential for understanding how machines can recognize patterns, make predictions, and classify data. This blog provides a high-level overview, with a more detailed breakdown available on my Machine Learning Projects page.

Scikit-learn: Gained proficiency in using Scikit-learn for implementing classical machine learning algorithms.
Model Evaluation Metrics: Learned to evaluate models using metrics like accuracy, precision, recall, and F1-score.
Feature Engineering: Developed skills in creating and selecting meaningful features to improve model performance.
Natural Language Processing: Explored techniques for processing and analyzing textual data.
Problem Solving: Enhanced problem-solving abilities by tackling real-world machine learning challenges.
Statistical Methods: Applied statistical techniques to analyze data and validate model assumptions.

Essential Tools & Libraries

This section gives a brief introduction to the tools and libraries I used for this work. Click on each card to see sample code snippets:

SK
Scikit-learn

The core library for machine learning, used for regression, classification, and ensemble methods like random forests. It also provided utilities for feature scaling, model evaluation, and pipeline creation.


    # Example code for Scikit-learn


    from sklearn.ensemble import RandomForestClassifier

    from sklearn.model_selection import train_test_split


    # Split data

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


    # Train model

    model = RandomForestClassifier(n_estimators=100, random_state=42)

    model.fit(X_train, y_train)


    # Evaluate

    accuracy = model.score(X_test, y_test)

    print(f"Accuracy: {accuracy}")

PD
Pandas

Essential for data manipulation, cleaning, and exploratory data analysis (EDA). It was particularly useful in handling datasets like LendingClub and Yelp reviews.


    # Example code for Pandas


    import pandas as pd


    # Load dataset

    df = pd.read_csv('data.csv')


    # Inspect data

    print(df.head())


    # Clean data

    df.dropna(inplace=True)

NP
NumPy

Used for numerical operations, especially when working with arrays and performing mathematical computations in preprocessing steps.


    # Example code for NumPy


    import numpy as np


    # Create array

    arr = np.array([1, 2, 3, 4, 5])


    # Perform operations

    mean = np.mean(arr)

    print(f"Mean: {mean}")

MS
Matplotlib & Seaborn

Key visualization tools for plotting relationships in the data, checking distributions, and interpreting model results.


    # Example code for Matplotlib & Seaborn


    import matplotlib.pyplot as plt

    import seaborn as sns


    # Plot distribution

    sns.histplot(data, kde=True)

    plt.show()

NL
NLTK & String Library

For natural language processing, NLTK was used to remove stop words and preprocess text, while Python’s string library helped with handling punctuation.


    # Example code for NLTK


    from nltk.corpus import stopwords

    import string


    # Remove stop words and punctuation

    text = "Sample text for NLP preprocessing."

    stop_words = set(stopwords.words('english'))

    cleaned_text = ' '.join([word for word in text.split() if word.lower() not in stop_words and word not in string.punctuation])

    print(cleaned_text)

TF
TF-IDF Vectorization

A crucial technique for converting text data into numerical features for the NLP project.


    # Example code for TF-IDF


    from sklearn.feature_extraction.text import TfidfVectorizer


    # Vectorize text

    vectorizer = TfidfVectorizer()

    X = vectorizer.fit_transform(corpus)

    print(X.toarray())

NB
Naïve Bayes Classifier

Applied for text classification in the NLP project due to its efficiency in handling text-based tasks.


    # Example code for Naïve Bayes


    from sklearn.naive_bayes import MultinomialNB


    # Train model

    model = MultinomialNB()

    model.fit(X_train, y_train)


    # Predict

    predictions = model.predict(X_test)

    print(predictions)

PL
Pipeline (Scikit-learn)

Used to streamline preprocessing and modeling steps, making it easier to test different configurations without repetitive code.


    # Example code for Pipeline


    from sklearn.pipeline import Pipeline

    from sklearn.preprocessing import StandardScaler

    from sklearn.ensemble import RandomForestClassifier


    # Create pipeline

    pipeline = Pipeline([

        ('scaler', StandardScaler()),

        ('model', RandomForestClassifier())

    ])


    # Fit pipeline

    pipeline.fit(X_train, y_train)

Capstone Projects Overview

Over the past week, I worked on capstone projects using the Scikit-learn library. In the following sections, I’ll present practical examples of how these algorithms were applied in my capstone projects.

Linear regression, a foundational model, predicts numerical outcomes by establishing a linear relationship between variables. Logistic regression extends this approach for binary classification by applying a sigmoid function to estimate probabilities.

Supervised learning includes techniques like K-Nearest Neighbors (KNN), a non-parametric method that classifies or predicts based on the proximity of data points in the feature space. Decision Trees and Random Forests are tree-based models used for classification and regression, with decision trees employing a hierarchical structure and random forests enhancing accuracy through ensemble learning.

Support Vector Machines (SVMs) are powerful classification models that identify an optimal hyperplane to separate data in high-dimensional spaces.

Unsupervised learning includes methods like K-Means Clustering, which groups data into clusters based on similarity, and Principal Component Analysis (PCA), a dimensionality reduction technique used to simplify datasets while preserving essential information.

These classical models can also be applied in Natural Language Processing (NLP) for tasks such as sentiment analysis and text classification.

Featured Projects

Customer Spending Analysis

Data set: NYC Online Clothing Company

Analyzed customer data to determine whether the business should prioritize its mobile app or website. Using Linear Regression, I explored spending patterns and correlations between factors like in-store consultations, app usage, and online orders to provide actionable insights.

Key Findings:

Identified a strong correlation between app usage and total spending.
In-store consultations were a significant predictor of high-value customers.
Website usage showed diminishing returns compared to mobile app engagement.

For detailed analysis and complete code, visit the Customer Spending Analysis page.

Python

Pandas

Scikit-learn

Linear Regression

Ad Click Prediction

Online Advertising

Built a model to predict whether users would click on ads based on their online activity. This classification problem used Logistic Regression, leveraging features such as user demographics, internet usage, and ad metadata to evaluate ad targeting strategies.

Key Findings:

Achieved 85% accuracy in predicting ad clicks.
Identified key demographic groups with higher click-through rates.
Optimized ad targeting strategies based on user behavior patterns.

For detailed analysis and complete code, visit the Ad Click Prediction page.

Python

Scikit-learn

Logistic Regression

Data Visualization

Loan Repayment Prediction

Financial Analysis

Used LendingClub data to predict whether borrowers would repay their loans in full. Employing a Random Forest classification model, I incorporated features like credit scores, income levels, debt-to-income ratios, and payment histories to simulate decision-making for lending investments.

Key Findings:

Achieved 90% accuracy in predicting loan repayment outcomes.
Identified high-risk borrowers based on debt-to-income ratios.
Provided actionable insights for improving lending strategies.

For detailed analysis and complete code, visit the Loan Repayment Prediction page.

Python

Scikit-learn

Random Forest

Feature Engineering

Movie Recommender System

Collaborative Filtering

Explored Recommender Systems through a detailed walkthrough focused on movie recommendations. This exercise introduced me to advanced techniques such as collaborative filtering and matrix factorization, though it came with challenges due to its reliance on linear algebra and structured datasets.

Key Findings:

Learned the fundamentals of collaborative filtering.
Implemented a basic recommendation engine using matrix factorization.
Gained insights into the challenges of sparse datasets.

For detailed analysis and complete code, visit the Movie Recommender System page.

Python

Pandas

Matrix Factorization

Collaborative Filtering

Sentiment Analysis on Yelp Reviews

Natural Language Processing

Classified Yelp reviews as 1-star or 5-star based on their text content. Using NLP methods, I employed classification models that utilized word frequency, sentiment analysis, and text vectorization techniques. By building a sentiment classifier with pipeline methods, I ensured efficiency and scalability in processing textual data.

Key Findings:

Achieved 88% accuracy in classifying Yelp reviews.
Identified key sentiment indicators in textual data.
Streamlined preprocessing and modeling using Scikit-learn pipelines.

For detailed analysis and complete code, visit the Sentiment Analysis page.

Python

Scikit-learn

NLP

TF-IDF

Key Insights & Future Directions

Throughout these projects, I encountered several challenges. For instance, in the NLP project, handling text preprocessing efficiently was tricky—removing stop words helped improve accuracy, but too much cleaning risked losing valuable context. Choosing the right machine learning model was another challenge, especially in classification tasks where trade-offs between interpretability and accuracy had to be considered. Additionally, working with real-world data, particularly in the LendingClub project, required extensive data cleaning due to missing values and imbalanced classes.

However, these challenges provided valuable learning experiences, deepening my understanding of data preparation, model selection, and feature engineering.

Looking ahead, I plan to explore advanced deep learning techniques, such as neural networks and transformers, to tackle more complex problems. Additionally, I aim to enhance my skills in deploying machine learning models to production environments and working with big data technologies.

If you're interested in collaborating or learning more about my work, feel free to reach out or explore my complete portfolio!

Introduction

Essential Tools & Libraries

SKScikit-learn

PDPandas

NPNumPy

MSMatplotlib & Seaborn

NLNLTK & String Library

TFTF-IDF Vectorization

NBNaïve Bayes Classifier

PLPipeline (Scikit-learn)

Capstone Projects Overview

Featured Projects

Customer Spending Analysis

Ad Click Prediction

Loan Repayment Prediction

Movie Recommender System

Sentiment Analysis on Yelp Reviews

Key Insights & Future Directions

SK
Scikit-learn

PD
Pandas

NP
NumPy

MS
Matplotlib & Seaborn

NL
NLTK & String Library

TF
TF-IDF Vectorization

NB
Naïve Bayes Classifier

PL
Pipeline (Scikit-learn)