Introduction
Following last week’s data science exploration, the next logical step was to dive into machine learning—specifically, its classical foundations. While artificial intelligence encompasses a broad range of techniques (most of the AI methods you interact with include complex deep neural networks that are far, far more complex), this post focuses purely on traditional machine learning methods.
Rather than deep learning—which will get its dedicated post—I explored fundamental algorithms that have shaped the field. These classical approaches form the backbone of many modern AI systems and are essential for understanding how machines can recognize patterns, make predictions, and classify data. This blog provides a high-level overview, with a more detailed breakdown available on my Machine Learning Projects page.
- Scikit-learn: Gained proficiency in using Scikit-learn for implementing classical machine learning algorithms.
- Model Evaluation Metrics: Learned to evaluate models using metrics like accuracy, precision, recall, and F1-score.
- Feature Engineering: Developed skills in creating and selecting meaningful features to improve model performance.
- Natural Language Processing: Explored techniques for processing and analyzing textual data.
- Problem Solving: Enhanced problem-solving abilities by tackling real-world machine learning challenges.
- Statistical Methods: Applied statistical techniques to analyze data and validate model assumptions.
Essential Tools & Libraries
This section gives a brief introduction to the tools and libraries I used for this work. Click on each card to see sample code snippets:
Scikit-learn
The core library for machine learning, used for regression, classification, and ensemble methods like random forests. It also provided utilities for feature scaling, model evaluation, and pipeline creation.
# Example code for Scikit-learn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Evaluate
accuracy = model.score(X_test, y_test)
print(f"Accuracy: {accuracy}")
Pandas
Essential for data manipulation, cleaning, and exploratory data analysis (EDA). It was particularly useful in handling datasets like LendingClub and Yelp reviews.
# Example code for Pandas
import pandas as pd
# Load dataset
df = pd.read_csv('data.csv')
# Inspect data
print(df.head())
# Clean data
df.dropna(inplace=True)
NumPy
Used for numerical operations, especially when working with arrays and performing mathematical computations in preprocessing steps.
# Example code for NumPy
import numpy as np
# Create array
arr = np.array([1, 2, 3, 4, 5])
# Perform operations
mean = np.mean(arr)
print(f"Mean: {mean}")
Matplotlib & Seaborn
Key visualization tools for plotting relationships in the data, checking distributions, and interpreting model results.
# Example code for Matplotlib & Seaborn
import matplotlib.pyplot as plt
import seaborn as sns
# Plot distribution
sns.histplot(data, kde=True)
plt.show()
NLTK & String Library
For natural language processing, NLTK was used to remove stop words and preprocess text, while Python’s string library helped with handling punctuation.
# Example code for NLTK
from nltk.corpus import stopwords
import string
# Remove stop words and punctuation
text = "Sample text for NLP preprocessing."
stop_words = set(stopwords.words('english'))
cleaned_text = ' '.join([word for word in text.split() if word.lower() not in stop_words and word not in string.punctuation])
print(cleaned_text)
TF-IDF Vectorization
A crucial technique for converting text data into numerical features for the NLP project.
# Example code for TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
# Vectorize text
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray())
Naïve Bayes Classifier
Applied for text classification in the NLP project due to its efficiency in handling text-based tasks.
# Example code for Naïve Bayes
from sklearn.naive_bayes import MultinomialNB
# Train model
model = MultinomialNB()
model.fit(X_train, y_train)
# Predict
predictions = model.predict(X_test)
print(predictions)
Pipeline (Scikit-learn)
Used to streamline preprocessing and modeling steps, making it easier to test different configurations without repetitive code.
# Example code for Pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
# Create pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', RandomForestClassifier())
])
# Fit pipeline
pipeline.fit(X_train, y_train)
Capstone Projects Overview
Over the past week, I worked on capstone projects using the Scikit-learn library. In the following sections, I’ll present practical examples of how these algorithms were applied in my capstone projects.
Linear regression, a foundational model, predicts numerical outcomes by establishing a linear relationship between variables. Logistic regression extends this approach for binary classification by applying a sigmoid function to estimate probabilities.
Supervised learning includes techniques like K-Nearest Neighbors (KNN), a non-parametric method that classifies or predicts based on the proximity of data points in the feature space. Decision Trees and Random Forests are tree-based models used for classification and regression, with decision trees employing a hierarchical structure and random forests enhancing accuracy through ensemble learning.
Support Vector Machines (SVMs) are powerful classification models that identify an optimal hyperplane to separate data in high-dimensional spaces.
Unsupervised learning includes methods like K-Means Clustering, which groups data into clusters based on similarity, and Principal Component Analysis (PCA), a dimensionality reduction technique used to simplify datasets while preserving essential information.
These classical models can also be applied in Natural Language Processing (NLP) for tasks such as sentiment analysis and text classification.
Featured Projects
Customer Spending Analysis
Analyzed customer data to determine whether the business should prioritize its mobile app or website. Using Linear Regression, I explored spending patterns and correlations between factors like in-store consultations, app usage, and online orders to provide actionable insights.
Key Findings:
- Identified a strong correlation between app usage and total spending.
- In-store consultations were a significant predictor of high-value customers.
- Website usage showed diminishing returns compared to mobile app engagement.
For detailed analysis and complete code, visit the Customer Spending Analysis page.
Ad Click Prediction
Built a model to predict whether users would click on ads based on their online activity. This classification problem used Logistic Regression, leveraging features such as user demographics, internet usage, and ad metadata to evaluate ad targeting strategies.
Key Findings:
- Achieved 85% accuracy in predicting ad clicks.
- Identified key demographic groups with higher click-through rates.
- Optimized ad targeting strategies based on user behavior patterns.
For detailed analysis and complete code, visit the Ad Click Prediction page.
Loan Repayment Prediction
Used LendingClub data to predict whether borrowers would repay their loans in full. Employing a Random Forest classification model, I incorporated features like credit scores, income levels, debt-to-income ratios, and payment histories to simulate decision-making for lending investments.
Key Findings:
- Achieved 90% accuracy in predicting loan repayment outcomes.
- Identified high-risk borrowers based on debt-to-income ratios.
- Provided actionable insights for improving lending strategies.
For detailed analysis and complete code, visit the Loan Repayment Prediction page.
Movie Recommender System
Explored Recommender Systems through a detailed walkthrough focused on movie recommendations. This exercise introduced me to advanced techniques such as collaborative filtering and matrix factorization, though it came with challenges due to its reliance on linear algebra and structured datasets.
Key Findings:
- Learned the fundamentals of collaborative filtering.
- Implemented a basic recommendation engine using matrix factorization.
- Gained insights into the challenges of sparse datasets.
For detailed analysis and complete code, visit the Movie Recommender System page.
Sentiment Analysis on Yelp Reviews
Classified Yelp reviews as 1-star or 5-star based on their text content. Using NLP methods, I employed classification models that utilized word frequency, sentiment analysis, and text vectorization techniques. By building a sentiment classifier with pipeline methods, I ensured efficiency and scalability in processing textual data.
Key Findings:
- Achieved 88% accuracy in classifying Yelp reviews.
- Identified key sentiment indicators in textual data.
- Streamlined preprocessing and modeling using Scikit-learn pipelines.
For detailed analysis and complete code, visit the Sentiment Analysis page.
Key Insights & Future Directions
Throughout these projects, I encountered several challenges. For instance, in the NLP project, handling text preprocessing efficiently was tricky—removing stop words helped improve accuracy, but too much cleaning risked losing valuable context. Choosing the right machine learning model was another challenge, especially in classification tasks where trade-offs between interpretability and accuracy had to be considered. Additionally, working with real-world data, particularly in the LendingClub project, required extensive data cleaning due to missing values and imbalanced classes.
However, these challenges provided valuable learning experiences, deepening my understanding of data preparation, model selection, and feature engineering.
Looking ahead, I plan to explore advanced deep learning techniques, such as neural networks and transformers, to tackle more complex problems. Additionally, I aim to enhance my skills in deploying machine learning models to production environments and working with big data technologies.
If you're interested in collaborating or learning more about my work, feel free to reach out or explore my complete portfolio!