An advanced web-based Natural Language Processing (NLP) mini-project that delivers personalized book recommendations by combining popularity metrics, collaborative filtering, and NLP-driven text analysis. Developed as part of the SEM-VII NLP Lab, this system demonstrates how NLP techniques and machine learning algorithms can work together to enhance content discovery in a modern web application.
Readers today face an ever-growing library of options. To help users find books aligned with their interests, this project leverages:
- Popularity-Based Recommendations β surfaces the top 50 books by aggregate ratings and votes.
- Collaborative Filtering β identifies books similar to a userβs favorite based on collective user preferences.
- NLP Techniques β preprocesses and analyzes textual metadata (titles, descriptions) to enrich the recommendation pipeline and ensure robust data handling.
The entire application is built using Flask for backend routing and templating, with a Bootstrap frontend for responsive design. Models are trained and serialized in Google Colab, then deployed locally via Flask.
-
Web Application Development
Build a Flask-based interface where users can browse popular titles and request personalized suggestions. -
NLP Pipeline Implementation
Apply text-processing steps (lowercasing, stop-word removal) to book metadata to ensure clean, structured inputs for modeling. -
Machine Learning Integration
Train and serialize:- A popularity model (top-50 extraction)
- A collaborative filtering model (cosine similarity on user-item ratings)
-
Model Serialization for Fast Retrieval
Use Pickle to store precomputed matrices (popular.pkl
,similarity_scores.pkl
,pt.pkl
) for sub-second recommendations. -
Responsive UI/UX
Design HTML templates (index.html
,recommend.html
) with Bootstrap to provide an intuitive and mobile-friendly experience. -
Scalability & Extensibility
Structure code and data storage to accommodate growing datasets, additional NLP models, and future feature extensions (e.g., user authentication, real-time feedback).
Challenge: Users are overwhelmed by millions of titles and lack effective tools to navigate this content.
Solution: Combine popularity insights with personalized, NLP-enhanced algorithms to recommend books that match both general trends and individual tastes.
Before running the project, ensure you have the following:
- Programming Language: Python 3.x
- Development Platform:
- Model Training: Google Colab (cloud-based Jupyter notebooks)
- App Deployment: Local machine (Windows 10/11 or Ubuntu 20.04+)
- Backend Framework: Flask
- Frontend: HTML5, CSS3, Bootstrap 3.3.7
- Libraries & Packages:
- Data Handling & ML: pandas, numpy, scikit-learn
- Serialization: pickle
- (Optional for extended NLP): NLTK, spaCy
- Datasets: CSV files (
books.csv
,users.csv
,ratings.csv
) - Version Control: Git & GitHub for code and report hosting
-
Data Layer
- Books Dataset: ID, title, author, description, image URL
- Users Dataset: ID, demographic info (for future NLP persona modeling)
- Ratings Dataset: Userβbook rating pairs
-
Business Logic Layer
- NLP Preprocessor:
- Lowercasing, punctuation removal
- (Optional) Tokenization & stop-word removal
- Popularity Model: Aggregates number of ratings & mean rating β Top-50 list β
popular.pkl
- Collaborative Filtering Model:
- Pivot table (books Γ users) β Cosine similarity matrix β
similarity_scores.pkl
- Pivot table (books Γ users) β Cosine similarity matrix β
- NLP Preprocessor:
-
Presentation Layer
- Routes:
/
β Rendersindex.html
with Top-50 books/recommend
(GET) β Shows search form/recommend_books
(POST) β Returns personalized suggestions
- Templates: Bootstrap cards display book covers, titles, authors, ratings
- Routes:
Dataset | Description |
---|---|
books.csv |
Contains metadata: Book-Title , Book-Author , Image-URL-M , Description |
users.csv |
Contains simple user IDs and demographic fields |
ratings.csv |
Contains User-ID , Book-ID , and Rating |
All datasets sourced from Kaggleβs βBook Recommendation Dataset.β
- Objective: Surface books with broad appeal.
- Pipeline:
- Aggregate
ratings.csv
to compute total votes and average rating per book. - Filter out books with fewer than 250 votes (to ensure statistical reliability).
- Sort by descending average rating.
- Serialize the Top-50 into
popular.pkl
.
- Aggregate
- Objective: Recommend titles similar to a user-selected book.
- Pipeline:
- Build a pivot table (
pt.pkl
): rows = book titles, columns = user IDs, values = ratings. - Compute pairwise cosine similarity on this sparse matrix.
- Store similarity scores in
similarity_scores.pkl
for O(1) lookup. - On user input, retrieve top-4 most similar books (excluding the query itself).
- Build a pivot table (
- Data Cleaning: Lowercasing, strip HTML/entities from descriptions.
- (Optional) Tokenization & Stop-Word Removal: Prepare text for advanced NLP modules (e.g., embedding-based similarity).
- Future Scope: TF-IDF vectorization or transformer embeddings (spaCy, BERT) to enable content-based recommendations alongside collaborative methods.
-
Model Training (Google Colab)
- Load raw CSVs β Clean & normalize β Train popularity & collaborative filtering models β Serialize with Pickle.
-
Flask App (Local)
- Load
popular.pkl
,pt.pkl
,similarity_scores.pkl
. - Define routes (
/
,/recommend
,/recommend_books
). - Render templates with data passed from backend logic.
- Load
-
Templates & Styling
index.html
: Displays Top-50 in a 4-column Bootstrap grid.recommend.html
: Search form + recommendation cards; includes client-side validation/clear functionality.
-
Error Handling & UX
- Alerts for invalid or missing user input.
- Graceful messages if no similar books found.
- Content-Based NLP Models: Incorporate TF-IDF vectors or BERT embeddings on
Description
to recommend by semantic similarity. - Hybrid Recommender: Combine collaborative filtering, popularity, and content-based scores.
- User Profiles & Feedback Loops: Enable sign-in, track user history, refine recommendations over time.
- Multilingual Descriptions: Extend NLP pipeline to support non-English metadata.
- Real-Time Updates: Auto-retrain/recompute similarity as new ratings arrive.
- NLP Lab Report: SEM 7 NLP Lab Project Report
- Experiment Brief: BE NLP Ex No 9 Mini Project