Back to Projects

Development of a Product Recommendation System Using Web Scraping 10.000+ Data, PySpark, Apache Airflow, and Integration with Llama 3.1

DATA SCIENCE

A comprehensive product recommendation system project focused on building similarity-based recommendations using data scraped from padiUMKM platform. The system integrates various technologies including Selenium, BeautifulSoup, PySpark, PostgreSQL, and Llama 3.1 for enhanced contextual recommendations.

Project screenshot 1

Project Highlights

  • Scraped over 100,000 product entries using Selenium and BeautifulSoup
  • Implemented data transformation and cleaning using PySpark
  • Developed similarity-based recommendation system with Cosine Similarity
  • Integrated Llama 3.1 for enhanced contextual recommendations
  • Built real-time recommendation service with continuous data updates

Technologies Used

Python 3.12.2SeleniumBeautifulSoupPySparkPostgreSQLApache AirflowLlama 3.1Cosine SimilarityJupyter NotebookDocker

Project Stages

Stage 1: Web Scraping

Developed automated scraping system using Selenium and BeautifulSoup to collect product data including names, prices, stock, sales, ratings, descriptions, dimensions, and weights from padiUMKM platform.

Stage 2: Data Processing

Implemented data transformation pipeline using PySpark for consistency checking, normalization, and data cleansing. Established PostgreSQL database connection for data storage.

Stage 3: Recommendation Model

Built similarity-based recommendation system using Cosine Similarity algorithm, utilizing product descriptions and features like categories, brands, and related attributes.

Stage 4: LLM Integration

Integrated Llama 3.1 model through prompt engineering techniques to enhance recommendation quality by understanding linguistic patterns in product descriptions.

Stage 5: Real-time System

Developed real-time data collection and transformation processes, implemented recommendation model as a real-time service for immediate product suggestions.

Stage 6: Testing & Optimization

Conducted comprehensive testing of the recommendation system, optimized model performance, and validated recommendation accuracy.