Project

General

Profile

Actions

Task #181

open

Generic Web Scraping Engine for Vidyarti

Added by Dana Basheer about 2 months ago. Updated about 2 months ago.

Status:
In Progress
Priority:
High
Assignee:
Start date:
03/17/2026
Due date:
% Done:

0%

Estimated time:

Description

Develop a centralized web scraping module that can fetch data from multiple external sources and map it into different modules of Vidyarti such as:

  • Syllabus
  • Current Affairs
  • Mock Test Questions
  • Study Materials

The system should be configurable, reusable, and scalable.

Table

vid_scraping_source_master

  • id INT (PK) Source ID
  • source_name VARCHAR(150) Website name
  • base_url VARCHAR(255) Website URL
  • module_type ENUM('current_affairs','syllabus','mock_test','study_material') Target module
  • parsing_rules TEXT JSON rules for scraping
  • status BOOLEAN Active/Inactive
  • created_at DATETIME Created date

vid_scraped_data_staging

  • id INT (PK) ID
  • source_id INT (FK) Reference source
  • module_type VARCHAR(50) Target module
  • raw_title TEXT Extracted title
  • raw_content TEXT Extracted content
  • raw_data JSON Full raw scraped data
  • source_url VARCHAR(255) Original link
  • status ENUM('pending','approved','rejected') Workflow status
  • created_at DATETIME Scraped time

vid_scraping_logs

  • id INT (PK) Log ID
  • source_id INT Source reference
  • status VARCHAR(50) Success/Failed
  • message TEXT Error or success message
  • run_time DATETIME Execution time

Validations

Backend

  • source_name → required
  • base_url → valid URL
  • module_type → must be valid enum
  • source_url → unique (avoid duplicates)
  • Prevent duplicate data:
    Same source_url OR same title

Frontend

Required fields:

  • Source Name
  • URL
  • Module Type
  • JSON validation for parsing rules
  • Show preview test scraping (optional)
Actions

Also available in: Atom PDF