Skip to content

This repository contains an intelligent web scraping solution that uses ScrapeGraphAI for LLM-powered content extraction and LangGraph for orchestrating the scraping workflow. The system can intelligently crawl websites, extract content using natural language instructions, and search for specific information.

License

Notifications You must be signed in to change notification settings

extrawest/web_scraping_with_scrapegraphai_and_langgraph

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧠 Agentic Web Scraping with ScrapeGraphAI and LangGraph

Maintenance Maintainer Ask Me Anything ! License GitHub release

This repository contains an intelligent web scraping solution that uses ScrapeGraphAI for LLM-powered content extraction and LangGraph for orchestrating the scraping workflow. The system can intelligently crawl websites, extract content using natural language instructions, and search for specific information.

screen-capture.mp4

🚀 Features

  • LLM-Powered Extraction: Uses OpenAI models to intelligently extract content based on natural language instructions
  • Parallel Processing: Processes multiple URLs simultaneously using LangGraph's fan-out pattern
  • Flexible Prompting: Customizable prompts for different scraping scenarios
  • Local Processing Control: No remote servers continuing to consume credits
  • Progress Tracking: Real-time progress updates during scraping
  • Error Handling: Robust error handling for browser and API issues
  • Configurable: Easy to configure for different websites and search terms

📋 Requirements

The code requires the following dependencies:

  • Python 3.8+
  • scrapegraphai
  • langgraph
  • nest_asyncio
  • playwright
  • pydantic-settings
  • python-dotenv
  • openai (for API access)

🛠️ Installation

# Clone the repository
git clone https://github.com/extrawest/web_scraping_with_scrapegraphai_and_langgraph.git
cd web-scraping-scrapegraphai

# Install required packages
pip install -r requirements.txt

# Install Playwright browsers
playwright install

# Create a .env file with your configuration
echo "OPENAI_API_KEY=your-api-key-here" > .env

📝 Usage

Command Line Usage

# Run the script directly
python scrape_the_web_agentically.py

Configuration

You can modify the target URL and search keyword by editing the script:

if __name__ == "__main__":
    target_urls = [
        "https://python.langchain.com"
    ]
    search_keyword = "How to track token usage for LLMs"

    if not target_urls or not search_keyword:
        print("Please set the target_urls list and search_keyword variable.")
    else:
        main(target_urls, search_keyword)

🧠 How It Works

The script uses a LangGraph workflow with ScrapeGraphAI to orchestrate the web scraping process:

  1. Initialization: Sets up the initial state with the target URL and keyword
  2. Scrape Management: Manages the URLs to be scraped
  3. Parallel Processing: Uses LangGraph's fan-out pattern to process multiple URLs simultaneously
  4. LLM-Powered Extraction: Uses OpenAI models to intelligently extract content from web pages
  5. Content Evaluation: Determines if the extracted content contains the requested information
  6. Result Processing: Formats and presents the extracted information

🔄 LangGraph Workflow

langgraph_visualization

The script uses LangGraph to create a structured workflow with the following nodes:

  • initialize_state: Sets up the initial state with URLs and keyword
  • scrape_manager: Manages the list of URLs to be scraped
  • scraper: Extracts content from individual URLs using ScrapeGraphAI
  • evaluate: Checks if the extracted content contains the requested information

The workflow continues until either the information is found or all URLs have been processed.

🤖 ScrapeGraphAI vs Firecrawl

ScrapeGraphAI offers several advantages over Firecrawl:

  1. LLM-Powered Extraction: Uses OpenAI models to intelligently extract content based on natural language instructions
  2. Local Processing Control: No remote servers continuing to consume credits
  3. More Flexible Scraping: Natural language instructions allow for more nuanced content extraction
  4. Direct LLM-based Content Extraction: Extracts content without requiring multiple API calls

About

This repository contains an intelligent web scraping solution that uses ScrapeGraphAI for LLM-powered content extraction and LangGraph for orchestrating the scraping workflow. The system can intelligently crawl websites, extract content using natural language instructions, and search for specific information.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages