🧠 Agentic Web Scraping with ScrapeGraphAI and LangGraph

This repository contains an intelligent web scraping solution that uses ScrapeGraphAI for LLM-powered content extraction and LangGraph for orchestrating the scraping workflow. The system can intelligently crawl websites, extract content using natural language instructions, and search for specific information.

screen-capture.mp4

🚀 Features

LLM-Powered Extraction: Uses OpenAI models to intelligently extract content based on natural language instructions
Parallel Processing: Processes multiple URLs simultaneously using LangGraph's fan-out pattern
Flexible Prompting: Customizable prompts for different scraping scenarios
Local Processing Control: No remote servers continuing to consume credits
Progress Tracking: Real-time progress updates during scraping
Error Handling: Robust error handling for browser and API issues
Configurable: Easy to configure for different websites and search terms

📋 Requirements

The code requires the following dependencies:

Python 3.8+
scrapegraphai
langgraph
nest_asyncio
playwright
pydantic-settings
python-dotenv
openai (for API access)

🛠️ Installation

# Clone the repository
git clone https://github.com/extrawest/web_scraping_with_scrapegraphai_and_langgraph.git
cd web-scraping-scrapegraphai

# Install required packages
pip install -r requirements.txt

# Install Playwright browsers
playwright install

# Create a .env file with your configuration
echo "OPENAI_API_KEY=your-api-key-here" > .env

📝 Usage

Command Line Usage

# Run the script directly
python scrape_the_web_agentically.py

Configuration

You can modify the target URL and search keyword by editing the script:

if __name__ == "__main__":
    target_urls = [
        "https://python.langchain.com"
    ]
    search_keyword = "How to track token usage for LLMs"

    if not target_urls or not search_keyword:
        print("Please set the target_urls list and search_keyword variable.")
    else:
        main(target_urls, search_keyword)

🧠 How It Works

The script uses a LangGraph workflow with ScrapeGraphAI to orchestrate the web scraping process:

Initialization: Sets up the initial state with the target URL and keyword
Scrape Management: Manages the URLs to be scraped
Parallel Processing: Uses LangGraph's fan-out pattern to process multiple URLs simultaneously
LLM-Powered Extraction: Uses OpenAI models to intelligently extract content from web pages
Content Evaluation: Determines if the extracted content contains the requested information
Result Processing: Formats and presents the extracted information

🔄 LangGraph Workflow

The script uses LangGraph to create a structured workflow with the following nodes:

initialize_state: Sets up the initial state with URLs and keyword
scrape_manager: Manages the list of URLs to be scraped
scraper: Extracts content from individual URLs using ScrapeGraphAI
evaluate: Checks if the extracted content contains the requested information

The workflow continues until either the information is found or all URLs have been processed.

🤖 ScrapeGraphAI vs Firecrawl

ScrapeGraphAI offers several advantages over Firecrawl:

LLM-Powered Extraction: Uses OpenAI models to intelligently extract content based on natural language instructions
Local Processing Control: No remote servers continuing to consume credits
More Flexible Scraping: Natural language instructions allow for more nuanced content extraction
Direct LLM-based Content Extraction: Extracts content without requiring multiple API calls

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
scrape_the_web_agentically.py		scrape_the_web_agentically.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 Agentic Web Scraping with ScrapeGraphAI and LangGraph

🚀 Features

📋 Requirements

🛠️ Installation

📝 Usage

Command Line Usage

Configuration

🧠 How It Works

🔄 LangGraph Workflow

🤖 ScrapeGraphAI vs Firecrawl

About

Releases

Packages

Languages

License

extrawest/web_scraping_with_scrapegraphai_and_langgraph

Folders and files

Latest commit

History

Repository files navigation

🧠 Agentic Web Scraping with ScrapeGraphAI and LangGraph

🚀 Features

📋 Requirements

🛠️ Installation

📝 Usage

Command Line Usage

Configuration

🧠 How It Works

🔄 LangGraph Workflow

🤖 ScrapeGraphAI vs Firecrawl

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages