This repository contains an intelligent web scraping solution that uses ScrapeGraphAI for LLM-powered content extraction and LangGraph for orchestrating the scraping workflow. The system can intelligently crawl websites, extract content using natural language instructions, and search for specific information.
screen-capture.mp4
- LLM-Powered Extraction: Uses OpenAI models to intelligently extract content based on natural language instructions
- Parallel Processing: Processes multiple URLs simultaneously using LangGraph's fan-out pattern
- Flexible Prompting: Customizable prompts for different scraping scenarios
- Local Processing Control: No remote servers continuing to consume credits
- Progress Tracking: Real-time progress updates during scraping
- Error Handling: Robust error handling for browser and API issues
- Configurable: Easy to configure for different websites and search terms
The code requires the following dependencies:
- Python 3.8+
- scrapegraphai
- langgraph
- nest_asyncio
- playwright
- pydantic-settings
- python-dotenv
- openai (for API access)
# Clone the repository
git clone https://github.com/extrawest/web_scraping_with_scrapegraphai_and_langgraph.git
cd web-scraping-scrapegraphai
# Install required packages
pip install -r requirements.txt
# Install Playwright browsers
playwright install
# Create a .env file with your configuration
echo "OPENAI_API_KEY=your-api-key-here" > .env
# Run the script directly
python scrape_the_web_agentically.py
You can modify the target URL and search keyword by editing the script:
if __name__ == "__main__":
target_urls = [
"https://python.langchain.com"
]
search_keyword = "How to track token usage for LLMs"
if not target_urls or not search_keyword:
print("Please set the target_urls list and search_keyword variable.")
else:
main(target_urls, search_keyword)
The script uses a LangGraph workflow with ScrapeGraphAI to orchestrate the web scraping process:
- Initialization: Sets up the initial state with the target URL and keyword
- Scrape Management: Manages the URLs to be scraped
- Parallel Processing: Uses LangGraph's fan-out pattern to process multiple URLs simultaneously
- LLM-Powered Extraction: Uses OpenAI models to intelligently extract content from web pages
- Content Evaluation: Determines if the extracted content contains the requested information
- Result Processing: Formats and presents the extracted information
The script uses LangGraph to create a structured workflow with the following nodes:
initialize_state
: Sets up the initial state with URLs and keywordscrape_manager
: Manages the list of URLs to be scrapedscraper
: Extracts content from individual URLs using ScrapeGraphAIevaluate
: Checks if the extracted content contains the requested information
The workflow continues until either the information is found or all URLs have been processed.
ScrapeGraphAI offers several advantages over Firecrawl:
- LLM-Powered Extraction: Uses OpenAI models to intelligently extract content based on natural language instructions
- Local Processing Control: No remote servers continuing to consume credits
- More Flexible Scraping: Natural language instructions allow for more nuanced content extraction
- Direct LLM-based Content Extraction: Extracts content without requiring multiple API calls