Skip to content

This repository contains an intelligent web scraping solution that uses Firecrawl for content extraction and LangGraph for orchestrating the scraping workflow. The system can automatically crawl websites, extract content, and search for specific keywords or information.

License

Notifications You must be signed in to change notification settings

extrawest/web_scraping_with_firecrawl_and_langgraph

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🕸️ Agentic Web Scraping with Firecrawl and LangGraph

Maintenance Maintainer Ask Me Anything ! License GitHub release

This repository contains an intelligent web scraping solution that uses Firecrawl for content extraction and LangGraph for orchestrating the scraping workflow. The system can automatically crawl websites, extract content, and search for specific keywords or information.

screen-capture.1.mp4

🚀 Features

  • Automated Sitemap Extraction: Automatically discovers all pages on a website
  • Intelligent Content Extraction: Extracts markdown, HTML, or text content from web pages
  • Keyword Search: Searches for specific keywords or phrases across all pages
  • Progress Tracking: Real-time progress updates during scraping
  • Error Handling: Robust error handling for network issues and parsing errors
  • Configurable: Easy to configure for different websites and search terms
  • LangGraph Workflow: Uses LangGraph for structured, maintainable scraping workflows

📋 Requirements

The code requires the following dependencies:

  • Python 3.8+
  • firecrawl-py
  • langgraph
  • pydantic-settings
  • python-dotenv

🛠️ Installation

# Clone the repository
git clone https://github.com/extrawest/web_scraping_with_firecrawl_and_langgraph.git
cd web_scraping_with_firecrawl_and_langgraph

# Install required packages
pip install -r requirements.txt

# Create a .env file with your configuration
echo "FIRECRAWL_URL=http://localhost:3002" > .env

📝 Usage

Basic Usage

from scrape_the_web_agentically import main

# Run the scraper with a target URL and keyword
main(url="https://example.com", keyword="specific information")

Command Line Usage

# Run the script directly
python scrape_the_web_agentically.py

Configuration

You can modify the target URL and search keyword by editing the script:

if __name__ == "__main__":
    target_url = "https://python.langchain.com"
    search_keyword = "LLMs"
    
    if not target_url or not search_keyword:
        print("Please set the target_url and search_keyword variables.")
    else:
        main(target_url, search_keyword)

🧠 How It Works

The script uses a LangGraph workflow to orchestrate the web scraping process:

  1. Initialization: Sets up the initial state with the target URL and keyword
  2. Sitemap Extraction: Fetches the sitemap to discover all pages on the website
  3. Batch Processing: Processes URLs in batches for efficient scraping
  4. Content Extraction: Extracts content from each page using Firecrawl
  5. Keyword Search: Searches for the specified keyword in the extracted content
  6. Result Evaluation: Determines if the information was found or if more URLs need to be processed

The workflow is visualized and saved as a PNG file for easy understanding of the process.

🔄 LangGraph Workflow

firecrawl_langgraph_visualization

The script uses LangGraph to create a structured workflow with the following nodes:

  • initialize_state: Sets up the initial state with URL and keyword
  • get_sitemap: Fetches the sitemap for the target URL
  • scrape_manager: Manages batches of URLs for processing
  • scraper: Extracts content from individual URLs
  • evaluate: Checks if the keyword was found in the content

The workflow continues until either the keyword is found or all URLs have been processed.

🔧 Customization

Firecrawl Server

By default, the script connects to a local Firecrawl server at http://localhost:3002. You can change this by:

  1. Setting the FIRECRAWL_URL environment variable
  2. Creating a .env file with FIRECRAWL_URL=your-server-url
  3. Modifying the firecrawl_url parameter in the Settings class

About

This repository contains an intelligent web scraping solution that uses Firecrawl for content extraction and LangGraph for orchestrating the scraping workflow. The system can automatically crawl websites, extract content, and search for specific keywords or information.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages