🕸️ Agentic Web Scraping with Firecrawl and LangGraph

This repository contains an intelligent web scraping solution that uses Firecrawl for content extraction and LangGraph for orchestrating the scraping workflow. The system can automatically crawl websites, extract content, and search for specific keywords or information.

screen-capture.1.mp4

🚀 Features

Automated Sitemap Extraction: Automatically discovers all pages on a website
Intelligent Content Extraction: Extracts markdown, HTML, or text content from web pages
Keyword Search: Searches for specific keywords or phrases across all pages
Progress Tracking: Real-time progress updates during scraping
Error Handling: Robust error handling for network issues and parsing errors
Configurable: Easy to configure for different websites and search terms
LangGraph Workflow: Uses LangGraph for structured, maintainable scraping workflows

📋 Requirements

The code requires the following dependencies:

Python 3.8+
firecrawl-py
langgraph
pydantic-settings
python-dotenv

🛠️ Installation

# Clone the repository
git clone https://github.com/extrawest/web_scraping_with_firecrawl_and_langgraph.git
cd web_scraping_with_firecrawl_and_langgraph

# Install required packages
pip install -r requirements.txt

# Create a .env file with your configuration
echo "FIRECRAWL_URL=http://localhost:3002" > .env

📝 Usage

Basic Usage

from scrape_the_web_agentically import main

# Run the scraper with a target URL and keyword
main(url="https://example.com", keyword="specific information")

Command Line Usage

# Run the script directly
python scrape_the_web_agentically.py

Configuration

You can modify the target URL and search keyword by editing the script:

if __name__ == "__main__":
    target_url = "https://python.langchain.com"
    search_keyword = "LLMs"
    
    if not target_url or not search_keyword:
        print("Please set the target_url and search_keyword variables.")
    else:
        main(target_url, search_keyword)

🧠 How It Works

The script uses a LangGraph workflow to orchestrate the web scraping process:

Initialization: Sets up the initial state with the target URL and keyword
Sitemap Extraction: Fetches the sitemap to discover all pages on the website
Batch Processing: Processes URLs in batches for efficient scraping
Content Extraction: Extracts content from each page using Firecrawl
Keyword Search: Searches for the specified keyword in the extracted content
Result Evaluation: Determines if the information was found or if more URLs need to be processed

The workflow is visualized and saved as a PNG file for easy understanding of the process.

🔄 LangGraph Workflow

The script uses LangGraph to create a structured workflow with the following nodes:

initialize_state: Sets up the initial state with URL and keyword
get_sitemap: Fetches the sitemap for the target URL
scrape_manager: Manages batches of URLs for processing
scraper: Extracts content from individual URLs
evaluate: Checks if the keyword was found in the content

The workflow continues until either the keyword is found or all URLs have been processed.

🔧 Customization

Firecrawl Server

By default, the script connects to a local Firecrawl server at http://localhost:3002. You can change this by:

Setting the FIRECRAWL_URL environment variable
Creating a .env file with FIRECRAWL_URL=your-server-url
Modifying the firecrawl_url parameter in the Settings class

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
scrape_the_web_agentically.py		scrape_the_web_agentically.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🕸️ Agentic Web Scraping with Firecrawl and LangGraph

🚀 Features

📋 Requirements

🛠️ Installation

📝 Usage

Basic Usage

Command Line Usage

Configuration

🧠 How It Works

🔄 LangGraph Workflow

🔧 Customization

Firecrawl Server

About

Releases

Packages

Languages

License

extrawest/web_scraping_with_firecrawl_and_langgraph

Folders and files

Latest commit

History

Repository files navigation

🕸️ Agentic Web Scraping with Firecrawl and LangGraph

🚀 Features

📋 Requirements

🛠️ Installation

📝 Usage

Basic Usage

Command Line Usage

Configuration

🧠 How It Works

🔄 LangGraph Workflow

🔧 Customization

Firecrawl Server

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages