This repository contains an intelligent web scraping solution that uses Firecrawl for content extraction and LangGraph for orchestrating the scraping workflow. The system can automatically crawl websites, extract content, and search for specific keywords or information.
screen-capture.1.mp4
- Automated Sitemap Extraction: Automatically discovers all pages on a website
- Intelligent Content Extraction: Extracts markdown, HTML, or text content from web pages
- Keyword Search: Searches for specific keywords or phrases across all pages
- Progress Tracking: Real-time progress updates during scraping
- Error Handling: Robust error handling for network issues and parsing errors
- Configurable: Easy to configure for different websites and search terms
- LangGraph Workflow: Uses LangGraph for structured, maintainable scraping workflows
The code requires the following dependencies:
- Python 3.8+
- firecrawl-py
- langgraph
- pydantic-settings
- python-dotenv
# Clone the repository
git clone https://github.com/extrawest/web_scraping_with_firecrawl_and_langgraph.git
cd web_scraping_with_firecrawl_and_langgraph
# Install required packages
pip install -r requirements.txt
# Create a .env file with your configuration
echo "FIRECRAWL_URL=http://localhost:3002" > .env
from scrape_the_web_agentically import main
# Run the scraper with a target URL and keyword
main(url="https://example.com", keyword="specific information")
# Run the script directly
python scrape_the_web_agentically.py
You can modify the target URL and search keyword by editing the script:
if __name__ == "__main__":
target_url = "https://python.langchain.com"
search_keyword = "LLMs"
if not target_url or not search_keyword:
print("Please set the target_url and search_keyword variables.")
else:
main(target_url, search_keyword)
The script uses a LangGraph workflow to orchestrate the web scraping process:
- Initialization: Sets up the initial state with the target URL and keyword
- Sitemap Extraction: Fetches the sitemap to discover all pages on the website
- Batch Processing: Processes URLs in batches for efficient scraping
- Content Extraction: Extracts content from each page using Firecrawl
- Keyword Search: Searches for the specified keyword in the extracted content
- Result Evaluation: Determines if the information was found or if more URLs need to be processed
The workflow is visualized and saved as a PNG file for easy understanding of the process.
The script uses LangGraph to create a structured workflow with the following nodes:
initialize_state
: Sets up the initial state with URL and keywordget_sitemap
: Fetches the sitemap for the target URLscrape_manager
: Manages batches of URLs for processingscraper
: Extracts content from individual URLsevaluate
: Checks if the keyword was found in the content
The workflow continues until either the keyword is found or all URLs have been processed.
By default, the script connects to a local Firecrawl server at http://localhost:3002
. You can change this by:
- Setting the
FIRECRAWL_URL
environment variable - Creating a
.env
file withFIRECRAWL_URL=your-server-url
- Modifying the
firecrawl_url
parameter in theSettings
class