Bulk Scrape

The Bulk Scrape node enables you to scrape multiple URLs concurrently while managing resources efficiently.

Overview

This node allows you to:

  • Scrape multiple URLs
  • Manage concurrency
  • Handle rate limiting
  • Process results in batches
  • Export data in bulk

Configuration

ParameterTypeDescription
URLsArray[String]List of URLs to scrape
ConcurrencyNumberMaximum concurrent requests
Rate LimitObjectRate limiting settings
SelectorsObjectData extraction selectors
ExportObjectExport configuration

Example Usage

Basic Bulk Scrape

{
  "urls": [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3"
  ],
  "concurrency": 3,
  "selectors": {
    "title": "h1",
    "content": ".main-content"
  }
}

Advanced Configuration

{
  "urls": ["https://example.com/page1", "https://example.com/page2"],
  "concurrency": 2,
  "rate_limit": {
    "requests_per_second": 1,
    "delay_between_requests": 1000
  },
  "selectors": {
    "title": {
      "selector": "h1",
      "type": "text"
    },
    "images": {
      "selector": "img",
      "type": "attribute",
      "attribute": "src"
    }
  },
  "export": {
    "format": "json",
    "filename": "scrape_results.json"
  }
}

Queue Management

{
  "queue": {
    "max_size": 1000,
    "retry_failed": true,
    "max_retries": 3,
    "priority": {
      "enabled": true,
      "rules": [
        {
          "pattern": "*.important.com",
          "priority": 1
        }
      ]
    }
  }
}

Error Handling

Retry Configuration

{
  "retry": {
    "max_attempts": 3,
    "codes": [429, 503],
    "delay": {
      "initial": 1000,
      "multiplier": 2,
      "max": 10000
    }
  }
}

Export Formats

JSON Export

{
  "export": {
    "format": "json",
    "pretty": true,
    "filename": "results.json"
  }
}

CSV Export

{
  "export": {
    "format": "csv",
    "headers": true,
    "delimiter": ",",
    "filename": "results.csv"
  }
}

Progress Tracking

{
  "progress": {
    "enabled": true,
    "update_interval": 1000,
    "callbacks": {
      "onProgress": "https://api.example.com/progress",
      "onComplete": "https://api.example.com/complete"
    }
  }
}

Resource Management

{
  "resources": {
    "max_memory": "1GB",
    "max_cpu": 0.8,
    "cleanup_interval": 5000
  }
}