Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.svalync.com/llms.txt

Use this file to discover all available pages before exploring further.

Bulk Scrape

The Bulk Scrape node enables you to scrape multiple URLs concurrently while managing resources efficiently.

Overview

This node allows you to:
  • Scrape multiple URLs
  • Manage concurrency
  • Handle rate limiting
  • Process results in batches
  • Export data in bulk

Configuration

ParameterTypeDescription
URLsArray[String]List of URLs to scrape
ConcurrencyNumberMaximum concurrent requests
Rate LimitObjectRate limiting settings
SelectorsObjectData extraction selectors
ExportObjectExport configuration

Example Usage

Basic Bulk Scrape

{
  "urls": [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3"
  ],
  "concurrency": 3,
  "selectors": {
    "title": "h1",
    "content": ".main-content"
  }
}

Advanced Configuration

{
  "urls": ["https://example.com/page1", "https://example.com/page2"],
  "concurrency": 2,
  "rate_limit": {
    "requests_per_second": 1,
    "delay_between_requests": 1000
  },
  "selectors": {
    "title": {
      "selector": "h1",
      "type": "text"
    },
    "images": {
      "selector": "img",
      "type": "attribute",
      "attribute": "src"
    }
  },
  "export": {
    "format": "json",
    "filename": "scrape_results.json"
  }
}

Queue Management

{
  "queue": {
    "max_size": 1000,
    "retry_failed": true,
    "max_retries": 3,
    "priority": {
      "enabled": true,
      "rules": [
        {
          "pattern": "*.important.com",
          "priority": 1
        }
      ]
    }
  }
}

Error Handling

Retry Configuration

{
  "retry": {
    "max_attempts": 3,
    "codes": [429, 503],
    "delay": {
      "initial": 1000,
      "multiplier": 2,
      "max": 10000
    }
  }
}

Export Formats

JSON Export

{
  "export": {
    "format": "json",
    "pretty": true,
    "filename": "results.json"
  }
}

CSV Export

{
  "export": {
    "format": "csv",
    "headers": true,
    "delimiter": ",",
    "filename": "results.csv"
  }
}

Progress Tracking

{
  "progress": {
    "enabled": true,
    "update_interval": 1000,
    "callbacks": {
      "onProgress": "https://api.example.com/progress",
      "onComplete": "https://api.example.com/complete"
    }
  }
}

Resource Management

{
  "resources": {
    "max_memory": "1GB",
    "max_cpu": 0.8,
    "cleanup_interval": 5000
  }
}