Scrape Content from Links

The Scrape Content from Links node allows you to extract content from a list of links and follow them recursively if needed.

Overview

This node enables you to:

  • Extract content from links
  • Follow nested links
  • Set crawl depth
  • Filter content
  • Handle pagination

Configuration

ParameterTypeDescription
Start URLsArray[String]Initial URLs to scrape
Link SelectorStringCSS selector for finding links
Content SelectorsObjectSelectors for content extraction
Max DepthNumberMaximum crawl depth
Follow RulesObjectRules for following links

Example Usage

{
  "start_urls": ["https://example.com/blog"],
  "link_selector": "a.article-link",
  "content_selectors": {
    "title": "h1.article-title",
    "content": "div.article-content",
    "date": "span.publish-date"
  }
}

Advanced Configuration

{
  "start_urls": ["https://example.com/blog"],
  "link_selector": "a.article-link",
  "content_selectors": {
    "title": {
      "selector": "h1.article-title",
      "type": "text"
    },
    "content": {
      "selector": "div.article-content",
      "type": "html"
    },
    "author": {
      "selector": ".author-info",
      "nested": {
        "name": ".author-name",
        "bio": ".author-bio"
      }
    }
  },
  "max_depth": 2,
  "follow_rules": {
    "patterns": ["*/blog/*", "*/article/*"],
    "exclude": ["*/tag/*", "*/category/*"]
  }
}

Pattern Matching

{
  "follow_rules": {
    "patterns": ["*/blog/*", "*/article/*"],
    "exclude": ["*/archive/*", "*/author/*"],
    "regex": {
      "include": ["^/blog/\\d{4}/\\d{2}/"],
      "exclude": ["\\?(page|sort)="]
    }
  }
}

Content Extraction

Nested Content

{
  "content_selectors": {
    "article": {
      "selector": "article",
      "nested": {
        "title": "h1",
        "metadata": {
          "selector": ".meta",
          "nested": {
            "author": ".author",
            "date": ".date",
            "tags": {
              "selector": ".tags li",
              "multiple": true
            }
          }
        }
      }
    }
  }
}

Data Processing

Content Transformations

{
  "transformations": {
    "title": ["trim", "normalize"],
    "date": "to_iso_date",
    "content": ["remove_html", "clean_whitespace"]
  }
}

Error Handling

Common issues and solutions:

  • Broken links
  • Invalid content structure
  • Rate limiting
  • Depth limits
  • Circular references