crawler_schema_suggestion.md 14 KB

  1. Advanced Schema & Nested Structures Real sites often have nested or repeated data—like categories containing products, which themselves have a list of reviews or features. For that, we can define nested or list (and even nested_list) fields.

Sample E-Commerce HTML We have a sample e-commerce HTML file on GitHub (example):

https://gist.githubusercontent.com/githubusercontent/2d7b8ba3cd8ab6cf3c8da771ddb36878/raw/1ae2f90c6861ce7dd84cc50d3df9920dee5e1fd2/sample_ecommerce.html This snippet includes categories, products, features, reviews, and related items. Let’s see how to define a schema that fully captures that structure without LLM. schema = {

"name": "E-commerce Product Catalog",
"baseSelector": "div.category",
# (1) We can define optional baseFields if we want to extract attributes 
# from the category container
"baseFields": [
    {"name": "data_cat_id", "type": "attribute", "attribute": "data-cat-id"}, 
],
"fields": [
    {
        "name": "category_name",
        "selector": "h2.category-name",
        "type": "text"
    },
    {
        "name": "products",
        "selector": "div.product",
        "type": "nested_list",    # repeated sub-objects
        "fields": [
            {
                "name": "name",
                "selector": "h3.product-name",
                "type": "text"
            },
            {
                "name": "price",
                "selector": "p.product-price",
                "type": "text"
            },
            {
                "name": "details",
                "selector": "div.product-details",
                "type": "nested",  # single sub-object
                "fields": [
                    {
                        "name": "brand",
                        "selector": "span.brand",
                        "type": "text"
                    },
                    {
                        "name": "model",
                        "selector": "span.model",
                        "type": "text"
                    }
                ]
            },
            {
                "name": "features",
                "selector": "ul.product-features li",
                "type": "list",
                "fields": [
                    {"name": "feature", "type": "text"} 
                ]
            },
            {
                "name": "reviews",
                "selector": "div.review",
                "type": "nested_list",
                "fields": [
                    {
                        "name": "reviewer", 
                        "selector": "span.reviewer", 
                        "type": "text"
                    },
                    {
                        "name": "rating", 
                        "selector": "span.rating", 
                        "type": "text"
                    },
                    {
                        "name": "comment", 
                        "selector": "p.review-text", 
                        "type": "text"
                    }
                ]
            },
            {
                "name": "related_products",
                "selector": "ul.related-products li",
                "type": "list",
                "fields": [
                    {
                        "name": "name", 
                        "selector": "span.related-name", 
                        "type": "text"
                    },
                    {
                        "name": "price", 
                        "selector": "span.related-price", 
                        "type": "text"
                    }
                ]
            }
        ]
    }
]

} Key Takeaways:

Nested vs. List: type: "nested" means a single sub-object (like ). details type: "list" means multiple items that are simple dictionaries or single text fields. type: "nested_list" means repeated complex objects (like or ).productsreviews Base Fields: We can extract attributes from the container element via . For instance, might be . "baseFields""data_cat_id"data-cat-id="elect123" Transforms: We can also define a if we want to lower/upper case, strip whitespace, or even run a custom function.transform 运行提取

import json
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
ecommerce_schema = {
    # ... the advanced schema from above ...
}
raw_html = '...'
async with AsyncWebCrawler(verbose=True) as crawler:
    strategy = JsonCssExtractionStrategy(ecommerce_schema, verbose=True)
    result = await crawler.arun(
        url=f"raw://{dummy_html}",
        extraction_strategy=strategy,
        config=config
    )

    if not result.success:
        print("Crawl failed:", result.error_message)
        return

    # Parse the JSON output
    data = json.loads(result.extracted_content)
    print(json.dumps(data, indent=2) if data else "No data found.")
  1. 把它们放在一起:更大的例子 考虑一个博客网站。我们有一个模式,用于从每张明信片中提取 URL(通过 an ),以及标题、日期、摘要和作者:baseFields"attribute": "href"

schema = { "name": "Blog Posts", "baseSelector": "a.blog-post-card", "baseFields": [

{"name": "post_url", "type": "attribute", "attribute": "href"}

], "fields": [

{"name": "title", "selector": "h2.post-title", "type": "text", "default": "No Title"},
{"name": "date", "selector": "time.post-date", "type": "text", "default": ""},
{"name": "summary", "selector": "p.post-summary", "type": "text", "default": ""},
{"name": "author", "selector": "span.post-author", "type": "text", "default": ""}

] } 然后运行 with 以获取一组博客文章对象,每个对象都包含 、 。JsonCssExtractionStrategy(schema)"post_url""title""date""summary""author"

  1. 提示和最佳实践
  2. 在 Chrome DevTools 或 Firefox 的 Inspector 中检查 DOM 以找到稳定的选择器。
  3. 从简单开始:验证是否可以提取单个字段。然后添加嵌套对象或列表等复杂性。
  4. 在大爬虫之前,在部分 HTML 或测试页面上测试你的 schema。
  5. 如果网站动态加载内容,则与 JS Execution 结合使用。您可以传入 或 传入 。
  6. 在以下情况下查看日志 :如果您的选择器已关闭或架构格式不正确,它通常会显示警告。
  7. 如果您需要容器元素中的属性(例如,, ),请使用 baseFields,尤其是对于 “parent” 项。
  8. 性能:对于大页面,请确保您的选择器尽可能窄。js_codewait_forCrawlerRunConfigverbose=Truehrefdata-id

你只有两种提取策略可用 JsonXPathExtractionStrategy 和 JsonCssExtractionStrategy. 他们不能同时使用在一个 schema 对象中。

必须使用选择 XPATH 策略: 参考上述说明,帮我html中提取所有文本,你应该发现规律是文本在类似这样的元素中:

パソコン
你只需完成 schema 字典的编写

<div class="autocomplete-results-container" id="sac-autocomplete-results-container" role="grid"><div class="two-pane-results-container" role="rowgroup"><div class="left-pane-results-container" style="flex: 1 1 0%; height: auto;"><div id="sac-suggestion-row-1" role="row" aria-rowindex="1" aria-owns="sac-suggestion-row-1-cell-1"><div class="s-suggestion-container" id="sac-suggestion-row-1-cell-1" role="gridcell"><div class="s-suggestion s-suggestion-ellipsis-direction" role="button" aria-label="パソコン">パソコン</div><div class="icon-suggestion-div search-icon-div" suggestion-icon-div="true"><i class="icon-search-suggestion s-suggestion-icon-left" suggestion-icon="true"></i><div class="s-sugg-icon-background s-suggestion-icon-background-grey-shield"></div></div></div></div><div id="sac-suggestion-row-2" role="row" aria-rowindex="2" aria-owns="sac-suggestion-row-2-cell-1"><div class="s-suggestion-container" id="sac-suggestion-row-2-cell-1" role="gridcell"><div class="s-suggestion s-suggestion-ellipsis-direction" role="button" aria-label="パソコンケース">パソコン<span class="s-heavy">ケース</span></div><div class="icon-suggestion-div search-icon-div" suggestion-icon-div="true"><i class="icon-search-suggestion s-suggestion-icon-left" suggestion-icon="true"></i><div class="s-sugg-icon-background s-suggestion-icon-background-grey-shield"></div></div></div></div><div id="sac-suggestion-row-3" role="row" aria-rowindex="3" aria-owns="sac-suggestion-row-3-cell-1"><div class="s-suggestion-container" id="sac-suggestion-row-3-cell-1" role="gridcell"><div class="s-suggestion s-suggestion-ellipsis-direction" role="button" aria-label="パソコンスタンド">パソコン<span class="s-heavy">スタンド</span></div><div class="icon-suggestion-div search-icon-div" suggestion-icon-div="true"><i class="icon-search-suggestion s-suggestion-icon-left" suggestion-icon="true"></i><div class="s-sugg-icon-background s-suggestion-icon-background-grey-shield"></div></div></div></div><div id="sac-suggestion-row-4" role="row" aria-rowindex="4" aria-owns="sac-suggestion-row-4-cell-1"><div class="s-suggestion-container" id="sac-suggestion-row-4-cell-1" role="gridcell"><div class="s-suggestion s-suggestion-ellipsis-direction" role="button" aria-label="パソコンデスク">パソコン<span class="s-heavy">デスク</span></div><div class="icon-suggestion-div search-icon-div" suggestion-icon-div="true"><i class="icon-search-suggestion s-suggestion-icon-left" suggestion-icon="true"></i><div class="s-sugg-icon-background s-suggestion-icon-background-grey-shield"></div></div></div></div><div id="sac-suggestion-row-5" role="row" aria-rowindex="5" aria-owns="sac-suggestion-row-5-cell-1"><div class="s-suggestion-container" id="sac-suggestion-row-5-cell-1" role="gridcell"><div class="s-suggestion s-suggestion-ellipsis-direction" role="button" aria-label="パソコン ノート">パソコン<span class="s-heavy"> ノート</span></div><div class="icon-suggestion-div search-icon-div" suggestion-icon-div="true"><i class="icon-search-suggestion s-suggestion-icon-left" suggestion-icon="true"></i><div class="s-sugg-icon-background s-suggestion-icon-background-grey-shield"></div></div></div></div><div id="sac-suggestion-row-6" role="row" aria-rowindex="6" aria-owns="sac-suggestion-row-6-cell-1"><div class="s-suggestion-container" id="sac-suggestion-row-6-cell-1" role="gridcell"><div class="s-suggestion s-suggestion-ellipsis-direction" role="button" aria-label="パソコン台 卓上">パソコン<span class="s-heavy">台 卓上</span></div><div class="icon-suggestion-div search-icon-div" suggestion-icon-div="true"><i class="icon-search-suggestion s-suggestion-icon-left" suggestion-icon="true"></i><div class="s-sugg-icon-background s-suggestion-icon-background-grey-shield"></div></div></div></div><div id="sac-suggestion-row-7" role="row" aria-rowindex="7" aria-owns="sac-suggestion-row-7-cell-1"><div class="s-suggestion-container" id="sac-suggestion-row-7-cell-1" role="gridcell"><div class="s-suggestion s-suggestion-ellipsis-direction" role="button" aria-label="パソコンケース 14インチ">パソコン<span class="s-heavy">ケース 14インチ</span></div><div class="icon-suggestion-div search-icon-div" suggestion-icon-div="true"><i class="icon-search-suggestion s-suggestion-icon-left" suggestion-icon="true"></i><div class="s-sugg-icon-background s-suggestion-icon-background-grey-shield"></div></div></div></div><div id="sac-suggestion-row-8" role="row" aria-rowindex="8" aria-owns="sac-suggestion-row-8-cell-1"><div class="s-suggestion-container" id="sac-suggestion-row-8-cell-1" role="gridcell"><div class="s-suggestion s-suggestion-ellipsis-direction" role="button" aria-label="パソコンラック">パソコン<span class="s-heavy">ラック</span></div><div class="icon-suggestion-div search-icon-div" suggestion-icon-div="true"><i class="icon-search-suggestion s-suggestion-icon-left" suggestion-icon="true"></i><div class="s-sugg-icon-background s-suggestion-icon-background-grey-shield"></div></div></div></div><div id="sac-suggestion-row-9" role="row" aria-rowindex="9" aria-owns="sac-suggestion-row-9-cell-1"><div class="s-suggestion-container" id="sac-suggestion-row-9-cell-1" role="gridcell"><div class="s-suggestion s-suggestion-ellipsis-direction" role="button" aria-label="パソコン モニター">パソコン<span class="s-heavy"> モニター</span></div><div class="icon-suggestion-div search-icon-div" suggestion-icon-div="true"><i class="icon-search-suggestion s-suggestion-icon-left" suggestion-icon="true"></i><div class="s-sugg-icon-background s-suggestion-icon-background-grey-shield"></div></div></div></div><div id="sac-suggestion-row-10" role="row" aria-rowindex="10" aria-owns="sac-suggestion-row-10-cell-1"><div class="s-suggestion-container" id="sac-suggestion-row-10-cell-1" role="gridcell"><div class="s-suggestion s-suggestion-ellipsis-direction" role="button" aria-label="パソコンケース 13.3インチ">パソコン<span class="s-heavy">ケース 13.3インチ</span></div><div class="icon-suggestion-div search-icon-div" suggestion-icon-div="true"><i class="icon-search-suggestion s-suggestion-icon-left" suggestion-icon="true"></i><div class="s-sugg-icon-background s-suggestion-icon-background-grey-shield"></div></div></div></div></div><div class="right-pane-results-container" style="display:none"></div></div><div class="status-message-container" role="status">Prefix パソコン, 10 individual suggestions</div></div>