crawler_Schema.md 17 KB

  1. Advanced Schema & Nested Structures Real sites often have nested or repeated data—like categories containing products, which themselves have a list of reviews or features. For that, we can define nested or list (and even nested_list) fields.

Sample E-Commerce HTML We have a sample e-commerce HTML file on GitHub (example):

https://gist.githubusercontent.com/githubusercontent/2d7b8ba3cd8ab6cf3c8da771ddb36878/raw/1ae2f90c6861ce7dd84cc50d3df9920dee5e1fd2/sample_ecommerce.html This snippet includes categories, products, features, reviews, and related items. Let’s see how to define a schema that fully captures that structure without LLM. schema = {

"name": "E-commerce Product Catalog",
"baseSelector": "div.category",
# (1) We can define optional baseFields if we want to extract attributes 
# from the category container
"baseFields": [
    {"name": "data_cat_id", "type": "attribute", "attribute": "data-cat-id"}, 
],
"fields": [
    {
        "name": "category_name",
        "selector": "h2.category-name",
        "type": "text"
    },
    {
        "name": "products",
        "selector": "div.product",
        "type": "nested_list",    # repeated sub-objects
        "fields": [
            {
                "name": "name",
                "selector": "h3.product-name",
                "type": "text"
            },
            {
                "name": "price",
                "selector": "p.product-price",
                "type": "text"
            },
            {
                "name": "details",
                "selector": "div.product-details",
                "type": "nested",  # single sub-object
                "fields": [
                    {
                        "name": "brand",
                        "selector": "span.brand",
                        "type": "text"
                    },
                    {
                        "name": "model",
                        "selector": "span.model",
                        "type": "text"
                    }
                ]
            },
            {
                "name": "features",
                "selector": "ul.product-features li",
                "type": "list",
                "fields": [
                    {"name": "feature", "type": "text"} 
                ]
            },
            {
                "name": "reviews",
                "selector": "div.review",
                "type": "nested_list",
                "fields": [
                    {
                        "name": "reviewer", 
                        "selector": "span.reviewer", 
                        "type": "text"
                    },
                    {
                        "name": "rating", 
                        "selector": "span.rating", 
                        "type": "text"
                    },
                    {
                        "name": "comment", 
                        "selector": "p.review-text", 
                        "type": "text"
                    }
                ]
            },
            {
                "name": "related_products",
                "selector": "ul.related-products li",
                "type": "list",
                "fields": [
                    {
                        "name": "name", 
                        "selector": "span.related-name", 
                        "type": "text"
                    },
                    {
                        "name": "price", 
                        "selector": "span.related-price", 
                        "type": "text"
                    }
                ]
            }
        ]
    }
]

} Key Takeaways:

Nested vs. List: type: "nested" means a single sub-object (like ). details type: "list" means multiple items that are simple dictionaries or single text fields. type: "nested_list" means repeated complex objects (like or ).productsreviews Base Fields: We can extract attributes from the container element via . For instance, might be . "baseFields""data_cat_id"data-cat-id="elect123" Transforms: We can also define a if we want to lower/upper case, strip whitespace, or even run a custom function.transform 运行提取

import json
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
ecommerce_schema = {
    # ... the advanced schema from above ...
}
raw_html = '...'
async with AsyncWebCrawler(verbose=True) as crawler:
    strategy = JsonCssExtractionStrategy(ecommerce_schema, verbose=True)
    result = await crawler.arun(
        url=f"raw://{dummy_html}",
        extraction_strategy=strategy,
        config=config
    )

    if not result.success:
        print("Crawl failed:", result.error_message)
        return

    # Parse the JSON output
    data = json.loads(result.extracted_content)
    print(json.dumps(data, indent=2) if data else "No data found.")
  1. 把它们放在一起:更大的例子 考虑一个博客网站。我们有一个模式,用于从每张明信片中提取 URL(通过 an ),以及标题、日期、摘要和作者:baseFields"attribute": "href"

schema = { "name": "Blog Posts", "baseSelector": "a.blog-post-card", "baseFields": [

{"name": "post_url", "type": "attribute", "attribute": "href"}

], "fields": [

{"name": "title", "selector": "h2.post-title", "type": "text", "default": "No Title"},
{"name": "date", "selector": "time.post-date", "type": "text", "default": ""},
{"name": "summary", "selector": "p.post-summary", "type": "text", "default": ""},
{"name": "author", "selector": "span.post-author", "type": "text", "default": ""}

] } 然后运行 with 以获取一组博客文章对象,每个对象都包含 、 。JsonCssExtractionStrategy(schema)"post_url""title""date""summary""author"

  1. 提示和最佳实践
  2. 在 Chrome DevTools 或 Firefox 的 Inspector 中检查 DOM 以找到稳定的选择器。
  3. 从简单开始:验证是否可以提取单个字段。然后添加嵌套对象或列表等复杂性。
  4. 在大爬虫之前,在部分 HTML 或测试页面上测试你的 schema。
  5. 如果网站动态加载内容,则与 JS Execution 结合使用。您可以传入 或 传入 。
  6. 在以下情况下查看日志 :如果您的选择器已关闭或架构格式不正确,它通常会显示警告。
  7. 如果您需要容器元素中的属性(例如,, ),请使用 baseFields,尤其是对于 “parent” 项。
  8. 性能:对于大页面,请确保您的选择器尽可能窄。js_codewait_forCrawlerRunConfigverbose=Truehrefdata-id

你只有两种提取策略可用 JsonXPathExtractionStrategy 和 JsonCssExtractionStrategy. 他们不能同时使用在一个 schema 对象中。推荐选择 XPATH 策略。除非 CSS 更优。

参考上述说明,帮我从这个html中提取: Product Information: 图片链接 、 goto_amazone (class btn-asinseed-link a标签的超链接) 、 主文本内容 Unique Words : article 里面的所有文本组成列表

<div class="js-sticky-block" data-has-sticky-header="true" data-offset-target="#logoAndNav" data-sticky-view="lg" data-start-point="#stickyBlockStartPoint" data-end-point="#stickyBlockEndPoint" data-offset-top="32" data-offset-bottom="170">

                    <div id="div-asin-product-infor">
                        <h3 class="h5 text-asinseed-black font-weight-bold mb-4">Product Information</h3>
                        <article class="mb-5">
                            <div class="d-flex mb-1">
                                <div class="avatar-self-pic mr-3">
                                    <div class="pop-url-imgs" style="background-image: url(https://m.media-amazon.com/images/I/41hY78XIaiL._AC_US600_.jpg)">
                                        </div>
                                    <img class="img-fluid rounded u-xl-avatar item-wh-6r" src="https://m.media-amazon.com/images/I/41hY78XIaiL._AC_US200_.jpg" alt="GOODCHI フレームプロテクター ブレーキケーブルプロテクター 耐摩耗性 柔らかくスパイラル パイプ保護 自転車用 保護スリーブ 10個入">
                                </div>
                                <div class="media-body">
                                    <h4 class="h6 font-weight-normal mb-0">
                                        <a href="https://www.amazon.co.jp/dp/B0CQ1SHD8V" class="small text-muted" target="_blank">GOODCHI</a><br>
                                        GOODCHI フレームプロテクター ブレーキケーブルプロテクター 耐摩耗性 柔らかくスパイラル パイプ保護 自転車用 保護スリーブ 10個入<br>
                                        <span class="small text-muted">B0CQ1SHD8V</span>
                                        <a href="https://www.amazon.co.jp/dp/B0CQ1SHD8V" target="_blank" class="small btn-asinseed-link text-muted" data-toggle="tooltip" title="" data-original-title="View This Product on Amazon"><i class="iconfont icon-to_amazon small"></i></a>
                                    </h4>
                                </div>
                            </div>
                        </article>
                    </div>

                    <div id="div-asin-variation">
                        <h3 class="h5 text-asinseed-black font-weight-bold mb-4">Variations</h3>
                            <article class="mb-5" data-animation="flash" data-animation-delay="800" data-animation-duration="1500">
                                <ul class="list-unstyled u-list" id="variation-parent-asin" data-asin="B0CQ1SHD8V">
                                    <li class="u-variation-list__link  ">
                                            <a href="https://www.asinseed.com/en/JP/B0DHCWHMM6?utm_asin=B0CQ1SHD8V"><span class="far fa-dot-circle u-list__link-icon mr-1"></span>色: ブラック+レッド
                                                &nbsp;<span class="badge badge-pill" data-asin="B0DHCWHMM6">2</span>
                                                
                                            </a>
                                        </li>
                                        <li class="u-variation-list__link active ">
                                            <a href="https://www.asinseed.com/en/JP/B0CQ1SHD8V?utm_asin=B0CQ1SHD8V"><span class="far fa-dot-circle u-list__link-icon mr-1"></span>色: ブラック
                                                &nbsp;<span class="badge badge-pill" data-asin="B0CQ1SHD8V">12</span>
                                                
                                            </a>
                                        </li>
                                        </ul>
                            </article>
                    </div>
                    <h3 class="h5 text-asinseed-black font-weight-bold mb-4">Unique Words
                        <i class="far fa-question-circle small" data-trigger="hover" role="button" tabindex="1" data-content="Unique Words is the minimal unit of keywords, keywords(user search terms) is composed of them.<br>You can put these words on your listing's title, search term, bullet points and description, let Amazon <br>think such keywords are your product profile, and give your more traffic.<br>We suggest after you review 10+ competitors, then begin to optimize your listing." data-html="true" data-toggle="popover" data-placement="top" data-container="body" data-original-title="" title=""></i>
                        <button class="btn btn-xs u-btn-asinseed keywords-copy-clipboard ml-4" title="" data-clipboard-text="カバー
ロードバイク
ワイヤーガード
プロテクター
フレームパッド
フレームプロテクター
キックガード
プロテクターカバー
車
キルト芯
ケーブルプロテクター
ハンドルカバー" type="button"><i class="ace-icon fa fa-copy bigger-110"></i> Copy to Clipboard</button>
                    </h3>
                    <article class="mb-5">
                        <!-- style="background-color: rgba(182, 214, 249, 0.1);" -->
                                <span class="badge badge-pill badge-asinseed-keywords-weight mb-1 high-frequence-word-level-5">カバー</span>
                            <!-- style="background-color: rgba(182, 214, 249, 0.1);" -->
                                <span class="badge badge-pill badge-asinseed-keywords-weight mb-1 high-frequence-word-level-4">ロードバイク</span>
                            <!-- style="background-color: rgba(182, 214, 249, 0.05);" -->
                                <span class="badge badge-pill badge-asinseed-keywords-weight mb-1 high-frequence-word-level-3">ワイヤーガード</span>
                            <!-- style="background-color: rgba(182, 214, 249, 0.05);" -->
                                <span class="badge badge-pill badge-asinseed-keywords-weight mb-1 high-frequence-word-level-3">プロテクター</span>
                            <!-- style="background-color: rgba(182, 214, 249, 0.05);" -->
                                <span class="badge badge-pill badge-asinseed-keywords-weight mb-1 high-frequence-word-level-2">フレームパッド</span>
                            <!-- style="background-color: rgba(182, 214, 249, 0.05);" -->
                                <span class="badge badge-pill badge-asinseed-keywords-weight mb-1 high-frequence-word-level-2">フレームプロテクター</span>
                            <!-- style="background-color: rgba(182, 214, 249, 0.05);" -->
                                <span class="badge badge-pill badge-asinseed-keywords-weight mb-1 high-frequence-word-level-1">キックガード</span>
                            <!-- style="background-color: rgba(182, 214, 249, 0.05);" -->
                                <span class="badge badge-pill badge-asinseed-keywords-weight mb-1 high-frequence-word-level-1">プロテクターカバー</span>
                            <!-- style="background-color: rgba(182, 214, 249, 0.05);" -->
                                <span class="badge badge-pill badge-asinseed-keywords-weight mb-1 high-frequence-word-level-1">車</span>
                            <!-- style="background-color: rgba(182, 214, 249, 0.05);" -->
                                <span class="badge badge-pill badge-asinseed-keywords-weight mb-1 high-frequence-word-level-1">キルト芯</span>
                            <!-- style="background-color: rgba(182, 214, 249, 0.05);" -->
                                <span class="badge badge-pill badge-asinseed-keywords-weight mb-1 high-frequence-word-level-1">ケーブルプロテクター</span>
                            <!-- style="background-color: rgba(182, 214, 249, 0.05);" -->
                                <span class="badge badge-pill badge-asinseed-keywords-weight mb-1 high-frequence-word-level-1">ハンドルカバー</span>
                            </article>
                    <div id="div-introduce-video">
                        <h3 class="h5 text-asinseed-black font-weight-bold mb-4">AsinSeed Video</h3>
                        <article class="mb-5">
                            <a id="header-help-video-btn" style="outline: 0;">
                                <img src="https://www.asinseed.com/assets/images/video/introduce-video-en-20181122.png" alt="SVG Illustration" style="width: 300px;">
                            </a>
                        </article>
                    </div>

            </div>
schema = {
            "name": "Product Details",
            "baseSelector": "div.js-sticky-block",
            "fields": [
                {
                    "name": "product_info",
                    "selector": "#div-asin-product-infor",
                    "type": "nested",
                    "fields": [
                        {
                            "name": "image_url",
                            "selector": "div.avatar-self-pic img",
                            "type": "attribute",
                            "attribute": "src"
                        },
                        {
                            "name": "goto_amazon",
                            "selector": "a.btn-asinseed-link",
                            "type": "attribute",
                            "attribute": "href"
                        },
                        {
                            "name": "main_text",
                            "selector": "div.media-body h4",
                            "type": "text",
                            "transform": ["strip"]
                        }
                    ]
                },
                {
                    "name": "unique_words",
                    "selector": "h3:contains('Unique Words') + article",
                    "type": "list",
                    "fields": [
                        {
                            "name": "word",
                            "selector": "span.badge-asinseed-keywords-weight",
                            "type": "text"
                        }
                    ]
                }
            ]
        }

这个没有提取对: {...'unique_words': [{'word': 'カバー'}]}

理论上到列表字符串才对。因为我看到 article 下都是 span 标签。