Sample E-Commerce HTML We have a sample e-commerce HTML file on GitHub (example):
https://gist.githubusercontent.com/githubusercontent/2d7b8ba3cd8ab6cf3c8da771ddb36878/raw/1ae2f90c6861ce7dd84cc50d3df9920dee5e1fd2/sample_ecommerce.html This snippet includes categories, products, features, reviews, and related items. Let’s see how to define a schema that fully captures that structure without LLM. schema = {
"name": "E-commerce Product Catalog",
"baseSelector": "div.category",
# (1) We can define optional baseFields if we want to extract attributes
# from the category container
"baseFields": [
{"name": "data_cat_id", "type": "attribute", "attribute": "data-cat-id"},
],
"fields": [
{
"name": "category_name",
"selector": "h2.category-name",
"type": "text"
},
{
"name": "products",
"selector": "div.product",
"type": "nested_list", # repeated sub-objects
"fields": [
{
"name": "name",
"selector": "h3.product-name",
"type": "text"
},
{
"name": "price",
"selector": "p.product-price",
"type": "text"
},
{
"name": "details",
"selector": "div.product-details",
"type": "nested", # single sub-object
"fields": [
{
"name": "brand",
"selector": "span.brand",
"type": "text"
},
{
"name": "model",
"selector": "span.model",
"type": "text"
}
]
},
{
"name": "features",
"selector": "ul.product-features li",
"type": "list",
"fields": [
{"name": "feature", "type": "text"}
]
},
{
"name": "reviews",
"selector": "div.review",
"type": "nested_list",
"fields": [
{
"name": "reviewer",
"selector": "span.reviewer",
"type": "text"
},
{
"name": "rating",
"selector": "span.rating",
"type": "text"
},
{
"name": "comment",
"selector": "p.review-text",
"type": "text"
}
]
},
{
"name": "related_products",
"selector": "ul.related-products li",
"type": "list",
"fields": [
{
"name": "name",
"selector": "span.related-name",
"type": "text"
},
{
"name": "price",
"selector": "span.related-price",
"type": "text"
}
]
}
]
}
]
} Key Takeaways:
Nested vs. List: type: "nested" means a single sub-object (like ). details type: "list" means multiple items that are simple dictionaries or single text fields. type: "nested_list" means repeated complex objects (like or ).productsreviews Base Fields: We can extract attributes from the container element via . For instance, might be . "baseFields""data_cat_id"data-cat-id="elect123" Transforms: We can also define a if we want to lower/upper case, strip whitespace, or even run a custom function.transform 运行提取
import json
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
ecommerce_schema = {
# ... the advanced schema from above ...
}
raw_html = '...'
async with AsyncWebCrawler(verbose=True) as crawler:
strategy = JsonCssExtractionStrategy(ecommerce_schema, verbose=True)
result = await crawler.arun(
url=f"raw://{dummy_html}",
extraction_strategy=strategy,
config=config
)
if not result.success:
print("Crawl failed:", result.error_message)
return
# Parse the JSON output
data = json.loads(result.extracted_content)
print(json.dumps(data, indent=2) if data else "No data found.")
schema = { "name": "Blog Posts", "baseSelector": "a.blog-post-card", "baseFields": [
{"name": "post_url", "type": "attribute", "attribute": "href"}
], "fields": [
{"name": "title", "selector": "h2.post-title", "type": "text", "default": "No Title"},
{"name": "date", "selector": "time.post-date", "type": "text", "default": ""},
{"name": "summary", "selector": "p.post-summary", "type": "text", "default": ""},
{"name": "author", "selector": "span.post-author", "type": "text", "default": ""}
] } 然后运行 with 以获取一组博客文章对象,每个对象都包含 、 。JsonCssExtractionStrategy(schema)"post_url""title""date""summary""author"
参考上述说明,帮我从这个html中提取: Product Information: 图片链接 、 goto_amazone (class btn-asinseed-link a标签的超链接) 、 主文本内容 Unique Words : article 里面的所有文本组成列表
<div class="js-sticky-block" data-has-sticky-header="true" data-offset-target="#logoAndNav" data-sticky-view="lg" data-start-point="#stickyBlockStartPoint" data-end-point="#stickyBlockEndPoint" data-offset-top="32" data-offset-bottom="170">
<div id="div-asin-product-infor">
<h3 class="h5 text-asinseed-black font-weight-bold mb-4">Product Information</h3>
<article class="mb-5">
<div class="d-flex mb-1">
<div class="avatar-self-pic mr-3">
<div class="pop-url-imgs" style="background-image: url(https://m.media-amazon.com/images/I/41hY78XIaiL._AC_US600_.jpg)">
</div>
<img class="img-fluid rounded u-xl-avatar item-wh-6r" src="https://m.media-amazon.com/images/I/41hY78XIaiL._AC_US200_.jpg" alt="GOODCHI フレームプロテクター ブレーキケーブルプロテクター 耐摩耗性 柔らかくスパイラル パイプ保護 自転車用 保護スリーブ 10個入">
</div>
<div class="media-body">
<h4 class="h6 font-weight-normal mb-0">
<a href="https://www.amazon.co.jp/dp/B0CQ1SHD8V" class="small text-muted" target="_blank">GOODCHI</a><br>
GOODCHI フレームプロテクター ブレーキケーブルプロテクター 耐摩耗性 柔らかくスパイラル パイプ保護 自転車用 保護スリーブ 10個入<br>
<span class="small text-muted">B0CQ1SHD8V</span>
<a href="https://www.amazon.co.jp/dp/B0CQ1SHD8V" target="_blank" class="small btn-asinseed-link text-muted" data-toggle="tooltip" title="" data-original-title="View This Product on Amazon"><i class="iconfont icon-to_amazon small"></i></a>
</h4>
</div>
</div>
</article>
</div>
<div id="div-asin-variation">
<h3 class="h5 text-asinseed-black font-weight-bold mb-4">Variations</h3>
<article class="mb-5" data-animation="flash" data-animation-delay="800" data-animation-duration="1500">
<ul class="list-unstyled u-list" id="variation-parent-asin" data-asin="B0CQ1SHD8V">
<li class="u-variation-list__link ">
<a href="https://www.asinseed.com/en/JP/B0DHCWHMM6?utm_asin=B0CQ1SHD8V"><span class="far fa-dot-circle u-list__link-icon mr-1"></span>色: ブラック+レッド
<span class="badge badge-pill" data-asin="B0DHCWHMM6">2</span>
</a>
</li>
<li class="u-variation-list__link active ">
<a href="https://www.asinseed.com/en/JP/B0CQ1SHD8V?utm_asin=B0CQ1SHD8V"><span class="far fa-dot-circle u-list__link-icon mr-1"></span>色: ブラック
<span class="badge badge-pill" data-asin="B0CQ1SHD8V">12</span>
</a>
</li>
</ul>
</article>
</div>
<h3 class="h5 text-asinseed-black font-weight-bold mb-4">Unique Words
<i class="far fa-question-circle small" data-trigger="hover" role="button" tabindex="1" data-content="Unique Words is the minimal unit of keywords, keywords(user search terms) is composed of them.<br>You can put these words on your listing's title, search term, bullet points and description, let Amazon <br>think such keywords are your product profile, and give your more traffic.<br>We suggest after you review 10+ competitors, then begin to optimize your listing." data-html="true" data-toggle="popover" data-placement="top" data-container="body" data-original-title="" title=""></i>
<button class="btn btn-xs u-btn-asinseed keywords-copy-clipboard ml-4" title="" data-clipboard-text="カバー
ロードバイク
ワイヤーガード
プロテクター
フレームパッド
フレームプロテクター
キックガード
プロテクターカバー
車
キルト芯
ケーブルプロテクター
ハンドルカバー" type="button"><i class="ace-icon fa fa-copy bigger-110"></i> Copy to Clipboard</button>
</h3>
<article class="mb-5">
<!-- style="background-color: rgba(182, 214, 249, 0.1);" -->
<span class="badge badge-pill badge-asinseed-keywords-weight mb-1 high-frequence-word-level-5">カバー</span>
<!-- style="background-color: rgba(182, 214, 249, 0.1);" -->
<span class="badge badge-pill badge-asinseed-keywords-weight mb-1 high-frequence-word-level-4">ロードバイク</span>
<!-- style="background-color: rgba(182, 214, 249, 0.05);" -->
<span class="badge badge-pill badge-asinseed-keywords-weight mb-1 high-frequence-word-level-3">ワイヤーガード</span>
<!-- style="background-color: rgba(182, 214, 249, 0.05);" -->
<span class="badge badge-pill badge-asinseed-keywords-weight mb-1 high-frequence-word-level-3">プロテクター</span>
<!-- style="background-color: rgba(182, 214, 249, 0.05);" -->
<span class="badge badge-pill badge-asinseed-keywords-weight mb-1 high-frequence-word-level-2">フレームパッド</span>
<!-- style="background-color: rgba(182, 214, 249, 0.05);" -->
<span class="badge badge-pill badge-asinseed-keywords-weight mb-1 high-frequence-word-level-2">フレームプロテクター</span>
<!-- style="background-color: rgba(182, 214, 249, 0.05);" -->
<span class="badge badge-pill badge-asinseed-keywords-weight mb-1 high-frequence-word-level-1">キックガード</span>
<!-- style="background-color: rgba(182, 214, 249, 0.05);" -->
<span class="badge badge-pill badge-asinseed-keywords-weight mb-1 high-frequence-word-level-1">プロテクターカバー</span>
<!-- style="background-color: rgba(182, 214, 249, 0.05);" -->
<span class="badge badge-pill badge-asinseed-keywords-weight mb-1 high-frequence-word-level-1">車</span>
<!-- style="background-color: rgba(182, 214, 249, 0.05);" -->
<span class="badge badge-pill badge-asinseed-keywords-weight mb-1 high-frequence-word-level-1">キルト芯</span>
<!-- style="background-color: rgba(182, 214, 249, 0.05);" -->
<span class="badge badge-pill badge-asinseed-keywords-weight mb-1 high-frequence-word-level-1">ケーブルプロテクター</span>
<!-- style="background-color: rgba(182, 214, 249, 0.05);" -->
<span class="badge badge-pill badge-asinseed-keywords-weight mb-1 high-frequence-word-level-1">ハンドルカバー</span>
</article>
<div id="div-introduce-video">
<h3 class="h5 text-asinseed-black font-weight-bold mb-4">AsinSeed Video</h3>
<article class="mb-5">
<a id="header-help-video-btn" style="outline: 0;">
<img src="https://www.asinseed.com/assets/images/video/introduce-video-en-20181122.png" alt="SVG Illustration" style="width: 300px;">
</a>
</article>
</div>
</div>
schema = {
"name": "Product Details",
"baseSelector": "div.js-sticky-block",
"fields": [
{
"name": "product_info",
"selector": "#div-asin-product-infor",
"type": "nested",
"fields": [
{
"name": "image_url",
"selector": "div.avatar-self-pic img",
"type": "attribute",
"attribute": "src"
},
{
"name": "goto_amazon",
"selector": "a.btn-asinseed-link",
"type": "attribute",
"attribute": "href"
},
{
"name": "main_text",
"selector": "div.media-body h4",
"type": "text",
"transform": ["strip"]
}
]
},
{
"name": "unique_words",
"selector": "h3:contains('Unique Words') + article",
"type": "list",
"fields": [
{
"name": "word",
"selector": "span.badge-asinseed-keywords-weight",
"type": "text"
}
]
}
]
}
这个没有提取对: {...'unique_words': [{'word': 'カバー'}]}
理论上到列表字符串才对。因为我看到 article 下都是 span 标签。