Sample E-Commerce HTML We have a sample e-commerce HTML file on GitHub (example):
https://gist.githubusercontent.com/githubusercontent/2d7b8ba3cd8ab6cf3c8da771ddb36878/raw/1ae2f90c6861ce7dd84cc50d3df9920dee5e1fd2/sample_ecommerce.html This snippet includes categories, products, features, reviews, and related items. Let’s see how to define a schema that fully captures that structure without LLM. schema = {
"name": "E-commerce Product Catalog",
"baseSelector": "div.category",
# (1) We can define optional baseFields if we want to extract attributes
# from the category container
"baseFields": [
{"name": "data_cat_id", "type": "attribute", "attribute": "data-cat-id"},
],
"fields": [
{
"name": "category_name",
"selector": "h2.category-name",
"type": "text"
},
{
"name": "products",
"selector": "div.product",
"type": "nested_list", # repeated sub-objects
"fields": [
{
"name": "name",
"selector": "h3.product-name",
"type": "text"
},
{
"name": "price",
"selector": "p.product-price",
"type": "text"
},
{
"name": "details",
"selector": "div.product-details",
"type": "nested", # single sub-object
"fields": [
{
"name": "brand",
"selector": "span.brand",
"type": "text"
},
{
"name": "model",
"selector": "span.model",
"type": "text"
}
]
},
{
"name": "features",
"selector": "ul.product-features li",
"type": "list",
"fields": [
{"name": "feature", "type": "text"}
]
},
{
"name": "reviews",
"selector": "div.review",
"type": "nested_list",
"fields": [
{
"name": "reviewer",
"selector": "span.reviewer",
"type": "text"
},
{
"name": "rating",
"selector": "span.rating",
"type": "text"
},
{
"name": "comment",
"selector": "p.review-text",
"type": "text"
}
]
},
{
"name": "related_products",
"selector": "ul.related-products li",
"type": "list",
"fields": [
{
"name": "name",
"selector": "span.related-name",
"type": "text"
},
{
"name": "price",
"selector": "span.related-price",
"type": "text"
}
]
}
]
}
]
} Key Takeaways:
Nested vs. List: type: "nested" means a single sub-object (like ). details type: "list" means multiple items that are simple dictionaries or single text fields. type: "nested_list" means repeated complex objects (like or ).productsreviews Base Fields: We can extract attributes from the container element via . For instance, might be . "baseFields""data_cat_id"data-cat-id="elect123" Transforms: We can also define a if we want to lower/upper case, strip whitespace, or even run a custom function.transform 运行提取
import json
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
ecommerce_schema = {
# ... the advanced schema from above ...
}
raw_html = '...'
async with AsyncWebCrawler(verbose=True) as crawler:
strategy = JsonCssExtractionStrategy(ecommerce_schema, verbose=True)
result = await crawler.arun(
url=f"raw://{dummy_html}",
extraction_strategy=strategy,
config=config
)
if not result.success:
print("Crawl failed:", result.error_message)
return
# Parse the JSON output
data = json.loads(result.extracted_content)
print(json.dumps(data, indent=2) if data else "No data found.")
schema = { "name": "Blog Posts", "baseSelector": "a.blog-post-card", "baseFields": [
{"name": "post_url", "type": "attribute", "attribute": "href"}
], "fields": [
{"name": "title", "selector": "h2.post-title", "type": "text", "default": "No Title"},
{"name": "date", "selector": "time.post-date", "type": "text", "default": ""},
{"name": "summary", "selector": "p.post-summary", "type": "text", "default": ""},
{"name": "author", "selector": "span.post-author", "type": "text", "default": ""}
] } 然后运行 with 以获取一组博客文章对象,每个对象都包含 、 。JsonCssExtractionStrategy(schema)"post_url""title""date""summary""author"
必须使用选择 XPATH 策略: 参考上述说明,帮我html中提取所有文本,你应该发现规律是文本在类似这样的元素中:
<div class="autocomplete-results-container" id="sac-autocomplete-results-container" role="grid"><div class="two-pane-results-container" role="rowgroup"><div class="left-pane-results-container" style="flex: 1 1 0%; height: auto;"><div id="sac-suggestion-row-1" role="row" aria-rowindex="1" aria-owns="sac-suggestion-row-1-cell-1"><div class="s-suggestion-container" id="sac-suggestion-row-1-cell-1" role="gridcell"><div class="s-suggestion s-suggestion-ellipsis-direction" role="button" aria-label="パソコン">パソコン</div><div class="icon-suggestion-div search-icon-div" suggestion-icon-div="true"><i class="icon-search-suggestion s-suggestion-icon-left" suggestion-icon="true"></i><div class="s-sugg-icon-background s-suggestion-icon-background-grey-shield"></div></div></div></div><div id="sac-suggestion-row-2" role="row" aria-rowindex="2" aria-owns="sac-suggestion-row-2-cell-1"><div class="s-suggestion-container" id="sac-suggestion-row-2-cell-1" role="gridcell"><div class="s-suggestion s-suggestion-ellipsis-direction" role="button" aria-label="パソコンケース">パソコン<span class="s-heavy">ケース</span></div><div class="icon-suggestion-div search-icon-div" suggestion-icon-div="true"><i class="icon-search-suggestion s-suggestion-icon-left" suggestion-icon="true"></i><div class="s-sugg-icon-background s-suggestion-icon-background-grey-shield"></div></div></div></div><div id="sac-suggestion-row-3" role="row" aria-rowindex="3" aria-owns="sac-suggestion-row-3-cell-1"><div class="s-suggestion-container" id="sac-suggestion-row-3-cell-1" role="gridcell"><div class="s-suggestion s-suggestion-ellipsis-direction" role="button" aria-label="パソコンスタンド">パソコン<span class="s-heavy">スタンド</span></div><div class="icon-suggestion-div search-icon-div" suggestion-icon-div="true"><i class="icon-search-suggestion s-suggestion-icon-left" suggestion-icon="true"></i><div class="s-sugg-icon-background s-suggestion-icon-background-grey-shield"></div></div></div></div><div id="sac-suggestion-row-4" role="row" aria-rowindex="4" aria-owns="sac-suggestion-row-4-cell-1"><div class="s-suggestion-container" id="sac-suggestion-row-4-cell-1" role="gridcell"><div class="s-suggestion s-suggestion-ellipsis-direction" role="button" aria-label="パソコンデスク">パソコン<span class="s-heavy">デスク</span></div><div class="icon-suggestion-div search-icon-div" suggestion-icon-div="true"><i class="icon-search-suggestion s-suggestion-icon-left" suggestion-icon="true"></i><div class="s-sugg-icon-background s-suggestion-icon-background-grey-shield"></div></div></div></div><div id="sac-suggestion-row-5" role="row" aria-rowindex="5" aria-owns="sac-suggestion-row-5-cell-1"><div class="s-suggestion-container" id="sac-suggestion-row-5-cell-1" role="gridcell"><div class="s-suggestion s-suggestion-ellipsis-direction" role="button" aria-label="パソコン ノート">パソコン<span class="s-heavy"> ノート</span></div><div class="icon-suggestion-div search-icon-div" suggestion-icon-div="true"><i class="icon-search-suggestion s-suggestion-icon-left" suggestion-icon="true"></i><div class="s-sugg-icon-background s-suggestion-icon-background-grey-shield"></div></div></div></div><div id="sac-suggestion-row-6" role="row" aria-rowindex="6" aria-owns="sac-suggestion-row-6-cell-1"><div class="s-suggestion-container" id="sac-suggestion-row-6-cell-1" role="gridcell"><div class="s-suggestion s-suggestion-ellipsis-direction" role="button" aria-label="パソコン台 卓上">パソコン<span class="s-heavy">台 卓上</span></div><div class="icon-suggestion-div search-icon-div" suggestion-icon-div="true"><i class="icon-search-suggestion s-suggestion-icon-left" suggestion-icon="true"></i><div class="s-sugg-icon-background s-suggestion-icon-background-grey-shield"></div></div></div></div><div id="sac-suggestion-row-7" role="row" aria-rowindex="7" aria-owns="sac-suggestion-row-7-cell-1"><div class="s-suggestion-container" id="sac-suggestion-row-7-cell-1" role="gridcell"><div class="s-suggestion s-suggestion-ellipsis-direction" role="button" aria-label="パソコンケース 14インチ">パソコン<span class="s-heavy">ケース 14インチ</span></div><div class="icon-suggestion-div search-icon-div" suggestion-icon-div="true"><i class="icon-search-suggestion s-suggestion-icon-left" suggestion-icon="true"></i><div class="s-sugg-icon-background s-suggestion-icon-background-grey-shield"></div></div></div></div><div id="sac-suggestion-row-8" role="row" aria-rowindex="8" aria-owns="sac-suggestion-row-8-cell-1"><div class="s-suggestion-container" id="sac-suggestion-row-8-cell-1" role="gridcell"><div class="s-suggestion s-suggestion-ellipsis-direction" role="button" aria-label="パソコンラック">パソコン<span class="s-heavy">ラック</span></div><div class="icon-suggestion-div search-icon-div" suggestion-icon-div="true"><i class="icon-search-suggestion s-suggestion-icon-left" suggestion-icon="true"></i><div class="s-sugg-icon-background s-suggestion-icon-background-grey-shield"></div></div></div></div><div id="sac-suggestion-row-9" role="row" aria-rowindex="9" aria-owns="sac-suggestion-row-9-cell-1"><div class="s-suggestion-container" id="sac-suggestion-row-9-cell-1" role="gridcell"><div class="s-suggestion s-suggestion-ellipsis-direction" role="button" aria-label="パソコン モニター">パソコン<span class="s-heavy"> モニター</span></div><div class="icon-suggestion-div search-icon-div" suggestion-icon-div="true"><i class="icon-search-suggestion s-suggestion-icon-left" suggestion-icon="true"></i><div class="s-sugg-icon-background s-suggestion-icon-background-grey-shield"></div></div></div></div><div id="sac-suggestion-row-10" role="row" aria-rowindex="10" aria-owns="sac-suggestion-row-10-cell-1"><div class="s-suggestion-container" id="sac-suggestion-row-10-cell-1" role="gridcell"><div class="s-suggestion s-suggestion-ellipsis-direction" role="button" aria-label="パソコンケース 13.3インチ">パソコン<span class="s-heavy">ケース 13.3インチ</span></div><div class="icon-suggestion-div search-icon-div" suggestion-icon-div="true"><i class="icon-search-suggestion s-suggestion-icon-left" suggestion-icon="true"></i><div class="s-sugg-icon-background s-suggestion-icon-background-grey-shield"></div></div></div></div></div><div class="right-pane-results-container" style="display:none"></div></div><div class="status-message-container" role="status">Prefix パソコン, 10 individual suggestions</div></div>