This page provides the Product Corpus V.2020 for public download. The corpus is extracted from the December 2020 WDC schema.org Product Microdata and JSON-LD subsets. In comparison to our previously published Product Data Corpus which contains data from 2017, the current corpus is 4 times larger and covers up-to-date products from 2020. The current version of the WDC Product Data Corpus consists of more than 98 million product offers originating from 603 thousand websites. Grouping the offers based on the co-occurrence of their annotated product identifier values, such as GTINs and MPNs, results in more than 7.1 million groups of size two or larger.
News
- 2023-01-07: We have released a categorization for a subset of the corpus as part of a student master thesis and the WDC Products benchmark. More information and download links for both artifacts can be found here and here.
- 2021-23-08: Product Corpus version 2020 published.
Contents
- 1. Motivation
- 2. Adoption of schema.org Annotations by E-Shops
- 3. Corpus Creation
- 4. Corpus Profiling
- 5. Corpus Evaluation
- 6. Download
- 7. Feedback
- 8. References
1. Motivation
Many e-shops have started to mark-up offers within HTML pages using schema.org annotations. In recent years, many of these e-shops have also started to annotate product identifiers within their pages such as schema.org/Product/sku, gtin8, gtin13, gtin14, and mpn. These identifiers allow offers for the same product from different e-shops to be grouped into clusters and can thus be considered as supervision for training matching methods. In our previous work [1] we exploited this source of supervision and published the largest publicly available corpus for entity matching which was extracted from the WDC 2017 schema.org Microdata Product Corpus. Given the considerable increase of Product related annotation adoption, we improve our cleansing workflow and publish a new version of the WDC Product Corpus which is extracted from the WDC 2020 schema.org Microdata and JSON-LD Product corpora.
The corpus consists of 98.9 million offers which we group into clusters based on the co-occurence of their annotated product identifier values. The grouping results in 7.1 million clusters of size two or larger. 1.5 million of these clusters have a size larger than three while 670 thousand have a size larger than five.
In the following, we first provide some statistics about the adoption of schema.org annotations in the domain of e-commerce given the 2020 MD and JSON-LD schema.org Product Corpora. Next, we describe the data cleaning steps that were applied to derive the corpus from the December 2020 version of the WDC schema.org/Product MD and JSON-LD corpus. Finally, we present some statistics about the corpus and the results of manual inspection to estimate its quality.
2. Adoption of Schema.org Annotations by E-Shops - 2020 Statistics
We use the December 2020 WDC MD and JSON-LD schema.org/Product corpus as starting point for building the corpus. The table below provides general statistics about the corpus. Figure 1 shows the number of websites (PLDs) in the corpus that use specific schema.org properties for describing product offers and compares to the absolute number of PLDs in 2017. Figure 2 shows the number of offers in the corpus that contain identifiers and compares to the absolute number of offers in 2017. We consider the following schema.org/Product and schema.org/Offer properties as identifier related properties: sku, mpn, identifier, productID, gtin14, gtin13, gtin12, and gtin8. We observe that there has been a considerable growth in both general product-related schema.org properties as well as identifying product-related properties.
December 2020 schema.org/Product corpus statistics
Data Size | 284.19 GB | (compressed) |
---|---|---|
Quads | 17,904,749,043 | |
Domains | 2,476,579 | |
Related Classes (#Entities) |
|
|
#Entities (Product/Offer) with at least one ID property |
|
|
#Distinct ID values |
|
3.Corpus Creation
We apply an enhanced version of the cleansing pipeline of our previous work in order to identify and eliminate common annotation errors. Below we present the cleansing pipeline and provide details on the evaluation of the newly introduced cleansing steps which we mark with *new*.
Filtering by identifier value length
We normalize the identifier values by removing common prefixes, such as sku or id, and any non-alphanumeric characters. Additionally, identifiers with a length smaller than 8 and larger than 25 or with purely textual values are removed.
Resulting data after identifier value filtering
Entity count per property | Show property info |
Distinct identifier values | 118,348,448 |
# Entities with ID-related annotations (at least one) | 200,947,702 |
**new** Dealing with pages having more than one annotated offer
When multiple entities occur under one URL, the page can either be a listing page, a page with a main entity and product recommendations, or a product variation page. In this part of the cleansing pipeline we want to identify and preserve product variations and main entities while product recommendations and listing pages are dropped considering that their short descriptions are not informative enough for product matching.
Product Variation Detection
The schema.org vocabulary offers the property s:Product/offers for marking variations, e.g. colour variations, of a product offer. Still, not all websites stick to this annotation practice. We implement a heuristic for categorizing an entity as offer variation. The heuristic is based on the core idea that variations of a product have very similar identifiers and descriptive attributes. More concretely, we compute the normalized Levenshtein similarity for all identifiers and descriptive properties of the offers found in a page. If the identifier similarity added with the descriptive similarity is bigger than 1.5, the entities will be marked as variations. When performed on the whole corpus, this step marks 10.6 million entities as product variations.
EvaluationMain Entity Detection
Apart from detecting variations, another vital task is to detect and differentiate a main product entity from recommended products that might appear in the same URL. We apply a main entity detection heuristic which is based on the concept that the main entity should have a much longer description than every other entity on the page. The heuristic determines the entity with the longest description and the mean value of all other description lengths. If the description length is 2.5 times longer than the mean value of the rest, the entity is identified as the main entity. This step detects 250 thousand main entities and deletes 817 thousand entities.
EvaluationListing pages Detection
A major challenge when working with the product corpus is the distinguishment between listing pages and detail pages. Listing pages are not desired because they do not provide comprehensive information about a product and can overlap with the offers found on their detail pages. A common source for errors in this step is the differentiation between a listing page and a product variation or a primary product page. Hence, the previously described steps were performed to address this problem. The listing ads heuristic uses the made annotations (variation/main entity) to differentiate the different cases. Other than that, the heuristic also considers the ItemPage annotation to identify a non-listing page safely. Finally, the heuristic classifies a page as a listing page when the page has more than two entities and is not an item or variation page. This step identifies 2.7 million pages as a listing page and filters out 51.5 million entities.
EvaluationResulting data after dealing with pages having more than one annotated offer
Entity count per property | Show property info |
Distinct identifier values | 103,094,323 |
Offer Entities with ID-related annotations (at least one) | 149,131,925 |
**new** Filtering by identifier value occurrence
In the final step of the cleansing process we detect websites that use the same identifier value to annotate all their offers, likely due to an error in the script generating the pages. Additionally, we detect and remove example product offers which contain generic textual descriptions such as "I am a product". Such example product descriptions were found to exist in 19.77% of all websites in the corpus. In total, 49.7 million offers are detected as having erroneous repetitive identifier values and are removed.
Resulting data after frequent id-values removal
Entity count per property | Show property info |
Distinct identifier values | 88,354,037 |
Offer Entities with ID-related annotations (at least one) | 99,371,133 |
Removal of category identifiers
We note that some websites include identifiers refering to product categories, such as UNSPSC number in to single product identifiers. We detect those cases and remove the corresponding identifier values with the heuristic described here.
Resulting data after category identifiers removal
Entity count per property | Show property info |
Distinct identifier values | 88,353,678 |
Offer Entities with ID-related annotations (at least one) | 98,900,648 |
ID-Clusters creation
In this step we group the offer entities into ID-Clusters using their identifiers. As it happens that single offers contain multiple alternative identifiers we use this information to merge clusters refering to the same product. This results in 80,478,480 ID-Clusters.
Resulting data after grouping
Offer Entities with ID-related annotations (at least one) | 98,900,648 |
ID-clusters | 80,478,480 |
Distinct PLDs | 603,545 |
4.Corpus Profiling
Below we provide profiling information on the curated Product Corpus and compare it to the product corpus of 2017. We present the absolute and relative amount of offers in the corpus having specific schema.org properties which can be used for matching (Figure 3). In Table 1 we present the distribution of offers per URL which reveals that 92% of the offers in the corpus come from different webpages, while 97% of the offers appear in pages together with maximum 4 more offers. This implies that the extracted offers are most likely the main product entities on the page while recommendations and listings have been successfully removed during the cleansing process. Finally, Figures 4 and 5 show the distribution of offer entities and positive pairs per cluster size and compare to the one of the 2017 Large-Scale Product Corpus.
# Offers | # URLS | % URLS |
---|---|---|
[1] | 56,674,688 | 92.25% |
[2-5] | 2,894,272 | 4.71% |
+/- Expand/Close for more offers per URL |
5.Corpus Evaluation
In order to assess the quality of the resulting clusters, pairs of offers belonging to the same clusters were randomly sampled across the whole corpus and it was manually verified if they indeed represent the same real-world product entity or not. A two-fold evaluation was conducted: for small clusters of size =<80 offers and for larger clusters with size >80 offers. For each evaluation group 1000 pairs of offers were sampled across all clusters. The small clusters were found to contain 6.9% noise while the large clusters are estimated to contain 1.8% noise. Despite the high quality of large clusters, it was observed during evaluation that most offers belonging to large clusters had very similar or even identical attribute values. This may originate from redirects, i.e. the same page was crawled many times following different url paths inside the same website, or from duplicated content across different e-shops. Therefore, the large clusters of the corpus need to be used with caution as they may not be informative enough for training good matchers.
6. Download
We offer the WDC Product Data Corpus - V.2020 for public download in JSON format. To ease the loading of the corpus, we provide 359 split gzipped json files of size <300MB. Each JSON file contains the following fields for each offer:
- node_id: generated
- url: The webpage from which the product entity originates. Together with the node_id can be used as unique identifier
- cluster_id: generated. The cluster to which a product entity has been grouped.
- schema.org identifiers:sku, productID, mpn, identifier, gtin14, gtin13, gtin12, gtin8, gtin
- schema.org product related properties: name, description, price, priceCurrency, brand, manufacturer, title
Additionally, we map the offers of the WDC Product Data Corpus - V.2020 to the table corpus and offer the mappings in separate json files. Considering that different preprocessing steps have been implemented when creating the two corpora, i.e. the table corpus and the WDC Product Data Corpus - V.2020, 53.2% (52.6M) of the offers were mapped to rows in the tables. More concretely, for the creation of the WDC Product Data Corpus V.2020 entities of both types schema.org:Offer and schema.org:Product were considered granting that they contain some product identifier related property. On the other hand, the product subset of the table corpus contains solely entities of type schema.org:Product. Therefore, the Offer entities of the WDC Product Data Corpus cannot be mapped to the table corpus. Additionally, further missing mappings derive from the different strategies followed for removing listing pages: For the construction of the table corpus all product entities containing less than three schema.org properties were removed while listing pages and product ads are identified based on a textual length heuristic. In contrast, the listing and ads detection method of the WDC Product Data Corpus V.2020 considers a different value threshold for the textual length heuristic as well as certain schema.org properties for identifying main entities in a page as described above in Section 3. Each JSON mapping file contains the following fields:
- table_id: Name of the corresponding table file of the table corpus
- row_id: The row id of the corresponding table of the table corpus
- url: The webpage from which the product entity of the specific row originates.
- cluster_id: generated. The cluster to which the row of that table has been grouped.
File | Size |
WDC LSPC V2020 (Sample) | 16KB |
WDC LSPC V2020 | 20GB (compressed folder with 359 files) |
Map to Table Corpus | 2.7GB (compressed folder with 359 files) |
8. Feedback
Please send questions and feedback to the Web Data
Commons Google Group.
More information about Web Data Commons is found here.
9. References
- Primpeli, A., Peeters, R., & Bizer, C.: The WDC training dataset and gold standard for large-scale product matching. In: Companion Proceedings of the 2019 World Wide Web Conference. pp. 381-386 ACM (2019).
- Peeters, Ralph, Christian Bizer, and Goran Glavaš. "Intermediate training of BERT for product matching." small 745.722 (2020): 2-112.
- Mudgal, S. et al.: Deep Learning for Entity Matching: A Design Space Exploration. In: Proceedings of the 2018 International Conference on Management of Data. pp. 19–34 ACM (2018).
- Qiu, D. et al.: Dexter: large-scale discovery and extraction of product specifications on the web. Proceedings of the VLDB Endowment. 8, 13, 2194–2205 (2015).
- Köpcke, H. et al.: Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment. 3, 1–2, 484–493 (2010).