Class-Specific Subsets of the Schema.org Data contained in the October 2024 Corpus

This page provides access to and statistics about class-specific subsets of the Schema.org data contained in the October 2024 version of the Web Data Commons Microdata and JSON-LD corpus. The datasets are part of the Web Data Commons Schema.org Data Set Series

Introduction

As many users are only interested in specific types of Schema.org data (like product data, event data, job postings, or data describing local businesses), we have created class-specific subsets out of the complete and merged Microdata and JSON-LD corpora for a selection of schema.org classes. The subsets contain all instances of a specific class of either formats as well as all other data that is found on the webpages containing these instances. For example, a page containing data about a product might also contain reviews and offers for this product; a page containing data about an event might also contain data about the location of the event and the persons involved in the event. The data is represented in N-Quads format, meaning that the forth element of each quad contains the URL of the webpage from which the data was extracted. To facilitate the download and access to the class specific data, we provide the schema.org subsets in chunks. Each chunk contains quads of specific pay-level-domains (PLDs), i.e. all quads of one PLD, e.g. yummly.com, are organized within the same chunk file. Additionally, we provide lookup files containing the mappings between PLDs and their corresponding chunks as well as csv files with PLD-specific statistics.

Please note that:

If you want to refer to the datasets in your scientific publications, please cite the following poster: The Web Data Commons Schema.org Data Set Series by Alexander Brinkmann, Anna Primpeli and Christian Bizer in Companion Proceedings of the ACM Web Conference 2023 (WWW ’23 Companion), Austin, Texas, USA, April 2023.

Class-Specific Subsets of the Schema.org Data

Schema.org Subset General Stats Related Classes Size
(# Files)
Download (Sample) PLD to File look-up
PLD Specific Stats
AdministrativeArea Quads: 96,086,119
URLs: 521,585
Hosts: 4,933
http://schema.org/ListItem (1,499,751)
http://schema.org/ImageObject (1,454,619)
http://schema.org/AdministrativeArea (1,301,571)
http://schema.org/Person (976,876)
http://schema.org/PostalAddress (966,279)
1.25 GB
(8)
AdministrativeArea (sample) lookup_file
pld_stats_file
Airport Quads: 53,684,719
URLs: 173,702
Hosts: 1,003
http://schema.org/Airport (3,562,733)
http://schema.org/GeoCoordinates (2,546,608)
http://schema.org/Flight (1,331,733)
http://schema.org/Airline (1,258,369)
http://schema.org/Offer (1,139,954)
490.76 MB
(5)
Airport (sample) lookup_file
pld_stats_file
Answer Quads: 1,617,417,253
URLs: 14,298,778
Hosts: 414,222
http://schema.org/Answer (60,188,640)
http://schema.org/Question (51,845,095)
http://schema.org/ListItem (32,575,842)
https://schema.org/Answer (22,038,464)
http://schema.org/ImageObject (20,709,476)
29.69 GB
(126)
Answer (sample) lookup_file
pld_stats_file
Book Quads: 249,603,999
URLs: 4,208,106
Hosts: 18,993
http://schema.org/Book (10,291,224)
http://schema.org/Country (6,776,472)
http://schema.org/Person (5,755,257)
http://schema.org/Offer (3,590,467)
http://schema.org/ListItem (3,350,195)
4.35 GB
(20)
Book (sample) lookup_file
pld_stats_file
City Quads: 235,105,383
URLs: 1,156,025
Hosts: 16,149
http://schema.org/City (5,772,799)
http://schema.org/ImageObject (4,144,523)
http://schema.org/Person (4,069,973)
http://schema.org/PostalAddress (3,790,692)
http://schema.org/OpeningHoursSpecification (2,991,182)
2.62 GB
(19)
City (sample) lookup_file
pld_stats_file
ClaimReview Quads: 3,919,715
URLs: 49,708
Hosts: 343
http://schema.org/Organization (123,301)
http://schema.org/ImageObject (95,783)
http://schema.org/ListItem (93,535)
http://schema.org/Person (66,621)
http://schema.org/ClaimReview (59,710)
69.39 MB
(1)
ClaimReview (sample) lookup_file
pld_stats_file
CollegeOrUniversity Quads: 112,777,803
URLs: 1,001,649
Hosts: 5,121
http://schema.org/ImageObject (4,911,790)
http://schema.org/CollegeOrUniversity (3,892,042)
http://schema.org/Person (3,167,877)
http://schema.org/PostalAddress (2,714,298)
http://schema.org/GeoCoordinates (1,995,606)
1.25 GB
(9)
CollegeOrUniversity (sample) lookup_file
pld_stats_file
Continent Quads: 759,731
URLs: 6,752
Hosts: 66
http://schema.org/City (57,883)
http://schema.org/AdministrativeArea (42,597)
http://schema.org/Country (10,423)
http://schema.org/Continent (7,337)
http://schema.org/GeoCoordinates (5,692)
9.5 MB
(1)
Continent (sample) lookup_file
pld_stats_file
Country Quads: 950,481,115
URLs: 7,110,847
Hosts: 35,296
http://schema.org/Country (31,979,996)
http://schema.org/ListItem (23,422,340)
http://schema.org/Organization (15,663,556)
http://schema.org/PostalAddress (11,083,956)
http://schema.org/Offer (11,007,053)
12.7 GB
(73)
Country (sample) lookup_file
pld_stats_file
CreativeWork Quads: 2,064,113,912
URLs: 45,276,024
Hosts: 1,325,636
https://schema.org/CreativeWork (80,071,257)
https://schema.org/SiteNavigationElement (55,892,203)
https://schema.org/Person (40,215,674)
https://schema.org/WPHeader (32,248,539)
https://schema.org/WPFooter (30,340,361)
84.08 GB
(160)
CreativeWork (sample) lookup_file
pld_stats_file
Dataset Quads: 58,627,800
URLs: 694,158
Hosts: 2,024
http://schema.org/DataDownload (2,584,058)
http://schema.org/Dataset (1,559,803)
http://schema.org/Organization (1,056,210)
http://schema.org/PropertyValue (744,074)
http://schema.org/Person (737,459)
844.09 MB
(5)
Dataset (sample) lookup_file
pld_stats_file
EducationalOrganization Quads: 67,328,226
URLs: 830,258
Hosts: 11,630
http://schema.org/EducationalOrganization (1,393,334)
http://schema.org/ListItem (1,202,342)
http://schema.org/ImageObject (983,689)
http://schema.org/PostalAddress (955,404)
http://schema.org/Person (627,438)
1.04 GB
(6)
EducationalOrganization (sample) lookup_file
pld_stats_file
Event Quads: 1,959,220,573
URLs: 14,077,815
Hosts: 399,470
http://schema.org/Event (62,979,080)
http://schema.org/Place (47,079,855)
http://schema.org/PostalAddress (36,842,757)
http://schema.org/Person (23,766,448)
http://schema.org/ListItem (19,234,094)
24.16 GB
(152)
Event (sample) lookup_file
pld_stats_file
FAQPage Quads: 1,416,338,018
URLs: 11,600,305
Hosts: 385,257
http://schema.org/Answer (48,925,667)
http://schema.org/Question (48,641,655)
http://schema.org/ListItem (30,144,276)
http://schema.org/ImageObject (20,603,938)
https://schema.org/Answer (17,421,753)
25.04 GB
(110)
FAQPage (sample) lookup_file
pld_stats_file
GeoCoordinates Quads: 3,183,262,704
URLs: 25,257,658
Hosts: 567,267
http://schema.org/ListItem (73,477,803)
http://schema.org/PostalAddress (53,036,985)
http://schema.org/GeoCoordinates (50,514,808)
http://schema.org/OpeningHoursSpecification (32,389,059)
http://schema.org/Offer (31,583,240)
40.99 GB
(247)
GeoCoordinates (sample) lookup_file
pld_stats_file
GovernmentOrganization Quads: 25,786,196
URLs: 389,511
Hosts: 1,940
http://schema.org/ListItem (1,425,244)
http://schema.org/GovernmentOrganization (547,322)
http://schema.org/ImageObject (478,534)
http://schema.org/PropertyValue (289,526)
http://schema.org/PostalAddress (228,173)
393.95 MB
(2)
GovernmentOrganization (sample) lookup_file
pld_stats_file
Hospital Quads: 17,744,405
URLs: 178,173
Hosts: 2,489
http://schema.org/PostalAddress (408,831)
http://schema.org/Hospital (341,547)
https://schema.org/MedicalProcedure (265,300)
http://schema.org/GeoCoordinates (230,028)
http://schema.org/ListItem (193,701)
238.78 MB
(2)
Hospital (sample) lookup_file
pld_stats_file
Hotel Quads: 244,120,606
URLs: 1,961,745
Hosts: 24,641
http://schema.org/ImageObject (12,124,449)
http://schema.org/Hotel (4,413,153)
http://schema.org/PostalAddress (4,118,968)
http://schema.org/ListItem (4,004,110)
http://schema.org/AggregateRating (2,332,147)
3.6 GB
(19)
Hotel (sample) lookup_file
pld_stats_file
JobPosting Quads: 175,208,843
URLs: 3,606,167
Hosts: 63,320
http://schema.org/PostalAddress (6,754,011)
http://schema.org/Place (6,689,046)
http://schema.org/Organization (4,452,042)
http://schema.org/JobPosting (4,068,469)
http://schema.org/ListItem (2,519,111)
7.09 GB
(14)
JobPosting (sample) lookup_file
pld_stats_file
LakeBodyOfWater Quads: 35,276
URLs: 689
Hosts: 100
http://schema.org/ImageObject (1,060)
http://schema.org/Organization (765)
http://schema.org/WebPage (687)
http://schema.org/LakeBodyOfWater (681)
http://schema.org/Person (562)
1.72 MB
(1)
LakeBodyOfWater (sample) lookup_file
pld_stats_file
LandmarksOrHistoricalBuildings Quads: 3,005,491
URLs: 33,102
Hosts: 460
http://schema.org/ImageObject (112,997)
http://schema.org/LandmarksOrHistoricalBuildings (95,368)
http://schema.org/PostalAddress (64,910)
http://schema.org/CreativeWork (50,724)
http://schema.org/OpeningHoursSpecification (49,374)
58.71 MB
(1)
LandmarksOrHistoricalBuildings (sample) lookup_file
pld_stats_file
Language Quads: 586,554,652
URLs: 4,742,134
Hosts: 11,556
http://schema.org/Person (25,797,772)
http://schema.org/Comment (19,971,596)
http://schema.org/ListItem (10,307,155)
http://schema.org/Language (9,360,775)
http://schema.org/InteractionCounter (7,608,122)
10.42 GB
(45)
Language (sample) lookup_file
pld_stats_file
Library Quads: 7,343,688
URLs: 206,299
Hosts: 938
http://schema.org/Library (220,963)
http://schema.org/Place (115,805)
http://schema.org/CreativeWork (108,818)
http://schema.org/ListItem (95,132)
http://schema.org/PostalAddress (90,187)
117.54 MB
(1)
Library (sample) lookup_file
pld_stats_file
LocalBusiness Quads: 2,246,054,619
URLs: 27,185,774
Hosts: 1,456,656
http://schema.org/ListItem (68,904,089)
http://schema.org/LocalBusiness (42,251,508)
http://schema.org/PostalAddress (39,581,828)
http://schema.org/ImageObject (16,997,007)
http://schema.org/Offer (16,958,951)
29.72 GB
(175)
LocalBusiness (sample) lookup_file
pld_stats_file
Mountain Quads: 232,960
URLs: 11,296
Hosts: 63
http://schema.org/Mountain (20,970)
http://schema.org/GeoCoordinates (13,074)
http://schema.org/propertyValue (5,749)
http://schema.org/ListItem (1,101)
http://schema.org/Place (712)
5.23 MB
(1)
Mountain (sample) lookup_file
pld_stats_file
Movie Quads: 150,239,669
URLs: 1,849,268
Hosts: 8,969
http://schema.org/Person (9,033,938)
http://schema.org/Movie (3,785,906)
http://schema.org/ListItem (2,092,367)
http://schema.org/AggregateRating (1,498,557)
http://schema.org/Place (1,232,216)
2.14 GB
(12)
Movie (sample) lookup_file
pld_stats_file
Museum Quads: 5,066,224
URLs: 81,583
Hosts: 653
http://schema.org/PostalAddress (108,572)
http://schema.org/ListItem (81,923)
http://schema.org/Museum (81,129)
http://schema.org/ImageObject (72,825)
http://schema.org/OpeningHoursSpecification (63,146)
68.46 MB
(1)
Museum (sample) lookup_file
pld_stats_file
MusicAlbum Quads: 81,155,565
URLs: 582,473
Hosts: 2,813
http://schema.org/Country (6,016,664)
http://schema.org/Offer (2,290,151)
http://schema.org/MusicRecording (2,229,386)
http://schema.org/MusicAlbum (1,964,947)
http://schema.org/MusicGroup (1,252,222)
832.36 MB
(7)
MusicAlbum (sample) lookup_file
pld_stats_file
MusicRecording Quads: 115,966,463
URLs: 879,827
Hosts: 5,315
http://schema.org/MusicRecording (6,362,571)
http://schema.org/Country (4,576,815)
http://schema.org/Offer (2,499,676)
http://schema.org/MusicAlbum (1,372,455)
https://schema.org/MusicRecording (1,358,158)
1.19 GB
(9)
MusicRecording (sample) lookup_file
pld_stats_file
Organization Quads: 40,064,384,727
URLs: 612,884,806
Hosts: 8,025,176
http://schema.org/ListItem (1,116,154,472)
http://schema.org/ImageObject (837,492,697)
http://schema.org/Organization (825,432,236)
http://schema.org/Offer (451,026,750)
http://schema.org/BreadcrumbList (390,105,797)
639.43 GB
(3103)
Organization (sample) lookup_file
pld_stats_file
Painting Quads: 10,557,884
URLs: 62,182
Hosts: 530
http://schema.org/Person (2,199,905)
http://schema.org/Offer (478,440)
http://schema.org/Painting (264,239)
http://schema.org/Product (154,817)
http://schema.org/ListItem (90,303)
88.0 MB
(1)
Painting (sample) lookup_file
pld_stats_file
Park Quads: 645,311
URLs: 8,017
Hosts: 337
http://schema.org/PostalAddress (25,330)
http://schema.org/Organization (15,538)
http://schema.org/Park (8,573)
http://schema.org/ListItem (7,464)
http://schema.org/GeoCoordinates (7,252)
9.85 MB
(1)
Park (sample) lookup_file
pld_stats_file
Person Quads: 25,756,876,296
URLs: 332,386,290
Hosts: 5,567,720
http://schema.org/ImageObject (603,867,548)
http://schema.org/Person (553,148,240)
http://schema.org/ListItem (552,470,158)
http://schema.org/Organization (273,894,286)
http://schema.org/WebPage (271,623,678)
486.43 GB
(1995)
Person (sample) lookup_file
pld_stats_file
Place Quads: 3,314,731,430
URLs: 26,959,880
Hosts: 536,281
http://schema.org/Place (84,443,010)
http://schema.org/ListItem (69,601,113)
http://schema.org/PostalAddress (68,406,696)
http://schema.org/Event (51,435,095)
http://schema.org/Person (34,851,209)
46.97 GB
(257)
Place (sample) lookup_file
pld_stats_file
Product Quads: 21,541,073,999
URLs: 279,730,608
Hosts: 3,309,246
http://schema.org/Offer (749,412,376)
http://schema.org/ListItem (500,275,491)
http://schema.org/Product (492,109,637)
http://schema.org/Organization (279,070,322)
http://schema.org/ImageObject (153,495,226)
315.19 GB
(1668)
Product (sample) lookup_file
pld_stats_file
QAPage Quads: 150,398,856
URLs: 2,328,621
Hosts: 11,113
http://schema.org/Person (8,306,375)
http://schema.org/Answer (6,535,032)
http://schema.org/ListItem (2,161,088)
http://schema.org/Question (2,116,945)
http://schema.org/QAPage (2,000,032)
3.16 GB
(12)
QAPage (sample) lookup_file
pld_stats_file
Question Quads: 1,632,265,643
URLs: 15,017,687
Hosts: 418,463
http://schema.org/Answer (59,458,768)
http://schema.org/Question (52,840,194)
http://schema.org/ListItem (32,375,177)
https://schema.org/Answer (21,594,163)
http://schema.org/ImageObject (21,100,418)
29.97 GB
(127)
Question (sample) lookup_file
pld_stats_file
RadioStation Quads: 11,700,578
URLs: 236,879
Hosts: 862
http://schema.org/ListItem (318,064)
http://schema.org/RadioStation (285,623)
http://schema.org/NewsArticle (201,603)
http://schema.org/ImageObject (161,884)
http://schema.org/WPSideBar (123,784)
197.45 MB
(1)
RadioStation (sample) lookup_file
pld_stats_file
Recipe Quads: 258,355,715
URLs: 2,746,673
Hosts: 37,305
http://schema.org/HowToStep (8,610,681)
http://schema.org/ListItem (5,355,061)
http://schema.org/ImageObject (3,430,769)
http://schema.org/Person (3,051,928)
http://schema.org/Recipe (2,922,483)
4.43 GB
(20)
Recipe (sample) lookup_file
pld_stats_file
Restaurant Quads: 158,668,167
URLs: 1,186,921
Hosts: 84,257
http://schema.org/Offer (6,208,726)
http://schema.org/MenuItem (3,963,413)
http://schema.org/Restaurant (2,969,814)
http://schema.org/Product (2,780,583)
http://schema.org/ListItem (2,372,384)
1.79 GB
(13)
Restaurant (sample) lookup_file
pld_stats_file
RiverBodyOfWater Quads: 170,020
URLs: 1,418
Hosts: 25
https://schema.org/Canal (16,992)
https://schema.org/Service (5,580)
http://schema.org/ImageObject (2,198)
http://schema.org/ListItem (2,022)
http://schema.org/TouristDestination (1,746)
2.85 MB
(1)
RiverBodyOfWater (sample) lookup_file
pld_stats_file
School Quads: 10,072,237
URLs: 187,096
Hosts: 2,099
http://schema.org/School (291,503)
http://schema.org/ListItem (194,016)
http://schema.org/PostalAddress (180,528)
http://schema.org/Organization (106,718)
http://schema.org/ImageObject (95,256)
163.6 MB
(1)
School (sample) lookup_file
pld_stats_file
SearchAction Quads: 27,878,243,924
URLs: 417,722,788
Hosts: 6,756,347
http://schema.org/ListItem (1,052,354,014)
http://schema.org/ImageObject (653,530,325)
http://schema.org/WebSite (433,191,623)
http://schema.org/SearchAction (422,554,600)
http://schema.org/BreadcrumbList (408,756,058)
349.64 GB
(2160)
SearchAction (sample) lookup_file
pld_stats_file
ShoppingCenter Quads: 15,255,183
URLs: 135,249
Hosts: 1,345
http://schema.org/Offer (363,660)
http://schema.org/ListItem (251,172)
http://schema.org/PostalAddress (249,166)
http://schema.org/Organization (238,757)
http://schema.org/ShoppingCenter (180,908)
209.82 MB
(2)
ShoppingCenter (sample) lookup_file
pld_stats_file
SkiResort Quads: 1,173,165
URLs: 28,128
Hosts: 245
http://schema.org/ListItem (42,596)
http://schema.org/SkiResort (38,305)
http://schema.org/PostalAddress (24,781)
http://schema.org/Person (21,854)
http://schema.org/Review (21,440)
24.25 MB
(1)
SkiResort (sample) lookup_file
pld_stats_file
SportsEvent Quads: 118,761,252
URLs: 801,134
Hosts: 7,213
http://schema.org/SportsTeam (6,022,913)
http://schema.org/SportsEvent (5,824,189)
http://schema.org/Place (5,054,962)
http://schema.org/PostalAddress (4,570,869)
http://schema.org/Organization (1,017,320)
1.04 GB
(10)
SportsEvent (sample) lookup_file
pld_stats_file
SportsTeam Quads: 99,708,850
URLs: 754,133
Hosts: 4,063
http://schema.org/SportsTeam (7,166,902)
http://schema.org/SportsEvent (2,995,861)
http://schema.org/Place (2,388,090)
http://schema.org/PostalAddress (2,094,768)
http://schema.org/Person (1,310,046)
953.32 MB
(8)
SportsTeam (sample) lookup_file
pld_stats_file
StadiumOrArena Quads: 14,432,465
URLs: 57,179
Hosts: 256
http://schema.org/SportsTeam (937,973)
http://schema.org/StadiumOrArena (322,770)
http://schema.org/SportsEvent (247,964)
http://schema.org/SportsMatchCompetitor (247,784)
http://schema.org/Organization (231,215)
123.21 MB
(2)
StadiumOrArena (sample) lookup_file
pld_stats_file
TVEpisode Quads: 29,570,994
URLs: 220,868
Hosts: 1,065
http://schema.org/Country (3,253,439)
http://schema.org/TVEpisode (974,891)
http://schema.org/Person (505,805)
https://schema.org/TVEpisode (300,012)
http://schema.org/TVSeries (213,857)
306.01 MB
(3)
TVEpisode (sample) lookup_file
pld_stats_file
TelevisionStation Quads: 1,927,396
URLs: 22,721
Hosts: 89
http://schema.org/ListItem (44,898)
http://schema.org/ImageObject (41,683)
http://schema.org/TelevisionStation (39,377)
http://schema.org/Person (26,370)
http://schema.org/WebPage (24,917)
29.75 MB
(1)
TelevisionStation (sample) lookup_file
pld_stats_file


In case you are interested in a particular class or set of classes which is not listed above, please get in contact with the WebDataCommons team via Mailing List or our Google Group.

Conversion to Other Formats

We provide the extracted data for download using a variation of the N-Quads format. For users who prefer other formats, we provide code for converting the download files into CSV and JSON formats, which are supported by a wide range of spreadsheet applications, relational databases and data mining frameworks like the python data analysis library pandas. Please find further details on how to convert the download files to other formats on the main page.

Get the Code

The jupyter notebooks used to create the schema.org subsets from the MD and JSON-LD corpus can be checked out from our Git repository.

The extraction of December 2024 was done with version 1.5 of the extractor. For more information about the framework and a detailed description how to run a own extraction visit the framework page.

Get Support

Please send questions and feedback to the Web Data Commons mailing list or post them in our Web Data Commons Google Group.