View Post [edit]
Poster: | Branko Collin | Date: | Sep 27, 2007 6:30am |
Forum: | texts | Subject: | Re: API? |
re: ISBN: in February this year the vast majority of books in the Toronto and Americana collections were published before 1920. ISBN is from 1966. You do the math.
re: scraping. A quick look through the FAQ does not teach me how TIA would prefer you to minimize traffic. However, the fact that they discuss wget and that they offer RSS feeds of the most recent items would suggest that scraping is indeed the way to go. (If that means what I think it means.)
I am not 100% sure about this, but it would seem that all items get a unique identifier. The item can then be found at http://www.archive.org/details/identifier.
Reply [edit]
Poster: | jrochkind | Date: | Oct 17, 2007 5:47am |
Forum: | texts | Subject: | Re: API? |
Reply [edit]
Poster: | EmilPer | Date: | Oct 19, 2007 11:57pm |
Forum: | texts | Subject: | Re: API? |
There is a sort of API for searching: the search uses Lucene, so the rules for building the query string are in the open, and the results page is easy to parse.
Once you have identified the strings that identify uniquely an item, it's very easy to get the xml with the list of files and after that the full text or the page images: see http://www.archive.org/about/faqs.php#140 .
Reply [edit]
Poster: | AnnaN | Date: | May 13, 2009 10:30am |
Forum: | texts | Subject: | Re: API? |
http://www.archive.org/help/
Reply [edit]
Poster: | EmilPer | Date: | May 13, 2009 10:46am |
Forum: | texts | Subject: | Re: API? |
Was there any change in TOS, too ? To say what can and what cannot be done with the books in the archive ?
Reply [edit]
Poster: | marcus lucero | Date: | Oct 12, 2007 5:11pm |
Forum: | texts | Subject: | Re: API? |
http://www.archive.org/details/itemid
(e.g. http://www.archive.org/details/mindcure00larsrich which will always stay the same and never be replaced by other files)
Others have 'scrapped" our database from outside but have never really shared their techniques.
Marcus
Reply [edit]
Poster: | EmilPer | Date: | Oct 20, 2007 12:51am |
Forum: | texts | Subject: | Re: API? |
This could be because it's not that difficult to scrap the public domain books archive, and in consequence there is not much code or technique to share.
It could also be because Archive.org does not say clearly what they allow and what they do not allow. For example, Project Gutenberg says clearly what can be done with their content, so there are many PG readers out there that read their text databases, process the book text, split it into pages, reformat etc. Archive.org does not, or not in a place that's easy to find, state what can be done and what cannot be done with the texts they host.
"Access to the Archive’s Collections is provided at no cost to you and is granted for scholarship and research purposes only." is very ambiguous. Would anyone spend a few hundred man-hours to write code to search, download marc/dc/meta files, get the fulltext files, index the text, cross reference it, identify correlations, generate new searches etc. and then share the code only to find out that "no, that's not allowed" ? Most likely s/he would leach as much as possible, share the code that does only the leaching, and then claim s/he is writing a better spelling checker and needs raw data.
Ambiguous "Terms of Use" and " questions or comments regarding these terms ... at info@archive.org" means "don't bother unless you can afford to pay a lawyer full time".