another PDF Zotero fails to scrape,

Mike Marchywka marchywka at hotmail.com
Mon Feb 28 17:14:23 CET 2022


I've seen a lot of universities with thesis collections
in formats similar to this,


https://digitalcommons.unl.edu/cgi/viewcontent.cgi?article=1406&context=foodsciefacpub

but Zotero has a hard time with pdf at least on its web form. This pdf includes
a doi but you, or the computer, have to hunt for it. Often, they seem
to have links to a special collection of html that has machine scrapable
citions info. In any case, pddf scraping can be very important. I still
use wget but think I can use the headless chrome debug port to do file
downloads. Citeseer has been a problem as their bibtex is almost
empty and they are now rejecting wget downloads apparently. 

I've started to get TooBib to the "niceities" like UTF-8 and formating uniformity
but there are some blunders with innocuous changes to the 
cetnral design pieces ( going to an asynchronous fetch in headless
and generalizing the hierarchial doc format ). 

I've never looked at Zotero or competitors much but 
TooBib seems to work ok for me and certainly copared to the
Zotero web form. Hopefully I can get to the non-standrd biblio
entries for people and materials. It looks like adding document
specific query strings will not be a problem as that also lets
me write out the bbl file taking care of last minute
format changes. I'm not sure how to write a bst file to do it
but choosing among various fields and including multiple ones
with short text (" DOI" linkk for example like they do in IIRC
Nature ) was just easier to code in c++ than learning bst.  


Anyway, once you find the doi there is a lot of info there
from crossref, 


% mjmhandler: toobib handlepdf (pdftotext bin)
% date 2022-02-28:11:02:37 Mon Feb 28 11:02:37 EST 2022
% srcurl: https://digitalcommons.unl.edu/cgi/viewcontent.cgi?article=1406&context=foodsciefacpub
% citeurl: http://api.crossref.org/works/10.1111/j.1600-0757.2006.00184.x
@article{2006_Anne_Tanner_Jacques_Izard_Tannerella,
X_TooBib = {year: extract from date leng== 7},
X_TooBib = {urldate: FixBeKvp s= cmd=date "+%Y-%m-%d" d=2022-02-28 dn=urldate},
X_TooBib = {author: R Tanner , Anne C and Izard , Jacques},
X_TooBib = {reference: deleted for space },
abbrvjrnl = {Periodontol 2000},
affiliation = {},
alternative-id = {10.1111/j.1600-0757.2006.00184.x},
author = {R Tanner , Anne C and Izard , Jacques},
author_orig = {Anne C. R. Tanner and Jacques Izard},
bib-source = {Crossref},
content-domain = {false},
date = {2006-10},
date-created = {2006-08-23T15:16:38Z},
date-deposited = {2021-07-072021-07-07T10:52:40Z},
date-indexed = {2022-01-05T02:05:53Z},
date-issued = {2006-10},
date-journal-issue = {2006-10},
date-license = {2015-09-012015-09-01T00:00:00Z},
date-published-print = {2006-10},
deposited = {1625655160000},
doi = {10.1111/j.1600-0757.2006.00184.x},
is-referenced-by-count = {87},
issn = {1600-0757},
issn-type = {0906-6713, print, 1600-0757, electronic},
issue = {1},
journal = {Periodontology 2000},
journal-issue = {1},
language = {en},
license = {1441065600000, tdm, 3257, http://doi.wiley.com/10.1002/tdm_license_1.1},
link = {https://api.wiley.com/onlinelibrary/tdm/v1/articles/10.1111%2Fj.1600-0757.2006.00184.x, unspecified, vor, text-mining, http://onlinelibrary.wiley.com/wol1/doi/10.1111/j.1600-0757.2006.00184.x/fullpdf, unspecified, vor, similarity-checking},
member = {311},
month = {10},
page = {88-113},
prefix = {10.1111},
publisher = {Wiley},
reference-count = {267},
references-count = {267},
score = {1},
subject = {Periodontics},
title = {Tannerella forsythia, a periodontal pathogen entering the genomic era},
type = {journal-article},
url = {http://dx.doi.org/10.1111/j.1600-0757.2006.00184.x},
urldate = {2022-02-28},
volume = {42},
year = {2006},
final_assembly ={ TooBib handler handlepdf (pdftotext bin)},
srcurl={https://digitalcommons.unl.edu/cgi/viewcontent.cgi?article=1406&context=foodsciefacpub},
xsrcurl={https://digitalcommons.unl.edu/cgi/viewcontent.cgi?article=1406&context=foodsciefacpub},
citeurl={http://api.crossref.org/works/10.1111/j.1600-0757.2006.00184.x}

}


-- 

mike marchywka
306 charles cox
canton GA 30115
USA, Earth 
marchywka at hotmail.com
404-788-1216
ORCID: 0000-0001-9237-455X


More information about the texhax mailing list.