how well known o accepted is "json-ld" as a source of information for scraping bibtex from web pages?

Mike Marchywka marchywka at hotmail.com
Thu Jun 10 01:41:31 CEST 2021


AFAICT, this is a pretty common format for non-academic-publishers, 

https://json-ld.org/

It seems to be used on most news sites and I even found it on an academic reference site,
not a webpage of journal articles but a large diagram. For example,

https://string-db.org/network/9606.ENSP00000378426

in the Zotero web form gives this, 

@misc{noauthor_vkorc1_nodate-1,
	title = {{VKORC1} protein ({Human}) - {STRING} interaction network},
	url = {https://string-db.org/network/9606.ENSP00000378426},
	urldate = {2021-06-09},
}

But when I just naively make a bibtex entry from the ld+json data
it is quite rich and suggestive of extended bibtex usages like "BomTex"
for bill of materials or a biography short enough to be a bibliographic
entry for "private communication" based on say a LinkedIn profile, 
( I have modified the keys to just reformat the key hierarchy, not claiming
this is a good bibtex but could easily be re-keyed for custom apps ),

@article{2021,
X_10_description = {Homo sapiens STRING functional association network of VKORC1 protein.},
X_11_keywords = {VKORC1, VKORC1 network, interactions, interaction network, functional associations},
X_12_license_1_type = {CreativeWork},
X_12_license_2_name = {Creative Commons Attribution 4.0 International},
X_12_license_3_url = {https://creativecommons.org/licenses/by/4.0/},
X_13_image_1_type = {ImageObject},
X_13_image_2_name = {STRING VKORC1 interaction network (Homo sapiens)},
X_13_image_3_url = {https://string-db.org/image_png/9606.ENSP00000378426.png},
X_13_image_4_license_1_type = {CreativeWork},
X_13_image_4_license_2_name = {Creative Commons Attribution 4.0 International},
X_13_image_4_license_3_url = {https://creativecommons.org/licenses/by/4.0/},
X_14_creator_1_type = {Organization},
X_14_creator_2_name = {String Consortium},
X_14_creator_3_url = {https://string-db.org},
X_14_creator_4_email_1_email = {mering at imls.uzh.ch},
X_14_creator_4_email_2_email = {bork at embl.de},
X_14_creator_4_email_3_email = {lars.juhl.jensen at cpr.ku.dk},
X_15_version = {11.0},
X_1_base = {http://schema.org},
X_1_bio = {http://bioschemas.org/},
X_1_context_2_context = {http://schema.org},
X_1_type = {Organization , Organization , Organization , Organization},
X_2_includedInDataset = {https://string-db.org/#string.v11},
X_2_name = {SIB Swiss Bioinformatics Institute , European Molecular Biology Laboratory , University of Zurich , University of Copenhagen},
X_3_type = {bio:Protein},
X_3_url = {https://www.sib.swiss/ , https://embl.org/ , https://www.uzh.ch/ , https://www.ku.dk/},
X_4_id = {https://string-db.org/network/9606.ENSP00000378426},
X_5_http://purl.org/dc/terms/conformsTo = {https://bioschemas.org/specifications/Protein/0.9-DRAFT},
X_6_name = {VKORC1 interaction network (Homo sapiens)},
X_7_identifier = {9606.ENSP00000378426},
X_8_url = {https://string-db.org/network/9606.ENSP00000378426},
X_9_taxonomicRange_1_identifier = {9606},
X_9_taxonomicRange_2_id = {http://identifiers.org/taxonomy:9606},
X_9_taxonomicRange_3_type = {bio:taxon},
X_9_taxonomicRange_4_name = {Homo sapiens},
X_9_taxonomicRange_5_taxonRank = {species},
X_type = {},
author = {},
day = {09},
jt = {},
month = {06},
name = {},
title = {},
year = {2021}
}


Maybe some people don't want cluttered bibtex or an entry type of "bio:protein" but
most of these things seem to be fixable with look up tables on literals or regexes.








note new address
 Mike Marchywka 306 Charles Cox Drive Canton, GA 30115
470-758-0799
404-788-1216




More information about the texhax mailing list.