www type bibtex entries - generating bibtex for webpages + prior theme.

Mike Marchywka marchywka at hotmail.com
Sun Sep 15 14:56:48 CEST 2019


On Sun, Sep 15, 2019 at 11:14:59AM +0100, Peter Flynn wrote:
> 
> On 14/09/2019 22:51, Mike Marchywka wrote:
> > 
> > In a prior thread I was describing some reasons to prefer latex-like
> > document "source" over things like html or explicit xml.
> 
> I'm not clear what "explicit" XML is (as opposed to what?)

Anything that is XML but called something different, mostly things ending in ML  :) 

> 
> > Someone offered the CELT site below as an example of an experiment
> > related to this topic.
> 
> That would be me :-)
> 
> > In the link in the sample bibtex below, there is a link to xml
> > described as the "source document",
> 
> Where did you see http://research.ucc.ie/celt/document/E590001-007 described
> as the "source document"?
>
Bad editing, I meant that if you hit that link and then look at the content,
there is another link featured on the right of the page that looks like this,  

Source document
E590001-007.xml

 
> > %2019-09-14:17:16:49
> > %autogenerated by toobib
> > @www{CELTprojectBriefeucc,
> > authors = {},
> > title = {CELT project: A Briefe description of Ireland: made in this year, 1589, By Robert Payne | University College Cork},
> > url = {http://research.ucc.ie/celt/document/E590001-007},
> > urldate = {2019-09-14:17:16:49},
> > year = {}
> > }
> > 
> > so called "source document":
> 
> That's some kind of auto-generated bib file about the web page. The CELT

yes, because I could not find the citing info on that page ( logically it should
be with the shares but I though it would be on THAT page lol). 

> project does not call this a source document. For the source

If I click the thing called "source document" on the right it goes to the xml,

http://research.ucc.ie/celt/document/E590001-007#front

Source document
E590001-007.xml

and the link copies as 
http://research.ucc.ie/celt/document/E590001-007.xml

> document you can look in the web page and click on "Header" and then
> "Source" where you will find the BiBTeX:
>

Thanks, that is exactly what I needed for this site but I'm not sure
how you could have easily found that- they have one button "share" things
on the other page but you have to dig up the citation. In this case, there are two kinds
of source- the original literature on which the page is based and 
what they apparently also call a "source" as I mentioned above which
is XML for that generates the web page. 

 
> @incollection{E590001-007,
>   editor 	 = {Aquilla Smith},
>   title 	 = {A Brife description of Ireland: made in this yeere. 1589. By
> Robert Payne. vnto xxv. of his partners for whom he is undertaker there.
> Truely published verbatim, according to his letters, by Nich. Gorsan one of
> the said partners, for that he would his countrymen should be partakers of
> the many good Notes therein conteined. With diuers Notes taken out of others
> the Authoures letters written to his said partners, sithenes the first
> Impression, well worth the reading.},
>   booktitle 	 = {Tracts relating to Ireland, printed for the Irish
> Archaeological Society.},
>   address 	 = {Dublin},
>   publisher 	 = {University Press, Graisberry and Gill},
>   date 	 = {1841},
>   volume 	 = {1},
>   note 	 = {v–viii; 3–14 (separate pagination)}
> }
> 
> > http://research.ucc.ie/celt/document/E590001-007.xml
> > While it is quite true that this xml provides good explicit structure
> > and is "human readable" it does not quite "flow" like simple latex
> > source code.
> I'm not clear what "flow" means in this context. The XML document is an an
I had to pick a word for the style- if you try to read it you can't just sit
down and read it you have all the "XML junk" to read around. See beloe but
the latex-like syntax does not imply specific presentation of the info 
it just is better visually organized even before typesetting into a specific
rendition.  

> accurate representation of the original book from 1841. It begins like this:
> 
>     <body>
>       <div0 type="description" lang="en">
> 	<head>A Brife description of Ireland: made in this yeere.
> 	  1589. By Robert Payne [...]</head>
> 	<pb n="3"/>
> 	<div1 type="section" n="1">
> 	  <p><text type="letter">
> 	      <body>
> 		<p>Let not the reportes of those that haue spent all
> 		  their owne and what they could by any meanes get
> 		  from others in England, discourage you from
> 		  Irela<ex>n</ex>d, although they and such others by
> 		  bad dealinges haue wrought a generall discredite to
> 		  all English men, in that countrie which are to the
> 		  Irishe vnknowen.</p>
> 
> I'm not sure that there is any other meaningful way to do it: the objective
> of the project is to capture the text and *accurate* structure of the
> original, so there's a divisional container, a heading, a pge-break, a
> numbered sub-container, with a quoted letter with its own internal
> structure, etc.
> 
> > That is you could read most latex source as if it was meant to be
> > understood versus html or this xml.
> Correct. XML is a file storage format. It contains information that LaTeX
> does not have by default (eg nested containers)
> 
> > The latex just provides
> 
> ...some...
> 
> > logical structure without a lot of verbosity
> 
> Correct. XML is for *storing* the metadata — in this case for posterity — it
> makes no judgment about how you or anyone else will use it.
> 
> > and allows a renderer to define layout info for the latex things.
> 
> Right. The project could have used LaTeX (it was seriously considered back
> when it was starting in 1989) but wiser heads prevailed.
> 
> You can already see in the extract above that an editor has annotated her
> corrections wherever she expanded a word to complete the spelling, with the
> <ex> element type. In print, this would be rendered [n] or perhaps an italic
> n or an underlined n — that's a formatting decision for the publisher. Using
> XML, you don't specify *how* it looks, only that it exists. Scholars need
> the non-committal format so they can do things like studying the scriptorial
> or linguistic aspects of editions, so being able to retrieve all occurrences
> of editorial interventions in their context is important to them, much more
> so than how to typeset it.
>

I understand all of that and mostly just object on the "human readability."
Ideally of course you have some "Source [digial] Document" that contains all the information
about the , well, source document ( the histoical thing you want to make available to 
the world). XML is a flexible well supported thing for anything you can define
as a tree of text. Latex, or maybe even JSON for that matter, AFAICT provide
similar capabilities with varying human readability. There is no reason that
a latex-like document needs to have any formatting stuff- all those commands
can be logical rather than "what it looks like" and you can choose rendering
algorithms when displaying.  

 
> > Anyway, the point in posting this time is to ask about citing web pages.
> 
> Use biblatex for formatting, not BiBTeX, because the older formats tend not

ok, I have to see what is involved as I migrated recently not sure I looked
at bib details. 

> to have the right fields for citing web pages. See also
> 
> https://tex.stackexchange.com/questions/3587/how-can-i-use-bibtex-to-cite-a-web-page
> https://tex.stackexchange.com/questions/411440/how-cite-a-website-with-bibtex
>
I guess this is kind of open yet. Although in the second link it is funny they
mention "plain" style my earlier latex was so old I wrote a plainurl
bst  that included a url lol. 
 
> > For most articles intended to be cited, I had ways to scrape bibtex
> > off the pages containing an abstract- if the link is on the
> > clipboard the script can usually find a bibtex entry or a doi and
> > call crossref.
> Right. Scrapers are usually unreliable, even Zotero and Mendeley. Most
> journal pages have a download, often including a .bib file, but even if they
> only have RIS, you can still open that in JabRef and get the data saved in
> BiBTeX format.
> 
> > However, I need to make some arguments contrasted to "popular" or
> > maybe news sites or cite commercial products that were mentioned in a
> > work. Few of these provide bibtex for their pages although plenty
> > have "share"  features.
> 
> In those cases the only answer is to copy and paste into JabRef or whatever
> you use to manage your bibliography.
> 
> > AFAICT, even the CELT site did not provide much in the way of "how to
> > cite" which is odd for their academic work and indeed confusing as
> > you want to credit their work with displaying some other classic
> > work.
> 
> Yes, it's something missing which is on the list to implement. As I said,

Well, the bibtex you found looks nice but it also seems like a research
task just to find it. I guess if it was on the same page as the share
features( some journals have a cite button near the shares)
 that would be easier but at least it exists. 

> it's a new format and not everything is in place yet. However, very few
> people would ever need to cite the CELT *web page* itself. They would cite
> the quoted edition (which is why BiBTeX is provided in every document), and
> just add the URL as their link. The CELT editions can be treated exactly as
> the paper editions would be.
> 
> > Is there some obvious way anyone here would create a bibtex
> > entry for the page above,
> 
> At the moment, only manually. But given your impetus, I can bump the
> priority level for providing this up a few notches. It's fairly complex
> because it needs some decisions taking over (eg) which version of the title
> to use, how many of the editors to cite (some documents have dozens), etc.
> 
> > and as an example of the commercial site, for example,
> > 
> > ./toobib.h608  m_bib.format()=%2019-09-14:17:45:00
> > %autogenerated by toobib
> > @www{ZincCapsHighPotencylifeextension,
> > authors = {},
> > title = {Zinc Caps High Potency, 50 mg 90 capsules | Life Extension    },
> > url = {https://www.lifeextension.com/vitamins-supplements/item01813/zinc-caps-high-potency},
> > urldate = {2019-09-14:17:45:00},
> > year = {}
> > }
> 
> I would make that something like:
> 
> @www{ZincCapsHighPotencylifeextension,
> authors = {Life Extension Foundation},
> title = {Zinc Caps High Potency, 50 mg 90 capsules},
> url = {www.lifeextension.com/vitamins-supplements/item01813/zinc-caps-high-potency},
> urldate = {2019-09-14T17:45:00},
> year = {2019},
> address = {Fort Lauderdale, FL}
> }
>

I guess it is kind of an almost irrelevant point but I was curious about authors- both
intended content and where to scrape. Probably any reader who wanted to look
would just hit the link and not care how it was written. Ultimately the point of the
bibliography is docuementation and aid to reader.  

 
> > The bibtex above is what I could scrape from the link using some code I wrote
> > to do it automatically from the link itself, html fields like "title" and any
> > "meta" it can find.
> 
> Unless the page owner is aware of things like citation, that's probably all
> you'll ever get.
> 
> > Eventually I could chase down doi's or other cues, that is why I went
> > from bash to c++, but hopefully it does not become that big a mess
> I would have stuck with bash because of the huge range of facilities
> designed for text manipulation like tidy and the LTxml2 utilities.
> 
> > I guess if this worked well it would be nice to let publishers or site
> > owners use a similar tool to provide bibtex in a "how to cite" button
> > next to all the sharing stuff.
> 
> I doubt if they would be interested, to be honest.
Yeah I get that feeling too I guess links or shares are most of the publicicity.

> 
> > Google scholar probably did something like this to create their bibtex
> > but I was not sure if any of that is public or if other mechanisms exist
> > so I wrote my own code but it could be quite involved and I'm not even
> > sure how to use some of the fields. Is there a style guide with this in
> > it somewhere?
> 
> You can ask them :-)
> 
> P

Thanks.

-- 

mike marchywka
306 charles cox
canton GA 30115
USA, Earth 
marchywka at hotmail.com
404-788-1216
ORCID: 0000-0001-9237-455X



More information about the texhax mailing list