{\BibTeX-based dataset generation for training citation parsers} {Sree Harsha Ramesh, Dung Thai, Boris\regularorprogramstring{~}{ }Veytsman\regularorprogramstring{~(presenter)}{*}, Andrew McCallum} {A human can relatively easily read a bibliography list and parse each entry, i.e., recognize authors' names, item title, venue, year and volume information, pagination, urls and doi numbers, etc., using such cues as punctuation and font changes. This is even more impressive since there is no universal standard of bibliography typesetting; virtually all publishers and journals use their own ``house styles''. It has been a challenge to make a machine to do this task, which is important for digitization of the scientific literature. One of the problems is the lack of labeled data: parsed bibliography items, suitable to train the algorithms. It is cumbersome and expensive to translate a large number of bibliographies into a machine readable format. Thus the largest dataset published so far has 2479 entries. In this work (it has been partially presented at \acro{AKBC} 2019) we describe a way to overcome the problem. We start with a bibliography already in machine readable format, and typeset it using \BibTeX\ and \LaTeX. The resulting labeled dataset is suitable for training algorithms. We used Nelson Beebe's archive of 1.41 million \BibTeX\ entries and typeset it with 275 bibliography styles from the recent \TeX\ Live collection. After deleting some problematic entry-style combinations that did not compile (mostly due to non-standard fields), we obtained 185 million labeled samples, improving the state of the art by five orders of magnitude.} % Sree Harsha Ramesh (1), Dung Thai (1), Boris Veytsman (2), % Andrew McCallum (1) % (1) College of Information and Computer Sciences, UMass Amherst % (2) Chan Zuckerberg Initiative