O/T : linux: can I list which small caps are in a ttf/otf font ?

Thu Nov 2 18:11:51 CET 2023

On Wed, Oct 25, 2023 at 08:55:52AM +0100, Peter Flynn wrote:
> On 25/10/2023 01:47, Ken Moffat via tex-live wrote:
> [...]
> > No problem with linking to it, BUT - I'm not using any latex
> > packages for the fonts.
> 
> That's OK, I'm just referencing font resources for users who just want the
> font and aren't specifically looking for a package.
> 
> > Thanks. I find the docs, and indeed trying to read fontforge's tiny
> > text, hard.  Will have to reinstall it and make another attempt.
> 
> The documentation is mostly directed at font designers, not font managers.
> 
> > Cheers.  I suppose I ought to start by reading up on the OTF font
> > format.  A quick look at wikipedia suggests the pcap or c2pc tags
> > (lowercase to petite caps, or uppercase to petite caps/0 are part of
> > what I need.
> 
> Keep us posted.
> 
> > I think I may be some time!
> > ĸen
> :-)
> 
> Peter

A status report: I took Bob Tennent's suggestion of otfinfo -g and
I've got adequate details of all the fonts I've tested where I know
there are latin small caps, wit hthe exceptions of Source Sans 3,
Source Serif 4, and Roboto (recent versions of each, from google
fonts).  In those cases I get 'no glyph name information'.  Since I
had already started to separately show the small cap alphabets, I
can live without knowing all the details for claimed † codepoints
of those three fonts.

† : In the normal text parts of several fonts there are errors in
mapping one or two glyphs, so it always pays to check the actual
glyphs for a language. In Small Caps, there are some fonts where
latin i and dotless i both lack a dot, making the small caps useless
in e.g. turkish.

But after grepping for the scall caps with otfinfo, who would have
thought there could be so many variations ? The freefont fonts use
sc_smoething instead of something.sc, but they only cover unaccented
latin).  Many of them other variations are style variations or
language-specific.

But that still leaves what I have called 'work items' used when
assembling a glyph (I lack the correct terminology).  Some of these
are straightforward: _part.somename is used while assembling that
glyph (e.g. in Vollkorn). Others are obscure except to typogrephers:
o rogate (texgyre) or yi_yicyr (ligature for two cyrillic yi which
are like latin i-with-diaeresis). That last one initially left me
really confused becasue at first I was stripping out underscores
(latin ligatures are things like f_f but the adobe glyphlist has
'ff'.) and wondered what a cyrillic 'yiyi' might be ;-)

I also came across one instance of a separate SC font (I use my
normal C++ prog to read installed separate caps/smallcaps fonts)
which has 4 items (combining diacriticals) that are not in the main
fonts.  Made an exception for that in the other script where I add
' +SC' to the existing codepoints file, so that I do not flag it up
as "something was wrong in what I extracted".

Overall, this is good enough for me.  I'll attach the current
version of my bash script.  At first I was documenting all the items
which I knew created warnings.  But then I ran the script agaisnt a
recent version of Junicode and got a lot more.

ĸen
-- 
This is magic for grown-ups; it has to be hard because we know there's
no such thing as a free goblin.
   -- Pratchett, Stewart & Cohen - The Science of Discworld II
-------------- next part --------------
#!/bin/bash

# Known OpenType items for *assembling* a glyph which create warnings:
#  Most of these are from texgyre.
# aogonekacute - a with ogonek and acute accent
# cybreve - the cyrillic breve is a different shape from the latin breve.
# eogonekacute - e with ogonek and acute accent
# idotaccent - apparently used for creating i with dot and accent in turkish.
# iogonekacute - i with ogonek and acute accent
# i_dot in FreeSerif - assume idotaccent although the font lacks accented SC
# jacute - j with acute accent
# oogonekacute - o with ogonek and acute accent
# orogate - an old (14th century CE) Polish o with vertical lines above and below it
#  see https://www.unicode.org/L2/L2021/21039-old-polish-o.pdf
# ubrevebelowinverted - u with inverted breve below it
# ustraitcy - cyrillic straight u, for adding diacriticals (Vollkorn)
# ustraitstrokecy - cyrillic straight u with stroke for adding diacriticals (Vollkorn)
# yi_yicy - a ligature for ukrainian yi yi (Vollkorn)
#
# Identifications are often based on random posts google found at forum.glyphsapp.com/
# which is for a macOS font editor.
#
# A mediaevalist font such as Junicode has a lot more like this.

LC_ALL=C
VARIANT=0 # normal, items are *.sc

# Look at a font which contains small caps, find the codepoints they cover.
test -f /usr/bin/otfinfo || echo "you need to install lcdf-typetools"
test -f /usr/bin/otfinfo || exit 2

# point to list of glyphs from
# https://github.com/adobe-type-tools/agl-aglfn/blob/master/glyphlist.txt
GLYPHLIST=/sources/scripts/font-analysis/glyphlist.txt

test -r "$GLYPHLIST" || echo "cannot read $GLYPHLIST"
test -r "$GLYPHLIST" || exit 2
if [ "$#" -ne 1 ]; then
	echo "pass the /full/path/to/filename.{otf,ttf} as a single argument"
	exit 2
fi

# Insecure temp files, assumes only running one of these at a time
# so clear out any temp files from previous run
>/tmp/possible-sc-items
>/tmp/all-sc-items
>/tmp/named-sc-items
>/tmp/uni-sc-items
>/tmp/uninum-sc-items
>/tmp/alpha-sc-items
>/tmp/alnum-sc-items
echo "looking for all possible small caps in $1"
otfinfo -g $1 | grep '\.sc' >/tmp/possible-sc-items
if [ $? -ne 0 ]; then
	echo "did not find any '.sc' in $1, looking for '^sc.'"
	otfinfo -g $1 | grep '^sc\.' >/tmp/possible-sc-items
	if [ $? -eq 0 ]; then
		VARIANT=1
	else
		echo "No Small Caps found in $1"
		exit
	fi
fi

if [ $VARIANT = "0" ]; then
	# Drop everything after first decimal point
	# because of things like aacute.SngStory.sc
	echo "reducing to only the '.*.sc' small caps"
	#cat /tmp/possible-sc-items | grep '\.sc' | cut -d '.' -f 1  >/tmp/all-sc-items
	# keep everything up to sc but lose anything after
	cat /tmp/possible-sc-items | grep '\.sc' | sed 's/\(.*\.sc\).*/\1/'  >/tmp/all-sc-items
else
	# simulate the common case
	# fix up odd variants ssharp  : germandbls
	# i_dot : assume idotaccent
	echo "reducing to only the '^sc.' small caps"
	cat /tmp/possible-sc-items | sed 's/^sc\.//' |
	sed -e 's/ssharp/germandbls/' -e 's/i_dot/idotaccent/' >/tmp/all-sc-items
fi

# now split into uni items and named items
echo "splitting into names and uniNNNN"
while read line
do
	# texgyre fonts have prefixed variations of sc combining tilde,
	# h_uni0303.sc l_uni0303.sc t_uni0303.sc : reduce to uni0303
	# Vollkorn has items like _part.cheabkhasiancy.sc
	#
	echo "$line" | grep -q 'uni'
	if [ $? -eq 0 ]; then
		# uni-sc-items can include uni0434.loclBGR.sc,
		# uni006A0301.sc (both from Vollkorn)
		echo $line | sed -e 's/^.*uni/uni/' -e 's/\(^uni....\).*/\1/' >>/tmp/uni-sc-items
	else
		# Strip out '_part.someglyph.sc'
		echo "$line" | grep -q '^_part'
		if [ $? -eq 0 ]; then
			# I don't know the carrect terminology
			echo "Ignoring $line, is an internal item, not a codepoint"
		else
			# Now strip off everything after the first '.'
			echo $line | sed 's/\..*//' >>/tmp/named-sc-items
		fi
	fi
done </tmp/all-sc-items

# it is possible that either file might have non-unique items
# so use sort -u for both, even though uni-sc-items are in order
# first, convert uni items to numbers
echo "converting uniNNNN to U+NNNN format"
cat /tmp/uni-sc-items | sed 's/uni/U+/' | sort -u > /tmp/uninum-sc-items

# the unicode data is ordered, A..Z,a..z
# so aim to read the table only once - in fact, simple
# repeated greps seem fast enough
echo "sorting named items into order"
cat /tmp/named-sc-items | sort -u >/tmp/alpha-sc-items

echo "Processing named items into unicode values"
# On a huge file, looping through the glyphs and matching might be worth
# the effort, but the number of small caps is not usually very large.
>/tmp/alnum-sc-items
while read line
do
	#ITEM=$(echo $line | cut -d ';' -f 1 | sed 's/_//g')
	ITEM=$(echo $line | cut -d ';' -f 1)
	#echo ITEM is $ITEM
	# Ligatures are reported as f_i f_l etc
	# but the glyphlist ha ff ffi ffi fl etc
	# it looks as if I'm losing another - Fira is reported to have 'brevecy' ?
	# I might be missing a few more, but this is probably adequate.
	grep -q "^$ITEM;" $GLYPHLIST
	if [ $? -ne 0 ]; then
		# ligatures may be f_f etc, glyphlist has ff
		SHORT=$(echo "$ITEM" | sed 's/_//g')
		grep -q "^$SHORT;" $GLYPHLIST
		if [ $? -eq 0 ]; then
			#use the short name
			ITEM=$SHORT
		else
			echo "Warning, assume $ITEM is a work item, not a codepoint"
		fi
	fi
	grep "^$ITEM;" $GLYPHLIST | cut -d ';' -f2 | sed 's/\(^.*\)/U+\1/' >>/tmp/alnum-sc-items
done < /tmp/alpha-sc-items

# Finish by merging using sort -u and writing to $1-sc.codepoints in $CWD
echo "final sort"
# fonts such as FreeSerif do not specify a separate weight, so remove '.*'
NAME=$(basename $1 | cut -d '-' -f1 | cut -d '.' -f1)
#echo NAME is $NAME

cat /tmp/alnum-sc-items /tmp/uninum-sc-items | awk '{ print $1 }' |
 sort -u >$NAME-sc.codepoints

exit