将 troff 转换为 HTML
发表于 2008 年 8 月 13 日
好的,更具体地说,是将使用 mm 宏的 troff 转换为 HTML。 这是我“生锈的脚本”集合中的另一个,它很久以前是为了特定任务而编写的。 在这种情况下,我们将一些内部文档从使用 mm 宏的 troff 转换为 HTML。
troff 是一个非常复杂的系统,因此正确地完成这项工作需要大量工作。 但是,编写一些东西来消除 90% 的繁琐转换工作是很容易的。 Paul Dunne 为 ms 宏集做了一个,你可以在这里找到,但 mm 是不同的。
由于 awk 过去(也许现在仍然)是我“修复数据”程序的首选,因此它被调用来执行此任务。 这是我决定值得做的事情。
#!/usr/bin/awk -f # troff -mm to HTML converter # Phil Hughes--April 21, 1997 # Updates: # # # This doesn't do everything and it is not intended to do everyting. # It does what is easy. In particular, headers and the title is # not mucked with. Also, center requests only center one line (I think). # # The goal here was to do the stuff that is a total pain to do by hand. # This includes section numbering, font changes (which is still rather # dumb) and lists. The output of this program should be considered # a good starting point for making good HTML. # # Here is what is currently recognized: # .H - deals with heading levels # .P - maps to <p> (as does a blank line) # .BL - maps to <ul> # .AL - maps to <ol> # .LE - maps to the end of the list of the type most recently started # .LI - <li> # .ds - tossed # .ce - centers next line # \fB, \f(HB, \fI, - changes to bold, bold or italic # \fP - goes back to previous font # \(em - -- # .PF, .PH - tossed # \s - tossed # That's all folks. # BEGIN { BLANKS = " " # beginning HTML crap print "<html>" print "<head>" print "<title> ============</title>" print "</head>" print "<body>" } { # always convert these things # yes, there is a lot to add here gsub(/\\\(em/, "--") # \em to -- gsub(/.\\".*$/, "") # trash comments gsub(/.PF.*$/, "") # trash all sorts of headers & footers gsub(/.PH.*$/, "") # trash all sorts of headers & footers gsub(/\\s[0-9+-][1-9]?/, "") # trash point size changes } /^\.H / { # heading head_level = $2 head[head_level+1] = 0 head[head_level+2] = 0 head[head_level+3] = 0 head[head_level+4] = 0 head[head_level+5] = 0 head[head_level]++ $1 = "" $2 = "" gsub(/"/,"") printf "<h" head_level ">" for (x=1; x <= head_level; x++) { printf "%d.", head[x] } printf " " print $0 "</h" head_level ">" next } /^ *$/ { # paragraph "<p>" next } /^\.P *$/ { # paragraph print "<p>" next } /^\.BL/ { # bulleted list print "<ul>" list[++ll] = "</ul>" indent += 2 next } /^\.AL/ { # alpha list print "<ol>" list[++ll] = "</ol>" indent += 2 next } /^\.LI/ { # list item print substr(BLANKS, 1, indent) "<li>" next } /^\.LE/ { # list end print list[ll--] indent -= 2 next } /\.ds/ { # trash them next } /^\.ce/ { # center next line(s)--only does one line for now print "<p align=\"center\">" next } { # print whatever we have left # hard stuff like font changes where we need to remember split($0,tmp,"\\") for (x in tmp) { if (sub(/^fB/, "<b>", tmp[x]) == 1) { new_sub = "</b>" } if (sub(/^f\(HB/, "<b>", tmp[x]) == 1) { new_sub = "</b>" } if (sub(/^fI/, "<i>", tmp[x]) == 1) { new_sub = "</i>" } if (sub(/^fP/, new_sub, tmp[x]) == 1) { new_sub = "#####" } } for (x in tmp) { printf "%s", substr(BLANKS, 1, indent) printf "%s", tmp[x] } print "" } END { # ending HTML crap print "</body>" print "</html>" }
大部分内容非常普通且是暴力破解。 请注意,我在输出中添加的缩进只是为了美观,以便更容易看到发生了什么。 唯一“困难的部分”是处理标题。
在带有 mm 的 troff 中,标题的形式为 section.subsection.subsubsection ... 后跟文本。 例如,3.5.1 这是一个测试将是一个标准外观的标题。 我通过使用相应级别的 HTML 标题标签,手动计算当前级别的节数将其组合在一起。 对于那些不熟悉 awk 的人,让我们看一下这段代码
head[head_level]++ $1 = "" $2 = "" gsub(/"/,"") printf "" for (x=1; x <= head_level; x++) { printf "%d.", head[x] } printf " " print $0 " "
head 数组跟踪每个级别的当前节号。 递增该节后,接下来的两行看起来有点奇怪。 在 awk 中,$0 是完整的输入行,这些片段(使用当前字段分隔符解析)分配给 $1、$2 等。 如果你分配给 htem,$0 会被更新,因此这所做的一切只是从输入行中抛弃前两个字段——troff .H 标签和级别编号。
for 循环构建了节字符串,最后的 print 语句打印原始输入行,减去前两个字段,并附加适当的<\h到它。
与我的 shell 脚本一样,这是一个一次性修复,侧重于手头的任务。 根据您在 troff 代码中所做的事情,可能还有其他值得转换的标签。