将 troff 转换为 HTML
发表于 2008 年 8 月 13 日
好的,更具体地说,是将使用 mm 宏的 troff 转换为 HTML。 这是我“生锈的脚本”集合中的另一个,它很久以前是为了特定任务而编写的。 在这种情况下,我们将一些内部文档从使用 mm 宏的 troff 转换为 HTML。
troff 是一个非常复杂的系统,因此正确地完成这项工作需要大量工作。 但是,编写一些东西来消除 90% 的繁琐转换工作是很容易的。 Paul Dunne 为 ms 宏集做了一个,你可以在这里找到,但 mm 是不同的。
由于 awk 过去(也许现在仍然)是我“修复数据”程序的首选,因此它被调用来执行此任务。 这是我决定值得做的事情。
#!/usr/bin/awk -f
# troff -mm to HTML converter
# Phil Hughes--April 21, 1997
# Updates:
#
#
# This doesn't do everything and it is not intended to do everyting.
# It does what is easy. In particular, headers and the title is
# not mucked with. Also, center requests only center one line (I think).
#
# The goal here was to do the stuff that is a total pain to do by hand.
# This includes section numbering, font changes (which is still rather
# dumb) and lists. The output of this program should be considered
# a good starting point for making good HTML.
#
# Here is what is currently recognized:
# .H - deals with heading levels
# .P - maps to <p> (as does a blank line)
# .BL - maps to <ul>
# .AL - maps to <ol>
# .LE - maps to the end of the list of the type most recently started
# .LI - <li>
# .ds - tossed
# .ce - centers next line
# \fB, \f(HB, \fI, - changes to bold, bold or italic
# \fP - goes back to previous font
# \(em - --
# .PF, .PH - tossed
# \s - tossed
# That's all folks.
#
BEGIN {
BLANKS = " "
# beginning HTML crap
print "<html>"
print "<head>"
print "<title> ============</title>"
print "</head>"
print "<body>"
}
{ # always convert these things
# yes, there is a lot to add here
gsub(/\\\(em/, "--") # \em to --
gsub(/.\\".*$/, "") # trash comments
gsub(/.PF.*$/, "") # trash all sorts of headers & footers
gsub(/.PH.*$/, "") # trash all sorts of headers & footers
gsub(/\\s[0-9+-][1-9]?/, "") # trash point size changes
}
/^\.H / { # heading
head_level = $2
head[head_level+1] = 0
head[head_level+2] = 0
head[head_level+3] = 0
head[head_level+4] = 0
head[head_level+5] = 0
head[head_level]++
$1 = ""
$2 = ""
gsub(/"/,"")
printf "<h" head_level ">"
for (x=1; x <= head_level; x++) {
printf "%d.", head[x]
}
printf " "
print $0 "</h" head_level ">"
next
}
/^ *$/ { # paragraph
"<p>"
next
}
/^\.P *$/ { # paragraph
print "<p>"
next
}
/^\.BL/ { # bulleted list
print "<ul>"
list[++ll] = "</ul>"
indent += 2
next
}
/^\.AL/ { # alpha list
print "<ol>"
list[++ll] = "</ol>"
indent += 2
next
}
/^\.LI/ { # list item
print substr(BLANKS, 1, indent) "<li>"
next
}
/^\.LE/ { # list end
print list[ll--]
indent -= 2
next
}
/\.ds/ { # trash them
next
}
/^\.ce/ { # center next line(s)--only does one line for now
print "<p align=\"center\">"
next
}
{ # print whatever we have left
# hard stuff like font changes where we need to remember
split($0,tmp,"\\")
for (x in tmp) {
if (sub(/^fB/, "<b>", tmp[x]) == 1) {
new_sub = "</b>"
}
if (sub(/^f\(HB/, "<b>", tmp[x]) == 1) {
new_sub = "</b>"
}
if (sub(/^fI/, "<i>", tmp[x]) == 1) {
new_sub = "</i>"
}
if (sub(/^fP/, new_sub, tmp[x]) == 1) {
new_sub = "#####"
}
}
for (x in tmp) {
printf "%s", substr(BLANKS, 1, indent)
printf "%s", tmp[x]
}
print ""
}
END {
# ending HTML crap
print "</body>"
print "</html>"
}
大部分内容非常普通且是暴力破解。 请注意,我在输出中添加的缩进只是为了美观,以便更容易看到发生了什么。 唯一“困难的部分”是处理标题。
在带有 mm 的 troff 中,标题的形式为 section.subsection.subsubsection ... 后跟文本。 例如,3.5.1 这是一个测试将是一个标准外观的标题。 我通过使用相应级别的 HTML 标题标签,手动计算当前级别的节数将其组合在一起。 对于那些不熟悉 awk 的人,让我们看一下这段代码
head[head_level]++
$1 = ""
$2 = ""
gsub(/"/,"")
printf ""
for (x=1; x <= head_level; x++) {
printf "%d.", head[x]
}
printf " "
print $0 " "
head 数组跟踪每个级别的当前节号。 递增该节后,接下来的两行看起来有点奇怪。 在 awk 中,$0 是完整的输入行,这些片段(使用当前字段分隔符解析)分配给 $1、$2 等。 如果你分配给 htem,$0 会被更新,因此这所做的一切只是从输入行中抛弃前两个字段——troff .H 标签和级别编号。
for 循环构建了节字符串,最后的 print 语句打印原始输入行,减去前两个字段,并附加适当的<\h到它。
与我的 shell 脚本一样,这是一个一次性修复,侧重于手头的任务。 根据您在 troff 代码中所做的事情,可能还有其他值得转换的标签。
