将 troff 转换为 HTML

作者:Phil Hughes
好的,更具体地说,是将使用 mm 宏的 troff 转换为 HTML。 这是我“生锈的脚本”集合中的另一个,它很久以前是为了特定任务而编写的。 在这种情况下,我们将一些内部文档从使用 mm 宏的 troff 转换为 HTML。

troff 是一个非常复杂的系统,因此正确地完成这项工作需要大量工作。 但是,编写一些东西来消除 90% 的繁琐转换工作是很容易的。 Paul Dunne 为 ms 宏集做了一个,你可以在这里找到,但 mm 是不同的。

由于 awk 过去(也许现在仍然)是我“修复数据”程序的首选,因此它被调用来执行此任务。 这是我决定值得做的事情。

#!/usr/bin/awk -f
# troff -mm to HTML converter
# Phil Hughes--April 21, 1997
# Updates:
#
#
# This doesn't do everything and it is not intended to do everyting.
# It does what is easy.  In particular, headers and the title is
# not mucked with.  Also, center requests only center one line (I think).
#
# The goal here was to do the stuff that is a total pain to do by hand.
# This includes section numbering, font changes (which is still rather
# dumb) and lists.  The output of this program should be considered
# a good starting point for making good HTML.
#
# Here is what is currently recognized:
#   .H - deals with heading levels
#   .P - maps to <p> (as does a blank line)
#   .BL - maps to <ul>
#   .AL - maps to <ol>
#   .LE - maps to the end of the list of the type most recently started
#   .LI - <li>
#   .ds - tossed
#   .ce - centers next line
#   \fB, \f(HB, \fI, - changes to bold, bold or italic
#   \fP - goes back to previous font
#   \(em - --
#   .PF, .PH - tossed
#   \s - tossed
# That's all folks.
#

BEGIN   {
        BLANKS = "                                                "
        # beginning HTML crap
        print "<html>"
        print "<head>"
        print "<title> ============</title>"
        print "</head>"
        print "<body>"
        }

        {       # always convert these things
                # yes, there is a lot to add here
        gsub(/\\\(em/, "--")    # \em to --
        gsub(/.\\".*$/, "")     # trash comments
        gsub(/.PF.*$/, "")      # trash all sorts of headers & footers
        gsub(/.PH.*$/, "")      # trash all sorts of headers & footers
        gsub(/\\s[0-9+-][1-9]?/, "") # trash point size changes
        }

/^\.H / {       # heading
        head_level = $2

        head[head_level+1] = 0
        head[head_level+2] = 0
        head[head_level+3] = 0
        head[head_level+4] = 0
        head[head_level+5] = 0

        head[head_level]++
        $1 = ""
        $2 = ""
        gsub(/"/,"")

        printf "<h" head_level ">"
        for (x=1; x <= head_level; x++) {
                printf "%d.", head[x]
                }
        printf " "
        print $0 "</h" head_level ">"
        next
        }

/^ *$/  {       # paragraph
        "<p>"
        next
        }

/^\.P *$/       {       # paragraph
        print "<p>"
        next
        }

/^\.BL/ {       # bulleted list
        print "<ul>"
        list[++ll] = "</ul>"
        indent += 2
        next
        }

/^\.AL/ {       # alpha list
        print "<ol>"
        list[++ll] = "</ol>"
        indent += 2
        next
        }

/^\.LI/ {       # list item
        print substr(BLANKS, 1, indent) "<li>"
        next
        }


/^\.LE/ {       # list end
        print list[ll--]
        indent -= 2
        next
        }

/\.ds/  {       # trash them
        next
        }

/^\.ce/ {       # center next line(s)--only does one line for now
        print "<p align=\"center\">"
        next
        }

        {       # print whatever we have left

        # hard stuff like font changes where we need to remember
        split($0,tmp,"\\")
        for (x in tmp) {
                if (sub(/^fB/, "<b>", tmp[x]) == 1) {
                        new_sub = "</b>"
                        }
                if (sub(/^f\(HB/, "<b>", tmp[x]) == 1) {
                        new_sub = "</b>"
                        }
                if (sub(/^fI/, "<i>", tmp[x]) == 1) {
                        new_sub = "</i>"
                        }
                if (sub(/^fP/, new_sub, tmp[x]) == 1) {
                        new_sub = "#####"
                        }
                }

        for (x in tmp) {
                printf "%s", substr(BLANKS, 1, indent)
                printf "%s", tmp[x]
                }
        print ""
        }

END     {
        # ending HTML crap
        print "</body>"
        print "</html>"
        }
                                                          

大部分内容非常普通且是暴力破解。 请注意,我在输出中添加的缩进只是为了美观,以便更容易看到发生了什么。 唯一“困难的部分”是处理标题。

在带有 mm 的 troff 中,标题的形式为 section.subsection.subsubsection ... 后跟文本。 例如,3.5.1 这是一个测试将是一个标准外观的标题。 我通过使用相应级别的 HTML 标题标签,手动计算当前级别的节数将其组合在一起。 对于那些不熟悉 awk 的人,让我们看一下这段代码

        head[head_level]++
        $1 = ""
        $2 = ""
        gsub(/"/,"")

        printf ""
        for (x=1; x <= head_level; x++) {
                printf "%d.", head[x]
                }
        printf " "
        print $0 ""

head 数组跟踪每个级别的当前节号。 递增该节后,接下来的两行看起来有点奇怪。 在 awk 中,$0 是完整的输入行,这些片段(使用当前字段分隔符解析)分配给 $1、$2 等。 如果你分配给 htem,$0 会被更新,因此这所做的一切只是从输入行中抛弃前两个字段——troff .H 标签和级别编号。

for 循环构建了节字符串,最后的 print 语句打印原始输入行,减去前两个字段,并附加适当的<\h到它。

与我的 shell 脚本一样,这是一个一次性修复,侧重于手头的任务。 根据您在 troff 代码中所做的事情,可能还有其他值得转换的标签。

加载 Disqus 评论