CS 313E: Bulko

Programming Assignment 4:
HTML Checker
Due Date: October 13, 11:59 pm

A markup language is a system for inserting annotations into a document. The most important feature of a markup language is that the tags it uses to indicate annotations should be easy to distinguish from the document content.

One of the most well-known markup languages is the one commonly used to create Web pages, called HTML, or "Hypertext Markup Language". In HTML, tags appear in "angle brackets" such as in "<html>". When you load a Web page in your browser, you don't see the tags themselves: the browser interprets the tags as instructions on how to format the text for display.

Most tags in HTML are used in pairs to indicate where an effect starts and ends. For example:

Note that "end" tags look just like the "start" tags, except for the addition of a backslash after the < symbol.

Sets of tags are often nested inside other sets of tags. For example, an ordered list is a list of numbered bullets. You specify the start of an ordered list with the tag <ol>, and the end with </ol>. Within the ordered list, you identify items to be numbered with the tags <li> (for "list item") and </li>. For example, the following specification:

   <ol>
      <li>First item</li>
      <li>Second item</li>
      <li>Third item</li>
   </ol>

would result in the following:

  1. First item
  2. Second item
  3. Third item

Notice how you start the ordered list with the <ol> tag, specify three line items with matching <li> and </li> tags, and the close the ordered list with the </ol> tag.

You may have noticed that the pattern of using matching tags strongly resembles the pattern of matching parentheses that we discussed in class: when you use parentheses, brackets, and braces, they have to match in reverse order, such as "{[()]}". A pattern such as "[(])" would be incorrect since the right bracket does not match the left parenthesis. Similarly, an HTML pattern such as <ol><li></ol></li> would be incorrect since the closing tags are in the wrong order.

Your assignment is to write an "HTML Checker" program that takes as input an HTML file, and produces a report indicating whether or not the tags are correctly matched.

In addition, have your program build a list called "VALIDTAGS". As you iterate through your list of tags, check to see if the tag appears in VALIDTAGS. If it doesn't, add it to VALIDTAGS and print the message, "New tag XXX found and added to list of valid tags"

Input:

As input for your program, use this file! That is, use the file "http://www.cs.utexas.edu/~bulko/2017fall/313E.hw4.html". Download this file to your computer and rename it "htmlfile.txt".

Output:

The output of your program should include the following:

Your program should end with either:

      Error:  tag is XXX but top of stack is YYY

or

      Processing complete.  No mismatches found.

or

      Processing complete.  Unmatched tags remain on stack:  [XXX, YYY, ZZZ]

followed by labeled and sorted printouts of VALIDTAGS and EXCEPTIONS (see below).

Plot Twist:

There are some tags that do not need matching start and end tags! One example is <br />. This tag is used to indicate a line break at the current location. Another is <meta>, which is used to provide special information ("metadata") about a Web page, and one more (left for you to identify in your data file).

This means that if you followed the instructions above correctly, your HTML checker will notice that there are three tags that don't have a match. Teach your program that this is okay for these three cases by maintaining a list called EXCEPTIONS which you hard-code into your main program. They will appear in your list of tags just as any other tags. However, when you begin your iteration through the list and you encounter one of these, you do not need to push it on the stack since you won't be waiting for a close tag. Instead, just print an output line such as:

      Tag br does not need to match:  stack is still [html, body, b]

and continue.


General requirements: