CS375 Term Project: An XML Parser Generator

Prof. Daniel P. Miranker

 

Assigned 10/14/98,

Due various dates, see the last section

 

 

 

 

 

 

 

 

 

 

Abstract:

 

Your term project is to build a compiler that takes as input an XML document type declaration, (DTD), and produces a program that parses a well-formed XML document and initializes an object with the data values extracted from the XML document.

 

There will be a prize of nominal, but nontrivial value for the best project.

 

 

 

 

 

 

 

 

Introduction:

 

XML is a new standard for moving structured data over the Internet. Other standards include Java, CORBA and DCOM, to name just a few. XML has promise due to a very low bar for entry and a clear migration path from existing HTML applications. In particular, XML documents are text and in effect are also legal HTML documents.

 

An easy and nearly accurate explanation of XML is that in addition to HTML tags, an XML document may distinguish certain printable strings as data by nesting those strings in an additional pair of tags. A separate file, called a document type declaration, (DTD), is used to specify the new tags. For the term project we assume the XML file must contain a link to the DTD from which the tags have been drawn. A DTD goes beyond declaring the names of the new tags. A DTD can in fact define an object (or class) structure. Thus, an XML file, in text form, can be mapped to an object instance in a binary form.

 

Your assignment is to write a compiler that takes as input a DTD and creates as output a program that reads an XML file and produces the object representation of the data in the XML file.

 

Note that this project does not embody the intended commercial packaging and use of XML. In commercial practice it is intended that a browser, given an XML file, may fetch its DTD and in a highly interpreted fashion fill out an internal representation of the object. Browsers are being extended with an API which allows applications to

query the browser and collect individual data values from the XML document. The APIs themselves are the subject of standards discussion. If you are curious, look-up document object model (DOM) at the www.w3.org site.

 

 

XML resources:

 

The XML specification was included in your supplemental reading packet. It is also available at www.w3.org. Microsoft provides excellent web-based information on XML (www.microsoft.com/xml). One of their documents was also included in your supplemental reading packet. Other interesting links include

 

 

In general there is ample information on XML on the web. If you turn up something that you think is particularly useful please send the URL to the TA and myself and we will put it on the class web page for everyone's benefit. BTW - the web-based resources include public domain XML parsers. You are not to import any external code into your projects.

 

 

 

The Project Proper:

 

Organization

 

The project breaks down into 4 code related components (see Figure 1):

 

  1. The input which is given.
  2. Your compiler
  3. The output of your compiler
  4. A runtime library which will contain most of the code for the resultant XML parsers.

 

Input:

 

The TA has collected 8 DTD files and organized them 4 groups. See Table 1. They are available from the class web page. The grouping represents a hierarchy of increasingly complex XML documents. This hierarchy will be the basis for the portion of the term project grade related to the number of XML features you have implemented.

 

 

Level

XML Features

File Names

1

No attributes, finite nesting and no entity elements

memo1.dtd, faq1.dtd

2

Finite nesting of tags and no entity elements

memo2.dtd, faq2.dtd

3

Finite nesting of tags

memo3.dtd, faq3.dtd

4

All the above restrictions are removed

play.dtd, tstmt.dtd

 

Table 1. 4 Levels of Project Accomplishment

 

For each DTD there are one or more test XML files. We will test your projects by trying additional XML files. We will not test your projects by asking you to compile additional DTDs. Thus, in the worst form of specification, the ability to succeed on each particular DTD is the precise subset of XML required for that group.

 

The only additional specification on the input is,

 

 

Your Compiler:

 

You are not required to use LEX and YACC or its free cousins, but you are required to use compiler tools to build your compiler. You must inform me of the substitution you are making.

 

Note that your compiler has to produce to things. It has to produce code to parse the XML. It also has to create object definition(s) for representing the extracted data.

 

Within these constraints you are free to develop your term project in any language or development environment.

 

 

 

Runtime Library + output of your compiler:

 

It is recommended that you organize the runtime parse of the XML files as a very general XML parser such that your compiler may add additional grammar rules specific to the DTD. In other words you should write a base Lex and Yacc parsing program for XML, but not one that will work until your compiler produces a few additional Yacc rules.

 

Output:

 

Once you have completed all processing, pretty print

the object parsed out of the XML. (No irony intended.)

 

Approach (as in hints):

 

Familiarize yourself with XML and the test DTDs. For the XML files and the DTD files, develop a notion of what work should be accomplished as lexical analysis and what work should be accomplished as syntax analysis. (i.e. what gets done in Lex and what gets done in Yacc). Note there will be considerable syntactic overlap between parsing the DTD and XML files, but they need not be identical.

 

Develop a Lexar for XML. This part should be identical for all DTDs. For two different DTDs, by hand develop YACC rules for parsing each DTD and constructing an object. Observe which YACC rules are identical for the two DTDs. Observe which YACC rules are very similar. The rules that are identical will form the core of a YACC program for all DTDs. The pairs of rules that are similar represent the output your compiler must generate.

 

Study groups and acceptable use policy:

 

The T.A. will announce additional hours as the basis of term project study groups. There will be two groups. They will meet at least twice. Attendance is voluntary.

 

Students are also welcome to organize there own study groups. All possible discussions about the term project are allowed. Ideas may be shared and replicated. But, not one line of code may be shared or borrowed, not from other students, or web-based code resources.

 

 

 

Individual Milestones:

 

(The numbers do not necessarily indicate a temporal development sequence)

 

1) develop the LEX for

a) the DTDs

b) the XML files

 

You should be able to lexically analyze files from all four groups.

 

2) define the mapping from DTD to object (class definitions).

 

Note the in group 4 you must target a dynamic data structure (e.g. a linked list).

 

3) develop the YACC for parsing the DTDs

4) generate object definitions for the DTDs

5) define the target YACC rules for you compiler

6) generate code

7) test the resulting parsers

8) write term project report; instructions for this to come later

 

Due dates:

 

The final project is due November 30, 9:00AM. Projects turned in late but before Dec. 7 at 9:00AM will be penalized 1/2 a letter grade. Projects turned in later than that will be penalized no less than 1 letter grade.

 

Intermediate milestones also have due dates.

 

Nov. 2 demonstrate Lexar for DTD

Nov. 2 demonstrate Lexar for XML

Nov. 9 demonstrate syntax checking for group 1 DTDs.

Nov. 23 produce object definitions for group 1 DTDs.

 

Note: These intermediate due dates represent my estimation of minimum progress toward a minimal level 1 project. Students who fall behind this schedule are unlikely to complete a satisfactory term projects. Students who are not way ahead of this schedule will not achieve the fourth level of success.