Aggregate Data Types

Most programming languages have a facility for grouping different kinds of variables together to form a new type of aggregate variable that can be manipulated as a whole or as parts. Often these types of variables are called "records" or "structures." This way we can define arbitrarily complex abstract data types and have them appear to the user (i.e. the programmer using the ADT) as a simple type.

`struct` and `typedef`

In C, the way this is done is through struct declarations. The syntax of such a declaration is like this:

struct tag {
	type₁	field₁;
	type₂	field₂;
	...
	type_n	field_n;
} varname₁, varname₂, ..., varname_m;

This is declaring a variable, not of type int, float, etc. like we're used to, but of type "structure of a bunch of things" where the bunch of things are declared, imbedded in the type. For example, suppose we want to have a variable capable of holding a student ID, classification ('f' for freshman, 's' for sophomore, 'j' for junior, 'S' for senior, or 'g' for graduate student), and GPA. We could declare a variable for a student like this:

struct student {
	int	id;
	char	classification;
	float	gpa;
} bob;

The name of the variable is bob. The "tag" student give a name to the structure type, so that the next time we want to declare a student variable, we need only remind the system of the tag:

struct student alice;

Now we have two struct student variables, alice and bob. We can modify the fields (parts of the struct) using the . (dot) operator:

	bob.gpa = 4.0; 		/* good student */
	bob.id = 1234567;
	bob.classification = 'f';

or, we can treat the variable as a unit. If alice is just like bob, we can do this (note: this is only guaranteed to work in ANSI C and C++):

	alice = bob;

structs can be passed to functions (again, only in ANSI C and C++) by value, or as a pointer to a structure. Unlike arrays, a struct must be explicitly passed as a pointer before a function can modify it and the modification will be kept in the original argument.

C had been evolving for some time when a new keyword was added: typedef. This keyword allows the programmer to make up new type names that work just like the basic type names in variable declarations. A typedef statement looks just like a variable declaration, except that its preceded with the word typedef. For example:

typedef int MyInt;

Now, instead of having a variable called MyInt, we have a new type called MyInt, and we can declare things to be of that type:

MyInt	a, b, c;

Now a, b, and c are all of type int. This seems pretty useless at first, but if we combine it with struct, we have an elegant way to declare an abstract data type that appears just like a basic type to the user. For example:

typedef struct student {
	int	id;
	char	classification;
	float	gpa;
} student_t;

Now student_t is a type, just like int or char, and we don't have to go around writing struct every time we want another variable of that type. Student record declarations now look like this:

student_t	alice, bob;

And if you didn't know any better, you'd think student_t had been built into the C language. We can even declare whole arrays of record types, e.g.:

student_t	students[1000];

This declares an array of student structures. They can be modified with e.g.:

	int	i;

	for (i=0; i<1000; i++) {
		students[i].id = something;
		students[i].classification = something else;
		etc...
	}
}

(The "tag" part of the struct, student in this case, is obviated by the stronger type name, student_t. Some compilers allow you to omit it entirely in some cases. I usually put in a tag with a similar name to the type, and then ignore it. If you use the same name for the tag and the type, some braindead compilers might balk. As we will see with linked lists, the tag is necessary when we have to refer to the struct before we have given it a type name.)

Pointers

Memory in the computer is organized into bytes, each with its own unique address, giving its location within the memory system. Some computers have a "linear" address space, so that you can think of memory as an enormous array of bytes, with the address of each byte equal to its index in the array. Other computers use more complicated addressing schemes, but the linear address space is still a useful metaphor.

Bytes in the computer can be organized into groups so that they can function as different types of variables; for example, integers (4 bytes on the Sun), doubles (8 bytes on the Sun), and other basic types, as well as aggregates (e.g. arrays and structs in C) of basic types.

Before a program begins to run, the operating system allocates a certain area of the computer's memory to hold the program and all of its variables. As long as your program refers to values stored in memory in this area, everything runs OK. When your program tries to access memory outside its area (for example, by indexing an array out of bounds), the system will complain bitterly, usually causing your program to stop running. This is to protect other programs from unauthorized (and probably accidental) access from your program.

A pointer is the programming language representation of the memory address of a variable. A pointer holds information about the location of a variable, not its value. The variable can be modified through the pointer.

A pointer must point to something valid, i.e., a memory location allocated to your program, before the value it points to is accessed. If not, the system will complain.

You can declare variables to be pointers. Pointer variables are declared to point to a certain type. Once the pointer contains a valid memory address, you can modify or read the contents of the that memory (and sometimes adjacent memory) in a variety of ways.

Pointers in C

The pointer to a variable can be found using the & (ampersand, or address-of) operator. A pointer to a certain type can be declared by writing the declaration for a variable of that type, then preceding the name of the variable with the * (asterisk, star, "dereferencing") operator. The memory a pointer points to can be accessed by preceding a reference to a pointer by a *; this is called dereferencing the pointer. Here is an example of these ideas:

int main () {
	float	a;	/* a is a float variable */
	float	*p;	/* p is a pointer to float */

	/* p is uninitialized.  let's give it the address of a */

	p = &a;

	/* now *p is an "alias," or another name for, a.  let's modify a. */

	*p = 23;

	/* let's print out the value of a */

	printf ("%f\n", a);

	/* the output should be 23, even though we never explicitly
	 * changed a
	 */
	exit (0);
}

(Note that, since p above contained uninitialized "garbage" before the statmement p = &a;, any reference to *p before that statement would have most likely resulted in a "Segmentation fault" on Unix, "Access violation" on some other operating systems, and, worse, silent failures in operating systems and/or computers without memory protections.)

Pointers (in)to Arrays

Pointers can be accessed through the same operator we used to reference arrays. For example:

	int	v[100], *p;

	p = &v[0];	/* now p is an alias for v */
	p[5] = 123;	/* v[5] is now 123 */

We can also point a pointer to somewhere in the middle of an array, and it will act like a little array starting in the middle of the bigger array:

	int	v[100], *p;

	p = &v[20];	/* p points to memory starting at v[20] */
	p[5] = 123;	/* v[25] is now 123 */

Pointer Arithmetic

A limited form of arithmetic can be done on pointers. Since arrays (and thus things that pointers can point to) are laid out consecutively in memory, adding 1 to a pointer shifts it to the next item in the array pointed to. This makes no sense if the pointer is pointing to a single variable, only in the context of an array. For example:

	int	v[100], *p, i;

	p = &v[20];

	/* let i go from 0 through 19 */

	for (i=0; i<20; i++) {
		*p = 0;	/* what p points to is assigned zero */
		p++;	/* increment p so that it points to the 
			   next integer in the array */
	}

This puts zeros in v[20..39]. Note that *p and p[0] mean the same thing. Also, *(p+1) and p[1] mean the same thing, so *(p+i) and p[1] are the same thing. (It turns out the brackets [] are just a convenient notation for what is really pointer arithmetic.)

Pointers to Structures

We can have pointers to structures, too. For example:

typedef struct _foo {
	int	a;
	double	b;
	char	c;
} foo;

int main () {
	foo	bar, 	/* 'bar' is of structure type 'foo' */
		*q;	/* 'q' is of type pointer to 'foo' */

	q = &bar;
	(*q).a = 1;
	(*q).b = 3.14159;
	(*q).c = 'h';
	...
}

So q points to the structure bar in this example. Writing (*q).a can become tedious, so C has another syntax that means the same thing and looks a little more intuitive. The -> (points-to) operator dereferences a pointer with respect to a certain field in a structure being pointer to, e.g.:

	q = &bar;
	q->a = 1;
	q->b = 3.14159;
	q->c = 'h';

The `NULL` Pointer

A special pointer value in C is called NULL. It is represented by the constant 0 (although it isn't necessarily equal to the integer 0), and doesn't point anywhere. It is used to signal a special condition, like an error or the end of a list of pointers. You can test for it explicitly e.g. if (p == NULL) ... or by looking at it as a truth value, e.g. if (!p) ....

Dynamic Allocation

The size of an array in C must be declared at compile-time, and can't be changed during run-time. However, C (and most other programming languages) provides us with a way to ask for as little or as much storage as we want (within the bounds set up by the operating system) and refer to that storage through a pointer. This is called dynamic allocation of storage.

Suppose, for example, that we want to allow the user to specify how many items should be in an array of floats (because the user might be entering them by hand or from a file later). We can do this:

	int	n;
	float	v[100];

	printf ("How many? ");
	scanf ("%i", &n);
	if (n > 100) {
		printf ("too bad.\n");
	} else ...

but this is not satisfying. We can always compile the program with extra memory, but this wastes space if the user doesn't need it and still doesn't work if the user needs even more. What we really want is to allocate exactly the space we need. We can do this with a standard C library function called malloc.

malloc accepts an integer parameter and returns a pointer to that many newly allocated bytes (or NULL if the system failed to allocate enough memory). The pointer returned is either of type void * or char * depending on the compiler and/or operating system, so it usually isn't the type of pointer we want. So we must cast it to the type we want before we use it. We must also know how many bytes to allocate, so we have to find out the size, in bytes, of the data type we're allocating and multiply that by the number of items we want.

Here is how the above example would work with malloc:

#include <stdlib.h>	/* this is where malloc hangs out */
...
	int	n;
	float	*v;

	printf ("How many? ");
	scanf ("%i", &n);
	v = (float *) malloc (n * sizeof (float));
	if (!v) {
		printf ("unable to allocate enough memory, sorry!\n");
		exit (1);
	}
...

Let's look at that malloc call a little more closely. The (float *) part is a cast changing the type of whatever malloc returns to the type "pointer to float." The expression sizeof (float) uses the compiler's built-in sizeof operator to determine how many bytes there are in a float, so we can port the program to other systems with different sized floats, and so we don't have to go around memorizing the sizes of types. So the argument n * sizeof (float) tells malloc to allocate enough bytes to hold n floats. The pointer returned can be used just like an array.

When we are done with memory that has been malloc'ed, we can return it to the system by calling free with the pointer allocated:

	float	*v;

	v = (float *) malloc (whatever...);
	...
	free (v);	/* return memory at v to the system */

Never free memory you didn't get with malloc. Never free the same memory twice.

Dynamic allocation allows us to define all kinds of abstract data types that can grow dynamically, bound only by the constraints of the operating system.

Example of Dynamic Allocation: Linked Lists

One abstract data type we would like to have is a list type. It should support the following operations:

create(L) - create an empty list.
insert(L, k) - insert a value k into the list.
search(L, k) - search for k in the list, returning a pointer to it if it is found, NULL otherwise.
delete(L, k) - delete the first list item containing the value k.
length(L) - return the number of items in the list.
destroy(L) - get rid of a list, freeing up any storage allocated.

This could be used for a variety of purposes. k could be a key for a record type, and then we could insert things like student records, inventory records, etc. into a list and manage it.

You can imagine how to implement this with an array, with all the limitations that implies. Another way is to use a linked list, a set of structs set up so that the first one in the next points to the second one, the second one points to the third, and so forth. The last struct in the list will contain a NULL pointer, to signal the end.

Here is a typedef/struct declaration for a single item in the list. structs like this are called list nodes, or simply nodes:

typedef struct _node {
	int		k;
	struct _node	*next;
} node;

So a node contains a pointer to a node; in particular, it contains a pointer to the next node in the list, or to NULL if there are no more nodes.

So a linked list can be seen as a pointer to the first node (or to NULL), if the list is empty. Here is a typedef for our list ADT:

typedef node *list;

So a list is just a pointer to a node. We would put these typedefs in a header file as in lecture 1, and call it list.h. We'd also put these declarations for basic list operations in list.h (we're doing a list of int here, but we could just as easily have a list of anything else, including structs or even other lists).

/* interface section for list ADT */

/* create an empty list (list is passed by reference) */
void create_list (list *);

/* insert an int into a list (list is passed by reference */
void insert_list (list *, int);

/* search a list for an int, returning a pointer to that int or NULL if
 * it isn't found 
 */
int *search_list (int);

/* delete the first list node containing this integer 
 * (list passed by reference)
 */
void delete_list (list *, int);

/* return the length of a list */
int length_list (list);

/* deallocate storage for a list (list passed by reference) */
void destroy_list (list *);

So far, these function prototypes look like they would for any implementation of the list operations, and we'd like the user (i.e., programmer using our ADT) to stop reading here. We, however, must continue on to the implementation section in a file called list.c. Here are some of the list operations:

create_list - just sticks a NULL pointer into the list node pointer.

/* make an empty list by assigning it the NULL pointer. */
void create_list (list *L) {
	/* note that L is a pointer to a pointer! */

	*L = NULL;
}

insert_list - allocate a new node and put it at the beginning of the list by making it the head of the list:

void insert_list (list *L, int k) {
	node	*p;

	p = (node *) malloc (sizeof (node));
	p->k = k;
	p->next = *L;
	*L = p;
}

For example, suppose we have a list like this:

          ___    ___    ___
   L ->  |_4_|->|_3_|->|_8_|->NULL

and we want to insert the value 5 into the list. First we get a new node from malloc and put a pointer to it in p, assigning 5 to the k field:

                        ___
                  p -> |_5_|
          ___    ___    ___
   L ->  |_4_|->|_3_|->|_8_|->NULL

Then we have the next field point to the rest of the list:

                    ___
              p -> |_5_|
                     |
                     v
                    ___    ___    ___
             L ->  |_4_|->|_3_|->|_8_|->NULL

Then finally tell L that the new node is the beginning of the list:

                    ___
              L -> |_5_| <- p
                     |
                     v
                    ___    ___    ___
                   |_4_|->|_3_|->|_8_|->NULL

Note that this works even when the original list is empty (i.e., *L is NULL).

We'll figure out some of the rest of the operations in the lab, as well as possibly alternate ways of doing insert_list.