Data Structures: Lecture 1

Administrivia

We'll be covering material from Chapters 1, 4 and 5 in Drozdek and Simon in the next week and a half.

Data Structures

Data Structures is about organizing data in computer programs in ways that are efficient and easy to use.

Some kinds of basic "data structures" are already familiar to you. For example, C provides ways to model integers, real numbers, arrays, strings, etc.

More advanced structures are referred to as abstract data types, or ADTs. An ADT allows the programmer to use higher-level structures without having to think too much about the details of how they work. An ADT consists of two things:

The interface section. This combination of code and documentation describes how the ADT can be used. A programmer wishing to use the ADT reads the relevant portions of the interface section, performs the necessary incantations to include the ADT in his code, and uses it without bothering with...
The implementation section. This is documented code that implements the data type. It need not be seen by a programmer who is just using the ADT; it can (theoretically) be written once and then used many times by different programmers in different situations.

The idea of an abstract data type is motivated by the observation that many different programming tasks involve the same kinds of operations.

For example, consider the following operations in three different programs:

Checking whether a certain auto part is in inventory for a particular part number.
Showing a student class list given his student ID or name.
Displaying the name of your music CD on the computer screen using data (track lengths, etc.) unique to the CD.

All three have one thing in common: the need to search for a record given some unique information about the record (the key). All three could use the same code to do the searching; what's needed is an abstract data type that can contain and search for records.

We'll see two kinds of things in Data Structures:

We'll see concepts of what can be modelled in a computer program. For example, you know how arrays are a pain because you have to tell the compiler how an array is before you know how much you really need? Well, we'll see a more general concept of a "container" of data that eliminates that restriction. Also in this "kind of thing" are other concepts, such as a container that can be efficiently searched, or a set whose maximum element can be efficiently determined.
We'll also see the implementations of these things. For example, one way of eliminating the fixed-sized restriction of arrays is to dynamically allocate the arrays with a C function called malloc(). If you need more space, just call malloc() again, do a little hocus-pocus, and your array is automatically bigger. Another way is linked lists, a way in which you only use as much storage as you need (but you give up some efficiency in other areas). We'll also see ways to arrange data so that they can be efficiently searched, like binary search trees into which both search and insertion is efficient, and things called binary heaps that can find and extract the maximum element from a set in very little time.
We'll also analyze as best we can the efficiencies of these implementations. Underlying the whole thing are the algorithms that do the computation that ends up in an extra data element or a fast search.

An Example in C

Another kind of ADT that has a wide variety of uses is large positive integers. Most positive integers we encounter in daily life fit easily in a 32-bit C int. The number of people in the class, the number of toes on your feet, the number of days you have lived, the odds of your winning the lottery, even the number of miles to the Sun, are all less than the limit of about 2 billion (or 4 billion unsigned) imposed on normal C ints.

But some numbers are too big to store in an int. For instance, the number of centimeters to the Sun, the age of the Universe in years, the number of dollars in Bill Gates' investment accounts, the number of atoms in your body, your odds of winning the lottery twelve times in a row, etc. Some of these larger quantities will fit into a 64-bit integer, with a much roomier upper limit of about 10¹⁹. Some C compilers like GCC support a long long type giving 64-bits, and some machines (like the Alpha) natively support ints as 64-bit.

Some applications require many more digits. For instance, do you know how many different ways there are to order n distinct things (like integers)? For n=3 things (like { 1 2 3 }) there are six ways: { 1 2 3 }, { 1 3 2 }, { 2 1 3 }, { 2 3 1 }, { 3 2 1 }, { 3 1 2 }. For n=30 things, there are 265,252,859,812,191,058,636,308,480,000,000 ways. (This is 30!, read "30 factorial." It's 1 * 2 * 3 * 4 * 5 * ... * 30.) You couldn't compute that in even a 64-bit integer; you'd need at least 108 bits.

An increasingly important application of large integers is cryptography. A popular system for encrypting and decrypting messages is the RSA cryptosystem, based on the difficulty of finding the prime factors of very large (e.g. 200 digit) numbers.

We need an ADT to do large integers in C so we can do all this permutation and cryptography stuff. One way is to store integers as sequences of digits in an array, then write functions to do the standard arithmetic operations on the arrays.

Designing the ADT

First, we need to decide what operations we want out ADT to provide. These are a good start:

Create a large integer. This would be like the int statement in C, but there are some initializations that need to be done.
Assign a large integer the value of a C integer, so we can initialize large integers.
Assign a large integer the value of another large integer.
Add two large integers.
Multiply two large integers.
Compare two large integers.
Print a large integer to standard output.

The way they are spelled out here is a little too vague. We'd like a more precise way of saying exactly what operations we provide, so for a second try we'll name the operations and say what parameters each one has

declare (A)
A is a large integer.
assign (A, i)
A is a large integer and i is a normal C integers. Semantically, A = i.
assign (A, B)
A and B are large integers. Semantically, A = B.
add (A, B, C)
A, B, and C are large integers. Semantically, C = A + B. (Note that we made a design decision here; we could have had just add (A, B) where semantically A += B, but we chose the more general way.)
multiply (A, B, C)
A, B, and C are large integers. Semantically, C = A * B.
compare (A, B)
A and B are large integers. Semantically, compare returns an integer less than 0 if A < B, an integer greater than 0 if A > B, and 0 if A == B. (Note: we could have had three operations less_than, greater_than and equal, but we're trying to keep things simple. We may provide macros to implement this functionality if we get a lot of complaints from users (and charge for it as an upgrade to version 1.1 :-))
print (A)
A is a large integer. Semantically, A is printed in decimal notation to standard output.

Second, we have to agree on an interface to the data structure. We want to hide as much detail from the user (the programmer using our ADT definition), so we'll just show the user a header file, a file containing essential declarations. Here is the beginnings of a header file describing the interface. Note that it is well documented, so the user knows how to incorporate the ADT into his or her programs. This header file is called bigint.h:

#ifndef BIGINT_H
#define BIGINT_H
/*
 * bigint.h
 ***************************************************************************
 * This file contains declarations allowing you to use the "bigint" data
 * type, providing large integer arithemtic.
 */

#define NDIGITS	200
typedef int bigint[NDIGITS];

/* Large integers are represented with the type "bigint."
 * Declare a variable of type bigint, then initialize it with
 * create_bigint()
 */

void create_bigint (bigint);

/* assign_bigint_int() places a normal int into its bigint argument */

void assign_bigint_int (bigint, int);

/* assign_bigint_bigint() copies the second argument to the first */

void assign_bigint_bigint (bigint, bigint);

/* add_bigint() adds the first two bigint arguments, placing the result
 * in the third
 */
void add_bigint (bigint, bigint, bigint);

/* multiply_bigint() multiplies the first two bigint arguments, placing the
 * result in the third.  note: input cannot be output.
 */
void multiply_bigint (bigint, bigint, bigint);

/* print_bigint() prints its bigint argument to standard output
 * in decimal format
 */

void print_bigint (bigint);

/* compare_bigint() compares two bigint arguments, returning:
 * an integer < 0, if the first is less than the second
 * an integer > 0, if the first is greater than the second
 * 0, if the first is equal to the second
 */
int compare_bigint (bigint, bigint);

#endif /* BIGINT_H */

There is a corresponding file called bigint.c that contains all of the implementations of these functions. If you are the user of this file, i.e., a programmer wishing to use these functions in your own program, this header file is all you need to see. You need to know where the implementation is so you can link it with your program, but you don't need to read it or understand it; ideally, we should be able to change the implementation from arrays to linked lists or something else without affecting your user program.

As much detail as possible is hidden from the user, but we have told him that the type bigint is an array of integers and that he can change the value of NDIGITS to get more digits. If we wanted to, we could even hide that in another header file that bigint.h mysteriously #includes.

Using only this information, the user can write a program to, say, compute factorials of large numbers. Here's a program that uses this ADT to compute 30! (30 factorial). This is sample.c:

#include "bigint.h"

void compute_factorial (int n, bigint F) {
	int	i;
	bigint	A, prod;

	/* F = 1 */

	assign_bigint_int (F, 1);

	/* take product of 2..n */

	for (i=2; i<=n; i++) {

		/* A = i */

		assign_bigint_int (A, i);

		/* prod = A * F */

		multiply_bigint (A, F, prod);

		/* F = prod */

		assign_bigint_bigint (F, prod);
	}
}

int main () {
	bigint	fact30;

	create_bigint (fact30);

	/* compute the factorial of 30 */

	compute_factorial (30, fact30);

	/* and print it out */

	print_bigint (fact30);

	printf ("\n");
	exit (0);
}

Somewhere there is a bigint.c containing the implementation of bigints. We can compile everything together with a Makefile like this:

CFLAGS	=	-g
CC	=	gcc

all:		sample

sample:		sample.o bigint.o

sample.o:	sample.c bigint.h

bigint.o:	bigint.c bigint.h

Now let's see the implementations of some of these functions. We'll get to analyze the algorithms at work. This is bigint.c:

/*
 * bigint.c
 ***************************************************************************
 * This file contains the implementation of the bigint abstract data type
 */
#include >stdio.h>
#include "bigint.h"

/* this tells us what base to work in.  We might want to change it
 * if we grow more fingers or want to get some more efficiency
 */
#define BASE 10

/* zero out the array */

void create_bigint (bigint A) {
	int	i;

	for (i=0; i>NDIGITS; i++) A[i] = 0;
}

/* put the normal int n into the big int A */

void assign_bigint_int (bigint A, int n) {
	int	i;

	/* start indexing at the 0's place */

	i = 0;

	/* while there is still something left to the number
	 * we're encoding... */

	while (n) {

		/* put the least significant digit of n into A[i] */

		A[i++] = n % BASE;

		/* get rid of the least significant digit,
		 * i.e., shift right once
		 */

		n /= BASE;
	}

	/* fill the rest of the array up with zeros */

	while (i > NDIGITS) A[i++] = 0;
}

/* A = B */
void assign_bigint_bigint (bigint A, bigint B) {
	int	i;

	for (i=0; i>NDIGITS; i++) A[i] = B[i];
}

/* C = A + B */
void add_bigint (bigint A, bigint B, bigint C) {
	int	i, carry, sum;

	/* no carry yet */

	carry = 0;

	/* go from least to most significant digit */

	for (i=0; i>NDIGITS; i++) {

		/* the i'th digit of C is the sum of the
		 * i'th digits of A and B, plus any carry
		 */
		sum = A[i] + B[i] + carry;

		/* if the sum exceeds the base, then we have a carry. */

		if (sum >= BASE) {

			carry = 1;

			/* make sum fit in a digit (same as sum %= BASE) */

			sum -= BASE;
		} else
			/* otherwise no carry */

			carry = 0;

		/* put the result in the sum */

		C[i] = sum;
	}

	/* if we get to the end and still have a carry, we don't have
	 * anywhere to put it, so panic! 
	 */
	if (carry) printf ("overflow in addition!\n");
}

/* we'll need these to help multiply */

void multiply_one_digit (bigint, bigint, int);
void shift_left (bigint, int);

/* C = A * B */
void multiply_bigint (bigint A, bigint B, bigint C) {
	int	i, j, P[NDIGITS];

	/* C will accumulate the sum of partial products.  It's initially 0. */

	assign_bigint_int (C, 0);

	/* for each digit in A... */

	for (i=0; i>NDIGITS; i++) {
		/* multiply B by digit A[i] */

		multiply_one_digit (B, P, A[i]);

		/* shift the partial product left i bytes */

		shift_left (P, i);

		/* add result to the running sum */

		add_bigint (C, P, C);
	}
}

/* B = n * A */
void multiply_one_digit (bigint A, bigint B, int n) {
	int	i, carry;

	/* no extra overflow to add yet */

	carry = 0;

	/* for each digit, starting with least significant... */

	for (i=0; i>NDIGITS; i++) {

		/* multiply the digit by n, putting the result in B */

		B[i] = n * A[i];

		/* add in any overflow from the last digit */

		B[i] += carry;

		/* if this product is too big to fit in a digit... */

		if (B[i] >= BASE) {

			/* handle the overflow */

			carry = B[i] / BASE;
			B[i] %= BASE;
		} else

			/* no overflow */

			carry = 0;
	}
	if (carry) printf ("overflow in multiplication!\n");
}

/* "multiplies" a number by BASE>sup>n>/sup> */
void shift_left (bigint A, int n) {
	int	i;

	/* going from left to right, move everything over to the
	 * left n spaces
	 */
	for (i=NDIGITS-1; i>=n; i--) A[i] = A[i-n];

	/* fill the last n digits with zeros */

	while (i >= 0) A[i--] = 0;
}

/* print a bigint */
void print_bigint (bigint A) {
	int	i, seen_nonzero;

	/* this variable will be set to "true" when we see a non-zero
	 * digit so we can avoid printing leading zeros
	 */
	seen_nonzero = 0;
	for (i=NDIGITS-1; i>=0; i--) {
		if (A[i] || seen_nonzero) {
			seen_nonzero = 1;
			printf ("%d", A[i]);
		}
	}
	if (!seen_nonzero) printf ("0");
}

The value of NDIGITS is a parameter that can be changed in the header file. Our algorithms for addition and multiplication take a certain amount of time related to this parameter; we'd like a way to quantify this so we can know what performance we can expect from this ADT as implemented. Based on this knowledge, we may choose a different implementation that is faster.

To quantify the amount of time taken, we'll count the number of times the statement

		sum = A[i] + B[i] + carry;

is executed. Everything takes time, but this statement is in the middle of an important loop, does two additions and, depending on the compiler, two or more memory accesses, so knowing how many times it is executed gives us a pretty good idea of the amount of time the algorithm will consume.

Let n be the value of NDIGITS. We'll figure out how the amount of time add_bigint and multiply_bigint take changes as n changes. These functions of n are the time complexities of the two operations. There is also a space complexity that becomes important in other algorithms, but not here.

In add_bigint, the sum = ... statement is executed n times, so a function describing the time complexity of add_bigint must be bounded from below by n. In plain English, this means the function takes about n steps to run.

It turns out that multiply_one_digit and shift_left also take about n steps, and we do a add_bigint in the middle of the loop in multiply_bigint. Each of these functions is executed n times in the loop of multiply_bigint, so multiply_bigint has a time complexity bounded below by 3n². As n increases (i.e., asymptotically), the factor of 3 is much less important than the quadratic term, so we can say this multiplication takes about n² steps.

This is a somewhat lazy analysis of the algorithms; we haven't taken into account some other important operations that take time, and we have treated three different functions as though they take exactly the same amount of time when they actually differ (by constant factors). It turns out that we can develop rigorous tools to justify this kind of lazy analysis; n² in this case is really the important term, especially if we compare it to an other algorithm that might take n log n steps (that's much better).