A quick intro to SQL

In 2003, a guy I know was taking a course about SQL, felt that the instructor was not providing any conceptual framework whatsoever for the topic, and asked me to tell him what was what. What follows is my response.

If you know something about SQL, or relational databases generally, you might notice a few omissions, errors, and strangenesses here. Some of that is because of the special context for which I was writing — e.g., I go nowhere near relational algebra, because I'm addressing a novice who needs a much more basic orientation — but some of it is because I just didn't know any better.

If you don't know much about SQL, and are reading this to learn more: welcome, and beware! These notes are an idiosyncratic supplement to a proper course or book on the subject.

Date: Mon, 17 Nov 2003 01:36:50 -0700
Subject: Conceptual model of SQL, part 1: basics

To a first approximation, data in SQL can be conceived this way:

1. Objects have properties.  For example, the properties of the
novel _Cryptonomicon_ include the fact that its title is
"Cryptonomicon", that its author is Neal Stephenson, and that it
was published in 1999.

2. Objects come in types; all objects of the same type have the
same set of properties.  For example, all books have a title, an
author, and a year of publication.

Every type of object gets a table; the table has a row for every
object of that type and a column for every property that objects
of that type have.

                Table "books"
    title          author              year
    ---------------------------------------
    Cryptonomicon  Neal Stephenson     1998
    The Cyberiad   Stanislaw Lem       1985
    Friday         Robert Heinlein     1982
    The Big U      Neal Stephenson     1988

There are four things you might want to do with this data:

    1. SELECT:  Retrieve information.
    2. INSERT:  Add a new row.
    3. DELETE:  Delete rows.
    4. UPDATE:  Change the contents of existing rows.

These four kinds of statement comprise the Data Manipulation
Language, which is half of SQL.  All of them involve doing things
with *objects* -- finding them, creating them, destroying them,
and changing them.

The other half of SQL is the Data Definition Language, which has
to do with doing things to *types* -- for example, the CREATE
TABLE statement creates a new table (= a new type of object),
specifying what columns the table has (= what properties objects
of that type have).

Examples:

1. An example of Data Definition Language.

English: "Books have titles, authors and years of publication;
titles are strings, authors are strings, and years are integers."

SQL:
    CREATE TABLE books (title varchar, author varchar, year int);

Pretty simple.  Creates an empty table with the specified columns.

2. Adding data.

English: "There is a book named 'Cryptonomicon', written by Neal
Stephenson, and published in 1998.  There is a book named 'The
Cyberiad', written by Stanislaw Lem, ... (etc.)"

SQL:
    INSERT INTO books (title, author, year)
        VALUES('Cryptonomicon', 'Neal Stephenson', 1998);
    INSERT INTO books (title, author, year)
        VALUES('The Cyberiad', 'Stanislaw Lem', 1985);
    INSERT INTO books (title, author, year)
        VALUES('Friday', 'Robert Heinlein', 1982);
    INSERT INTO books (title, author, year)
        VALUES('The Big U', 'Neal Stephenson', 1988);

Again, pretty simple.

3. Retrieving rows.

English: "What books were published in the 1980s?"

SQL:
    SELECT *
    FROM books
    WHERE 1980 <= year AND year < 1990;

Response:
    title          author              year
    ---------------------------------------
    The Cyberiad   Stanislaw Lem       1985
    Friday         Robert Heinlein     1982
    The Big U      Neal Stephenson     1988

Note that this response is just a subset of the original table --
the only difference is that the Cryptonomicon row was omitted
because it has year = 1998.

4. Retrieving specific columns.

You can also cause columns to be omitted, producing a different
kind of subset of the original table:

English: "What are the titles of books published in the 1980s?"

SQL:
    SELECT title
    FROM books
    WHERE 1980 <= year AND year < 1990;

Response:
    Title
    ------------
    The Cyberiad
    Friday
    The Big U

5. Computing extra columns.

It's also possible to *add* columns to the response, by having the
SQL server compute them on the fly:

English: "When is the 50th anniversary of the publication of each
book?"

SQL:
    SELECT *, year+50 AS anniversary
    FROM books;

Response:
    title          author              year  anniversary
    ----------------------------------------------------
    Cryptonomicon  Neal Stephenson     1998  2048
    The Cyberiad   Stanislaw Lem       1985  2035
    Friday         Robert Heinlein     1982  2032
    The Big U      Neal Stephenson     1988  2038

(As before, '*' means 'all columns'.  It's actually a bad idea to
use '*', since the SQL standard doesn't guarantee that you get the
columns in any particular order -- you might get different results
from different servers, or even from the same server at different
times.  For portability and robustness, ask for the columns you
want in the order you want.)

6. Aggregation.

And, for my last example of SELECT, you can suppress detail,
aggregating many rows into one:

English: "Group all the books by author.  For each group, what is
the author, the number of books in the group, and the first and
last years of publication of books in the group?"

SQL:
    SELECT author, COUNT(*), MIN(year), MAX(year)
    FROM books
    GROUP BY author;

Response:
    Author              count  min_year  max_year
    ---------------------------------------------
    Neal Stephenson     2      1988      1999
    Stanislaw Lem       1      1985      1985
    Robert Heinlein     1      1982      1982

Here each row of the response corresponds to one or more rows from
the original table.

7. Deleting and updating data.

English: "Forget about all of Neal Stephenson's books."

SQL:
    DELETE FROM books WHERE author = 'Neal Stephenson';

English: "Change 'Robert Heinlein' to 'Robert A. Heinlein'
everywhere."

SQL:
    UPDATE books
    SET author = 'Robert A. Heinlein'
    WHERE author = 'Robert Heinlein';

As these examples illustrate, you specify which rows to delete or
update in the same way as you specify which rows to retrieve in a
SELECT statement -- with a WHERE clause.  This means that to
specify a single row, you need to be able to express some unique
property it has.  My example table of books is badly designed in
this respect.  For example:

English: "Oops... Cryptonomicon was actually published in 1999."

SQL:
    UPDATE books
    SET year = 1999
    WHERE title = 'Cryptonomicon';

This SQL statement will affect *every* row whose title =
'Cryptonomicon'; in my example that's just one row, but we can
easily imagine multiple books having the same title.

It is thus common to put an 'id' column in such tables; it
contains a number unique to each row, even if all other values in
the row are duplicated elsewhere.  This way you can delete or
update individual rows with confidence that only the row you want
will be affected.  In an application, you might SELECT a bunch of
rows, show them to the user on the screen, have them highlight one
with the mouse and click a "Nuke" button, and then do a "DELETE
...  WHERE id = foo", with confidence that only that one row will
be deleted.

In future installment(s): constraints, relationships between
tables, multiple-table queries, and indexes.

Date: Tue, 18 Nov 2003 22:13:20 -0700
Subject: SQL, part 2: multiple tables; constraints

Today, we add to our developing example the idea of authors' years
of birth and death.  One of our objectives is to be able to
display books like this:

    Cryptonomicon  Neal Stephenson (1959-)
    The Cyberiad   Stanislaw Lem (1921-)
    Friday         Robert A. Heinlein (1907-1988)
    The Big U      Neal Stephenson (1959-)

The quickest and dirtiest way to achieve this objective is just to
consider the strings like "(1959-)" to be part of the authors'
names.  That is, in our books table we'd have

    title          author                          year
    ---------------------------------------------------
    Cryptonomicon  Neal Stephenson (1959-)         1999
    The Cyberiad   Stanislaw Lem (1921-)           1985
    Friday         Robert A. Heinlein (1907-1988)  1982
    The Big U      Neal Stephenson (1959-)         1988

One problem with this approach is that it makes it harder to
search for books by author.  For example, to find books written by
Neal Stephenson, you'd have to know that he was born in 1959 and
ask
    SELECT * FROM books WHERE author = 'Neal Stephenson (1959-)';
(Well, there are other ways you could find those rows, but never
mind.)  It would also be difficult to ask questions like "which
books were written by authors born before 1950?".

These problems can be solved by separating out the birth and death
information into their own columns:

                        Table "books"
    title          author              birth  death  year
    -----------------------------------------------------
    Cryptonomicon  Neal Stephenson     1959          1999
    The Cyberiad   Stanislaw Lem       1921          1985
    Friday         Robert A. Heinlein  1907   1988   1982
    The Big U      Neal Stephenson     1959          1988

Call this Schema 2.  (A "schema" is a database's set of table
definitions; it's what the Data Definition Language manipulates.
Schema 1 consists of the books table as it was defined in my
previous email.)

We've recovered the ability to find books by author name without
knowing when they were born, and added the ability to do queries
on the birth and death years of their authors.  To display things
like "Neal Stephenson (1959-)", we'll combine these three columns
in the application, using sprintf or something.

(Exercise: Write a SQL statement to find books written by authors
born before 1950.)

Things are getting better, but we're not doing the Right Thing
yet.  For example, in this setup, the simple-seeming question
    "What year was Neal Stephenson born?"
is troublesome.  Extrapolating from the examples of the previous
email, we might try
    SELECT birth FROM books WHERE author = 'Neal Stephenson';
The response would be
    birth
    -----
    1959
    1959
Since there are two books satisfying the WHERE clause, you get two
rows back, although you only wanted one fact.  To get just one
row, you could try aggregation, grouping by author,
    SELECT birth
    FROM books
    WHERE author = 'Neal Stephenson'
    GROUP BY author;
but SQL doesn't even allow this query.  The problem is that the
birth column could have different values in the two rows where
author = 'Neal Stephenson', and the server has no way to know
which one you want for that group of rows.  If we persist in this
approach, we might end up with something like
    SELECT MIN(birth)
    FROM books
    WHERE author = 'Neal Stephenson'
    GROUP BY author;
which actually asks "What is the minimum year of birth recorded
for Neal Stephenson?"... not quite what we want.

This mess results from having made a conceptual error.  According
to the model from last email, having a birth column in the books
table means conceptually that the author's year of birth is a
property that books have.  But it's really a property of the
authors, not the books.  That is, authors form another type of
objects, with their own properties, and so deserve their own table.

So, schema 3: scrap the birth and death columns in the books table
and create a new table for authors.

            Table "authors"
    name                birth  death
    --------------------------------
    Neal Stephenson     1959
    Stanislaw Lem       1921
    Robert A. Heinlein  1907   1988

Now we can find Neal Stephenson's birth year easily, and we can
pose questions like "which authors were born before 1950?" and
"how many authors were born in each year?"  (Exercise: What are
the SQL statements for these questions, now that we have an
authors table?)

Meeting our original objective, displaying things like

    Cryptonomicon  Neal Stephenson (1959-)
    The Cyberiad   Stanislaw Lem (1921-)
    Friday         Robert A. Heinlein (1907-1988)
    The Big U      Neal Stephenson (1959-)

now takes a bit more work, to combine the information from the two
tables:

    Book Display Algorithm (for Schema 3)

    1.  Get a list of books by using
            SELECT title, author FROM books;

    2.  For each book in that list, use
            SELECT birth, death FROM authors WHERE name = ?;
        to obtain the author's birth and death years.  (Fill in
        the ? with the value of author for the current book.)

    3.  Assemble this information into text and display it.

This is a fine plan, but there are three things to consider.

First, what if, in step 2, the SELECT statement produces more than
one row?  (This would occur whenever two authors have the same
name.)  How would we know which birth/death years apply to the
author of the book we're interested in?

Fixing this calls for some way to identify authors uniquely, so as
suggested at the end of the previous email, let's add an 'id'
column:

              Table "authors"
    id  name                birth  death
    ------------------------------------
     1  Neal Stephenson     1959
     2  Stanislaw Lem       1921
     3  Robert A. Heinlein  1907   1988

Then, in the books table, we can replace the author name column
with an 'authorid' column, so we know exactly which author is
associated with each book:

           Table "books"
    title          authorid  year
    -----------------------------
    Cryptonomicon         1  1999
    The Cyberiad          2  1985
    Friday                3  1982
    The Big U             1  1988

This is Schema 4.  (Exercise: Update the Book Display Algorithm to
work under Schema 4.  Verify that your revision actually obtains
all the information needed in step 3.)

But wait -- how does this actually solve the problem of possibly
getting more than one row back in step 2?  What if two rows in the
authors table have the same id?

The answer is, that should never happen.  Identifiers by their
nature and function should be unique; in other words, we want to
prevent rows with the same id from being added to the table in the
first place, so that we won't have to worry about it in step 2.

This brings us to the topic of constraints.  SQL lets you restrict
the values that may appear in a column in various ways.
Constraints are statements about what values are permitted in
certain columns; as such they concern the entire table, and are in
the purview of the Data Definition Language.

To make sure author ids are unique, we need the UNIQUE constraint:

    CREATE TABLE authors (
        id int UNIQUE,
        name varchar,
        birth int,
        death int
        );

With the authors table created this way, if we try to INSERT two
rows with the same id, the second insertion will fail.  Also, an
UPDATE statement can't change the id column in a way which would
cause a collision.  Thus we can rely on never getting more than
one row back in step 2.

There's one gotcha with the UNIQUE constraint: NULL is not equal
to anything, not even itself.  That is, we could make two authors
with NULL ids, and the UNIQUE constraint would not complain, since
those values are not equal to each other.  (We might say that
UNIQUE requires *values* to be unique, and that NULL is exempt
because it's not a value but the absence of a value.  NULL is so
weird that it makes good sense to adopt this view.)

But every author should have an id number -- in other words, we
want to enforce the constraint that the values in the id column
are not NULL.

    CREATE TABLE authors (
        id int UNIQUE NOT NULL,
        name varchar,
        birth int,
        death int
        );

The name column should also be NOT NULL, and maybe the birth
column too.  (We could also have just declared that the name
column to be UNIQUE; this would also have ensured that step 2 of
the Book Display Algorithm produces only one row, and we wouldn't
need the id column.  The problem with this is that it just ain't
so -- name collisions do happen in the real world, and our
database should be ready for them.  It's the difference between
"should never happen" and "probably won't happen".)

Constraints aren't restricted to canned things like "UNIQUE" and
"NOT NULL"; the CHECK constraint lets you state an arbitrary
condition which must always hold.  For example, we probably want
    CREATE TABLE authors (
        ...
        birth int NOT NULL,
        death int,
        CHECK (birth <= death OR death IS NULL)
        );
because people can't die before they're born (or rather, those
that do don't become authors).  This example also illustrates that
constraints can involve more than one column.

(Note that we have to check for NULL death year specially.  This
is because "foo <= NULL" is false for any foo.  Also "foo > NULL"
is false for any foo, and so is "foo <> NULL"; that is, not only
is NULL not equal to anything, it's not not-equal to anything
either.  Did I mention that NULL is weird?)

Second consideration:  What if we can't find *any* rows for a
given author id in step 2?

This is other side of the first consideration, where we were
worried about getting more than one row.  The answer is the same:
it should never happen.  That is, we wish to enforce the
constraint that each value in the authorid column of the books
table matches some value in the id column of the author table.
For this we need the REFERENCES constraint:

    CREATE TABLE books (
        title varchar,
        authorid int REFERENCES author(id),
        year int
        );

Now if we try to INSERT a book with a bogus author id, it'll fail,
as will an attempt to UPDATE a book's author id to something
bogus.  Conversely, in the authors table, we can DELETE an author
or UPDATE its id only if there are no books for that author (since
otherwise the change would leave those books in violation of the
REFERENCES constraint).

(Many SQL servers allow you to make other things happen when
deleting or updating an author.  For example, when updating an
author id, you might want all the associated books to be updated
to the new authorid -- this is called "cascading".  How to make
this happen, if you want it, depends on the implementation.)

(Exercise: Add an id column to the above CREATE TABLE statement
for the books table.  Also add any appropriate constraints to the
other columns.)

(Exercise:  Suppose that, by error, Heinlein was added to the
authors table twice, once with id 3 and once with id 10.  Suppose
further that, when adding Heinlein's books to the book table,
sometimes they were associated with Heinlein 3 and sometimes with
Heinlein 10.  Write a sequence of SQL statements which fixes the
data, without at any stage violating any constraints.)

The third and final consideration has nothing to do with
constraints.  In the Book Display Algorithm, we loop through the
list of books, looking up each authorid in the authors table.
This is somewhat inefficient, since we ask the server for an
author's row once for each book that author wrote, duplicating
effort for prolific authors.

We could deal with this in the application by keeping a list of
which authors we've already asked about, and for each book,
checking that list before asking the server.  (In other words,
cache the author data in the application, at least for the
duration of the loop.)

But SQL can do better, since it allows you to SELECT from more
than one table at once.  For example:

    SELECT books.title, books.authorid, authors.id, authors.name
    FROM books, authors;

(Note that we specify columns by the combination of table name and
column name.  For convenience, you can omit the table name if
there's no ambiguity.)

Response:

    books.title    books.authorid  authors.id  authors.name
    -------------------------------------------------------
    Cryptonomicon               1           1  Neal Stephenson
    Cryptonomicon               1           2  Stanislaw Lem
    Cryptonomicon               1           3  Robert A. Heinlein
    The Cyberiad                2           1  Neal Stephenson
    The Cyberiad                2           2  Stanislaw Lem
    The Cyberiad                2           3  Robert A. Heinlein
    Friday                      3           1  Neal Stephenson
    Friday                      3           2  Stanislaw Lem
    Friday                      3           3  Robert A. Heinlein
    The Big U                   1           1  Neal Stephenson
    The Big U                   1           2  Stanislaw Lem
    The Big U                   1           3  Robert A. Heinlein

This result set has one row for every possible combination of a
row from the books table and a row from the authors table.  That's
what "FROM books, authors" means -- the Cartesian product of the
two tables.

Useless?  No!  By construction, this result contains as a subset
all the correct combinations of books with their authors.  And we
already know how to pick just a subset of a result -- with a WHERE
clause.

Exercise (and this one you should actually consider doing): What
WHERE clause is needed to reduce the above result set to just the
ones with the correct combinations of book and author?  Put it all
together by writing a SQL statement to answer the question "for
each book, what is its title, its author's name, and its author's
years of birth and death?".  (Note that this is exactly the
information we need to complete the objective set at the beginning
of this email.)

(Exercise: Write a SQL statement to pose the question "how many
books has each author written?".)

In future installments: many-to-many relationships, indexes,
transactions... and I might have something to say about schema
design too.  I also now realize that I forgot to talk about the
interaction of WHERE and GROUP BY in my last email, so I'll stick
that in somewhere.

Date: Fri, 21 Nov 2003 08:20:01 -0700
Subject: SQL, part 3: many-to-many relationships; more on GROUP BY

Previously, on "Steven's Correspondence Crash Course in SQL":

              Table "authors"
    id  name                birth  death
    ------------------------------------
     1  Neal Stephenson     1959
     2  Stanislaw Lem       1921
     3  Robert A. Heinlein  1907   1988

             Table "books"
    id  title          authorid  year
    ---------------------------------
     1  Cryptonomicon         1  1999
     2  The Cyberiad          2  1985
     3  Friday                3  1982
     4  The Big U             1  1988

(This is Schema 5; it's the same as Schema 4 from the previous
email except it has books.id, as suggested in one of the
exercises.  Not shown, it also includes all the constraints
previously discussed.)

Today we add to our developing catalogue of science fiction novels
_The Mote in God's Eye_, by Larry Niven and Jerry Pournelle.

Niven and Pournelle are both authors, so we give them each a row
in the authors table.  Likewise the Mote gets a row in the books
table.

            New rows in "authors"
    id  name                birth  death
    ------------------------------------
     4  Larry Niven          1938
     5  Jerry Pournelle      1933

               New row in "books"
    id  title                  authorid  year
    -----------------------------------------
     5  The Mote in God's Eye         ?  1974

But we have a problem in the authorid field for this book.  This,
of course, is the new feature we're adding today -- support for
collaborations.

Having an authorid column in the books table which REFERENCES
author.id indicates that there is a one-to-many relationship
between these classes of object: for each book there is exactly
one author -- the one identified in the books.authorid column --
while for each author there are (possibly) many books -- all those
with that author's id in their authorid column.  What we want is a
many-to-many relationship: each author may have many books, each
book may have many authors.

We can achieve this in SQL by adding an intermediate table:

    Table "authorship"
    authorid  bookid
    ----------------
           1       1
           1       4
           2       2
           3       3
           4       5
           5       5

Each row of this table represents the fact that the specified
author was involved in writing the specified book.  When we have
an author and want to find their books, we do
    "What are the ids of books written by the author whose id is foo?"
    SELECT bookid FROM authorship WHERE authorid = foo;
Then we look the bookids up in the books table.  Or rather, we let
the server combine the tables for us,
    "What are the titles of the books written by author id foo?"
    SELECT books.title
    FROM books, authorship
    WHERE books.id = authorship.bookid AND authorship.authorid = foo;
as described at the end of part 2.  The first condition in the
WHERE clause restricts the all-possible-combinations result set to
just those with the correct combinations of book with authorship
entry, and the second restricts it further to just those
concerning the specified author.

(The authorship table breaks the conceptual model from part 1; its
rows do not indicate objects.  We can recover by adopting the view
that rows represent facts or statements; the "row = object" kind
of table holds facts of the type "this object has such-and-such
properties", while tables like authorship hold facts of the type
"this object has such-and-such a relationship to that object".)

As a first cut, the authorship table can be created by

    CREATE TABLE authorship (
        authorid int,
        bookid int
        );

But suppose this table had two identical rows, say, two rows with
authorid 4 and bookid 5 (representing Niven's authorship of The
Mote).  Then we'd have a nasty problem when combining this table
with the other tables; for example,
    "Which books did Niven write?"
    SELECT books.title
    FROM books, authorship
    WHERE books.id = authorship.bookid AND authorship.authorid = 4;
would result in
    books.title
    ---------------------
    The Mote in God's Eye
    The Mote in God's Eye
The book appears twice because, in the all-possible-combinations
result set that "FROM books, authorship" makes, the one row with
authorid 4 and bookid 5 is combined with The Mote, and so is the
other row, making two satisfying the WHERE condition.

To prevent this, we want each row in the authorship table to be
unique.  It's fine for an author to appear more than once in that
table, and it's fine for a book to appear more than once, but not
the same combination of author and book.  Happily, the UNIQUE
constraint can also be applied to a collection of columns, with
exactly that effect:

    CREATE TABLE authorship (
        authorid int,
        bookid int,
        UNIQUE(authorid, bookid)
        );

(Exercise: What other constraints does this table need?)

Having the book-to-author relationship in its own table means the
books table no longer needs an authorid column:

             Table "books"
    id  title                  year
    -------------------------------
     1  Cryptonomicon          1999
     2  The Cyberiad           1985
     3  Friday                 1982
     4  The Big U              1988
     5  The Mote in God's Eye  1974

This change, along with the new authorship table, makes Schema 6.

(Exercise: What SQL statements are needed to add the novel
_Zodiac_, written by Neal Stephenson and published in 1988, and
then to delete it again?)

(Exercise: Given the data above, what is the result set produced
by this query?
    SELECT books.title, author.name
    FROM books, authorship, authors
    WHERE books.bookid = authorship.bookid
    AND authorship.authorid = authors.id
    AND books.year < 1985;
)

So now our database supports collaborations.  We've already seen
the basic "who wrote what" kinds of questions; we can also ask
questions such as
    "How many authors wrote each book?"
    SELECT books.id, COUNT(*) AS author_count
    FROM books, authorship
    WHERE books.id = authorship.bookid
    GROUP BY books.id;
Of course, we probably want the titles; but just replacing "SELECT
books.id..." with "SELECT books.title..." won't work here.  The
problem is that, when grouping, every column in the output (i.e.,
everything in the SELECT list) must be either one of the things
grouped by, an aggregate (e.g., a COUNT, a MIN), or something
computed from these.  So we have to either group on the title too,
    SELECT books.title, COUNT(*) AS author_count
    FROM books, authorship
    WHERE books.id = authorship.bookid
    GROUP BY books.id, books.title;
or aggregate the titles,
    SELECT MIN(books.title), COUNT(*) AS author_count
    FROM books, authorship
    WHERE books.id = authorship.bookid
    GROUP BY books.id;
(This may seem silly.  After all, there can only be one value of
books.title per books.id, since there's only one *row* per
books.id, by the UNIQUE constraint.  But SQL doesn't make that
kind of inference.  Don't ask me why.)
Let's try
    "Which books are collaborations?"
Extrapolating from previous examples, we might try
    SELECT books.id
    FROM books, authorship
    WHERE books.id = authorship.bookid AND COUNT(*) > 1
    GROUP BY books.id;
But alas, this won't work.  The problem is that the WHERE clause
operates before grouping; that is, here it filters the
all-possible-combinations result set, not the grouped result set
which the "GROUP BY books.id" produces.  Thus, at the time the
WHERE filtering occurs, COUNT(*) cannot be determined.

(If WHERE filtering happened after grouping, the COUNT(*) > 1 part
would be fine, but the other part -- the test books.id =
authorship.booksid -- wouldn't work.  The data starts as
all-possible-combinations; if we group by books.id before
filtering, the groups would include all the incorrect combinations
of book with authorship.  Try
    SELECT books.id, COUNT(*)
    FROM books, authorship
    GROUP BY books.id;
for example.  We can't filter these incorrect combinations out
after grouping because, by grouping on books.id, we've thrown away
all the other columns, retaining only summary information such as
MAX() and COUNT().)

What we need is the HAVING clause.  This is just like WHERE, but
it operates after grouping, so it can refer to aggregate
information such as counts, maxima, averages, etc., but it can't
refer to individual values from columns which have been aggregated
by grouping.  In the case at hand, we need
    "Which books are collaborations?"
    SELECT books.id
    FROM books, authorship
    WHERE books.id = authorship.bookid
    GROUP BY books.id
    HAVING COUNT(*) > 1;

For my next trick, the question
    "Who has Niven collaborated with?"
This one is a bit complicated, but we can sneak up on it by
considering how we'd answer this question in the application.

    Niven Collaborators Algorithm

    1.  Use
            SELECT bookid FROM authorship WHERE authorid = 4;
        to find the books that Niven had a part in.

    2.  For each bookid, use
            SELECT authorid FROM authorship WHERE bookid = ?;
        to find all the authors of that book.

    3.  Combine all the lists of authors obtained in step 2
        and display the result.

Notice how steps 1 and 2 here have the same structure as steps 1
and 2 in the Book Display Algorithm from last email; we retrieve
some rows using one query, then loop through them, running an
additional query for each.  This suggests that we can use a
multiple-table query to do these two steps at once, just as we did
in that book display example.

The tricky bit is that here the two tables being joined are the
same table, authorship.  The SQL feature which lets us pull this
off is the ability to assign, in the FROM clause, an alias for a
table, and refer to it by that alias in the rest of the SELECT
statement:
    SELECT spam.title
    FROM books AS spam;
Thus, in the case at hand, we can do
    SELECT ...
    FROM authorship, authorship AS otherauthor
    ... ;
Then it's just as if we're joining two different tables, one named
authorship and one named otherauthor.  So:
    SELECT otherauthor.authorid
    FROM authorship, authorship AS otherauthor
    WHERE authorship.authorid = 4               -- Niven wrote the book
    AND authorship.bookid = otherauthor.bookid  -- so did the other author
    AND otherauthor.authorid <> 4;              -- other author is not Niven

(Exercise: Explain why we have to test that the other author is
not Niven.)

(Exercise: Explain why the output of this query might contain
duplicate rows, and fix it.  Also, while you're in there, arrange
for the output to include the other author's name.)

One problem with this approach to many-to-many relationships is
that you can't enforce any restrictions on how many objects are on
either side.  For example, we can't enforce the constraint that
each book must have at least one author.  (Imagine you could
enforce such a constraint.  When adding a new book, you'd have to
INSERT a book row and at least one authorship row; but you
wouldn't be able to INSERT the book row first, since that would
produce a book with no authors, and you wouldn't be able to INSERT
an authorship row first, since that would produce an authorship
with a bogus book id.  Deadlock.)

Instead, such constraints, if you need them, are enforced by the
application; it just never creates a book without immediately
creating an authorship to go with it, and never deletes the sole
authorship for a book.  (More on this when I talk about
transactions.)

(Exercise: At the beginning of all this I stated, on conceptual
grounds alone, that Niven and Pournelle should each get a row in
the authors table.  What if we consider them to be a single
entity, some kind of hive mind perhaps, and give that entity a row
in the authors table?  Then we can leave books.authorid in place
and forget about the authorship table.  Would any practical
problems then arise?  E.g., would there then be questions which
are difficult or impossible to ask?)

(Exercise: What about giving The Mote two rows in the books table,
one with authorid = 4 and one with authorid = 5?)

(Exercise: What about just adding an authorid2 column to the books
table for the second author?)

In future installments: indexes and transactions.  (I won't talk
about schema design separately after all.  Those last three
exercises are enough.)  The example database will not be developed
further.

Date: Mon, 24 Nov 2003 18:07:45 -0700
Subject: SQL, part 4: indices

Usually you can use SQL without worrying about how the SQL server
does what you tell it to do; this is one of the benefits of a
high-level language such as SQL.  Occasionally, however, it is
worthwhile to have at least a broad idea of the sorts of things a
server might do when processing SQL statements.  (The usual
scenario is that a query is taking too long to execute, and you
want to speed it up.)

First let's consider a statement such as
    SELECT * FROM authors WHERE id = 3;
The worst way to answer this question is to go through the authors
table row by row, checking each row's id, and outputting just
those rows whose id is 3.  This is called "linear search"; if the
table doubles in size, such a search takes twice as long.  (That
is, its running time is proportional to n, the number of rows in
the table.)

But suppose the rows of the authors table are on disk in order of
id.  Then we can use "binary search":  First look at the row in
the middle of the table; if that row has id > 3, the desired rows
are in the first half of the table; otherwise they are in the
second half.  Then look at the row in the middle of the
appropriate half, and repeat until you find rows with id = 3.
Under this strategy, if the table doubles in size, we just need
one more step at the beginning, to determine which half of the
enlarged table the rows we seek are in.  (Thus binary search takes
time proportional to log n.)

So in general we'd prefer that the rows of a table be on disk in
an order which makes searches by id fast.  This can usually be
achieved by designating the id column to be the PRIMARY KEY of the
table:
    CREATE TABLE authors (
        id int PRIMARY KEY,
        name varchar,
        birth int,
        death int
    );
(Formally, PRIMARY KEY is the same as UNIQUE NOT NULL, but it
further suggests that the server ought to organize the table on
disk in a way which makes searches by that column fast.)

But what if we want to do other kinds of search?  For example,
we'll probably want queries such as
    SELECT * FROM authors WHERE name = 'Robert A. Heinlein';
to be fast too.  (Users of our application will presumably specify
authors by name, not by id, and we'll have to use this kind of
query to find out which author they're referring to.)  With the
table sorted by id, we'd have to use linear search to answer this
question, since Heinlein might be anywhere in the table.  We could
sort the table by name instead, but then searches by id would need
linear search.

(The situation is analogous to a phone book, which is sorted in a
way which makes finding entries by name easy but finding entries
by phone number hard.)

One strategy is to create an auxiliary table sorted by name.  This
table might look like
    Table "authors_by_name"
     name               id
     ---------------------
     Neal Stephenson     1
     Stanislaw Lem       2
     Robert A. Heinlein  3
Then, to look up an author by id, we do binary search in the main
table, while to look up an author by name we first do binary
search in authors_by_name to find the id, then binary search in
the main table to find the complete row.  This way we get both
searches by id and searches by name to run in log n time.  Of
course, we slightly increase the time needed to INSERT, UPDATE or
DELETE rows in the authors table, since we also have to keep
authors_by_name up to date when making such changes.

As usual, SQL can do this for us:
    CREATE INDEX authors_by_name ON authors(name);
Once you've done this, a statement such as
    SELECT * FROM authors WHERE name = 'Robert A. Heinlein';
will automagically use the index so it can be executed faster.

Most SQL servers provide a way of finding out how the server has
chosen to answer a query -- if memory serves, MS SQL Server has a
tool called "Query Analyzer" for this purpose.  (Exercise: create
an authors table and use this tool to experiment with various
queries as you fiddle with the indexes.)

In complicated queries, there may be several ways of using the
available indices to answer the question, and which one is fastest
depends on statistical properties of the data which each index
indexes.  Accordingly, many SQL servers keep track of such
statistics, and use them to decide among alternative search
strategies.  Such servers would usually provide special commands
for manipulating the statistics (e.g., to throw away the current
statistics and recompute them).

(Real SQL servers, incidentally, don't use binary search; they
usually store tables and indices using btrees or hashing.  This is
not usually a detail which will matter to you.  The important bit
as a SQL user is that creating an index on a given column or set
of columns makes searches involving those columns faster, at some
expense to inserts, deletes, and updates which touch that column.)

The end.

(The text promises information on transactions; I never got around to writing that. Sorry.)

Mail Steven: steven@amotlpaa.org