In 2003, a guy I know was taking a course about SQL, felt that the instructor was not providing any conceptual framework whatsoever for the topic, and asked me to tell him what was what. What follows is my response.
If you know something about SQL, or relational databases generally, you might notice a few omissions, errors, and strangenesses here. Some of that is because of the special context for which I was writing — e.g., I go nowhere near relational algebra, because I'm addressing a novice who needs a much more basic orientation — but some of it is because I just didn't know any better.
If you don't know much about SQL, and are reading this to learn more: welcome, and beware! These notes are an idiosyncratic supplement to a proper course or book on the subject.
Date: Mon, 17 Nov 2003 01:36:50 -0700 Subject: Conceptual model of SQL, part 1: basics To a first approximation, data in SQL can be conceived this way: 1. Objects have properties. For example, the properties of the novel _Cryptonomicon_ include the fact that its title is "Cryptonomicon", that its author is Neal Stephenson, and that it was published in 1999. 2. Objects come in types; all objects of the same type have the same set of properties. For example, all books have a title, an author, and a year of publication. Every type of object gets a table; the table has a row for every object of that type and a column for every property that objects of that type have. Table "books" title author year --------------------------------------- Cryptonomicon Neal Stephenson 1998 The Cyberiad Stanislaw Lem 1985 Friday Robert Heinlein 1982 The Big U Neal Stephenson 1988 There are four things you might want to do with this data: 1. SELECT: Retrieve information. 2. INSERT: Add a new row. 3. DELETE: Delete rows. 4. UPDATE: Change the contents of existing rows. These four kinds of statement comprise the Data Manipulation Language, which is half of SQL. All of them involve doing things with *objects* -- finding them, creating them, destroying them, and changing them. The other half of SQL is the Data Definition Language, which has to do with doing things to *types* -- for example, the CREATE TABLE statement creates a new table (= a new type of object), specifying what columns the table has (= what properties objects of that type have). Examples: 1. An example of Data Definition Language. English: "Books have titles, authors and years of publication; titles are strings, authors are strings, and years are integers." SQL: CREATE TABLE books (title varchar, author varchar, year int); Pretty simple. Creates an empty table with the specified columns. 2. Adding data. English: "There is a book named 'Cryptonomicon', written by Neal Stephenson, and published in 1998. There is a book named 'The Cyberiad', written by Stanislaw Lem, ... (etc.)" SQL: INSERT INTO books (title, author, year) VALUES('Cryptonomicon', 'Neal Stephenson', 1998); INSERT INTO books (title, author, year) VALUES('The Cyberiad', 'Stanislaw Lem', 1985); INSERT INTO books (title, author, year) VALUES('Friday', 'Robert Heinlein', 1982); INSERT INTO books (title, author, year) VALUES('The Big U', 'Neal Stephenson', 1988); Again, pretty simple. 3. Retrieving rows. English: "What books were published in the 1980s?" SQL: SELECT * FROM books WHERE 1980 <= year AND year < 1990; Response: title author year --------------------------------------- The Cyberiad Stanislaw Lem 1985 Friday Robert Heinlein 1982 The Big U Neal Stephenson 1988 Note that this response is just a subset of the original table -- the only difference is that the Cryptonomicon row was omitted because it has year = 1998. 4. Retrieving specific columns. You can also cause columns to be omitted, producing a different kind of subset of the original table: English: "What are the titles of books published in the 1980s?" SQL: SELECT title FROM books WHERE 1980 <= year AND year < 1990; Response: Title ------------ The Cyberiad Friday The Big U 5. Computing extra columns. It's also possible to *add* columns to the response, by having the SQL server compute them on the fly: English: "When is the 50th anniversary of the publication of each book?" SQL: SELECT *, year+50 AS anniversary FROM books; Response: title author year anniversary ---------------------------------------------------- Cryptonomicon Neal Stephenson 1998 2048 The Cyberiad Stanislaw Lem 1985 2035 Friday Robert Heinlein 1982 2032 The Big U Neal Stephenson 1988 2038 (As before, '*' means 'all columns'. It's actually a bad idea to use '*', since the SQL standard doesn't guarantee that you get the columns in any particular order -- you might get different results from different servers, or even from the same server at different times. For portability and robustness, ask for the columns you want in the order you want.) 6. Aggregation. And, for my last example of SELECT, you can suppress detail, aggregating many rows into one: English: "Group all the books by author. For each group, what is the author, the number of books in the group, and the first and last years of publication of books in the group?" SQL: SELECT author, COUNT(*), MIN(year), MAX(year) FROM books GROUP BY author; Response: Author count min_year max_year --------------------------------------------- Neal Stephenson 2 1988 1999 Stanislaw Lem 1 1985 1985 Robert Heinlein 1 1982 1982 Here each row of the response corresponds to one or more rows from the original table. 7. Deleting and updating data. English: "Forget about all of Neal Stephenson's books." SQL: DELETE FROM books WHERE author = 'Neal Stephenson'; English: "Change 'Robert Heinlein' to 'Robert A. Heinlein' everywhere." SQL: UPDATE books SET author = 'Robert A. Heinlein' WHERE author = 'Robert Heinlein'; As these examples illustrate, you specify which rows to delete or update in the same way as you specify which rows to retrieve in a SELECT statement -- with a WHERE clause. This means that to specify a single row, you need to be able to express some unique property it has. My example table of books is badly designed in this respect. For example: English: "Oops... Cryptonomicon was actually published in 1999." SQL: UPDATE books SET year = 1999 WHERE title = 'Cryptonomicon'; This SQL statement will affect *every* row whose title = 'Cryptonomicon'; in my example that's just one row, but we can easily imagine multiple books having the same title. It is thus common to put an 'id' column in such tables; it contains a number unique to each row, even if all other values in the row are duplicated elsewhere. This way you can delete or update individual rows with confidence that only the row you want will be affected. In an application, you might SELECT a bunch of rows, show them to the user on the screen, have them highlight one with the mouse and click a "Nuke" button, and then do a "DELETE ... WHERE id = foo", with confidence that only that one row will be deleted. In future installment(s): constraints, relationships between tables, multiple-table queries, and indexes.
Date: Tue, 18 Nov 2003 22:13:20 -0700 Subject: SQL, part 2: multiple tables; constraints Today, we add to our developing example the idea of authors' years of birth and death. One of our objectives is to be able to display books like this: Cryptonomicon Neal Stephenson (1959-) The Cyberiad Stanislaw Lem (1921-) Friday Robert A. Heinlein (1907-1988) The Big U Neal Stephenson (1959-) The quickest and dirtiest way to achieve this objective is just to consider the strings like "(1959-)" to be part of the authors' names. That is, in our books table we'd have title author year --------------------------------------------------- Cryptonomicon Neal Stephenson (1959-) 1999 The Cyberiad Stanislaw Lem (1921-) 1985 Friday Robert A. Heinlein (1907-1988) 1982 The Big U Neal Stephenson (1959-) 1988 One problem with this approach is that it makes it harder to search for books by author. For example, to find books written by Neal Stephenson, you'd have to know that he was born in 1959 and ask SELECT * FROM books WHERE author = 'Neal Stephenson (1959-)'; (Well, there are other ways you could find those rows, but never mind.) It would also be difficult to ask questions like "which books were written by authors born before 1950?". These problems can be solved by separating out the birth and death information into their own columns: Table "books" title author birth death year ----------------------------------------------------- Cryptonomicon Neal Stephenson 1959 1999 The Cyberiad Stanislaw Lem 1921 1985 Friday Robert A. Heinlein 1907 1988 1982 The Big U Neal Stephenson 1959 1988 Call this Schema 2. (A "schema" is a database's set of table definitions; it's what the Data Definition Language manipulates. Schema 1 consists of the books table as it was defined in my previous email.) We've recovered the ability to find books by author name without knowing when they were born, and added the ability to do queries on the birth and death years of their authors. To display things like "Neal Stephenson (1959-)", we'll combine these three columns in the application, using sprintf or something. (Exercise: Write a SQL statement to find books written by authors born before 1950.) Things are getting better, but we're not doing the Right Thing yet. For example, in this setup, the simple-seeming question "What year was Neal Stephenson born?" is troublesome. Extrapolating from the examples of the previous email, we might try SELECT birth FROM books WHERE author = 'Neal Stephenson'; The response would be birth ----- 1959 1959 Since there are two books satisfying the WHERE clause, you get two rows back, although you only wanted one fact. To get just one row, you could try aggregation, grouping by author, SELECT birth FROM books WHERE author = 'Neal Stephenson' GROUP BY author; but SQL doesn't even allow this query. The problem is that the birth column could have different values in the two rows where author = 'Neal Stephenson', and the server has no way to know which one you want for that group of rows. If we persist in this approach, we might end up with something like SELECT MIN(birth) FROM books WHERE author = 'Neal Stephenson' GROUP BY author; which actually asks "What is the minimum year of birth recorded for Neal Stephenson?"... not quite what we want. This mess results from having made a conceptual error. According to the model from last email, having a birth column in the books table means conceptually that the author's year of birth is a property that books have. But it's really a property of the authors, not the books. That is, authors form another type of objects, with their own properties, and so deserve their own table. So, schema 3: scrap the birth and death columns in the books table and create a new table for authors. Table "authors" name birth death -------------------------------- Neal Stephenson 1959 Stanislaw Lem 1921 Robert A. Heinlein 1907 1988 Now we can find Neal Stephenson's birth year easily, and we can pose questions like "which authors were born before 1950?" and "how many authors were born in each year?" (Exercise: What are the SQL statements for these questions, now that we have an authors table?) Meeting our original objective, displaying things like Cryptonomicon Neal Stephenson (1959-) The Cyberiad Stanislaw Lem (1921-) Friday Robert A. Heinlein (1907-1988) The Big U Neal Stephenson (1959-) now takes a bit more work, to combine the information from the two tables: Book Display Algorithm (for Schema 3) 1. Get a list of books by using SELECT title, author FROM books; 2. For each book in that list, use SELECT birth, death FROM authors WHERE name = ?; to obtain the author's birth and death years. (Fill in the ? with the value of author for the current book.) 3. Assemble this information into text and display it. This is a fine plan, but there are three things to consider. First, what if, in step 2, the SELECT statement produces more than one row? (This would occur whenever two authors have the same name.) How would we know which birth/death years apply to the author of the book we're interested in? Fixing this calls for some way to identify authors uniquely, so as suggested at the end of the previous email, let's add an 'id' column: Table "authors" id name birth death ------------------------------------ 1 Neal Stephenson 1959 2 Stanislaw Lem 1921 3 Robert A. Heinlein 1907 1988 Then, in the books table, we can replace the author name column with an 'authorid' column, so we know exactly which author is associated with each book: Table "books" title authorid year ----------------------------- Cryptonomicon 1 1999 The Cyberiad 2 1985 Friday 3 1982 The Big U 1 1988 This is Schema 4. (Exercise: Update the Book Display Algorithm to work under Schema 4. Verify that your revision actually obtains all the information needed in step 3.) But wait -- how does this actually solve the problem of possibly getting more than one row back in step 2? What if two rows in the authors table have the same id? The answer is, that should never happen. Identifiers by their nature and function should be unique; in other words, we want to prevent rows with the same id from being added to the table in the first place, so that we won't have to worry about it in step 2. This brings us to the topic of constraints. SQL lets you restrict the values that may appear in a column in various ways. Constraints are statements about what values are permitted in certain columns; as such they concern the entire table, and are in the purview of the Data Definition Language. To make sure author ids are unique, we need the UNIQUE constraint: CREATE TABLE authors ( id int UNIQUE, name varchar, birth int, death int ); With the authors table created this way, if we try to INSERT two rows with the same id, the second insertion will fail. Also, an UPDATE statement can't change the id column in a way which would cause a collision. Thus we can rely on never getting more than one row back in step 2. There's one gotcha with the UNIQUE constraint: NULL is not equal to anything, not even itself. That is, we could make two authors with NULL ids, and the UNIQUE constraint would not complain, since those values are not equal to each other. (We might say that UNIQUE requires *values* to be unique, and that NULL is exempt because it's not a value but the absence of a value. NULL is so weird that it makes good sense to adopt this view.) But every author should have an id number -- in other words, we want to enforce the constraint that the values in the id column are not NULL. CREATE TABLE authors ( id int UNIQUE NOT NULL, name varchar, birth int, death int ); The name column should also be NOT NULL, and maybe the birth column too. (We could also have just declared that the name column to be UNIQUE; this would also have ensured that step 2 of the Book Display Algorithm produces only one row, and we wouldn't need the id column. The problem with this is that it just ain't so -- name collisions do happen in the real world, and our database should be ready for them. It's the difference between "should never happen" and "probably won't happen".) Constraints aren't restricted to canned things like "UNIQUE" and "NOT NULL"; the CHECK constraint lets you state an arbitrary condition which must always hold. For example, we probably want CREATE TABLE authors ( ... birth int NOT NULL, death int, CHECK (birth <= death OR death IS NULL) ); because people can't die before they're born (or rather, those that do don't become authors). This example also illustrates that constraints can involve more than one column. (Note that we have to check for NULL death year specially. This is because "foo <= NULL" is false for any foo. Also "foo > NULL" is false for any foo, and so is "foo <> NULL"; that is, not only is NULL not equal to anything, it's not not-equal to anything either. Did I mention that NULL is weird?) Second consideration: What if we can't find *any* rows for a given author id in step 2? This is other side of the first consideration, where we were worried about getting more than one row. The answer is the same: it should never happen. That is, we wish to enforce the constraint that each value in the authorid column of the books table matches some value in the id column of the author table. For this we need the REFERENCES constraint: CREATE TABLE books ( title varchar, authorid int REFERENCES author(id), year int ); Now if we try to INSERT a book with a bogus author id, it'll fail, as will an attempt to UPDATE a book's author id to something bogus. Conversely, in the authors table, we can DELETE an author or UPDATE its id only if there are no books for that author (since otherwise the change would leave those books in violation of the REFERENCES constraint). (Many SQL servers allow you to make other things happen when deleting or updating an author. For example, when updating an author id, you might want all the associated books to be updated to the new authorid -- this is called "cascading". How to make this happen, if you want it, depends on the implementation.) (Exercise: Add an id column to the above CREATE TABLE statement for the books table. Also add any appropriate constraints to the other columns.) (Exercise: Suppose that, by error, Heinlein was added to the authors table twice, once with id 3 and once with id 10. Suppose further that, when adding Heinlein's books to the book table, sometimes they were associated with Heinlein 3 and sometimes with Heinlein 10. Write a sequence of SQL statements which fixes the data, without at any stage violating any constraints.) The third and final consideration has nothing to do with constraints. In the Book Display Algorithm, we loop through the list of books, looking up each authorid in the authors table. This is somewhat inefficient, since we ask the server for an author's row once for each book that author wrote, duplicating effort for prolific authors. We could deal with this in the application by keeping a list of which authors we've already asked about, and for each book, checking that list before asking the server. (In other words, cache the author data in the application, at least for the duration of the loop.) But SQL can do better, since it allows you to SELECT from more than one table at once. For example: SELECT books.title, books.authorid, authors.id, authors.name FROM books, authors; (Note that we specify columns by the combination of table name and column name. For convenience, you can omit the table name if there's no ambiguity.) Response: books.title books.authorid authors.id authors.name ------------------------------------------------------- Cryptonomicon 1 1 Neal Stephenson Cryptonomicon 1 2 Stanislaw Lem Cryptonomicon 1 3 Robert A. Heinlein The Cyberiad 2 1 Neal Stephenson The Cyberiad 2 2 Stanislaw Lem The Cyberiad 2 3 Robert A. Heinlein Friday 3 1 Neal Stephenson Friday 3 2 Stanislaw Lem Friday 3 3 Robert A. Heinlein The Big U 1 1 Neal Stephenson The Big U 1 2 Stanislaw Lem The Big U 1 3 Robert A. Heinlein This result set has one row for every possible combination of a row from the books table and a row from the authors table. That's what "FROM books, authors" means -- the Cartesian product of the two tables. Useless? No! By construction, this result contains as a subset all the correct combinations of books with their authors. And we already know how to pick just a subset of a result -- with a WHERE clause. Exercise (and this one you should actually consider doing): What WHERE clause is needed to reduce the above result set to just the ones with the correct combinations of book and author? Put it all together by writing a SQL statement to answer the question "for each book, what is its title, its author's name, and its author's years of birth and death?". (Note that this is exactly the information we need to complete the objective set at the beginning of this email.) (Exercise: Write a SQL statement to pose the question "how many books has each author written?".) In future installments: many-to-many relationships, indexes, transactions... and I might have something to say about schema design too. I also now realize that I forgot to talk about the interaction of WHERE and GROUP BY in my last email, so I'll stick that in somewhere.
Date: Fri, 21 Nov 2003 08:20:01 -0700 Subject: SQL, part 3: many-to-many relationships; more on GROUP BY Previously, on "Steven's Correspondence Crash Course in SQL": Table "authors" id name birth death ------------------------------------ 1 Neal Stephenson 1959 2 Stanislaw Lem 1921 3 Robert A. Heinlein 1907 1988 Table "books" id title authorid year --------------------------------- 1 Cryptonomicon 1 1999 2 The Cyberiad 2 1985 3 Friday 3 1982 4 The Big U 1 1988 (This is Schema 5; it's the same as Schema 4 from the previous email except it has books.id, as suggested in one of the exercises. Not shown, it also includes all the constraints previously discussed.) Today we add to our developing catalogue of science fiction novels _The Mote in God's Eye_, by Larry Niven and Jerry Pournelle. Niven and Pournelle are both authors, so we give them each a row in the authors table. Likewise the Mote gets a row in the books table. New rows in "authors" id name birth death ------------------------------------ 4 Larry Niven 1938 5 Jerry Pournelle 1933 New row in "books" id title authorid year ----------------------------------------- 5 The Mote in God's Eye ? 1974 But we have a problem in the authorid field for this book. This, of course, is the new feature we're adding today -- support for collaborations. Having an authorid column in the books table which REFERENCES author.id indicates that there is a one-to-many relationship between these classes of object: for each book there is exactly one author -- the one identified in the books.authorid column -- while for each author there are (possibly) many books -- all those with that author's id in their authorid column. What we want is a many-to-many relationship: each author may have many books, each book may have many authors. We can achieve this in SQL by adding an intermediate table: Table "authorship" authorid bookid ---------------- 1 1 1 4 2 2 3 3 4 5 5 5 Each row of this table represents the fact that the specified author was involved in writing the specified book. When we have an author and want to find their books, we do "What are the ids of books written by the author whose id is foo?" SELECT bookid FROM authorship WHERE authorid = foo; Then we look the bookids up in the books table. Or rather, we let the server combine the tables for us, "What are the titles of the books written by author id foo?" SELECT books.title FROM books, authorship WHERE books.id = authorship.bookid AND authorship.authorid = foo; as described at the end of part 2. The first condition in the WHERE clause restricts the all-possible-combinations result set to just those with the correct combinations of book with authorship entry, and the second restricts it further to just those concerning the specified author. (The authorship table breaks the conceptual model from part 1; its rows do not indicate objects. We can recover by adopting the view that rows represent facts or statements; the "row = object" kind of table holds facts of the type "this object has such-and-such properties", while tables like authorship hold facts of the type "this object has such-and-such a relationship to that object".) As a first cut, the authorship table can be created by CREATE TABLE authorship ( authorid int, bookid int ); But suppose this table had two identical rows, say, two rows with authorid 4 and bookid 5 (representing Niven's authorship of The Mote). Then we'd have a nasty problem when combining this table with the other tables; for example, "Which books did Niven write?" SELECT books.title FROM books, authorship WHERE books.id = authorship.bookid AND authorship.authorid = 4; would result in books.title --------------------- The Mote in God's Eye The Mote in God's Eye The book appears twice because, in the all-possible-combinations result set that "FROM books, authorship" makes, the one row with authorid 4 and bookid 5 is combined with The Mote, and so is the other row, making two satisfying the WHERE condition. To prevent this, we want each row in the authorship table to be unique. It's fine for an author to appear more than once in that table, and it's fine for a book to appear more than once, but not the same combination of author and book. Happily, the UNIQUE constraint can also be applied to a collection of columns, with exactly that effect: CREATE TABLE authorship ( authorid int, bookid int, UNIQUE(authorid, bookid) ); (Exercise: What other constraints does this table need?) Having the book-to-author relationship in its own table means the books table no longer needs an authorid column: Table "books" id title year ------------------------------- 1 Cryptonomicon 1999 2 The Cyberiad 1985 3 Friday 1982 4 The Big U 1988 5 The Mote in God's Eye 1974 This change, along with the new authorship table, makes Schema 6. (Exercise: What SQL statements are needed to add the novel _Zodiac_, written by Neal Stephenson and published in 1988, and then to delete it again?) (Exercise: Given the data above, what is the result set produced by this query? SELECT books.title, author.name FROM books, authorship, authors WHERE books.bookid = authorship.bookid AND authorship.authorid = authors.id AND books.year < 1985; ) So now our database supports collaborations. We've already seen the basic "who wrote what" kinds of questions; we can also ask questions such as "How many authors wrote each book?" SELECT books.id, COUNT(*) AS author_count FROM books, authorship WHERE books.id = authorship.bookid GROUP BY books.id; Of course, we probably want the titles; but just replacing "SELECT books.id..." with "SELECT books.title..." won't work here. The problem is that, when grouping, every column in the output (i.e., everything in the SELECT list) must be either one of the things grouped by, an aggregate (e.g., a COUNT, a MIN), or something computed from these. So we have to either group on the title too, SELECT books.title, COUNT(*) AS author_count FROM books, authorship WHERE books.id = authorship.bookid GROUP BY books.id, books.title; or aggregate the titles, SELECT MIN(books.title), COUNT(*) AS author_count FROM books, authorship WHERE books.id = authorship.bookid GROUP BY books.id; (This may seem silly. After all, there can only be one value of books.title per books.id, since there's only one *row* per books.id, by the UNIQUE constraint. But SQL doesn't make that kind of inference. Don't ask me why.) Let's try "Which books are collaborations?" Extrapolating from previous examples, we might try SELECT books.id FROM books, authorship WHERE books.id = authorship.bookid AND COUNT(*) > 1 GROUP BY books.id; But alas, this won't work. The problem is that the WHERE clause operates before grouping; that is, here it filters the all-possible-combinations result set, not the grouped result set which the "GROUP BY books.id" produces. Thus, at the time the WHERE filtering occurs, COUNT(*) cannot be determined. (If WHERE filtering happened after grouping, the COUNT(*) > 1 part would be fine, but the other part -- the test books.id = authorship.booksid -- wouldn't work. The data starts as all-possible-combinations; if we group by books.id before filtering, the groups would include all the incorrect combinations of book with authorship. Try SELECT books.id, COUNT(*) FROM books, authorship GROUP BY books.id; for example. We can't filter these incorrect combinations out after grouping because, by grouping on books.id, we've thrown away all the other columns, retaining only summary information such as MAX() and COUNT().) What we need is the HAVING clause. This is just like WHERE, but it operates after grouping, so it can refer to aggregate information such as counts, maxima, averages, etc., but it can't refer to individual values from columns which have been aggregated by grouping. In the case at hand, we need "Which books are collaborations?" SELECT books.id FROM books, authorship WHERE books.id = authorship.bookid GROUP BY books.id HAVING COUNT(*) > 1; For my next trick, the question "Who has Niven collaborated with?" This one is a bit complicated, but we can sneak up on it by considering how we'd answer this question in the application. Niven Collaborators Algorithm 1. Use SELECT bookid FROM authorship WHERE authorid = 4; to find the books that Niven had a part in. 2. For each bookid, use SELECT authorid FROM authorship WHERE bookid = ?; to find all the authors of that book. 3. Combine all the lists of authors obtained in step 2 and display the result. Notice how steps 1 and 2 here have the same structure as steps 1 and 2 in the Book Display Algorithm from last email; we retrieve some rows using one query, then loop through them, running an additional query for each. This suggests that we can use a multiple-table query to do these two steps at once, just as we did in that book display example. The tricky bit is that here the two tables being joined are the same table, authorship. The SQL feature which lets us pull this off is the ability to assign, in the FROM clause, an alias for a table, and refer to it by that alias in the rest of the SELECT statement: SELECT spam.title FROM books AS spam; Thus, in the case at hand, we can do SELECT ... FROM authorship, authorship AS otherauthor ... ; Then it's just as if we're joining two different tables, one named authorship and one named otherauthor. So: SELECT otherauthor.authorid FROM authorship, authorship AS otherauthor WHERE authorship.authorid = 4 -- Niven wrote the book AND authorship.bookid = otherauthor.bookid -- so did the other author AND otherauthor.authorid <> 4; -- other author is not Niven (Exercise: Explain why we have to test that the other author is not Niven.) (Exercise: Explain why the output of this query might contain duplicate rows, and fix it. Also, while you're in there, arrange for the output to include the other author's name.) One problem with this approach to many-to-many relationships is that you can't enforce any restrictions on how many objects are on either side. For example, we can't enforce the constraint that each book must have at least one author. (Imagine you could enforce such a constraint. When adding a new book, you'd have to INSERT a book row and at least one authorship row; but you wouldn't be able to INSERT the book row first, since that would produce a book with no authors, and you wouldn't be able to INSERT an authorship row first, since that would produce an authorship with a bogus book id. Deadlock.) Instead, such constraints, if you need them, are enforced by the application; it just never creates a book without immediately creating an authorship to go with it, and never deletes the sole authorship for a book. (More on this when I talk about transactions.) (Exercise: At the beginning of all this I stated, on conceptual grounds alone, that Niven and Pournelle should each get a row in the authors table. What if we consider them to be a single entity, some kind of hive mind perhaps, and give that entity a row in the authors table? Then we can leave books.authorid in place and forget about the authorship table. Would any practical problems then arise? E.g., would there then be questions which are difficult or impossible to ask?) (Exercise: What about giving The Mote two rows in the books table, one with authorid = 4 and one with authorid = 5?) (Exercise: What about just adding an authorid2 column to the books table for the second author?) In future installments: indexes and transactions. (I won't talk about schema design separately after all. Those last three exercises are enough.) The example database will not be developed further.
Date: Mon, 24 Nov 2003 18:07:45 -0700 Subject: SQL, part 4: indices Usually you can use SQL without worrying about how the SQL server does what you tell it to do; this is one of the benefits of a high-level language such as SQL. Occasionally, however, it is worthwhile to have at least a broad idea of the sorts of things a server might do when processing SQL statements. (The usual scenario is that a query is taking too long to execute, and you want to speed it up.) First let's consider a statement such as SELECT * FROM authors WHERE id = 3; The worst way to answer this question is to go through the authors table row by row, checking each row's id, and outputting just those rows whose id is 3. This is called "linear search"; if the table doubles in size, such a search takes twice as long. (That is, its running time is proportional to n, the number of rows in the table.) But suppose the rows of the authors table are on disk in order of id. Then we can use "binary search": First look at the row in the middle of the table; if that row has id > 3, the desired rows are in the first half of the table; otherwise they are in the second half. Then look at the row in the middle of the appropriate half, and repeat until you find rows with id = 3. Under this strategy, if the table doubles in size, we just need one more step at the beginning, to determine which half of the enlarged table the rows we seek are in. (Thus binary search takes time proportional to log n.) So in general we'd prefer that the rows of a table be on disk in an order which makes searches by id fast. This can usually be achieved by designating the id column to be the PRIMARY KEY of the table: CREATE TABLE authors ( id int PRIMARY KEY, name varchar, birth int, death int ); (Formally, PRIMARY KEY is the same as UNIQUE NOT NULL, but it further suggests that the server ought to organize the table on disk in a way which makes searches by that column fast.) But what if we want to do other kinds of search? For example, we'll probably want queries such as SELECT * FROM authors WHERE name = 'Robert A. Heinlein'; to be fast too. (Users of our application will presumably specify authors by name, not by id, and we'll have to use this kind of query to find out which author they're referring to.) With the table sorted by id, we'd have to use linear search to answer this question, since Heinlein might be anywhere in the table. We could sort the table by name instead, but then searches by id would need linear search. (The situation is analogous to a phone book, which is sorted in a way which makes finding entries by name easy but finding entries by phone number hard.) One strategy is to create an auxiliary table sorted by name. This table might look like Table "authors_by_name" name id --------------------- Neal Stephenson 1 Stanislaw Lem 2 Robert A. Heinlein 3 Then, to look up an author by id, we do binary search in the main table, while to look up an author by name we first do binary search in authors_by_name to find the id, then binary search in the main table to find the complete row. This way we get both searches by id and searches by name to run in log n time. Of course, we slightly increase the time needed to INSERT, UPDATE or DELETE rows in the authors table, since we also have to keep authors_by_name up to date when making such changes. As usual, SQL can do this for us: CREATE INDEX authors_by_name ON authors(name); Once you've done this, a statement such as SELECT * FROM authors WHERE name = 'Robert A. Heinlein'; will automagically use the index so it can be executed faster. Most SQL servers provide a way of finding out how the server has chosen to answer a query -- if memory serves, MS SQL Server has a tool called "Query Analyzer" for this purpose. (Exercise: create an authors table and use this tool to experiment with various queries as you fiddle with the indexes.) In complicated queries, there may be several ways of using the available indices to answer the question, and which one is fastest depends on statistical properties of the data which each index indexes. Accordingly, many SQL servers keep track of such statistics, and use them to decide among alternative search strategies. Such servers would usually provide special commands for manipulating the statistics (e.g., to throw away the current statistics and recompute them). (Real SQL servers, incidentally, don't use binary search; they usually store tables and indices using btrees or hashing. This is not usually a detail which will matter to you. The important bit as a SQL user is that creating an index on a given column or set of columns makes searches involving those columns faster, at some expense to inserts, deletes, and updates which touch that column.)
The end.
(The text promises information on transactions; I never got around to writing that. Sorry.)