I love the idea of code generation without using IDEs just because I know if my IDEs of choice are not present
at a particular moment, I've got another swiss army knife to rely on.
I particularly like SQL code generation schemes that rely on a database to do it. When you are trying to
generate SQL code, you often need the metadata of your table structures to do so and using the database to get that info is often the easiest approach.
This post is in the spirit of Hubert Lubaczewski's recent PostgreSQL post grant XXX on * ? which details how to grant access to tables in PostgreSQL using some SQL code generation tricks. As it turns out I had a similar situation I had to deal with recently, but involved updating fields to null.
Empty String verses Null
Now there have been long debates about when to use Null vs. Empty string and the various Gotchas involved. If you want to know about these, this series on Nulls
What if null if null is null null null is null?
by Hugo Kornelis is a pretty good one. Hugo is predominately a Microsoft SQL Server blogger, but for the most part how null
is handled is pretty much the same across most pseudo ANSI standard relational databases. So regardless of your database poisons of choice, his comments have
equal merit.
I generally prefer the use of Null instead of empty string, all gotchas aside, mostly for philosophical reasons;
I believe that the absence of data should be a black hole and that is what NULL is. When you do a lot of statistical and financial reporting as we do, its immensely useful, but you have to be cognizant of the difference between 0, empty string and NULL and that it is used consistently within your database.
There is one other reason I prefer NULL over using an empty string and that is that
NULL can be cast to any data type because it represents the existence of nothing where as an empty string can not because an empty string is a string. This gets me to my particular dilemma.
Updating Empty String to NULL
Often times when I get property parcel data they send it to me as all varchar or some such thing and they are always changing
the field names on me (the evils of DBF/ESRI shape as a transport mechanism) and I then have to massage this data into my superbly structured tables where a number is a number and can be tabulated without casting. Now take
the case if you have say land property's assessed value or some other numeric field that comes as varchar and you need to stuff it into an integer or float (double precision) field and a lot of these fields come thru as empty string.
You get an error like this - ERROR: invalid input syntax for double precision: "" in PostgreSQL (or SQL Server) when you do something like.
SELECT CAST('' AS float)
But if you do something like SELECT CAST(NULL as float)
that works just fine,
because all kinds of data can have black holes. NULL is the universal thing that is a party of all data sets.
It is equivalent to the empty set in set theory.
So the way I get around this unpleasantry is to set all varchar fields
that are empty to null before I try to insert it into my final table structure. You can imagine this gets pretty repetitive if you have to do this for say
20 fields. So here is my trick to generate the SQL to do this in PostgreSQL.
SELECT 'UPDATE ' || table_name || ' SET ' || column_name || ' = NULL WHERE ' || column_name || ' = '''';' As sqlupdate
FROM information_schema.columns
WHERE table_name = 'sometable' AND data_type LIKE '%char%'
The above will return a row for each column in your table that is a character varying or char field and will contain the update statement to update all empty strings in that column to NULL. If you want to go one better,
you can create a custom aggregate for strings (as I mentioned in More generate series) and use it to get a single row containing all your update statements.
Note that this same trick works in other databases that support the ISO-SQL:1999+ information_schema such as (SQL Server 2005, SQL Server 2000, MySQL 5, PostgreSQL 7.4 +) - (sadly Oracle apparently doesn't support information_schema - perhaps someday. ).
In SQL Server 2000/2005 you would replace the || with + so your code would look like
SELECT 'UPDATE ' + table_name + ' SET ' + column_name + ' = NULL WHERE ' + column_name + ' = '''';' As sqlupdate
FROM information_schema.columns
WHERE table_name = 'sometable' AND data_type LIKE '%char%'
In MySQL you would have to use the CONCAT function if you are not in ANSI mode. For MySQL 5+ you can do
set sql_mode = 'ANSI';
and use the standard ||.
If you want to have the above return a single row in SQL Server 2005, similarly you would create an aggregate function for strings
as described here - granted a bit more involved than doing the same in PostgreSQL.
Or use SQL Server's XPath syntax supported in SQL Server 2005 as described here.
Note that if you are making heavy use of database schemas,
then you will need to qualify your tables with the schema name and do a where on the table_schema as well.
SELECT 'UPDATE ' || table_schema || '.' || table_name || ' SET ' || column_name || ' = NULL WHERE ' || column_name || ' = '''';' As sqlupdate
FROM information_schema.columns
WHERE table_name = 'sometable' AND data_type LIKE '%char%' AND table_schema = 'assessing';
Explore the INFORMATION_SCHEMA
The INFORMATION_SCHEMA is chuck-full of all sorts of useful metadata about your database objects. While some tables may not exist
in some database management systems (DBMS) (e.g. PostgreSQL has a table information_schema.sequences (defined in SQL:2003 standard which you won't find in MYSQL and SQL Server because those databases don't have sequence objects),
for the tables that exist, the tables are consistently named across all DBMS's that support them and so are the field names.
There are 3 tables (views) I find most useful in the information schema. These are all available in the DBMS's I mentioned, and those are
- INFORMATION_SCHEMA.TABLES - contains listing of all tables and views in a database
- INFORMATION_SCHEMA.COLUMNS - contains listing of all columns in a database, the table they belong to, the data type, length, ordinal position in table
- INFORMATION_SCHEMA.VIEWS - contains names of views and the SQL that defines the view
- These ones I haven't explored but look useful for doing security audits on your tables - INFORMATION_SCHEMA.TABLE_PRIVILEGES, INFORMATION_SCHEMA.COLUMN_PRIVILEGES
Note - for the most part if not in all cases, the INFORMATION_SCHEMA set of information are pretty much implemented as views on top of the proprietary DBMS system tables. The reasons I find most compelling to use them instead of using the direct system tables are the following
- It is often easier to navigate for basic metadata than trying to navigate the DBMS system tables. I have found it very rare to need information that I couldn't find in the information_schema and needed to resort to system tables.
- Since it is purely there for informational purposes and is a standard, you don't have to worry too much about it changing as you would
a system table - whose purpose is more for direct management of database internals.
- Since it is a standard, for the databases that comply with the standard you can feel a bit more at home, since there is one less thing you need to know about to migrate. Thus less feeling of Vendor Lock-In.