Often you will receive data in a non-spatial form such as comma delimited data with latitude and longitude fields. To take full advantage of PostGIS spatial abilities, you will want to create geometry fields in your new table and update that field using the longitude latitude fields you have available.

General Note: All the command statements that follow should be run from the PgAdminIII Tools - Query Tool or any other PostGreSQL Administrative tool you have available. If you are a command line freak - you can use the psql command line tool packaged with PostGreSQL.

Getting the data

For this exercise, we will use US zip code tabulation areas instead of just Boston data. The techniques here will apply to any data you get actually.

First step is to download the data from US Census. http://www.census.gov/geo/www/gazetteer/places2k.html

Importing the Data into PostGreSQL

PostgreSQL comes with a COPY function that allows you to import data from a delimited text file. Since the ZCTAs data is provided in fixed-width format, we can't import it easily without first converting it to a delimited such as the default tab-delimited format that COPY works with. Similarly for data in other formats such as DBF, you'll either want to convert it to delimited using tools such as excel, use a third party tool that will import from one format to another, or one of my favorite tools Microsoft Access that allows you to link any tables or do a straight import and export to any ODBC compliant database such as PostgreSQL.

For fixed width data such as ZCTA, we demonstrate a more generic way of importing fixed with data into PostgreSQL that doesn't require any additional tools beyond what is packaged with PostgreSQL.

Create the table to import to

First you will need to create the table in Postgres. You want to make sure the order of the fields is in the same order as the data you are importing.


CREATE TABLE zctas
(
  state char(2),
  zcta char(5),
  junk varchar(100),
  population_tot int8,
  housing_tot int8,
  water_area_meter float8,
  land_area_meter float8,
  water_area_mile float8,
  land_area_mile float8,
  latitude float8,
  longitude float8
) 
WITHOUT OIDS;

Convert from Fixed-width to Tab-Delimited

For this part of the exercise, I'm going to use Microsoft Excel because it has a nice wizard for dealing with fixed-width and a lot of windows users have it already. If you open the zcta file in Excel, it should launch the Text Import Wizard. MS Access has a similarly nice wizard and can deal with files larger than excels 65000 some odd limitation. Note there are trillions of ways to do this step so I'm not going to bother going over the other ways. For non-MS Office users other office suites such as Open-Office probably have similar functionality.

Open the file in Excel.
Import Text Wizard should launch automatically and have Fixed-Width as an option
Look at the ZCTA table layout spec http://www.census.gov/geo/www/gazetteer/places2k.html#zcta and set your breakouts the same as specified. For the above I broke out the Name field further into first 5 for zcta and the rest for a junk field.
Next File->Save As ->Text (Tab delimited)(*.txt) -give it name of zcta5.tab
Copy the file to somewhere on your PostGreSQL server.

The COPY command

Now copy the data into the table using the COPY command. Note the Copy command works using the PostGreSQL service so the file location must be specified relative to the Server.


     COPY zctas FROM 'C:/Downloads/GISData/zcta5.tab';

Creating and Populating the Geometry Field

Create the Geometry Field

To create the Geometry field, use the AddGeometryColumn opengis function. This will add a geometry field to the specified table as well as adding a record to the geometry_columns meta table and creating useful constraints on the new field. A summary of the function can be found here http://postgis.refractions.net/docs/ch06.html#id2526109.


SELECT AddGeometryColumn( 'public', 'zctas', 'thepoint_lonlat', 4269, 'POINT', 2 );

The above code will create a geometry column named thepoint_longlat in the table zctas that validates to make sure the inputs are 2-dimensional points in SRID 4269 (NAD83 longlat).

Populate the Geometry Field using the Longitude and Latitude fields


UPDATE zctas
        SET thepoint_lonlat = ST_SetSRID(ST_Point( longitude, latitude),4269) )

The above code will generate a PostGIS point geometry object of spatial reference SRID 4269.

There are a couple of things I would like to point out that may not be apparently clear to people not familiar with PostGreSQL or PostGis

You can't just put any arbitrary SRID in there and expect the system to magically transform to that. The SRID you specify has to be the reference system that your data is in.

Transforming to Another spatial reference system

The above is great if you want your geometry in longlat spatial reference system. In many cases, longlat is not terribly useful. For example if you want to do distance queries with your data, you don't want your distance returned back in longlat. You want it in a metric that you normally measure things in.

In the code below, we will create a new geometry field that holds points in the WGS 84 North Meter reference system and then updates that field accordingly.


SELECT AddGeometryColumn( 'public', 'zctas', 'thepoint_meter', 32661, 'POINT', 2 );

UPDATE zctas SET thepoint_meter = transform(setsrid(makepoint(longitude, latitude),4269), 32661);

Please note in earlier versions of this article I had suggested doing the above with


UPDATE zctas
SET thepoint_meter = transform(PointFromText('POINT(' || longitude || ' ' || latitude || ')',4269),32661) ;

On further experimentation, it turns out that PointFromText is perhaps the slowest way of doing this in Postgis. Using a combination of ST_Setsrid and ST_Point is on the magnitude of 7 to 10 times faster at least for versions of PostGIS 1.2 and above. ST_GeomFromText comes in second (replace ST_PointFromText with ST_GeomFromText) at about 3 to 1.5 times slower than ST_SetSRID ST_Point. See about PointFromText on why PointFromText is probably slower. In ST_GeomFromText, there appears to be some caching effects so on first run with similar datasets ST_GeomFromText starts out about 3-4 times slower than the ST_makepoint (ST_Point) way and then catches up to 1.5 times slower. This I tested on a dataset of about 150,000 records and all took - 26 secs for ST_PointFromText fairly consistently, 10.7 secs for first run of GeomFromText then 4.1 secs for each consecutive run, 3.5 secs fairly consistently for setsrid,makepoint on a dual Xeon 2.8 Ghz, Windows 2003 32-bit with 2 gig RAM).

Index your spatial fields

One of the number one reasons for poor query performance is lack of attention to indexes. Putting in an index can make as much as a 100 fold difference in query speed depending on how many records you have in the table. For large updates and imports, you should put your indexes in after the load, because while indexes help query speed, updates against indexed fields can be very slow because they need to create index records for the updated/inserted data. In the below, we will be putting in GIST indexes against our spatial fields.


CREATE INDEX idx_zctas_thepoint_lonlat ON zctas
  USING GIST (thepoint_lonlat);

CREATE INDEX idx_zctas_thepoint_meter ON zctas
  USING GIST (thepoint_meter);

ALTER TABLE zctas ALTER COLUMN thepoint_meter SET NOT NULL;
CLUSTER idx_zctas_thepoint_meter ON zctas;

VACUUM ANALYZE zctas;

In the above after we create the indexes, we put in a constraint to not allow nulls in the thepoint_meter field. The not null constraint is required for clustering since as of now, clustering is not allowed on gist indexes that have null values. Next we cluster on this index. Clustering basically physically reorders the table in the order of the index. In general spatial queries are much slower than attribute based queries, so if you do a fair amount of spatial queries especially bounding box queries, you get a significant gain.

Please note: (for those familiar with Cluster in SQL Server - Cluster in PostgreSQL pretty much means the same thing, except in PostgreSQL - at least for versions below 8.3 (not sure about 8.3 and above), if you cluster and then add new data or update data, the system does not ensure that your new or updated data is ordered according to the cluster as it does in SQL Server. The upside of this is that there is no additional penalty in production mode with clustering as you have in SQL Server. The downside is you must create a job of some sort to maintain the cluster for tables that are constantly updated.

In the above we vacuum analyze the table to insure that index statistics are updated for our table.

Post Comments About Part 3: PostGIS Loading Data from Non-Spatial Sources

Purpose of BostonGIS