pg_bulkload is a high speed data loading utility for PostgreSQL
The pg_bulkload project is a PostgreSQL Community project that is a part of the pgFoundry. It is produced by NTT OSS Center.
The pgFoundry page for the project is at http://pgfoundry.org/projects/pgbulkload, where you can find downloads, documentation, bug reports, mailing lists, and a whole lot more.
pg_bulkload provides high-speed data loading capability to PostgreSQL users.
When we load huge amount of data to a database, it is common situation that
data set to be loaded is valid and consistent. For example, dedicated tools
are used to prepare such data, providing data validation in advance. In such
cases, we'd like to bypass any overheads within database system to load data as
quickly as possible. pg_bulkload is developed to help such situations.
Therefore, it is not pg_bulkload's goal to provide detailed data validation.
Rather, pg_bulkload asumes that loaded data set is validated by separate means.
If you're not in such situation, you should use COPY
command in PostgreSQL.
pg_bulkload provides two programs to users.
This is a wrapper command for pg_ctl
, which starts and stops PostgreSQL
server. postgresql script invokes pg_ctl
internally. postgresql script
provides very important pg_bulkload functionality, recovery. For
performance, pg_bulkload bypasses some of PostgreSQL's internal functionality
such as WAL. Therefore, pg_bulkload needs to provide separate recovery
procedure before usual PostgreSQL's recovery is performed. postgresql script
provides this feature.
You must see below "Reminder", especially if
DIRECT
load mode (bypassing WAL : default settings)
initdb
.
$ cd [directory where postgresql-8.2or3.X.tar.gz is untared]/contrib/
$ tar zxvf pg_bulkload-2.3.X.tar.gz
$ cd pg_bulkload
$ make
$ make install
$ mkdir $PGDATA/pg_bulkload
$ postgresql start
$ psql -f $PGHOME/share/contrib/pg_bulkload.sql database_name
You can use pg_bulklad by the following three steps:
$PGDATA/pg_bulkload
, in that load status files are created.
$ pg_bulkload sample_csv.ctl
You can specify the following load options. See sample_csv.ctl and sample_bin.ctl included in a pg_bulkload package.
LOAD=DIRECT
, which is the default), you have to be aware below:
When pg_bulkload is crashed and some .loadstatus files are remained in $PGDATA/pg_bulkload
, database must be recovered by pg_bulkload own recovery with "pg_bulkoad -r
" command before you invoke pg_ctl start.
You must start and stop PostgreSQL using postgresql script, which invokes "pg_bulkload -r
" and "pg_ctl start" correctly. We recommend not to use pg_ctl
directly.
If you use pg_bulkload in Windows operating system, postgresql script is not included in a pg_bulkload package. So you have to invoke "pg_bulkload -r
" manually.
Because of bypassing WAL, archive recovery by PITR is not available. If you would like to use PITR, take a full backup of the database after the loading by pg_bulkload.
You must not remove the load status file (*.loadstatus) found in
$PGDATA/pg_bulkload
directory. This file is needed in pg_bulkload crash
recovery.
kill -9
Do not terminate pg_bulkload command using "kill -9
" as much as possible. If you did this, you
must invoke postgresql script to perform pg_bulkload recovery and restart
PostgreSQL to continue.
In addition to pg_bulkload, the following user-defined function is also provided to skip parsing overhead of timestamp string. This is involved in a pg_bulkload package.
2007-01-01 12:34:56
Installation sequence is shown below. Parmission for installed directories are given correctly.
$ cd [directory where postgresql-8.2or3.X.tar.gz is untared]/contrib/
$ tar zxvf pg_bulkload-2.3.X.tar.gz
$ cd pg_bulkload
$ make
$ make install
$ postgresql start
$ psql -f $PGHOME/share/contrib/pg_timestamp.sql database_name
Timestamp value with timezone attribute is outside the scope of pg_timestamp_in.
If you provide data in such format, you must use usual
PostgreSQL feature to read data. In this case, you may need longer duration
to load. Although pg_timestamp_in provide much faster data loading for
timestamp data, it replaces usual PostgreSQL's internal function used to read
timestamp data. That is, use of pg_timestamp_in influences the data symtax
for PostgreSQL's SQL statements such as INSERT
, COPY
, and UPDATE
. To avoid
such influence, users have to use pg_timestamp_in only in the data loading,
and uninstall pg_timestamp_in.
NTT Opensource Software Center
Copyright (c) 2007-2008 Nippon Telegraph and Telephone Corporation
item | 8.1.8 | 8.2.3 |
COPY(sec.) | 1601.4 | 1586.2 |
pg_bulkload(sec.) | 147.7 | 131.8 |
ratio | 10.8 | 12.0 |
item | 8.1.8 | 8.2.3 |
COPY(sec.) | 548.9 | 596.5 |
pg_bulkload(sec.) | 147.7 | 131.8 |
ratio | 4.17 | 4.04 |
item | 8.1.8 | 8.2.3 |
pg_bulkload to an indexed table | 147.7 | 131.8 |
pg_bulkload to a non-indexed table | 72.5 | 72.9 |
COPY to an indexed table | 1601.4 | 1586.2 |
COPY to a non-indexed table | 127.7 | 140.2 |
CREATE INDEX | 468.8 | 408.7 |
Item | value |
Machine | PowerEdge1900 |
CPU | Dual Core Intel(R) Xeon(R) Processor5050 CPU3.0GHz |
Memory | 2GB (512MB*4) |
Disc size (operating system installed) | SerialATAⅡ80GB |
Storage size (database cluster stored) | RAID0 1.2TB |
RAID Controller Cache | 128MB |
Hyper Thread | ON |
Version | RHEL ES release4 update4(32bit) |
Kernel | 2.6.9-42.ELsmp |
libc | 2.3.4 |
Version | 8.1.8/8.2.3 |
shared_buffers | 1024 |
checkpoint_segments | 1000 |
checkpoint_timeout | 3600 |
work_mem | 1024 |
maintenance_work_mem | 16384 |
Table definition | DBT-2 customer table |
Index column | c_id (PRIMARY KEY) |
c_d_id (non-unique B-Tree) | |
Constraint | Non-NULL (all columns) |
Existing data | 16,777,216 tuples(4GB) |
Loading data | 4,194,304 tuples(1GB) |
Input file type | CSV |
pg_bulkload version | pg_bulkload-2.1.2(PG-8.1.8)/2.2.0(PG-8.2.3) |