pg_bulkload is a high speed data loading utility for PostgreSQL. Here is the pgFoundry page, where you can find downloads, documentation, bug reports, mailing lists, and a whole lot more.
Here are comparisons between COPY and pg_bulkload. Performance was measured with basic-tuned PostgreSQL server.
There are the following measurement patterns.
COPY has better performance on initial data loading if the destination table is TRUNCATEd in the same transaction with COPY. In addition, performance of COPY is improved when the data are loaded without indexes and create indexes after the loading. However, pg_bulkload has completed in 85% of loading time compared with the tuned COPY.
Item | version (postgres + bulkload) | |
---|---|---|
8.4.4 + 2.4 | 9.0b2 + 3.0 | |
COPY with indexes | 1133.4 sec | 1105.8 sec |
COPY without indexes + CREATE INDEX |
717.9 sec | 705.3 sec |
pg_bulkload (DIRECT) with indexes |
603.2 sec | 598.9 sec |
Duration comparison | 84.0 % | 84.9 % |
The TRUNCATE hack is not available on appended data loading. Also, creating indexes after loading is not always faster than loading with indexes. Therefore, pg_bulkload has completed in 35% of loading time compared with COPY on appended data loading.
Item | version (postgres + bulkload) | |
---|---|---|
8.4.4 + 2.4 | 9.0b2 + 3.0 | |
COPY with indexes | 520.4 sec | 549.3 sec |
COPY without indexes + CREATE INDEX |
805.3 sec | 799.6 sec |
pg_bulkload (DIRECT) with indexes |
185.2 sec | 191.7 sec |
Duration comparison | 35.6 % | 34.9 % |
The parallel loader is used when WRITER = PARALLEL is specified. Performance would be improved on multi-CPU server because reading an input file and writing rows to a table are done with two processes. The actual measurement shows the loading time is shortened to 70%.
FILTER feature transforms input data in various operations, but it's not free. The actual measurement shows the loading time is increased to 250-300% with SQL functions and 150% with C functions.
Item | Initial (4GB) | Appended (1GB) |
---|---|---|
pg_bulkload (DIRECT) | 598.9 sec | 191.7 sec |
pg_bulkload (PARALLEL) | 413.5 sec | 133.0 sec |
Duration comparison | 69.0 % | 69.4 % |
pg_bulkload (SQL-FILTER) | 1813.9 sec | 484.6 sec |
Duration comparison | 302.9 % | 252.7 % |
pg_bulkload (C-FILTER) | 918.4 sec | 263.7 sec |
Duration comparison | 153.3 % | 137.6 % |
PostgreSQL 9.0b2 + pg_bulkload 3.0b1 was used for all measurements, with indexes.
Item | Value |
---|---|
Server | Dell PowerEdge 1900 |
CPU | Dual Core Xeon 5050 (3.0GHz) |
Hyper-Threading | off |
Memory | 2GB |
Storage Subsystem | Dell PowerVault 221S |
Disks | SCSI 7x146GB (RAID 0) |
RAID Controller | PERC 4e/DC DRAM=128MB |
OS | CentOS 5.5 (64bit) |
shared_buffers | 256MB |
checkpoint_segments | 300 |
checkpoint_timeout | 5min |
Table definition | DBT-2 customer table |
Indexed columns | c_id (PRIMARY KEY) |
c_d_id (non-unique B-Tree) | |
Constraints | NOT NULL for all columns |
Input file format | CSV |