Japanese


Welcome to the pg_bulkload Project Home Page


pg_bulkload is a high speed data loading utility for PostgreSQL

The pg_bulkload project is a PostgreSQL Community project that is a part of the pgFoundry. It is produced by NTT OSS Center.

The pgFoundry page for the project is at http://pgfoundry.org/projects/pgbulkload, where you can find downloads, documentation, bug reports, mailing lists, and a whole lot more.

日本語ページはこちら

Contents


README

pg_bulkload - High speed data loading utility.

Introduction

pg_bulkload provides high-speed data loading capability to PostgreSQL users.

When we load huge amount of data to a database, it is common situation that data set to be loaded is valid and consistent. For example, dedicated tools are used to prepare such data, providing data validation in advance. In such cases, we'd like to bypass any overheads within database system to load data as quickly as possible. pg_bulkload is developed to help such situations. Therefore, it is not pg_bulkload's goal to provide detailed data validation. Rather, pg_bulkload asumes that loaded data set is validated by separate means. If you're not in such situation, you should use COPY command in PostgreSQL.

Lineup

pg_bulkload provides two programs to users.

  1. pg_bulkload
  2. This program is used to load the data. Internally, it invokes PostgreSQL's user-defined function called pg_bulkload() and perform the loading. pg_bulkload() function will be installed during pg_bulkload installation.

  3. postgresql script
  4. This is a wrapper command for pg_ctl, which starts and stops PostgreSQL server. postgresql script invokes pg_ctl internally. postgresql script provides very important pg_bulkload functionality, recovery. For performance, pg_bulkload bypasses some of PostgreSQL's internal functionality such as WAL. Therefore, pg_bulkload needs to provide separate recovery procedure before usual PostgreSQL's recovery is performed. postgresql script provides this feature.

    You must see below "Reminder", especially if

    or

Installation

  1. Environment
  2. pg_bulkload installation assumes the following;

  3. Installation procedure
  4. Installation sequence is shown below. Parmission for installed directories are given correctly.
    
    $ cd [directory where postgresql-8.2or3.X.tar.gz is untared]/contrib/
    $ tar zxvf pg_bulkload-2.3.X.tar.gz
    $ cd pg_bulkload
    $ make
    $ make install
    $ mkdir $PGDATA/pg_bulkload
    $ postgresql start
    $ psql -f $PGHOME/share/contrib/pg_bulkload.sql database_name
    

Usage

You can use pg_bulklad by the following three steps:

  1. Edit control file "sample_csv.ctl" or "sample_bin.ctl" that includes settigs for data loading. You can specify table name, absolute path for input file, description of the input file, and so on.
  2. Assume there is a directory $PGDATA/pg_bulkload, in that load status files are created.
  3. Execute command with a control file as argument. Relative path is available for the argument.
    
    $ pg_bulkload sample_csv.ctl 
    

Control File

You can specify the following load options. See sample_csv.ctl and sample_bin.ctl included in a pg_bulkload package.

Common

CSV format

Fixed format

Reminder

Optional tool : pg_timestamp_in

In addition to pg_bulkload, the following user-defined function is also provided to skip parsing overhead of timestamp string. This is involved in a pg_bulkload package.

pg_timestamp_in

This user-defined function provides very fast loading of timestamp type data. For the speed, instead, the format of the timestamp data must satisfy the following 19byte format:
2007-01-01 12:34:56

Installation

Installation sequence is shown below. Parmission for installed directories are given correctly.


$ cd [directory where postgresql-8.2or3.X.tar.gz is untared]/contrib/
$ tar zxvf pg_bulkload-2.3.X.tar.gz
$ cd pg_bulkload
$ make
$ make install
$ postgresql start
$ psql -f $PGHOME/share/contrib/pg_timestamp.sql database_name

Reminder

Timestamp value with timezone attribute is outside the scope of pg_timestamp_in. If you provide data in such format, you must use usual PostgreSQL feature to read data. In this case, you may need longer duration to load. Although pg_timestamp_in provide much faster data loading for timestamp data, it replaces usual PostgreSQL's internal function used to read timestamp data. That is, use of pg_timestamp_in influences the data symtax for PostgreSQL's SQL statements such as INSERT, COPY, and UPDATE. To avoid such influence, users have to use pg_timestamp_in only in the data loading, and uninstall pg_timestamp_in.

Author

NTT Opensource Software Center
Copyright (c) 2007-2008 Nippon Telegraph and Telephone Corporation


Performance Result

Overview

Result