Generating the CSV and binary formats from Python is easy enough. Here are samples for each.
CSV format
You can use code like the following to generate a tab-separated CSV where the first row is a header. The header is ignored by PostgreSQL, so make sure the column ordering matches up with the table definition. f_table1 = open(filebase + '_table1.csv', 'wb')
table1 = csv.DictWriter(f_table1, [
'col1', 'col2'],
dialect=csv.excel_tab)
# writeheader() was added in Python 2.7; for 2.6, see StackOverflow
f_table1.writeheader(pht_text)
# Sample - one row
table1.write_row({
'col1': 1,
'col2': 2,
})
f_table1.close()
The code above uses a writeheader() method added to csv.DictWriter in Python 2.7. If you're stuck with 2.6, see this Stack Overflow post.
To load the CSV into psql:
Luckily, the tables I had to worry about only have three columns: a varchar, a bytea and a timestamptz. The binary format of each is straightforward:
\copy table1 FROM 'yyyymm_table1.csv' CSV HEADER DELIMITER E'\t';
Timestamps in CSV format
For timestamp and timestamptz columns PostgreSQL is smart enough to handle a date string like "2014-02-28 12:34:45.123456". For timestamptz it will infer the appropriate timezone for the given date and time. To format the timestamp string you can use something like:row['timestamp'] = "%04d-%02d-%02d %02d:%02d:%09.6f" % (
year, month, day, hour, minute, second)
Binary format
The PostgreSQL binary format is partly documented but once you're past the header and into the field values you're essentially on your own - the advice given is to read the source under utils/backend/adt, and to an outsider it looks like the field formats are heavily dependent on compile-time options. I normally wouldn't touch it with a ten-foot pole but it's the most efficient way to load tables with large bytea columns.Luckily, the tables I had to worry about only have three columns: a varchar, a bytea and a timestamptz. The binary format of each is straightforward:
- in all cases the field is preceded by a 32-bit field length
- varchar and other text fields, and bytea, are encoded as a run of bytes
- timestamptz could be 64-bit signed integer or double-precision floating point depending on compiler options, but realistically it's always a signed integer: number of seconds since 2000-01-01 00:00:00 UTC, multiplied by one million, plus microseconds
Here's a sample:
f_table2 = open(filebase + '_table2.bin', 'wb')
table2 = pgbinwriter(f_table2)
table2.writeheader()
# Columns: varchar, bytea, timestamptz
table2.writerow([
pgbintext("abcdefg"),
pgbinbytea(some_bytearray),
pgbintimestamptz(
string_to_pythontime('2014-02-28 12:34:45.123456')),
])
table2.writetrailer()
table2.close()
And the underlying functions:
############################################################################
# PostgreSQL binary format routines
# http://56bytes.blogspot.ca/2014/04/creating-postgresql-bulk-import-files.html
PGCOPY_HEADER = "PGCOPY\n\377\r\n\0"
# 2000-01-01 00:00:00 UTC in Unix (Python) time
PG_DATE_EPOCH = 946684800.0
class pgbinwriter(object):
def __init__(self, fileobj):
self.fileobj = fileobj
def writeheader(self):
self.fileobj.write(PGCOPY_HEADER)
# Write flags
self.fileobj.write(bytearray([0, 0, 0, 0]))
# Header extension area length
self.fileobj.write(bytearray([0, 0, 0, 0]))
def writerow(self, fields):
# 16 bits: number of fields in the tuple
self._write16(len(fields))
for field in fields:
self.fileobj.write(field.encode())
def writetrailer(self):
self._write16(-1)
# Write a 16-bit integer in network byte order
def _write16(self, u16):
self.fileobj.write(bytearray([(u16 >> 8) & 0xff, u16 & 0xff]))
class pgbinfield(object):
def __init__(self, length):
self.length = length
def encode(self):
return self._encode32(self.length) + self._encode_data()
# Write a 32-bit integer in network byte order
def _encode32(self, length):
return bytearray([
(length >> 24) & 0xff,
(length >> 16) & 0xff,
(length >> 8) & 0xff,
length & 0xff])
def _encode_data(self):
raise
class pgbinnull(pgbinfield):
def __init__(self):
super(pgbinnull, self).__init__(-1)
def _encode_data(self):
return bytearray()
class pgbintext(pgbinfield):
def __init__(self, text):
super(pgbintext, self).__init__(len(text))
self.text = text
def _encode_data(self):
return bytearray(self.text)
class pgbinbytea(pgbinfield):
def __init__(self, bytearr):
super(pgbinbytea, self).__init__(len(bytearr))
self.bytearr = bytearr
def _encode_data(self):
return self.bytearr
# Depending on compiler options, PostgreSQL either uses 64-bit signed integers
# or double precision floating point numbers - default 64-bit ints.
# Values are stored as number of seconds since 2000-01-01 00:00:00 UTC,
# multiplied by one million, plus microseconds.
# Example: 2014-03-12 22:48:41.267917-04 -> 0x19772c275aacd
class pgbintimestamptz(pgbinfield):
def __init__(self, pythontime):
super(pgbintimestamptz, self).__init__(8)
self.pythontime = pythontime
def _encode_data(self):
pgtime = long((self.pythontime - PG_DATE_EPOCH) * 1000000.0)
return bytearray([
(pgtime >> 56) & 0xff,
(pgtime >> 48) & 0xff,
(pgtime >> 40) & 0xff,
(pgtime >> 32) & 0xff,
(pgtime >> 24) & 0xff,
(pgtime >> 16) & 0xff,
(pgtime >> 8) & 0xff,
pgtime & 0xff])
# Given a string like "2014-02-28 12:34:45.123456", returns a floating point
# number of seconds since the Unix epoch (1970-01-01 00:00:00 UTC).
def string_to_pythontime(s):
# time.mktime() is smart enough to figure out daylight savings time.
# To get time.strptime() to work we just have to strip off the fractional
# part of the second.
i = s.index('.')
nofrac = s[0:i]
frac = float("0." + s[i+1:])
return time.mktime(time.strptime(nofrac, "%Y-%m-%d %H:%M:%S")) + frac
To load into psql:
\copy table2 FROM 'yyyymm_table2.bin' BINARY;

No comments:
Post a Comment