A better way to move data between databases

April 30th, 2008

In this post, Moving data between databases I used psql to copy data between databases. I had OK success with this. I recently had to copy data between two physical servers and I came up with a better solution.

Create a file called migrate.sh with this:

SOURCE_HOST=$1
TARGET_HOST=$2
SCHEMA=$3
SOURCE_DB=enterprise
TARGET_DB=enterprise

if [$SCHEMA = ""]; then
echo
echo
echo You must pass in the correct parameters!
echo Syntax: migrate.sh \<source_host\> \<target_host\> \<schema_name\>
echo Example: migrate.sh gp test-gp demo_schema
echo
echo
else
echo cleanup target
psql -h $TARGET_HOST -c "drop schema $SCHEMA cascade" $TARGET_DB


echo dump output to stdout and pipe to restore
pg_dump -c -h $SOURCE_HOST --gp-syntax -i -F c -v -n $SCHEMA $SOURCE_DB | pg_restore -F c -i -d $TARGET_DB -v -h $TARGET_HOST
fi

Now execute:
chmod 700 migrate.sh

I copied my jon schema from Production to Test with this:
./migrate.sh gp test-gp jon

pljava

March 28th, 2008

I’m using pljava to reach out to SQL Server and Oracle databases to gather information about it to transform the data into Greenplum. I’m using a PostgreSQL version 8.3 for this on Windows and recently installed this on the server. I’ve had it running on my laptop for a while and it took some time to configure it there so I knew it would be fun when I installed it on the server.

I had already installed PostgreSQL on our server and included the pljava option. From there, there is still configuration needed. Here are the steps I took on our Windows server to get pljava installed.

1. install java

Our server didn’t have Java installed yet.

2. add java to path

C:\Program Files\Java\jre1.6.0_05\bin
C:\Program Files\Java\jre1.6.0_05\bin\client

It really just needs the \bin\client directory because it has the location for jvm.dll but I also add \bin to make it easier to run and compile Java from the command line.

3. create CLASSPATH with this

.;C:\Program Files\PostgreSQL\8.3\share\pljava\pljava.jar

On a fresh install, I had to create this environment variable.

4. add this to postgresql.conf

#--------------------------------------------------------------------------
# pljava changes
#--------------------------------------------------------------------------

custom_variable_classes = 'pljava'
pljava.classpath='C:\\Program Files\\PostgreSQL\\8.3\\share\\pljava.jar'
dynamic_library_path = '$libdir;C:\\Program Files\\PostgreSQL\\8.3\\share\\pljava'

Of course, modify this for your install locations.

5. restart pg

6. run script

C:\Program Files\PostgreSQL\8.3\share\pljava\install.sql

Moving data between databases

February 13th, 2008

Here is a quick trick to move data between two different PostgreSQL or Greenplum databases.

First, we have a table in the postgres database:

C:\>psql postgres -c "select * from test_date" -U scheduler
id | datestamp
----+-------------------------
1 | 2008-02-12 20:19:10.709
1 | 2008-02-12 20:19:11.881
2 | 2008-02-12 20:19:14.631
3 | 2008-02-12 20:19:15.834
4 | 2008-02-12 20:19:17.131
(5 rows)

Next, drop the table (if it exists) in the target sandbox database:

C:\>psql sandbox -c "drop table if exists test_date" -U scheduler
DROP TABLE

Create the table using the DDL from the source database:

C:\>psql sandbox -c "create table test_date (id numeric, datestamp timestamp)" -U scheduler
CREATE TABLE

Now use the copy command twice. Once pulling the data to standard out, pipe, then copy from standard in. BTW, this works in Windows or bash exactly the same.


C:\>psql postgres -c "copy test_date to stdout" -U scheduler | psql sandbox -c "copy test_date from stdin" -U scheduler

Look at the data in the target database:

C:\>psql sandbox -c "select * from test_date" -U scheduler
id | datestamp
----+-------------------------
1 | 2008-02-12 20:19:10.709
1 | 2008-02-12 20:19:11.881
2 | 2008-02-12 20:19:14.631
3 | 2008-02-12 20:19:15.834
4 | 2008-02-12 20:19:17.131
(5 rows)

Bug in timestamp

February 1st, 2008

This post applies to PostgreSQL too. I even went so far as to point out this bug to the PostgreSQL hackers list and in usual Open Source fashion, I was told that it was a “feature”, not a “bug”. Yeah right.

select sub.t1,
to_char(t1, 'yyyy-mm-dd hh24:mi:ss.ms') as char_t1
from (select timestamp'2008-01-31 12:00:40.5' as t1) sub

2008-01-31 12:00:40.50 | 2008-01-31 12:00:40.500

Notice how the millisecond shows .50 for the timestamp column and .500 for the character representation of the same date.

Now notice the documentation found here indicates ms should be padded like month, day, and year are. BUG!

What really bugs me (no pun intended) about this simple, rather obvious bug is how the PostgreSQL group responds to it. If you get them to admit it is a bug, they want you to fix it. Huh? If you don’t have the time or knowledge to fix the problem, they want you to sponsor the fix. Again, huh?

Now many vendors jump on the shoulders of Open Source products like PostgreSQL and add more functionality. Do they offer to fix the core product? I haven’t heard an offer yet!

I don’t get Macs

November 30th, 2007

I really don’t. I don’t understand why in the world anyone would get excited about an operating system.

Windows User
I’ve been using Windows since version 3.0 and along with it DOS. I remember edlin, config.sys, autoexec.bat, etc. That was painful. I remember loving getting Norton Desktop on Windows 3.1. I even downloaded Windows 95 beta versions from my dial-up PPP account in college from a warez site. I know the product very well.

UNIX
I worked for an ISP that embraced Linux (slackware actually) and I learned a lot about Unix (Irix) while there. I moved on to working with databases and worked on HP/UX, AIX, and Solaris. For years, I used elm as my email editor. I like Unix.

However, I prefer greatly prefer Windows for my desktop and laptop. I never want to change. What is the benefit? It is sort of like asking a Unix person to switch shells. What is the draw for a bash user of 20 years to switch to csh? He/she will question the benefit and most likely refuse to switch.  Just like me.  Why should I bother learning a new shell?

Open Source?
Now I see open source people loving Macs. The OS uses some open source code and it has a Unix feel to it. However, it is anything but open. It is actually the opposite of open. Where can I buy or download the Mac OS for a bare bones machine I build? Oh yeah, Apple closes the OS to their hardware only. It reminds me of IBM back in the day.

Speaking of closed, what about the iPhone? That’s right, closed too.

iTunes
My brother bought me an iPod for Christmas last year which I never would have bought for myself. I never would have paid that much for an MP3 player and I was leaning towards a different brand.

Anyway, I took my entire MP3 collection (about 4 GBs) and get iTunes to sync to my iPod. Next I rip another song or two and I didn’t use iTunes to do that. Now I have to delete everything on my iPod and resync my collection. User friendly my ass.

Later on, I take the iPod to work and I load iTunes so I can transfer the songs to my hard drive. Whoa! You can’t do that! I can only sync to one machine? WTF?

The easy fix on my Windows XP machine was to get Winamp. This great software lets me upload/download songs from the iPod with much greater ease. The iTunes interface is terrible. I can’t think of a less user friendly application. Winamp also lets you stream rip which is awesome.

Utility
An operating system has great utility. It isn’t sexy or something new. I don’t get excited about the utlity of a pickup truck either and no truck has ever been sexy. It has great utility but nothing else.

I think I would be using Windows 2000 right now had I not bought a computer with XP already loaded. Windows 2000 is stable and does everything I need. Windows XP does it better and it comes with some more free applications that are nice.

M$ Haters
It seems there are a lot of Microsoft haters out there that portray Bill Gates as evil. I don’t buy it. MS has a great OS that will run on any x86 hardware I want to buy. There are a ton of applications for it and a ton of development going on for it. Conversely, Apple has very few games, very few applications, very little development, and you must buy expensive Apple hardware only to use it. That sucks.

If you really hate M$ and want an alternative, use Linux. Go to Wal-Mart and buy yourself a $200 pre-loaded Linux desktop and then take the extra $1500 you saved from not buying a Mac and buy something else!

Loading data into Greenplum

November 21st, 2007

Once you get Greenplum, you’ll run into the problem of loading data.  It isn’t straight forward and it isn’t easy.

Insert Statements
If you try to load your data with a typical insert statement, it will take forever!  While GP is great at extracting data much faster than other databases because everything runs in parallel, this method is very slow at loading data. 

Copy
The copy command is a better way to load data.  It loads data from a flat file usually delimited by something.  It isn’t perfect though.

Delimiters
The copy command doesn’t support multi-character delimiters.  This means if you have a ”|” or ”~” somewhere in your data, you can’t use it. 

Backslash “\”
The backslash character is an escape character.  The means if you have a trailing backslash in a field, it will escape the delimiter and the data will not load.

All or Nothing
Unlike Oracle’s sqlldr or M$/Sybase BCP, you don’t have the option of putting errors in a separate file.  Either all of the data will be loaded or it will error out the entire batch.  This is extremely frustrating when you are trying to load gigabytes of data only to have one row cause a problem.

Character Set
GP uses UTF8 which is probably the most flexible character set to use.  Most of my data was coming from SQL Server and I was creating files in Windows.  The code page on my server was set to 1252 which does support high ascii characters but not all of them.  For me to load a Windows file on a 1252 code page server, I had to set the character set in my GP session to WIN1252 in order for me to load data correctly.  That was  lot of fun figuring out!

CSV and Double Quotes
I had the best luck using CSV format and double quotes around text fields.  This let me use | as the delimiter and not worry about some field containing a backslash that would escape this field.

\Copy
 Didn’t I just talk about copy?  Well, yes but there is another copy.  The first copy requires the file be loaded on the master node.  The “\copy” command allows you to load a file remotely.  It pretty much behaves the same as copy except I found it to be slower.  It was actually faster to create a file in Windows, transfer it to the master node using FTP, and then load it with Copy.

Both \copy and copy have the ability to load data from standard in.  I was able to issue a select statement in SQL Server and pipe the output to \copy.  It was able to load data into GP without creating a file.  Unfortunately, I found that it was slower than creating a file locally in Windows, FTP, and then copy the data in.  Grrr.

External Tables
External tables offer the ability to load data even faster than copy using gpfdist. Gpfdist is basically multiple copy commands running at the same time to load the data faster.  It takes more up front configuration too but it will load data from files the fastest.

Copy with Standard In
Another cool way to load data is to create a client to your source database on the master node and the pipe the output to copy.   This runs pretty fast too but you are still living in a bash session.

External Web Tables
This option lets you pull in data from a web server.  I was thinking that this might be nice but it will be slow because the data has to go through a web server.  However, if you look at the details of the “create external web table” command, you’ll see that you can pick a command to run instead of a URL. 

Now I can run a Unix program (like Java or Perl), that connects to a database to get data.  I can use the external web table to execute the command to get the data.

This last technique has proven very fast and flexible.  A regular external table that goes to a flat file is still probably faster but with an external web table pointing to a program is the most flexible solution for my data loading needs.  I no longer have to worry about creating and moving files around!  I highly recommend this method.

Hey!

November 17th, 2007

I just converted my site to a blog.  Over the years, I’ve hosted a couple of different sites mainly dedicated towards cars.  One of those sites lives on still with a new owner. 

Instead of cars, I’ll mainly blog about my work.  I’ve been working with database technologies for several years and I started a new job recently working with Greenplum.  I intend on sharing topics about GP as I learn more and more…