PostgreSQL Performance Tips
This page is for collecting tips on tuning PostgreSQL, especially when using it with Chado. Feel free to place information from your own experience on this page. Unless otherwise indicated the tips on this page refer to PostgreSQL servers in the 8.x series.
- PostgreSQL performance FAQ
- PostgreSQL performance mailing list
- PostgreSQL performance checklist
- PostgreSQL annotated configuration option guide
- Another annotated configuration guide
One step that can be taken to optimize your PostgreSQL database is to properly initialize it with initdb. If you are using a PostgreSQL server that came with your OS or have installed it from some package system it may have already been initialized. In these cases you are probably better off removing the PostgreSQL data directory and reinitializing it manually.
To do this you need to execute the following command as the postgres user, or whatever user you use to run postgres under.
initdb -D /var/lib/pgsql/data --locale=C
Please note that the directory location is entirely dependent on your system or your specific setup and should be substituted with whatever convention is used in your situation. What this command does is explicitly initialize the PostgreSQL database with a C locale. If you do not specify the C locale it defaults to whatever your system is set to, which is set to 'en_US.UTF8' on many linux distributions. In these cases, queries with "LIKE" will take a performance hit because PostgreSQL will not use the indices as you would expect it to. Setting it to the C locale from the start eliminates this.
From the Locale Support page in the manual:
The drawback of using locales other than C or POSIX in PostgreSQL is its performance impact. It slows character handling and prevents ordinary indexes from being used by LIKE. For this reason use locales only if you actually need them.
If for some reason you cannot reinitialize your database you can also drop specific indexes that involve text fields and recreate them using operator classes.
One other option you should consider is what the default encoding for all your databases should be. If your data requires UTF-8 or just plain ASCII characters it might be a wise choice to specify that via the initdb command:
initdb -D /var/lib/pgsql/data --encoding=UTF8 --locale=C or initdb -D /var/lib/pgsql/data --encoding=SQL_ASCII --locale=C
If you do not do this you need to be sure that you specify the correct encoding when you create the database instance for Chado via createdb.
Tuning PostgreSQL configuration options requires some serious voodoo and patience. In addition to the links above you might find useful information on these pages.
In general you will want to pay special attention to the following configuration variables. These can be found in the postgresql.conf file located in the PostgreSQL server data directory. The comments below are pulled from some of the above URLs for convenience. If you don't actually read the full descriptions you are doing yourself a disservice :-).
Setting shared_buffers requires greater discussion than is suitable for this space. Please see other articles on the topic.
For quick rules-of-thumb: on a dedicated PostgreSQL server, this value should likely be between 1000 and 50,000 (8MB and 400MB). Factors that raise the recommended amount are more connections, larger active portions of your database, long complex queries, and large tables. Available RAM limits the maximum shared_buffers; you should never use more than 1/3 of your available RAM.
Formerly sort_mem, this setting name has been changed to reflect its expanded role in governing more than just sorts.
Work_mem is a direct tradeoff. Adjust it upwards for: large databases, complex queries, lots of available RAM. Adjust it downwards for: low available RAM, or many concurrent users. Finding the right balance spot can be hard.
Another way to set this value is to monitor the Postgres temp files (in PGDATA/base/DB_OID/pgsql_tmp) and adjust sort_mem upward if you see a lot of queries swapping from these temp files.
Also keep in mind that this parameter can be adjusted per connection. So if you only have a few really large queries, you can increase the work_mem for them before query execution, and leave it low for the rest of the connections.
A proper FSM setting can eliminate or at least postpone your need to run VACUUM FULL and REINDEX. The best way to set it is as follows: 1) figure out the VACUUM (regular) frequency of your database based on write activity; 2) run the database under normal production load, and run VACUUM VERBOSE ANALYZE instead of VACUUM, saving the output to a file; 3) calculate the maximum total number of pages reclaimed between VACUUMs based on the output, and use that.
Alternately, if you are using Autovacuum, you can base this as a percentage of the total data pages in your database, to match the autovacuum percent. Regardless, little memory is required per page (about 6 bytes) so it's better to be generous than stingy.
Please note that databases with high peak activity (bursts of 1 million updates but nothing else for minutes or hours) this number can be impossible to tune perfectly. Inserted rows are not significant for FSM. Finally, if your database server is short on RAM, increasing FSM to needed values may be counter-productive.
Few users will need to adjust this number, but it's worth checking. You should have at least as many FSM_relations as you have tables in all databases, including template databases and system schema. Postgres develops odd performance quirks if it is does not have enough FSM_relations.
Turn WAL off (fsync=false) only for a read-only database or one where the database can be regenerated from external software. While RAID plus UPSes can do a lot to protect your data, turning off fsync means that you will be restoring from backup in the event of hardware or power failure.
On the other hand, WAL imposes significant penalties on database writes, especially in single-disk systems. Essentially you are doubling the amount of read/write activity required for each update, plus requiring you to disable performance-enhancing disk-caching features of your OS and hardware. So, if your data is disposable, turing Fsync off is worth consideration.
This is the most effective setting for dealing with large updates, data loading, and heavy OLTP activity. For any system with heavy write activity, you'll want to raise this to at least 8; on systems with very large write loads (such as loads of several GB of data), as much as 128 (and we've used 256 for DBT2 testing). However, this does require a significant amount of disk space for the xlog ( ( 2 x segments + 1 ) x 16mb, to be precise), and is a limited improvement if your xlog is not on a separate disk resource from the data.
Increase this setting dramatically (up to 30 minutes) for large data loads. For other purposes, settings between 3min and 10min is the useful range; use higher settings for write activity which comes in bursts. Increasing checkpoint timeouts is currently limited by the increased impact that disk sync has with longer times.
Defaults to a “pg_log” directory in your PGDATA, which is probably not a wise choice if you have other disks/arrays available. /var/pg_log is popular.
Vacuuming your database after loading or after a large update is very important to database performance. To vacuum your database issue this command
vacuumdb -d DBNAME -f -z -v
See vacuumdb --help for more options. Doing a verbose vacuum is important because it will sometimes inform you when configuration options needs to be increased.
PostgreSQL 8.1 and up has the ability to automate vacuming. Users who are in read/write environments should utilize this for maintaining performance over the long term.
This section should be expanded with information from Chado users who actually use auto vacuuming.
- Full Text Search in PostgreSQL, 2010/05