30 March 2015
Ignoring duplicate inserts with Postgres when processing a batch
Users are expected to be able to rerun batches and there is overlap between different datasets. For example: the dataset of "last year" overlaps with the dataset of "all time". This means that we need an elegant way to handle duplicate updates.
Searching if a record exists (by PK) is fine until the row count in the table gets significant. At just over 2 million records it was taking my development machine 30 seconds to process 10,000 records. This number steadily increased as the row count increased.
I had to find a better way to do this and happened across the option of using a database rule to ignore duplicates. While using the rule there is a marked improvement in the performance as I no longer need to search the database for a record. Tip