|
133 | 133 | (<acronym>BBU</>) disk controllers. In such setups, the synchronize |
134 | 134 | command forces all data from the controller cache to the disks, |
135 | 135 | eliminating much of the benefit of the BBU. You can run the |
136 | | - <xref linkend="pgtestfsync"> module to see |
| 136 | + <xref linkend="pgtestfsync"> program to see |
137 | 137 | if you are affected. If you are affected, the performance benefits |
138 | 138 | of the BBU can be regained by turning off write barriers in |
139 | 139 | the file system or reconfiguring the disk controller, if that is |
|
372 | 372 | asynchronous commit, but it is actually a synchronous commit method |
373 | 373 | (in fact, <varname>commit_delay</varname> is ignored during an |
374 | 374 | asynchronous commit). <varname>commit_delay</varname> causes a delay |
375 | | - just before a synchronous commit attempts to flush |
376 | | - <acronym>WAL</acronym> to disk, in the hope that a single flush |
377 | | - executed by one such transaction can also serve other transactions |
378 | | - committing at about the same time. Setting <varname>commit_delay</varname> |
379 | | - can only help when there are many concurrently committing transactions. |
| 375 | + just before a transaction flushes <acronym>WAL</acronym> to disk, in |
| 376 | + the hope that a single flush executed by one such transaction can also |
| 377 | + serve other transactions committing at about the same time. The |
| 378 | + setting can be thought of as a way of increasing the time window in |
| 379 | + which transactions can join a group about to participate in a single |
| 380 | + flush, to amortize the cost of the flush among multiple transactions. |
380 | 381 | </para> |
381 | 382 |
|
382 | 383 | </sect1> |
|
394 | 395 | <para> |
395 | 396 | <firstterm>Checkpoints</firstterm><indexterm><primary>checkpoint</></> |
396 | 397 | are points in the sequence of transactions at which it is guaranteed |
397 | | - that the heap and index data files have been updated with all information written before |
398 | | - the checkpoint. At checkpoint time, all dirty data pages are flushed to |
399 | | - disk and a special checkpoint record is written to the log file. |
400 | | - (The changes were previously flushed to the <acronym>WAL</acronym> files.) |
| 398 | + that the heap and index data files have been updated with all |
| 399 | + information written before that checkpoint. At checkpoint time, all |
| 400 | + dirty data pages are flushed to disk and a special checkpoint record is |
| 401 | + written to the log file. (The change records were previously flushed |
| 402 | + to the <acronym>WAL</acronym> files.) |
401 | 403 | In the event of a crash, the crash recovery procedure looks at the latest |
402 | 404 | checkpoint record to determine the point in the log (known as the redo |
403 | 405 | record) from which it should start the REDO operation. Any changes made to |
404 | | - data files before that point are guaranteed to be already on disk. Hence, after |
405 | | - a checkpoint, log segments preceding the one containing |
| 406 | + data files before that point are guaranteed to be already on disk. |
| 407 | + Hence, after a checkpoint, log segments preceding the one containing |
406 | 408 | the redo record are no longer needed and can be recycled or removed. (When |
407 | 409 | <acronym>WAL</acronym> archiving is being done, the log segments must be |
408 | 410 | archived before being recycled or removed.) |
|
411 | 413 | <para> |
412 | 414 | The checkpoint requirement of flushing all dirty data pages to disk |
413 | 415 | can cause a significant I/O load. For this reason, checkpoint |
414 | | - activity is throttled so I/O begins at checkpoint start and completes |
415 | | - before the next checkpoint starts; this minimizes performance |
| 416 | + activity is throttled so that I/O begins at checkpoint start and completes |
| 417 | + before the next checkpoint is due to start; this minimizes performance |
416 | 418 | degradation during checkpoints. |
417 | 419 | </para> |
418 | 420 |
|
419 | 421 | <para> |
420 | 422 | The server's checkpointer process automatically performs |
421 | | - a checkpoint every so often. A checkpoint is created every <xref |
| 423 | + a checkpoint every so often. A checkpoint is begun every <xref |
422 | 424 | linkend="guc-checkpoint-segments"> log segments, or every <xref |
423 | 425 | linkend="guc-checkpoint-timeout"> seconds, whichever comes first. |
424 | 426 | The default settings are 3 segments and 300 seconds (5 minutes), respectively. |
425 | | - In cases where no WAL has been written since the previous checkpoint, new |
426 | | - checkpoints will be skipped even if checkpoint_timeout has passed. |
427 | | - If WAL archiving is being used and you want to put a lower limit on |
428 | | - how often files are archived in order to bound potential data |
429 | | - loss, you should adjust archive_timeout parameter rather than the checkpoint |
430 | | - parameters. It is also possible to force a checkpoint by using the SQL |
| 427 | + If no WAL has been written since the previous checkpoint, new checkpoints |
| 428 | + will be skipped even if <varname>checkpoint_timeout</> has passed. |
| 429 | + (If WAL archiving is being used and you want to put a lower limit on how |
| 430 | + often files are archived in order to bound potential data loss, you should |
| 431 | + adjust the <xref linkend="guc-archive-timeout"> parameter rather than the |
| 432 | + checkpoint parameters.) |
| 433 | + It is also possible to force a checkpoint by using the SQL |
431 | 434 | command <command>CHECKPOINT</command>. |
432 | 435 | </para> |
433 | 436 |
|
434 | 437 | <para> |
435 | 438 | Reducing <varname>checkpoint_segments</varname> and/or |
436 | 439 | <varname>checkpoint_timeout</varname> causes checkpoints to occur |
437 | | - more often. This allows faster after-crash recovery (since less work |
438 | | - will need to be redone). However, one must balance this against the |
| 440 | + more often. This allows faster after-crash recovery, since less work |
| 441 | + will need to be redone. However, one must balance this against the |
439 | 442 | increased cost of flushing dirty data pages more often. If |
440 | 443 | <xref linkend="guc-full-page-writes"> is set (as is the default), there is |
441 | 444 | another factor to consider. To ensure data page consistency, |
|
450 | 453 | Checkpoints are fairly expensive, first because they require writing |
451 | 454 | out all currently dirty buffers, and second because they result in |
452 | 455 | extra subsequent WAL traffic as discussed above. It is therefore |
453 | | - wise to set the checkpointing parameters high enough that checkpoints |
| 456 | + wise to set the checkpointing parameters high enough so that checkpoints |
454 | 457 | don't happen too often. As a simple sanity check on your checkpointing |
455 | 458 | parameters, you can set the <xref linkend="guc-checkpoint-warning"> |
456 | 459 | parameter. If checkpoints happen closer together than |
|
498 | 501 | altered when building the server). You can use this to estimate space |
499 | 502 | requirements for <acronym>WAL</acronym>. |
500 | 503 | Ordinarily, when old log segment files are no longer needed, they |
501 | | - are recycled (renamed to become the next segments in the numbered |
| 504 | + are recycled (that is, renamed to become future segments in the numbered |
502 | 505 | sequence). If, due to a short-term peak of log output rate, there |
503 | 506 | are more than 3 * <varname>checkpoint_segments</varname> + 1 |
504 | 507 | segment files, the unneeded segment files will be deleted instead |
|
507 | 510 |
|
508 | 511 | <para> |
509 | 512 | In archive recovery or standby mode, the server periodically performs |
510 | | - <firstterm>restartpoints</><indexterm><primary>restartpoint</></> |
| 513 | + <firstterm>restartpoints</>,<indexterm><primary>restartpoint</></> |
511 | 514 | which are similar to checkpoints in normal operation: the server forces |
512 | 515 | all its state to disk, updates the <filename>pg_control</> file to |
513 | 516 | indicate that the already-processed WAL data need not be scanned again, |
514 | | - and then recycles any old log segment files in <filename>pg_xlog</> |
515 | | - directory. A restartpoint is triggered if at least one checkpoint record |
516 | | - has been replayed and <varname>checkpoint_timeout</> seconds have passed |
517 | | - since last restartpoint. In standby mode, a restartpoint is also triggered |
518 | | - if <varname>checkpoint_segments</> log segments have been replayed since |
519 | | - last restartpoint and at least one checkpoint record has been replayed. |
| 517 | + and then recycles any old log segment files in the <filename>pg_xlog</> |
| 518 | + directory. |
520 | 519 | Restartpoints can't be performed more frequently than checkpoints in the |
521 | 520 | master because restartpoints can only be performed at checkpoint records. |
| 521 | + A restartpoint is triggered when a checkpoint record is reached if at |
| 522 | + least <varname>checkpoint_timeout</> seconds have passed since the last |
| 523 | + restartpoint. In standby mode, a restartpoint is also triggered if at |
| 524 | + least <varname>checkpoint_segments</> log segments have been replayed |
| 525 | + since the last restartpoint. |
522 | 526 | </para> |
523 | 527 |
|
524 | 528 | <para> |
525 | 529 | There are two commonly used internal <acronym>WAL</acronym> functions: |
526 | | - <function>LogInsert</function> and <function>LogFlush</function>. |
527 | | - <function>LogInsert</function> is used to place a new record into |
| 530 | + <function>XLogInsert</function> and <function>XLogFlush</function>. |
| 531 | + <function>XLogInsert</function> is used to place a new record into |
528 | 532 | the <acronym>WAL</acronym> buffers in shared memory. If there is no |
529 | | - space for the new record, <function>LogInsert</function> will have |
| 533 | + space for the new record, <function>XLogInsert</function> will have |
530 | 534 | to write (move to kernel cache) a few filled <acronym>WAL</acronym> |
531 | | - buffers. This is undesirable because <function>LogInsert</function> |
| 535 | + buffers. This is undesirable because <function>XLogInsert</function> |
532 | 536 | is used on every database low level modification (for example, row |
533 | 537 | insertion) at a time when an exclusive lock is held on affected |
534 | 538 | data pages, so the operation needs to be as fast as possible. What |
535 | 539 | is worse, writing <acronym>WAL</acronym> buffers might also force the |
536 | 540 | creation of a new log segment, which takes even more |
537 | 541 | time. Normally, <acronym>WAL</acronym> buffers should be written |
538 | | - and flushed by a <function>LogFlush</function> request, which is |
| 542 | + and flushed by an <function>XLogFlush</function> request, which is |
539 | 543 | made, for the most part, at transaction commit time to ensure that |
540 | 544 | transaction records are flushed to permanent storage. On systems |
541 | | - with high log output, <function>LogFlush</function> requests might |
542 | | - not occur often enough to prevent <function>LogInsert</function> |
| 545 | + with high log output, <function>XLogFlush</function> requests might |
| 546 | + not occur often enough to prevent <function>XLogInsert</function> |
543 | 547 | from having to do writes. On such systems |
544 | 548 | one should increase the number of <acronym>WAL</acronym> buffers by |
545 | | - modifying the configuration parameter <xref |
546 | | - linkend="guc-wal-buffers">. When |
| 549 | + modifying the <xref linkend="guc-wal-buffers"> parameter. When |
547 | 550 | <xref linkend="guc-full-page-writes"> is set and the system is very busy, |
548 | | - setting this value higher will help smooth response times during the |
549 | | - period immediately following each checkpoint. |
| 551 | + setting <varname>wal_buffers</> higher will help smooth response times |
| 552 | + during the period immediately following each checkpoint. |
550 | 553 | </para> |
551 | 554 |
|
552 | 555 | <para> |
553 | 556 | The <xref linkend="guc-commit-delay"> parameter defines for how many |
554 | | - microseconds the server process will sleep after writing a commit |
555 | | - record to the log with <function>LogInsert</function> but before |
556 | | - performing a <function>LogFlush</function>. This delay allows other |
557 | | - server processes to add their commit records to the log so as to have all |
558 | | - of them flushed with a single log sync. No sleep will occur if |
559 | | - <xref linkend="guc-fsync"> |
560 | | - is not enabled, or if fewer than <xref linkend="guc-commit-siblings"> |
561 | | - other sessions are currently in active transactions; this avoids |
562 | | - sleeping when it's unlikely that any other session will commit soon. |
563 | | - Note that on most platforms, the resolution of a sleep request is |
564 | | - ten milliseconds, so that any nonzero <varname>commit_delay</varname> |
565 | | - setting between 1 and 10000 microseconds would have the same effect. |
566 | | - Good values for these parameters are not yet clear; experimentation |
567 | | - is encouraged. |
| 557 | + microseconds a group commit leader process will sleep after acquiring a |
| 558 | + lock within <function>XLogFlush</function>, while group commit |
| 559 | + followers queue up behind the leader. This delay allows other server |
| 560 | + processes to add their commit records to the WAL buffers so that all of |
| 561 | + them will be flushed by the leader's eventual sync operation. No sleep |
| 562 | + will occur if <xref linkend="guc-fsync"> is not enabled, or if fewer |
| 563 | + than <xref linkend="guc-commit-siblings"> other sessions are currently |
| 564 | + in active transactions; this avoids sleeping when it's unlikely that |
| 565 | + any other session will commit soon. Note that on some platforms, the |
| 566 | + resolution of a sleep request is ten milliseconds, so that any nonzero |
| 567 | + <varname>commit_delay</varname> setting between 1 and 10000 |
| 568 | + microseconds would have the same effect. Note also that on some |
| 569 | + platforms, sleep operations may take slightly longer than requested by |
| 570 | + the parameter. |
| 571 | + </para> |
| 572 | + |
| 573 | + <para> |
| 574 | + Since the purpose of <varname>commit_delay</varname> is to allow the |
| 575 | + cost of each flush operation to be amortized across concurrently |
| 576 | + committing transactions (potentially at the expense of transaction |
| 577 | + latency), it is necessary to quantify that cost before the setting can |
| 578 | + be chosen intelligently. The higher that cost is, the more effective |
| 579 | + <varname>commit_delay</varname> is expected to be in increasing |
| 580 | + transaction throughput, up to a point. The <xref |
| 581 | + linkend="pgtestfsync"> program can be used to measure the average time |
| 582 | + in microseconds that a single WAL flush operation takes. A value of |
| 583 | + half of the average time the program reports it takes to flush after a |
| 584 | + single 8kB write operation is often the most effective setting for |
| 585 | + <varname>commit_delay</varname>, so this value is recommended as the |
| 586 | + starting point to use when optimizing for a particular workload. While |
| 587 | + tuning <varname>commit_delay</varname> is particularly useful when the |
| 588 | + WAL log is stored on high-latency rotating disks, benefits can be |
| 589 | + significant even on storage media with very fast sync times, such as |
| 590 | + solid-state drives or RAID arrays with a battery-backed write cache; |
| 591 | + but this should definitely be tested against a representative workload. |
| 592 | + Higher values of <varname>commit_siblings</varname> should be used in |
| 593 | + such cases, whereas smaller <varname>commit_siblings</varname> values |
| 594 | + are often helpful on higher latency media. Note that it is quite |
| 595 | + possible that a setting of <varname>commit_delay</varname> that is too |
| 596 | + high can increase transaction latency by so much that total transaction |
| 597 | + throughput suffers. |
| 598 | + </para> |
| 599 | + |
| 600 | + <para> |
| 601 | + When <varname>commit_delay</varname> is set to zero (the default), it |
| 602 | + is still possible for a form of group commit to occur, but each group |
| 603 | + will consist only of sessions that reach the point where they need to |
| 604 | + flush their commit records during the window in which the previous |
| 605 | + flush operation (if any) is occurring. At higher client counts a |
| 606 | + <quote>gangway effect</> tends to occur, so that the effects of group |
| 607 | + commit become significant even when <varname>commit_delay</varname> is |
| 608 | + zero, and thus explicitly setting <varname>commit_delay</varname> tends |
| 609 | + to help less. Setting <varname>commit_delay</varname> can only help |
| 610 | + when (1) there are some concurrently committing transactions, and (2) |
| 611 | + throughput is limited to some degree by commit rate; but with high |
| 612 | + rotational latency this setting can be effective in increasing |
| 613 | + transaction throughput with as few as two clients (that is, a single |
| 614 | + committing client with one sibling transaction). |
568 | 615 | </para> |
569 | 616 |
|
570 | 617 | <para> |
|
574 | 621 | All the options should be the same in terms of reliability, with |
575 | 622 | the exception of <literal>fsync_writethrough</>, which can sometimes |
576 | 623 | force a flush of the disk cache even when other options do not do so. |
577 | | - However, it's quite platform-specific which one will be the fastest; |
578 | | - you can test option speeds using the <xref |
579 | | - linkend="pgtestfsync"> module. |
| 624 | + However, it's quite platform-specific which one will be the fastest. |
| 625 | + You can test the speeds of different options using the <xref |
| 626 | + linkend="pgtestfsync"> program. |
580 | 627 | Note that this parameter is irrelevant if <varname>fsync</varname> |
581 | 628 | has been turned off. |
582 | 629 | </para> |
|
585 | 632 | Enabling the <xref linkend="guc-wal-debug"> configuration parameter |
586 | 633 | (provided that <productname>PostgreSQL</productname> has been |
587 | 634 | compiled with support for it) will result in each |
588 | | - <function>LogInsert</function> and <function>LogFlush</function> |
| 635 | + <function>XLogInsert</function> and <function>XLogFlush</function> |
589 | 636 | <acronym>WAL</acronym> call being logged to the server log. This |
590 | 637 | option might be replaced by a more general mechanism in the future. |
591 | 638 | </para> |
|
0 commit comments