You are on page 1of 128

PostgreSQL Dynamic Trace

Digoal.zhou
Monday, October 14, 2013
2013 PostgreSQL China Conference
SKYMOBI, Hangzhou, Zhejiang

Agenda

systemtap

PostgreSQL

PostgreSQL

PostgreSQL

PostgreSQLLinux

stap5:

1. (.stp-e ), tapset, -I, .

2. ,,,,

3. C

/root/.systemtap/cache/e8/stap_e878009262b7836eb07f0b5a0bf0705e_970.c

4. Linux

/root/.systemtap/cache/e8/stap_e878009262b7836eb07f0b5a0bf0705e_970.ko

5. , . stap
. staprun ko, stap.

stapPostgreSQL, , ().

<stap PROCESSING 5 steps introduce>

http://blog.163.com/digoal@126/blog/static/163877040201391434530674/

systemtap

, (64bit), ,

(64bit)

* / + - % () >> << & ^ | (, , , ) && || ()

(+,-,!,~,++,--) (,,,,,)

(<,>,<=,>=,==,!=), (=~, !~)

[idx1, idxn] in var_arr ()

foreach([idx1, idxn]+- in var_array [@count|@avg] +-)

delete var_arr (), delete var_arr[idx1, idxn] ()

<<< (), delete static_var ()

, a ? b : c: ab, c

. ()

-> field

/* */ , # , //

systemtap

break and continue

try/catch

delete

EXP (expression)

for

foreach

if

next

; (null statement)

return

{ } (statement block)

while

systemtap

function:ret_type (par1:type1,)

function thatfn:string(arg1:long, arg2) {

return sprintf("%d%s", arg1, arg2)

Embbed C

function <name>:<type> ( <arg1>:<type>, ... ) %{ <C_stmts> %}

IO STAP_ARG_foo (for arguments named foo) , STAP_RETVALUE

%{

#include <linux/in.h>

#include <linux/ip.h>

%} /* <-- top level */

/* Reads the char value stored at a given address: */

function __read_char:long(addr:long) %{ /* pure */

STAP_RETVALUE = kderef(sizeof(char), STAP_ARG_addr);

CATCH_DEREF_FAULT();

%} /* <-- function body */

/* Determines whether an IP packet is TCP, based on the iphdr: */

function is_tcp_packet:long(iphdr) {

protocol = @cast(iphdr, "iphdr")->protocol

return (protocol == %{ IPPROTO_TCP %}) /* <-- expression */

systemtap

()

(@var(varname), $varname )

[root@db-172-16-3-150 ~]# stap -e '

global var1

probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__done") {

lv=$1

var1=@2

printf("lv: %d, var1:%s\n", lv, var1)

println($$locals)

println("context var: ", user_string($parsetree_list->head->data->ptr_value))

exit()

}' 10 "Hello I am digoal."

lv: 10, var1:Hello I am digoal.

dest=0x2 oldcontext=? parsetree_list=0x1d4fbe8 parsetree_item=?


save_log_statement_stats=0x0 was_logged=0x0 isTopLevel=? msec_str=[...] __func__=[...]

context var: B

systemtap

MAXNESTING - The maximum number of recursive function call levels. The default is 10.

MAXSTRINGLEN - The maximum length of strings. The default is 256 bytes for 32 bit machines and 512 bytes for all other
machines.

MAXTRYLOCK - The maximum number of iterations to wait for locks on global variables before declaring possible deadlock
and skipping the probe. The default is 1000.

MAXACTION - The maximum number of statements to execute during any single probe hit. The default is 1000.

MAXMAPENTRIES - The maximum number of rows in an array if the array size is not specified explicitly when declared.
The default is 2048.

MAXERRORS - The maximum number of soft errors before an exit is triggered. The default is 0.

MAXSKIPPED - The maximum number of skipped reentrant probes before an exit is triggered. The default is 100.

MINSTACKSPACE - The minimum number of free kernel stack bytes required in order to run a probe handler. This number
should be large enough for the probe handler's own needs, plus a safety margin. The default is 1024.

stap -D

systemtap

stap.

@define foo %( x %)

@define add(a,b) %( ((@a)+(@b)) %)

@foo = @add(2,2)
@define foo %(
%( CONFIG_UTRACE == "y" %? process.syscall %: **ERROR** %)
%)

%( CONDITION %? TRUE-TOKENS %)

%( CONDITION %? TRUE-TOKENS %: FALSE-TOKENS %)

[root@db-172-16-3-39 ~]# stap -e

%( 1!=1

%?

probe begin {printf("true\n"); exit();}

%:

probe begin {printf("false\n"); exit();}

%)

false

The general probe point syntax is a dotted-symbol sequence. .

This divides the event namespace into parts, analogous to the style of the Domain Name
System. .,dns.

Each component identifier is parameterized by a string or number literal, with a syntax


analogous to a function call.

kernel.function("foo")

kernel.function("foo").return

Module("ext3).function("ext3_*")

kernel.function("no_such_function") ?

syscall.*

end

timer.ms(5000)

DWARF(debuginfo)

kernel.function(PATTERN)

kernel.function(PATTERN).call

kernel.function(PATTERN).return

kernel.function(PATTERN).return.maxactive(VALUE)

kernel.function(PATTERN).inline

kernel.function(PATTERN).label(LPATTERN)

module(MPATTERN).function(PATTERN)

module(MPATTERN).function(PATTERN).call

module(MPATTERN).function(PATTERN).return.maxactive(VALUE)

module(MPATTERN).function(PATTERN).inline

kernel.statement(PATTERN)

kernel.statement(ADDRESS).absolute

module(MPATTERN).statement(PATTERN)

@var(varname@src.c)$var, $$vars, $$locals,


$$parms, $$ suffix

DWARF-LESS(debuginfo, kprobe)

In the absence of debugging information, you can still use the kprobe family of probes to examine the entry
and exit points of kernel and module functions. You cannot look up the arguments or local variables of a
function using these probes.

asmlinkage ssize_t sys_read (unsigned int fd, char __user * buf, size_t count)

You can obtain the values of fd, buf, and count, respectively, as uint_arg(1), pointer_arg(2),
and ulong_arg(3). In this case, your probe code must first call asmlinkage(), because on some architectures
the asmlinkage attribute affects how the function's arguments are passed.

entry probe, asmlinkage()

exit probereturnval(), $return.

[root@db- ~]# stap -e 'probe kprobe.function("tcp_v4_connect") {asmlinkage(); printf("%s, %d, %d, 0x%x,
0x%x, %d\n", pp(), pid(), cpu(), pointer_arg(1), pointer_arg(2), uint_arg(3));}'

kprobe.function("tcp_v4_connect"), 32372, 11, 0xffff81024ce51380, 0xffff81038aa2fec8, 16

[root@db- ~]# stap -e 'probe kprobe.function("tcp_v4_connect").return {asmlinkage(); printf("%d\n",


returnval());}'

userspace probe

process.begin . (target process.stap -c or -x;


stap --unprivileged )

process("PATH").begin PATH$PATH
. .

process(PID).begin IDPID

, .

process.thread.begin

process("PATH").thread.begin

process(PID).thread.begin

.end . .

process.end

process("PATH").end

process(PID).end

process.thread.end

process("PATH").thread.end

process(PID).thread.end

userspace probe

.(debuginfo)

process("PATH").function("NAME")

process("PATH").statement("*@FILE.c:123")

process("PATH").function("*").return

process("PATH").function("myfun").label("foo")

Full symbolic source-level probes in userspace programs and shared libraries are supported.

These are exactly analogous to the symbolic DWARF-based kernel or module probes described previously
and expose similar contextual $-variables.

Here is an example of prototype symbolic userspace probing support:

# stap -e 'probe process("ls").function("*").call {

log (probefunc()." ".$$parms)

}' \

-c 'ls -l'

To run, this script requires debugging information for the named program and utrace support in the kernel.

If you see a "pass 4a-time" build failure, check that your kernel supports utrace.

userspace probe

process.syscall

process("PATH").syscall

process(PID).syscall

process.syscall.return

process("PATH").syscall.return

process(PID).syscall.return

$syscall, $arg1 - $arg66.

.return$arg, $return.

.return @entry() , . :

gettimeofday_ns() - @entry(gettimeofday_ns())

userspace probe - Target process mode

Target process mode (invoked with stap -c CMD or -x PID) implicitly restricts all process.* probes to the
given child process.

It does not affect kernel.* or other probe types.

[root@db-172-16-3-39 ~]# ps -ewf|grep postgres|grep pg94

pg94

process.*, .

[root@db-172-16-3-39 ~]# stap -x 15967 -e 'probe process.syscall {printf("%s, %s, %d\n", pp(), execname(),
$syscall); }'

process.syscall, postgres, 23

-x. stapio

[root@db-172-16-3-39 ~]# stap -x 15967 -e 'probe syscall.* {printf("%s, %s\n", pp(), execname()); exit();}'

kernel.function("sys_fcntl@fs/fcntl.c:357").call?, stapio

15967

1 0 Sep29 ?

00:00:00 /home/pg94/pgsql9.4devel/bin/postgres

userspace probe - Target process mode

(process(PATH)PATH)

[root@db-172-16-3-39 lib]# stap --download-debuginfo=yes -e 'probe process("/usr/local/lib/libevent1.4.so.2.2.0").function("*") { printf ("%s, %s\n", pp(), execname()); } probe timer.s(1) {exit();}'

WARNING: abrt-action-install-debuginfo-to-abrt-cache is not installed. Continuing without downloading


debuginfo.

process("/usr/local/lib/libevent-1.4.so.2.2.0").function("evsignal_process@/opt/soft_bak/libevent-1.4.14bstable/signal.c:314"), rpc.idmapd

process("/usr/local/lib/libevent-1.4.so.2.2.0").function("timeout_process@/opt/soft_bak/libevent-1.4.14bstable/event.c:927"), rpc.idmapd

[root@db-172-16-3-39 lib]# stap -e 'probe process("/usr/sbin/rpc.idmapd").library("/usr/local/lib/libevent1.4.so.2.2.0").function("*") { printf ("%s, %s, %s\n", pp(), execname(), $$vars); } probe timer.s(1) {exit();}'

userspace probe - |

process("PATH").insn

process(PID).insn

process("PATH").insn.block

process(PID).insn.block

0.9.5

* What's new in version 0.9.5, 2009-03-27

- New probes process().insn and process().insn.block that allows

inspection of the process after each instruction or block of

instructions executed. So to count the total number of instructions

a process executes during a run do something like:

$ stap -e 'global steps; probe process("/bin/ls").insn {steps++}

probe end {printf("Total instructions: %d\n", steps);}' \

-c /bin/ls
This feature can slow down execution of a process somewhat.

userspace probe -

process("PATH").mark("LABEL")

, , 1-12. :

STAP_PROBE[1-12](handle,LABEL[,arg1-12])

DTRACE_PROBE[1-12](handle,LABEL[,arg1-12])

, $varname@var("varname@src.c")

/usr/include/sys/sdt.h

http://blog.163.com/digoal@126/blog/static/163877040201383044341926/

(systemtap), :

http://blog.163.com/digoal@126/blog/static/163877040201396882559/

http://blog.163.com/digoal@126/blog/static/1638770402013971009843/

http://blog.163.com/digoal@126/blog/static/16387704020139710289441/

http://blog.163.com/digoal@126/blog/static/163877040201397112435514/

http://blog.163.com/digoal@126/blog/static/16387704020139752612312/

http://blog.163.com/digoal@126/blog/static/1638770402013978959440/

userspace probe -

1. /usr/include/sys/sdt.h

/* DTrace compatible macro names. */

#define DTRACE_PROBE(provider,probe)

STAP_PROBE(provider,probe)
#define DTRACE_PROBE1(provider,probe,parm1)

STAP_PROBE1(provider,probe,parm1)
#define DTRACE_PROBE2(provider,probe,parm1,parm2)

STAP_PROBE2(provider,probe,parm1,parm2)

...

#define
DTRACE_PROBE12(provider,probe,parm1,parm2,parm3,parm4,parm5,parm6,parm7,parm8,parm9,parm10,
parm11,parm12) \

STAP_PROBE12(provider,probe,parm1,parm2,parm3,parm4,parm5,parm6,parm7,parm8,parm9,parm10,par
m11,parm12)

userspace probe -

2. PostgreSQL - src/backend/utils/probes.h (probes.d)

#include <sys/sdt.h>

/* TRACE_POSTGRESQL_TRANSACTION_START ( unsigned int) */

#if defined STAP_SDT_V1

#define TRACE_POSTGRESQL_TRANSACTION_START_ENABLED() __builtin_expect


(transaction__start_semaphore, 0)

#define postgresql_transaction__start_semaphore transaction__start_semaphore

#else

#define TRACE_POSTGRESQL_TRANSACTION_START_ENABLED() __builtin_expect


(postgresql_transaction__start_semaphore, 0)

#endif

__extension__ extern unsigned short postgresql_transaction__start_semaphore __attribute__ ((unused))


__attribute__ ((section (".probes")));

#define TRACE_POSTGRESQL_TRANSACTION_START(arg1) \

DTRACE_PROBE1(postgresql,transaction__start,arg1)

TRACE_POSTGRESQL_TRANSACTION_START(arg1)PostgreSQL, mark.

userspace probe -

3. PostgreSQL - src/include/pg_trace.h

#include "utils/probes.h"

4. PostgreSQL - transaction__start mark

src/backend/access/transam/xact.c

#include "pg_trace.h"

...

static void

StartTransaction(void)

TransactionState s;

VirtualTransactionId vxid;

/* pgrminclude ignore */

...

Assert(MyProc->backendId == vxid.backendId);

MyProc->lxid = vxid.localTransactionId;

TRACE_POSTGRESQL_TRANSACTION_START(vxid.localTransactionId);

...

userspace probe -

transaction__start :

probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("transaction__start")

, :

[root@db-172-16-3-150 ~]# stap -e '

probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("transaction__start") {

println($$locals$$)

println($MyProc->backendId, " ", $MyProc->lxid)

println($arg1)

}'

s={.transactionId=0, .subTransactionId=1, .name="<unknown>", .savepointLevel=0, .state=1, .blockState=0,


.nestingLevel=0, .gucNestLevel=0, .curTransactionContext=0x1dc8400, .curTransactionOwner=0x1dc8510,
.childXids=0x0, .nChildXids=0, .maxChildXids=0, .prevUser=10, .prevSecContext=0, .prevXactReadOnly=
'\000', .startedInRecovery='\000', .parent=0x0} vxid={.backendId=?, .localTransactionId=?} __func__=?

2 239116

239116

PostgreSQL

http://blog.163.com/digoal@126/blog/static/163877040201391684012713/

Name
transaction-start

transaction-commit

transaction-abort

Parameters

Description

(LocalTransactionId)

Probe that fires at the start


of a new transaction. arg0 is
the transaction ID.

(LocalTransactionId)

Probe that fires when a


transaction completes
successfully. arg0 is the
transaction ID.

(LocalTransactionId)

Probe that fires when a


transaction completes
unsuccessfully. arg0 is the
transaction ID.

PostgreSQL

http://blog.163.com/digoal@126/blog/static/163877040201391684012713/

EXP : , , :

digoal=# CREATE OR REPLACE FUNCTION public.f_test(i_id integer)

RETURNS void

LANGUAGE plpgsql

STRICT

AS $function$

declare

begin

update test set info=md5(random()::text), crt_time=clock_timestamp() where id=i_id;

if not found then

insert into test(id,info,crt_time) values(i_id,md5(random()::text),clock_timestamp());

end if;

return;

exception when others then

return;

end;

$function$;

digoal=# create table test(id int primary key, info text, crt_time timestamp);

PostgreSQL

http://blog.163.com/digoal@126/blog/static/163877040201391684012713/

EXP : , , :

vi test.sql

\setrandom id 1 5000000

select f_test(:id);

stap -e '

global var1
probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("transaction__start") {
var1["START"]++
}
probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("transaction__commit") {
var1["COMMIT"]++
}
probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("transaction__abort") {
var1["ABORT"]++
}
probe timer.s(1) {
printf("START/s:%d, COMMIT/s:%d, ABORT/s:%d\n", var1["START"], var1["COMMIT"], var1["ABORT"])
var1["START"]=0
var1["COMMIT"]=0
var1["ABORT"]=0
}'

PostgreSQL

http://blog.163.com/digoal@126/blog/static/163877040201391684012713/

EXP : , , :

pgbench -M prepared -n -r -f ./test.sql -c 8 -j 1 -T 10

transaction type: Custom query

scaling factor: 1

query mode: prepared

number of clients: 8

number of threads: 1

duration: 10 s

number of transactions actually processed: 175602

tps = 17559.485329 (including connections establishing)

tps = 17601.731285 (excluding connections establishing)

statement latencies in milliseconds:

0.001609

\setrandom id 1 5000000

0.451537

select f_test(:id);

PostgreSQL

http://blog.163.com/digoal@126/blog/static/163877040201391684012713/

EXP : , , :

stap :

START/s:0, COMMIT/s:0, ABORT/s:0

START/s:0, COMMIT/s:0, ABORT/s:0

START/s:7484, COMMIT/s:7483, ABORT/s:0

START/s:18035, COMMIT/s:18032, ABORT/s:0

START/s:17345, COMMIT/s:17346, ABORT/s:0

START/s:17151, COMMIT/s:17150, ABORT/s:0

START/s:17517, COMMIT/s:17520, ABORT/s:0

START/s:18048, COMMIT/s:18046, ABORT/s:0

START/s:17597, COMMIT/s:17600, ABORT/s:0

START/s:17648, COMMIT/s:17645, ABORT/s:0

START/s:17728, COMMIT/s:17724, ABORT/s:0

START/s:17346, COMMIT/s:17348, ABORT/s:0

START/s:9720, COMMIT/s:9725, ABORT/s:0

START/s:0, COMMIT/s:0, ABORT/s:0

START/s:0, COMMIT/s:0, ABORT/s:0

PostgreSQL

http://blog.163.com/digoal@126/blog/static/163877040201391684012713/

EXP2 : ,

stap -e '

global var1%[819200], var2, var3, var4

probe begin {

var4=gettimeofday_ms()

probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("transaction__start") {

var1[pid(),$arg1] = gettimeofday_ms()

probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("transaction__commit") {

if (var1[pid(),$arg1] != 0)

var2 <<< (gettimeofday_ms()-var1[pid(),$arg1])

probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("transaction__abort") {

if (var1[pid(),$arg1] != 0)

var3 <<< (gettimeofday_ms()-var1[pid(),$arg1])

PostgreSQL

http://blog.163.com/digoal@126/blog/static/163877040201391684012713/

EXP2 : , commit/s,

probe timer.s($1) {

now=gettimeofday_ms()

if (@count(var2) != 0) {

printf("COMMIT/s:%d\n", (1000*@count(var2)) / (now-var4))

println(@hist_log(var2))

delete var2

if (@count(var3) != 0) {

printf("ABORT/s:%d\n", (1000*@count(var3)) / (now-var4))

println(@hist_log(var3))

delete var3

var4=now

}' 3

commit/s,

PostgreSQL

http://blog.163.com/digoal@126/blog/static/163877040201391684012713/

EXP2 : , commit/s,

COMMIT/s:16364

value |-------------------------------------------------- count

0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 39936

1 |@@@@@@@@@@@

2|

20

4|

8|

9137

COMMIT/s:16427

value |-------------------------------------------------- count

0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 39927

1 |@@@@@@@@@@@

2|

10

4|

8|

... ,

9346

PostgreSQL

http://blog.163.com/digoal@126/blog/static/1638770402013916101117367/

query-start

(const char *) Probe that fires when the processing of a query is started. arg0 is the query string.

query-done

(const char *) Probe that fires when the processing of a query is complete. arg0 is the query string.

query-parse-start

(const char *) Probe that fires when the parsing of a query is started. arg0 is the query string.

query-parse-done

(const char *) Probe that fires when the parsing of a query is complete. arg0 is the query string.

query-rewrite-start (const char *) Probe that fires when the rewriting of a query is started. arg0 is the query string.
query-rewritedone

(const char *) Probe that fires when the rewriting of a query is complete. arg0 is the query string.

query-plan-start

()

Probe that fires when the planning of a query is started.

query-plan-done

()

Probe that fires when the planning of a query is complete.

()

Probe that fires when the execution of a query is started.

()

Probe that fires when the execution of a query is complete.

query-executestart
query-executedone

PostgreSQL

http://blog.163.com/digoal@126/blog/static/1638770402013916101117367/

EXP : simpleextended

vi test.sql

select clock_timestamp();

pgbench -M simple -n -r -f ./test.sql -c 1 -j 1 -t 6

pgbench -M extended -n -r -f ./test.sql -c 1 -j 1 -t 6

pgbench -M prepared -n -r -f ./test.sql -c 1 -j 1 -t 6

PostgreSQL

http://blog.163.com/digoal@126/blog/static/1638770402013916101117367/

stap -e '

probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__start") {

println(pn(), user_string($arg1), pid())

probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__parse__start") {

println(pn(), user_string($arg1), pid())

probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__rewrite__start") {

println(pn(), user_string($arg1), pid())

probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__plan__start") {

println(pn(), pid())

probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__execute__start") {

println(pn(), pid())
}'

PostgreSQL

http://blog.163.com/digoal@126/blog/static/1638770402013916101117367/

simple

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__start")select clock_timestamp();14146

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__parse__start")select clock_timestamp();14146

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__rewrite__start")select clock_timestamp();14146

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__plan__start")14146

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__execute__start")14146

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__start")select clock_timestamp();14146

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__parse__start")select clock_timestamp();14146

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__rewrite__start")select clock_timestamp();14146

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__plan__start")14146

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__execute__start")14146

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__start")select clock_timestamp();14146

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__parse__start")select clock_timestamp();14146

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__rewrite__start")select clock_timestamp();14146

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__plan__start")14146

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__execute__start")14146

...

simple, SQLquery start, parse, rewrite, plan, execute.

PostgreSQL

http://blog.163.com/digoal@126/blog/static/1638770402013916101117367/

extended

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__parse__start")select clock_timestamp();14140

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__plan__start")14140

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__execute__start")14140

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__parse__start")select clock_timestamp();14140

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__plan__start")14140

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__execute__start")14140

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__parse__start")select clock_timestamp();14140

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__plan__start")14140

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__execute__start")14140

...

query start rewrite. SQL parse, plan, execute. (parseSQL)

PostgreSQL

http://blog.163.com/digoal@126/blog/static/1638770402013916101117367/

prepare

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__parse__start")select clock_timestamp();14143

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__plan__start")14143

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__execute__start")14143

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__execute__start")14143

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__execute__start")14143

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__execute__start")14143

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__execute__start")14143

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__execute__start")14143

query parse, plan, SQLexecute.

PostgreSQL

checkpoint

http://blog.163.com/digoal@126/blog/static/163877040201391622459221/

checkpoint-start

(int)

Probe that fires when a checkpoint is started. arg0 holds the bitwise flags used to
distinguish different checkpoint types, such as shutdown, immediate or force.

Probe that fires when a checkpoint is complete. (The probes listed next fire in
(int, int, int, sequence during checkpoint processing.) arg0 is the number of buffers written. arg1
checkpoint-done
int, int)
is the total number of buffers. arg2, arg3 and arg4 contain the number of xlog file(s)
added, removed and recycled respectively.
Probe that fires when the CLOG portion of a checkpoint is started. arg0 is true for
clog-checkpoint-start (bool)
normal checkpoint, false for shutdown checkpoint.
Probe that fires when the CLOG portion of a checkpoint is complete. arg0 has the
clog-checkpoint-done (bool)
same meaning as for clog-checkpoint-start.
subtrans-checkpointProbe that fires when the SUBTRANS portion of a checkpoint is started. arg0 is true
(bool)
start
for normal checkpoint, false for shutdown checkpoint.
subtrans-checkpointProbe that fires when the SUBTRANS portion of a checkpoint is complete. arg0 has
(bool)
done
the same meaning as for subtrans-checkpoint-start.
multixact-checkpointProbe that fires when the MultiXact portion of a checkpoint is started. arg0 is true for
(bool)
start
normal checkpoint, false for shutdown checkpoint.
multixact-checkpoint(bool)
done

Probe that fires when the MultiXact portion of a checkpoint is complete. arg0 has the
same meaning as for multixact-checkpoint-start.

PostgreSQL

checkpoint

http://blog.163.com/digoal@126/blog/static/163877040201391622459221/

buffer-checkpoint-start

(int)

buffer-sync-start

(int,
int)

buffer-sync-written

(int)

buffer-sync-done

(int,
int,
int)

buffer-checkpoint-sync-start

()

buffer-checkpoint-done
twophase-checkpoint-start
twophase-checkpoint-done

()
()
()

Probe that fires when the buffer-writing portion of a checkpoint is started. arg0 holds
the bitwise flags used to distinguish different checkpoint types, such as shutdown,
immediate or force.
Probe that fires when we begin to write dirty buffers during checkpoint (after
identifying which buffers must be written). arg0 is the total number of buffers. arg1 is
the number that are currently dirty and need to be written.
Probe that fires after each buffer is written during checkpoint. arg0 is the ID number
of the buffer.
Probe that fires when all dirty buffers have been written. arg0 is the total number of
buffers. arg1 is the number of buffers actually written by the checkpoint process. arg2
is the number that were expected to be written (arg1 of buffer-sync-start); any
difference reflects other processes flushing buffers during the checkpoint.
Probe that fires after dirty buffers have been written to the kernel, and before starting
to issue fsync requests.
Probe that fires when syncing of buffers to disk is complete.
Probe that fires when the two-phase portion of a checkpoint is started.
Probe that fires when the two-phase portion of a checkpoint is complete.

PostgreSQL

checkpoint

http://blog.163.com/digoal@126/blog/static/163877040201391622459221/

EXP : checkpoint__done

[root@db-172-16-3-150 ~]# stap -D MAXSTRINGLEN=100000 -e 'probe


process("/home/pg93/pgsql9.3.1/bin/postgres").mark("checkpoint__done") {printf("%s\n", $NBuffers$$)}'

262144

[root@db-172-16-3-150 ~]# stap -D MAXSTRINGLEN=100000 -e 'probe


process("/home/pg93/pgsql9.3.1/bin/postgres").mark("chepoint__done") {printf("%s\n",
$CheckpointStats$$)}'

{.ckpt_start_t=435381006985866, .ckpt_write_t=435381006985916, .ckpt_sync_t=435381006990623, .ckpt


_sync_end_t=435381006990626, .ckpt_end_t=435381006991807, .ckpt_bufs_written=0, .ckpt_segs_added=
0, .ckpt_segs_removed=0, .ckpt_segs_recycled=0, .ckpt_sync_rels=0, .ckpt_longest_sync=0, .ckpt_agg_sync
_time=0}

EXP : checkpointvfs.write ,

PostgreSQL

buffer

http://blog.163.com/digoal@126/blog/static/1638770402013916488761/

bufferread-start

(ForkNumber,
BlockNumber,
Oid, Oid, Oid,
int, bool)

(ForkNumber,
bufferBlockNumber,
read-done Oid, Oid, Oid,
int, bool, bool)
(ForkNumber,
bufferBlockNumber,
flush-start
Oid, Oid, Oid)
(ForkNumber,
bufferBlockNumber,
flush-done
Oid, Oid, Oid)

Probe that fires when a buffer read is started. arg0 and arg1 contain the fork and block
numbers of the page (but arg1 will be -1 if this is a relation extension request). arg2, arg3,
and arg4 contain the tablespace, database, and relation OIDs identifying the relation. arg5 is
the ID of the backend which created the temporary relation for a local buffer, or
InvalidBackendId (-1) for a shared buffer. arg6 is true for a relation extension request, false
for normal read.
Probe that fires when a buffer read is complete. arg0 and arg1 contain the fork and block
numbers of the page (if this is a relation extension request, arg1 now contains the block
number of the newly added block). arg2, arg3, and arg4 contain the tablespace, database, and
relation OIDs identifying the relation. arg5 is the ID of the backend which created the
temporary relation for a local buffer, or InvalidBackendId (-1) for a shared buffer. arg6 is true
for a relation extension request, false for normal read. arg7 is true if the buffer was found in
the pool, false if not.
Probe that fires before issuing any write request for a shared buffer. arg0 and arg1 contain the
fork and block numbers of the page. arg2, arg3, and arg4 contain the tablespace, database,
and relation OIDs identifying the relation.
Probe that fires when a write request is complete. (Note that this just reflects the time to pass
the data to the kernel; it's typically not actually been written to disk yet.) The arguments are
the same as for buffer-flush-start.

PostgreSQL

buffer

http://blog.163.com/digoal@126/blog/static/1638770402013916488761/

Probe that fires when a server process begins to write a dirty buffer. (If this happens
(ForkNumber,
buffer-writeoften, it implies that shared_buffers is too small or the bgwriter control parameters need
BlockNumber, Oid,
dirty-start
adjustment.) arg0 and arg1 contain the fork and block numbers of the page. arg2, arg3,
Oid, Oid)
and arg4 contain the tablespace, database, and relation OIDs identifying the relation.
(ForkNumber,
buffer-writeProbe that fires when a dirty-buffer write is complete. The arguments are the same as
BlockNumber, Oid,
dirty-done
for buffer-write-dirty-start.
Oid, Oid)
wal-bufferProbe that fires when a server process begins to write a dirty WAL buffer because no
write-dirty- ()
more WAL buffer space is available. (If this happens often, it implies that wal_buffers is
start
too small.)
wal-bufferwrite-dirty- ()
Probe that fires when a dirty WAL buffer write is complete.
done

PostgreSQL

buffer

http://blog.163.com/digoal@126/blog/static/1638770402013916488761/

1. buffer, ;

2. shared buffer flush, (, , flush,


) (FlushBuffer) (shared buffer flushed out)

3. dirty shared buffer flush, (BufferAlloc, shared buffer,


, shared buffer,flushdirtybuffer,
bgwriter), FlushBufferflush.

4. dirty wal buffer flush, . (, wal_buffers)

PostgreSQL

buffer

http://blog.163.com/digoal@126/blog/static/1638770402013916488761/

buffer, ?

1. buffer, forkNUM() , blockid, oid, oid, relation


relfilenode(pg_class.relfilenode); backend_id(local buffer)-1(shared buffer).
true,false.

2. buffer, , bool(found), shared pool(


shared buffer).

3. shared buffer flush, ; forkNUM() , blockid, oid, oid, relation


relfilenode(pg_class.relfilenode);

flush.

4. dirty shared buffer flush, ; (shared buffer flush).

PostgreSQL

buffer

http://blog.163.com/digoal@126/blog/static/1638770402013916488761/

1. ForkNumber

, , , fsmvm, init(nologging).

(nologging init

http://blog.163.com/digoal@126/blog/static/163877040201382341433512/ )

src/include/storage/relfilenode.h

typedef enum ForkNumber

InvalidForkNumber = -1,

MAIN_FORKNUM = 0,

FSM_FORKNUM,

VISIBILITYMAP_FORKNUM,

INIT_FORKNUM

} ForkNumber;

PostgreSQL

buffer

http://blog.163.com/digoal@126/blog/static/1638770402013916488761/

2. BlockNumber

block id.

src/include/storage/block.h

/*

* each data file (heap or index) is divided into postgres disk blocks

* (which may be thought of as the unit of i/o -- a postgres buffer

* contains exactly one disk block). the blocks are numbered

* sequentially, 0 to 0xFFFFFFFE.

*/

typedef uint32 BlockNumber;

3. , , relation oid.

pg_database.oid, pg_tablespace.oid, pg_class.relfilenode.

PostgreSQL

buffer

http://blog.163.com/digoal@126/blog/static/1638770402013916488761/

buffer?

1. buffer , , . probequery probe, sql


buffer .

2. dirty buffer flushbuffer flush. probe query probe.

3. dirty wal buffer.

SQLbuffer, shared buffershared buffer.

PostgreSQL

buffer

http://blog.163.com/digoal@126/blog/static/1638770402013916488761/

stap -e '

global var;

probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__start") {

var[pid(),0]=0

var[pid(),1]=0

probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("buffer__read__done") {

if ($arg8)

var[pid(),1]++

if (! $arg8)

var[pid(),0]++

probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__done") {

printf("query: %s\n", user_string($arg1))

printf("shared buffer hit: %d\n", var[pid(),1])

printf("shared buffer nonhit: %d\n", var[pid(),0])

}'

PostgreSQL

buffer

http://blog.163.com/digoal@126/blog/static/1638770402013916488761/

query: explain (analyze,verbose,costs,buffers,timing) select count(*) from test where id<100;

shared buffer hit: 202

shared buffer nonhit: 1

query: explain (analyze,verbose,costs,buffers,timing) select count(*) from test where id<1000;

shared buffer hit: 984

shared buffer nonhit: 0

query: explain (analyze,verbose,costs,buffers,timing) select count(*) from test ;

shared buffer hit: 47321

shared buffer nonhit: 0

PostgreSQL

read|write relation

http://blog.163.com/digoal@126/blog/static/163877040201391653616103/

relationrelation(index, table,tmp tablefsm, vm).

Probe that fires when beginning to read a block from a relation. arg0 and arg1
(ForkNumber,
contain the fork and block numbers of the page. arg2, arg3, and arg4 contain the
BlockNumber,
smgr-md-read-start
tablespace, database, and relation OIDs identifying the relation. arg5 is the ID of the
Oid, Oid, Oid,
backend which created the temporary relation for a local buffer, or
int)
InvalidBackendId (-1) for a shared buffer.

Probe that fires when a block read is complete. arg0 and arg1 contain the fork and
(ForkNumber, block numbers of the page. arg2, arg3, and arg4 contain the tablespace, database,
BlockNumber, and relation OIDs identifying the relation. arg5 is the ID of the backend which
smgr-md-read-done
Oid, Oid, Oid, created the temporary relation for a local buffer, or InvalidBackendId (-1) for a
int, int, int)
shared buffer. arg6 is the number of bytes actually read, while arg7 is the number
requested (if these are different it indicates trouble).

PostgreSQL

read|write relation

http://blog.163.com/digoal@126/blog/static/163877040201391653616103/

smgr-md-write-start

smgr-md-write-done

(ForkNumber,
BlockNumber,
Oid, Oid, Oid,
int)

Probe that fires when beginning to write a block to a relation. arg0 and arg1
contain the fork and block numbers of the page. arg2, arg3, and arg4 contain
the tablespace, database, and relation OIDs identifying the relation. arg5 is the
ID of the backend which created the temporary relation for a local buffer, or
InvalidBackendId (-1) for a shared buffer.

(ForkNumber,
BlockNumber,
Oid, Oid, Oid,
int, int, int)

Probe that fires when a block write is complete. arg0 and arg1 contain the
fork and block numbers of the page. arg2, arg3, and arg4 contain the
tablespace, database, and relation OIDs identifying the relation. arg5 is the ID
of the backend which created the temporary relation for a local buffer, or
InvalidBackendId (-1) for a shared buffer. arg6 is the number of bytes actually
written, while arg7 is the number requested (if these are different it indicates
trouble).

PostgreSQL

read|write relation

http://blog.163.com/digoal@126/blog/static/163877040201391653616103/

1. relation: forkNum, blocknum, tbs_oid, db_oid, pg_class.relfilenode, (read to local or shared buffer)

buffer, , :

http://blog.163.com/digoal@126/blog/static/1638770402013916488761/

2. relation: 2, relation.
.

3. relation: relation.

4. relation: 2, relation.
.

PostgreSQL

read|write relation

http://blog.163.com/digoal@126/blog/static/163877040201391653616103/

EXP : relation, 1.

stap -e '

probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("smgr__md__read__done") {

printdln("***", pn(), $arg1, $arg2, $arg3, $arg4, $arg5, $arg6, $arg7, $arg8)

probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("smgr__md__write__done") {

printdln("***", pn(), $arg1, $arg2, $arg3, $arg4, $arg5, $arg6, $arg7, $arg8)
}'

PostgreSQL

read|write relation

http://blog.163.com/digoal@126/blog/static/163877040201391653616103/

EXP : relation, 1.

digoal=# explain (analyze,verbose,costs,buffers,timing) select count(*) from test ;

QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------

Aggregate (cost=106075.99..106076.00 rows=1 width=0) (actual time=1561.052..1561.053 rows=1


loops=1)

Output: count(*)

Buffers: shared hit=47319

-> Seq Scan on public.test (cost=0.00..94324.59 rows=4700559 width=0) (actual time=0.011..883.486


rows=4676559 loops=1)

Output: id, info, crt_time

Buffers: shared hit=47319

Total runtime: 1561.094 ms

(7 rows)

shared buffer, stap

PostgreSQL

read|write relation

http://blog.163.com/digoal@126/blog/static/163877040201391653616103/

EXP : relation, 1.

digoal=# create table t1(id int, info text);

CREATE TABLE

digoal=# insert into t1 select generate_series(1,100000),md5(random()::text);

INSERT 0 100000

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("smgr__md__read__done")***1***2***1663***1638
4***24726***-1***8192***8192

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("smgr__md__read__done")***1***0***1663***1638
4***24726***-1***8192***8192

PostgreSQL

read|write relation

http://blog.163.com/digoal@126/blog/static/163877040201391653616103/

EXP : relation, 1.

digoal=# explain (analyze,verbose,costs,buffers,timing) select count(*) from t1 ;

QUERY PLAN

----------------------------------------------------------------------------------------------------------------

Aggregate (cost=21.50..21.51 rows=1 width=0) (actual time=0.316..0.316 rows=1 loops=1)

Output: count(*)

Buffers: shared hit=9

-> Seq Scan on public.t1 (cost=0.00..19.00 rows=1000 width=0) (actual time=0.010..0.164 rows=1000
loops=1)

Output: id, info

Buffers: shared hit=9

Total runtime: 0.353 ms

(7 rows)

shared buffer, stap

PostgreSQL

read|write relation

http://blog.163.com/digoal@126/blog/static/163877040201391653616103/

EXP : relation, 1.

digoal=# checkpoint;

CHECKPOINT

, checkpoint.

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("smgr__md__write__done")***0***0***1663***163
84***12658***-1***8192***8192

...

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("smgr__md__write__done")***0***8***1663***163
84***24726***-1***8192***8192

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("smgr__md__write__done")***0***7***1663***163
84***24726***-1***8192***8192

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("smgr__md__write__done")***0***6***1663***163
84***24726***-1***8192***8192

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("smgr__md__write__done")***0***5***1663***163
84***24726***-1***8192***8192

PostgreSQL

read|write relation

http://blog.163.com/digoal@126/blog/static/163877040201391653616103/

EXP : relation, 1.

digoal=# checkpoint;

CHECKPOINT

, checkpoint.

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("smgr__md__write__done")***0***4***1663***163
84***24726***-1***8192***8192

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("smgr__md__write__done")***0***3***1663***163
84***24726***-1***8192***8192

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("smgr__md__write__done")***0***2***1663***163
84***24726***-1***8192***8192

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("smgr__md__write__done")***0***1***1663***163
84***24726***-1***8192***8192

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("smgr__md__write__done")***0***0***1663***163
84***24726***-1***8192***8192

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("smgr__md__write__done")***0***16***1663***16
384***12649***-1***8192***8192

...

PostgreSQL

read|write relation

http://blog.163.com/digoal@126/blog/static/163877040201391653616103/

EXP : relation, 1.

digoal=# select oid,relname from pg_class where relfilenode in


(12658,12660,12682,12685,12637,12638,12661,12650,12631);

oid |

------+---------------------------

2840 | pg_toast_2619

2679 | pg_index_indexrelid_index

2610 | pg_index

(3 rows)

t19flush, stap.

digoal=# select relfilenode from pg_class where relname='t1';

relfilenode

-------------

24726

(1 row)

relname

PostgreSQL

lock

http://blog.163.com/digoal@126/blog/static/163877040201391674922879/

lwlock-acquire

(LWLockId,
Probe that fires when an LWLock has been acquired. arg0 is the LWLock's ID. arg1 is the
LWLockMode) requested lock mode, either exclusive or shared.

lwlock-release

(LWLockId)

lwlock-wait-start

Probe that fires when an LWLock was not immediately available and a server process has begun
(LWLockId,
to wait for the lock to become available. arg0 is the LWLock's ID. arg1 is the requested lock
LWLockMode)
mode, either exclusive or shared.

lwlock-waitdone

Probe that fires when a server process has been released from its wait for an LWLock (it does
(LWLockId,
not actually have the lock yet). arg0 is the LWLock's ID. arg1 is the requested lock mode, either
LWLockMode)
exclusive or shared.

lwlockcondacquire

(LWLockId,
Probe that fires when an LWLock was successfully acquired when the caller specified no
LWLockMode) waiting. arg0 is the LWLock's ID. arg1 is the requested lock mode, either exclusive or shared.

Probe that fires when an LWLock has been released (but note that any released waiters have not
yet been awakened). arg0 is the LWLock's ID.

lwlock(LWLockId,
Probe that fires when an LWLock was not successfully acquired when the caller specified no
condacquire-fail LWLockMode) waiting. arg0 is the LWLock's ID. arg1 is the requested lock mode, either exclusive or shared.

PostgreSQL

lock

http://blog.163.com/digoal@126/blog/static/163877040201391674922879/

lock-wait-start

(unsigned int, unsigned int,


unsigned int, unsigned int,
unsigned int, LOCKMODE)

Probe that fires when a request for a heavyweight lock (lmgr


lock) has begun to wait because the lock is not available. arg0
through arg3 are the tag fields identifying the object being
locked. arg4 indicates the type of object being locked. arg5
indicates the lock type being requested.

lock-wait-done

(unsigned int, unsigned int,


unsigned int, unsigned int,
unsigned int, LOCKMODE)

Probe that fires when a request for a heavyweight lock (lmgr


lock) has finished waiting (i.e., has acquired the lock). The
arguments are the same as for lock-wait-start.

deadlock-found

()

Probe that fires when a deadlock is found by the deadlock


detector.

PostgreSQL

lock

http://blog.163.com/digoal@126/blog/static/163877040201391674922879/

probe lwlock__acquire(LWLockId, LWLockMode);

probe lwlock__release(LWLockId); .

probe lwlock__wait__start(LWLockId, LWLockMode); .

probe lwlock__wait__done(LWLockId, LWLockMode); .

probe lwlock__condacquire(LWLockId, LWLockMode); ,


nowait, , . lwlock__acquire.
probe lwlock__condacquire__fail(LWLockId, LWLockMode); .

probe lwlock__wait__until__free(LWLockId, LWLockMode); LWLockAcquireOrWait ,


, true; , , false,
. WALWriteLock.
probe lwlock__wait__until__free__fail(LWLockId, LWLockMode); , .

PostgreSQL

lock

http://blog.163.com/digoal@126/blog/static/163877040201391674922879/

probe lock__wait__start(unsigned int, unsigned int, unsigned int, unsigned int, unsigned int,
LOCKMODE);

, 5LOCKTAG5field
probe lock__wait__done(unsigned int, unsigned int, unsigned int, unsigned int, unsigned int,
LOCKMODE);
.

:
probe deadlock__found();

PostgreSQL

lock

http://blog.163.com/digoal@126/blog/static/163877040201391674922879/

typedef enum LWLockId

BufFreelistLock,

ShmemIndexLock,

OidGenLock,

XidGenLock,

ProcArrayLock,

SInvalReadLock,

SInvalWriteLock,

WALInsertLock,

WALWriteLock,

ControlFileLock,

CheckpointLock,

CLogControlLock,

SubtransControlLock,

MultiXactGenLock,

MultiXactOffsetControlLock,

PostgreSQL

lock

http://blog.163.com/digoal@126/blog/static/163877040201391674922879/

MultiXactMemberControlLock,

RelCacheInitLock,

CheckpointerCommLock,

TwoPhaseStateLock,

TablespaceCreateLock,

BtreeVacuumLock,

AddinShmemInitLock,

AutovacuumLock,

AutovacuumScheduleLock,

SyncScanLock,

RelationMappingLock,

AsyncCtlLock,

AsyncQueueLock,

SerializableXactHashLock,

SerializableFinishedListLock,

SerializablePredicateLockListLock,

OldSerXidLock,

PostgreSQL

lock

http://blog.163.com/digoal@126/blog/static/163877040201391674922879/

SyncRepLock,

/* Individual lock IDs end here */

FirstBufMappingLock,

FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,

FirstPredicateLockMgrLock = FirstLockMgrLock + NUM_LOCK_PARTITIONS,

/* must be last except for MaxDynamicLWLock: */

NumFixedLWLocks = FirstPredicateLockMgrLock + NUM_PREDICATELOCK_PARTITIONS,

MaxDynamicLWLock = 1000000000

} LWLockId;

typedef enum LWLockMode

LW_EXCLUSIVE,

LW_SHARED,

LW_WAIT_UNTIL_FREE

} LWLockMode;

PostgreSQL

lock

http://blog.163.com/digoal@126/blog/static/163877040201391674922879/

typedef enum LockTagType

LOCKTAG_RELATION,

/* ID info for a relation is DB OID + REL OID; DB OID = 0 if shared */

LOCKTAG_RELATION_EXTEND,

/* same ID info as RELATION */

LOCKTAG_PAGE,

/* ID info for a page is RELATION info + BlockNumber */

LOCKTAG_TUPLE,

/* ID info for a tuple is PAGE info + OffsetNumber */

LOCKTAG_TRANSACTION,

/* ID info for a transaction is its TransactionId */

LOCKTAG_VIRTUALTRANSACTION, /* virtual transaction (ditto) */

/* ID info for a virtual transaction is its VirtualTransactionId */

LOCKTAG_OBJECT,

/* ID info for an object is DB OID + CLASS OID + OBJECT OID + SUBID */

/* whole relation */
/* the right to extend a relation */

/* one page of a relation */

/* one physical tuple */


/* transaction (for waiting for xact done) */

/* non-relation database object */

PostgreSQL

lock

http://blog.163.com/digoal@126/blog/static/163877040201391674922879/

LOCKTAG_USERLOCK,

/* reserved for old contrib/userlock code */

LOCKTAG_ADVISORY

/* advisory user locks */

} LockTagType;

typedef struct LOCKTAG

uint32

locktag_field1; /* a 32-bit ID field */

uint32

locktag_field2; /* a 32-bit ID field */

uint32

locktag_field3; /* a 32-bit ID field */

uint16

locktag_field4; /* a 16-bit ID field */

uint8

locktag_type; /* see enum LockTagType */

uint8

locktag_lockmethodid; /* lockmethod indicator */

} LOCKTAG;

PostgreSQL

lock

http://blog.163.com/digoal@126/blog/static/163877040201391674922879/

/* NoLock is not a lock mode, but a flag value meaning "don't get a lock" */

#define NoLock

#define AccessShareLock

/* SELECT */

#define RowShareLock

/* SELECT FOR UPDATE/FOR SHARE */

#define RowExclusiveLock

#define ShareUpdateExclusiveLock 4

/* VACUUM (non-FULL),ANALYZE, CREATE

* INDEX CONCURRENTLY */

#define ShareLock

#define ShareRowExclusiveLock 6

* SHARE */
#define ExclusiveLock

/* blocks ROW SHARE/SELECT...FOR

/* CREATE INDEX (WITHOUT CONCURRENTLY) */


/* like EXCLUSIVE MODE, but allows ROW

/* INSERT, UPDATE, DELETE */

* UPDATE */
#define AccessExclusiveLock

/* ALTER TABLE, DROP TABLE, VACUUM


* FULL, and unqualified LOCK TABLE */

PostgreSQL

lock

http://blog.163.com/digoal@126/blog/static/163877040201391674922879/

EXP 1. .

stap -e '

global var1

probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("lwlock__wait__done") {

var1[$arg1, $arg2]++

probe timer.s($1) {

println("*******************")

foreach(v=[x,y] in var1+)

printf("lockid:%d, lockmode:%d, wait_count:%d\n", x,y,v)


delete var1
}' 5

PostgreSQL

lock

http://blog.163.com/digoal@126/blog/static/163877040201391674922879/

EXP 1. , , .

[root@db-172-16-3-150 postgresql-9.3.1]# stap -v -D MAXSKIPPED=10000000 -e '


global var1%[120000], var2%[120000]
probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("lock__wait__start") {
var1[pid()] = gettimeofday_us()
}
probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("lock__wait__done") {
p=pid()
t=gettimeofday_us()
if (p in var1)
var2[$arg1, $arg2, $arg3, $arg4, $arg5, $arg6] <<< (t - var1[p])
}
probe timer.s($1) {
println("*******************")
foreach([a,b,c,d,e,f] in var2 @sum - limit 5)
printdln("**",a,b,c,d,e,f,@sum(var2[a,b,c,d,e,f])/1000,@count(var2[a,b,c,d,e,f]),@avg(var2[a,b,c,d,e,f])/100
0)

delete var2
}' 5

PostgreSQL

statement, xlog, sort

http://blog.163.com/digoal@126/blog/static/1638770402013916221518/

1. , pg_stat_activity.status, , SQL.

2. xlog1, WAL record. recordrmid, info flags.

3. xlog2, xlog.

4. , (heap,index,datum), , , work_mem
kbytes, .

5. , ( ), blocks or kbytes.

PostgreSQL

statement, xlog, sort

http://blog.163.com/digoal@126/blog/static/1638770402013916221518/

statement-status

(const char *)

Probe that fires anytime the server process updates its pg_stat_activity.status. arg0 is
the new status string.

xlog-insert

(unsigned char, unsigned char)

Probe that fires when a WAL record is inserted. arg0 is the resource
manager (rmid) for the record. arg1 contains the info flags.

xlog-switch

()

Probe that fires when a WAL segment switch is requested.

sort-start

sort-done

(int, bool, int, int, bool)

Probe that fires when a sort operation is started. arg0 indicates heap, index or
datum sort. arg1 is true for unique-value enforcement. arg2 is the number of key
columns. arg3 is the number of kilobytes of work memory allowed. arg4 is true if
random access to the sort result is required.

(bool, long)

Probe that fires when a sort is complete. arg0 is true for external sort, false for
internal sort. arg1 is the number of disk blocks used for an external sort, or
kilobytes of memory used for an internal sort.

PostgreSQL

statement, xlog, sort

http://blog.163.com/digoal@126/blog/static/1638770402013916221518/

statement :

src/backend/postmaster/pgstat.c

/* Called from tcop/postgres.c to report what the backend is actually doing */

void

pgstat_report_activity(BackendState state, const char *cmd_str)

volatile PgBackendStatus *beentry = MyBEEntry;

TimestampTz start_timestamp;

TimestampTz current_timestamp;

int

TRACE_POSTGRESQL_STATEMENT_STATUS(cmd_str);

len = 0;

PostgreSQL

statement, xlog, sort

http://blog.163.com/digoal@126/blog/static/1638770402013916221518/

EXP :(pg_stat_activity.status), .

[root@db-172-16-3-150 postgresql-9.3.1]# stap -e '

probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("statement__status") {printdln("**", pn(), $arg1 ?


user_string($arg1) : "0")

}'

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("statement__status")**begin;

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("statement__status")**0

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("statement__status")**end;

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("statement__status")**0

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("statement__status")**select * from t1;

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("statement__status")**0

PostgreSQL

statement, xlog, sort

http://blog.163.com/digoal@126/blog/static/1638770402013916221518/

EXP : xlog record data, rmid(,), info flags(rmid, info).

[root@db-172-16-3-150 postgresql-9.3.1]# stap -e '

probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("xlog__insert") {

printf("rmid:%u, info:%d\n", $arg1 ,$arg2)

}'

digoal=# insert into t1 values (1,'test');

INSERT 0 1

OUTPUT :

rmid:10, info:0

rmid:1, info:96

rmid=10heap_redo, info=0heap_redoINSERT.

rmid=1xact_redo, info=96xact_redoXLOG_XACT_COMMIT_COMPACT

PostgreSQL

statement, xlog, sort

http://blog.163.com/digoal@126/blog/static/1638770402013916221518/

EXP : xlog, xlog.

[root@db-172-16-3-150 postgresql-9.3.1]# stap -e '

probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("xlog__switch") {

println(pn())

}'

SQL :

digoal=# select pg_switch_xlog();

----------------

3/45000728

digoal=# select pg_switch_xlog();

----------------

3/5EE76270

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("xlog__switch")

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("xlog__switch")

PostgreSQL

statement, xlog, sort

http://blog.163.com/digoal@126/blog/static/1638770402013916221518/

EXP :

[root@db-172-16-3-150 postgresql-9.3.1]# stap --vp 10000 -e '

probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("sort__start") {

if ($arg1 == 0) st="HEAP_SORT"

if ($arg1 == 1) st="INDEX_SORT"

if ($arg1 == 2) st="DATUM_SORT"

if ($arg1 == 3) st="CLUSTER_SORT"

printdln("**",pn(),st,$arg2,$arg3,$arg4,$arg5)

probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("sort__done") {

if ($arg1) st="EXTERNAL_SORT"

if (! $arg1) st="MEM_SORT"

printdln("**",pn(),st,$arg2)

}'

PostgreSQL

statement, xlog, sort

http://blog.163.com/digoal@126/blog/static/1638770402013916221518/

EXP :

digoal=# explain (analyze,verbose,costs,buffers,timing) select * from t1 order by id;

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------

Sort (cost=884772.20..898852.12 rows=5631970 width=11) (actual time=11457.870..13425.913


rows=5635072 loops=1)

Output: id, info

Sort Key: t1.id

Sort Method: external merge Disk: 118696kB

Buffers: shared hit=31965, temp read=57480 written=57480

-> Seq Scan on public.t1 (cost=0.00..88281.70 rows=5631970 width=11) (actual time=0.010..969.990


rows=5635072 loops=1)

Output: id, info

Buffers: shared hit=31962

Total runtime: 14040.272 ms

(9 rows)

PostgreSQL

statement, xlog, sort

http://blog.163.com/digoal@126/blog/static/1638770402013916221518/

# , 14837, stap.

digoal=# select 14837*8;

?column?

----------

118696

(1 row)

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("sort__start")**HEAP_SORT**0**1**1024**0

0, 11key. 1024KBwork mem, 0

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("sort__done")**EXTERNAL_SORT**14837

14837, analyze.

PostgreSQL

statement, xlog, sort

http://blog.163.com/digoal@126/blog/static/1638770402013916221518/

digoal=# explain (analyze,verbose,costs,buffers,timing) select * from t1 order by id desc,info;

Sort Key: t1.id, t1.info

Sort Method: external merge Disk: 118704kB

digoal=# set work_mem='1024MB';

digoal=# explain (analyze,verbose,costs,buffers,timing) select * from t1 order by id desc,info;

Sort Key: t1.id, t1.info

Sort Method: quicksort Memory: 476753kB

digoal=# explain (analyze,verbose,costs,buffers,timing) select * from t1 group by info,id order by id


desc,info;

Sort Key: t1.id, t1.info

Sort Method: quicksort Memory: 931kB

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("sort__start")**HEAP_SORT**0**2**1024**0

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("sort__done")**EXTERNAL_SORT**14838

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("sort__start")**HEAP_SORT**0**2**1048576**0

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("sort__done")**MEM_SORT**476753

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("sort__start")**HEAP_SORT**0**2**1048576**0

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("sort__done")**MEM_SORT**931

1. ,

2. probes.d, ()

3. probes.d,

4. probes.h

5.

6. pg_trace.h

7. PostgreSQL

8.

9.

MD5 Server -> send salt, Client -> encrypted md5+salt, Server -> receive enc(salted
md5) .

1.

src/backend/libpq/auth.c

2. probes.d, ,

vi src/backend/utils/probes.d

#define bool char

// add by digoal,

#define salt char[4]

#define pwd char *

// 2, sendAuthRequestrecv_password_packet

provider postgresql {

// add by digoal

probe test1(salt);

probe test2(pwd);

3. probes.h

cd src/backend/utils

gmake

MD5 Server -> send salt, Client -> encrypted md5+salt, Server -> receive enc(salted
md5) .

5.

6. pg_trace.h

7. PostgreSQL

vi src/backend/libpq/auth.c

// add by digoal

#include "pg_trace.h"

...

/*

* Send an authentication request packet to the frontend.

*/

static void

sendAuthRequest(Port *port, AuthRequest areq)

StringInfoData buf;

MD5 Server -> send salt, Client -> encrypted md5+salt, Server -> receive enc(salted
md5) .

pq_beginmessage(&buf, 'R');

pq_sendint(&buf, (int32) areq, sizeof(int32));

/* Add the salt for encrypted passwords. */

if (areq == AUTH_REQ_MD5) {

pq_sendbytes(&buf, port->md5Salt, 4);

TRACE_POSTGRESQL_TEST1(port->md5Salt);

...

static char *

recv_password_packet(Port *port)

StringInfoData buf;

if (PG_PROTOCOL_MAJOR(port->proto) >= 3)

/* Expect 'p' message type */

int

mtype;

MD5 Server -> send salt, Client -> encrypted md5+salt, Server -> receive enc(salted
md5) .

mtype = pq_getbyte();

if (mtype != 'p')

...

/*

* Return the received string. Note we do not attempt to do any

* character-set conversion on it; since we don't yet know the client's

* encoding, there wouldn't be much point.

*/

// add by digoal

TRACE_POSTGRESQL_TEST2(buf.data);

return buf.data;

MD5 Server -> send salt, Client -> encrypted md5+salt, Server -> receive enc(salted
md5) .

mtype = pq_getbyte();

if (mtype != 'p')

...

/*

* Return the received string. Note we do not attempt to do any

* character-set conversion on it; since we don't yet know the client's

* encoding, there wouldn't be much point.

*/

// add by digoal

TRACE_POSTGRESQL_TEST2(buf.data);

return buf.data;

cd ../../..

gmake && gmake install

MD5 Server -> send salt, Client -> encrypted md5+salt, Server -> receive enc(salted
md5) .

8.

9.

[root@db-172-16-3-150 postgresql-9.3.1]# su - pg93

pg93@db-172-16-3-150-> pg_ctl restart -m fast

[root@db-172-16-3-150 postgresql-9.3.1]# stap -e '

probe
process("/home/pg93/pgsql9.3.1/bin/postgres").mark("test1"), process("/hompg93/pgsql9.3.1/bin/postgres").
mark("test2") {

printdln("---", pn(), cmdline_str(), user_string($arg1))

}'

pg93@db-172-16-3-150-> psql -h 172.16.3.150

Password:

digoal=# \q

pg93@db-172-16-3-150-> psql -h 172.16.3.150

Password:

digoal=# \q

MD5 Server -> send salt, Client -> encrypted md5+salt, Server -> receive enc(salted
md5) .

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("test1")---postgres: postgres digoal 172.16.3.150(62448)


authentication---?#

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("test1")---postgres: postgres digoal 172.16.3.150(62449)


authentication---<9?

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("test2")---postgres: postgres digoal 172.16.3.150(62449)


authentication---md59ec1063988718e99ee11e3933614232e

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("test1")---postgres: postgres digoal 172.16.3.150(62450)


authentication---??'D

process("/home/pg93/pgsql9.3.1/bin/postgres").mark("test1")---postgres: postgres digoal 172.16.3.150(62451)


authentication---???process("/home/pg93/pgsql9.3.1/bin/postgres").mark("test2")---postgres: postgres digoal
172.16.3.150(62451) authentication---md57dcdfd2b2810d62651ab46d29159a2ac

salt, md5.

, , .

vi src/backend/utils/probes.d

provider postgresql {

// add by digoal

probe client_conn();

probe client_close();

gmake

less probes.h

#define TRACE_POSTGRESQL_CLIENT_CONN() DTRACE_PROBE(postgresql,client_conn)

#define TRACE_POSTGRESQL_CLIENT_CLOSE() DTRACE_PROBE(postgresql,client_close)

[root@db-172-16-3-39 utils]# cd /opt/soft_bak/postgresql-9.3.1/src/backend/libpq

[root@db-172-16-3-39 libpq]# vi pqcomm.c

// pg_trace.h,

// add by digoal

#include "pg_trace.h"

....

, , .

pq_initTRACE_POSTGRESQL_CLIENT_CONN();

void

pq_init(void)

// add by digoal

TRACE_POSTGRESQL_CLIENT_CONN();

....

pq_closeTRACE_POSTGRESQL_CLIENT_CLOSE();

static void

pq_close(int code, Datum arg)

// add by digoal

TRACE_POSTGRESQL_CLIENT_CLOSE();

[root@db-172-16-3-39 libpq]# cd /opt/soft_bak/postgresql-9.3.1

[root@db-172-16-3-39 postgresql-9.3.1]# gmake && gmake install

pg_ctl restart -m fast

, , .

vi test.stp

global var1, var11, var2, var3, var4, var44;

probe begin {

if ($1<1 || $2<1) {

println("Please enter $interval and $count larger then one")

exit()

}
}

probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("client_conn") {

var1 <<< 1

var11 <<< 1

var2[pid()]=gettimeofday_ms()

, , .

probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("client_close") {

# client_conn, , client_conn
.

if (var2[pid()] != 0) {

var3 <<< (gettimeofday_ms() - var2[pid()])

delete var2[pid()]

var4 <<< 1

var44 <<< 1

probe timer.s($1) {

printf("conn per sec: %d, close per sec: %d\n", @count(var1)/$1, @count(var4)/$1)

delete var1

, , .

delete var4

probe timer.s($2) {

printf("end conn per sec: %d, end close per sec: %d\n", @count(var1)/$2, @count(var4)/$2)

printf("session times min:%d, max:%d, avg:%d\n", @min(var3), @max(var3), @avg(var3))

println("hist_linear : ms")

println(@hist_linear(var3,0,5000,50))

delete var11

delete var44

delete var3

delete var1

delete var2

delete var4

exit()

, , .

\setrandom seed 1 300

select pg_sleep(0.01 * :seed);

pg93@db-172-16-3-39-> pgbench -M extended -f ./test.sql -n -r -C -h 127.0.0.1 -c 16 -j 4 -T 13

[root@db-172-16-3-39 ~]# stap ./test.stp 1 10

conn per sec: 11, close per sec: 11

conn per sec: 5, close per sec: 5

conn per sec: 13, close per sec: 13

conn per sec: 13, close per sec: 13

conn per sec: 12, close per sec: 12

conn per sec: 8, close per sec: 8

conn per sec: 15, close per sec: 15

conn per sec: 8, close per sec: 8

conn per sec: 11, close per sec: 11

conn per sec: 11, close per sec: 11

end conn per sec: 0, end close per sec: 0

session times min:14, max:2994, avg:1429

, , .

hist_linear : ms

value |-------------------------------------------------- count

0 |@@@@@

50 |@@@

100 |@@

150 |@@

200 |

250 |@@

300 |@@

350 |@

400 |

450 |

500 |@@@@

550 |

600 |@

650 |@@@

700 |@@

5
3

4
0
1
3
2

, , .

750 |

800 |@

850 |@@

900 |@@

950 |@@

1000 |@@

1050 |@@@

1100 |@

1150 |@@@

1200 |@@

1250 |

1300 |@@@

1350 |

1400 |@

1450 |@

1500 |

1550 |@

0
1

3
1

3
2
0
3
0

0
1

, , .

1600 |@

1650 |

1700 |@

1750 |@@@

1800 |@@@

1850 |@

1900 |@

1950 |

2000 |@

2050 |@

2100 |@

2150 |@@@@

2200 |@@

2250 |

2300 |@

2350 |@@

2400 |

1
0
1

4
2
0
1
2
0

, , .

2450 |@

2500 |

2550 |@@@

2600 |

2650 |@@@@

2700 |

2750 |@

2800 |@

2850 |@

2900 |@@@

2950 |@@@@

3000 |

3050 |

1
0
3
0
4
0

3
4

PostgreSQL

Trace PostgreSQL iostat per SQL statement

http://blog.163.com/digoal@126/blog/static/16387704020139152191581/

SQLvfs.read vfs.write: ,,

OS cache, .

PostgreSQLvfs.read vfs.write.

stpblog.

PostgreSQL

digoal=# drop table t;

digoal=# create table t(id int, info text, crt_time timestamp);

digoal=# insert into t select generate_series(1,1000000), md5(random()::text), clock_timestamp();

digoal=# create index idx_t_1 on t(id);

digoal=# explain (analyze,verbose,costs,buffers,timing) select count(*) from t;

QUERY PLAN
Aggregate (cost=21846.00..21846.01 rows=1 width=0) (actual time=467.932..467.932 rows=1 loops=1)

Output: count(*)

Buffers: shared hit=9346

-> Seq Scan on public.t (cost=0.00..19346.00 rows=1000000 width=0) (actual time=0.011..253.487


rows=1000000 loops=1)

Output: id, info, crt_time

Buffers: shared hit=9346

PostgreSQL

digoal=# explain (analyze,verbose,costs,buffers,timing) select count(*) from t where id<10000;

QUERY PLAN
Aggregate (cost=16928.08..16928.09 rows=1 width=0) (actual time=8.509..8.509 rows=1 loops=1)

Output: count(*)

Buffers: shared hit=94 read=30

-> Index Only Scan using idx_t_1 on public.t (cost=0.42..16094.75 rows=333333 width=0) (actual time=0.146..6.010
rows=9999 loops=1)

Output: id

Index Cond: (t.id < 10000)

Heap Fetches: 9999

Buffers: shared hit=94 read=30

digoal=# explain (analyze,verbose,costs,buffers,timing) select count(*) from generate_series(1,1000000);

QUERY PLAN
Aggregate (cost=12.50..12.51 rows=1 width=0) (actual time=918.138..918.138 rows=1 loops=1)

Output: count(*)

Buffers: temp read=1710 written=1709

-> Function Scan on pg_catalog.generate_series (cost=0.00..10.00 rows=1000 width=0) (actual time=320.753..684.423


rows=1000000 loops=1)

Output: generate_series

Function Call: generate_series(1, 1000000)

Buffers: temp read=1710 written=1709

PostgreSQL

, IO, shared buffer, os cacheio


io :

query: drop table t;

cache

cache

query: create table t(id int, info text, crt_time timestamp);

cache

cache

query: insert into t select generate_series(1,1000000), md5(random()::text), clock_timestamp();

cache

-R-devname:sdc1, reqs:4, reqKbytes:32, reqs/s:76923, reqKbytes/s:615384

-W-devname:sdc1, reqs:4, reqKbytes:32, reqs/s:37735, reqKbytes/s:301886

cache

query: create index idx_t_1 on t(id);

cache

-R-devname:sdc1, reqs:2, reqKbytes:16, reqs/s:105263, reqKbytes/s:842105

-W-devname:sdc1, reqs:2, reqKbytes:16, reqs/s:51282, reqKbytes/s:410256

cache

PostgreSQL

query: explain (analyze,verbose,costs,buffers,timing) select count(*) from t;

cache

-R-devname:sdc1, reqs:3, reqKbytes:24, reqs/s:90909, reqKbytes/s:727272

-W-devname:sdc1, reqs:3, reqKbytes:24, reqs/s:46153, reqKbytes/s:369230

cache

query: explain (analyze,verbose,costs,buffers,timing) select count(*) from t where id<10000;

cache

-R-devname:sdc1, reqs:32, reqKbytes:256, reqs/s:101265, reqKbytes/s:810126

-W-devname:sdc1, reqs:32, reqKbytes:256, reqs/s:64128, reqKbytes/s:513026

cache

query: explain (analyze,verbose,costs,buffers,timing) select count(*) from generate_series(1,1000000);

cache

-R-devname:sdc1, reqs:1711, reqKbytes:13687, reqs/s:104969, reqKbytes/s:839693

-W-devname:sdc1, reqs:1711, reqKbytes:13687, reqs/s:69103, reqKbytes/s:552786

cache

^C----------END----------

cache

-R-devname:sdc1, reqs:1836, reqKbytes:14238, reqs/s:106477, reqKbytes/s:825726

-W-devname:sdc1, reqs:1836, reqKbytes:14238, reqs/s:69445, reqKbytes/s:538543

cache

-W-devname:N/A, reqs:8, reqKbytes:0, reqs/s:5242, reqKbytes/s:0

-R-devname:N/A, reqs:7, reqKbytes:0, reqs/s:20771, reqKbytes/s:0

PostgreSQL

Trace PostgreSQL netflow per session or per sql

http://blog.163.com/digoal@126/blog/static/16387704020139153195701/

SQL, . PostgreSQL.

stpblog.

PostgreSQL

172.16.3.150

[root@db-172-16-3-39 ~]#

digoal=# drop table t;

digoal=# create table t(id int, info text, crt_time timestamp);

digoal=# insert into t select generate_series(1,1000000), md5(random()::text), clock_timestamp();

digoal=# \dt+

List of relations

Schema | Name | Type | Owner | Size | Description

--------+------+-------+----------+-------+-------------

public | t | table | postgres | 73 MB |

pg93@db-172-16-3-39-> psql -h 172.16.3.150 -p 1921 -c "copy t to stdout"|psql -h 172.16.3.150 -p 1921


-c "copy t from stdin"

pg93@db-172-16-3-39-> psql -h 172.16.3.150 -p 1921 -c "copy t to stdout"|psql -h 172.16.3.150 -p 1921


-c "copy t from stdin"

PostgreSQL

stap :

query: drop table t;

-R-from:172.16.3.39:37919-to:172.16.3.150:1921, pkgs:2, Kbytes:0, pkgs/s:5141, Kbytes/s:0

-S-from:172.16.3.150:1921-to:172.16.3.39:37919, pkgs:2, Kbytes:0, pkgs/s:60606, Kbytes/s:0

query: create table t(id int, info text, crt_time timestamp);

-R-from:172.16.3.39:37919-to:172.16.3.150:1921, pkgs:2, Kbytes:0, pkgs/s:5141, Kbytes/s:0

-S-from:172.16.3.150:1921-to:172.16.3.39:37919, pkgs:2, Kbytes:0, pkgs/s:60606, Kbytes/s:0

query: insert into t select generate_series(1,1000000), md5(random()::text), clock_timestamp();

-R-from:172.16.3.39:37919-to:172.16.3.150:1921, pkgs:2, Kbytes:0, pkgs/s:5141, Kbytes/s:0

-S-from:172.16.3.150:1921-to:172.16.3.39:37919, pkgs:3, Kbytes:0, pkgs/s:53571, Kbytes/s:0

PostgreSQL

query: SELECT n.nspname as "Schema",

c.relname as "Name",
CASE c.relkind WHEN 'r' THEN 'table' WHEN 'v' THEN 'view' WHEN 'm' THEN 'materialized
view' WHEN 'i' THEN 'index' WHEN 'S' THEN 'sequence' WHEN 's' THEN 'special' WHEN 'f'
THEN 'foreign table' END as "Type",

pg_catalog.pg_get_userbyid(c.relowner) as "Owner",

pg_catalog.pg_size_pretty(pg_catalog.pg_table_size(c.oid)) as "Size",

pg_catalog.obj_description(c.oid, 'pg_class') as "Description"

FROM pg_catalog.pg_class c
LEFT JOIN pg_catalog.pg_na

-R-from:172.16.3.39:37920-to:172.16.3.150:1921, pkgs:1, Kbytes:0, pkgs/s:83333, Kbytes/s:0

-S-from:172.16.3.150:1921-to:172.16.3.39:37920, pkgs:1, Kbytes:0, pkgs/s:83333, Kbytes/s:0

PostgreSQL

query: copy t to stdout

-S-from:172.16.3.150:1921-to:172.16.3.39:37921, pkgs:8761, Kbytes:70088, pkgs/s:7313,


Kbytes/s:58505

query: copy t from stdin

-R-from:172.16.3.39:37922-to:172.16.3.150:1921, pkgs:8765, Kbytes:70094, pkgs/s:72554,


Kbytes/s:580224

-S-from:172.16.3.150:1921-to:172.16.3.39:37922, pkgs:2, Kbytes:0, pkgs/s:7662, Kbytes/s:0

query: copy t to stdout

-S-from:172.16.3.150:1921-to:172.16.3.39:37933, pkgs:17523, Kbytes:140184, pkgs/s:4478,


Kbytes/s:35829

query: copy t from stdin

-R-from:172.16.3.39:37932-to:172.16.3.150:1921, pkgs:8636, Kbytes:69078, pkgs/s:64990,


Kbytes/s:519848

-S-from:172.16.3.150:1921-to:172.16.3.39:37932, pkgs:1, Kbytes:0, pkgs/s:38461, Kbytes/s:0

PostgreSQL

^C----------END----------

-S-from:172.16.3.150:1921-to:172.16.3.39:37933, pkgs:17527, Kbytes:140190, pkgs/s:4479,


Kbytes/s:35831

-R-from:172.16.3.39:37932-to:172.16.3.150:1921, pkgs:17530, Kbytes:140190, pkgs/s:55968,


Kbytes/s:447589

-S-from:172.16.3.150:1921-to:172.16.3.39:37921, pkgs:8765, Kbytes:70095, pkgs/s:7316,


Kbytes/s:58509

-R-from:172.16.3.39:37922-to:172.16.3.150:1921, pkgs:8769, Kbytes:70095, pkgs/s:71634,


Kbytes/s:572610

-R-from:172.16.3.39:37919-to:172.16.3.150:1921, pkgs:9, Kbytes:0, pkgs/s:0, Kbytes/s:0

-R-from:172.16.3.39:37920-to:172.16.3.150:1921, pkgs:5, Kbytes:0, pkgs/s:0, Kbytes/s:0

-S-from:172.16.3.150:1921-to:172.16.3.39:37919, pkgs:9, Kbytes:0, pkgs/s:56603, Kbytes/s:0

-S-from:172.16.3.150:1921-to:172.16.3.39:37920, pkgs:4, Kbytes:0, pkgs/s:70175, Kbytes/s:0

-S-from:172.16.3.150:1921-to:172.16.3.39:37922, pkgs:5, Kbytes:0, pkgs/s:16722, Kbytes/s:0

-S-from:172.16.3.150:1921-to:172.16.3.39:37932, pkgs:5, Kbytes:0, pkgs/s:16501, Kbytes/s:0

-R-from:172.16.3.39:37921-to:172.16.3.150:1921, pkgs:5, Kbytes:0, pkgs/s:87, Kbytes/s:0

-R-from:172.16.3.39:37933-to:172.16.3.150:1921, pkgs:5, Kbytes:0, pkgs/s:61, Kbytes/s:0

PostgreSQL

PostgreSQL.

, /dev/null, :

psql -h 172.16.3.150 -p 1921 -c "copy t to stdout" >/dev/null 2>&1

433MB/s

query: copy t to stdout

-S-from:172.16.3.150:1921-to:172.16.3.39:37961, pkgs:70096, Kbytes:560752, pkgs/s:54195,


Kbytes/s:433552

-R-from:172.16.3.39:37961-to:172.16.3.150:1921, pkgs:2, Kbytes:0, pkgs/s:4889, Kbytes/s:0

PostgreSQL

Trace PostgreSQL instruction or block of instructions per sql or per session

http://blog.163.com/digoal@126/blog/static/16387704020139153455311/

SQL.

target process, insn.

digoal=# select pg_backend_pid();

pg_backend_pid

----------------

31531

PostgreSQL

stap -e '

global var1%[60000], var2%[60000]

probe process("/home/pg93/pgsql9.3.1/bin/postgres").insn {

var1[pid()]++

var2[pid()]++

probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__start") {

delete var1[pid()]

probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__done") {

printf("query:%s, insn:%d\n", user_string($arg1), var1[pid()])

delete var1[pid()]

probe end {

foreach(x in var2-)

printf("pid:%d, insn:%d\n", x, var2[x])

delete var1

delete var2

' -x 31531

PostgreSQL

query:select count(*) from pg_class;, insn:488264

query:select count(*) from pg_class;, insn:488252

query:select count(*) from pg_class;, insn:488252

PostgreSQL

PostgreSQL bulk COPY load Bottleneck by extend lock waiting

http://blog.163.com/digoal@126/blog/static/163877040201392641033482/

digoal=# copy t to '/home/pg93/t.dmp' with (header off);

COPY 0 610000

digoal=# \! ls -lh /home/pg93/t.dmp

-rw-r--r-- 1 pg93 pg93 250M Oct 26 15:07 /home/pg93/t.dmp

pgbench,

pg93@db-172-16-3-150-> vi test.sql

copy t from '/home/pg93/t.dmp' with (header off);

8, 4.

pg93@db-172-16-3-150-> pgbench -M prepared -n -r -f ./test.sql -c 8 -j 4 -t 4

tps = 0.365815 (including connections establishing)

tps = 0.365846 (excluding connections establishing)

statement latencies in milliseconds:

21834.036437 copy t from '/home/pg93/t.dmp' with (header off);


91MB 22.3.

PostgreSQL

unlogged table :

digoal=# update pg_class set relpersistence='u' where relname='t';

digoal=# update pg_class set relpersistence='u' where relname='idx_t_id';

tps = 0.423626 (including connections establishing)

tps = 0.423670 (excluding connections establishing)

statement latencies in milliseconds:

18879.816156 copy t from '/home/pg93/t.dmp' with (header off);

106MB 25.8.

copy waiting.

21551 pg93
waiting

20 0 2271m 1.9g 1.9g S 24.1 2.0 0:12.54 postgres: postgres digoal [local] COPY

21553 pg93

20 0 2271m 1.9g 1.9g S 24.1 2.0 0:12.51 postgres: postgres digoal [local] COPY waiting

PostgreSQL

pg_locks, .

digoal=# select * from pg_locks where not granted;

locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid |
virtualtransaction | pid

mode

| granted | fastpath

----------+----------+----------+------+-------+------------+---------------+---------+-------+----------+-------------------+-----

--+---------------+---------+----------

extend | 16384 | 65765 |

6 | ExclusiveLock | f

extend | 16384 | 65765 |

9 | ExclusiveLock | f

extend | 16384 | 65765 |

| 6/30

| 2159

| 9/30

| 2159

| 8/30

| 2159

|f
|f

PostgreSQL

stap, .

pg93@db-172-16-3-150-> pgbench -M prepared -n -r -f ./test.sql -c 16 -j 4 -t 1

tps = 0.261183 (including connections establishing)

tps = 0.261247 (excluding connections establishing)

statement latencies in milliseconds:

61234.802062 copy t from '/home/pg93/t.dmp' with (header off);

copy61.

47.5.

PostgreSQL

stap -e '

global arr1%[12000] , arr2%[12000]

probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("lock__wait__start") {

arr1[pid(),$arg1,$arg2,$arg3,$arg4,$arg5,$arg6] = gettimeofday_us()

probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("lock__wait__done") {

t = gettimeofday_us()

pid = pid()

lv = arr1[pid,$arg1,$arg2,$arg3,$arg4,$arg5,$arg6]

if ( lv )

arr2[pid,$arg1,$arg2,$arg3,$arg4,$arg5,$arg6] <<< t - lv

probe timer.s(1) {

println("-----")

foreach([a,b,c,d,e,f,g] in arr1)

printf("pid: %d; obj: %d, %d, %d, %d; objtype:%d, locktype: %d, waitcnt:%d, waitus:%d\n", a, b, c, d, e, f,
g, @count(arr2[a,b,c,d,e,f,g]), @sum(arr2[a,b,c,d,e,f,g]))

}'

PostgreSQL

pid: 21549; obj: 16384, 65765, 0, 0; objtype:1, locktype: 7, waitcnt:33978, waitus:47513070

pid: 21552; obj: 16384, 65765, 0, 0; objtype:1, locktype: 7, waitcnt:33969, waitus:47575317

pid: 21553; obj: 16384, 65765, 0, 0; objtype:1, locktype: 7, waitcnt:33967, waitus:47946887

pid: 21551; obj: 16384, 65765, 0, 0; objtype:1, locktype: 7, waitcnt:33981, waitus:48020588

pid: 21556; obj: 16384, 65765, 0, 0; objtype:1, locktype: 7, waitcnt:33971, waitus:47723644

pid: 21554; obj: 16384, 65765, 0, 0; objtype:1, locktype: 7, waitcnt:33984, waitus:47569972

pid: 21558; obj: 16384, 65765, 0, 0; objtype:1, locktype: 7, waitcnt:33973, waitus:47941600

pid: 21560; obj: 16384, 65765, 0, 0; objtype:1, locktype: 7, waitcnt:33966, waitus:47576161

pid: 21562; obj: 16384, 65765, 0, 0; objtype:1, locktype: 7, waitcnt:33967, waitus:47137802

pid: 21550; obj: 16384, 65765, 0, 0; objtype:1, locktype: 7, waitcnt:33981, waitus:47695491

pid: 21548; obj: 16384, 65765, 0, 0; objtype:1, locktype: 7, waitcnt:33971, waitus:47685924

pid: 21555; obj: 16384, 65765, 0, 0; objtype:1, locktype: 7, waitcnt:33972, waitus:47581269

pid: 21561; obj: 16384, 65765, 0, 0; objtype:1, locktype: 7, waitcnt:33981, waitus:47614633

pid: 21557; obj: 16384, 65765, 0, 0; objtype:1, locktype: 7, waitcnt:33971, waitus:47644666

pid: 21559; obj: 16384, 65765, 0, 0; objtype:1, locktype: 7, waitcnt:33986, waitus:47157181

pid: 21563; obj: 16384, 65765, 0, 0; objtype:1, locktype: 7, waitcnt:33977, waitus:47592679

PostgreSQL

PostgreSQL main fork. 1(8KB).

digoal=# select 33978*8/1024;

?column?

----------

265

(1 row)

PostgreSQL

, delete, . main fork .

digoal=# select 2168900*8/1024/1024||'GB';

?column?

----------

16GB

(1 row)

digoal=# select max(ctid) from t;

max

--------------

(2168900,16)

(1 row)

digoal=# delete from t where ctid<>'(2168900,16)';

PostgreSQL

digoal=# vacuum (freeze,verbose,analyze) t;

INFO: vacuuming "postgres.t"

INFO: scanned index "idx_t_id" to remove 39039999 row versions

DETAIL: CPU 0.41s/23.03u sec elapsed 24.84 sec.

INFO: "t": removed 39039999 row versions in 2168901 pages

DETAIL: CPU 15.37s/7.40u sec elapsed 180.92 sec.

INFO: index "idx_t_id" now contains 1 row versions in 135228 pages

DETAIL: 39039999 index row versions were removed.

134262 index pages have been deleted, 0 are currently reusable.

CPU 0.00s/0.00u sec elapsed 0.00 sec.

INFO: "t": found 39039999 removable, 1 nonremovable row versions in 2168901 out of 2168901 pages

DETAIL: 0 dead row versions cannot be removed yet.

There were 0 unused item pointers.

0 pages are entirely empty.

CPU 31.58s/41.81u sec elapsed 337.35 sec.

INFO: vacuuming "pg_toast.pg_toast_65765"

INFO: index "pg_toast_65765_index" now contains 0 row versions in 1 pages

DETAIL: 0 index row versions were removed.

0 index pages have been deleted, 0 are currently reusable.

CPU 0.00s/0.00u sec elapsed 0.00 sec.

INFO: "pg_toast_65765": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages

DETAIL: 0 dead row versions cannot be removed yet.

PostgreSQL

digoal=# checkpoint;

CHECKPOINT

digoal=# \dt+ t

List of relations
Schema | Name | Type | Owner | Size | Description

----------+------+-------+----------+-------+-------------

postgres | t | table | postgres | 17 GB |

(1 row)

pg93@db-172-16-3-150-> pgbench -M prepared -n -r -f ./test.sql -c 16 -j 4 -t 4

tps = 0.750399 (including connections establishing)

tps = 0.750523 (excluding connections establishing)

statement latencies in milliseconds:

20898.632172 copy t from '/home/pg93/t.dmp' with (header off);

main fork extend(index main fork extend).

188MB 45.77.

PostgreSQL

blocksizeextend lock.

32,

1097.3MB 267.6 (455).

http://blog.163.com/digoal@126/blog/static/163877040201392641033482/

https://sourceware.org/systemtap/examples/

https://sourceware.org/systemtap/tapsets/

https://sourceware.org/systemtap/langref/

https://sourceware.org/systemtap/man/stap.1.html

https://sourceware.org/systemtap/man/index.html

http://blog.163.com/digoal@126/blog/ Systemtap

https://sourceware.org/systemtap/SystemTap_Beginners_Guide/

https://wiki.postgresql.org/wiki/Performance_Analysis_Tools

https://sourceware.org/systemtap/SystemTap_Beginners_Guide/useful-systemtap-scripts.html

About ME

name : digoal.zhou

Corp. : SKYMOBI

QQ : 276732431

BLOG : http://blog.163.com/digoal@126

EMAIL : digoal@126.com

2013 PostgreSQL China Conference


SKYMOBI, Hangzhou, Zhejiang

You might also like