RailsエンジニアのためのSQLチューニング速習会

Railsエンジニアのための
SQLチューニング速習会 @ Wantedly
2015-12-10
Nao Minami (@south37)

• 1. SQLが実行されるとき、RDBの中で何が起きるか
を知る
• 2. Explain の読み方、適切なindexの張り方を知る
• 3. チューニングの為に気をつけるポイントを知る
今日速習する内容

セットアップ
$ git clone https://github.com/south37/sql-tuning
$ git checkout sql-tuning
$ bin/rake db:create
$ pg_restore -j 4 --verbose --no-acl --no-owner -d sql-tuning-dev db.dump

ActiveRecord::Relation#explain
$ Job.joins(:company).group('companies.country').where('companies.id < 1000’)
.select('companies.country', 'COUNT(jobs.id)').explain
=> EXPLAIN for: SELECT companies.country, COUNT(jobs.id) FROM "jobs" INNER JOIN "companies" ON
"companies"."id" = "jobs"."company_id" WHERE (companies.id < 1000) GROUP BY companies.country
QUERY PLAN
-------------------------------------------------------------------------------------------------------
HashAggregate (cost=1213.79..1220.12 rows=634 width=16)
-> Hash Join (cost=54.28..1188.79 rows=5000 width=16)
Hash Cond: (jobs.company_id = companies.id)
-> Seq Scan on jobs (cost=0.00..897.00 rows=50000 width=8)
-> Hash (cost=41.78..41.78 rows=1000 width=16)
-> Index Scan using companies_pkey on companies (cost=0.29..41.78 rows=1000 width=16)
Index Cond: (id < 1000)

QUERY PLAN
-------------------------------------------------------------------------------------------------------
ツリー構造
Explainの見方

実行計画はツリー状の構造
ツリー構造
HashAggregate
Hash Join
Seq ScanHash
Index Scan

QUERY PLAN
-------------------------------------------------------------------------------------------------------
コストの見方

コストの見方
Seq Scan on jobs (cost=0.00..897.00 rows=50000 width=8)
Index Scan using companies_pkey on companies (cost=0.29..41.78 rows=1000 width=16)
初期化コスト総コスト取得行数
1行あたりのデータサイズ(バイト)
総コスト = 初期化コスト +
(走査行数 × 1行あたりの取得コスト )
index 使うと初期化コストが存在

ANALYSE をつけると実際に実行
$ ActiveRecord::Base.connection.execute("EXPLAIN ANALYSE
#{Job.joins(:company).group('companies.country').where('companies.id < 1000').select('companies.country',
'COUNT(jobs.id)').to_sql}").each { |row| print row['QUERY PLAN']+"n" }
HashAggregate (cost=1213.79..1220.12 rows=634 width=16) (actual time=20.290..20.465 rows=950 loops=1)
-> Hash Join (cost=54.28..1188.79 rows=5000 width=16) (actual time=1.018..18.102 rows=4983 loops=1)
-> Seq Scan on jobs (cost=0.00..897.00 rows=50000 width=8) (actual time=0.009..6.352 rows=50000 loops=1)
-> Hash (cost=41.78..41.78 rows=1000 width=16) (actual time=0.995..0.995 rows=999 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 51kB
-> Index Scan using companies_pkey on companies (cost=0.29..41.78 rows=1000 width=16) (actual
time=0.022..0.527 rows=999 loops=1)

Explainの見方
より詳しく知りたい方はこちら:
http://www.postgresql.org/docs/current/static/sql-explain.html

HashAggregate
Hash Join
Seq ScanHash
Index Scan
最初のステップはデータの取得

index を知る
HashAggregate
Hash Join
Seq ScanHash
Index Scan

index の仕組み
B-tree index
• ノードあたり数百要素
• 300要素として、3段で2,700 万件格納
高速なデータ取得

index の利用
$ Job.where(id: 1).explain
=> EXPLAIN for: SELECT "jobs".* FROM "jobs" WHERE "jobs"."id" = $1 [["id", 1]]
QUERY PLAN
-----------------------------------------------------------------------
Index Scan using jobs_pkey on jobs (cost=0.29..8.31 rows=1 width=28)
Index Cond: (id = 1)
$ Job.where(id_without_index: 1).explain
=> EXPLAIN for: SELECT "jobs".* FROM "jobs" WHERE "jobs"."id_without_index"
= $1 [["id_without_index", 1]]
QUERY PLAN
--------------------------------------------------------
Seq Scan on jobs (cost=0.00..1022.00 rows=1 width=28)
Filter: (id_without_index = 1)
index有り
index無し 
(Seq Scan)

index バッドパターンその1
「index を貼ったカラムに演算」
$ Profile.where('lower(email) = ?', 'minami@wantedly.com').limit(1).explain
=> EXPLAIN for: SELECT "profiles".* FROM "profiles"
WHERE (lower(email) = 'minami@wantedly.com') LIMIT 1
QUERY PLAN
------------------------------------------------------------------
Limit (cost=0.00..5.08 rows=1 width=54)
-> Seq Scan on profiles (cost=0.00..254.00 rows=50 width=54)
Filter: (lower(email) = 'minami@wantedly.com'::text)
index は key の比較で sort してるので、
演算が行われると利用できない
「クエリ書き換え」 or 「Indexes on Expression を利用」

index バッドパターンその2
「絞り込み条件の緩いWHERE」
$ Profile.where(gender: ‘female').explain
=> EXPLAIN for: SELECT "profiles".* FROM "profiles" WHERE
"profiles"."gender" = $1 [["gender", "female"]]
QUERY PLAN
--------------------------------------------------------------
Seq Scan on profiles (cost=0.00..229.00 rows=5038 width=54)
Filter: (gender = 'female'::text)
male female
proﬁles.gender の分布
デフォルトだと、 4分の1以下に絞り込まれる必要あり

なぜ絞り込み条件が緩いと
indexが使われないのか？

HDDへのランダムアクセスと
シーケンシャルアクセスの速度差が原因
Seq Scan Index Scan
(Random Access)
1 2 3 ４ 1 23 4
1要素単位だと高コスト

ちゃんと絞り込まれるならOK
$ BoxerProfile.where(gender: ‘female').explain
=> EXPLAIN for: SELECT "boxer_profiles".* FROM "boxer_profiles" WHERE "boxer_profiles"."gender" =
$1 [["gender", "female"]]
QUERY PLAN
-------------------------------------------------------------------------------------------------
Bitmap Heap Scan on boxer_profiles (cost=28.08..114.66 rows=1006 width=25)
Recheck Cond: (gender = 'female'::text)
-> Bitmap Index Scan on index_boxer_profiles_on_gender (cost=0.00..27.83 rows=1006 width=0)
Index Cond: (gender = 'female'::text)
male female
proﬁles.gender の分布
データの分布 = 「統計情報」が大事

余談: PostgreSQL 内での
データレイアウト
詳しく知りたい方は:
http://www.postgresql.org/docs/current/static/storage.html
または「内部構造から学ぶPostgreSQL 設計・運用計画の鉄則」

index のデメリット
• 1. 更新に時間がかかるようになる
• 2. HOT が効かない

1. 更新に時間がかかるようになる
B-tree index の更新が必要

2. HOT が効かない
HOTはPostgreSQL のカラムの更新を早くする仕組み
（必要な箇所のみを更新する）
詳しくはこちら:
http://lets.postgresql.jp/documents/tutorial/hot_1/

いろいろな index
• 1. Unique Indexes
• 2. Multicolumn Indexes
• 3. Indexes on Expressions
• 4. Partial Indexes

2. Multicolumn Indexes
create_table "tourist_spots", force: :cascade do |t|
t.text "country"
t.text "city"
end
add_index "tourist_spots", ["country", "city"],
name: "index_tourist_spots_on_country_and_city", using: :btree
複数カラムに対しての index

$ TouristSpot.where(country: 'japan', city: 'tokyo').explain
=> EXPLAIN for: SELECT "tourist_spots".* FROM "tourist_spots" WHERE "tourist_spots"."country" = $1 AND
"tourist_spots"."city" = $2 [["country", "japan"], ["city", "tokyo"]]
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Index Scan using index_tourist_spots_on_country_and_city on tourist_spots (cost=0.42..8.44 rows=1 width=52)
Index Cond: ((country = 'japan'::text) AND (city = 'tokyo'::text))
Multicolumn index有り
$ TouristSpotWithoutMultipleIndex.where(country: 'japan', city: 'tokyo').explain
=> EXPLAIN for: SELECT "tourist_spot_without_multiple_indices".* FROM "tourist_spot_without_multiple_indices"
WHERE "tourist_spot_without_multiple_indices"."country" = $1 AND
"tourist_spot_without_multiple_indices"."city" = $2 [["country", "japan"], ["city", "tokyo"]]
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Index Scan using index_tourist_spot_without_multiple_indices_on_city on tourist_spot_without_multiple_indices
(cost=0.42..8.44 rows=1 width=52)
Index Cond: (city = 'tokyo'::text)
Filter: (country = 'japan'::text)
Multicolumn index無し

先頭の要素の index としても効く
より詳細を知りたい方は:
http://www.postgresql.org/docs/current/static/indexes-multicolumn.html
$ TouristSpot.where(country: 'japan').explain
=> EXPLAIN for: SELECT "tourist_spots".* FROM "tourist_spots" WHERE "tourist_spots"."country" = $1
[["country", "japan"]]
QUERY PLAN
-------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tourist_spots (cost=4.50..41.67 rows=10 width=52)
Recheck Cond: (country = 'japan'::text)
-> Bitmap Index Scan on index_tourist_spots_on_country_and_city (cost=0.00..4.49 rows=10 width=0)
Index Cond: (country = 'japan'::text)

3. Indexes on Expressions
関数などの返り値を key として index を作る事ができる
# db/migrate/db/migrate/20151210065304_add_indexes_on_~.rb
def up
execute <<-SQL
CREATE INDEX index_profiles_with_indexes_on_expressions_on_lower_email
ON profiles_with_indexes_on_expressions(lower(email));
SQL
end
def down
execute <<-SQL
DROP INDEX index_profiles_with_indexes_on_expressions_on_lower_email
SQL
end

3. Indexes on Expressions
lower(email) を index として利用
詳細はこちら:
http://www.postgresql.org/docs/current/static/indexes-expressional.html
$ ProfilesWithIndexesOnExpression.where("lower(email) = 'minami@wantedly.com'").explain
=> EXPLAIN for: SELECT "profiles_with_indexes_on_expressions".* FROM
"profiles_with_indexes_on_expressions" WHERE (lower(email) = 'minami@wantedly.com')
QUERY PLAN
------------------------------------------------------------------------------------------------------
Index Scan using index_profiles_with_indexes_on_expressions_on_lower_email on
profiles_with_indexes_on_expressions (cost=0.29..8.30 rows=1 width=48)
Index Cond: (lower(email) = 'minami@wantedly.com'::text)

HashAggregate
Hash Join
Seq ScanHash
Index Scan
次のステップはデータの結合(JOIN)

JOIN のアルゴリズム
index の有無や統計情報（データの量・分布）から、
最適なアルゴリズムが選ばれる
• 1. Nested Loop Join
• 2. Hash Join
• 3. Merge Join
遅い
早い

1. Nested Loop
テーブル1と2に対して、すべての組み合わせを試す
O(N × M) … 極めて遅い
レコード数N レコード数M
• レコード数が少なければ高速
• Table 2 に index を貼れば、
高速化が可能

2. Hash Join
テーブル2に対して、一度フルスキャンしてHashMapを作成
O(N + M) …Hash 生成のコストはかかるが、
Nested Loop よりはマシ
テーブル2の全てのレコード
をメモリに載せる必要あり

QUERY PLAN
-------------------------------------------------------------------------------------------------------
Hash Join のコスト
Hash の生成コスト（初期化コスト）

3. Merge Join
ソート済みのテーブル1と2に対して、1度だけフルスキャン
O(N+M) …最も高速
JOIN に使うカラムには、
index を貼りましょう

index があっても
JOIN が遅くなるケース
どんなに高速化しても O(N+M) にしかならない
Nが大きいと遅くなる

index があっても
JOIN が遅くなるケース
$ User.joins(:profile).select('COUNT(*)').explain
=> EXPLAIN for: SELECT COUNT(*) FROM "users" INNER JOIN "profiles" ON
"profiles"."user_id" = "users"."id"
QUERY PLAN
--------------------------------------------------------------------------------
Aggregate (cost=23288.72..23288.73 rows=1 width=0)
Hash Cond: (users.id = profiles.user_id)
-> Seq Scan on users (cost=0.00..11441.64 rows=698964 width=4)

JOIN される left relation は、
事前に絞り込んでおこう
$ User.where(registered: true).joins(:profile).select('COUNT(*)').explain
=> EXPLAIN for: SELECT COUNT(*) FROM "users" INNER JOIN "profiles" ON "profiles"."user_id" = "users"."id"
WHERE "users"."registered" = $1 [["registered", "t"]]
QUERY PLAN
-----------------------------------------------------------------------------------------------------------
Aggregate (cost=8131.17..8131.18 rows=1 width=0)
Hash Cond: (users.id = profiles.user_id)
-> Bitmap Heap Scan on users (cost=1496.35..6639.86 rows=69151 width=4)
Filter: registered
-> Bitmap Index Scan on index_users_on_registered (cost=0.00..1479.06 rows=69151 width=0)
Index Cond: (registered = true)

HashAggregate
Hash Join
Seq ScanHash
Index Scan
ラストステップはデータの集約
(Aggregate)

GROUP BY の2つのアルゴリズム
• 1. Group Aggregate
• 2. Hash Aggregate

1. Group Aggregate
入力されたデータをグループキーでソート後、
各グループを順番に処理
（index があってソート済みならパイプライン化も可能）

QUERY PLAN
-------------------------------------------------------------------------------------------------------
2. Hash Aggregate
グループキーを key とする、一時的な Hash Tableを作成

ORDER BY を指定する事で、 Sort 処理が入る
ラストステップが Sort と Limitの場合
$ PageViewLog.order(:viewed_at).limit(20).explain
=> EXPLAIN for: SELECT "page_view_logs".* FROM "page_view_logs"
ORDER BY "page_view_logs"."viewed_at" ASC LIMIT 20
QUERY PLAN
-----------------------------------------------------------------------------------
-> Sort (cost=22026.31..23278.87 rows=501024 width=28)
Sort Key: viewed_at
-> Seq Scan on page_view_logs (cost=0.00..8694.24 rows=501024 width=28)
Disk sort になると、すごく遅い

ORDER BY には index
index があればすでに sort 済みなので、sort 処理が不要
$PageViewLogWithIndex.order(:viewed_at).limit(20).explain
=> EXPLAIN for: SELECT "page_view_log_with_indices".* FROM "page_view_log_with_indices"
ORDER BY "page_view_log_with_indices"."viewed_at" ASC LIMIT 20
QUERY PLAN
---------------------------------------------------------------------------------------------------
-> Index Scan using index_page_view_log_with_indices_on_viewed_at on page_view_log_with_indices
(cost=0.42..16698.78 rows=501024 width=28)

その他、PostgreSQLに特徴的な
愉快な仲間たち
• 1. Window Functions
• 2. Json Type
• 3. Hstore
• 4. Materialized View
• 5. Stored Procedure (PL/pgSQL)

1. Window Functions
http://www.postgresql.org/docs/current/static/tutorial-window.html
$ Company.select('country, rank() OVER (PARTITION BY country ORDER BY id DESC)').explain
=> EXPLAIN for: SELECT country, rank() OVER (PARTITION BY country ORDER BY id DESC) FROM
"companies"
QUERY PLAN
----------------------------------------------------------------------------
WindowAgg (cost=936.35..1155.63 rows=10964 width=16)
-> Sort (cost=936.35..963.76 rows=10964 width=16)
Sort Key: country, id
-> Seq Scan on companies (cost=0.00..200.64 rows=10964 width=16)
Partition ごとに、値を計算
country | rank
--------------+------
britain | 1
china | 1
china | 2
china | 3
country_0 | 1
高機能な集約関数

2. Json Type
Json データを保存可能
ActiveREcord で対応済み
$ Event.create(payload: { kind: "user_renamed", change: ["jack", "john"]})
(0.1ms) BEGIN
SQL (1.7ms) INSERT INTO "events" ("payload", "created_at", "updated_at") VALUES ($1, $2, $3)
RETURNING "id" [["payload", "{"kind":"user_renamed","change":["jack","john"]}"],
["created_at", "2015-12-10 09:57:52.294809"], ["updated_at", "2015-12-10 09:57:52.294809"]]
(0.4ms) COMMIT
# db/migrate/~.rb
def change
create_table :events do |t|
t.json :payload
end
end

2. Json Type
http://www.postgresql.org/docs/current/static/functions-json.html
Json の値取得用の operator が存在
$ Event.where("payload->>'name' = ?", "test1").explain
=> EXPLAIN for: SELECT "events".* FROM "events" WHERE (payload->>'name' = 'test1')
QUERY PLAN
--------------------------------------------------------
Seq Scan on events (cost=0.00..24.85 rows=5 width=52)
Filter: ((payload ->> 'name'::text) = 'test1'::text)

3. Hstore
http://www.postgresql.org/docs/current/static/hstore.html
key, value のペアを1つの絡むに保存可能
問い合わせ用のオペレータあり

4. Materialized View
http://www.postgresql.org/docs/current/static/sql-creatematerializedview.html
キャッシュされた View
高速化は期待できるが、手動で Reﬂesh する必要あり

5. Stored Procedure (PL/pgSQL)
http://www.postgresql.org/docs/current/static/plpgsql.html
PostgreSQL で実行可能な function を定義可能

まとめ
SQLの実行時に選ばれる実行計画は、index の有無や
統計情報（データの量・分布）に依存する
適切な schema, index, query の選択によって、
高速化しよう
• WHERE, JOIN, ORDER BY, GROUP BY の key には index
• JOIN の前に絞り込めるだけ絞り込む
• JSON Type などもケースバイケースで

RailsエンジニアのためのSQLチューニング速習会

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to RailsエンジニアのためのSQLチューニング速習会 (20)

RailsエンジニアのためのSQLチューニング速習会