Azure Search 大全

#azurejp
https://www.facebook.com/dahatake/
https://twitter.com/dahatake/
https://github.com/dahatake/
https://daiyuhatakeyama.wordpress.com/
https://www.slideshare.net/dahatake/

https://customers.microsoft.com/Pages/CustomerStory.aspx?recid=18596

「モバイルアプリで、
もっと売れると
思ったんですよねぇ」
「欲しいものは、
なんか違うのよね」

“AdventureWorks Cycle Shop”
車種型番色価格画像URL
マウンテン MS-01 赤 10万 http://xxx/a01.png
マウンテン MS-02 青 8万 http://xxx/b01.png
「スマホでの
入力は大変ね」 ”文字入力” ”アクション”

絞り込み
あいまい検索

●
●
● ●
●●
●
●

マウンテン MS-02 青 8万 http://xxx/b01.png「運動したくて、
自転車はどうかなって」
運動

“Azure Sports Store”
カテゴリ商品名価格画像URL
ランニング Azureシューズ 1万円 http://yyy/as01.png
フィットネス Cloudワンダー 2万円 http://xxxx/fc04.png

“AdventureWorks Work-Out Services”
“Azure Sports Store”
カテゴリ商品名価格画像URL
ランニング Azureシューズ 1万円 http://yyy/as01.png
フィットネス Cloudワンダー 2万円 http://xxxx/fc04.png
Search
大カテゴリ中カテゴリ商品名価格画像URL Blog タイトル
自転車マウンテン MS-01 10万 http://xxx/a01.png {ランとバイクの相乗効果…}
自転車マウンテン MS-02 8万 http://xxx/b01.png {富士山からの… , 2時間で..}
ランニング “シューズ” Azureシューズ 1万円 http://yyy/as01.png {ランとサイクリングの…}
フィットネ
ス
“筋トレ” Cloudワンダー 2万円 http://yyy/fc04.png サイクリングで鍛えられな
Work-Out
Data
Index
サービス用のデータ構造

Oracle CouchDBDB2Postgres MongoDBCassandra
RavenDBMySQLSQLDB RedisDocument
DB
Relational No-SQL
Azure Search

販売チャネルを増やす事も
考えないとね

“AdventureWorks Work-Out Services”
Search
圧倒的なスケーラビリティ

「参照」と「更新」の分離
• データ分割とレプリケー
ション
高度なデータ管理
• インデックス管理など

全文検索エンジンなら…
取得することができ
る

転置インデックス：トークンからドキュメントを引き当
てるデータ構造
テキスト解析
インデクシング
Doc# ドキュメント内容
1 Microsoft is introducing SQL
Server
2 Windows Server on Azure
3 Microsoft is introducing
Azure
4 Application programming on
Microsoft Azure
単語（トーク
ン）
含まれるドキュメ
ント
microsoft 1, 3, 4
introducing 1, 3
sql 1
server 1, 2
Windows 2
azure 2, 3, 4
application 4

「キング」 ⇒
「バーガーキング」「ライオンキング」
「Azureでのセキュアネットワーキング」
「京都」⇒
「東京都庁」
「京都観光」
「ダイアモンド」
⇒「ダイヤモンド」もヒット
• 語幹変化
• 見出し語変化
• 同義語展開
• 正規化
• ストップワード
除去
• アンチフレージ
ング
• スペルチェック
• クエリサジェスト
• ファセット
検索精度向上のための主要技術・ソリューション
• N-gram
• 形態素解析

• クエリとドキュメントの
関連性を評価して結果を
並べる
• データベースのORDER BY
句による結果ソートとは
全く異なる評価手法

Microsoft
Bot
Framework
Knowledge
Base
Azure
Search

Azure
Search
Document
DB Azure Media Services
Microsoft Cognitive
Service
Azure Machine
Learning

検索機能は、数多くのアプリケーションでユーザーの主要な操作手法として活用されており、特に全文検索機能には大きな期待が
寄せられています。ユーザーは、普段から Web 検索エンジン、高度な e コマース Web サイト、関連性の高い検索結果を提供する
ソーシャルアプリケーション、
入力時の検索候補、ファセットナビゲーション、強調表示などのさまざまな機能を、ほぼタイムラグなしで使用しています。
マイクロソフトは Azure Search の開発にあたって、
検索に関する専門的な知識のない開発者でも
優れた検索エクスペリエンスを
アプリケーションに組み込むことができるようにしたい
と考えました。
強固な検索エクスペリエンスの実現は、テキスト分析やランキングの処理が必要な情報取得用フロントエンドや、スケーラビリ
ティや信頼性を管理する
必要のある配信システムのフロントエンドのいずれにおいても課題となります。そこで、サービスとしての検索機能を提供するこ
とで、
これらの課題を自然な形で解決し、開発者がアプリケーションの構築に集中できるようにすることを目指しました。
Azure Searchのシナリオと機能 Azure Search Scenarios and Capabilities )

https://github.com/Azure-
Samples/search-dotnet-asp-
net-mvc-jobs

検索サービス機能
• 検索機能
• ファセット・ナビゲー
ション
• ヒットハイライト
• サジェスト
• フィルタリング・ソート
• 位置情報検索
• あいまい、類似検索
• ランク/スコア調整
• タグブースト
• テキスト解析アナライザ
• フィード関連機能
• APIによるPUSH更新
• インデクサによるポーリ
ング
運用管理機能
• サービスプロビジョニング
• キャパシティ変更
• インデックス簡易統計情報
• スキーマ作成・変更
• データソース作成・変更
• 検索トラフィック可視化
• 検索ログ保存
• バックアップ（自動）
全般
• 検索/管理
• REST APIもしくは管理
ポータル操作
• API
• プロトコル：HTTP
• フォーマット：JSON

• 複数情報ソースへの対応
• Office文書、RDBMS、グループウェアなど複数情報ソース
に対応するためにクローリング、コネクタ接続する機能が
必要となる
• ビルトイン機能として存在しないため独自の実装が必要
• 情報ごとのアクセス制御
• 扱う情報が公開情報ではないため閲覧者や情報ごとのアク
セス制御機能が必要になってくる
• 実現には独自の実装、プロキシの構築が必要となる
エンタープライズサーチの主目的
組織内に散らばる定型／非定型の情報の一元的な検索を実現する
こと

https://github.com/Azure/azure-quickstart-
templates/tree/master/101-azure-search-create

Free Basic Standard S1 Standard S2
Maxサービス数 1 12 12 6
Maxインデックス
数/サービス
3 5 50 200
Maxドキュメント
数/サービス
10000 100万 1500万/P
(1.8億/サービス)
6000万/P
(7.2億/サービス)
Maxストレージ
サイズ/サービス
50MB 2GB 25GB/P
(300GB/サービス )
100GB/P
(1.2TB/サービス)
Maxパーティショ
ン数/サービス
N/A 1 12 12
Maxレプリカ数/
サービス
N/A 3 12 12
Maxサーチユニッ
ト数/サービス
N/A 3 36 36
クエリ数/秒
(QPS) 目安
N/A 〜3/R 〜15/R 〜60/R
Standard S3 Standard S3 HD
6 6
200 1000/P
(3000/サービス)
1.2億/P
(14億/サービス)
2000万/P
(100万/インデックス)
200GB/P
(2.4TB/サービス)
200GB/P
(600G/サービス)
12 3
12 12
36 36
>60/R ＞60/R

インデックス追加・更新 /indexes/<indexname> PUT
インデックス一覧表示 /indexes GET
インデックス統計情報取得 /indexes/<indexname>/stats GET
インデックスの削除 /indexes/<indexname> DELETE
ドキュメント追加・削除 /indexes/<indexname>/docs/index POST
検索 /indexes/<indexname>/docs GET
ドキュメントlookup /indexes/<indexname>/docs/<key> GET
ドキュメント数取得 /indexes/<indexname>/docs/$count GET
サジェスション /indexes/<indexname>/docs/suggest GET
https://<アカウント名>.search.windows.net
{
"@odata.context":
"https://yoichikademo0.search.windows.net/in
dexes('messages')/$metadata#Collection(Micro
soft.Azure.Search.V2015_02_28_Preview.IndexR
esult)",
"value": [
{ "errorMessage": null, "key": "1", "status":
true, "statusCode": 201 },
true, "statusCode": 201 },
true, "statusCode": 201 }
]
}
※ APIバージョン 2015-02-28-Previewの機能一覧。バージョンごとの機能についてはこちらを参照ください

QueryParser Search
Engine
Analyzer
IndexWriter
インデックスSimple lucene
Analyzed
terms
Query
terms
Query
tree
Query
text
Documents
terms
Analyzed
terms
Retrieve Ingest
Analyzer
ドキュメント
検索処理
• クエリーを元に
トークン引き当て
• ランキング処理
クエリ文を解析し内部
クエリ―形式に変換
テキスト解析を行いトークンの
展開、変換、削除などを行う
転置インデックス
インデックス生成処理

データ
ソース
リージョンA リージョンB

Search RDB
Index Index
Document Row
Field Column
Crawling (“indexing”) Data Import
≒

SERVICE_NAME='yoichikademo0'
API_VER='2015-02-28-Preview'
API_KEY='5694051B97CC6A115D1FXA700B9033C1'
URL="https://$SERVICE_NAME.search.windows.net/indexes?api-version=$API_VER"
curl -s
-H "Content-Type: 'application/json‘
-H "api-key: $API_KEY"
-XPOST $URL -d'{
"name": “myindex",
"fields": [
{ "name":"id", "type":"Edm.String", "key": true, "searchable": false, "filterable":false,
"facetable":false },
{ "name":"title", "type":"Edm.String", "searchable": true, "filterable":true, "sortable":true,
"facetable":false, "analyzer":"ja.microsoft" },
{ "name":"speakername", "type":"Edm.String", "searchable": true, "filterable":true,
"sortable":true, "facetable":false, "analyzer":"ja.microsoft" },
{ "name":"speakerid", "type":"Edm.String", "searchable": false, "filterable":false,
"sortable":true, "facetable":false },
{ "name":"url", "type":"Edm.String", "searchable": false, "filterable":false, "sortable":true,
"facetable":false },
{ "name":"thumbnail", "type":"Edm.String", "searchable": false, "filterable":false,
"sortable":true, "facetable":false },
{ "name":"description", "type":"Edm.String", "searchable": true, "filterable":false,
"sortable":false, "facetable":false, "analyzer":"ja.microsoft" }
],
"suggesters": [
{ "name":"sessionsg", "searchMode":"analyzingInfixMatching", "sourceFields":["title"] }
]
}'

{
"name":"description",
"type":"Edm.String",
"searchable": true,
"filterable":false,
"sortable":false,
"facetable":false,
"retrievable":true
"analyzer":"ja.microsoft"
}
インデックス
どうトークンを作
成するかを決める
転置インデックス
作成（トークン化）
ドキュメント
保存

https://msdn.microsoft.com/library/azure/dn798941.aspx/
属性説明
検索可能 (searchable) 全文検索可能に。インデックス作成時にワードブレイ
クや
言語分析をする
取得可能 (retrievable) フィールドを検索結果に含めるか
フィルター可能
(filterable)
$filter クエリで参照するか。完全一致のみとなる。
True / false など。
ソート可能 (sortable) 既定のランキングアルゴリズム以外で、並び替えが
できるようにするか
ファセット可能
(facetable)
ファセット (カテゴリー別のヒット数を含む検索結果)
で
使用するか

〇
✖
• 新しいフィールドの追加
• 新しいフィールドが追加されると既存ドキュメントの追
加フィールドの値はNULL扱い
• 既存フィールドの種類変更、削除

パーティション
• 全てのドキュメントは
パーティションの数分に
分割保存（IO分散）
• 全パーティション合わせ
て１つのインデックスを
構成
レプリカ
• 全レプリカに同じものが
複製同期される
• クエリはいづれかのレプ
リカに処理が割り当てら
れる
par1 par2 par3 par4 par5
rep1
rep2
rep3
rep4
p1 p2 p3 p4 p5
p1 p2 p3 p4 p5
p1 p2 p3 p4 p5
p1 p2 p3 p4 p5

P1 P2
R1
検
索
リ
ク
エ
ス
ト
P1 P2
R1
R2
R3
R4
R5
検
索
リ
ク
エ
ス
ト
Standard1 Standard2
15 QPS / レプリカ 60 QPS / レプリカ

Standard1 Standard2
ドキュメント数 1500万/パーティション or
サービス全体で1.8億
6000万/パーティション or
サービス全体で7.2億
ストレージ
サイズ
25GB/パーティション or
サービス全体で300GB
100GB/パーティション or
サービス全体で1.2TB
P1 P2 P3 P4 P5
R1
P1
R1

https://azure.microsoft.com/en-us/documentation/articles/search-limits-quotas-capacity/
12 replicas 12 SU 24 SU 36 SU N/A N/A N/A
6 replicas 6 SU 12 SU 18 SU 24 SU 36 SU N/A
3 replicas 3 SU 6 SU 9 SU 12 SU 18 SU 36 SU
2 replicas 2 SU 4 SU 6 SU 8 SU 12 SU 24 SU
1 replica 1 SU 2 SU 3 SU 4 SU 6 SU 12 SU
1 Partition 2 Partitions 3 Partitions 4 Partitions 6 Partitions
12
Partitions

テキスト解
析
インデクシ
ング
1 Microsoft is
introducing SQL Server
2 Windows Server on Azure
3 Microsoft is
introducing Azure
4 Application programming
on Microsoft Azure
Terms Doc#
microsoft 1, 3, 4
introducing 1, 3
sql 1
server 1, 2
Windows 2
azure 2, 3, 4
application 4
programming 4
転置
インデック
ス

Query:
Microsoft
Terms Doc#
microsoft 1, 3, 4
introducing 1, 3
sql 1
server 1, 2
Windows 2
azure 2, 3, 4
application 4
programming 4
1 Microsoft is
introducing SQL
Server
2 Windows Server on
Azure
3 Microsoft is
introducing Azure
4 Application
programming on
Microsoft Azure

1 Microsoft is
introducing SQL
Server
2 Windows Server on
Azure
3 Microsoft is
introducing Azure
4 Application
programming on
Microsoft Azure
Terms Doc#
microsoft 1, 3, 4
introducing 1, 3
sql 1
server 1, 2
Windows 2
azure 2, 3, 4
application 4
programming 4
3
1
Azure Microsoft
4
Query:
Microsoft AND Azure
2

4 Application
programming on
Microsoft Azure
Terms Doc#
application 4:0
Programming 4:12
Microsoft 4:27
azure 4:37
ドキュメント中の各トークンの
offset値
(0)application (12)programming
(27)Microsoft (37)Azure

1 Microsoft is
introducing SQL Server
2 Windows Server on
Azure
3 Microsoft is
introducing Azure
4 Application
programming on
Microsoft Azure
Terms Doc#
microsoft 1:0
3:0
4:27
introducing 1:14
3:13
sql 1:26
server 1:30
2:8
Windows 2:0
azure 2:18
3:25
4:37
application 4:0
programming 4:12
Query:
“Microsoft Azure”
キーワード1のオフセットとキー
ワード1とスペース(1)の長さの合
計がキーワード2のオフセット等
しくなるフレーズが含まれるド
キュメントを探す
Doc#4の場合
k1len:キーワード1長(“Microsoft”) =9
k1off: キーワード1のオフセット = 27
k2off: キーワード2(“Azure”)のオフセット
=37
⇒ k1off + (k1len +1) = k2off
フレーズクエリ：ダブルクォートで囲む

Document DB
Blob Storage
Microsoft Azure
SQL Database
Table Storage
On-Premise
Azure Search
PULL
(インデクサ)
PUSH (API)
Pull方式: インデクサの利
用
• 4種類のデータソース
• 定期的実行（最小5分）
• 差分更新
• マスタDBとしての利用
• 全件再インデクシングはここから
Push方式: APIで直接更新
• アップロード、マージ、削除など
• 1度のバッチ: 最大 1,000 ドキュメン
ト
• リアルタイムに近いデータ更新
インジェス
トする人
• クローラー
• バッチ
• カスタム
ツール

API_KEY='5694051B97CC6A115D1FXA700B903X'
URL="https://$SERVICE_NAME.search.windows.net/indexes?api-
version=2015-02-28-Preview”
curl
-H "Content-Type: application/json"
-XPOST $URL -d'{
"value": [
{ "@search.action": "upload", "id": "1", "user_name": "taylorswift13",
"message":"post by taylorswift13", "created_at":"2016-04-
29T00:00:00Z" },
{ "@search.action": "upload", "id": "2", "user_name": "katyperry",
"message":"post by katyperry", "created_at":"2016-04-30T00:00:00Z" },
{ "@search.action": "upload", "id": "3", "user_name": "ladygaga",
"message":"post by ladygaga", "created_at":"2016-04-29T00:00:00Z" }
]
}' 最大1000ドキュメント
BODYサイズは最大16MB
{
"@odata.context":
"https://yoichikademo0.search.windows.net/indexes
('messages')/$metadata#Collection(Microsoft.Azure.
Search.V2015_02_28_Preview.IndexResult)",
"value": [
{ "errorMessage": null, "key": "1", "status": true,
"statusCode": 201 },
"statusCode": 201 },
"statusCode": 201 }
]
}

インデクサーデータソース説明
Document DB
インデクサー
Document DB • DocumentDBのデータを元にAzure Searchインデックス
を更新するインデクサー
SQL DB
インデクサー
SQL Database,
SQL Server on VM
• SQL DatabaseもしくはVMに立てたSQL Serverのデータ
を元にAzure Searchインデックスを更新するインデク
サー
BLOBストレージイ
ンデクサー
BLOBストレージ • Azure Blob Storage に格納されているドキュメント (PDF
や Office ファイルなど) を元にインデックスを作成する
Tableストレージ
インデクサー
Tableストレージ • Azure Table Storage に格納されているドキュメント (PDF
や Office ファイルなど) を元にインデックスを作成する
• Previewリリース中

API_KEY=‘5694051B97CC6A115D1FXA700B903X'
URL=https://$SERVICE_NAME.search.windows.net/datasources?api-
version=2015-02-28-Preview
curl -s
-XPOST $URL -d'{
"name": "docdbds-article",
"type": "documentdb",
"credentials": {
"connectionString":
"AccountEndpoint=https://yoichikademo0.documents.azure.com;AccountKe
y=Tl1+ikQtnExUisJ+BXwbbaC8NtUqYVE9kUDXCNust5aYBduhui29Xtxz3DLP
88PayjtgtnARc1PW+2wlA6jXJw==;Database=feeddb"
},
"container": {
"name": "article_collection",
"query": "SELECT s.id, s.title, s.content, s.permalink, s.postdate, s._ts
FROM Sessions s WHERE s._ts > @HighWaterMark"
},
"dataChangeDetectionPolicy": {
"@odata.type":
"#Microsoft.Azure.Search.HighWaterMarkChangeDetectionPolicy",
"highWaterMarkColumnName": "_ts"
}
}'
API_KEY='2E73D2456052A9AD21E54CB03C3ABF6A'
URL="https://$SERVICE_NAME.search.windows.net/indexers?api-
version=2015-02-28-Preview"
curl -s
-XPOST $URL -d'{
"name": "docdbindexer",
"dataSourceName": "docdbds-article",
"targetIndexName" : "articles",
"schedule":
{
"interval" : "PT5M",
"startTime" :"2016-05-01T00:00:00Z"
}
}'
POST
https://yoichikademo0.search.windows.net/indexers/docdbindexer/run?
api-version=2015-02-28-Preview
api-key: <Search Service API KEY>
インデクサ名
データソース名
更新対象インデックス名
インデクサ名

DocumentDB::
Collection
id
title
content
permalink
postdate
Azure Search::
Index
itemno
subject
body
url
date
Data Source Query
"SELECT s.id AS itemno,
s.title AS subject,
s.content AS body,
s.permalink AS url,
s.postdate AS date,
s._ts
FROM Sessions s WHERE
s._ts > @HighWaterMark”

$filter=geo.distance(location, geography'POINT(-122.131577 47.678581)') le 10
$filter=geo.intersects(location, geography'POLYGON((-122.031577 47.578581, -
122.031577 47.678581, -122.131577 47.678581, -122.031577 47.578581))')
https://msdn.microsoft.com/library/azure/dn798921.aspx/

https://<アカウント名>.search.windows.net/indexes/<インデックス名>/docs
&search=“xxx”
&searchMode=all
&queryType=full
?api-version
=2016-09-01
&$count=true
&$top=5
&$skip=10
&$select=title,speaker
&$orderby=level desc
&facet=tag
&highlight=title
• 絞り込み用
• アナライザとランキング処理共に無し
• oData式構文サブセット
• and, or, not, eq, lt, any, all
search • searchクエリ文字列
• クエリ文字列にアナライザー処理
searchMode • Booleanクエリ評価方法を決定(all|any)
queryType • クエリパーサーを選択(simple|full)
&$filter
= xxx

searchクエリ
• searchパラメータ
• searchableフィールドに利用可
• 全文検索用クエリ
• スコアリングする
クエリ構文
• Simpleクエリ
• Luceneクエリ
filterクエリ
• $filterパラメータ
• filterableフィールドに利用可
• 検索結果絞り込み用クエリ
• 大文字・小文字区別
• スコアリングしない
oData式構文
• oData式構文サブセット
• 論理演算子 (and, or, not)
• 比較式 (eq, ne, gt, lt, ge, le)
• any, allなど

/<インデックス名>/docs?...
＆$filter=session eq ‘DEV‘
＆search=Azure

NOTE: searchModeとの組み合わせ
(1) search=A-B&searchMode=any
⇒ search=A or (NOT B)
(2) search=A-B&searchMode=all
⇒ A and (NOT B)
AND検索「+」A+B : ＡかつＢquery:
Azure+Search
OR検索「|」 A|B: A, B or Both
query: Azure|Search
NOT検索「-」A-B: A or (NOT B)
query: Azure-Search
A NOT B
ワイルドカード検索「*」大小文字区別なし
query: Azu*
フレーズ検索「“”」”A B”: A B順にあるものだけ
query: “Azure-Search”
グルーピング「()」A+(B|C): A+B or A+Ｃ
query: Azure+(AD|Search)

search= A B の例
(1) search=A B&searchMode=any (2) search=A B&searchMode=all
⇒ search=A OR B ⇒ search=A AND B

フィールドスコープ「field:term」検索対象フィールドの指定
query: session:Azure AND Search
query: session:“Azure Search" AND “Azure AD"
あいまい検索「term~」または「term~N」(N=0～2, default 2): N回入れ替え
れば一致するもの全て
query: Azure~1
近似検索「”Ａ B”~N 」: AとBの間がN語以内のもの
query:“Azure Search”~3
Azure search
3 words

ブースティング
「term^N」または「phrase^N」(N:ブースト値 default=1): ^で指定した単語またはフレーズ
をN値分ブーストさせてより適合性の高いものにする（ランキングをN値分上げる）
query: apache lucene^2
query: “Azure Search"^3 "SharePoint Search"
正規表現検索
「/正規表現/」正規表現構文。詳細はLucene RegExpクラスドキュメントを参照
query: /[hm]otel/
ワイルドカード検索
「*」複数文字、「?」単一文字ワイルドカード。中間、後方一致のみ。前方一致は未サポート
query: te?t
query: test*

kittenとsittingの例
レーベンシュタイン距離＝3
• kitten → sitten (“s”と”k”の入れ替え)
• sitten → sittin (“i”と”e”の入れ替え)
• sittin → sitting (“g“の追加).
Wikipediaレーベンシュタイン距離

/indexes/myindex/docs/suggest?api-version=2015-02-28-Preview
&search=Azure&$select=title,author&$top=5
&suggesterName=mysuggester
&fuzzy=true
1. 検索サジェストのfuzzyモード
2. Luceneクエリのfuzzy search機能
/indexes/myindex/docs?api-version=2015-02-28-Preview
&$select=title,author&$top=5
&search=Azure~1
&querytype=full

"suggesters": [
{
"name":"sessionsg“,
"searchMode":"analyzingInfixMatching",
"sourceFields":["title"]
}
],
/indexes/myindex/docs/suggest
suggesterName
fuzzy
search Azu
{
"@odata.context":
"https://yoichikademo0.search.windows.net/i
ndexes(‘myindex')/$metadata#docs(title)",
"value": [
{ "@search.text": "Azure DevOps at
Rakuten",
"title": "Azure DevOps at Rakuten"},
{"@search.text": "Azure IaaS 最新動向",
"title": "Azure IaaS 最新動向"},
…
]
}

{ "name":"color", "type":"Edm.String", "searchable": false,
"filterable":true, "sortable":true, "facetable":true },
{ "name":"size", "type":" Edm.Int32", "searchable": false,
{ "name":"price", "type":" Edm.Int32", "searchable": false,
/indexes/myindex/docs
facet
facet
facet
search
"@search.facets": {
"color@odata.type":
"#Collection(Microsoft.Azure.Search.V2015_02_2
8.QueryResultFacet)",
"color": [
{ "count": 4, "value": "Red“ },
{ "count": 3, "value": "Black“ },
{ "count": 3, "value": "Yellow“ }
],
"size@odata.type":
"#Collection(Microsoft.Azure.Search.V2015_02_2
8.QueryResultFacet)",
"size": [
{"count": 2, "value": 62 },
{"count": 2, "value": 60 },
..
],
},

自分の位置から5キロ以内のドキュメントを検索
/indexes/myindex/docs?...
&search=engineer
&$filter=geo.distance(loc,
'POINT(-127.21 42)') lt 5
自分の位置からの距離順にソートする
/indexes/myindex/docs?...
&search=engineer
&$orderby=geo.distance(loc,
geography'POINT(-127.21 42)')

テキスト解析の基盤は
Lucene Core
処理単位はアナライザ

• インデクシング処理時とクエリ処理時実行されるテキスト解析処理
• フィールド単位で設定可能
• カスタムアナライザで独自アナライザの定義が可能
<b>Azure Search</b> allows
you to easily add a robust
search experience
インデックス処理クエリ処理

文字フィルタ ( Char Filters )
トークナイズ処理の前、文字レベルの加工処理
１アナライザに０個以上の文字フィルタを定義可能
トークナイザ ( Tokenizer )
文字列をトークン（単語）に分かち書き方法を定義
１アナライザに１つのトークナイザを設定可能
トークンフィルタ ( Token
Filters )
トークナイズ処理後、トークンに対して加工処理を提供
１アナライザに０個以上のトークンフィルタを定義可能

a s
文字列をトークンに分かち書き
トークンを小文字化
ストップワードを削除
HtmlStripCharFilter
文字列からHTMLタグを削除

Azure Search Built-in モジュール一覧
https://docs.microsoft.com/en-us/rest/api/searchservice/custom-analyzers-in-
azure-search#property-reference
Analyzer
• <lang>.microsoft (50言語)
• <lang>.lucene (35言語)
• keyword
• pattern
• simple
• standard
• standardasciifolding.lucen
e
• stop
• whitespace
CharFilter
• html_strip
• mapping
• pattern_replace
Tokenizer
• classic
• edgeNGram
• keyword_v2
• letter
• lowercase
• microsoft_language_tokenizer
(43言語)
• microsoft_language_stemming
_tokenizer (＊)
• nGram
• path_hierarchy_v2
• pattern
• stnadard_v2
• uax_url_email
• whitespace
TokenFilter
arabic_normalization
apostrophe
asciifolding
cjk_bigram
cjk_width
classic
common_grams
dictionary_decompounder
edgeNGram_v2
elision
keep
keyword_marker
keyword_repeat
kstem
length
limit
lowercase
nGram_v2
pattern_capture
pattern_replace
phonetic
porter_stem
reverse
shingle
snowball
stemmer (＊)
stemmer_override
stopwords (＊)
synonym
trim
truncate
unique
uppercase
word_delimiter
(＊) - 複数言語対応。ただし日本語み対応
2017年５月対応状況

a s
文字列をトークンに分かち書き
トークンを小文字化
ストップワードを削除

重要性の低いトーク
ン（ごみ）がヒット
している
日本語文章に適した
トークン分割がされ
ていない
StandardAnalzyer（標準ア
ナライザ）フィールドの日
本語検索結果

• 文字列のトークン化
• 語幹変化/見出し語変化
• 正規化
• ストップワード除去
• アンチフレージング
検索結果の再現率/適合率
向上のため手法
主要なテキスト解析処理
→ 方式: 形態素解析,
N-Gram

N-gram 形態素解析
トークナイズの速度早い遅い
インデックスのサイズ大きい小さい
精度 (precision) 低い高い
ヒットする量 (recall) 多い少ない
検索スピード遅い速い
運用コスト低い
辞書不要
高い
辞書の用意、メンテナンスが必要

テキスト「経済新聞をかいにいく」の解析
https://www.atilika.com/ja/products/kuromoji.html
http://atilika.org/kuromoji/

ja.lucene
• kuromoji を使用
• default(search)テキスト分割モードで
設定
ja.microsoft
• マイクロソフト日本語NLP
• 詳細処理非公開
吾輩はここで始めて人間というものを見た。吾輩はここで始めて人間というものを見た。
掌の上で少し落ちついて書生の顔を見たの
がいわゆる人間というものの見始みはじめ
であろう。
掌の上で少し落ちついて書生の顔を見たの
がいわゆる人間というものの見始みはじめ
であろう。
人間
人間見た
見
見
人間
人間

en.lucene
• StandardAnalzyerの拡張
• 語幹変化 (Porter Stemming)
• ストップワード削除
en.microsoft
• マイクロソフト英語NLP
• 語幹変化ではなく見出し語変化
（lemmatization）
• 詳細処理非公開
after such a fall as this, I shall think nothing of
tumbling down stairs!, Why, I wouldn't say
anything about it, even if I fell off the top of
the house!'
after such a fall as this, I shall think nothing of
tumbling down stairs!, Why, I wouldn't say
anything about it, even if I fell off the top of
the house!
or she fell very slowly, for she had plenty of
time as she went down to look about her and
to wonder what was going to happen next
or she fell very slowly, for she had plenty of
time as she went down to look about her and
to wonder what was going to happen next
she she
she
fell
fell
fallfall

https://<アカウント名>.search.windows.net/indexes/<インデックス名>/analyze
{
"text": "テキスト",
"analyzer":"アナライザ名"
}
{
"tokens": [
{ "token" : "トークン1",
"startOffset": 0,
"endOffset": 4,
"position": 0
},
{ "token": "トークン2",
"startOffset": 5,
"endOffset": 7,
"position": 1
},
....
]}
{
"text": "テキスト",
"tokenizer": "トークナイザ名",
“tokenFilters”:(任意)[フィルタ(複数)],
"charFilters":(任意)[フィルタ(複数)]
}

クエリ処理
Microsoft
「Microsoft」で検索
インデックス
Synonym
Maps
Microsoft
OR MSFT
OR MS
OR マイクロソフト
…
Microsoft, MSFT, MS, マイクロソ
フト
…
「Microsoft」でクエリを投げ
「マイクロソフト」,「MSFT」,
「MS」が
含まれた文書もヒット

{
"name": "mysynonymmap",
"format":"solr",
"synonyms": "
MS, MSFT, Microsoft
Washington, Wash., WA => WA
pet => cat, dog, puppy, pet"
}'
{
"name":”myfieldname",
"type":"Edm.String",
"searchable":true,
"analyzer":"en.lucene",
"synonymMaps":[ "mysynonymmap" ]
}

フォーマット詳細：Lucene SolrSynonymParser APIリファレンス
i-pod, i pod => ipod
i-pod, i pod, ipod
foo => foo, bar
foo => baz
foo => foo, bar, baz

文字列分割方式
• N-gram
• 形態素解析
さまざまな手法
• 語幹変化/見出し語変化
• 正規化
• ストップワード除去
• アンチフレージング
• 同義語展開
• スペルチェック
• クエリサジェスト
• ファセット/ナビゲーション
• クラスタリングエンティティ抽出

N-gram
全体を文脈や単語の境界とは関係な
くN文字ずつ機械的に分割。辞書を
必要としない。
形態素解析
文脈の解析、単語分解を行いトー
クンを抽出。解析のために辞書を
必要とする。
2Gram 英語例 2Gram 日本語例
When in Rome 東京都ルパン上映時間
“Wh”
“he”
“en”
“n “
“ i”
“in”
“n “
“ R”
“Ro”
“om”
“me”
“東京”
“京都”
“都 ”
“ ル”
“ルパ”
“パン”
“ン上”
“上映”
“映時”
“時間”
形態素解析英語例形態素解析日本語例
When in Rome 東京都ルパン上映時間
“When”
“in”
“Rome”
“東京都”
“ルパン”
“上映”
“時間”

語幹変化（Stemming) 見出し語変化(Lemmatization)
語尾を切り離し語幹(Stem)に統一単語を見出し語(lemma)化するプロセス、
語尾変化や語尾派生に対応
• engineering, engineers, engineered
→ engineer
• car, cars, car’s, cars’ → car
• compressing, compressed →
compress
• コンピューター → コンピュータ
• コーナー → コーナ
• am, are, is → (to) be
• gone, going, goes, went → go
• 行われ → 行う

正規化の例
• U.S.A → USA
• Co-education → coeducation
• 半角カタカナ→全角カタカナ
• カタカナ→ひらがな
• Alphabētikós Katálogos → Alphabetikos Katalogos #Non Spacing mark
• Αλφαβητικός Κατάλογος → Alphabētikós Katálogos #latin
• 簡化字 → 简化字

ストップワード除去例
Instructions are applicable to these Adventure Works Cycles models
↓
Instructions applicable Adventure Works Cycles models

アンチフレージング例
Who is Miles Davis?"
↓
Miles Davis?

同義語展開の例
• 二酸化炭素 → 二酸化炭素, co2, 炭酸ガス
• マイクロソフト → マイクロソフト、MS、日本MS、日本マイクロソフト、
Microsoft Japan, Microsoft
• ヴァーチャル → ヴァーチャル、バーチャル
• ダイヤモンド → ダイヤモンド、ダイアモンド

"analyzers":(optional)[
{
"name":"analyzer_name_1",
"@odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
"charFilters":[ "char_filter_name_1", "char_filter_name_2" ],
"tokenizer":"tokenizer_name",
"tokenFilters":[ "token_filter_name_1", "token_filter_name_2" ]
},
{
"name":"analyzer_name_2",
"@odata.type":"#analyzer_type",
...
}
],
"charFilters":(optional)[
{
"name":"char_filter_name",
"@odata.type":"#char_filter_type",
"option1":"value1", "option2":"value2", ...
}
],
"tokenizers":(optional)[
{
"name":"tokenizer_name",
"@odata.type":"#tokenizer_type",
}
],
"tokenFilters":(optional)[
{
"name":"token_filter_name",
"@odata.type":"#token_filter_type",
}
]
Analysis in Azure Search
https://msdn.microsoft.com/en-
us/library/azure/mt605304.aspx
文字フィルタ
トークナイザ
トークンフィルタ

"analyzers":[
{
"name":"my_ngram_ja",
"@odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
"charFilters": ["html_strip"],
"tokenizer":"my_tokenizer",
"tokenFilters":[ "cjk_width", "lowercase“, “my_synonym_filter”
]
}
],
"tokenizers":[
{
"name":"my_tokenizer",
"@odata.type":"#Microsoft.Azure.Search.NGramTokenizer",
"minGram":1,
"maxGram":3
}
],
"tokenFilters":[
{
"name":"my_synonym_filter",
"@odata.type":"#Microsoft.Azure.Search.SynonymTokenFilter",
"synonyms": [
"吾輩,わがはい,私,自分",
"猫,ねこ,ネコ,CAT"
],
“ignoreCase”: true,
“expand”: true
}
同義語設定内容
• “吾輩、わがはい、私、自分”
• “猫、ねこ、ネコ、CAT”

文字列分かち書き
（3グラム）
HTML_Strip HTMLタグを削除
(吾輩|わがはい|私|自分)
(猫|ねこ|ネコ|CAT)
“吾輩”と”猫”を同義語展開
(猫|ねこ|ネコ|CAT)
“ネコ” を同義語展開
半角カタカナ全角化
文字列分かち書き
（3グラム）

name type (char_filter_type) Description and Options
html_strip HtmlStripCharFilter HTMLタグを削除する文字フィルタ
Mapping MappingCharFilter 文字対文字のマッピングを行うフィルタ
pattern_replace PatternReplaceCharFilte
r
正規表現によるパターン文字列書き換え
フィルタ

name type (char_filter_type) Description and Options
nGram NGramTokenizer N-Gram方式で文字列を分割するトークナイザ
edgeNGram EdgeNGramTokenFilter エッジN-Gram方式で文字列を分割するトークナイザ
- MicrosoftLanguageTokenizer • maxTokenLength – maximum token length
• isSearchTokenizer - used as search or index tokenizer (depending on the language the
behavior may be different)
• language – allowed values: bengali", "bulgarian", "catalan",
"chinese_simplified", "chinese_traditional", "croatian", "czech", "danish", "dutch",
"english", "french", "german", "greek", "gujarati", "hindi", "icelandic", "indonesian","italian",
"japanese", "kannada", "korean", "malay", "malayalam", "marathi", "norwegian_bokmaal",
"polish", "portuguese", "portuguese_brazilian", "punjabi", "romanian", "russian", "serbian_cyrillic",
"serbian_latin", "slovenian", "spanish", "swedish", "tamil", "telugu", "thai", "ukrainian", "urdu",
"vietnamese"

name type
(char_filter_type)
Description and Options
cjk_width CjkWidthTokenFilter CJK言語圏（中国語、日本語、韓国語）の文字の全角・半角を統一させてマッチン
グしやすくするためのトークンフィルタ。全角英数字を半角に、半角カタカナを全
角にする。Optionsなし
Lowercase LowercaseTokenFilter 大文字を全て小文字に変換するトークンフィルタ。Optionsなし
Synonym SynonymTokenFilter 同義語展開するためのトークンフィルタ
Options:
• synonyms: 同義語リスト。配列形式[…]
• A, B => C形式（A, BはCに内部変換）： incredible, unbelievable,
fabulous => amazing
• A, B, C形式（expand=trueのときA, B, Cは互いに同義）：incredible,
unbelievable, fabulous, amazing
• ignoreCase: true|false (default false)
• expand: true|false
• expand=trueのとき: synonymsで定義されたA, B, CはA, B, Cは互いに同
義として扱われる
• expand=trueのとき: synonymsで定義されたA, B, CはA, B, C=> Aとして
扱われる

勉強したい?
オフライン?
オンライン
導入したい?
障害?
PC上?
クラウド上?
機械学習
回帰
SVM
Decision Tree
Deep Learning
CNN
RNN
強化学習 Q Learning

勉強したい?
オフライン?
オンライン
導入したい?
障害?
PC上?
クラウド上?
機械学習
回帰
SVM
Decision Tree
Deep Learning
CNN
RNN
Relevancy – 関連性
User | 入力値と Data | 情報
を紐づける

機械学習
回帰
SVM
Decision Tree
Deep Learning
CNN
RNN
勉強したい?
オフライン?
オンライン
導入したい?
障害?
PC上?
クラウド上?
Relevancy – 関連性
User | 入力値と Data | 情報
を紐づける

companies
Google
Microsoft
Facebook
record-id companies
1 [
“v02”,
“v01”,
“v05”
]
自然言語処理 (NLP) は大変…
Deep Learning は、2012年
にGoogleの研究者によって
飛躍的に技術向上しました。
その後Microsoftや
Facebookなどの企業も
本格参入。
2015年から写真の識別など
幾つかの分野でMicrosoft
が
人と同程度のModelを
開発しました。
Olgaも入っておりますが、

SpeechLanguageVision
メタデータの自動抽出への道が
開けた!

Tags
“throwing”, “ball”, “girl”, “grass”,
“basketball”
Caption
“A girl throwing a ball”

Entities
Persons
“Anita Christiansen”,
“Conrad Nuber”,
Locations
“Bothell”, “Woodinville”
Organization
“Litware Insurance Corp.”

John F. Kennedy (JFK)
November 22, 1963

Web App
(azsearch.js)
Blob
Storage
Azure Function
Skills:
Computer Vision
OCR +
Handwriting
Entity Linking
CIA Cryptonyms
Azure Search
Cosmos DB
Azure
Machine Learning
Cognitiv
e
Skill Set
Skill: Topics
本質は、情報検索(Search)
JFK FILES
COGNITIVE SEARCH
ARCHITECTURE

多様なファイルフォーマットへ

Azure Search Built-in モジュール一覧
https://docs.microsoft.com/en-us/rest/api/searchservice/custom-analyzers-in-
azure-search#property-reference
Analyzer
• <lang>.microsoft (50言語)
• <lang>.lucene (35言語)
• keyword
• pattern
• simple
• standard
• standardasciifolding.lucen
e
• stop
• whitespace
CharFilter
• html_strip
• mapping
• pattern_replace
Tokenizer
• classic
• edgeNGram
• keyword_v2
• letter
• lowercase
• microsoft_language_tokenizer
(43言語)
• microsoft_language_stemming
_tokenizer (＊)
• nGram
• path_hierarchy_v2
• pattern
• stnadard_v2
• uax_url_email
• whitespace
TokenFilter
arabic_normalization
apostrophe
asciifolding
cjk_bigram
cjk_width
classic
common_grams
dictionary_decompounder
edgeNGram_v2
elision
keep
keyword_marker
keyword_repeat
kstem
length
limit
lowercase
nGram_v2
pattern_capture
pattern_replace
phonetic
porter_stem
reverse
shingle
snowball
stemmer (＊)
stemmer_override
stopwords (＊)
synonym
trim
truncate
unique
uppercase
word_delimiter
(＊) - 複数言語対応。ただし日本語み対応
2017年５月対応状況
自然言語処理組み込み済み!

{ "name":"color", "type":"Edm.String", "searchable": false,
{ "name":"size", "type":" Edm.Int32", "searchable": false,
{ "name":"price", "type":" Edm.Int32", "searchable": false,
/indexes/myindex/docs
facet
facet
facet
search
"@search.facets": {
"color": [
{ "count": 4, "value": "Red“ },
{ "count": 3, "value": "Black“ },
{ "count": 3, "value": "Yellow“ }
],
"size@odata.type":
"size": [
{"count": 2, "value": 62 },
{"count": 2, "value": 60 },
..
],
},
メタデータ | 構造データの表現

Search
Engine
Analyzer
IndexWriter
Index
QueryParser
Simple lucene
Analyzed
terms
Query
terms
Query
tree
Query
text
Documents
terms
Analyzed
terms
RetrieveIngest
Analyzer
ドキュメ
ント
クエリ文を解析し
内部クエリ―形式に変
換
テキスト解析を行い
トークンの展開、変換、削除など
を行う
転置インデッ
クス
検索処理
Content
Extraction
ファイルやファイル
メタから、テキス
トを抽出
インデックス生成処
理

Search
Engine
Analyzer
IndexWriter
Index
QueryParser
Simple lucene
Analyzed
terms
Query
terms
Query
tree
Query
text
Documents
terms
Analyzed
terms
RetrieveIngest
Analyzer
クエリ文を解析し
内部クエリ―形式に変
換
テキスト解析を行い
トークンの展開、変換、削除など
を行う
転置インデッ
クス
検索処理
ドキュメ
ント
Content
Extraction
ファイルやファイル
メタから、テキス
トを抽出
インデックス生成処
理
“ENRICH”
skills Annotation

ドキュメ
ント
Index
{
"name":"01-hellodoc",
"dataSourceName" : "01-hellodoc",
"targetIndexName" : "01-hellodoc",
"skillsetName" : "01-hellodoc-skillset",
"fieldMappings" : [
{
"sourceFieldName" : "metadata_storage_path",
"targetFieldName" : "metadata_storage_path",
"mappingFunction" : { "name" : "base64Encode" }
}
],
"outputFieldMappings" :
[
{
"sourceFieldName" : "/document/organizations",
"targetFieldName" : "organizations"
},
…

ドキュメ
ント
Index
0. [option] データ準備 Blog など
1. Azure Search 作成
2. data Source 作成
3. skillset 作成
4. index 作成
5. indexer 作成
1. data Source, skillset index への参照
2. 起動スケジュール設定
1. スケジュール指定がないと、作成時に起動
6. Indexer の status で、挙動確認
7. search で格納結果確認
https://docs.microsoft.com/ja-jp/azure/search/cognitive-search-concept-intro#where-do-i-start

Web App
(azsearch.js)
Blob
Storage
Azure Function
Skills:
Computer Vision
OCR +
Handwriting
Entity Linking
CIA Cryptonyms
Azure Search
Cosmos DB
Azure
Machine Learning
Cognitiv
e
Skill Set
Skill: Topics
JFK FILES
COGNITIVE SEARCH
ARCHITECTURE

Explore
Azure
Storage
Azure Functions
-Cryptonyms
-Redactions
Cognitive Skills
-OCR + Handwriting
-Computer Vision
-Entities
Azure ML
Search Index
Azure Search
Content
Extraction
JFK FILES
COGNITIVE SEARCH
ARCHITECTURE

• フルマネージ - PaaS
• Indexer Add-in
• Pull のみ
• Pre-Build skill
• Azure Cognitive Services + α
• Region
• South Central US か West Europe
• API Version
• api-version=2017-11-11-Preview
• 拡張性
• 任意の REST API の呼びだし
• 現状追加費用なし!

Skillset
ドキュメ
ント
Index
outputFieldMappings

Key Phrase Extraction
Sentiment Analysis
Organization Entity Extraction
Location Entity Extraction
Persons Entity Extraction
Language Detection
Face Detection
Tag Extraction
Celebrity Recognition
Landmark Detection
Handwriting Recognition (Preview)
Printed Text Recognition
https://docs.microsoft.com/ja-jp/azure/search/cognitive-
search-predefined-skills

Sentiment Analysis
Language Detection
Face Detection
Tag Extraction
Landmark Detection
https://docs.microsoft.com/ja-jp/azure/search/cognitive-search-skill-textsplit
https://docs.microsoft.com/ja-jp/azure/search/cognitive-search-skill-textmerger
文字数
Split

Sentiment Analysis
Language Detection
Face Detection
Tag Extraction
Landmark Detection
サイズ変更
https://docs.microsoft.com/ja-jp/azure/search/cognitive-search-concept-image-scenarios

{
"fields": [
// other fields go here.
{
"name": "enriched",
"type": "Edm.String",
"searchable": false,
"sortable": false,
"filterable": false,
"facetable": false
}
]
}

Azure Machine
Learning
3rd Party

“Lorem ipsum dolor sit amet,
consectetur adipiscing elit, sed
do eiusmod tempor incididunt ut
labore et dolore magna aliqua. Ut
enim ad minim veniam, quis
nostrud exercitation ullamco
laboris nisi…”
Class A
Class B
Class C

laboris nisi…”
laboris nisi…”
Entity type A
Entity type B

Labeled
Data
Named
Entity
Extraction
Azure ML
Annotated
Documents
Customer
Data
Search
Index

ドキュメントの
関連性の数値化
(スコアリン
グ）
ランクスコア
順に
結果表示
検索開始
ソート
条件あり
orderby=pricesearch=suface

TF-IDF
ベース
のスコア
スコアリング
プロファイル
による調整
ランクスコア
Σ
ランキングのチューニングは
スコアリングプロファイルで行う

Term Frequency
Inverse Document
Frequency
単語の出現頻度単語の特徴度（レア度）
https://ja.wikipedia.org/wi
ki/Tf-idf

プロファイル名
search=キーワード
&scoringPorfile=
フィールドウェイト設定
freshness (鮮度) 度合いによるブースト
関数合計値算出方法：sum(規定)|average |
minimum | maximum | firstMatching
magnitute (数値、範囲) 度合いによるブースト
distance (距離) 度合いによるブースト
tag で指定した値が含まれるかどうかでブースト
関数
searchableフィールドにのみ有効
filterableフィールドにのみ有効

Title=“Azure Search
Deep Dive”
Description = Many
applications use
search as the
primary interaction
…Microsoft …
LastUpdate= 2016-
04-28
Rating = 5
/indexes/myindex/docs?
search= Azure%20Search
& scoringProfile=myScoreProfile
ドキュメント
Σ
TF-IDFベース
のスコア算出
TAG
ブースト
Distance
ブースト
freshness
ブースト
Magnitude
ブースト
スコア値算出
+0.3
0
+0.2
+0.2
+0.5
functionAggregation=
sum (default) | average |
minimum | maximum |
firstMatching
プロファイル関数によるブースト値の
集約方法はfunctionAggregationで決
定
スコアリング関
規定スコアリン
グ

{
"name":"tags",
"type":"Collection(Edm.String)",
"searchable":false,
"filterable":true,
"sortable":false,
"facetable":false
}
{
"name": "personalizedBoost",
"functions": [
{
"type": "tag",
"boost": 5,
"fieldName": "tags",
"tag": {
"tagsParameter":"featuredtags"
}
}
]
}
search=キーワード
&scoringProfile=personalizedBoost
&scoringParameter=featuredtags:TAG1,TAG2,TAG3..
tagフィールド
名指定
tagsParameter名
ユーザーごとにパーソナラ
イズされたタグを指定

正規化の例
• U.S.A → USA
• Co-education → coeducation
• 半角カタカナ→全角カタカナ
• カタカナ→ひらがな
• Alphabētikós Katálogos →
Alphabetikos Katalogos #音声記号
• 簡化字 → 简化字
# -*- coding: utf-8 -*-
import unicodedata
"""
unicodedata.normalizeのNFKC（Normalization Form
Compatibility Composition）で半角カタカナ、全角記号、濁
音、特殊文字などを正規化
"""
data = u"㈱㍉㌶（％＆！？＠＃）ｶﾀｶﾅｻﾞｻﾞｻﾞｻﾞｻﾞｱ"
normal = unicodedata.normalize('NFKD',
data).encode('utf-8', 'ignore')
print normal
# => (株)ミリヘクタール (%&!?@#) カタカナザザザザザア

登場人物役割設定箇所変更コスト
アナライザーテキストのトークン化インデックススキー
マ
大
（小: 新規フィールド追加で
アナライザ設定）
クエリトークンの絞り込み、マッ
チングの挙動と結果評価の
制御
クエリパラメータ小
ランキング関連度（スコア）の計算スコアリングプロ
ファイル
クエリパラメータ
小
同義語辞書辞書ベースのキーワードの
展開（クエリ側でのみ）
同義語辞書
インデックススキー
マ
小
（大: 既存フィールドへの
新規定義追加が必要な場
合）

最適なフィールド属性の定義
• 必要最低限の機能有効化。特に不要な言語解析処理
(searchable)は避ける
最適なアナライザーの選定
• テキスト解析の基本処理なので選択は慎重に
• 基本的に日本語はja.luceneかja.microsoftの2択
スコアリング- フィールドウェイト調
整
• searchableフィールドにフィールドウェイト設定
クエリパラメータ選定
• searchMode、queryType、$filter、search
短時間で
そこそこの結果
にするために
まずできること

{
"name": "qnakb",
"fields": [
{ "name":"id", "type":"Edm.String", "key":true,
"searchable":false, "filterable":false, "sortable":false, "facetable":false },
{ "name":"question", "type":"Edm.String", "searchable":true, "filterable":false,
"sortable":false, "facetable":false,"analyzer":“ja.lucene"},
{ "name":"answer", "type":"Edm.String", "searchable":true, "filterable":false,
"sortable":false, "facetable":false,"analyzer":"ja.lucene"},
{ "name":"category", "type":"Edm.String", "searchable":false,
{ "name":"url", "type":"Edm.String", "searchable":false,
"filterable":false, "sortable":false, "facetable":false },
{ "name":"tags", "type":"Collection(Edm.String)",
"searchable":false, "filterable":true, "sortable":false, "facetable":false }
],
…
} question, answerフィールドはsearchableで
アナライザーをja.lucene

{
"fields": […],
"scoringProfiles": [
{
"name": "weightedFields",
"text": {
"weights": {
"question": 9,
"answer": 1
}
}
}
]
}
& searchMode=any
& queryType=full
& search=“キーワード”
& scoringProfile=weightedFields
( & $filter=category eq ‘カテゴリ’ )
• 特定フィールドに絞る場合
はフィールドスコープ指定
(question:キーワード)
• ここではanswerフィールド
を考慮するため指定しない
カテゴリ絞りをする場合
スコアリングのフィールドウェイト
をquestionを9に対してanswerに１
を設定

スペルミス・タイプミス対策
• あいまい検索（fuzzy）や近似検索（ Proximity）
ランキングのパーソナライズ
• ユーザの位置/関心内容に応じてランキングを変える
– 距離/Tagブースト
自前でテキスト解析処理を施す
• Azure Search未サポート処理をAzure Search外の処理でカ
バーするアプローチ
• 例）事前にキーワード文字列の正規化やノイズ除去
アナライザーのカスタマイズ（△）
• カスタムアナライザでアナライザーの振る舞いをカスタ
マイズ。ただし現時点（2017年5月）では日本語モジュー
ルが不十分なので日本語検索ではあまり効果が期待でき
ない。
さらに
精度・利便性
を上げるために
できること
同義語、類義語対応
• 再現率を上げたいフィールドに対して同義語辞書
（Public Preview）機能の有効化。辞書更新は逐次

Azure 検索
クラウドドキュメントA (score: 0.312)
Tags:
ドキュメントB (score: 0.291)
Tags: Azure
ドキュメントC (score: 0.164)
Tags:
サーチ
ドキュメントA (score: 0.312)
Tags:
ドキュメントC (score: 0.164)
Tags:
ユーザXが関心のあるキーワード
ユーザーXさん
ドキュメントB (score: 0.91)
Tags: Azure
スコアブースト
検索ヒットしたドキュメント最終的な結果並び順
「〇✖△」で検索
Tagブースト用プロファイル(※)
と関心のあるTagを指定
Xさんにとって興味のある結果が上位にきた！
※ Tagブーストの一連の設定例についてはAPPENDIXを参照ください

70%の検索用語
は、誰も想定で
きない

オペレーションログメトリックス
保存
コンテナ
insights-logs-
operationlogs
insights-metrics-pt1m
内容インデックス作成
検索クエリ
サジェストクエリ
など
クエリレイテンシー
クエリ数/秒（QPS）
※分単位

{
"time": "2016-05-07T09:15:24.3901416Z",
"resourceId": "/SUBSCRIPTIONS/87C7C7F9-0C9F-
47D1-A856-1305A0CBFD7A/RESOURCEGROUPS/RG-SEARCH-
DEMO/PROVIDERS/MICROSOFT.SEARCH/SEARCHSERVICES/YOICH
IKADEMO0",
"operationName": "Query.Search",
"operationVersion": "2015-02-28",
"category": "OperationLogs",
"resultType": "Success",
"resultSignature": 200,
"durationMS": 41,
"properties": { "Description" : "GET
/indexes('decodesessions2016')/docs" , "Query" :
"?$top=12&$select=id,title,url,thumbnail,description
&api-version=2015-02-28&search=Azure" , "Documents"
: 12, "IndexName" : "decodesessions2016" }
}
{
"resourceId": "/SUBSCRIPTIONS/87C7C7F9-0C9F-47D1-
A856-1305A0CBFD7A/RESOURCEGROUPS/RG-SEARCH-
DEMO/PROVIDERS/MICROSOFT.SEARCH/SEARCHSERVICES/YOICHIK
ADEMO0",
"metricName": "SearchQueriesPerSecond",
"time": "2016-05-13T13:14:00Z",
"average": 0.05,
"minimum": 0,
"maximum": 2,
"total": 3,
"count": 60,
"timeGrain": "PT1M"
}

収集された検索オペレーション・メトリックスログはPower BI連携
により簡単に可視化が可能

Azure Search
Index
SQL Server /
SQL Database
Admin Key 管理
Query Key 検索 DocumentDB

var q = encodeURIComponent($("#q").val());
var searchAPI =
"https://yoichikademo0.search.windows.net/indexes/
decodesessions2016/docs?$top=12&$select=id,title,t
rack,url,thumbnail,description&api-version=2015-
02-28&search=" + q;
inSearch= true;
$.ajax({
url: searchAPI,
beforeSend: function (request) {
request.setRequestHeader("api-key",
”A86C8C8929A5225D5120A151B584C5B6”);
request.setRequestHeader("Content-Type",
"application/json");
request.setRequestHeader("Accept",
"application/json; odata.metadata=none");
},
type: "GET",
success: function (data) {

インターフェース最新バージョン状況
NET SDK 3.0
Generally Available, released
November 2016
.NET SDK Preview 2.0-preview Preview, released August 2016
Service REST API 2016-09-01 Generally Available
Service REST API Preview 2015-02-28-Preview Preview
.NET Management SDK 2015-08-19 Generally Available
Management REST API 2015-08-19 Generally Available
https://docs.microsoft.com/en-us/azure/search/search-api-versions

再現率(Recall)
適合率
(Precision)
検索ヒット数↑
検索ノイズ ↑
検索精度↑
検索漏れ↑
再現率、適合率の最適なブレイクポイント

商品購入、
レビュー
書き込み
商品
カタログ
更新
DocumentDB
SQL Database
Azure Search
Azure Table
AdventureWorks
Azurewebsites.net
商品カタログ、レ
ビュー、
レーティング
購買
商品カタログ
検索
ショッピング
カート

https://docs.microsoft.com/ja-jp/azure/search/search-get-started-portal

© 2016 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a
commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.
MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Azure Search 大全

More Related Content

What's hot (20)

Similar to Azure Search 大全 (20)

More from Daiyu Hatakeyama (20)

Azure Search 大全

Editor's Notes