SlideShare a Scribd company logo
Building Structured Data from
Product Descriptions
Keiji Shinzato
Product information extraction

An Italian product. This is a fruity
red wine that mainly consists of
sangiovese grapes of Tuscany.

Type

Red

Grape
variety

Sangiovese

Region

Italy,
Tuscany
2
Background

• Structured data play a crucial role for
making Rakuten more attractive service.
– Faceted navigation, recommendation, and
market analysis.

ベリンダ・コーリー キアンティ
2011 750ml
トスカーナ州 キャ
ンティ地区のサン
ジョベーゼ種を主
体につくられる、
イタリアを代表す
る赤ワインの一つ。

Attribute

Value

Type

赤

Region

イタリア,
トスカーナ州キャンティ
地区

Grape

サンジョベーゼ

Vintage

2011

3
Faceted navigation

Reference: http://www.amazon.com/
4
Background

• Structured data play a crucial role for
making Rakuten more attractive service.
– Faceted navigation, recommendation, and
market analysis.

• Unsupervised methodology is required.
– 100 million products / 40,000 categories.
ベリンダ・コーリー キアンティ
2011 750ml
トスカーナ州 キャ
ンティ地区のサン
ジョベーゼ種を主
体につくられる、
イタリアを代表す
る赤ワインの一つ。

Attribute

Value

Type

赤

Region

イタリア,
トスカーナ州キャンティ
地区

Grape

サンジョベーゼ

Vintage

2011

5
Table is an useful clue, but…
WINE > CHILE

WINE > CHILE

Montes Alpha M 2009

Montes Alpha M 2009

Type

Red

Region

Chile

38%

Grape

Cabernet
sauvignon,
Merlot,
Cabernet franc,
Petit verdot

Year

2009

Product page including a table

Montes Alpha M is a blend
of Cabernet
Sauvignon, Merlot, Cabern
et Franc, and Petit Verdot.
A powerful wine with very
good level of soft and
rounded tannins. Intense
dark red color. The wine is
elegant and has a …
Product page consists of
sentences
6
Product information extraction
WINE > CHILE

Montes Alpha M 2009
Montes Alpha M is a blend
of Cabernet Sauvignon,
Merlot, Cabernet Franc,
and Petit Verdot.
A powerful wine with very
good level of soft and
rounded tannins. Intense
dark red color. The wine is
elegant and has a very
well defined character. …

Product page (unstructured)

Attribute

Value

Type

Red

Region

Chile

Grape

Cabernet sauvignon,
Merlot,
Cabernet franc,
Petit verdot

Vintage

2009

Company

Montes

Structured data

• Issue1: How do we know attributes for a category ??
• Issue2: How do we extract attribute values from full
texts ??
7
Attribute name collection
Analyze a large amount of table data
for collecting attributes of an object

Attribute values
Attribute names
of Wine

Reference: http://item.rakuten.co.jp/redbox/odm3000728/
8
Attribute value database (wine)
ぶどう品種
(Grape
variety)

内容量
(Volume)

産地
(Region)

生産者
(Winery)

味わい
(Taste)

Chardonnay

750ML

France

Farnese

Dry

Chardonnay
100%

720ML

Italy

Mas de
Monistrol

Full body

Merlot

375ML

Spain

Leroy

Medium body

Riesling

500ML

Chile

M. Chapoutier

Slightly sweet

Syrah

1500ML

German

Mastroberardino

Sweet

Grenache

360ML

Australia

Santero

Medium dry

Merlot

200ML

America

Saltarelli

Extremely sweet

Tempranillo

3000ML

Bordeaux

Cavicchioli

Medium dry

Sangiovese

1800ML

Champagne

Fontodi

Red Full body

Syrah100%

1000ML

Argentina

Ca'Rugate

Middle sweet

Precision is high, but coverage is low.
9
Product information extraction
WINE > CHILE

Montes Alpha M 2009
Montes Alpha M is a blend
of Cabernet Sauvignon,
Merlot, Cabernet Franc,
and Petit Verdot.
A powerful wine with very
good level of soft and
rounded tannins. Intense
dark red color. The wine is
elegant and has a very
well defined character. …

Product page (unstructured)

Attribute

Value

Type

Red

Region

Chile

Grape

Cabernet sauvignon,
Merlot,
Cabernet franc,
Petit verdot

Vintage

2009

Company

Montes

Structured data

• Issue1: How do we know attributes for each category ??
• Issue2: How do we extract attribute values from product
descriptions ??
10
Unsupervised attribute value extraction
- distant supervision approach Semi-structured data

Generation
Chateau d’Issan 1994

Construction
Database
:
<Region, Margaux>
<Color, White>
:

This is a wine
from Margaux.
...

Annotation

Rule
wine from x
⇒ x is a Region
Rule is generated
through machine
learning algorithm.

Product page including
entries in the database
11
Corpus with attribute-value annotations (wine)
• <産地>アルザス</産地>で最も香り豊かと言われるスパイシーで華やかなワイ
J:

E: ン。
A spicy and gorgeous wine that is known as the richest aroma one in

J: <production_area> Alsace </production_area>.
•

最もお手頃で、<生産者>ドメーヌ・ペゴー</生産者>の美味しさを気軽に楽し

E: める、とっても嬉しい一本なのです
This is a very nice wine because we can easily enjoy the taste of <winery>

J: Domaine Pegau </winery> at the best price.
• <ぶどう品種>ソーヴィニヨン・ブラン</ぶどう品種>種の特長がよく表れたワ
E:

J: イン。
A wine that <grape_variety> Sauvignon Blanc </grape_variety> was well

E: featured.
•

<タイプ>白</タイプ>身魚の塩焼きやシンプルな味付けのソテー、焼き牡蠣、

豚のしょうが焼き、ボンゴレビアンコなどと。

12
Unsupervised attribute value extraction
- distant supervision approach Semi-structured data

Generation
Chateau d’Issan 1994

Construction
Database
:
<Region, Margaux>
<Color, White>
:

This is a wine
from Margaux.
...

Annotation

Rule
wine from x
⇒ x is a Region
Rule is generated
through machine
learning algorithm.

Product page including
entries in the database
13
Extraction rule generation
• Algorithm: Conditional random fields [Lafferty+ 2001]
• Chunk tag: Start/End (IOBES) model [Sekine+ 1998]
• Features:
–
–
–
–
–
–
–

Token: Surface form of the token.
Base: Base form of the token.
PoS: Part-of-Speech tag of the token.
Char. type: Types of characters in the token.
Prefix: Double character prefix of the token.
Suffix: Double character suffix of the token.
The above features of ±3 tokens surrounding the token.

They are frequently employed in the task of Japanese
named entity recognition.
14
Unsupervised attribute value extraction
- distant supervision approach Semi-structured data

Generation
Chateau d’Issan 1994

Construction
Database
:
<Region, Margaux>
<Color, White>
:

This is a wine
from Margaux.
...

Annotation

Rule
wine from x
⇒ x is a Region
Rule is generated
through machine
learning algorithm.

Product page including
entries in the database
15
Unsupervised attribute value extraction
- distant supervision approach Terre di matraja
Bianco 2012

Apply
Rule
wine from x
⇒ x is a Region

This is a wine
from Tuscany.
...

Rule

1800 < x <= 2013
⇒ x is a Vintage

Attribute
Region
Vintage
Grape

Value
Tuscany
2012
Chardonnay
16
Performance (F-score)

Without ML
With ML

43.8 pt.
60.1pt.
Wine

24.1pt.
71.5 pt.
Shampoo
17
Wine / Japanese

An Italian product. This is a fruity
red wine that mainly consists of
sangiovese grapes of Tuscany.

Type

Red

Grape
variety

Sangiovese

Region

Italy,
Tuscany
18
Shampoo / Japanese

``MCH Natural shampoo 1000ml’’ is a shampoo
consisting of cypress oil and charcoal.
Category
Product
name

Shampoo
MCH Natural shampoo
1000ml

Ingredient

Cypress oil,
Charcoal

19
Video game / French

Product
type
Saga

Nintendo 64,
Nintendo DS
Mario

20
Conclusion
• Developing a technique for extracting product
information from unstructured data.
– Independent of any category and language.

• Useful services can be realized on structured
product data.
• Our paper is available on the web.
– ACL anthology: http://aclweb.org/anthology//I/I13/

21
Thank you for listing !

22

More Related Content

PDF
The Egison Programming Language
PPTX
E-commerce企業におけるビッグデータ活用の取り組みと今後の展望
PPTX
Intelligent Electronic Commerce Service Based on Understanding of User Behaviors
PPTX
[RakutenTechConf2013] [D-3_2] Counting Big Data by Streaming Algorithms
PDF
Latent Class Transliteration based on Source Language Origin
PDF
Scaling and High Performance Storage System: LeoFS
PPTX
[RakutenTechConf2013] [LT] Giving Life to your IDEAS to Survive in Evolving Era
PPTX
[RakutenTechConf2013] [C4-1] Text detection in product images
The Egison Programming Language
E-commerce企業におけるビッグデータ活用の取り組みと今後の展望
Intelligent Electronic Commerce Service Based on Understanding of User Behaviors
[RakutenTechConf2013] [D-3_2] Counting Big Data by Streaming Algorithms
Latent Class Transliteration based on Source Language Origin
Scaling and High Performance Storage System: LeoFS
[RakutenTechConf2013] [LT] Giving Life to your IDEAS to Survive in Evolving Era
[RakutenTechConf2013] [C4-1] Text detection in product images

Similar to [RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions (15)

PPT
Wine of italy
PPTX
WINES OF ITALY.pptx
PPT
Italian wine
PPSX
Italian wines
PDF
Wine Data Analysis using R, SQL and TABLEAU
PPTX
PPTX
Riondo winemaker's tasting notes
PPT
Italy and Spain Oct 13th.
PPT
2011 Foundation Wine Course 3: Rest of the Old World
PDF
Toschi Book 4.2012 Email
PPTX
Argentine wines by viners club
PPTX
wine and grape with france regions.......
PPTX
The vineyards of bergerac france
PPTX
International market japan
Wine of italy
WINES OF ITALY.pptx
Italian wine
Italian wines
Wine Data Analysis using R, SQL and TABLEAU
Riondo winemaker's tasting notes
Italy and Spain Oct 13th.
2011 Foundation Wine Course 3: Rest of the Old World
Toschi Book 4.2012 Email
Argentine wines by viners club
wine and grape with france regions.......
The vineyards of bergerac france
International market japan
Ad

More from Rakuten Group, Inc. (20)

PDF
EPSS (Exploit Prediction Scoring System)モニタリングツールの開発
PPTX
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
PDF
楽天における安全な秘匿情報管理への道のり
PDF
What Makes Software Green?
PDF
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
PDF
DataSkillCultureを浸透させる楽天の取り組み
PDF
大規模なリアルタイム監視の導入と展開
PDF
楽天における大規模データベースの運用
PDF
楽天サービスを支えるネットワークインフラストラクチャー
PDF
楽天の規模とクラウドプラットフォーム統括部の役割
PDF
Rakuten Services and Infrastructure Team.pdf
PDF
The Data Platform Administration Handling the 100 PB.pdf
PDF
Supporting Internal Customers as Technical Account Managers.pdf
PDF
Making Cloud Native CI_CD Services.pdf
PDF
How We Defined Our Own Cloud.pdf
PDF
Travel & Leisure Platform Department's tech info
PDF
Travel & Leisure Platform Department's tech info
PDF
OWASPTop10_Introduction
PDF
Introduction of GORA API Group technology
PDF
100PBを越えるデータプラットフォームの実情
EPSS (Exploit Prediction Scoring System)モニタリングツールの開発
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
楽天における安全な秘匿情報管理への道のり
What Makes Software Green?
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
DataSkillCultureを浸透させる楽天の取り組み
大規模なリアルタイム監視の導入と展開
楽天における大規模データベースの運用
楽天サービスを支えるネットワークインフラストラクチャー
楽天の規模とクラウドプラットフォーム統括部の役割
Rakuten Services and Infrastructure Team.pdf
The Data Platform Administration Handling the 100 PB.pdf
Supporting Internal Customers as Technical Account Managers.pdf
Making Cloud Native CI_CD Services.pdf
How We Defined Our Own Cloud.pdf
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech info
OWASPTop10_Introduction
Introduction of GORA API Group technology
100PBを越えるデータプラットフォームの実情
Ad

Recently uploaded (20)

PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
PDF
STKI Israel Market Study 2025 version august
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PPTX
TLE Review Electricity (Electricity).pptx
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
1 - Historical Antecedents, Social Consideration.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Hybrid model detection and classification of lung cancer
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
Zenith AI: Advanced Artificial Intelligence
observCloud-Native Containerability and monitoring.pptx
Getting started with AI Agents and Multi-Agent Systems
A novel scalable deep ensemble learning framework for big data classification...
WOOl fibre morphology and structure.pdf for textiles
Assigned Numbers - 2025 - Bluetooth® Document
OMC Textile Division Presentation 2021.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Enhancing emotion recognition model for a student engagement use case through...
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
STKI Israel Market Study 2025 version august
A contest of sentiment analysis: k-nearest neighbor versus neural network
TLE Review Electricity (Electricity).pptx
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
NewMind AI Weekly Chronicles - August'25-Week II
Web App vs Mobile App What Should You Build First.pdf
1 - Historical Antecedents, Social Consideration.pdf
Group 1 Presentation -Planning and Decision Making .pptx
Hybrid model detection and classification of lung cancer
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Zenith AI: Advanced Artificial Intelligence

[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions

  • 1. Building Structured Data from Product Descriptions Keiji Shinzato
  • 2. Product information extraction An Italian product. This is a fruity red wine that mainly consists of sangiovese grapes of Tuscany. Type Red Grape variety Sangiovese Region Italy, Tuscany 2
  • 3. Background • Structured data play a crucial role for making Rakuten more attractive service. – Faceted navigation, recommendation, and market analysis. ベリンダ・コーリー キアンティ 2011 750ml トスカーナ州 キャ ンティ地区のサン ジョベーゼ種を主 体につくられる、 イタリアを代表す る赤ワインの一つ。 Attribute Value Type 赤 Region イタリア, トスカーナ州キャンティ 地区 Grape サンジョベーゼ Vintage 2011 3
  • 5. Background • Structured data play a crucial role for making Rakuten more attractive service. – Faceted navigation, recommendation, and market analysis. • Unsupervised methodology is required. – 100 million products / 40,000 categories. ベリンダ・コーリー キアンティ 2011 750ml トスカーナ州 キャ ンティ地区のサン ジョベーゼ種を主 体につくられる、 イタリアを代表す る赤ワインの一つ。 Attribute Value Type 赤 Region イタリア, トスカーナ州キャンティ 地区 Grape サンジョベーゼ Vintage 2011 5
  • 6. Table is an useful clue, but… WINE > CHILE WINE > CHILE Montes Alpha M 2009 Montes Alpha M 2009 Type Red Region Chile 38% Grape Cabernet sauvignon, Merlot, Cabernet franc, Petit verdot Year 2009 Product page including a table Montes Alpha M is a blend of Cabernet Sauvignon, Merlot, Cabern et Franc, and Petit Verdot. A powerful wine with very good level of soft and rounded tannins. Intense dark red color. The wine is elegant and has a … Product page consists of sentences 6
  • 7. Product information extraction WINE > CHILE Montes Alpha M 2009 Montes Alpha M is a blend of Cabernet Sauvignon, Merlot, Cabernet Franc, and Petit Verdot. A powerful wine with very good level of soft and rounded tannins. Intense dark red color. The wine is elegant and has a very well defined character. … Product page (unstructured) Attribute Value Type Red Region Chile Grape Cabernet sauvignon, Merlot, Cabernet franc, Petit verdot Vintage 2009 Company Montes Structured data • Issue1: How do we know attributes for a category ?? • Issue2: How do we extract attribute values from full texts ?? 7
  • 8. Attribute name collection Analyze a large amount of table data for collecting attributes of an object Attribute values Attribute names of Wine Reference: http://item.rakuten.co.jp/redbox/odm3000728/ 8
  • 9. Attribute value database (wine) ぶどう品種 (Grape variety) 内容量 (Volume) 産地 (Region) 生産者 (Winery) 味わい (Taste) Chardonnay 750ML France Farnese Dry Chardonnay 100% 720ML Italy Mas de Monistrol Full body Merlot 375ML Spain Leroy Medium body Riesling 500ML Chile M. Chapoutier Slightly sweet Syrah 1500ML German Mastroberardino Sweet Grenache 360ML Australia Santero Medium dry Merlot 200ML America Saltarelli Extremely sweet Tempranillo 3000ML Bordeaux Cavicchioli Medium dry Sangiovese 1800ML Champagne Fontodi Red Full body Syrah100% 1000ML Argentina Ca'Rugate Middle sweet Precision is high, but coverage is low. 9
  • 10. Product information extraction WINE > CHILE Montes Alpha M 2009 Montes Alpha M is a blend of Cabernet Sauvignon, Merlot, Cabernet Franc, and Petit Verdot. A powerful wine with very good level of soft and rounded tannins. Intense dark red color. The wine is elegant and has a very well defined character. … Product page (unstructured) Attribute Value Type Red Region Chile Grape Cabernet sauvignon, Merlot, Cabernet franc, Petit verdot Vintage 2009 Company Montes Structured data • Issue1: How do we know attributes for each category ?? • Issue2: How do we extract attribute values from product descriptions ?? 10
  • 11. Unsupervised attribute value extraction - distant supervision approach Semi-structured data Generation Chateau d’Issan 1994 Construction Database : <Region, Margaux> <Color, White> : This is a wine from Margaux. ... Annotation Rule wine from x ⇒ x is a Region Rule is generated through machine learning algorithm. Product page including entries in the database 11
  • 12. Corpus with attribute-value annotations (wine) • <産地>アルザス</産地>で最も香り豊かと言われるスパイシーで華やかなワイ J: E: ン。 A spicy and gorgeous wine that is known as the richest aroma one in J: <production_area> Alsace </production_area>. • 最もお手頃で、<生産者>ドメーヌ・ペゴー</生産者>の美味しさを気軽に楽し E: める、とっても嬉しい一本なのです This is a very nice wine because we can easily enjoy the taste of <winery> J: Domaine Pegau </winery> at the best price. • <ぶどう品種>ソーヴィニヨン・ブラン</ぶどう品種>種の特長がよく表れたワ E: J: イン。 A wine that <grape_variety> Sauvignon Blanc </grape_variety> was well E: featured. • <タイプ>白</タイプ>身魚の塩焼きやシンプルな味付けのソテー、焼き牡蠣、 豚のしょうが焼き、ボンゴレビアンコなどと。 12
  • 13. Unsupervised attribute value extraction - distant supervision approach Semi-structured data Generation Chateau d’Issan 1994 Construction Database : <Region, Margaux> <Color, White> : This is a wine from Margaux. ... Annotation Rule wine from x ⇒ x is a Region Rule is generated through machine learning algorithm. Product page including entries in the database 13
  • 14. Extraction rule generation • Algorithm: Conditional random fields [Lafferty+ 2001] • Chunk tag: Start/End (IOBES) model [Sekine+ 1998] • Features: – – – – – – – Token: Surface form of the token. Base: Base form of the token. PoS: Part-of-Speech tag of the token. Char. type: Types of characters in the token. Prefix: Double character prefix of the token. Suffix: Double character suffix of the token. The above features of ±3 tokens surrounding the token. They are frequently employed in the task of Japanese named entity recognition. 14
  • 15. Unsupervised attribute value extraction - distant supervision approach Semi-structured data Generation Chateau d’Issan 1994 Construction Database : <Region, Margaux> <Color, White> : This is a wine from Margaux. ... Annotation Rule wine from x ⇒ x is a Region Rule is generated through machine learning algorithm. Product page including entries in the database 15
  • 16. Unsupervised attribute value extraction - distant supervision approach Terre di matraja Bianco 2012 Apply Rule wine from x ⇒ x is a Region This is a wine from Tuscany. ... Rule 1800 < x <= 2013 ⇒ x is a Vintage Attribute Region Vintage Grape Value Tuscany 2012 Chardonnay 16
  • 17. Performance (F-score) Without ML With ML 43.8 pt. 60.1pt. Wine 24.1pt. 71.5 pt. Shampoo 17
  • 18. Wine / Japanese An Italian product. This is a fruity red wine that mainly consists of sangiovese grapes of Tuscany. Type Red Grape variety Sangiovese Region Italy, Tuscany 18
  • 19. Shampoo / Japanese ``MCH Natural shampoo 1000ml’’ is a shampoo consisting of cypress oil and charcoal. Category Product name Shampoo MCH Natural shampoo 1000ml Ingredient Cypress oil, Charcoal 19
  • 20. Video game / French Product type Saga Nintendo 64, Nintendo DS Mario 20
  • 21. Conclusion • Developing a technique for extracting product information from unstructured data. – Independent of any category and language. • Useful services can be realized on structured product data. • Our paper is available on the web. – ACL anthology: http://aclweb.org/anthology//I/I13/ 21
  • 22. Thank you for listing ! 22