SlideShare a Scribd company logo
홍은기
PYTHON LEARNING FOR
NATURAL LANGUAGE
PROCESSING
1. Learning Sequence
2. Lists and Functions
3. Loops
4. Processing Raw Text with NLTK
CONTENTS
• 1. Python Syntax
• 2. Strings and Console Output
• 3. Conditionals and Control Flow
• 4. Functions
• 5. Lists & Dictionaries
• 6. Student Becomes the Teacher(test)
• 7. Lists and Functions
• 8. Loops
• 9. Exam Statistics(test)
• 10. Advanced Topic in Python
• 11. Introduction to Classes
• 12. File Input and Output
LEARNING SEQUENCE
(WWW.CODECADEMY.COM)
LISTS AND FUNCTIONS
LOOPS
PROCESSING RAW TEXT
WITH NLTK
(http://www.nltk.org/book/)
웹 상의 HTML 문서로부터 텍스트를 추출 후 ,
NLTK 를 사용하여 텍스트의 키워드를 추출
After extracting a text from HTML document on the
web, I tried to extract keywords from the text with
NLTK.
EXAMPLES
이주 아동 외면하는 ' 다문화 한국사회‘
(http://www.huffingtonpost.kr/kyongwhan-
ahn/story_b_6927970.html?utm_hp_ref=korea)
[('', 65), ('(', 9), (')', 9), (' 한다 ', 6), ("'", 6), (' 있다 ', 5),
(' 아동 ', 5), (' 큰 ', 5), (' 모든 ', 5), (' 일 ', 5), (' 국제 ', 4),
(' 대한민국 ', 4), (' 나라 ', 4), (' 땅 ', 4), (' 국제사회 ', 4),
(' 인권 ', 4), (' 의원 ', 3), (' 세계 ', 3), (' 여의 ', 3), (' 수 ', 3),
(' 안 ', 3), (' 강한 ', 3), (' 불문 ', 2), (' 이주 ', 2), (' 법무부 ', 2)]
1. HTML TO RAW TEXT
# -*- coding: utf-8 -*-
from urllib import request
import nltk, re, pprint
from nltk import word_tokenize
from nltk import *
from bs4 import BeautifulSoup
url = “http://www.huffingtonpost.kr/kyongwhan-
ahn/story_b_6927970.html?utm_hp_ref=korea”
html = request.urlopen(url).read().decode(‘utf8’)
raw = BeautifulSoup(html).get_text()
1. HTML TO RAW TEXT
# -*- coding: utf-8 -*-
from urllib import request
import nltk, re, pprint
from nltk import word_tokenize
from nltk import *
from bs4 import BeautifulSoup
url = “http://www.huffingtonpost.kr/kyongwhan-
ahn/story_b_6927970.html?utm_hp_ref=korea”
html = request.urlopen(url).read().decode(‘utf8’)
raw = BeautifulSoup(html).get_text()
2. RAW TEXT TO LIST
raw = raw[30123:32364]
print (type(raw))
-> <class ‘str’>
tokens = word_tokenize(raw)
print (type(tokens))
-> <class ‘list’>
3. LIST TO VOCABULARIES
words = Trial.NounExtractor(token)
3. LIST TO VOCABULARIES
words = Trial.NounExtractor(token)
3. LIST TO VOCABULARIES
token = [‘ 철수는’ , ‘ 동생에게’ , ‘ 자전거를’ , ‘ 빌려주었다’ ]
words = Trial.NounExtractor(token)
words = [‘ 철수’ , ‘ 동생’ , ‘ 자전거’ , ‘ 빌려주었다’ ]
4. FREQUENCY DISTRIBUTION
fdist = FreqDist(words)
print (fdist.most_common(25))
4. FREQUENCY DISTRIBUTION
fdist = FreqDist(words)
print (fdist.most_common(25))
EXAMPLES
나를 끌어내린 롯데월드
(http://www.huffingtonpost.kr/seungjoon-
ahn/story_b_6928016.html?utm_hp_ref=korea)
[('', 63), (' 그 ', 19), (' 것 ', 12), (' 우리 ', 10), ('!', 8), (' 없 ', 8),
(' 놀이기구 ', 8), (' 직원 ', 8), (' 수 ', 8), (' 시각장애인 ', 8), (' 안 ', 6),
(' 내 ', 5), (' 있었다 ', 5), (' 않 ', 5), (' 매뉴얼 ', 5), (' 근거 ', 5), (' 사람 ', 5),
(' 롯데월드 ', 5), (' 다른 ', 5), (' 있던 ', 4), (' 한 ', 4), (' 장애인 ', 4),
(' 설명 ', 4), (' 때 ', 4), (' 상황 ', 4)]
POS TAGGED
Thank_VB You_PRP !_.

More Related Content

PPTX
Scrapy.for.dummies
PDF
26 io -ii file handling
PPTX
Scrapy-101
PDF
Sphinx && Perl Houston Perl Mongers - May 8th, 2014
PDF
Search Engine
PPTX
ODP
re7jenskramer
ODP
re7jenskramer
Scrapy.for.dummies
26 io -ii file handling
Scrapy-101
Sphinx && Perl Houston Perl Mongers - May 8th, 2014
Search Engine
re7jenskramer
re7jenskramer

More from EunGi Hong (16)

PDF
최소 편집 거리와 동적 프로그래밍
PDF
철자 교정기
PDF
라틴어로 보는 컴퓨터 과학
PDF
Android App Bar
PDF
검색엔진 오픈 소스 Lucene
PDF
Haskell and Function
PDF
Wordswordswords
PDF
Haskell and List
PDF
Introduction to Natural Language Processing
PDF
Automata
PDF
Ah Counter App 마무리
PDF
Linguistics
PPT
안드로이드 개발하기 3rd week
PPT
안드로이드 개발하기 2nd week
PPT
안드로이드 개발하기_1st
PPT
Python Learning for Natural Language Processing
최소 편집 거리와 동적 프로그래밍
철자 교정기
라틴어로 보는 컴퓨터 과학
Android App Bar
검색엔진 오픈 소스 Lucene
Haskell and Function
Wordswordswords
Haskell and List
Introduction to Natural Language Processing
Automata
Ah Counter App 마무리
Linguistics
안드로이드 개발하기 3rd week
안드로이드 개발하기 2nd week
안드로이드 개발하기_1st
Python Learning for Natural Language Processing
Ad

Recently uploaded (20)

PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Machine Learning_overview_presentation.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Big Data Technologies - Introduction.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Approach and Philosophy of On baking technology
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
The AUB Centre for AI in Media Proposal.docx
Dropbox Q2 2025 Financial Results & Investor Presentation
Digital-Transformation-Roadmap-for-Companies.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Machine Learning_overview_presentation.pptx
MIND Revenue Release Quarter 2 2025 Press Release
Building Integrated photovoltaic BIPV_UPV.pdf
Review of recent advances in non-invasive hemoglobin estimation
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
sap open course for s4hana steps from ECC to s4
Big Data Technologies - Introduction.pptx
Spectral efficient network and resource selection model in 5G networks
Per capita expenditure prediction using model stacking based on satellite ima...
Approach and Philosophy of On baking technology
NewMind AI Weekly Chronicles - August'25-Week II
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Ad

Python learning for Natural Language Processing (2nd)

  • 2. 1. Learning Sequence 2. Lists and Functions 3. Loops 4. Processing Raw Text with NLTK CONTENTS
  • 3. • 1. Python Syntax • 2. Strings and Console Output • 3. Conditionals and Control Flow • 4. Functions • 5. Lists & Dictionaries • 6. Student Becomes the Teacher(test) • 7. Lists and Functions • 8. Loops • 9. Exam Statistics(test) • 10. Advanced Topic in Python • 11. Introduction to Classes • 12. File Input and Output LEARNING SEQUENCE (WWW.CODECADEMY.COM)
  • 6. PROCESSING RAW TEXT WITH NLTK (http://www.nltk.org/book/) 웹 상의 HTML 문서로부터 텍스트를 추출 후 , NLTK 를 사용하여 텍스트의 키워드를 추출 After extracting a text from HTML document on the web, I tried to extract keywords from the text with NLTK.
  • 7. EXAMPLES 이주 아동 외면하는 ' 다문화 한국사회‘ (http://www.huffingtonpost.kr/kyongwhan- ahn/story_b_6927970.html?utm_hp_ref=korea) [('', 65), ('(', 9), (')', 9), (' 한다 ', 6), ("'", 6), (' 있다 ', 5), (' 아동 ', 5), (' 큰 ', 5), (' 모든 ', 5), (' 일 ', 5), (' 국제 ', 4), (' 대한민국 ', 4), (' 나라 ', 4), (' 땅 ', 4), (' 국제사회 ', 4), (' 인권 ', 4), (' 의원 ', 3), (' 세계 ', 3), (' 여의 ', 3), (' 수 ', 3), (' 안 ', 3), (' 강한 ', 3), (' 불문 ', 2), (' 이주 ', 2), (' 법무부 ', 2)]
  • 8. 1. HTML TO RAW TEXT # -*- coding: utf-8 -*- from urllib import request import nltk, re, pprint from nltk import word_tokenize from nltk import * from bs4 import BeautifulSoup url = “http://www.huffingtonpost.kr/kyongwhan- ahn/story_b_6927970.html?utm_hp_ref=korea” html = request.urlopen(url).read().decode(‘utf8’) raw = BeautifulSoup(html).get_text()
  • 9. 1. HTML TO RAW TEXT # -*- coding: utf-8 -*- from urllib import request import nltk, re, pprint from nltk import word_tokenize from nltk import * from bs4 import BeautifulSoup url = “http://www.huffingtonpost.kr/kyongwhan- ahn/story_b_6927970.html?utm_hp_ref=korea” html = request.urlopen(url).read().decode(‘utf8’) raw = BeautifulSoup(html).get_text()
  • 10. 2. RAW TEXT TO LIST raw = raw[30123:32364] print (type(raw)) -> <class ‘str’> tokens = word_tokenize(raw) print (type(tokens)) -> <class ‘list’>
  • 11. 3. LIST TO VOCABULARIES words = Trial.NounExtractor(token)
  • 12. 3. LIST TO VOCABULARIES words = Trial.NounExtractor(token)
  • 13. 3. LIST TO VOCABULARIES token = [‘ 철수는’ , ‘ 동생에게’ , ‘ 자전거를’ , ‘ 빌려주었다’ ] words = Trial.NounExtractor(token) words = [‘ 철수’ , ‘ 동생’ , ‘ 자전거’ , ‘ 빌려주었다’ ]
  • 14. 4. FREQUENCY DISTRIBUTION fdist = FreqDist(words) print (fdist.most_common(25))
  • 15. 4. FREQUENCY DISTRIBUTION fdist = FreqDist(words) print (fdist.most_common(25))
  • 16. EXAMPLES 나를 끌어내린 롯데월드 (http://www.huffingtonpost.kr/seungjoon- ahn/story_b_6928016.html?utm_hp_ref=korea) [('', 63), (' 그 ', 19), (' 것 ', 12), (' 우리 ', 10), ('!', 8), (' 없 ', 8), (' 놀이기구 ', 8), (' 직원 ', 8), (' 수 ', 8), (' 시각장애인 ', 8), (' 안 ', 6), (' 내 ', 5), (' 있었다 ', 5), (' 않 ', 5), (' 매뉴얼 ', 5), (' 근거 ', 5), (' 사람 ', 5), (' 롯데월드 ', 5), (' 다른 ', 5), (' 있던 ', 4), (' 한 ', 4), (' 장애인 ', 4), (' 설명 ', 4), (' 때 ', 4), (' 상황 ', 4)]