SlideShare a Scribd company logo
Elegant
Concurrency
Why
Concurrency?
Elegant concurrency
Why Concurrency?
Be a Good Machine Tamer!
© Eduardo Woo
As a Good Machine Tamer
Why Concurrency?
• Get the machine into full play!
• The capacities:
• CPU
• IO
• Disk
• Network bandwidth
• Network connections
• etc.
Concurrency

Is Hard?
∵ The Various Ways?
Concurrency Is Hard?
• threading
• queue
• multiprocessing
• concurrent.futures
• asyncio
• thread
• process
• coroutine
• gevent
• lock
• rlock
• condition
• semaphore
• event
• barrier
• manager
• …
• ???
With Today's Sharing
Concurrency Is Hard?
★ queue
★ thread
Plus Some
Concurrency Is Hard?
★ queue
★ thread
★ process
★ coroutine
★ gevent
❤ Python & open source
Mosky
• Python Charmer at Pinkoi.
• Has spoken at
• PyCons in TW, KR, JP, SG, HK
• COSCUPs & TEDx, etc.
• Countless hours 

for teaching Python.
• Has serval Python packages:
• ZIPCodeTW, 

MoSQL, Clime, etc.
• http://mosky.tw/
Frontend &
Backend

Engineers
We're looking for
Outline
• Why Concurrency?
• Concurrency Is Hard?
★ Communicating Sequential Processes (CSP)
★ Channel-Based Concurrency
★ Concurrent Units
★ CSP vs. X
Communicating
Sequential Processes
Communicating Sequential Processes
Is a Formal Language
Communicating Sequential Processes
• A formal language for describing concurrent systems.
• The main ideas:
• “Processes” and
• Interact with each other solely through channels.
• But why CSP?
— Effective Go
Do not communicate by 

sharing memory; instead, 

share memory by communicating.
”
“
— Effective Go
Using channels to control access
makes it easier to write 

clear, correct programs.
”
“
— The Python Wiki
Use locks and shared memory to
shoot yourself in the foot 

in parallel.
”
“
In Python
Communicating Sequential Processes
• “Processes”
• → threads, processes, coroutines, etc.
• → concurrent units
• Interact with each other solely through channels.
• → concurrent units' channels
• → usually the queues
Channel-Based
Concurrency
Channel-Based Concurrency
• Not going to talk the exact CSP.
• Just adapt the concepts.
• → Use channel to communicate between concurrent units.
• Will continue with the code: http://bit.ly/econcurrency.
But The Traditional Way
NOT Channel-Based Concurrency
def consume(url_q):
while True:
url = url_q.get()
content = requests.get(url).content
print('Queried', url)
# mark a task is done
url_q.task_done()
url_q = Queue()
for url in urls:
url_q.put(url)
for _ in range(2):
# the “daemon” is not the Unix's deamon
# daemon threads are stopped at shutdown
call_in_daemon_thread(consume, url_q)
# block and unblock when all tasks are done
url_q.join()
# when main thread exits, Python shuts down
But the Traditional Way
NOT Channel-Based Concurrency
• The queue is a thread-safe queue.
• .task_done()
• If 0, notify all by releasing the locks.
• .join()
• Block by a double acquired lock.
• Daemon threads – are stopped abruptly at shutdown. 🔪
• How do I know? The uncleared docs & the Python source code. 😆
• Let's make the it simpler.
The Channel-Based Concurrency
def consume(url_q):
while True:
url = url_q.get()
if url is TO_RETURN:
return
content = requests.get(url).content
print('Queried', url)
url_q = Queue()
for url in urls:
url_q.put(url)
for _ in range(N):
url_q.put(TO_RETURN)
for _ in range(N):
call_in_thread(consume, url_q)
Much easier!
Layered Channel-Based Concurrency
• Model more complex concurrent system.
• Use 3 layers:
• Atomic Utils
• Each function must be concurrency-safe.
• Channel Operators
• Functions interacts with each other solely through channel.
• Graph Initializer
• A function initializes the whole graph.
Concurrency-Safe?
Layered Channel-Based Concurrency
• Depends on the concurrent unit, e.g., thread-safe.
• Tips for keeping atomic:
• Access only its frame.
• Use atomic operations – http://bit.ly/aoperations.
• Redesign with channels.
• Use lock – the last option.
The Crawler
Layered Channel-Based Concurrency
• A crawler crawls all the PyCon TW website's pages.
• f1: url → text via channel
• f2: text → url via channel
• Plus a channel to break loop when end.
• And run concurrently, of course!
Atomic Utils
Layered Channel-Based Concurrency
# conform accessing only its frame
def query_text(url):
return requests.get(url).text
def parse_out_href_gen(text):
soup = BeautifulSoup(text, 'html.parser')
return (a_tag.get('href', '')
for a_tag in soup.find_all('a'))
def is_relative_href(url):
return (not url.startswith('http') and
not url.startswith('mailto:'))
# conform using atomic operators
url_visted_map = {}
def is_visited_or_mark(url):
visited = url_visted_map.get(url, False)
if not visited:
url_visted_map[url] = True
return visited
Channel Operators
Layered Channel-Based Concurrency
• Function put_text_q operates
• url_q → text_q
• run_q
• Function put_url_q operates
• text_q → url_q
• run_q
def put_text_q(url_q, text_q, run_q):
while True:
url = url_q.get()
run_q.put(RUNNING)
if url is TO_RETURN:
url_q.put(TO_RETURN) # broadcast
return
text = query_text(url)
text_q.put(text)
run_q.get()
def put_url_q(text_q, url_q, run_q):
while True:
text = text_q.get()
run_q.put(RUNNING)
if text is TO_RETURN:
text_q.put(TO_RETURN)
return
href_gen = parse_out_href_gen(text)
# continue to the next page
for href in href_gen:
if not is_relative_href(href):
continue
url = urljoin(PYCON_TW_ROOT_URL, href)
if is_visited_or_mark(url):
continue
url_q.put(url)
if run_q.qsize() == 1 and url_q.qsize() == 0:
url_q.put(TO_RETURN)
text_q.put(TO_RETURN)
run_q.get()
Graph Initializer
Layered Channel-Based Concurrency
url_q = Queue()
text_q = Queue()
run_q = Queue()
init_url_q(url_q)
for _ in range(8):
call_in_thread(put_text_q,
url_q, text_q, run_q)
for _ in range(4):
call_in_thread(put_url_q,
text_q, url_q, run_q)
The Output
Layered Channel-Based Concurrency
$ py3 graph_initializer.py 2 1 # even 1 1 when debug
Thread-1put_text_q:52 url_q.get() -> https://P/a
Thread-1put_text_q:54 run_q.put(RUNNING) # query
Thread-1put_text_q:65 run_q.get() # done
...
Thread-3put_url_q:75 len(text_q.get()) -> 12314
Thread-3put_url_q:78 run_q.put(RUNNING) # parse
Thread-3put_url_q:98 url_q: 14 # more url -> not the end
Thread-3put_url_q:99 run_q: 1
Thread-3put_url_q:104 run_q.get() # done
...
Thread-2put_text_q:49 url_q.get() -> https://P/b
...
Thread-3put_url_q:98 url_q: 0 # no more url and
Thread-3put_url_q:99 run_q: 1 # only 1 running -> end
Thread-3put_url_q:103 url_q.put(TO_RETURN) # signal to return
Thread-3put_url_q:104 text_q.put(TO_RETURN)
Not so easy,
but clear.
The Crawler With Error Handling
Layered Channel-Based Concurrency
• A new function: get errors for further handling
Concurrent Units
The Standard Options
• threading.thread
• queue.Queue
• multiprocessing.Process
• multiprocessing.Queue
• @asyncio.coroutine ≡ async def
• asyncio.Queue
• gevent.Greenlet
• gevent.queue.Queue
Concurrent Units
Pro Tip: DO NOT mix them!
threading multiprocessing asyncio gevent
CPU ❌ ⭐ ❌ ❌
IO ⭐ ⭐ ⭐ ⭐
Run-Time Cost 🏠 🏢 ⚡ ⚡
Note Easy!
Note processes'
memories are
isolated.
IMO, 

the API is 

too basic.
The API is 

rich and similar
to threading.
Scale Out
• The channel can also be
• RabbitMQ.
• Redis.
• Apache Kafka.
• Scale out from a single machine with a similar design. 🚀
Concurrent Units
CSP vs. X
CSP vs. X
• X:
• Lock
• Parallel Map
• Actor Model
• Reactive Programming
• MapReduce
CSP vs. Lock
• Channel is just lock plus message passing.
• Locks and its variants cause complexity.
• Channels provide a better abstraction to control the complexity.
• Just like Python vs. C.
• Design with channels first, and then transform to locks if need.
CSP vs. Parallel Map
• Level: Lock < CSP < Parallel Map
• If you problem fits parallel map, just use,
• e.g., concurrent.futures.
• i.e., if you don't need to share memory, why communicate?
• If can't fit perfectly, consider using CSP to model it.
• Both are mathematical models.
• Model CSP with Actor model? Yes.
• Model Actor model with CSP? Yes.
• The major differences are
• Actor model emphasizes the “worker”.
• CSP emphasizes the “channel”.
CSP vs. Actor Model
• When implement, using Actor model tends to
• class.
• private state.
• CSP tends to
• function
• implies more functional, so simpler testing.
• explicit channel
• implies easier visualize, so simpler optimizing.
• IMO, I prefer CSP.
print('Testing query_text ... ', end='')
text = query_text('https://tw.pycon.org')
print(repr(text[:40]))
print('Testing parse_out_href_gen ... ', end='')
href_gen = parse_out_href_gen(text)
print(repr(list(href_gen)[:3]))
print('Testing is_relative_href ...')
assert is_relative_href('2017/en-us')
assert is_relative_href('/2017/en-us')
assert not is_relative_href('https://tw.pycon.org')
assert not is_relative_href('mailto:organizers@pycon.tw')
print('Testing is_visited_or_mark ...')
assert not is_visited_or_mark('/')
assert is_visited_or_mark('/')
$ py3 atomic_utils.py
...
Benchmarking query_text ... 0.7407s
Benchmarking parse_out_href_gen ... 0.01298s
# optimize by the ratio
$ py3 graph_initializer.py 40 1
...
• Both support multiple concurrent units.
• CSP can build flexible data flow easier.
• In reactive, the default one-way stream may limit you.
• CSP can use concurrency easier ∵ old-school.
• In reactive, have to understand its flat_map and/or schedulers.
• Reactive has richer APIs, especially for UI events.
CSP vs. Reactive Programming
• CSP is more lightweight.
• CSP is more flexible.
• In MapReduce,
• The algorithm must fit the MapReduce form, and
• Even more fixed data flow than reactive.
• MapReduce system is designed for PB-level data at the first.
CSP vs. MapReduce
At the End
At the End
• Channel-base concurrency from CSP consists of
• Concurrent units.
• Channels.
• CSP helps to avoid the pitfalls, but not all of the pitfalls.
• Logging and visualizing help debugging.
• When your problem fits a higher-level model? Use it!
• But always can model with CSP.
Notes
At the End
• The crawler is for showing the flexibility of using channels.
• Looks good, but not perfect, since the
• url_q.get()
• run_q.put(RUNNING)
• must be synced.
• The issue occurs on the high threads ratio.
• Keep your graph simple!

More Related Content

PDF
Graph-Tool in Practice
PDF
Boost Maintainability
PDF
Learning Python from Data
PDF
Concurrency in Python
PDF
Practicing Python 3
PDF
Happy Go Programming
PDF
Minimal MVC in JavaScript
PDF
Happy Go Programming Part 1
Graph-Tool in Practice
Boost Maintainability
Learning Python from Data
Concurrency in Python
Practicing Python 3
Happy Go Programming
Minimal MVC in JavaScript
Happy Go Programming Part 1

What's hot (20)

PDF
7 Common Mistakes in Go (2015)
PDF
Painless Data Storage with MongoDB & Go
PDF
7 Common mistakes in Go and when to avoid them
PPT
typemap in Perl/XS
PDF
PyCon 2013 : Scripting to PyPi to GitHub and More
PDF
Clojure: Simple By Design
PDF
Syncing up with Python’s asyncio for (micro) service development, Joir-dan Gumbs
PDF
OSCON2014 : Quick Introduction to System Tools Programming with Go
PDF
Introduction to Programming in Go
PDF
Why Python (for Statisticians)
PDF
Python for Penetration testers
PDF
Bringing modern PHP development to IBM i (ZendCon 2016)
PPTX
Troubleshooting Puppet
PDF
Unit Testing Lots of Perl
PDF
Go for Object Oriented Programmers or Object Oriented Programming without Obj...
PPTX
Php Extensions for Dummies
PDF
Python build your security tools.pdf
PDF
Cross platform php
ODP
Php in 2013 (Web-5 2013 conference)
PDF
Effective Benchmarks
7 Common Mistakes in Go (2015)
Painless Data Storage with MongoDB & Go
7 Common mistakes in Go and when to avoid them
typemap in Perl/XS
PyCon 2013 : Scripting to PyPi to GitHub and More
Clojure: Simple By Design
Syncing up with Python’s asyncio for (micro) service development, Joir-dan Gumbs
OSCON2014 : Quick Introduction to System Tools Programming with Go
Introduction to Programming in Go
Why Python (for Statisticians)
Python for Penetration testers
Bringing modern PHP development to IBM i (ZendCon 2016)
Troubleshooting Puppet
Unit Testing Lots of Perl
Go for Object Oriented Programmers or Object Oriented Programming without Obj...
Php Extensions for Dummies
Python build your security tools.pdf
Cross platform php
Php in 2013 (Web-5 2013 conference)
Effective Benchmarks
Ad

Similar to Elegant concurrency (20)

PDF
An Introduction to Go
PPTX
introduction to node.js
PDF
CouchDB for Web Applications - Erlang Factory London 2009
PPTX
Go from a PHP Perspective
PDF
Highly concurrent yet natural programming
PPTX
introduction to server-side scripting
PDF
Clojure Conj 2014 - Paradigms of core.async - Julian Gamble
PDF
Ratpack Web Framework
PDF
Ruby and Distributed Storage Systems
KEY
About Clack
PDF
Leveling Up With Unit Testing - LonghornPHP 2022
PDF
I see deadlocks : Matt Ellis - Techorama NL 2024
PPTX
End to-end async and await
PPT
.NET Debugging Workshop
PDF
Next Generation DevOps in Drupal: DrupalCamp London 2014
PPTX
Orchestrating Docker with Terraform and Consul by Mitchell Hashimoto
PPTX
C++ scalable network_io
PPTX
CodeIgniter Ant Scripting
PDF
Python VS GO
PPTX
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
An Introduction to Go
introduction to node.js
CouchDB for Web Applications - Erlang Factory London 2009
Go from a PHP Perspective
Highly concurrent yet natural programming
introduction to server-side scripting
Clojure Conj 2014 - Paradigms of core.async - Julian Gamble
Ratpack Web Framework
Ruby and Distributed Storage Systems
About Clack
Leveling Up With Unit Testing - LonghornPHP 2022
I see deadlocks : Matt Ellis - Techorama NL 2024
End to-end async and await
.NET Debugging Workshop
Next Generation DevOps in Drupal: DrupalCamp London 2014
Orchestrating Docker with Terraform and Consul by Mitchell Hashimoto
C++ scalable network_io
CodeIgniter Ant Scripting
Python VS GO
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Ad

More from Mosky Liu (13)

PDF
Statistical Regression With Python
PDF
Data Science With Python
PDF
Hypothesis Testing With Python
PDF
Beyond the Style Guides
PDF
Simple Belief - Mosky @ TEDxNTUST 2015
PDF
ZIPCodeTW: Find Taiwan ZIP Code by Address Fuzzily
PDF
Learning Git with Workflows
PDF
Dive into Pinkoi 2013
PDF
MoSQL: More than SQL, but Less than ORM @ PyCon APAC 2013
PDF
MoSQL: More than SQL, but less than ORM
PDF
Introduction to Clime
PDF
Programming with Python - Adv.
PDF
Programming with Python - Basic
Statistical Regression With Python
Data Science With Python
Hypothesis Testing With Python
Beyond the Style Guides
Simple Belief - Mosky @ TEDxNTUST 2015
ZIPCodeTW: Find Taiwan ZIP Code by Address Fuzzily
Learning Git with Workflows
Dive into Pinkoi 2013
MoSQL: More than SQL, but Less than ORM @ PyCon APAC 2013
MoSQL: More than SQL, but less than ORM
Introduction to Clime
Programming with Python - Adv.
Programming with Python - Basic

Recently uploaded (20)

PPTX
Tartificialntelligence_presentation.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Electronic commerce courselecture one. Pdf
PPTX
1. Introduction to Computer Programming.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Empathic Computing: Creating Shared Understanding
PDF
Encapsulation theory and applications.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
Tartificialntelligence_presentation.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Unlocking AI with Model Context Protocol (MCP)
Per capita expenditure prediction using model stacking based on satellite ima...
Network Security Unit 5.pdf for BCA BBA.
Electronic commerce courselecture one. Pdf
1. Introduction to Computer Programming.pptx
Machine learning based COVID-19 study performance prediction
Empathic Computing: Creating Shared Understanding
Encapsulation theory and applications.pdf
MYSQL Presentation for SQL database connectivity
Diabetes mellitus diagnosis method based random forest with bat algorithm
Advanced methodologies resolving dimensionality complications for autism neur...
Spectral efficient network and resource selection model in 5G networks
A comparative analysis of optical character recognition models for extracting...
Big Data Technologies - Introduction.pptx
Group 1 Presentation -Planning and Decision Making .pptx
SOPHOS-XG Firewall Administrator PPT.pptx

Elegant concurrency

  • 4. Why Concurrency? Be a Good Machine Tamer! © Eduardo Woo
  • 5. As a Good Machine Tamer Why Concurrency? • Get the machine into full play! • The capacities: • CPU • IO • Disk • Network bandwidth • Network connections • etc.
  • 7. ∵ The Various Ways? Concurrency Is Hard? • threading • queue • multiprocessing • concurrent.futures • asyncio • thread • process • coroutine • gevent • lock • rlock • condition • semaphore • event • barrier • manager • … • ???
  • 8. With Today's Sharing Concurrency Is Hard? ★ queue ★ thread
  • 9. Plus Some Concurrency Is Hard? ★ queue ★ thread ★ process ★ coroutine ★ gevent
  • 10. ❤ Python & open source Mosky • Python Charmer at Pinkoi. • Has spoken at • PyCons in TW, KR, JP, SG, HK • COSCUPs & TEDx, etc. • Countless hours 
 for teaching Python. • Has serval Python packages: • ZIPCodeTW, 
 MoSQL, Clime, etc. • http://mosky.tw/
  • 12. Outline • Why Concurrency? • Concurrency Is Hard? ★ Communicating Sequential Processes (CSP) ★ Channel-Based Concurrency ★ Concurrent Units ★ CSP vs. X
  • 15. Communicating Sequential Processes • A formal language for describing concurrent systems. • The main ideas: • “Processes” and • Interact with each other solely through channels. • But why CSP?
  • 16. — Effective Go Do not communicate by 
 sharing memory; instead, 
 share memory by communicating. ” “
  • 17. — Effective Go Using channels to control access makes it easier to write 
 clear, correct programs. ” “
  • 18. — The Python Wiki Use locks and shared memory to shoot yourself in the foot 
 in parallel. ” “
  • 19. In Python Communicating Sequential Processes • “Processes” • → threads, processes, coroutines, etc. • → concurrent units • Interact with each other solely through channels. • → concurrent units' channels • → usually the queues
  • 21. Channel-Based Concurrency • Not going to talk the exact CSP. • Just adapt the concepts. • → Use channel to communicate between concurrent units. • Will continue with the code: http://bit.ly/econcurrency.
  • 22. But The Traditional Way NOT Channel-Based Concurrency def consume(url_q): while True: url = url_q.get() content = requests.get(url).content print('Queried', url) # mark a task is done url_q.task_done()
  • 23. url_q = Queue() for url in urls: url_q.put(url) for _ in range(2): # the “daemon” is not the Unix's deamon # daemon threads are stopped at shutdown call_in_daemon_thread(consume, url_q) # block and unblock when all tasks are done url_q.join() # when main thread exits, Python shuts down
  • 24. But the Traditional Way NOT Channel-Based Concurrency • The queue is a thread-safe queue. • .task_done() • If 0, notify all by releasing the locks. • .join() • Block by a double acquired lock. • Daemon threads – are stopped abruptly at shutdown. 🔪 • How do I know? The uncleared docs & the Python source code. 😆 • Let's make the it simpler.
  • 25. The Channel-Based Concurrency def consume(url_q): while True: url = url_q.get() if url is TO_RETURN: return content = requests.get(url).content print('Queried', url)
  • 26. url_q = Queue() for url in urls: url_q.put(url) for _ in range(N): url_q.put(TO_RETURN) for _ in range(N): call_in_thread(consume, url_q)
  • 28. Layered Channel-Based Concurrency • Model more complex concurrent system. • Use 3 layers: • Atomic Utils • Each function must be concurrency-safe. • Channel Operators • Functions interacts with each other solely through channel. • Graph Initializer • A function initializes the whole graph.
  • 29. Concurrency-Safe? Layered Channel-Based Concurrency • Depends on the concurrent unit, e.g., thread-safe. • Tips for keeping atomic: • Access only its frame. • Use atomic operations – http://bit.ly/aoperations. • Redesign with channels. • Use lock – the last option.
  • 30. The Crawler Layered Channel-Based Concurrency • A crawler crawls all the PyCon TW website's pages. • f1: url → text via channel • f2: text → url via channel • Plus a channel to break loop when end. • And run concurrently, of course!
  • 31. Atomic Utils Layered Channel-Based Concurrency # conform accessing only its frame def query_text(url): return requests.get(url).text def parse_out_href_gen(text): soup = BeautifulSoup(text, 'html.parser') return (a_tag.get('href', '') for a_tag in soup.find_all('a')) def is_relative_href(url): return (not url.startswith('http') and not url.startswith('mailto:'))
  • 32. # conform using atomic operators url_visted_map = {} def is_visited_or_mark(url): visited = url_visted_map.get(url, False) if not visited: url_visted_map[url] = True return visited
  • 33. Channel Operators Layered Channel-Based Concurrency • Function put_text_q operates • url_q → text_q • run_q • Function put_url_q operates • text_q → url_q • run_q
  • 34. def put_text_q(url_q, text_q, run_q): while True: url = url_q.get() run_q.put(RUNNING) if url is TO_RETURN: url_q.put(TO_RETURN) # broadcast return text = query_text(url) text_q.put(text) run_q.get()
  • 35. def put_url_q(text_q, url_q, run_q): while True: text = text_q.get() run_q.put(RUNNING) if text is TO_RETURN: text_q.put(TO_RETURN) return href_gen = parse_out_href_gen(text) # continue to the next page
  • 36. for href in href_gen: if not is_relative_href(href): continue url = urljoin(PYCON_TW_ROOT_URL, href) if is_visited_or_mark(url): continue url_q.put(url) if run_q.qsize() == 1 and url_q.qsize() == 0: url_q.put(TO_RETURN) text_q.put(TO_RETURN) run_q.get()
  • 37. Graph Initializer Layered Channel-Based Concurrency url_q = Queue() text_q = Queue() run_q = Queue() init_url_q(url_q) for _ in range(8): call_in_thread(put_text_q, url_q, text_q, run_q) for _ in range(4): call_in_thread(put_url_q, text_q, url_q, run_q)
  • 38. The Output Layered Channel-Based Concurrency $ py3 graph_initializer.py 2 1 # even 1 1 when debug Thread-1put_text_q:52 url_q.get() -> https://P/a Thread-1put_text_q:54 run_q.put(RUNNING) # query Thread-1put_text_q:65 run_q.get() # done ... Thread-3put_url_q:75 len(text_q.get()) -> 12314 Thread-3put_url_q:78 run_q.put(RUNNING) # parse Thread-3put_url_q:98 url_q: 14 # more url -> not the end Thread-3put_url_q:99 run_q: 1 Thread-3put_url_q:104 run_q.get() # done ... Thread-2put_text_q:49 url_q.get() -> https://P/b ... Thread-3put_url_q:98 url_q: 0 # no more url and Thread-3put_url_q:99 run_q: 1 # only 1 running -> end Thread-3put_url_q:103 url_q.put(TO_RETURN) # signal to return Thread-3put_url_q:104 text_q.put(TO_RETURN)
  • 40. The Crawler With Error Handling Layered Channel-Based Concurrency • A new function: get errors for further handling
  • 42. The Standard Options • threading.thread • queue.Queue • multiprocessing.Process • multiprocessing.Queue • @asyncio.coroutine ≡ async def • asyncio.Queue • gevent.Greenlet • gevent.queue.Queue Concurrent Units Pro Tip: DO NOT mix them!
  • 43. threading multiprocessing asyncio gevent CPU ❌ ⭐ ❌ ❌ IO ⭐ ⭐ ⭐ ⭐ Run-Time Cost 🏠 🏢 ⚡ ⚡ Note Easy! Note processes' memories are isolated. IMO, 
 the API is 
 too basic. The API is 
 rich and similar to threading.
  • 44. Scale Out • The channel can also be • RabbitMQ. • Redis. • Apache Kafka. • Scale out from a single machine with a similar design. 🚀 Concurrent Units
  • 46. CSP vs. X • X: • Lock • Parallel Map • Actor Model • Reactive Programming • MapReduce
  • 47. CSP vs. Lock • Channel is just lock plus message passing. • Locks and its variants cause complexity. • Channels provide a better abstraction to control the complexity. • Just like Python vs. C. • Design with channels first, and then transform to locks if need.
  • 48. CSP vs. Parallel Map • Level: Lock < CSP < Parallel Map • If you problem fits parallel map, just use, • e.g., concurrent.futures. • i.e., if you don't need to share memory, why communicate? • If can't fit perfectly, consider using CSP to model it.
  • 49. • Both are mathematical models. • Model CSP with Actor model? Yes. • Model Actor model with CSP? Yes. • The major differences are • Actor model emphasizes the “worker”. • CSP emphasizes the “channel”. CSP vs. Actor Model
  • 50. • When implement, using Actor model tends to • class. • private state. • CSP tends to • function • implies more functional, so simpler testing. • explicit channel • implies easier visualize, so simpler optimizing. • IMO, I prefer CSP.
  • 51. print('Testing query_text ... ', end='') text = query_text('https://tw.pycon.org') print(repr(text[:40])) print('Testing parse_out_href_gen ... ', end='') href_gen = parse_out_href_gen(text) print(repr(list(href_gen)[:3])) print('Testing is_relative_href ...') assert is_relative_href('2017/en-us') assert is_relative_href('/2017/en-us') assert not is_relative_href('https://tw.pycon.org') assert not is_relative_href('mailto:organizers@pycon.tw') print('Testing is_visited_or_mark ...') assert not is_visited_or_mark('/') assert is_visited_or_mark('/')
  • 52. $ py3 atomic_utils.py ... Benchmarking query_text ... 0.7407s Benchmarking parse_out_href_gen ... 0.01298s # optimize by the ratio $ py3 graph_initializer.py 40 1 ...
  • 53. • Both support multiple concurrent units. • CSP can build flexible data flow easier. • In reactive, the default one-way stream may limit you. • CSP can use concurrency easier ∵ old-school. • In reactive, have to understand its flat_map and/or schedulers. • Reactive has richer APIs, especially for UI events. CSP vs. Reactive Programming
  • 54. • CSP is more lightweight. • CSP is more flexible. • In MapReduce, • The algorithm must fit the MapReduce form, and • Even more fixed data flow than reactive. • MapReduce system is designed for PB-level data at the first. CSP vs. MapReduce
  • 56. At the End • Channel-base concurrency from CSP consists of • Concurrent units. • Channels. • CSP helps to avoid the pitfalls, but not all of the pitfalls. • Logging and visualizing help debugging. • When your problem fits a higher-level model? Use it! • But always can model with CSP.
  • 57. Notes At the End • The crawler is for showing the flexibility of using channels. • Looks good, but not perfect, since the • url_q.get() • run_q.put(RUNNING) • must be synced. • The issue occurs on the high threads ratio. • Keep your graph simple!