Five Whys - Devoxx UK 2014

Download as PPT, PDF

2 likes1,029 views

The document discusses the concept of the 'Five Whys' as a method for addressing and solving problems within engineering and development environments. Key themes include identifying and solving problems, embracing immutability, leveraging open source practices, and ensuring effective metrics and monitoring. The presentation concludes by encouraging organizations to challenge existing processes and find ways to improve their development practices.

Software Technology

@al94781#5whys
The Five “Whys”
Andrew Harmel-Law
@al94781
the-music-of-time.blogspot.com
&
scalaeyeforthejavaguy.blogspot.com

@al94781#5whys
The Five “Whys”
•Andrew Harmel-Law
Andrew.Harmel-Law@Capgemini.com
•Dev Lead / Developer / OSS Advocate @ Capgemini UK
•We build complicated stuff for other people. (We’re
hiring. Email me)

@al94781#5whys
TL;DL
We’re engineering ourselves into a right mess;
but we can engineer ourselves out of it

@al94781#5whys
Disclaimer:
No “Hype Cycles”,
“Magic Quadrants” or
“Technology Radars” were harmed (or
consulted) in the course of preparing this
presentation

“… and they also
fail in ways that
are beyond the
comprehension of
a single person.”

@al94781#5whys
How Do We Cope?
• Identify the problem. Then solve it.

@al94781#5whys
How Can We Cope?
• Identify the problem. Then solve it.
• Identify the problem. Then solve it.

@al94781#5whys
How Can We Cope?
• Identify the problem. Then solve it.
• Identify the problem. Then solve it.
• Identify the problem. Then solve it.

@al94781#5whys
How Can We Cope?
• Identify the problem. Then solve it.
• Identify the problem. Then solve it.
• Identify the problem. Then solve it.
• Identify the problem. Then solve it.

@al94781#5whys
DOES THIS MEAN WE’RE
ACTUALLY PART OF THE
PROBLEM?

@al94781#5whys
1. The Bakery and NoOps
2. Anti-Fragility and the Simian Army
3. Throw Everything Away
4. The Church of Graphs
5. Open Source (Almost) Everything

@YourTwitterHandle@al94781#5whys
1.T
he
Bakery
and
N
oO
ps

@al94781#5whys
Immutability makes many things easier:
•Maintenance (e.g. SunRay)
•Multi-Threading
•Scaling (e.g. Pizza Boxes)
•Caching
•Development in general (e.g. “no changes in a
Sprint”)

@al94781#5whys
• What if you had immutable deployables?
• And what if you deployed to immutable
environments?

@al94781#5whys
“Netflix is a developer oriented culture.”
http://perfcap.blogspot.co.uk/2012/03/ops-devops-and-noops-at-netflix.html

@al94781#5whys
“Netflix is a developer oriented culture.”
“We decided to leverage developer oriented tools
such as Perforce for version control, Ivy for
dependencies, Jenkins to automate the build
process, Artifactory as the binary repository and to
construct a “Bakery" that produces complete AMIs
that contain all the code for a service.”
http://perfcap.blogspot.co.uk/2012/03/ops-devops-and-noops-at-netflix.html

@al94781#5whys
“[Your Company] is a developer oriented culture.”
“We decided to leverage developer oriented tools
such as Perforce for version control, Ivy for
dependencies, Jenkins to automate the build
process, Artifactory as the binary repository and to
construct a “Bakery" that produces complete AMIs
that contain all the code for a service.”
“Several hundred development engineers use these
tools to build code, run it in a test account in AWS,
then deploy it to production themselves.”
http://perfcap.blogspot.co.uk/2012/03/ops-devops-and-noops-at-netflix.html

@al94781#5whys
http://www.infoq.com/presentations/Building-for-the-Cloud-at-Netflix

@al94781#5whys
http://techblog.netflix.com/2012/06/asgard-web-based-cloud-management-and.html

@YourTwitterHandle@al94781#5whys
2.A
nti-
Fragility
&
the
Sim
ian
A
rm
y

@al94781#5whys
“Put all your eggs in
one basket,
and then watch that
basket very
closely”
(Andrew Carnegie)
(b)
(c)

@al94781#5whys
http://queue.acm.org/detail.cfm?id=2499552

@YourTwitterHandle@al94781#5whys
3.T
hrow
Everything
A
w
ay

@YourTwitterHandle@al94781#5whys
3.T
hrow
Everything
A
w
ay.
Everything?
Yes,Everything.
O
h,&
D
o
It
RegularlyToo

@al94781#5whys
“Plan to throw one
away”
Fred Brooks,
The Mythical Man Month
(b)

@al94781#5whys
• But little pieces at a time
• And then only the ASCII-
file implementations
(b)
What if We Were
Always Throwing
Away?

@al94781#5whys
• The knowledge gained from
writing the previous version
• The specs (executable
preferentially)
(b)
What Are We Not
Throwing Away?

@YourTwitterHandle@al94781#5whys
4.T
he
C
hurch
of
G
raphs

@al94781#5whys
“If Engineering at Etsy has a religion, it’s the Church of
Graphs. If it moves, we track it.
Sometimes we’ll draw a graph of something that isn’t
moving yet, just in case it decides to make a run for
it.
In general, we tend to measure at three levels:
network, machine, and application”
http://codeascraft.com/2011/02/15/measure-anything-measure-everything/

@al94781#5whys
“Application metrics are usually the hardest, yet most
important, of the three. They’re very specific to your
business, and they change as your applications
change (and Etsy changes a lot).
Instead of trying to plan out everything we wanted to
measure, we decided to make it ridiculously simple
for any engineer to get anything they can count or
time into a graph with almost no effort.”
http://codeascraft.com/2011/02/15/measure-anything-measure-everything/

@al94781#5whys
(b)http://code.flickr.net/2008/10/27/counting-timing/

@YourTwitterHandle#DVXFR14{session hashtag} @al94781#5whys
5.O
pen
Source
(A
lm
ost)
Everything

@al94781#5whys
AUDIENCE POLL
- Who here uses Open Source?
- Who here contributes to Open Source?
- Who here creates Open Source?
(b)

@al94781#5whys
AUDIENCE POLL
- Who here uses Open Source at Work?
(b)

@al94781#5whys
AUDIENCE POLL
- Who here uses Open Source at Work?
- Who here contributes to Open Source at Work?
- Who here creates Open Source at Work?
(b)

@al94781#5whys
• Great advertising
• More work done, faster and
more cheaply
• Attract talent
• Best technical interview
possible
• Retain talent
• Effortless modualrisation
• Reduce duplication of effort
(b)
Why Bother to
Open Source?

@al94781#5whys
• Hire & retain top engineers
• Good PR
• Make Netflix solutions common
standards
• Give back to the Apache OSS
community
• Motivate
• Peer pressure, code clean-up
and documentation
(b)
Why Should I Open
Source?

@al94781#5whys
AUDIENCE POLL
- Who here would like to use Open Source at Work?
- Who here would like to contribute to Open Source at
Work?
- Who here would like to create Open Source at Work?
(b)

@YourTwitterHandle#DVXFR14{session hashtag} @al94781#5whys
A
ny
G
eneral
C
onclusion
s?

@al94781#5whys
There are Some Themes (1)
You don’t get any of this for free, so;
• Design for build-ability
• Design for deploy- and undeploy-ability
• Design for modularity
• Design for monitor-ability
• Design for automate-ability
(b)

@al94781#5whys
There are Some Themes (2)
• Reduce variation (or manage the sources of
variation)
• Trust and enable developers to share and
collaborate (inside and outside the firewall)
(b)

@al94781#5whys
So, What About Us?
We can’t apply all these ideas on all of our
projects
(b)

@al94781#5whys
So, What About Us?
We can’t apply all these ideas on all of our
projects
But we could apply some of them on some
of our projects
(b)

@al94781#5whys
Think: What’s the Effect?
• On how we architect and design?
• On how we build and test?
• On how we deploy and run?
• On how we structure our teams?
• On how we interact with our customers?
• On how we use and share code?
(b)

@al94781#5whys
We Too Can Ask “Why?”
•What else can we:
•challenge?
•find that sucks, and then remove?
•find that is good, and then amplify?
•share, and get famous for?
(b)

@YourTwitterHandle#DVXFR14{session hashtag} @al94781#5whys
Thanks,
Q
&
A

@al94781#5whys
Thanks / Creative Commons
•Presentation Template — Guillaume LaForge
•The Queen — A prestigious heritage with some
inspiration from The Sex Pistols and funny Devoxxians
•Girl with a Balloon — Banksy
•Tube — Michael Keen

Five Whys - Devoxx UK 2014

1. @al94781#5whys The Five “Whys” Andrew Harmel-Law @al94781 the-music-of-time.blogspot.com & scalaeyeforthejavaguy.blogspot.com

2. @al94781#5whys The Five “Whys” •Andrew Harmel-Law Andrew.Harmel-Law@Capgemini.com •Dev Lead / Developer / OSS Advocate @ Capgemini UK •We build complicated stuff for other people. (We’re hiring. Email me)

3. @al94781#5whys TL;DL We’re engineering ourselves into a right mess; but we can engineer ourselves out of it

4. @al94781#5whys Disclaimer: No “Hype Cycles”, “Magic Quadrants” or “Technology Radars” were harmed (or consulted) in the course of preparing this presentation

5. @al94781#5whys

6. “… and they also fail in ways that are beyond the comprehension of a single person.”

12. @al94781#5whys How Do We Cope? • Identify the problem. Then solve it.

13. @al94781#5whys How Can We Cope? • Identify the problem. Then solve it. • Identify the problem. Then solve it.

14. @al94781#5whys How Can We Cope? • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it.

15. @al94781#5whys How Can We Cope? • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it.

16. @al94781#5whys How Can We Cope? • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it.

17. @al94781#5whys How Can We Cope? • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it.

18. @al94781#5whys How Can We Cope? • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it.

19. @al94781#5whys How Can We Cope? • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it.

20. @al94781#5whys How Can We Cope? • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it.

21. @al94781#5whys How Can We Cope? • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it.

22. @al94781#5whys How Can We Cope? • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it. • Identify the problem. Then solve it.

23. @al94781#5whys

24. @al94781#5whys DOES THIS MEAN WE’RE ACTUALLY PART OF THE PROBLEM?

25. @al94781#5whys “?”

30. @al94781#5whys 1. The Bakery and NoOps 2. Anti-Fragility and the Simian Army 3. Throw Everything Away 4. The Church of Graphs 5. Open Source (Almost) Everything

31. @YourTwitterHandle@al94781#5whys 1.T he Bakery and N oO ps

36. @al94781#5whys Immutability makes many things easier: •Maintenance (e.g. SunRay) •Multi-Threading •Scaling (e.g. Pizza Boxes) •Caching •Development in general (e.g. “no changes in a Sprint”)

37. @al94781#5whys • What if you had immutable deployables? • And what if you deployed to immutable environments?

38. @al94781#5whys “Netflix is a developer oriented culture.” http://perfcap.blogspot.co.uk/2012/03/ops-devops-and-noops-at-netflix.html

39. @al94781#5whys “Netflix is a developer oriented culture.” “We decided to leverage developer oriented tools such as Perforce for version control, Ivy for dependencies, Jenkins to automate the build process, Artifactory as the binary repository and to construct a “Bakery" that produces complete AMIs that contain all the code for a service.” http://perfcap.blogspot.co.uk/2012/03/ops-devops-and-noops-at-netflix.html

40. @al94781#5whys “Netflix is a developer oriented culture.” “We decided to leverage developer oriented tools such as Perforce for version control, Ivy for dependencies, Jenkins to automate the build process, Artifactory as the binary repository and to construct a “Bakery" that produces complete AMIs that contain all the code for a service.” “Several hundred development engineers use these tools to build code, run it in a test account in AWS, then deploy it to production themselves.” http://perfcap.blogspot.co.uk/2012/03/ops-devops-and-noops-at-netflix.html

41. @al94781#5whys “[Your Company] is a developer oriented culture.” “We decided to leverage developer oriented tools such as Perforce for version control, Ivy for dependencies, Jenkins to automate the build process, Artifactory as the binary repository and to construct a “Bakery" that produces complete AMIs that contain all the code for a service.” “Several hundred development engineers use these tools to build code, run it in a test account in AWS, then deploy it to production themselves.” http://perfcap.blogspot.co.uk/2012/03/ops-devops-and-noops-at-netflix.html

42. @al94781#5whys http://www.infoq.com/presentations/Building-for-the-Cloud-at-Netflix

43. @al94781#5whys http://techblog.netflix.com/2012/06/asgard-web-based-cloud-management-and.html

44. @al94781#5whys http://techblog.netflix.com/2012/06/asgard-web-based-cloud-management-and.html

45. @YourTwitterHandle@al94781#5whys 2.A nti- Fragility & the Sim ian A rm y

46. @al94781#5whys

47. @al94781#5whys (a)

48. @al94781#5whys (b)

49. @al94781#5whys “Put all your eggs in one basket, and then watch that basket very closely” (Andrew Carnegie) (b) (c)

50. @al94781#5whys (b)

51. @al94781#5whys (b)

52. @al94781#5whys http://queue.acm.org/detail.cfm?id=2499552

53. @al94781#5whys (b)

54. @YourTwitterHandle@al94781#5whys 3.T hrow Everything A w ay

55. @YourTwitterHandle@al94781#5whys 3.T hrow Everything A w ay. Everything? Yes,Everything. O h,& D o It RegularlyToo

56. @al94781#5whys “Plan to throw one away” Fred Brooks, The Mythical Man Month (b)

57. @al94781#5whys (b)

58. @al94781#5whys • But little pieces at a time • And then only the ASCII- file implementations (b) What if We Were Always Throwing Away?

59. @al94781#5whys • The knowledge gained from writing the previous version • The specs (executable preferentially) (b) What Are We Not Throwing Away?

60. @al94781#5whys

61. @YourTwitterHandle@al94781#5whys 4.T he C hurch of G raphs

62. @al94781#5whys (b)

63. @al94781#5whys (b)

64. @al94781#5whys (b)

65. @al94781#5whys (b)

66. @al94781#5whys “If Engineering at Etsy has a religion, it’s the Church of Graphs. If it moves, we track it. Sometimes we’ll draw a graph of something that isn’t moving yet, just in case it decides to make a run for it. In general, we tend to measure at three levels: network, machine, and application” http://codeascraft.com/2011/02/15/measure-anything-measure-everything/

67. @al94781#5whys (b)

68. @al94781#5whys (b)

69. @al94781#5whys (b)

70. @al94781#5whys “Application metrics are usually the hardest, yet most important, of the three. They’re very specific to your business, and they change as your applications change (and Etsy changes a lot). Instead of trying to plan out everything we wanted to measure, we decided to make it ridiculously simple for any engineer to get anything they can count or time into a graph with almost no effort.” http://codeascraft.com/2011/02/15/measure-anything-measure-everything/

71. @al94781#5whys (b)http://code.flickr.net/2008/10/27/counting-timing/

72. @YourTwitterHandle#DVXFR14{session hashtag} @al94781#5whys 5.O pen Source (A lm ost) Everything

73. @al94781#5whys AUDIENCE POLL - Who here uses Open Source? - Who here contributes to Open Source? - Who here creates Open Source? (b)

74. @al94781#5whys AUDIENCE POLL - Who here uses Open Source at Work? (b)

75. @al94781#5whys AUDIENCE POLL - Who here uses Open Source at Work? - Who here contributes to Open Source at Work? - Who here creates Open Source at Work? (b)

76. @al94781#5whys (b)

77. @al94781#5whys • Great advertising • More work done, faster and more cheaply • Attract talent • Best technical interview possible • Retain talent • Effortless modualrisation • Reduce duplication of effort (b) Why Bother to Open Source?

78. @al94781#5whys • Hire & retain top engineers • Good PR • Make Netflix solutions common standards • Give back to the Apache OSS community • Motivate • Peer pressure, code clean-up and documentation (b) Why Should I Open Source?

79. @al94781#5whys AUDIENCE POLL - Who here would like to use Open Source at Work? - Who here would like to contribute to Open Source at Work? - Who here would like to create Open Source at Work? (b)

80. @al94781#5whys (b)

81. @al94781#5whys (b)

82. @YourTwitterHandle#DVXFR14{session hashtag} @al94781#5whys A ny G eneral C onclusion s?

83. @al94781#5whys There are Some Themes (1) You don’t get any of this for free, so; • Design for build-ability • Design for deploy- and undeploy-ability • Design for modularity • Design for monitor-ability • Design for automate-ability (b)

84. @al94781#5whys There are Some Themes (2) • Reduce variation (or manage the sources of variation) • Trust and enable developers to share and collaborate (inside and outside the firewall) (b)

85. @al94781#5whys So, What About Us? We can’t apply all these ideas on all of our projects (b)

86. @al94781#5whys So, What About Us? We can’t apply all these ideas on all of our projects But we could apply some of them on some of our projects (b)

87. @al94781#5whys Think: What’s the Effect? • On how we architect and design? • On how we build and test? • On how we deploy and run? • On how we structure our teams? • On how we interact with our customers? • On how we use and share code? (b)

88. @al94781#5whys We Too Can Ask “Why?” •What else can we: •challenge? •find that sucks, and then remove? •find that is good, and then amplify? •share, and get famous for? (b)

89. @YourTwitterHandle#DVXFR14{session hashtag} @al94781#5whys Thanks, Q & A

90. @al94781#5whys Thanks / Creative Commons •Presentation Template — Guillaume LaForge •The Queen — A prestigious heritage with some inspiration from The Sex Pistols and funny Devoxxians •Girl with a Balloon — Banksy •Tube — Michael Keen

Editor's Notes

#6: While the details of this article are wildly misleading We can&apos;t ignore that the general public is starting to notice when things we do goes badly wrong
#7: “… and they also fail in ways that are beyond the comprehension of a single person.” NOTE: That’s not to say we can&apos;t understand the reason something failed _after_ it happens – too frequently it’s blindingly obvious (“Why didn’t we see that?”) It&apos;s just that it’s now _impossible_ to predict these in advance (you wouldn’t have seen it – a combinatorial explosion of complexity)
#8: For IT workers, complexity is on the increase: We have more and more scale And bigger and bigger volumes And more and more devices Trying to access systems more and more of the time And bigger and bigger teams working on things Spread across more and more countries Speaking more and more languages Lots of Simple Things Combined  Complexity  Chaos and Unpredictability  Emergent Behaviour
#9: For IT workers, complexity is on the increase: We have more and more scale And bigger and bigger volumes And more and more devices Trying to access systems more and more of the time And bigger and bigger teams working on things Spread across more and more countries Speaking more and more languages Lots of Simple Things Combined  Complexity  Chaos and Unpredictability  Emergent Behaviour
#10: For IT workers, complexity is on the increase: We have more and more scale And bigger and bigger volumes And more and more devices Trying to access systems more and more of the time And bigger and bigger teams working on things Spread across more and more countries Speaking more and more languages Lots of Simple Things Combined  Complexity  Chaos and Unpredictability  Emergent Behaviour
#11: But there is another set of forces at work; there are many new underpinnings And these new underpinnings are affecting everything we do as IT workers What was costly is now cheap: Disk, Memory, CPU, Bandwidth What was cheap is now costly: Time to think and plan – so called “internet speed” or “internet time” What was hard is now easy: Starting small and growing gradually to meet global demand Reaching all your customers everywhere all the time With a 3 man team What was easy is now hard: Space, Cooling, Security, Regulatory Compliance What was true is now false: &quot;no one ever got fired for buying IBM&quot; Hardware won&apos;t become commoditised Ownership / Control (&quot;it&apos;s mine&quot;) What was false is now true: Free software won&apos;t win over proprietary
#12: What does this mean for us as software developers?… It means that things get exciting.
#13: So how do we cope? We coped in the past / right now by identifying and then solving our problems: Single Point of Failure? -&gt; N+1 Dev too slow? -&gt; Hire more people Too many bugs? -&gt; Do more testing And when we came across new problems, we solved those too: Sessions across multiple machines? Cluster Programming language too hard? Abstract up a bit ... ...  Complexity
#14: But how can we cope? We coped in the past by solving our problems: Single Point of Failure? -&gt; N+1 Dev too slow? -&gt; Hire more people Too many bugs? -&gt; Do more testing And when we came across new problems, we solved those too: Sessions across multiple machines? Cluster Programming language too hard? Abstract up a bit ... ...  Complexity
#15: But how can we cope? We coped in the past by solving our problems: Single Point of Failure? N+1 Dev too slow? Hire more people Too many bugs? Do more testing Programming language too low level? Abstract up a bit And when we came across new problems, we solved those too: Sessions across multiple machines? Cluster Programming language too hard? Abstract up a bit Repeat Repeat Build Build Add Add Think Think Solve Solve ... ...  Complexity
#16: But how can we cope? We coped in the past by solving our problems: Single Point of Failure? -&gt; N+1 Dev too slow? -&gt; Hire more people Too many bugs? -&gt; Do more testing And when we came across new problems, we solved those too: Sessions across multiple machines? Cluster Programming language too hard? Abstract up a bit ... ...  Complexity
#17: But how can we cope? We coped in the past by solving our problems: Single Point of Failure? -&gt; N+1 Dev too slow? -&gt; Hire more people Too many bugs? -&gt; Do more testing And when we came across new problems, we solved those too: Sessions across multiple machines? Cluster Programming language too hard? Abstract up a bit ... ...  Complexity
#18: But how can we cope? We coped in the past by solving our problems: Single Point of Failure? -&gt; N+1 Dev too slow? -&gt; Hire more people Too many bugs? -&gt; Do more testing And when we came across new problems, we solved those too: Sessions across multiple machines? Cluster Programming language too hard? Abstract up a bit ... ...  Complexity
#19: But how can we cope? We coped in the past by solving our problems: Single Point of Failure? -&gt; N+1 Dev too slow? -&gt; Hire more people Too many bugs? -&gt; Do more testing And when we came across new problems, we solved those too: Sessions across multiple machines? Cluster Programming language too hard? Abstract up a bit ... ...  Complexity
#20: But how can we cope? We coped in the past by solving our problems: Single Point of Failure? -&gt; N+1 Dev too slow? -&gt; Hire more people Too many bugs? -&gt; Do more testing And when we came across new problems, we solved those too: Sessions across multiple machines? Cluster Programming language too hard? Abstract up a bit ... ...  Complexity
#21: But how can we cope? We coped in the past by solving our problems: Single Point of Failure? -&gt; N+1 Dev too slow? -&gt; Hire more people Too many bugs? -&gt; Do more testing And when we came across new problems, we solved those too: Sessions across multiple machines? Cluster Programming language too hard? Abstract up a bit ... ...  Complexity
#22: But how can we cope? We coped in the past by solving our problems: Single Point of Failure? -&gt; N+1 Dev too slow? -&gt; Hire more people Too many bugs? -&gt; Do more testing And when we came across new problems, we solved those too: Sessions across multiple machines? Cluster Programming language too hard? Abstract up a bit ... ...  Complexity
#23: But how can we cope? We coped in the past by solving our problems: Single Point of Failure? -&gt; N+1 Dev too slow? -&gt; Hire more people Too many bugs? -&gt; Do more testing And when we came across new problems, we solved those too: Sessions across multiple machines? Cluster Programming language too hard? Abstract up a bit ... ...  Complexity
#24:  Complexity  Inter-dependency  Fragility
#25:  Complexity  Inter-dependency  Fragility
#26: It’s getting to the point where I personally was beginning to ask: “Are there any alternatives?” And then I thought: When I interview for new developers, I always ask a question ... “It&apos;s not a one-man project in your bedroom for yourself …” To see what their understanding of “Enterprise Development” is But perhaps that wasn’t a great question. Should I instead have been asking “how can you keep the simplicity you have when working on your own project, on your own in your bedroom, but when you‘re scaling it up?…”
#27: Let’s introduce a book. This is “The One Straw Revolution”, a seminal volume about “Natural Farming&quot; Written in 1975 By Masanobu Fukuoka It’s all about doing less And about taking things away And stopping And the strange things was, it was scarily effective Many scientific studies have been done, and are still being done to find out why it works so well And many of the ideas he (re)pioneered are being brought into “modern” agriculture as a result But really he had looked at the process of growing crops from the bottom up He quite literally returned to the roots And asked some fundamental questions about why things had ended up being done the way they were being done And as a result, could change things
#28: I think we’re now seeing his philosophy appearing in IT. And I&apos;ve picked a few ideas and approaches which I wanted to draw your attention to: In every case the originators (and most of these ideas are not my own, I’m just looking at them from this certain angle because it seems to help) have looked right back at the fundamental drivers for the things that we do, in light of the new underpinnings we looked at, and come to some interesting (to me anyway) conclusions
#29: But this sounds mad? Why should I listen to this? How can it apply to what we do? Well, the folks that are pathfinding in this are some big names…
#30: Does it scale? All this taken from some of the largest / fastest growing companies in the world Will it last? But also none of them are ones which sprung up yesterday
#31: So what are these topics we’re going to look at? Because they come largely from Silicon Valley, as you’d expect, they have some pretty great names… (Again, not mine I’m afraid)
#33: Lets confront our first truth about what we do: Doing Things Causes Problems… Either doing them manually Or Doing them automatically We end up with problems… And what I mean us to focus on here is not the technical specifics, but the overall Zen-like inevitability of it all. Depressing isn’t it?
#34: (even more depression)
#35: (and still more…)
#36: So what can we do? In the One Straw Revolution there is the admonishment to “Do Less” and to “Leave things alone.” That sounds like a good idea in this case. But how can we do that in our sphere? One way is to further embrace the concept of immutability…
#37: Immutability Makes Many Things Easier Maintenance (SunRay) Multi-Threading Scaling (Pizza Boxes) Caching Development
#38: What if you had Immutable Deployables? And what if you deployed them to Immutable Environments?
#39: Here’s a wordy bit (for a change) This is taken from a long blog post by Adrian Cockroft entitled “Ops, DevOps and PaaS (NoOps) at Netflix”. It caused quite a stir in some communities, but we’re going to ignore that and look at some of the ideas it articulates #1 – They brought developer tools and processes as much to the centre of everything as possible
#40: #2 – Engineers then use these tools, and some more they built on top of them, to build and package…
#41: #3 - … and then deploy to PROD themselves
#42: Lets consider: Many of us are working in “developer oriented cultures” too… So how might we do this?
#43: This is “The Bakery” Adrian referred to. See the immutable base AMI images – and also notice how simple they are (Linux, Apache, Java and Tomcat – that’s it) And then observe them baking on the immutable “app bundle” in the bakery And then notice the taking of a snapshot ready for launch I don’t have time to go into it any more really, but you can read more at: http://www.infoq.com/presentations/Building-for-the-Cloud-at-Netflix
#44: Next up: “They use a web based portal to deploy hundreds of new instances running their new code.” And: “Pushes to the cloud are as frequent as each team of developers needs them to be” Here’s their web-based tool for that: Asgard
#45: Finally: “…running their new code alongside the old code, put one &quot;canary&quot; instance into traffic, if it looks good the developer flips all the traffic to the new code. If there are any problems they flip the traffic back to the previous version (in seconds) and if it&apos;s all running fine, some time later the old instances are automatically removed.” Again I’ve not got time to go into this in detail, but you can find out loads more by Googling “Asgard”, following them on twitter (@asgardoss) and attending their monthly Google hangouts The point here, by building immutable deployables, using standard developer tools and tool chains, and by engineering them to be simply and automatically deployed, and undeployed via a tool like Asgard, Netflix have made their life far, far easier
#47: Lets move on… You’re all probably thinking “that’s great for reducing failure, but it’ll still fail, so what happens then?” Great. I did too. Lets confront that head on: Failure Failure is Inevitable Failure is Unpredictable And we control less and less (massive scale, Cloud computing, etc) So how can we build systems which our users can rely on?
#48: Option (a) We Could Duplicate It We could set up a test environment Create exhaustive test suites Design architecture so that each component can maintain proper functioning as well as the entire system when individual components fail But this doesn’t scale, and I don’t mean scale to volume, I mean scale to humans
#49: Option (b): Predict it! We could model it (failure) Or conduct rigorous analysis Or even simulate it But we’re just not that mature yet
#50: The Third Way: Put all your eggs in one basket.
#51: Ways to Increase System Resilience (1) Lets take a step back. Despite sounding incredibly Californian, this particular approach has again come largely from a Brit – and it’s Adrian Cockroft again. And as with all good Brits, he sees a lot of sense in what we’re already doing. “Build your application with redundancy and fault tolerance. In a service-oriented architecture, components are encapsulated in services. Services are made up of redundant execution units (instances) that protect clients from single- or multiple-unit failure. When an entire service fails, clients of that service need fault tolerance to localize the failure and continue to function.” But we’re not here to hear about what we already do (quite well as it happens). What else can we do?
#52: Well, what does One Straw Revolution say in this matter? Leave nature alone. Embrace the chaos What if we did a similar thing in our sphere?
#53: Ways to Increase System Resilience (2) But here’s the scary bit. As well as “doing what we always do”, Cockroft suggests we also do something “completely different” What if we caused it? What, failure? Yes, cause it. “Reduce uncertainty by regularly inducing failure. Increasing the frequency of failure reduces its uncertainty and the likelihood of an inappropriate or unexpected response. Each unique failure can be induced while observing the application. For each undesirable response to an induced failure, the first approach can be applied to prevent its recurrence. Although in practice it is not feasible to induce every possible failure, the exercise of enumerating possible failures and prioritizing them helps in understanding tolerable operating conditions and classifying failures when they fall outside those bounds.” But because he’s spent too long in California, he goes further than that (is he mad?) What if we induce failures in the running system? What if that running system was PROD? Then we empirically demonstrate resilience and validate intended behaviour And remove the need to have multiple, complete copies of PROD And have no need to replicate data, and configuration, and deployment model, And we are also testing the Problem Identification / Resolution Processes
#54: So how do we do this? Amazon have “GameDays” [sic] which you may have heard of But Netflix have automated it and do it all the time, with the Simian Army, and that’s what I’ll introduce a little here: Chaos Monkey (-&gt; Chaos Gorilla (availability zone) -&gt; Chaos Kong) Latency Monkey Conformity Monkey Janitor Monkey Doctor Monkey “This was our philosophy when we built Chaos Monkey, a tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact. The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables -- all the while we continue serving our customers without interruption. By running Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems, we can still learn the lessons about the weaknesses of our system, and build automatic recovery mechanisms to deal with them. So next time an instance fails at 3 am on a Sunday, we won&apos;t even notice. Inspired by the success of the Chaos Monkey, we’ve started creating new simians that induce various kinds of failures, or detect abnormal conditions, and test our ability to survive them; a virtual Simian Army to keep our cloud safe, secure, and highly available. Latency Monkey induces artificial delays in our RESTful client-server communication layer to simulate service degradation and measures if upstream services respond appropriately. In addition, by making very large delays, we can simulate a node or even an entire service downtime (and test our ability to survive it) without physically bringing these instances down. This can be particularly useful when testing the fault-tolerance of a new service by simulating the failure of its dependencies, without making these dependencies unavailable to the rest of the system. Conformity Monkey finds instances that don’t adhere to best-practices and shuts them down. For example, we know that if we find instances that don’t belong to an auto-scaling group, that’s trouble waiting to happen. We shut them down to give the service owner the opportunity to re-launch them properly. Doctor Monkey taps into health checks that run on each instance as well as monitors other external signs of health (e.g. CPU load) to detect unhealthy instances. Once unhealthy instances are detected, they are removed from service and after giving the service owners time to root-cause the problem, are eventually terminated. Janitor Monkey ensures that our cloud environment is running free of clutter and waste. It searches for unused resources and disposes of them. Security Monkey is an extension of Conformity Monkey. It finds security violations or vulnerabilities, such as improperly configured AWS security groups, and terminates the offending instances. It also ensures that all our SSL and DRM certificates are valid and are not coming up for renewal. 10-18 Monkey (short for Localization-Internationalization, or l10n-i18n) detects configuration and run time problems in instances serving customers in multiple geographic regions, using different languages and character sets. Chaos Gorilla is similar to Chaos Monkey, but simulates an outage of an entire Amazon availability zone. We want to verify that our services automatically re-balance to the functional availability zones without user-visible impact or manual intervention. With the ever-growing Netflix Simian Army by our side, constantly testing our resilience to all sorts of failures, we feel much more confident about our ability to deal with the inevitable failures that we&apos;ll encounter in production and to minimize or eliminate their impact to our subscribers. The cloud model is quite new for us (and the rest of the industry); fault-tolerance is a work in progress and we have ways to go to fully realize its benefits. Parts of the Simian Army have already been built, but much remains an aspiration. Again I don’t have time to go into this in any more detail, but there’s loads of information out there on the internet about these guys, and the concepts of Antifragility and GameDays. I encourage you to read more.
#57: Reminder: where the (top part) of the quote came from) But we never do: And the code rots: Changes that used to be quick to add aren’t any more… And a small change here, breaks that thing way over there… And breaks is spectacularly… And developers have to become more expert in the codebase… Which they grow to hate more and more… And so they leave…
#58: So again, can we get inspiration from the One Straw Revolution? (Not such a nice calming picture this one). Yes we can. The Old Fertilises the New In The One Straw Revolution, there is no weeding, but also neither is there any clearing of the fields after a harvest. Sounds like the opposite of what I’m proposing here. But they do harvest. And when they do, they leave the sheaves, and the leaves, and everything else in the field, to be the fertiliser for the next crop, and they do this again and again, year after year. Oh, and they don’t harvest all the field’s crops at one time (more on that next) What if we did a similar thing in our world?
#61: Remember these guys? This is what a majority of the companies moving fast in Silicon Valley do right now. But not the ones that sprung up yesterday. And yet they still maintain an agility that belies their size. It’s because they largely take this approach.
#63: Does this look familiar? Seems to be that PHP warnings spiked suddenly at about 16:05, and then died down a bit around 16:20, and then calmed completely by 16:35. But what happened? Something must have happened But how can we find out? How do we find out the root cause?
#64: One way (the way we do now) is to look at the logs. “Something obviously happened here… but what was it? We might [be lucky and] correlate this sudden spike in PHP warnings with a drop in member logins or a drop in traffic on our web servers, but these point to effects and not to a root cause.” In the process of this you&apos;ll have to look across multiple files, from multiple machines, hopefully all with their times set to be exactly the same. But usually you&apos;re unlucky, and you never find out what happened, because the logging wasn&apos;t detailed enough, and you end up setting your best testers and developers to look at the code and test suites to try and find out the cause of the problem. How do we find out the root cause? How do we just make it easier to find the effects?
#65: So again, what does One Straw Revolution say in this matter? In the book, Fukuoka-san describes how a lot of scientists come to study him and his methods. They always come with pre-conceived notions and pre-built frameworks of thought. They then try to apply these to the methods he is using, looking at them through their specific lenses. He says they miss things because they’re failing to look at the whole picture. For us this means, why are we only writing things down and then reading them later? Even worse, why are we writing them down in multiple places, and then having to mentally correlate them after the fact? But most importantly, why are we trying to guess in advance what will go wrong and set up logging for that? If we know anything, it is that the unpredictable is the enemy here. So why don&apos;t we just look? Our eyes and even our intuition can tell us a lot. In our world, that means we should stop reading and writing and start drawing, aka the Church of Graphs
#66: First lets debunk a few myths: 1) Logging / Monitoring is expensive and slows things down - wrong, not any more, with new non-blocking loggers – google for them 2) You can record too much information - how do we make sense of it - well, with Big Data, we&apos;re looking at this problem all the time, but we&apos;re also forgetting something - our eyes We are very good at seeing and recognising patterns But if we have to read everything then we&apos;re removing the chance to see these bigger patterns.
#67: Welcome the “Church of Graphs”: We spend a lot of time gathering metrics for our network, servers, and many things going on within the code that drives Etsy. It’s no secret that this is one of our keys to moving fast. We use a variety of monitoring tools to help us correlate issues across our architecture. But what most monitoring tools achieve is correlating the effects of change, rather than the causes. From: http://codeascraft.com/2010/12/08/track-every-release/
#68: Step 1: “We need to track changes that we make to the system. Change to application code (deploys) are opportunities for failure. Tweaking pages and features on your web site cause ripples throughout the metrics you monitor, including database load, cache requests, web server requests, and outgoing bandwidth. When you break something on your site, those metrics will typically start to skew up or down. Different companies track change in ways that are reflective of their release cycle. A company that only releases new software or services once or twice a year might literally do this by distributing of a press release. Companies that move more quickly and release new products every few weeks might rely on company-wide emails to track changes. The faster the iteration schedule, the smaller and less formal the announcement becomes. When you reach the point of releasing changes a couple of times a day, this needs to be automated and needs to be distributed to places where it is quickly accessible, such as your monitoring tools and IRC channels. At Etsy, we are releasing changes to code and application configs over 25 times a day. When the system metrics we monitor start to skew we need to be able to immediately identify whether this is a human-induced change (application code) or not (hardware failure, third-party APIs, etc.). We do this by tracking the time of every single change we ship to our production servers.” http://codeascraft.com/2010/12/08/track-every-release/
#69: It’s Reassuring to See Nothing Wrong “Equally useful is the reassurance we have that we can deploy many times a day without disrupting core functionality on the site. Across the 16 code deploys shown below, not a single one caused an unexpected blip in our member logins.” http://codeascraft.com/2010/12/08/track-every-release/
#70: Measure Everything (revisited) “These tools highlight the good events along with the bad. Ian Malpass, who works on our customer support tools, uses Graphite to monitor the number of new posts written in our forums, where Etsy members discuss selling, share tips, report bugs, and ask for help. When we correlate these with deploys, you can see the flurry of excitement in our forums after one of our recent product launches.” http://codeascraft.com/2010/12/08/track-every-release/
#71: TL;DR: Be able to move fast
#72: Counting and Timing From: http://code.flickr.net/2008/10/27/counting-timing/
#74: This one’s a bit different, so lets start a bit differently, BY DOING A POLL Who here uses Open Source? Who here contributes to Open Source? Who here creates Open Source? [Shout out the results]
#75: Lets be clear, there’s not so much a problem here in this last one as in all the ones which went before it. Lots of us (all of us?) I’m guessing, use Open Source on everything we do. LETS CHECK AND DO ANOTHER POLL: - Who here uses Open Source AT WORK? But we’re not talking about that here.
#76: MORE QUESTIONS: - Who here contributes to Open Source AT WORK? - Who here creates Open Source AT WORK? FAR FEWER. What does that indicate?
#77: We’re (and Tom Preston-Warner in his 2011 post on the GitHub blog) talking about creating new Open Source – about releasing it, so that others can see it, and use it, and contribute fixes and new features to it, and perhaps even share responsibility with you for driving it. But closed source for our code is working just fine thank you very much. New products get delivered as closed source every day, and new sites, and new tools and frameworks. Being open source doesn’t seem directly related to their success in any comprehensible way. So why are we talking about it here? Why would (GitHub) exhort you to &quot;Open Source (Almost) Everything?&quot; and why would Netflix and many others be heading off down the same line? Lets look at what this “Why” does share with all the previous &quot;whys&quot; - it&apos;s seeming counter-intuitiveness.
#78: In his post, Tom is pretty clear about the benefits he thinks are gained by open sourcing: it’s these… And that’s a pretty nice list. And they all seem like great goals, but lets remind ourselves for one moment who also potentially gains from this open sourcing – GitHub, as it’s the likely place for us to put all of this OSS code. If you keep reading his post, it turns a little into a how-to, and then a little into an ad for GitHub. Also, it’s questionable as to whether your mileage may vary. “Effortless” always sounds alarm bells to me for starters. But the rest seem fair enough, and in this light, is this enough evidence to make you start doing it? Or to convince your (skeptical) manager? So lets broaden it out and bring Netflix into the mix too…
#79: Here’s the Netflix version, from a presentation by Joe Sondow who works on the Asgard team at Netflix (it’s available on slideshare if you want to look for it) Firstly, lets note that there’s some commonality here: - retaining and hiring - good PR (aka advertising) Next lets consider the third one? That’s the money right there right? As a skeptical manager, if I could achieve that, then I’d give you all the money and time you wanted. But most of us are probably not in the position Netflix is, as such a heavy user of a resource, also used by so many, to drive standards in such a way. But fair play to them for being honest about it. But this has led us to the problem, and that is, because we’re not running a popular code repository benefitting from OSS projects being hosted with us, and nor are we the biggest users of a massive yet proprietary loud platform, in these cases, all we’re left with is the soft stuff… But lets look at it anyway: - First, the really soft one – giving back to the Apache licence OSS community. Netflix use a lot of Apache licensed stuff, they’ve benefitted greatly. This is their way of saying “thank you”. It in many ways is the most honourable, but in the harsh world of commercial reality, no matter how ethical, if there weren’t other benefits then they wouldn’t be doing it, lets be honest. - Next things get a little more interesting – “motivate”. It seems Joe’s implying that engineers aren’t just motivated by money - And this links onto the last one – peer pressure, code clean-up, and documentation These are interesting and the ones I’d like to take forward into our final meeting with Fukuoka-San
#80: Before we do, lets take our LAST POLL: - Who here would like to use Open Source AT WORK? - Who here would like to contribute to Open Source AT WORK? - Who here would like to create Open Source AT WORK?
#81: So again, what does One Straw Revolution say in this matter? What do they all share? Again, as with all the other &quot;Why&quot; scenarios, we&apos;re embracing a force that has come from the ground up, from the stuff we work with, and turning it to our advantage. The difference with this one is that it’s a cultural force. In this case, it’s not what he says, but what he does. He is very free and open. He invites others to come and work with him, and to study how he works and the results he achieves without impediment. He does this because he knows that this is not where the value lies, and so he is not harmed by “giving away his knowledge” and in the case of those who come and work with him, he gains their labour, and also their input and expertise. But deep down, it might just be simple human interaction. What I want to take from him here is his way of working. He’s breaking down the barriers between himself and others. By doing so he makes his work more enjoyable, and more rewarding for him. He gets to share what he has learned, and that gives him satisfaction. What if we did a similar thing in our sphere? Would the benefits be enough to outweigh the perceived costs?
#82: Let’s add another data point – Capgemini. In the final part of the final “why”, lets talk about some of my personal experiences. I work at a Consultancy, and so most of the code that I write doesn’t belong to me, and neither does it belong to the company, it belongs to my clients. But that code is designed, written, fixed, documented and supported by myself and my colleagues, and weather we like it or not, this is a highly social activity. And yet the way we typically organise our teams, and our code, and our tooling, and our projects, and even ourselves, means we typically are putting up barriers all over the place. Wouldn’t it be better to break these barriers down? And wouldn’t adopting an Open Source model of working be the way to do that? What do I mean? I’ll tell you a story WHAT WE DID ON OUR PROJECT Running a team doing integration work – as we worked we quickly realised that we could break things down into small pieces of work (we too had a microservices architecture) and give them to small 1-3 man teams We also quickly realised, as we build a few services, that we also had some common patterns. Things like: expose as a REST service, call an external service, package as a FatJar, capture Monitoring etc. and, perhaps because of the fact that because we were so many small teams, many folks kept getting these same patterns to work. Now we were lucky – we got to make a few OSS contributions as a result of what we were doing, bug-fixes, documentation contributions, etc, but we also started to produce small pieces of code in a reusable state. We didn’t mandate that other teams use these re-usable bits, but we did put them in separate git repositories, and release them as separate Jars, so that folks who did want to use them could, easily. It was around the time that we had a few of these, and that, just within our team, we were using them in multiple places, and all contributing enhancements and documentation updates as we needed to, it was around this time that we realised we were slowly creeping up on doing something in an Open Source way, the only bits missing were OSS licences, and the code and Jars on public repos. Before we go any further, lets get the “almost” bit out of the way – at Netflix they aim to open source everything that they term “undifferentiated heavy lifting” (aka Infrastructure, Caching, Database, Cloud, Building, Deployment, Configuration, Testing, Monitoring, Networking, Robustness, Security.) They don’t OSS their Streaming, Encoding, Merchandising, Movie metadata, Recommendations, or UI. That’s competitive advantage. Make sense? I thought that, sure, for them that’ll be tons, but we’ll never have anything like that? Well, it turns out that we don’t, but we did have some bits which, because we’re a Systems Integrator, basically glue one thing to another – in our case, one framework (typically Spring) into something else. We had quite a bit of that. And so we looked at making that truly open source. We took it slowly. First we made things available to other projects, and we encouraged them to do the same. In some cases what we’d built was popular, and others could see a use for it. We also began to benefit when they began to share things that they’d likewise come up with. If it was useful, we took it. We kept pushing the OSS way of working – we encouraged them to fork, to report bugs, to clean up the code and documentation, to become contributors, and it worked. We’d open-sourced it, just not to the wider world. So now we’re in the process of making this final leap, on some of the key things we’ve made. Some are tiny, some are a bit bigger, but all are useful, to us anyway. And the best bit? Everything about motivation and staff retention was true – getting to work with great folks, on something which is really great, and really-reusable, and really clean, and at a code level is great. And I now don’t even need to work with just the folks on my project – I can work with them from anywhere. And others really do find and fix our bugs, or implement features we’ve had on the backlog for ages but haven’t ever really managed to make time for. And the final hurdle – getting great developers? Well, we’re not there yet, and without it we know that finding great developers is hard, but we’re hoping to see this too. MORE NOTES: For me as a manager or business owner I gain no advantage from having teams (re)building components which are effectively “undifferentiated heavy lifting” (Carl Quinn) “Ready for production is not the same as ready for Github” (Carl Quinn) Open Sourcing improves your quality. But also: We want to innovate. Why do we assume all the brightest people are in the room? We want coverage. The more people who use our stuff, the better it will become. We want as many eyes on our code as possible. The quality will improve You can hear a similar story on Java Posse Roundup Podcast XXX (Build Pipelines) when Justin Ryan talks about the genesis of the Hystrix Circuit Breaker

Five Whys - Devoxx UK 2014

More Related Content

What's hot (6)

Viewers also liked (14)

Similar to Five Whys - Devoxx UK 2014 (20)

Recently uploaded (20)

Five Whys - Devoxx UK 2014

Editor's Notes