DevOps Transformations

DevOps Transformations
at National Instruments
and Bazaarvoice
ERNEST MUELLER @ERNESTMUELLER
THEAGILEADMIN.COM ERNEST.MUELLER@GMAIL.COM

Who Am I?
YES
 B.S. Electrical Engineering, Rice
University
 Programmer, Webmaster at FedEx
 VP of IT at Towery Publishing
 Web Systems Manager, SaaS
Systems Architect at National
Instruments
 Release Manager, Engineering
Manager at Bazaarvoice
NO

National Instruments
 NATI – founded in 1976, $1.24bn
revenue, 7249 employees, 10
sites worldwide (HQ: Austin)
 Makes hardware and software for
data acquisition, virtual
instrumentation, automated test,
instrument control
 Frequently on Fortune’s 100 Best
Companies to Work For list

NI Web Systems (2002-2008)
 Managed systems team supporting NI’s Web presence (ni.com) and other
Web technology based assets.
 ni.com is the primary source of product information (site visits were a key
KPI on the overall lead to sales conversion pipeline), half of all lead
generation, and a primary sales channel – about 12% of revenue was direct
e-commerce.
 Key goals were all growth - globalization, driving traffic, increasing
conversion, and e-commerce sales.
 Team grew from 4 to 8 sysadmins, supporting 80-100 developers to fulfill
their needs and coordinate with the other IT Infrastructure teams to build
out Web systems.

The Problems
 Previous team philosophy had been actively anti-automation. All systems
manually built and deployed.
 Stability was low and site uptime hovered around 97% with constant work.
 Web Admin quality of life was poor and turnover was high. Some weeks we
had more than 200 oncall pages (resulting in a total of two destroyed oncall
pagers).
 Application teams were siloed by business initiative, infrastructure teams
siloed by technology.
 Large and diverse tech stack – ni.com had hundreds of applications comprised
of Java, Oracle PL/SQL, Lotus Notes Domino, and more.
 Monthly code releases were “all hands on deck,” beginning late Friday evening
and going well into Saturday.

What We Did
Immediate (3-6 mo):
 Stop the bleeding – triage issues, get QoL metrics, operations shield
Mid term (1-2 year):
 Automate – “Monolith,” “Autodeploys,” “Redirect Manager,” APM
 Process – “Systems Development Framework”, operational rotation
Long term (2+ years):
 Partner with Apps and Business
 Drive overall goal roadmap (performance, availability, budget, maintenance%,
agility)

What We Did, Part 2
Security Practice
 NI had no full time security staff of any sort
 We developed expertise and implemented both infrastructure and application security
programs
 Hosted local OWASP user group
Application Performance Management Practice
 After basic improvements, the majority of stability and performance issues were
application side
 We implemented a wide range of monitoring tools (synthetic, RUM, log analysis, APM)
 Reduced our production issues by 30% and reduced troubleshooting time by 90%

Results
“The value of the Web Systems group to the Web
Team is that it understands what we’re trying to
achieve, not just what tasks need completing.”
-Christer Ljungdahl, Director, Web Marketing

But…
We were still “the bottleneck.”

NI R&D Cloud Architect (2009-2011)
 About this time R&D decided to branch out into SaaS
products.
 Formed a greenfield team with myself as systems architect and
5 other key devs and ops from IT
 Experimental, had some initial product ideas but also wanted
to see what we could develop
 LabVIEW Cloud UI Builder (Silverlight client, save/compile/share in
cloud)
 LabVIEW FPGA Compile Cloud (FPGA compiles in cloud from LV
product)
 Technical Data Cloud (IoT data upload)

What We Did - Initial Decisions
Cloud Hosting
 Do not use existing IT systems or processes, go integrated self-contained team
 Adapt R&D NPI and design review processes to agile
 All our old tools wouldn’t work (dynamic, REST)
Security First
 We knew security would be one of the key barriers to adoption
 Doubled down on threat modeling, scanning, documentation
Automation First
 PIE, the Programmable Infrastructure Environment

What We Did, Part 2
 Close dev/ops collaboration was initially rocky, but we powered
through it
 REST not-quite-microservices approach
 Educate desktop software devs on Web-scale operational needs
 Used “Operations as the Secret Sauce” mentality to add value:
 Transparent uptime/status page
 Rapid deployment anytime
 Incident handling, follow-the-sun ops
 Monitoring/logging metrics feedback to developers and business

Results
 Average NI new software product time to market –
3 years
 Average NI Cloud new product time to market –
1 year

Meanwhile… DevOps!
 Velocity 2009, June 2009
 DevOpsDays Ghent, Nov 2009
 OpsCamp Austin, Feb 2010
 Velocity 2010/DevOpsDays Mountain View, June
2010
 Helped launch CloudAustin July 2010
 We hosted DevOpsDays Austin 2012 at NI!

Bazaarvoice
 BV – founded in 2005, $191M revenue,
826 employees, Austin Ventures backed
startup went public in 2012
 SaaS provider of ratings and reviews and
analytics and syndication products, media
 Very large scale – customers are 30% of
leading brands, retailers like Walmart,
services like OpenTable – 600M
impressions/day, 450M unique
users/month

BV Release Manager (2012)
 Bazaarvoice was making the transition into an engineering-heavy company,
adopting Agile and looking to rearchitect their aging core platform into a
microservice architecture.
 They were on 10-week release cycles that were delayed virtually 100% of
the time
 They moved to 2 week sprints but when they tried to release to production
at the end of them they had downtime and a variety of issues (44
customers reported breakages)
 I was brought on with the goal of “get us to two week releases in a month”
(I started Jan 30, we launched March 1)
 I had no direct reports, just broad cooperation and some key engineers
seconded to me

The Problems
 Slow testing (low percentage of automated regression testing)
 Lots of branching
 Lots of branch checkins after testing begins
 Creates vicious cycle of more to test/more delays
 Monolithic architecture (15,000 files, 4.9M lines of code, running on 1200
hosts in 4 datacenters)
 Unclear ownership of components
 Release pipeline very “throw over the wall”-y
 Concerns from Implementation, Support, and other teams about “not
knowing what’s going on”

What We Did
 With Product support, product teams took several (~3) sprints off to automate
regression testing – “as many as you need to hit the bar”
 Core QA tooling investment (JUnit, TeamCity CIT, Selenium)
 Review of assets and determining ownership (start of a service catalog)
 New branch strategy – trunk and single release branch per release only, no
checkins to release branch except to fix critical bugs
 “If the build is broken and you had a change in, it’s your job to fix it”
 Release process requiring explicit signoffs in Confluence/JIRA (every svn checkin
was linked to a ticket) promoting ownership “through the pipeline”
 Testing process changed (feature test in dev, regress in QA, smoke in stage)

What We Did, Part 2
 Also a feature flagging system was added, so new features a) could
be included in builds immediately and b) would be launched dark.
 “Just enough process” – go/no-go meetings and master signoffs
were important early
 Communication, communication, communication. Met with all
stakeholders (in a company whose only product is the service,
that’s everyone) and set up all-company service status and release
notification emails.

Results
 First release: “no go” – delayed 5 days, 5 customer issues
 Second release: on time, 1 customer issue
 Third release: on time, 0 customer issues
 Kept detailed metrics on # checkins, delays, process faults (mainly branch
checkins), support tickets
 We had issues (big blowup in May from a major change) and then we’d change
the process
 Running the releases was handed off to a rotation through the dev teams
 After all the heavy lifting to get to biweekly releases, we went to weekly releases
with little fanfare in September – average customer reported issues of 0

BV Engineering Manager (2012-2014)
 We had a core team working on the legacy ratings & reviews system
while new microservice teams embarked on the rewrite
 We still had a centralized operations team serving them all and knew
we wanted to distribute them into the teams so that all those teams
were operationally ready from early in their cycle
 We declared “the death of the ops team” mid-2012 and disseminated
them into the dev teams, initially reporting to a DevOps manager
 I took over the various engineers working on the existing product –
about 40 engineers, 2/3 of which were offshore outsourcers from
Softserve

The Problems
 With the assumption that “the legacy stack” would only be around a short
time, they had not been moved to agile and were very understaffed.
 Volume (hits and data volume) continued to grow at a blistering pace
necessitating continuous expansion and rearchitecture
 Black Friday peaks stressed our architecture, with us serving 2.6 billion review
impressions on Black Friday and Cyber Monday.
 Relationships with our outsourcers was strained
 Escalated support tickets were hitting SLA about 72% of the time
 Hundreds to thousands of alerts a day, 750-item product backlog
 Growing security and privacy compliance needs – SOX, ISO, TÜV, AFNOR

What We Did
 Moved the teams to agile, embedded DevOps
 Started onshore rotations and better communication with
outsourcers, empowered their engineers
 Trying to break the team into 4 sprint teams failed initially because
there weren’t enough trained leads/scrum masters, had to regroup
and try again to get success
 Balancing four split product backlogs with a Kanban-style “triage
process” to handle incidents and support tickets
 Metrics and custom visualizations across the board – from support
SLA percentage to current requests and performance per cluster

What We Did, Part 2
 Enhanced communication with Support, embedded Product Manager
 “Merit Badge” style crosstraining program to bring up new SMEs to
reduce load on “the one person that knows how to do that”
 “Dynamic Duo” day/night dev/ops operational rotation
 Joint old team/new team Black Friday planning, game days
 Worked with auditors and InfoSec team – everything automated and
auditable in source control, documentation in wiki. Integrated security
team findings into product backlog.

Results
 Customer support SLA went steadily up quarter by quarter until it
hit 100%
 Steadily increasing velocity across all 4 sprint teams
 Great value from outsourcer partner
 We continued to weather Black Friday spikes year after year on
the old stack
 Incorporation of “new stack” services, while slower than initially
desired, happened in a more measured manner
 Some retention issues with completely embedded DevOps on
“new stack” teams however

Questions?
ERNEST MUELLER @ERNESTMUELLER
THEAGILEADMIN.COM ERNEST.MUELLER@GMAIL.COM

DevOps Transformations

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to DevOps Transformations (20)

More from Ernest Mueller (18)

Recently uploaded (20)

DevOps Transformations

Editor's Notes