SlideShare a Scribd company logo
+

Sunday, July 24, 2011
ajackson
                              @
                   skylineinnovations.com


Sunday, July 24, 2011
a tale of rapid
                          prototyping, data
                          warehousing, solar
                        power, an architecture
                          designed for data
                          analysis at “scale”
                           ...and arduinos!
Sunday, July 24, 2011

So here’s what i’d like to talk about: Who we are, how we got started, and most importantly,
how we’ve been able to use MongoDB to help us. We’re not a traditional startup -- and while
i know that this is not a “startups” talk, but a Mongo one, i’d like to show how Mongo’s
flexible nature really helped us as a business, and how Mongo specifically has been a good
choice for us as we build some of our tools. Here are some themes:
Scaling



Sunday, July 24, 2011

Mongo has come to have a pretty strong association with the word “scaling.”

Scaling is a word we throw around a lot, and it almost always means “software performance,
as inputs grow by orders of magnitude.”

But scaling also means performance as the variety of inputs increases. I’d argue that it’s
scaling to go from 10 users to 10,000, and it’s also scaling to go from ten ‘kinds’ of input to
a hundred.

There’s another word for this.
Scaling
                                Flexibility


Sunday, July 24, 2011

Particularly when you scale in the real world, you start to find that it’s complicated and messy
and entropic in ways that software isn’t always equipped to handle. So for us, when we say
“mongo helps us scale”, we don’t necessarily mean scaling to petabytes of data. We’ll come
back to them as well.
Business-first
                        development


Sunday, July 24, 2011

This generally means flexibile, lightweight processes. Things that become fixed &
unchangable quickly become obsolete and sad :’(
When Does
                “Context”
               become “Yak
                 Shaving”?


Sunday, July 24, 2011

When i read new things or hear about new stuff, I’m always trying to put it in context. So,
sometimes i put too much context in my talks :( To avoid it, I sometimes go a little too fast
over the context that *is* important. So please stop me to ask questions! Also, the problem
domain here is a little different than what we might be used to, so bear with me as we go into
plumbing & construction.
Preliminaries



Sunday, July 24, 2011
Est. 8/2009
Sunday, July 24, 2011
Project Development
                                 +
                             Technology


Sunday, July 24, 2011
“Project Development”
Sunday, July 24, 2011
finance, develop, and operate
                 renewable energy and efficiency
                   installations, for measurable,
                        guaranteed savings.



Sunday, July 24, 2011
finance, develop, and
                    operate renewable energy
                   and efficiency installations, for
                   measurable, guaranteed savings.



Sunday, July 24, 2011

We’ll pay to put stuff on your roof, and we’ll keep it at its maximally awesome.
finance, develop, and operate
                    renewable energy and
                  efficiency installations, for
                  measurable, guaranteed savings.



Sunday, July 24, 2011

Right now, this means solar thermal, more efficient lighting retrofits, and maybe HVAC.
finance, develop, and operate
                  renewable energy and efficiency
                  installations, for measurable,
                      guaranteed savings.



Sunday, July 24, 2011

So, here’s the interesting part. Since we put stuff on your roof for free, we need to get that
money back. What we do is, we’ll charge you for the energy that it saved you, but, here’s the
twist. Other companies have done similar things, where they say “we’ll pay for a system/
retrofit/whatever, and you’ll agree to pay us an arbitrary number, and we say you’ll get
savings, but you won’t actually be able to tell, really.” That always seemed sketchy to us. So,
we actually measure the performance of this stuff, collect the data, and guarantee that you
save money.
(not webapps)



Sunday, July 24, 2011
Topics not covered:



Sunday, July 24, 2011
• Why solar thermal?
                        • Why hasn’t anyone else done this before?
                        • Pivots? Iterations?
                        • What’s the market size?
                        • Funding? Capital structures?
                        • Wait, how do you guys make money?

Sunday, July 24, 2011

Oh, right, this isn’t a startup talk. But feel free to ask me these later!
Solar Thermal in Five
                               Minutes
                            ( mongo next, i promise! )




Sunday, July 24, 2011
Municipal
                           =>
                          Roof
                           =>
                          Tank
                           =>
                        Customer
Sunday, July 24, 2011
Relevant Data to Track



Sunday, July 24, 2011
Temperatures
                        (about a dozen)


Sunday, July 24, 2011
Flow Rates
                        (at least two)


Sunday, July 24, 2011
Parallel data streams
                          (hopefully many)


Sunday, July 24, 2011

e.g., weather data, insolation data. It’d be nice if we didn’t have to collect it all ourselves.
how much data?
                        20 data points @ 4 bytes
                        1 minute intervals
                        at 1000 projects (I wish!)
                        for 10 years
                        80 * 60 * 24 * 365 * 10 * 1000 = 400 GB?
                        ...not much, really, “in the raw”


Sunday, July 24, 2011

unfortunately, we can’t really store it with maximal efficiency, because of things like
timestamps, metadata, etc., but still.
Sunday, July 24, 2011

I hope this provides enough context on the business problems we’re trying to solve. It looks
like we’ll need a data pipeline, and we’ll need one fast.

We’ve got data that we’ll need to use to build, monitor, and monetize these energy
technologies. Having worked at other smart grid companies before, I’ve seen some good
data pipelines and some bad data pipelines. I’d like to build a good one. The less stuff i
have to build, the better.
Sunday, July 24, 2011

As i do some research, i find that a lot of these data pipelines have a few well-defined areas
of responsibility.
Acquisition,
                         Storage,
                          Search,
                         Retrieval,
                         Analytics.



Sunday, July 24, 2011

These should be self explanatory. What’s interesting is that not only are most of the end-
users of the system analysts, interested in analyzing, but that most systems seem to be
designed for the other functionality. More importantly, they’re not very well decoupled: by
the time the analysts get to start building tools, the design decisions from the beginning are
inextricable from the systems that came before.
Acquisition,
                         Storage,
                          Search,
                         Retrieval,
                                                }       Designed for these



                         Analytics.            <=           Users are here




Sunday, July 24, 2011

These should be self explanatory. What’s interesting is that not only are most of the end-
users of the system analysts, interested in analyzing, but that most systems seem to be
designed for the other functionality. More importantly, they’re not very well decoupled: by
the time the analysts get to start building tools, the design decisions from the beginning are
inextricable from the systems that came before.
Acquisition,
                         Storage,
                          Search,
                         Retrieval,
                         Analytics.



Sunday, July 24, 2011

These should be self explanatory. What’s interesting is that not only are most of the end-
users of the system analysts, interested in analyzing, but that most systems seem to be
designed for the other functionality. More importantly, they’re not very well decoupled: by
the time the analysts get to start building tools, the design decisions from the beginning are
inextricable from the systems that came before.

It’s important to remember that, while you can’t get good analytics without the other stuff,
the analytics is where almost all of the value is! Search & retrieval are approaching “solved”
Acquisition,
                         Storage,
                          Search,
                         Retrieval,
                                                }       Designed for these



                         Analytics.             <=     Users are here
                                                Business value is here!




Sunday, July 24, 2011

These should be self explanatory. What’s interesting is that not only are most of the end-
users of the system analysts, interested in analyzing, but that most systems seem to be
designed for the other functionality. More importantly, they’re not very well decoupled: by
the time the analysts get to start building tools, the design decisions from the beginning are
inextricable from the systems that came before.

It’s important to remember that, while you can’t get good analytics without the other stuff,
the analytics is where almost all of the value is! Search & retrieval are approaching “solved”
Sunday, July 24, 2011

so, here’s how i started thinking about things. This is a design diagram from the early days
of the company.
Sunday, July 24, 2011

easy, python, no problem. There are some interesting topics here, but they’re not mongoDB
related. I was pretty sure i knew how to build this part, and i was pretty sure i knew what the
data would look like.
Sunday, July 24, 2011

This part was also easy -- e-mail reports, csvs, maybe some fancy graphs, possibly some
light webapps for internal use. These would be dictated by business goals first, but the
technological questions were straightforward.
Sunday, July 24, 2011

Here was the real question.

What would be some use cases of an analyst having a good experience look like? What would
they expect the tools to do?
Now we can think
                        about what the data
                             looks like


Sunday, July 24, 2011

So, let’s think about what this data looks like, how it’s structured and what it is. Then, after
that, we can look at what the best ways to organize it for future usefulness.
Time series?
Time,municipal water in T,solar heated water out T,solar tank bottom taped to side,solar tank top taped to side,array in/out,array in/out,tank room ambient t,array supply temperature,array return
temperature,solar energy sensor,customer flow meter,customer OIML btu meter,solar collector array flow meter,solar collector array OIML btu meter,Cycle Count
Tue Mar 9 23:01:44 2010,14.7627064834,53.7822899383,12.1642527206,51.1436001456,6.40476190476,8.9582972583,22.6857033228,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333458
Tue Mar 9 23:02:44 2010,14.958038343,53.764889193,12.1642527206,51.0925345058,6.40476190476,8.85184138407,22.5716100982,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333462
Tue Mar 9 23:03:45 2010,15.1145934976,53.6986641192,12.1642527206,50.8692901812,6.40476190476,8.78519002979,22.5673674246,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333462
Tue Mar 9 23:04:45 2010,15.2512207824,53.5955190752,12.1642527206,50.8293877551,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333468
Tue Mar 9 23:05:45 2010,15.3690229715,53.5534492867,12.1642527206,50.8293877551,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333471
Tue Mar 9 23:06:46 2010,15.5253261193,53.5534492867,12.1642527206,50.8658228816,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333472
Tue Mar 9 23:07:46 2010,15.6676270005,53.5534492867,12.1642527206,50.9177829276,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.293277114,0.0,0.0,0.0,0.0,0.0,333472
Tue Mar 9 23:08:47 2010,15.7915083121,53.4761516976,12.1642527206,50.8398031014,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.1826467404,0.0,0.0,0.0,0.0,0.0,333477
Tue Mar 9 23:09:47 2010,15.9763741003,53.693428918,12.1642527206,50.7859446809,6.40476190476,8.78519002979,22.5461357574,24.0728390462,22.1782915595,0.0,1.0,0.0,0.0,0.0,333581
Tue Mar 9 23:10:47 2010,16.1650984572,54.0547534088,12.1642527206,50.725,6.40476190476,8.78519002979,22.4544906773,24.0728390462,22.1782915595,0.0,0.0,0.0,0.0,0.0,333614




Sunday, July 24, 2011
TIME SERIES
                           DATA


Sunday, July 24, 2011

So what is time series data?
Features, Over Time




Sunday, July 24, 2011

multi-dimensional features. What’s fun in a business like this is that we’re not really sure
what the features we study will be. -- Flexibility callout
Features, Over Time

               Thing
       (Feature vector, v)




                                              Time
                                                 (t)


Sunday, July 24, 2011

multi-dimensional features. What’s fun in a business like this is that we’re not really sure
what the features we study will be. -- Flexibility callout
Features, Over Time

               Thing
       (Feature vector, v)




                                              Time
                                                 (t)


Sunday, July 24, 2011

multi-dimensional features. What’s fun in a business like this is that we’re not really sure
what the features we study will be. -- Flexibility callout
Sunday, July 24, 2011

A couple of ideas:
sampling rates. “regularity”. “completeness”
analog vs. digital
instantaneous vs. cumulative (tradeoffs)
tn              tn+1


Sunday, July 24, 2011

Finding known interesting ranges (definitely the most common)
tn              tn+1


Sunday, July 24, 2011

Finding known interesting ranges (definitely the most common)
t   t’              etc.
Sunday, July 24, 2011

Using features to find interesting ranges.

These two ways to look for things should inform our design decisions.
y




                        t   t’              etc.
Sunday, July 24, 2011

Using features to find interesting ranges.

These two ways to look for things should inform our design decisions.
y
                                                                 Thresholds
       y’




                        t   t’              etc.
Sunday, July 24, 2011

Using features to find interesting ranges.

These two ways to look for things should inform our design decisions.
y
                                                                 Thresholds
       y’




                        t   t’              etc.
Sunday, July 24, 2011

Using features to find interesting ranges.

These two ways to look for things should inform our design decisions.
(more complicated stuff
                   can be thought of as
                    transformations...)


Sunday, July 24, 2011

e.g., frequency analysis, wavelets, whatever.
Sunday, July 24, 2011

At this point, I go off and do a bunch of research on existing technologies. I really hate
reinventing the wheel, and we really don’t have the manpower.
Time series specific tools



                        Scientific tools & libraries



                        Traditional data-warehousing approaches



Sunday, July 24, 2011

So, these were some of the options i looked at. I want to quickly point out why i eliminated
the first two classes of tools.
Time series specific tools

                           RRDtool -- Round Robin Database




Sunday, July 24, 2011

There’s really surprisingly few of these. One of the best is the RRDtool. It’s pretty sweet, and
i highly recommend it. Unfortunately, it’s really designed for applications that are highly
regular, and that are already pretty digital, for instance, sampling latencies, or temperatures
in a datacenter. It’s not really good for unreliable sensors, nor is it really designed for long
term persistance. It also has a really high lock-in, with legacy data formats, etc. Don’t get
me wrong, it’s totally rad, but i didn’t think it was for us.
Scientific tools & libraries

                           e.g., PyTables




Sunday, July 24, 2011

Pretty cool, but not many of these were mature & ready for primetime. Some that were, like
PyTables, didn’t really match our business use-case.
Traditional data-warehousing approaches



Sunday, July 24, 2011

So, these were some of the options i looked at. I want to quickly point out why i eliminated
the first two classes of tools. [...]. That leaves us with the traditional approaches. This
represents a pretty well established field, but very few of the tools are free, lightweight, and
mature.
Enterprise buzzwords
                           (Just google for OLAP)




Sunday, July 24, 2011



But the biggest idea i learned is that most data warehousing revolves around the idea of a
“fact table”. They call it a “multidimensional OLAP cube”, but basically it exists as a totally
denormalized SQL table.
“Measures”
                          and their
                        “Dimensions”


Sunday, July 24, 2011

(or facts)
pretty neat!
Sunday, July 24, 2011
“how elegant!”

Sunday, July 24, 2011
in practice...



Sunday, July 24, 2011
Sunday, July 24, 2011
(from “How to Build OLAP Application Using Mondrian
                                + XMLA + SpagoBI”)
Sunday, July 24, 2011

to which the only acceptable response is:
Sunday, July 24, 2011

ha! Yeah right.
Time series are not relational!
Sunday, July 24, 2011

even extracted features are not inherently relational!

Also: you don’t know what you’re looking for, you don’t know when you’ll find it, you won’t
know when you’ll have to start looking for something different.
Why would you lock yourself into a schema?
We don’t know what
                        we’ll want to know.


Sunday, July 24, 2011

We won’t know what we want to know. Not only are we warehousing time-series of
multidimensional feature vectors, we don’t even know the dimensions we’ll be interested in
yet!
natural fit for
                          documents


Sunday, July 24, 2011

This makes a schema-less database a natural fit for these sorts of things. Think about all the
alter-table calls i’ve avoided...
"_id" : {
                                "install.name" : "agni-3501",
                                "timestamp" : ISODate("2010-08-06T00:00:00Z"),
                                "frequency" : "daily" },
                        "measures" : {
                                "total-delta" : -85.78773442284201,
                                "Energy Sold" : 450087.1186574721,
                                "Generation" : 57273.159890170136,
                                "consumed-delta" : 12.569841951556597,
                                "lbs-sold" : 18848.4,
                                "Gallons Loop" : 740.5,
                                "Coincident Usage" : 400,
                                "Stored Energy" : 1306699.6439737699,
                                "Gallons Sold" : 2260,
                                "Energy Delivered" : 360069.6949259777,
                                "Total Usage" : -1605086.7261496289,
                                "Stratification" : -4.905050370111111,
                                "gen-delta-roof" : 4.819865854785763,
                                "lbs-loop" : 6520.1025 },
                        "day_of_year" : 218,
                        "day_of_week" : 4,
                        "month" : 8,
                        "week_of_year" : 31,
                        "install" : {
                                "panels" : 32,
                                "name" : "agni-3501",
                                "num_files" : "3744",
                                "heater_efficiency" : 0.8,
                                "storage" : 1612,
                                "install_completed" : ISODate("2010-08-06T00:00:00Z"),
                                "logger_type" : "emerald",
                                "_id" : ObjectId("4d2905536edfdb022f000212"),
                                "polysun_proj" : [
                                        22863.7, 24651.7, 30301.7,
                                        30053.5, 29640.5, 27806.4,
                                        27511, 28563.1, 27840.7,
                                        26470.9, 21718.9, 19145.4 ],
                                "last_seen" : "2011-01-08 05:26:35.352782" },
                        "year" : 2010,
                        "day" : 6
Sunday, July 24, 2011

isn’t this better?
"_id" : {
                                "install.name" : "agni-3501",
                                "timestamp" : ISODate("2010-08-06T00:00:00Z"),
                                "frequency" : "daily" },
                        "measures" : {
                                "total-delta" : -85.78773442284201,
                                "Energy Sold" : 450087.1186574721,
                                "Generation" : 57273.159890170136,
                                "consumed-delta" : 12.569841951556597,
                                "lbs-sold" : 18848.4,
                                "Gallons Loop" : 740.5,
                                "Coincident Usage" : 400,
                                "Stored Energy" : 1306699.6439737699,      “measures”
                                "Gallons Sold" : 2260,
                                "Energy Delivered" : 360069.6949259777,
                                "Total Usage" : -1605086.7261496289,
                                "Stratification" : -4.905050370111111,
                                "gen-delta-roof" : 4.819865854785763,
                                "lbs-loop" : 6520.1025 },
                        "day_of_year" : 218,
                        "day_of_week" : 4,
                        "month" : 8,
                                                                         “dimensions”
                        "week_of_year" : 31,
                        "install" : {
                                "panels" : 32,
                                "name" : "agni-3501",
                                "num_files" : "3744",
                                "heater_efficiency" : 0.8,
                                "storage" : 1612,
                                "install_completed" : ISODate("2010-08-06T00:00:00Z"),
                                "logger_type" : "emerald",
                                "_id" : ObjectId("4d2905536edfdb022f000212"),
                                "polysun_proj" : [
                                        22863.7, 24651.7, 30301.7,
                                        30053.5, 29640.5, 27806.4,
                                        27511, 28563.1, 27840.7,
                                        26470.9, 21718.9, 19145.4 ],
                                "last_seen" : "2011-01-08 05:26:35.352782" },
                                                                                         ...right?
                        "year" : 2010,
                        "day" : 6
Sunday, July 24, 2011

measures & dimensions. This would be a nice, clean division, except that it isn’t. Frequently
we’ll look for measures by other measures -- i.e., each measure serves as a dimension.
...actually, not a good
                                model.


Sunday, July 24, 2011

The line gets pretty blurry, in practice. Multi-dimensional vectors mean every measure
provides another dimension.
Anyway!
"_id" : {
                                "install.name" : "agni-3501",
                                "timestamp" : ISODate("2010-08-06T00:00:00Z"),
                                "frequency" : "daily" },
                        "measures" : {
                                "total-delta" : -85.78773442284201,
                                "Energy Sold" : 450087.1186574721,
                                "Generation" : 57273.159890170136,
                                "consumed-delta" : 12.569841951556597,
                                "lbs-sold" : 18848.4,
                                "Gallons Loop" : 740.5,
                                "Coincident Usage" : 400,
                                "Stored Energy" : 1306699.6439737699,
                                "Gallons Sold" : 2260,
                                "Energy Delivered" : 360069.6949259777,
                                "Total Usage" : -1605086.7261496289,
                                "Stratification" : -4.905050370111111,
                                "gen-delta-roof" : 4.819865854785763,
                                "lbs-loop" : 6520.1025 },
                        "day_of_year" : 218,
                        "day_of_week" : 4,
                        "month" : 8,
                        "week_of_year" : 31,
                        "install" : {
                                "panels" : 32,
                                "name" : "agni-3501",
                                "num_files" : "3744",
                                "heater_efficiency" : 0.8,
                                "storage" : 1612,
                                "install_completed" : ISODate("2010-08-06T00:00:00Z"),
                                "logger_type" : "emerald",
                                "_id" : ObjectId("4d2905536edfdb022f000212"),
                                "polysun_proj" : [
                                        22863.7, 24651.7, 30301.7,
                                        30053.5, 29640.5, 27806.4,
                                        27511, 28563.1, 27840.7,
                                        26470.9, 21718.9, 19145.4 ],
                                "last_seen" : "2011-01-08 05:26:35.352782" },
                        "year" : 2010,
                        "day" : 6
Sunday, July 24, 2011

How do we build these quickly & efficiently?
the goal: good numbers!



Sunday, July 24, 2011

Remember, the goal here is to make it easy for analysts to get comparable numbers, so when
i ask for the delivered energy for one system, compared to the delivered energy from
another, i can just get the time-series data, without having to worry about if sensors
changed, when the network was out, when a logger was replaced with another one, etc.
Sunday, July 24, 2011

So, the OLTP layer serving as our inputs essentially serves up timestamped data as CSV
series. It doesn’t really provide a lot of intelligence, and is basically the raw numbers
from rows
                             to columns


Sunday, July 24, 2011

So, most of what our pipeline does is turn things from rows to columns, in a flexible, useful
way. I’m gonna walk through that process, quickly.
"_id" : {
                                "install.name" : "agni-3501",
                                "timestamp" : ISODate("2010-08-06T00:00:00Z"),
                                "frequency" : "daily" },
                        "measures" : {


                                                                       Let’s just look at one
                                "total-delta" : -85.78773442284201,
                                "Energy Sold" : 450087.1186574721,
                                "Generation" : 57273.159890170136,
                                "consumed-delta" : 12.569841951556597,
                                "lbs-sold" : 18848.4,
                                "Gallons Loop" : 740.5,
                                "Coincident Usage" : 400,
                                "Stored Energy" : 1306699.6439737699,
                                "Gallons Sold" : 2260,
                                "Energy Delivered" : 360069.6949259777,
                                "Total Usage" : -1605086.7261496289,
                                "Stratification" : -4.905050370111111,
                                "gen-delta-roof" : 4.819865854785763,
                                "lbs-loop" : 6520.1025 },
                        "day_of_year" : 218,
                        "day_of_week" : 4,
                        "month" : 8,
                        "week_of_year" : 31,
                        "install" : {
                                "panels" : 32,
                                "name" : "agni-3501",
                                "num_files" : "3744",
                                "heater_efficiency" : 0.8,
                                "storage" : 1612,
                                "install_completed" : ISODate("2010-08-06T00:00:00Z"),
                                "logger_type" : "emerald",
                                "_id" : ObjectId("4d2905536edfdb022f000212"),
                                "polysun_proj" : [
                                        22863.7, 24651.7, 30301.7,
                                        30053.5, 29640.5, 27806.4,
                                        27511, 28563.1, 27840.7,
                                        26470.9, 21718.9, 19145.4 ],
                                "last_seen" : "2011-01-08 05:26:35.352782" },
                        "year" : 2010,
                        "day" : 6
Sunday, July 24, 2011
row-major data
Time,municipal water in T,solar heated water out T,solar tank bottom taped to side,solar tank top taped to side,array in/out,array in/out,tank room ambient t,array supply temperature,array return
temperature,solar energy sensor,customer flow meter,customer OIML btu meter,solar collector array flow meter,solar collector array OIML btu meter,Cycle Count
Tue Mar 9 23:01:44 2010,14.7627064834,53.7822899383,12.1642527206,51.1436001456,6.40476190476,8.9582972583,22.6857033228,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333458
Tue Mar 9 23:02:44 2010,14.958038343,53.764889193,12.1642527206,51.0925345058,6.40476190476,8.85184138407,22.5716100982,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333462
Tue Mar 9 23:03:45 2010,15.1145934976,53.6986641192,12.1642527206,50.8692901812,6.40476190476,8.78519002979,22.5673674246,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333462
Tue Mar 9 23:04:45 2010,15.2512207824,53.5955190752,12.1642527206,50.8293877551,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333468
Tue Mar 9 23:05:45 2010,15.3690229715,53.5534492867,12.1642527206,50.8293877551,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333471
Tue Mar 9 23:06:46 2010,15.5253261193,53.5534492867,12.1642527206,50.8658228816,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333472
Tue Mar 9 23:07:46 2010,15.6676270005,53.5534492867,12.1642527206,50.9177829276,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.293277114,0.0,0.0,0.0,0.0,0.0,333472
Tue Mar 9 23:08:47 2010,15.7915083121,53.4761516976,12.1642527206,50.8398031014,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.1826467404,0.0,0.0,0.0,0.0,0.0,333477
Tue Mar 9 23:09:47 2010,15.9763741003,53.693428918,12.1642527206,50.7859446809,6.40476190476,8.78519002979,22.5461357574,24.0728390462,22.1782915595,0.0,1.0,0.0,0.0,0.0,333581
Tue Mar 9 23:10:47 2010,16.1650984572,54.0547534088,12.1642527206,50.725,6.40476190476,8.78519002979,22.4544906773,24.0728390462,22.1782915595,0.0,0.0,0.0,0.0,0.0,333614




Sunday, July 24, 2011
“Functional”
                        class Mass(BasicMeasure):
                            def __init__(self, density, volume):
                                ...

                                self._result_func = functools.partial(
                                     lambda data, density, volume: density * volume(data)
                                     density=density, volume=volume)

                            def __call__(self, data):
                               return self._result_func(data)




Sunday, July 24, 2011

quasi-functional classes that describe how to calculate a value from data.
"_id" : {
                                        "install.name" : "agni-3501",
                                        "timestamp" : ISODate("2010-08-06T00:00:00Z"),
                                        "frequency" : "daily" },
                                "measures" : {
                                        "total-delta" : -85.78773442284201,
                                        "Energy Sold" : 450087.1186574721,
                                        "Generation" : 57273.159890170136,
                                        "consumed-delta" : 12.569841951556597,




                                                        A formula:

                                                      E = ∆t × F
                        #pseudocode
                        class LoopEnergy(BasicMeasure):
                            def __init__(self, heat_cap, delta, mass):
                                ...
                                def result_func(data):
                                    return self.delta(data) * self.mass(data) * self.heat_cap
                                self._result_func = result_func

                            def __call__(self, data):
                                return self._result_func(data)




Sunday, July 24, 2011
Creating a Cube
                        For each install, for each chunk of data:

                            apply all known formulas to get values

                            make some convenience keys (e.g., day_of_year)

                            stuff it in mongo

                         Then, map/reduce to whatever dimensionalities you’re
                         interested in: e.g., downsampling.




Sunday, July 24, 2011

Here’s some pseudocode for how to make a cube of multidimensional data.
So, what’s the payoff?
How much water did
                         [x] use, monthly?
                > db.facts_monthly.find({"install.name": [foo]}, {"measures.Gallons Sold":
                1}).sort({“_id”: 1})




Sunday, July 24, 2011

Complicated analytical queries can be boiled down to nearly single line mongo-queries.
Here’s some examples:
What were our highest
                    production days?
                > db.facts_daily.find({}, {“measures.Energy Sold”: 1}).sort({_measures.Energy
                Sold”: -1})




Sunday, July 24, 2011

Complicated analytical queries can be boiled down to nearly single line mongo-queries.
Here’s some examples:
How does the distribution of [x]
                 on the weekend compare to its
                  distribution on the weekdays?
                > weekends = db.facts_daily.find({"day_of_week": {$in: [5,6]}})
                > weekdays = db.facts_daily.find({"day_of_week": {$nin: [5,6]}})
                > do stuff




Sunday, July 24, 2011

Complicated analytical queries can be boiled down to nearly single line mongo-queries.
Here’s some examples:
What’s the production of installs north of a certain
                        latitude, with a certain class of panel, on Tuesdays?

                        For hours where the average delivered temperature
                        delta was above [x], what was our generation
                        efficiency?

                        Normalize by number of panels? (map/reduce)

                        Normalize by distance from equinox? (map/reduce)

                        ...etc.



Sunday, July 24, 2011
• Building a cube can be done in parallel
                        • Map/reduce is an easy way to think about
                          transforms.

                        • Not maximally efficient, but parallelizes on
                          commodity hardware.




Sunday, July 24, 2011

Some advantages.
re #3 -- so what? It’s not a webapp.
mongoDB:
                        The future of enterprise
                         business intelligence.
                           (they just don’t know it yet)




Sunday, July 24, 2011

So, here’s my thesis:
document-databases are far superior to relational databases for business intelligence cases.
Not only that, but mongoDB and some common sense lets you replace multimillion dollar
IBM-level enterprise solutions with open-source awesomeness. All this in a rapid, agile way.
Lastly...



Sunday, July 24, 2011
Mongo expands in an
                           organization.


Sunday, July 24, 2011

it’s cool, don’t fight it. Once we started using it for our analytics, we realized there was a lot
of other schema-loose data that we could use it for -- like the definitions of the measures
themselves, or the details about an install, etc., etc.
Final Thoughts



Sunday, July 24, 2011

Ok, i want to close up with a few jumping-off points.
“Business Intelligence”
                          no longer requires
                              megabucks


Sunday, July 24, 2011
Flexible tools means
                 business responsiveness
                      should be easy


Sunday, July 24, 2011
“Scaling” doesn’t just
                          mean depth-first.


Sunday, July 24, 2011

businesses grow deep, in the sense of adding more users, but they also grow broad.
Questions?



Sunday, July 24, 2011
Epilogue
                        Quest for Logging Hardware




Sunday, July 24, 2011
This’ll be easy!
        This is such an obvious and well
          explored problem space, i’m
           sure we’ll be able to find a
        solution that matches our needs
           without breaking the bank!




Sunday, July 24, 2011
Shopping List!
           16 temperature sensors
                4 flow sensors
        maybe some miscellaneous ones
              internet backhaul
           no software/data lock in.




Sunday, July 24, 2011
Conventions
                  FTW!
        And since we’ve walked a couple
         convention floors and product
         catalogs from major industrial
         supply vendors, i’m sure it’s in
               here somewhere!




Sunday, July 24, 2011
derp derp
                    “internet”?
        I’m sure there’s a reason why all
        of these loggers have to connect
                    via USB...
                         Pace Scientific XR5:
                              8 analog
                               3 pulse
                              ONE MB
                            no internet?
                               $500?!?



Sunday, July 24, 2011
yay windows?
            ...and require proprietary
              (windows!) software or
         subscription plans that route my
            data through their servers

                        (basically all of them!)



Sunday, July 24, 2011
Maybe the gov’t
          can help!
           Perhaps there’s some kind of
          standard that the governments
              require for solar thermal
             monitoring systems to be
            eligible for incentives or tax
                        credits.



Sunday, July 24, 2011
Vive la France!
              An obscure standard by the
                   Organisation
                Internationale de
                Métrologie Légale
                   appears! Neat!




Sunday, July 24, 2011
A “Certified”
                  Logger
                 two temperature sensors
                         one pulse
                  no increase in accuracy
                  no data backhaul -- at all
                             ...
                     what’s the price?



Sunday, July 24, 2011
$1,000




Sunday, July 24, 2011
$1,000




Sunday, July 24, 2011
Hmm...
            I can solder, and arduinos are
                     pretty cheap




Sunday, July 24, 2011
It’s on!




Sunday, July 24, 2011
arduino + netbook!
Sunday, July 24, 2011
TL; DR:
                        Existing loggers
                          are terrible.


Sunday, July 24, 2011

Also, existing industries aren’t really ready for rapid prototyping and its destructive effects.
•   http://www.flickr.com/photos/rknight/4358119571/

                        •   http://4.bp.blogspot.com/_8vNzwxlohg0/
                            TJoUWqsF4LI/AAAAAAAABMg/QaUiKwCEZn8/
                            s320/turtles-all-the-way-down.jpg

                        •   http://www.flickr.com/photos/rhk313/3801302914/

                        •   http://www.flickr.com/photos/benny_lin/481411728/

                        •   http://spagobi.blogspot.com/
                            2010_08_01_archive.html

                        •   http://community.qlikview.com/forums/t/37106.aspx


Sunday, July 24, 2011

More Related Content

PDF
Grse project vishal
PDF
Microbiologically influenced corrosion (mic) 2019
PPTX
Chemical treatment methods
PPTX
Introduction to Marine Pollution Control
PPTX
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
PPTX
MongoDB for Time Series Data
PPTX
MongoDB for Time Series Data Part 1: Setting the Stage for Sensor Management
PPTX
MongoDB for Time Series Data: Schema Design
Grse project vishal
Microbiologically influenced corrosion (mic) 2019
Chemical treatment methods
Introduction to Marine Pollution Control
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data
MongoDB for Time Series Data Part 1: Setting the Stage for Sensor Management
MongoDB for Time Series Data: Schema Design

Viewers also liked (20)

PPTX
MongoDB for Time Series Data Part 3: Sharding
PPTX
MongoDB for Time Series Data: Setting the Stage for Sensor Management
PPTX
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
PPTX
The Aggregation Framework
PPTX
Using MongoDB As a Tick Database
PDF
Time series storage in Cassandra
PPT
MongoDB Tick Data Presentation
PPTX
Data Modeling IoT and Time Series data in NoSQL
PDF
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
PDF
Webinar: Working with Graph Data in MongoDB
PPTX
Big Data, NoSQL with MongoDB and Cassasdra
PPTX
Back to Basics Webinar 1: Introduction to NoSQL
PDF
Resilience an engineering construction perspective
PPTX
Riak TS
PDF
International Journal of Industrial Engineering and Design vol 2 issue 1
PDF
Con8862 no sql, json and time series data
PDF
MongoDB in the Big Data Landscape
PDF
Riakはなぜ良いのか
PDF
VMT 11 - Artikel BioValGroup
PDF
No sql e as vantagens na utilização do mongodb
MongoDB for Time Series Data Part 3: Sharding
MongoDB for Time Series Data: Setting the Stage for Sensor Management
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
The Aggregation Framework
Using MongoDB As a Tick Database
Time series storage in Cassandra
MongoDB Tick Data Presentation
Data Modeling IoT and Time Series data in NoSQL
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
Webinar: Working with Graph Data in MongoDB
Big Data, NoSQL with MongoDB and Cassasdra
Back to Basics Webinar 1: Introduction to NoSQL
Resilience an engineering construction perspective
Riak TS
International Journal of Industrial Engineering and Design vol 2 issue 1
Con8862 no sql, json and time series data
MongoDB in the Big Data Landscape
Riakはなぜ良いのか
VMT 11 - Artikel BioValGroup
No sql e as vantagens na utilização do mongodb
Ad

Similar to Time Series Data Storage in MongoDB (20)

PDF
STI Summit 2011 - Linked services
PDF
Commercialization of OpenStack Object Storage
PDF
Final Year Project Guidance
PPT
EDF2013: Keynote Knut Sebastian Tungland: We need to understand (our) data
PDF
Grid Observatory @ CCGrid 2011
PDF
MongoDB at Sailthru: Scaling and Schema Design
PPTX
Zero to ten million daily users in four weeks: sustainable speed is king
PDF
Global Geothermal Summit, Oct. 12, 2011
PDF
20100301icde
PDF
On Failure and Resilience
PDF
Reliability & Scale in AWS while letting you sleep through the night
PDF
eBay’s Challenges and Lessons
PDF
John D. Rowell - Scaling heterogeneous systems on the cloud
PDF
Big Data @ Bodensee Barcamp 2010
PDF
ANALYZING LARGE-SCALE USER DATA from Structure:Data 2012
PDF
Panasonic search
 
PDF
Cloud Log Analysis and Visualization
PPT
It challenge
PDF
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
PDF
Faster Cheaper Better-Replacing Oracle with Hadoop & Solr
STI Summit 2011 - Linked services
Commercialization of OpenStack Object Storage
Final Year Project Guidance
EDF2013: Keynote Knut Sebastian Tungland: We need to understand (our) data
Grid Observatory @ CCGrid 2011
MongoDB at Sailthru: Scaling and Schema Design
Zero to ten million daily users in four weeks: sustainable speed is king
Global Geothermal Summit, Oct. 12, 2011
20100301icde
On Failure and Resilience
Reliability & Scale in AWS while letting you sleep through the night
eBay’s Challenges and Lessons
John D. Rowell - Scaling heterogeneous systems on the cloud
Big Data @ Bodensee Barcamp 2010
ANALYZING LARGE-SCALE USER DATA from Structure:Data 2012
Panasonic search
 
Cloud Log Analysis and Visualization
It challenge
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster Cheaper Better-Replacing Oracle with Hadoop & Solr
Ad

Recently uploaded (20)

PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Big Data Technologies - Introduction.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Approach and Philosophy of On baking technology
PDF
KodekX | Application Modernization Development
PDF
cuic standard and advanced reporting.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Spectral efficient network and resource selection model in 5G networks
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Digital-Transformation-Roadmap-for-Companies.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
MIND Revenue Release Quarter 2 2025 Press Release
Unlocking AI with Model Context Protocol (MCP)
Per capita expenditure prediction using model stacking based on satellite ima...
Programs and apps: productivity, graphics, security and other tools
Mobile App Security Testing_ A Comprehensive Guide.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Building Integrated photovoltaic BIPV_UPV.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Big Data Technologies - Introduction.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Approach and Philosophy of On baking technology
KodekX | Application Modernization Development
cuic standard and advanced reporting.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx

Time Series Data Storage in MongoDB

  • 2. ajackson @ skylineinnovations.com Sunday, July 24, 2011
  • 3. a tale of rapid prototyping, data warehousing, solar power, an architecture designed for data analysis at “scale” ...and arduinos! Sunday, July 24, 2011 So here’s what i’d like to talk about: Who we are, how we got started, and most importantly, how we’ve been able to use MongoDB to help us. We’re not a traditional startup -- and while i know that this is not a “startups” talk, but a Mongo one, i’d like to show how Mongo’s flexible nature really helped us as a business, and how Mongo specifically has been a good choice for us as we build some of our tools. Here are some themes:
  • 4. Scaling Sunday, July 24, 2011 Mongo has come to have a pretty strong association with the word “scaling.” Scaling is a word we throw around a lot, and it almost always means “software performance, as inputs grow by orders of magnitude.” But scaling also means performance as the variety of inputs increases. I’d argue that it’s scaling to go from 10 users to 10,000, and it’s also scaling to go from ten ‘kinds’ of input to a hundred. There’s another word for this.
  • 5. Scaling Flexibility Sunday, July 24, 2011 Particularly when you scale in the real world, you start to find that it’s complicated and messy and entropic in ways that software isn’t always equipped to handle. So for us, when we say “mongo helps us scale”, we don’t necessarily mean scaling to petabytes of data. We’ll come back to them as well.
  • 6. Business-first development Sunday, July 24, 2011 This generally means flexibile, lightweight processes. Things that become fixed & unchangable quickly become obsolete and sad :’(
  • 7. When Does “Context” become “Yak Shaving”? Sunday, July 24, 2011 When i read new things or hear about new stuff, I’m always trying to put it in context. So, sometimes i put too much context in my talks :( To avoid it, I sometimes go a little too fast over the context that *is* important. So please stop me to ask questions! Also, the problem domain here is a little different than what we might be used to, so bear with me as we go into plumbing & construction.
  • 10. Project Development + Technology Sunday, July 24, 2011
  • 12. finance, develop, and operate renewable energy and efficiency installations, for measurable, guaranteed savings. Sunday, July 24, 2011
  • 13. finance, develop, and operate renewable energy and efficiency installations, for measurable, guaranteed savings. Sunday, July 24, 2011 We’ll pay to put stuff on your roof, and we’ll keep it at its maximally awesome.
  • 14. finance, develop, and operate renewable energy and efficiency installations, for measurable, guaranteed savings. Sunday, July 24, 2011 Right now, this means solar thermal, more efficient lighting retrofits, and maybe HVAC.
  • 15. finance, develop, and operate renewable energy and efficiency installations, for measurable, guaranteed savings. Sunday, July 24, 2011 So, here’s the interesting part. Since we put stuff on your roof for free, we need to get that money back. What we do is, we’ll charge you for the energy that it saved you, but, here’s the twist. Other companies have done similar things, where they say “we’ll pay for a system/ retrofit/whatever, and you’ll agree to pay us an arbitrary number, and we say you’ll get savings, but you won’t actually be able to tell, really.” That always seemed sketchy to us. So, we actually measure the performance of this stuff, collect the data, and guarantee that you save money.
  • 18. • Why solar thermal? • Why hasn’t anyone else done this before? • Pivots? Iterations? • What’s the market size? • Funding? Capital structures? • Wait, how do you guys make money? Sunday, July 24, 2011 Oh, right, this isn’t a startup talk. But feel free to ask me these later!
  • 19. Solar Thermal in Five Minutes ( mongo next, i promise! ) Sunday, July 24, 2011
  • 20. Municipal => Roof => Tank => Customer Sunday, July 24, 2011
  • 21. Relevant Data to Track Sunday, July 24, 2011
  • 22. Temperatures (about a dozen) Sunday, July 24, 2011
  • 23. Flow Rates (at least two) Sunday, July 24, 2011
  • 24. Parallel data streams (hopefully many) Sunday, July 24, 2011 e.g., weather data, insolation data. It’d be nice if we didn’t have to collect it all ourselves.
  • 25. how much data? 20 data points @ 4 bytes 1 minute intervals at 1000 projects (I wish!) for 10 years 80 * 60 * 24 * 365 * 10 * 1000 = 400 GB? ...not much, really, “in the raw” Sunday, July 24, 2011 unfortunately, we can’t really store it with maximal efficiency, because of things like timestamps, metadata, etc., but still.
  • 26. Sunday, July 24, 2011 I hope this provides enough context on the business problems we’re trying to solve. It looks like we’ll need a data pipeline, and we’ll need one fast. We’ve got data that we’ll need to use to build, monitor, and monetize these energy technologies. Having worked at other smart grid companies before, I’ve seen some good data pipelines and some bad data pipelines. I’d like to build a good one. The less stuff i have to build, the better.
  • 27. Sunday, July 24, 2011 As i do some research, i find that a lot of these data pipelines have a few well-defined areas of responsibility.
  • 28. Acquisition, Storage, Search, Retrieval, Analytics. Sunday, July 24, 2011 These should be self explanatory. What’s interesting is that not only are most of the end- users of the system analysts, interested in analyzing, but that most systems seem to be designed for the other functionality. More importantly, they’re not very well decoupled: by the time the analysts get to start building tools, the design decisions from the beginning are inextricable from the systems that came before.
  • 29. Acquisition, Storage, Search, Retrieval, } Designed for these Analytics. <= Users are here Sunday, July 24, 2011 These should be self explanatory. What’s interesting is that not only are most of the end- users of the system analysts, interested in analyzing, but that most systems seem to be designed for the other functionality. More importantly, they’re not very well decoupled: by the time the analysts get to start building tools, the design decisions from the beginning are inextricable from the systems that came before.
  • 30. Acquisition, Storage, Search, Retrieval, Analytics. Sunday, July 24, 2011 These should be self explanatory. What’s interesting is that not only are most of the end- users of the system analysts, interested in analyzing, but that most systems seem to be designed for the other functionality. More importantly, they’re not very well decoupled: by the time the analysts get to start building tools, the design decisions from the beginning are inextricable from the systems that came before. It’s important to remember that, while you can’t get good analytics without the other stuff, the analytics is where almost all of the value is! Search & retrieval are approaching “solved”
  • 31. Acquisition, Storage, Search, Retrieval, } Designed for these Analytics. <= Users are here Business value is here! Sunday, July 24, 2011 These should be self explanatory. What’s interesting is that not only are most of the end- users of the system analysts, interested in analyzing, but that most systems seem to be designed for the other functionality. More importantly, they’re not very well decoupled: by the time the analysts get to start building tools, the design decisions from the beginning are inextricable from the systems that came before. It’s important to remember that, while you can’t get good analytics without the other stuff, the analytics is where almost all of the value is! Search & retrieval are approaching “solved”
  • 32. Sunday, July 24, 2011 so, here’s how i started thinking about things. This is a design diagram from the early days of the company.
  • 33. Sunday, July 24, 2011 easy, python, no problem. There are some interesting topics here, but they’re not mongoDB related. I was pretty sure i knew how to build this part, and i was pretty sure i knew what the data would look like.
  • 34. Sunday, July 24, 2011 This part was also easy -- e-mail reports, csvs, maybe some fancy graphs, possibly some light webapps for internal use. These would be dictated by business goals first, but the technological questions were straightforward.
  • 35. Sunday, July 24, 2011 Here was the real question. What would be some use cases of an analyst having a good experience look like? What would they expect the tools to do?
  • 36. Now we can think about what the data looks like Sunday, July 24, 2011 So, let’s think about what this data looks like, how it’s structured and what it is. Then, after that, we can look at what the best ways to organize it for future usefulness.
  • 37. Time series? Time,municipal water in T,solar heated water out T,solar tank bottom taped to side,solar tank top taped to side,array in/out,array in/out,tank room ambient t,array supply temperature,array return temperature,solar energy sensor,customer flow meter,customer OIML btu meter,solar collector array flow meter,solar collector array OIML btu meter,Cycle Count Tue Mar 9 23:01:44 2010,14.7627064834,53.7822899383,12.1642527206,51.1436001456,6.40476190476,8.9582972583,22.6857033228,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333458 Tue Mar 9 23:02:44 2010,14.958038343,53.764889193,12.1642527206,51.0925345058,6.40476190476,8.85184138407,22.5716100982,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333462 Tue Mar 9 23:03:45 2010,15.1145934976,53.6986641192,12.1642527206,50.8692901812,6.40476190476,8.78519002979,22.5673674246,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333462 Tue Mar 9 23:04:45 2010,15.2512207824,53.5955190752,12.1642527206,50.8293877551,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333468 Tue Mar 9 23:05:45 2010,15.3690229715,53.5534492867,12.1642527206,50.8293877551,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333471 Tue Mar 9 23:06:46 2010,15.5253261193,53.5534492867,12.1642527206,50.8658228816,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333472 Tue Mar 9 23:07:46 2010,15.6676270005,53.5534492867,12.1642527206,50.9177829276,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.293277114,0.0,0.0,0.0,0.0,0.0,333472 Tue Mar 9 23:08:47 2010,15.7915083121,53.4761516976,12.1642527206,50.8398031014,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.1826467404,0.0,0.0,0.0,0.0,0.0,333477 Tue Mar 9 23:09:47 2010,15.9763741003,53.693428918,12.1642527206,50.7859446809,6.40476190476,8.78519002979,22.5461357574,24.0728390462,22.1782915595,0.0,1.0,0.0,0.0,0.0,333581 Tue Mar 9 23:10:47 2010,16.1650984572,54.0547534088,12.1642527206,50.725,6.40476190476,8.78519002979,22.4544906773,24.0728390462,22.1782915595,0.0,0.0,0.0,0.0,0.0,333614 Sunday, July 24, 2011
  • 38. TIME SERIES DATA Sunday, July 24, 2011 So what is time series data?
  • 39. Features, Over Time Sunday, July 24, 2011 multi-dimensional features. What’s fun in a business like this is that we’re not really sure what the features we study will be. -- Flexibility callout
  • 40. Features, Over Time Thing (Feature vector, v) Time (t) Sunday, July 24, 2011 multi-dimensional features. What’s fun in a business like this is that we’re not really sure what the features we study will be. -- Flexibility callout
  • 41. Features, Over Time Thing (Feature vector, v) Time (t) Sunday, July 24, 2011 multi-dimensional features. What’s fun in a business like this is that we’re not really sure what the features we study will be. -- Flexibility callout
  • 42. Sunday, July 24, 2011 A couple of ideas: sampling rates. “regularity”. “completeness” analog vs. digital instantaneous vs. cumulative (tradeoffs)
  • 43. tn tn+1 Sunday, July 24, 2011 Finding known interesting ranges (definitely the most common)
  • 44. tn tn+1 Sunday, July 24, 2011 Finding known interesting ranges (definitely the most common)
  • 45. t t’ etc. Sunday, July 24, 2011 Using features to find interesting ranges. These two ways to look for things should inform our design decisions.
  • 46. y t t’ etc. Sunday, July 24, 2011 Using features to find interesting ranges. These two ways to look for things should inform our design decisions.
  • 47. y Thresholds y’ t t’ etc. Sunday, July 24, 2011 Using features to find interesting ranges. These two ways to look for things should inform our design decisions.
  • 48. y Thresholds y’ t t’ etc. Sunday, July 24, 2011 Using features to find interesting ranges. These two ways to look for things should inform our design decisions.
  • 49. (more complicated stuff can be thought of as transformations...) Sunday, July 24, 2011 e.g., frequency analysis, wavelets, whatever.
  • 50. Sunday, July 24, 2011 At this point, I go off and do a bunch of research on existing technologies. I really hate reinventing the wheel, and we really don’t have the manpower.
  • 51. Time series specific tools Scientific tools & libraries Traditional data-warehousing approaches Sunday, July 24, 2011 So, these were some of the options i looked at. I want to quickly point out why i eliminated the first two classes of tools.
  • 52. Time series specific tools RRDtool -- Round Robin Database Sunday, July 24, 2011 There’s really surprisingly few of these. One of the best is the RRDtool. It’s pretty sweet, and i highly recommend it. Unfortunately, it’s really designed for applications that are highly regular, and that are already pretty digital, for instance, sampling latencies, or temperatures in a datacenter. It’s not really good for unreliable sensors, nor is it really designed for long term persistance. It also has a really high lock-in, with legacy data formats, etc. Don’t get me wrong, it’s totally rad, but i didn’t think it was for us.
  • 53. Scientific tools & libraries e.g., PyTables Sunday, July 24, 2011 Pretty cool, but not many of these were mature & ready for primetime. Some that were, like PyTables, didn’t really match our business use-case.
  • 54. Traditional data-warehousing approaches Sunday, July 24, 2011 So, these were some of the options i looked at. I want to quickly point out why i eliminated the first two classes of tools. [...]. That leaves us with the traditional approaches. This represents a pretty well established field, but very few of the tools are free, lightweight, and mature.
  • 55. Enterprise buzzwords (Just google for OLAP) Sunday, July 24, 2011 But the biggest idea i learned is that most data warehousing revolves around the idea of a “fact table”. They call it a “multidimensional OLAP cube”, but basically it exists as a totally denormalized SQL table.
  • 56. “Measures” and their “Dimensions” Sunday, July 24, 2011 (or facts)
  • 61. (from “How to Build OLAP Application Using Mondrian + XMLA + SpagoBI”) Sunday, July 24, 2011 to which the only acceptable response is:
  • 62. Sunday, July 24, 2011 ha! Yeah right.
  • 63. Time series are not relational! Sunday, July 24, 2011 even extracted features are not inherently relational! Also: you don’t know what you’re looking for, you don’t know when you’ll find it, you won’t know when you’ll have to start looking for something different. Why would you lock yourself into a schema?
  • 64. We don’t know what we’ll want to know. Sunday, July 24, 2011 We won’t know what we want to know. Not only are we warehousing time-series of multidimensional feature vectors, we don’t even know the dimensions we’ll be interested in yet!
  • 65. natural fit for documents Sunday, July 24, 2011 This makes a schema-less database a natural fit for these sorts of things. Think about all the alter-table calls i’ve avoided...
  • 66. "_id" : { "install.name" : "agni-3501", "timestamp" : ISODate("2010-08-06T00:00:00Z"), "frequency" : "daily" }, "measures" : { "total-delta" : -85.78773442284201, "Energy Sold" : 450087.1186574721, "Generation" : 57273.159890170136, "consumed-delta" : 12.569841951556597, "lbs-sold" : 18848.4, "Gallons Loop" : 740.5, "Coincident Usage" : 400, "Stored Energy" : 1306699.6439737699, "Gallons Sold" : 2260, "Energy Delivered" : 360069.6949259777, "Total Usage" : -1605086.7261496289, "Stratification" : -4.905050370111111, "gen-delta-roof" : 4.819865854785763, "lbs-loop" : 6520.1025 }, "day_of_year" : 218, "day_of_week" : 4, "month" : 8, "week_of_year" : 31, "install" : { "panels" : 32, "name" : "agni-3501", "num_files" : "3744", "heater_efficiency" : 0.8, "storage" : 1612, "install_completed" : ISODate("2010-08-06T00:00:00Z"), "logger_type" : "emerald", "_id" : ObjectId("4d2905536edfdb022f000212"), "polysun_proj" : [ 22863.7, 24651.7, 30301.7, 30053.5, 29640.5, 27806.4, 27511, 28563.1, 27840.7, 26470.9, 21718.9, 19145.4 ], "last_seen" : "2011-01-08 05:26:35.352782" }, "year" : 2010, "day" : 6 Sunday, July 24, 2011 isn’t this better?
  • 67. "_id" : { "install.name" : "agni-3501", "timestamp" : ISODate("2010-08-06T00:00:00Z"), "frequency" : "daily" }, "measures" : { "total-delta" : -85.78773442284201, "Energy Sold" : 450087.1186574721, "Generation" : 57273.159890170136, "consumed-delta" : 12.569841951556597, "lbs-sold" : 18848.4, "Gallons Loop" : 740.5, "Coincident Usage" : 400, "Stored Energy" : 1306699.6439737699, “measures” "Gallons Sold" : 2260, "Energy Delivered" : 360069.6949259777, "Total Usage" : -1605086.7261496289, "Stratification" : -4.905050370111111, "gen-delta-roof" : 4.819865854785763, "lbs-loop" : 6520.1025 }, "day_of_year" : 218, "day_of_week" : 4, "month" : 8, “dimensions” "week_of_year" : 31, "install" : { "panels" : 32, "name" : "agni-3501", "num_files" : "3744", "heater_efficiency" : 0.8, "storage" : 1612, "install_completed" : ISODate("2010-08-06T00:00:00Z"), "logger_type" : "emerald", "_id" : ObjectId("4d2905536edfdb022f000212"), "polysun_proj" : [ 22863.7, 24651.7, 30301.7, 30053.5, 29640.5, 27806.4, 27511, 28563.1, 27840.7, 26470.9, 21718.9, 19145.4 ], "last_seen" : "2011-01-08 05:26:35.352782" }, ...right? "year" : 2010, "day" : 6 Sunday, July 24, 2011 measures & dimensions. This would be a nice, clean division, except that it isn’t. Frequently we’ll look for measures by other measures -- i.e., each measure serves as a dimension.
  • 68. ...actually, not a good model. Sunday, July 24, 2011 The line gets pretty blurry, in practice. Multi-dimensional vectors mean every measure provides another dimension. Anyway!
  • 69. "_id" : { "install.name" : "agni-3501", "timestamp" : ISODate("2010-08-06T00:00:00Z"), "frequency" : "daily" }, "measures" : { "total-delta" : -85.78773442284201, "Energy Sold" : 450087.1186574721, "Generation" : 57273.159890170136, "consumed-delta" : 12.569841951556597, "lbs-sold" : 18848.4, "Gallons Loop" : 740.5, "Coincident Usage" : 400, "Stored Energy" : 1306699.6439737699, "Gallons Sold" : 2260, "Energy Delivered" : 360069.6949259777, "Total Usage" : -1605086.7261496289, "Stratification" : -4.905050370111111, "gen-delta-roof" : 4.819865854785763, "lbs-loop" : 6520.1025 }, "day_of_year" : 218, "day_of_week" : 4, "month" : 8, "week_of_year" : 31, "install" : { "panels" : 32, "name" : "agni-3501", "num_files" : "3744", "heater_efficiency" : 0.8, "storage" : 1612, "install_completed" : ISODate("2010-08-06T00:00:00Z"), "logger_type" : "emerald", "_id" : ObjectId("4d2905536edfdb022f000212"), "polysun_proj" : [ 22863.7, 24651.7, 30301.7, 30053.5, 29640.5, 27806.4, 27511, 28563.1, 27840.7, 26470.9, 21718.9, 19145.4 ], "last_seen" : "2011-01-08 05:26:35.352782" }, "year" : 2010, "day" : 6 Sunday, July 24, 2011 How do we build these quickly & efficiently?
  • 70. the goal: good numbers! Sunday, July 24, 2011 Remember, the goal here is to make it easy for analysts to get comparable numbers, so when i ask for the delivered energy for one system, compared to the delivered energy from another, i can just get the time-series data, without having to worry about if sensors changed, when the network was out, when a logger was replaced with another one, etc.
  • 71. Sunday, July 24, 2011 So, the OLTP layer serving as our inputs essentially serves up timestamped data as CSV series. It doesn’t really provide a lot of intelligence, and is basically the raw numbers
  • 72. from rows to columns Sunday, July 24, 2011 So, most of what our pipeline does is turn things from rows to columns, in a flexible, useful way. I’m gonna walk through that process, quickly.
  • 73. "_id" : { "install.name" : "agni-3501", "timestamp" : ISODate("2010-08-06T00:00:00Z"), "frequency" : "daily" }, "measures" : { Let’s just look at one "total-delta" : -85.78773442284201, "Energy Sold" : 450087.1186574721, "Generation" : 57273.159890170136, "consumed-delta" : 12.569841951556597, "lbs-sold" : 18848.4, "Gallons Loop" : 740.5, "Coincident Usage" : 400, "Stored Energy" : 1306699.6439737699, "Gallons Sold" : 2260, "Energy Delivered" : 360069.6949259777, "Total Usage" : -1605086.7261496289, "Stratification" : -4.905050370111111, "gen-delta-roof" : 4.819865854785763, "lbs-loop" : 6520.1025 }, "day_of_year" : 218, "day_of_week" : 4, "month" : 8, "week_of_year" : 31, "install" : { "panels" : 32, "name" : "agni-3501", "num_files" : "3744", "heater_efficiency" : 0.8, "storage" : 1612, "install_completed" : ISODate("2010-08-06T00:00:00Z"), "logger_type" : "emerald", "_id" : ObjectId("4d2905536edfdb022f000212"), "polysun_proj" : [ 22863.7, 24651.7, 30301.7, 30053.5, 29640.5, 27806.4, 27511, 28563.1, 27840.7, 26470.9, 21718.9, 19145.4 ], "last_seen" : "2011-01-08 05:26:35.352782" }, "year" : 2010, "day" : 6 Sunday, July 24, 2011
  • 74. row-major data Time,municipal water in T,solar heated water out T,solar tank bottom taped to side,solar tank top taped to side,array in/out,array in/out,tank room ambient t,array supply temperature,array return temperature,solar energy sensor,customer flow meter,customer OIML btu meter,solar collector array flow meter,solar collector array OIML btu meter,Cycle Count Tue Mar 9 23:01:44 2010,14.7627064834,53.7822899383,12.1642527206,51.1436001456,6.40476190476,8.9582972583,22.6857033228,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333458 Tue Mar 9 23:02:44 2010,14.958038343,53.764889193,12.1642527206,51.0925345058,6.40476190476,8.85184138407,22.5716100982,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333462 Tue Mar 9 23:03:45 2010,15.1145934976,53.6986641192,12.1642527206,50.8692901812,6.40476190476,8.78519002979,22.5673674246,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333462 Tue Mar 9 23:04:45 2010,15.2512207824,53.5955190752,12.1642527206,50.8293877551,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333468 Tue Mar 9 23:05:45 2010,15.3690229715,53.5534492867,12.1642527206,50.8293877551,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333471 Tue Mar 9 23:06:46 2010,15.5253261193,53.5534492867,12.1642527206,50.8658228816,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.3083978559,0.0,0.0,0.0,0.0,0.0,333472 Tue Mar 9 23:07:46 2010,15.6676270005,53.5534492867,12.1642527206,50.9177829276,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.293277114,0.0,0.0,0.0,0.0,0.0,333472 Tue Mar 9 23:08:47 2010,15.7915083121,53.4761516976,12.1642527206,50.8398031014,6.40476190476,8.78519002979,22.5652456306,24.0728390462,22.1826467404,0.0,0.0,0.0,0.0,0.0,333477 Tue Mar 9 23:09:47 2010,15.9763741003,53.693428918,12.1642527206,50.7859446809,6.40476190476,8.78519002979,22.5461357574,24.0728390462,22.1782915595,0.0,1.0,0.0,0.0,0.0,333581 Tue Mar 9 23:10:47 2010,16.1650984572,54.0547534088,12.1642527206,50.725,6.40476190476,8.78519002979,22.4544906773,24.0728390462,22.1782915595,0.0,0.0,0.0,0.0,0.0,333614 Sunday, July 24, 2011
  • 75. “Functional” class Mass(BasicMeasure): def __init__(self, density, volume): ... self._result_func = functools.partial( lambda data, density, volume: density * volume(data) density=density, volume=volume) def __call__(self, data): return self._result_func(data) Sunday, July 24, 2011 quasi-functional classes that describe how to calculate a value from data.
  • 76. "_id" : { "install.name" : "agni-3501", "timestamp" : ISODate("2010-08-06T00:00:00Z"), "frequency" : "daily" }, "measures" : { "total-delta" : -85.78773442284201, "Energy Sold" : 450087.1186574721, "Generation" : 57273.159890170136, "consumed-delta" : 12.569841951556597, A formula: E = ∆t × F #pseudocode class LoopEnergy(BasicMeasure): def __init__(self, heat_cap, delta, mass): ... def result_func(data): return self.delta(data) * self.mass(data) * self.heat_cap self._result_func = result_func def __call__(self, data): return self._result_func(data) Sunday, July 24, 2011
  • 77. Creating a Cube For each install, for each chunk of data: apply all known formulas to get values make some convenience keys (e.g., day_of_year) stuff it in mongo Then, map/reduce to whatever dimensionalities you’re interested in: e.g., downsampling. Sunday, July 24, 2011 Here’s some pseudocode for how to make a cube of multidimensional data. So, what’s the payoff?
  • 78. How much water did [x] use, monthly? > db.facts_monthly.find({"install.name": [foo]}, {"measures.Gallons Sold": 1}).sort({“_id”: 1}) Sunday, July 24, 2011 Complicated analytical queries can be boiled down to nearly single line mongo-queries. Here’s some examples:
  • 79. What were our highest production days? > db.facts_daily.find({}, {“measures.Energy Sold”: 1}).sort({_measures.Energy Sold”: -1}) Sunday, July 24, 2011 Complicated analytical queries can be boiled down to nearly single line mongo-queries. Here’s some examples:
  • 80. How does the distribution of [x] on the weekend compare to its distribution on the weekdays? > weekends = db.facts_daily.find({"day_of_week": {$in: [5,6]}}) > weekdays = db.facts_daily.find({"day_of_week": {$nin: [5,6]}}) > do stuff Sunday, July 24, 2011 Complicated analytical queries can be boiled down to nearly single line mongo-queries. Here’s some examples:
  • 81. What’s the production of installs north of a certain latitude, with a certain class of panel, on Tuesdays? For hours where the average delivered temperature delta was above [x], what was our generation efficiency? Normalize by number of panels? (map/reduce) Normalize by distance from equinox? (map/reduce) ...etc. Sunday, July 24, 2011
  • 82. • Building a cube can be done in parallel • Map/reduce is an easy way to think about transforms. • Not maximally efficient, but parallelizes on commodity hardware. Sunday, July 24, 2011 Some advantages. re #3 -- so what? It’s not a webapp.
  • 83. mongoDB: The future of enterprise business intelligence. (they just don’t know it yet) Sunday, July 24, 2011 So, here’s my thesis: document-databases are far superior to relational databases for business intelligence cases. Not only that, but mongoDB and some common sense lets you replace multimillion dollar IBM-level enterprise solutions with open-source awesomeness. All this in a rapid, agile way.
  • 85. Mongo expands in an organization. Sunday, July 24, 2011 it’s cool, don’t fight it. Once we started using it for our analytics, we realized there was a lot of other schema-loose data that we could use it for -- like the definitions of the measures themselves, or the details about an install, etc., etc.
  • 86. Final Thoughts Sunday, July 24, 2011 Ok, i want to close up with a few jumping-off points.
  • 87. “Business Intelligence” no longer requires megabucks Sunday, July 24, 2011
  • 88. Flexible tools means business responsiveness should be easy Sunday, July 24, 2011
  • 89. “Scaling” doesn’t just mean depth-first. Sunday, July 24, 2011 businesses grow deep, in the sense of adding more users, but they also grow broad.
  • 91. Epilogue Quest for Logging Hardware Sunday, July 24, 2011
  • 92. This’ll be easy! This is such an obvious and well explored problem space, i’m sure we’ll be able to find a solution that matches our needs without breaking the bank! Sunday, July 24, 2011
  • 93. Shopping List! 16 temperature sensors 4 flow sensors maybe some miscellaneous ones internet backhaul no software/data lock in. Sunday, July 24, 2011
  • 94. Conventions FTW! And since we’ve walked a couple convention floors and product catalogs from major industrial supply vendors, i’m sure it’s in here somewhere! Sunday, July 24, 2011
  • 95. derp derp “internet”? I’m sure there’s a reason why all of these loggers have to connect via USB... Pace Scientific XR5: 8 analog 3 pulse ONE MB no internet? $500?!? Sunday, July 24, 2011
  • 96. yay windows? ...and require proprietary (windows!) software or subscription plans that route my data through their servers (basically all of them!) Sunday, July 24, 2011
  • 97. Maybe the gov’t can help! Perhaps there’s some kind of standard that the governments require for solar thermal monitoring systems to be eligible for incentives or tax credits. Sunday, July 24, 2011
  • 98. Vive la France! An obscure standard by the Organisation Internationale de Métrologie Légale appears! Neat! Sunday, July 24, 2011
  • 99. A “Certified” Logger two temperature sensors one pulse no increase in accuracy no data backhaul -- at all ... what’s the price? Sunday, July 24, 2011
  • 102. Hmm... I can solder, and arduinos are pretty cheap Sunday, July 24, 2011
  • 104. arduino + netbook! Sunday, July 24, 2011
  • 105. TL; DR: Existing loggers are terrible. Sunday, July 24, 2011 Also, existing industries aren’t really ready for rapid prototyping and its destructive effects.
  • 106. http://www.flickr.com/photos/rknight/4358119571/ • http://4.bp.blogspot.com/_8vNzwxlohg0/ TJoUWqsF4LI/AAAAAAAABMg/QaUiKwCEZn8/ s320/turtles-all-the-way-down.jpg • http://www.flickr.com/photos/rhk313/3801302914/ • http://www.flickr.com/photos/benny_lin/481411728/ • http://spagobi.blogspot.com/ 2010_08_01_archive.html • http://community.qlikview.com/forums/t/37106.aspx Sunday, July 24, 2011