Week 8 – Big Data and Hadoop

This week we covered Big Data and Hadoop, a topic of dear interest to me, as I try and understand what to do with all the electricity smart meter data reads we receive as a company. We used to receive one meter read every two months. Now we receive 48 meter reads a day, or 2880 every two months. That’s quite a volume increase, and increasingly we’ll need to rely on big data techniques to process this data.

Which brings me to my first task for this week, which was to look at other potential or existing use cases for big data. As you can see, the increase in electricity meter reads is quite significant. But it’s still not enough. To start to analyse how people consume electricity, we’ll need to move towards minute-by-minute reading, for each device in the household. So in a day, that could be 7200 meter reads, or 432,000 meter reads every two months. As you can imagine, that’s quite a volume increase from one meter read every two months!

The second task for the week was to check out http://www.kdnuggets.com/2015/07/big-data-big-profits-google-nest-lesson.html, which is a Google Nest case study. Google’s Nest is a thermostat for Heating and Air Conditioning systems in the USA. Nest learns the patterns of behaviour for people in terms of the cooling and heating they want, and more efficiently delivers that than existing ‘dumb’ themostats. Nest is more efficient since it can figure out that no one’s home, and reduce heating, therefore saving power, and money. Of course, to do that, it needs to remember and process a lot of data points, which is a related example of big data similar to the smart meter scenario I pointed out earlier.

The third task for the week was to read an IBM White Paper on the Top Five Ways to get started with big data (http://public.dhe.ibm.com/common/ssi/ecm/im/en/imw14710usen/IMW14710USEN.PDF), which are:

  1. Big Data exploration, which is exploring information from sensors, and extracting trends. The company I work for currently does this, by extracting information from Power Station sensors, and doing trend analysis, using software called OSIsoft PI Historian (http://www.automatedresults.com/PI/pi-historian.aspx).
  2. Getting a 360 degree view of the customer, which is something very important to the company I work for. The more information we know about a customer, the more finer grained we can tailor our products and pricing to that customer, which in turn is designed to improve service and reduce churn. Of course, a counterpoint to that is that some people view it as creepy when large organisations collect a large amount of information about customers, and therefore, there is a responsibility to make sure that we do that collection with good intentions, i.e. for the purpose of delivering better products and services. More and more big data needs to be combined with in-memory databases such as SAP HANA (http://hana.sap.com/abouthana.html) to allow us to process data in a timely manner.
  3. Security and intelligence extension, another valuable use case for the company I work for, since the number of cyber attacks against us continues to grow, being able to sort through the logs of hundreds of servers, and thousands of desktops allows us to spot trends, such as malicious attacks running over multiple months. Without big data, we wouldn’t be able to process this amount of logs. Tools like Splunk (http://www.splunk.com/) allow us to analyse this.
  4. Operations analysis, which is the optimisation of our business using sensor data. I’d argue this is a pretty similar use case for us as big data exploration, though i understand one is about exploring new trends, and the other one is about optimising existing patterns in the data.
  5. Data warehouse optimisation, which is particularly important considering the massive increase in data processing (see my original point about smart meter data).

The big implication that I already touched on was the creepiness factor of large organisations knowing more and more information about you. My views is that the mass personalisation of products and pricing just for you delivers better service, though I also understand why some people would want to opt out of this data-utopia. I do think more and more though that’ll become difficult, if not impossible to opt out of. It’s a bit like not using Facebook, sure, you don’t have to, but eventually you’ll never get invited to events because they’re all hosted on Facebook which you’ll never see. So I don’t think all the implications of big data are positive, but then again, all technology has positive and negative consequences.

Finally, we were tasked to think about if big data is the right phrase. Personally, I think it’s just data, rather than big data. There is an explosion of data everywhere, which grows exponentially. Therefore, there won’t be any other processing other that big data.

As a side note, we also went through how MapReduce works. My advice is to check out:

which is an excellent video in describing how MapReduce splits tasks across nodes, then combines the tasks to create a result.

3 Replies to “Week 8 – Big Data and Hadoop”

  1. Oh you use SAP, explains the billing issues 😉 j/k
    seriously though, could save a few bucks by getting rid of SAP.. Hadoop adding support for memory as a storage tier..
    Release 2.6.0 – Hadoop HDFS. Heterogeneous Storage Tiers – Phase 2
    *Application APIs for heterogeneous storage
    *SSD storage tier
    *Memory as a storage tier (beta)

    But if you need a good M2M price for all your meters and in home future ones let me know 😉 I know a few people at 2degrees lol

    Actually in all serious, the data increases you mention don’t seem too extreme. We process (mediate) about 16M EDR records a day of a 4 or so CPU. Good thing is Tecoc’s have already dealt with large data issues.. good thing about NZ id were a fraction the size of America, India, China etc who have much larger data issues but we get to leverage off the same technology for our smaller volumes. Im surprised your industry doesn’t leverage off Telco type IT solutions..

    Actually check out CitusDB, from CitusData https://www.citusdata.com
    and the benchmark test

    But this may have changed since hadoop has introduced SSD and Memory support as mentioned above.

    I agree that there is the creepiness factor of collecting data. I think if people realised that the data that company A has is the same as company B, C, D etc anyway and commercially no-one can really sell any data off to anyone else as we all kind of have the same basic stuff as each other, and the Govt provides a lot of open source info as well anyway.

    We have spunk too, but again cloud services.. Data up and data down and the associated price is what gets you…

  2. Any idea on how to set up a Hadoop system to process 400+K meter reads? That would be what you need in your application.

    Can anythingfrom the Nest case be used/learned for the meter reads case? The cases look similar, so why not solve them in a similar way?

    The MapReduce video is funny, but in my view misses some point, in particular in the beginning how to set up the initial input.

  3. I haven’t personally played with Hadoop, but from an abstracted perspective, we would take all the different readings for a customer, and start to ask questions such as, what is the typical daily pattern of consumption for this user. A year’s worth of data would be split up into 365 daily jobs, which could then be farmed out to the cluster. The cluster would then take the daily reads and summarize those into a simpler pattern, i.e. converting from half hour intervals, into 2 hour intervals. Finally, these would all be returned back and reduced across every day to provide an aggregate view of a week, a month, and a year.

    Eventually the goal of reducing energy consumption is consistent with Nest’s idea of digital feedback used to monitor the environment, and reduce consumption. We’d like to do that too, but the information gathered from a smart meter isn’t fine grained enough for us to recommend changes in consumption behaviour to customers just yet.

Leave a Reply

Your email address will not be published.