This is my personal blog and anything I write here in no way reflects the opinion of Cisco Systems, my employer. If it does, it is only by pure coincidence :) Nothing here constitutes investment advice either, so you can't sue me.
The content on this site is provided without any warranty, express or implied. All opinions expressed on this site are those of the author and may contain errors or omissions. NO MATERIAL HERE CONSTITUTES INVESTMENT ADVICE. The author may have a position in any company or security mentioned herein. Actions you undertake as a consequence of any analysis, opinion or advertisement on this site are solely your responsibility.
I hadn't been keeping up on the Kinect mods that are coming out, I've been too busy doing far less glamorous backend work lately. But I came across this montage put together by Johnny Lee that compiles some of the more interesting ones.
This is truly a glimpse into the future, some of this stuff is mind-blowing.
I can't wait to see what the next 5 months will bring. I am so excited to have access to the SDK now, I'm really looking forward to building some interactive data visualizations with it.
The skunkworks project that I've been working on for the past month or so incorporates the idea of relationships between entities to enable automatic discovery and data recommendations
In said project there is a bit of code that watches the database for changes to an entity, and when it sees a change to that entity it automatically re-evaluates it against nearby, similar entities. I want to keep a record of the entities that are being compared against one another, so I create a relationship and store it for future reference.
This is pretty standard stuff. It's basically exactly what a relational database does. However, I ran up against one concept which I haven't seen much about.
That is, I realized that there is a whole gradient of similarities between entities, from "exactly the same" to "completely different". If you turn this into a binary value, either related or not, you lose that entire piece of information.
This seems like quite a waste because if you're looking at entities that are related to a specific one that you're looking at, it makes sense to prioritize them based on their similarity. And if you're doing something like, say, using those similar entities to come up with a market price for the entity you're interested in, you probably want to use something like a weighted average based on how similar the other entities are.
As far as I know there is no way right now to store that relationship strength in existing data systems. Relational databases in particular do nothing for you here, you can either choose related, or not related. RDF-based systems could, I suppose, accomodate this, but what you really want in this case is a "strength" attribute on the "sameAs" property, and I have never encountered such a thing in the wild.
Anyway, I thought it was an interesting and useful idea. Since I'm da boss on my project and it sounded fun I went ahead and spent a day adding it in while I was in that bit of the code, so I'll let you know how it works out.
One of the ideas I've been playing with in my data analysis skunkworks project is the concept of "Data Recommendations".
The gist of it is that you can pretty easily show people data that they'll probably find interesting if you know about a few instances of data that you know they already found interesting. It's the same concept as Amazon showing you recommended items you might be interested in buying, but on homegrown data. In fact, the implementation could be as simple a Bayesian filter, but the data has to be in a common format that it'll work against, which is the hard part.
I don't think this is completely new, a few smart folks like Jeff Jonas over at IBM seem to be working on something similar, but it's going to make analytics a whole lot more interesting for a whole lot more people. It really puts relationships front and center because that's really what you're setting up. A relationship from a person to anything that matches x, y and z, and leaving that relationship open to find new data as soon as it sees any that matches.
I'm a little surprised that this hasn't been explored at all in any of the BI products that I"ve seen.
Recently I have been playing with a concept that I just made up called Data Lenses.
The idea is that you can construct scenarios based on information you're interested in and see how it affects everything else in the system. The data that you're using as a what-if scenario supersedes any "real" underlying data so that you can see what effect it has on everything else. A data lens.
For example, what effect would it have on ROI if the local market is trending up or down? What if the property had an additional 1000 square feet? What effect would it have had on your metrics for the last year if you had routed traffic differently?
This is a very heavy question to answer though because, once again, it requires that you calculate everything in advance, but not only for the original scenario but for every other scenario you're watching as well.
Another score for big data columnar data stores IMO.
This is a continuation of my thoughts the other day on what makes data interesting. The gist of it was that the really interesting and impactful data is hard to get to, the easy stuff to get to is common knowledge and therefore not as valuable. The stuff that everyone knows is common knowledge, a much bigger impact is going to be made by going after the other stuff.
If that is true then the most important and impactful task at hand is bringing up the percentage of stuff that you do know. Going after the long tail inch-by-inch.
One of the primary advantages that Big Data provides in my opinion is the ability it gives you to not just store every bit of data you receive but to spend additional time analyzing each bit, researching it, and then storing useful meta-data about it. This is the work that makes faster than real-time analytics possible.
Which brings me to my point, which is that analytics about analytics (meta analytics) is going to be increasingly important. You'll need to know what percentage of bits you receive you're able to successfully mine meta-data about, and this will be the most important metric you see.
This is the number of customers you can identify, the number of calls you route successfully, the number of visitors you successfully profile, the number of assets you can confidently price. The number of decisions you can accurately make in advance.
I'm thinking of a dashboard something like this:
It's not about what you might be able to know, it's about what you actually do know. If the potential is there but the execution is not, what's the point?
I suspect this type of information will become inreasingly valuable and prominent in the near future as big data proliferates.
There are a ton of data sources out there that can easily be pulled in and learned from. Sources like Twitter, LinkedIn, Facebook, and even enterprise data sources like internal CRM, bug tracking systems, customer suppoert, and communication systems. It's all very possible today and being used for some very interesting things.
That is the low-hanging fruit at this point. Pull in a data source or two and use the data in them to enrich what you know about your customer/visitor/market/etc. It is easy to learn what Twitter knows about a person, a $15/hr coder from a former Soviet bloc country can easily get that for you.
But if you've ever tried this you quickly realized that the number of members that Twitter advertises is nothing like the number that actually participates. You are really only able to learn about the tiny fraction of users that actively participate. No, it's finding out about the people that Twitter doesn't know about that's the real trick. This plays out in any data source that you're hoping will be as comprehensive as possible.
It seems to break down that you can usually easily find out 50% of what you need to know from easily accessible data sources. The other 50% is REALLY hard to get to.
The real magic is in knowing something that is really hard to find out right now. Pulling in new data sources, more difficult and new ones, and combining it with the data that everyone already knows about to fill in the missing 50%. The land of screen scrapers, Mechanical Turk, maddeningly complex ETL processes, etc. That's the really interesting stuff.
But the only way to operate in faster than realtime is to deal in probabilities, using as much data as possible.
The more data you have the more confidence you can have that something should be handled in a certain way.
Real estate auctions made me think about it in this way. The problem is that you don't know all of the information about the property until the split second when you have to decide to bid or not. Which is almost impossible at scale, it's like needing to put a price on every single stock, every day, just in case it comes up for sale at the right price.
This is a symptom of a broken market, of course, the stock market could never function like this. Yet that's what I'm dealing with. You need a decent number of participants and adequate information to have a liquid market, the housing market has neither.
Anyway, the only way to deal with this problem is to pre-calculate everything, giving a probability that the house/customer/call/problem you're dealing with falls into a given category. And re-calculate every time you get more information. The output is an acceptable range of values for everything else. If you re-price every property in existence when you get new data you can be adequately prepared to act when you get that last critical piece of information (such as the price).
Maintaining an accurate data-mined model for each entity you're interested in is quite a feat. If you're interested in Twitter users, for example, it's like monitoring every user on the service and then flagging them when they become interesting to your purpose based on everything they've done in the past.
Treating every single instance of a property/customer/prospect whatever as a unique case to be considered on its own instead of as one of group is the only way to operate in faster than realtime.
However, doing things this way requires an immense data store because you're crunching everything you know about every single entity you know about whenever you get new data related to it. Every bit of information you get triggers a process that generates more data. That means every detail record you have becomes not just a detail record, but it becomes the subject of a detailed analysis, often very frequently.
I've seen lots of enterprise-grade relational databases fall flat when faced with a large number of detail records, and you have to start throwing away data very quickly. What a tragedy, no? Most importantly, it doesn't allow you store and crunch as much data as you need to in order to get to faster than realtime-operating in the realm of probabilities. This goes back to the reason that I'm so excited about NoSql databases such as Cassandra and Hadoop, they really enable this.
I've come to realize that it doesn't really matter what kind of whiz-bang analysis you can do within seconds. It will never hold up against brute force analysis of everything you're looking at.
I almost feel like this generated meta-data about the detail data deserves its own name. It's a like an identity probability card or something--the probabilities that the customer/call/property/twitterer you're looking at falls into a variety of categories. Ultimately this is what you're interested in, not the individual bits of data that decide the probabilities. This type of aggregate information about anything is going to be much more valuable than the bits themselves.
Mined data? Data probabilities? Data gold? I don't know what the right name is for this meta-data, but it's very interesting to me right now.
I've been thinking a lot lately about the fundamentals of analytics and how to break them down to their core components from a business-needs perspective. I have been dealing with this directly for years in the contact center market, and most recently by way of a real estate investment business that is trying to sort out the chaos in the housing market. So I'm kind of thinking out loud here about commonalities that I've seen across industries.
Based on my experience, I believe there are 3 fundamental products that people want to buy when they buy a business intelligence or analytics solution, and people buy these solutions hoping that they're actually getting one of these products. If these boxes were sitting on the shelf next to Cognos or whatever they'd get bought every time as they directly address the core need:
The Workflow Optimizer. Every business in the world is built on one or more workflows, the actual work that needs to be done. Actual work like handling a phone call, buying a property, following up on inquiries, etc. People buy this product to make their workflows work better, faster, and cheaper.
The Workflow Assigner. When you're dealing with massive quantities of data like financial markets, Web traffic, etc, a manual process will not cut it here. Do you send the call to IVR or a live agent? Do you buy the property or ignore it? Do you allow the transaction or decline it? If you cannot assign workflows in real-time or faster, you've created a bottleneck. If you don't prioritize instantly and assign accurately you're. This must be instantaneous.
The Opportunity Finder. Again, mountains of data means that you not only have to assign workflows properly, but the number of different workflows that you have to create and implement mushrooms as well. You can implement a white glove service for your VIP customers, but only if you can identify them instantly and assign the correct VIP processes. You can buy properties in many different places, but the profit potential is different in each.
Reports, dashboards, scorecards, all of the existing analysis products only exist to support these core needs.
If I had to guess I'd say that 99% of analytics products are Product #1 in different clthoes--they help optimize existing workflows. Rather, they help trained professionals figure out how to do this, the optimization is typically not part of the analytics process itself. This is typically done by looking at the past using metrics and trends, and is pretty standard at this point. SMS alerts, wall boards, etc are all there to alert you when one of these workflows breaks down in some way. Advanced analytics products implement data mining algorithms to help identify patterns in the data to help these manual processes work better.
Most of the exciting new data-related opportunities that I see are built around products #2 and #3. They simply weren't even needed before, in the Digial World 1.0, because there was a manageable amount of data generated, andpeople could manually do the jobs of Products #2 and #3. The amount of data is now getting to be un-manageable.
As more and more data streams come online--generated by smart phones, RFID tags, user-generated content, computer vision, etc--you can start to see patterns that you couldn't before. These patterns expose the underlying trends that you really want to take advantage of as a business. Things like illiquid markets, pricing inefficiencies, data starved veriticals. Every bit of data you add to your pool gives you the ability to see these more clearly.
Selecting the proper workflow or process becomes more accurate the more data you have available to support the decision. Buy the house? Only if you know what it's worth and how much you'd make relative to other houses that you're not buying. You can figure it out, but only if you have enough data. Prioritize the call to a live agent? Maybe, but you'd better make sure you're not bumping someone more important, and you need data for that.
When you have more extensive information on the entire world of data it becomes much easier to identify groups of properties/customers/whatever that are unique in some way but not being treated as such. This is when you'd want to focus your efforts on taking advantage of them by sending them through a special process that plays to their unique characteristics. This is almost pure data mining, but it becomes much more accurate with more data.
The more data you're able to crunch, the bigger your competitive advantage, but you need Products #2 and #3 in order to do this. These products either are not sold or are almost completely manual right now.
I especially think there is a lot of potential in point solutions that deliver products #1, #2, and #3 pre-tailored to specific veriticals in an easy to use package. Particularly as a cloud-based solution that offers pre-packaged integration with existing data sets such as Twitter, news, the 2010 Census data, weather, etc.
More thoughts to follow as I continue to work on and think about this.
Most reporting and analytics tools are excellent at telling you the norm, but don't help you find outliers, or give you extremely crude tools for using metrics to identify crisis situations.
The outliers are the most interesting part. T99% of the opportunity is in outliers:
Above-average customers are outliers
Underpriced assets are outliers
Influencers are outliers
Hits (music/games/movies/whatever) are outliers
Recognizing outliers is the most important part. If you have the ability to skim the cream off the top, where you get 80% of the bang for 20% of the buck, why wouldn't you?
The interesting part lies in the fact that the more data you have to work with--the more you know about everyone--the easier it gets to recognize outliers. The patterns that can be mined from the data improves, improving your ability to spot those outliers in the crowd.
This is why I am excited about both teaming up Big Data (Cassandra being my store of choice) and data mining. Build a mineable data warehouse in anticipation of unknown influences and links between data, and build it in such a way that it can intelligently link them, and that's money in the bank. In any industry.
The reason I've been so quiet lately is because I've been doing some hard-core immersion in this stuff, and have been in straight-up learning mode. But it's really cool stuff, the potential is very exciting.
When I first started blogging I signed up with TypePad because hey, it was 2006 and Wordpress was still pretty new. I went the fast and easy route. Needless to say, I outgrew their platform within a year or two, I just couldn't make it do what I needed to do. Life happened and I never got the chance to investigate further, and now I'm finally taking the dive.
All I want is a platform that makes it easy to create content, lay out a front page in a grid-like format, and remove all the clutter from the site that I don't want to distract people with (menus etc). Just something basic like this:
I thought for sure Wordpress would fit the bill, after all it's white-hot in the CMS market apparently and it's been the blog platform standard for years now. But after checking it out and even trying some of the premium themes it just isn't customizable enough unless you want to write PHP code; and, well, no thanks, I don't really want to keep my blog in source control and deal with upgrading code as the platform changes. (I'm not doing anything too off-the-wall here IMO.) Shockingly, their layout scheme does not seem to do the simple things I want to do out of the box, and if I'm starting from scratch I'm not going to compromise. (As far as I can tell only the sidebar, header, and footer support widgets, and the widgets cannot exist more than once on a page.)
I was hearing good things about a new CMS called Squarespace, so I decided to take them for a whirl. After watching the videos I thought it would surely be customizable enough to do what I wanted to do, but yet again I was thwarted. While the service makes it easy to create some great-looking sites it seems that you are still stuck with the Squarespace-blessed layout. (It looks like you may be able to do some custom coding to get around this, but I'm not sure if that could even do it, I'm not interested in investing the time in another proprietary platform, and then again I'm back to custom-coding my blog platform.)
The only platform that I know can do what I need to do is Drupal. Somehow thought some of the contenders might win out in the end, but I just can't find Drupal's flexibility anywhere else right now. It's a little more heavyweight than I'd like, but I know that it will at least accommodate my apparently weird-ass layout.
If anyone knows something that I missed on the other two platforms please do speak up, but otherwise I guess I'm plugging ahead with Drupal, I hope to have the new site up and resume regular blogging soon!