author picture

by Lee Provoost

This is a Headshift blog post by Lee Provoost, written on January 28, 2010 in Enterprise Future Trends Workforce Collaboration . It has (4) comments, the latest of which was on January 28, 2010. You can find more posts like this here.

The structured vs. unstructured data dilemma

One of the first things you learn at university in your first year of computer science is data normalisation. I don't know about the other people out there, but I found it such an utterly boring course. Mankind has such an obsession with categorising every single piece of data that this behaviour is crammed into the minds of naïve and unknowing computer science students, just fresh from high school.

Why is that? Well it does make it easer for us mortals to grasp the vast amount of data that is around us. Think about the early days of Yahoo! that used a directory approach and manual work to build a whole categorised database of web links.

But then cracks start appearing in our approach. Familiar with the "in which folder am I going to put this email?"-problem? First of all you start to have a gazillion folders and then the problem pops up that an email can belong to two different folders. After a while you start to wonder "where the hell is that message that I filed?".

Social Networks and Cloud driving the change

As a technologist, I also understand the underlying technical reasons why we want to normalise and categorise data. You first start with "objectifying" your view of the world in an object-oriented language like Java so it is easier to program and maintain those systems. This is persisted in classic row-column structure in a database where you create different tables for different data concepts. You have the student object with its properties stored in the Student table and course object with its properties in the Course table. Now you can create meaningful relations between the two and develop your software accordingly.

When your application starts to get a huge load, you will either spend an awful lot of money on expensive Oracle cluster software or you will use concepts like sharding to split databases to keep it performing. But still, you stick to your normalized Student and Course table structures.

An approach that started to get traction in some of the successful internet startups was the concept of key/value pairs. While not a novel concept at all, it got more popular because we started to get more and more websites that couldn't cope with the amount of users. The idea is that you don't normalise data in Course and Student tables, but you just have a key and an associated value combination in one huge table. This approach scales extremely well because it can be easily spread over a cluster of machines. Google's BigTable uses this as well as Facebook's Cassandra. It became more "mainstream" with Amazon's introduction of their SimpleDB that has this key/value concept, as well as Microsoft's key/value storage in Azure.

SSD discs to the rescue!

The only problem is... that developers have been so brainwashed with the relational row/column concept that the average corporate developer had a hard time grasping this key/value concept and building meaningful apps on top of it.

Both Amazon and Microsoft have realized that and have added a relational layer on top of their key/value data storage engine. This means that the vendors can use key/value for achieving extreme scalability, but they let developers interact through a relational model interface. (It does create a performance penalty, but negligible for most users I think).

But still, even this key/value approach requires that you are modelling your apps and data in a particular way. What if we didn't have to do ANY effort at all? What if we could just take the vast amount of data we have in our company, dump it on a disc and just being able to find meaningful information AS IS? Yes, we do have the Google Appliance for that, or Microsoft Enterprise Search, but they are merely indexing in a batch process and storing the results in a cache. What if we want a real-time extremely fast search that gives you on the spot information regardless of the data structures?

That's exactly the gap Oracle wants to fill with their Exadata v2 server. Consider it as a beast of a server, crammed with more than a terabyte of Solid-State Drive (flash) discs that has your data fully stored, it's fairly comparable with having ALL your data hot in memory, ready to be queried.

Can you grasp the opportunity this presents us? We can achieve real-time search inside our enterprise on vast amounts of unstructured data!

Be like the kitteh

This is again an example that we shouldn't limit ourselves by the current state of technology. In the past 10 years I've seen again and again that eventually, technology will catch up to make things possible that we couldn't do before.

Most people will quote Einstein and say "imagination is more important than knowledge", I'd rather tell you to be more like the kitteh below. I'm pretty sure that one day, he/she will be able to fax itself to the Burger King to get some delicious "cheezburgerz".

funny-pictures-cat-faxes-himself.jpg

Impact on Enterprise 2.0

So how does this relate to Enterprise 2.0 I hear you asking. Well, I argue that we shouldn't lose too much time and effort on meticulously categorising and tagging information and knowledge across all the different systems we have. This is because I believe that it is almost a lost cause anyway: it doesn't scale and it's often not the quality you want.

My point is that the tools and technology we get to our disposal are advancing at such a rapid speed, that we slowly start to reach the point where we can rely on software and hardware to do this boring heavy lifting for us.

I shared my view in a previous blog post that knowledge should be a happy by-product of your daily work, however this creates such a vast amount of information that this overload creates a problem on itself. Having intelligent software that can make sense of this, automatically tag it and categorise it, automatically make relationships and assumptions, would dramatically increase our efficiency.

At Headshift we're doing a lot of research on helping companies staying ahead of the pack and we've been looking at software solutions to more intelligently and automatically manage your knowledge as described. It seems that now we finally have the right hardware innovations to make it even juicier...

4 Comments

user-pic

Lee,

the problem with a big Oracle data silo like this is that you're still trapped in a relational database system for querying the data. Certainly Oracle databases support more BI discovery tools and mechanisms, but they don't allow the freedom and scalability that a true key-value data storage system provides. I'm still unconvinced that we'll ever have systems that are able to find relationships and meaningful patterns in data given the problems that *people* have with understanding correlation vs causation.

The really interesting tools are going to be those that can provide good visualisation of data; people can spot patterns and outliers in data far faster and more effectively than rule based systems, and tools that can highlight stuff that 'seems' interesting and present it well to experts will be much more valuable in the foreseeable future.

user-pic

Hi Felix, the Exadata v2 shows only what is possible with new hardware advancements. There seems to be quite low adoption of the Exadata machines in the market (think partially because of their cost factor). MySpace is one of the big Web 2.0 companies that have adopted this concept for their own servers for analysing data. Previously they had a huge farm to be able to cope with the vast amount of data and to process data at a reasonable speed, now they can process faster with a much lower number of machines.

I'm thinking in terms of having very smart software that can analyse patterns, give meaning to data, identify relations etc that currently can't run real-time on the spot because of the lacking hardware. The tools you are describing. Mix that with the use of SSD discs and faster datapipes between the different hardware components will give us eventually the speed needed for doing this "magic" on completely unstructured data.

Shall we start dreaming about the quantum computers already? :-)

user-pic

This isn't a 'quantum' problem; it's a matter of finding good heuristics for analysing the data. Connectivity speeds and storage speeds aren't the limiting factor in this, the algorithms for doing pattern analysis lack sufficient rigour now, with both people and machines.

Fast and frugal heuristics, potentially modelled on cognitive methods, will lead us part of the way down this garden path, but we are still faced with the problem that data analysis needs the touch of a real domain expert, someone who can make the links based on implicit knowledge, not the explicit knowledge exposed in, say, triples. This is why small domain expert systems are succesful; they are able to internalise at least some of the knowledge a 'real' practitioner has. Throwing lots of data at lots of teraflops will give you results, but it's like cracking a crypto message; you could find a decryption that finds Shakespeare out of any message, but it's probably not the right  answer.

Could we design a system that extends the concept of fuzzy logic far enough? Able to evaluate a situation based on what has gone before; extracting and evaluating useful information and categories to compare.  Bayesian reasoning should be easily implemented in such a system, but the extraction of useful information and categorisation thereof is the issue; it's clear that the categorisation of knowledge is something that grows out of social interactions, and analysing those is, in my opinion, one of those things where you just had to be there!

user-pic

Thanks Felix for the great insight! I think your thinking is way ahead than what I had in mind with this post :)

You are completely correct that throwing "just" raw hardware at your data won't solve the problem. That's why I see it at the moment working for some of those existing software solutions that can analyse data and add metadata like entity recognition, facial recognition (video), etc. That does require loads of processing power and super fast hardware. Standard hardware can't cope fast enough to offer a near real-time insight in raw data that hasn't been analysed before. By having a solution where you have all the data kind of hot in memory available where you can avoid hardware latencies to traditional discs without swapping and referencing algorithms, has a dramatic impact on current solutions.

But that still doesn't give us what you are describing. You think we will ever get there with the current approaches and knowledge?

Leave a comment