登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

中吴南顾惟一笑

成功法则就是那19个字

 
 
 

日志

 
 

SQL and Hadoop  

2010-07-20 14:59:52|  分类: dbms |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |
http://www.nettakeaway.com/tp/article/410/sql-and-hadoop
11/20/2008 12:23 PM, Database Analysis

I don’t know why there is so much confusion over the role of MapReduce oriented databases like Hadoop vs. SQL oriented databases. It’s actually pretty simple.

There are 2 things people want to do with databases: Select and Aggregate/Report, aka Process.

The Select portion is filtering: finding specific data points based on attributes like time, category, etc. The Aggregate/Report is the most common form of data processing: once you have all those rows, you want to do something with them.

So, how do we tell databases to do these 2 things? For the past 30 years, we’ve used a language called SQL, “Structured Query Language”, to access the data. SQL worked best when the data was organized in “relational tables”. SQL as a language has some cool features, including the ability to create tables, modify and insert data, and return aggregations in a set-oriented fashion. It’s also over 30 years old, is wordy, and cannot easily deal with any world other than sets of textual relational tables.

While some programmers immediately get what SQL can do, others find it to be “YAL”, “Yet Another Language”. Object-oriented databases and other “persistent storage” systems have popped up to help these programmers treat the database as just another portion of their program, by “integrating” persistent storage systems into their current programming approach. Python has “pickling”, Perl used the DBM tied hash, etc.

MapReduce is a programming concept that’s been around for a while in the object-oriented world, but has recently become more popular as scripting languages rise and as processors become more parallel. The MapReduce paradigm basically forces/allows the programmer to pick a way to split a task across various “compute groups”, have those groups compute something, and then fold it all back up at the end. This approach maps nicely to the way many modern languages treat data, so having the database handle the heavy lifting is a nice touch.

Therefore, if you think about it, both Hadoop and SQL databases are doing the same thing: Selecting some data (the Map phase) and Processing it (the Reduce phase).

So, why the sturm und drang? A couple of things; I’ll mention a few here:

  • Hadoop and it’s ilk are really programmer tools: they don’t have a SQL access language, but instead rely on Java and other bindings. You do kind of need to be a programmer, not just a data guy, to use it
  • Hadoop is not focused on the traditional old school of “store a row, retrieve a row, ACID compliant, etc.” Instead, Hadoop systems focus on the processing of massive data. Hadoop is probably the wrong tool for the traditional “bank of accounts” or “music catalog” exercises we all did to learn table normalization, for example. If you just want to “SELECT * WHERE USERID=12”, Hadoop is a bit of a pita.
  • Similarly, as much as the relational DB guys have hacked SQL into PL/SQL and TransactSQL and cursors and all sorts of ways to add processing features to the SQL database world, the type of massive processsing that modern data mining and processing requires tends to strain the SQL world. The relational database model has struggled to parallelize itself cheaply, and the SQL language gives the developer very little control over how to optimize the parallelism in complex cases.

There are efforts underway to put a pretty face on the MapReduce systems. Facebook has contributed Hive and Business.com has released a Hadoop variant called CloudBase which looks really nice in it’s SQL support; other approaches are in the “not-SQL-but-still-easier-than-raw-MapReduce” language area: Microsoft has created Dryad for their cloud systems, and Yahoo! Research has a language called PIG.

Some database players have also started to combine MapReduce engines for processing with SQL/Relational engines for the storage layer. Greenplum, who has had a parallelized PostGreSQL for a few years now (and open sourced their now abandoned BizGreSQL BI-oriented PostGreSQL) and AsterData, who is less well known but is regarded for high capacity database systems.

Look, there are no shortage of distracting things and buzzwords here: When you parallelize, you can distribute across the “cloud”, you can run your analyses in the cloud using “Software as a Service (SaaS)”, yadda yadda yadda.

At the end of the day, ask what you are trying to solve with your program: If it’s massive processing of data, then a Hadoop solution may be your best bet. If the reporting and storage aspects are relatively simple, just persistent storage and simple sums of reasonable size data, then a relational database will be easier to get going with.

And yes, these will eventually converge such that you won’t have to decide which tool to use: all of the major database systems will have a SQL layer with multiple engines and a controller which optimizes which engine to use for which query; you will also have the ability to use direct MapReduce or SQL, as you see fit.

But we aren’t there yet. So, don’t just assume that Hadoop is the answer to all data processing problems: if you aren’t processing the data, it’s really the wrong tool. And don’t just assume that an Oracle “grid” or a Teradata are the only way to solve your massive data processing. You might be surprised how easily Hadoop can solve your needs.

Some things to watch:
Data Mining in Hadoop
Hama : Matrix libraries with emphasis on compute intensive like inversion… all within Hadoop
Mahout : Mahout’s goal is to build scalable, Apache licensed machine learning libraries. Initially, we are interested in building out the ten machine learning libraries detailed in http://www.cs.stanford.edu/people/ang//papers/nips06-mapreducemulticore.pdf using Hadoop.

SQL vs. Hadoop articles
A dime a dozen. Here’s a recent one: The Commoditization of Massive Data Analysis. At the end of the day, almost every article is either by
1) a traditional DB guy who doesn’t understand the fuss b/c Hadoop can’t seem to do basic SQL or other relational stuff out of the box, and so doesn’t understand the sea change from easy to access parallelized processing or
2) Hadoop lovers who never understood how SQL can simplify data queries (b/c it’s yet another language to learn) and see all data as something to process, not as a valuable resource in it’s own light.

So, read each SQL vs. Hadoop article with a grain of salt, including this one.

  评论这张
 
阅读(1061)| 评论(0)

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2018