Svetlana Sicular

A member of the Gartner Blog Network

Svetlana Sicular
Research Director
1 year at Gartner
19 years IT industry

Svetlana Sicular has a uniquely combined experience of Fortune 500 IT and business leadership, product management at world-class software vendors, and Big Four consulting. She primarily handles inquiries in the areas of data management strategy, ...Read Full Bio

5+ Big Data Companies to Watch

by Svetlana Sicular  |  June 17, 2014  |  6 Comments

People often ask me if there is a magic quadrant for big data. There isn’t. What we have is a Hype Cycle for Big Data with abundance of big data technologies, some of which are just nascent, some are on the plateau of productivity, and some, like Hadoop distributions, are in the unrightfully dreaded and largely misunderstood trough of disillusionment.

Gartner also has annual cool vendor reports where analysts write about up and coming companies with innovative ideas, services and technologies. Many reports cover awesome big data vendors, for example, Cool Vendors in Big Data, Cool Vendors in Data Science and Cool Vendors in Information Innovation.

At Gartner for Technical Professionals (where I am), we usually publish vendor-neutral research and do not write for cool vendor reports (to be fair, we submit our choices and peer-review these reports). Yet, our clients constantly ask me and my colleagues about vendors. Last week, Fortune magazine published my opinion on big data companies to watch. My opinion was not about the best or the most prominent, most hyped or most intriguing, most funded or most profitable companies, but about the companies to watch.

Katherine Noyes, the author of the Fortune Magazine article, asked me to name five big data companies to watch and to comment on some published big data vendors lists. Below is my full response, it explains my choices:

Hi Katherine,

Well, selecting just five companies is a challenge since there are many more companies that do interesting things around big data.  I have technical and non-technical considerations for giving my list of five.  My top noteworthy big data companies would be:

  • Neo Technology is a force behind an open source graph database Neo4j – I think graphs have a great future since they show data in its connections rather than as a traditional view of atomic elements.  Graph technologies are mostly unexplored by the enterprises but they are the solution that can deliver truly new insights from data. I wrote about graphs some time ago in my blog post Think Graph.
  • Splunk has an excellent technology, and it was among the first big data companies to go public.  Now, Splunk also has a strong product called Hunk (Splunk on Hadoop) directly delivering big data solutions that are more mature than most products in the market.  Hunk is easy to use compared to many big data products, and generally, most customers I spoke with expressed their love to Splunk without any soliciting on my side.
  • MemSQL – an in-memory relational database that would be effective for mixed workloads and for analytics.  While SAP draws so much attention to in-memory databases by marketing their Hana database, MemSQL seems to be a less expensive and more agile solution in this space.
  • Pivotal – while it might not be the most perfect big data solution, Pivotal is solving a much larger problem – the convergence of cloud, mobile, social and big data forces (which Gartner calls the Nexus of Forces). Eventually, big data is not a standalone technology but it should deliver actionable insights about the rapidly changing modern world with its social interactions, mobility, Internet of Things etc.  That’s why GE is one of the major investors in Pivotal with the purpose of building the Industrial Internet.
  • Teradata – it might be a surprising choice for many big data aficionados who chose Teradata as a target for religious wars of new big data technologies against the data warehouse, where Teradata is an easy prey because it’s a pure play in the data warehousing (as opposed to Oracle, IBM or Microsoft who have many more products). Meanwhile, Teradata delivers a unified data architecture that combines best of both worlds, and enterprises need both.

As you may have noticed, I am covering various segments of big data technologies. If I had more than five companies to choose from, I’d also add companies in other segments:

  • Big data analytics: Actian and Datameer
  • Predictive analytics: Revolution Analytics and Ayasdi
  • Data integration: Pentaho and Denodo (particularly for data virtualization)
  • Big data cloud providers: Qubole and Altiscale
  • Hadoop: Cloudera – “first in space” for big data, huge recent investments from an interesting set of investors, most notably, Intel.
  • Development  framework: Concurrent with the open source product called Cascading, now included in some Hadoop distributions. Given that applications are about to explode on Hadoop, Concurrent should do very well.

Please note, this is not a comprehensive research and there are more very good companies.  The companies I listed are “to watch” rather than best overall.

Now, to comment on the list of 100 big data companies you pointed me to.  Out of this list, the following companies look appealing to me: Dataguise, MapR, MatterSight, Manhattan Software (an excellent player in real estate!) and Data Tamer (I would prefer Paxata though) – see my blog post Big Data Quantity vs. Quality.

I’d like to especially stop on The Hive. This VC company specializing in data has an unusual approach that I personally greatly appreciate.  It conducts weekly live meetups, which cover diverse subjects and draw diverse people who are interested in big data.  The Hive became one of the most well-known gatherings that attracts the brightest minds in big data as speakers (and as attendees). It became a social “big data hub” in Silicon Valley, and I believe, in India too.  Being in the center of the big data life, the Hive has a great opportunity to make successful investments on early stages of data companies.

Finally, I’d like to remind you again: Gartner cool vendor reports are a much more comprehensive and pointed reading than my casual overview.


Follow Svetlana on Twitter @Sve_Sic


Category: "Data Scientist" Big Data big data market data paprazzi Gartner hype cycle Hadoop Information Everywhere innovation The Era of Data Trough of Disillusionment Uncategorized     Tags: , , , , , , , ,

The Era of Data

by Svetlana Sicular  |  June 3, 2014  |  Comments Off

In 2009, a CRM icon Tom Siebel was attacked by a charging elephant during an African safari. Ominously, this was exactly the time of changing epochs signified by another elephant, Hadoop. It was not obvious back in 2009 that the era of CRM came to the end and the era of data began. This very year of 2009, Cloudera announced the availability of the Cloudera Distribution of Hadoop, and MapReduce and HDFS became separate subprojects of Apache Hadoop. This was the year when people started talking about beautiful data.

The era of data is about the process of data commoditization, where data is becoming an independently valuable asset that is freely available on the market.  A “commodity” is defined as:

  • Something useful that can be turned to commercial or other advantage
  • An article of trade or commerce
  • An advantage or benefit

Looks familiar? That’s what we want data to become. And we are getting there, not very fast but steadily.  Information patterns derived from data are already changing status quo; they disrupt industries and affect lives. Sometimes, data is useful, yet not turned to commercial advantage.  For sure, data is increasingly becoming an article of trade or commerce. Notice, the number of new available Web APIs that give public access to data started growing explosively around the beginning of the era of data.

Source: ProgrammableWeb

Source: ProgrammableWeb

I am not the first one pointing to the commoditization of data. Bob Grossman, who epitomizes a data scientist to me, gave a detailed account of commoditization in his outstanding book The Structure of Digital Computing: From Mainframes to Big Data. In particular, the commoditization of time took most of the 17th century. We take our watches, clocks and phone timers for granted — think in a perspective: in the future, someone will take for granted access to all kinds of data.  The last chapter of the Grossman’s book is entitled “The Era of Data.”

Open data is the strong manifestation of this new era.  The first government open-data websites — and data, — were launched in 2009. The government mandates and open data policies from multiple countries and public entities continue to contribute to the process of data commoditization. Openness has the benefit of increasing the size of the market. The greater the size of the market and the demand for a resource, the greater the competitive pressure on price and, hence, the increase in commoditization of the resource.

When data gets free or inexpensive (as a result of commoditization), the opportunity exists to unite people over data sets to make new discoveries and build new business models. Many companies choose Hadoop because it is a cheap data storage. This entry point is the first step on the journey to the data operating system, a term that I heard three times during past five days, notably from Doug Cutting who brought to the world Hadoop the elephant and the data operating system. This year’s Hadoop Summit starts today. It brought together 3,000 people from 1,000 organizations.

The last part of the “commodity” definition is “an advantage or benefit.” Gartner analysts Mark Beyer and Donald Feinberg predicted several years ago:

By 2014, organizations which have deployed analytics to support new complex data types and large volumes of data in analytics will outperform their market peers by more than 20% in revenue, margins, penetration and retention.

According to my observations, it’s true. If this is true for you? If not, be patient —an elephant’s pregnancy is almost two years long.

P.S. Tom Siebel survived the elephant attack. He is running a big data company C3 Energy now.



Follow Svetlana on Twitter @Sve_Sic



Comments Off

Category: Big Data big data market data paprazzi Hadoop open data The Era of Data Uncategorized     Tags: , , , , , , , ,

Big Data Quantity vs. Quality

by Svetlana Sicular  |  May 22, 2014  |  Comments Off

Increasing adoption of big data technologies brings about the big data dilemmas:

  • Quality vs. quantity
  • Truth vs. trust
  • Correction vs. curation
  • Ontology vs. anthology

Data profiling, cleansing or matching in Hadoop or elsewhere are all good but they don’t resolve these dilemmas.  My favorite semantic site Twinword pictures what a dilemma is.

What is "dilemma?"


You get the picture. New technologies promote sloppiness. People do stupid things because now they can.

Why store all data? — Because we can.

What’s in this data? — Who knows?

Remember the current state of big data analytics? — “It’s not just about finding the needle, but getting the hay in the stack.”

Big data technologies are developing fast. Silicon Valley is excited about new capabilities (which very few are using). In my mind, the best thing to do right now is to enable vast and vague data sources that are commingling in the new and immature data stores, and are confined in mature data stores.  Companies store more data than they can process or even fathom. My imagination fails at a quintillion rows (ask Cloudera).  Instead, it paints a continuous loop: data enables analysis, analytics boosts the value of data. How to do this? It starts dawning on the market — through information quality and information governance!

My subject today is just the information quality piece.  It continues my previous blog post BYO Big Data Quality.   (I explained the whole information loop on this picture in Big Data Analytics Will Drive the Visible Impact of the Nexus of Forces.)

Data liberation means more people accessing and changing data. Innovative information quality approaches — visualization, exception handling, data enrichment — are needed to transform raw data into a trusted source suitable for analysis. Some companies use crowdsourcing for data enrichment and validation. Social platforms provide a crowdsourced approach to cleaning up data and facilitate finding armies of workers with diverse backgrounds.  Consequently, the quality of crowdsourcing is another new task.

Big data is a way to preserve context that is missing in the refined structured data stores — this means a balance between intentionally “dirty” data and data cleaned from unnecessary digital exhaust, sampling or no sampling. A capability to combine multiple data sources creates new expectations for consistent quality; for example, to accurately account for differences in granularity, velocity of changes, life span, perishability and dependencies of participating datasets.  Convergence of social. mobile, cloud and big data technologies presents new requirements — getting the right information to the consumer quickly, ensuring reliability of external data you don’t have control over, validating the relationships among data elements, looking for data synergies and gaps, creating provenance of the data you provide to others, spotting skewed and biased data.

In reality, a data scientist job is 80% of a data quality engineer, and just 20% of a researcher, dreamer and scientist. Data scientist spends enormous amount of time on data curation and exploration to determine whether s/he can get value out of it. The immediate practical answer  — work with dark data confined in relational data stores.  Well, it’s structured, therefore, it is not really new. But at least, you get enough untapped sources of reasonable quality, and you can extract enough value right away, while new technologies are being developed. Are they?

While Silicon Valley is excited about Drill, Spark and Shark, I am watching a nascent trend — big data quality and data-enabling. Coincidentally, I got two briefings last week, Peaxy (that I liked a lot for its strength in the Industrial Internet) and Viking-FS in Europe with the product called K-Pax, mainly for the financial industry.  A briefing with Paxata is on Friday, and a briefing with Trifacta is scheduled too.  Earlier this month, I got a briefing from Waterline Data Science, a stealth startup  with lots of cool ideas on enabling big data. Earlier, I had encounters with Data Tamer, Ataccama and Cambridge Semantics among others.  Finally, have you heard about G2 and sensemaking? Take a look at this intriguing video. All these solutions are very different. The only quality they have in common is immaturity. For now, you are on your own, but hold on — help is coming!


Follow Svetlana on Twitter @Sve_Sic

Comments Off

Category: "Data Scientist" analytics Big Data big data quality Crossing the Chasm data governance data paprazzi Humans Information Everywhere innovation Inquire Within Uncategorized     Tags: , , , , , , ,

BYO Big Data Quality

by Svetlana Sicular  |  May 16, 2014  |  2 Comments

In the absence of best practices for big data quality, individual companies are coming up with their own solutions. Of course, these organizations first have problems.  Let’s look at the example from Paytronix, a cool company providing loyalty management for restaurant chains including my favorite Panera Bread. Paytronix is converging social, mobile, cloud and big data for its business (aha! The Nexus of Forces!).  And — by the way — cutting edge technologies help a lot to attract top talent to the company. But first things first, Paytronix had a big data quality problem, here is the description:

  • Over a quarter of their clients, restaurants, do not ask for age
  • Of those who ask age, 18% leave it blank
  • Of those who answer, approximately 10% are blatant liars

All of the above means that identifying families with kids is a huge challenge (spoiler: Paytronix successfully met the challenge). People with kids are younger.  They tend to fill the restaurants earlier in the evening. Check average is higher when orders include a kids meal (I confirm for our orders in Panera). That’s why restaurants often want to market to people with children: when they offer a kids meal coupon they get 25% more redemptions.  But!  What customers say is different than what they do. (Aren’t we all customers?)  In other words, here is the picture, instructive for parents:

Source: Paytronix

Big data quality is new and different: Traditional models do not work, familiar standards do not apply, typical metrics miss the mark. Most important, people’s mentality has to change when they assure quality of big data.  My colleague Martin Reynolds likes to cite, “most people are woefully muddled information processors who often stumble along ill-chosen shortcuts to reach bad conclusions.”  This quote appeared in Newsweek in 1987, BC (before Cloudera, the first commercial Hadoop distribution vendor). E.g. the problem is eternal although it wasn’t so widespread in data management because there was not much data management in 1987.  That Newsweek with the quote still advertised typewriters, best in the world. Wikipedia gives a daunting list of cognitive biases —each bias is a big data quality factor because quality applies to the resulting analysis, and to intermediate results, and to iterative data science.  In case of Paytronix, segmentation was biased. Biases also apply to data mashups: to evaluating granularity, trustworthiness and dependencies of participating data sets.  And sometimes, biases matter even to the absence or presence of particular data sources. Martin Reynolds shared with me the most astonishing example of cognitive bias.

Paytronix solved its big data quality problem by deciding not to change how people think.  They validated data by giving it in cubes in a familiar BI tool to good old people. By the way, crowdsourcing is another excellent big data quality method that relies on people. But this is a subject of my next post — I will tell what vendors are doing about big data quality, and even maybe about big data governance. As DBAs like to say, stay tuned.


Follow Svetlana on Twitter @Sve_Sic


Category: Big Data big data market Crossing the Chasm crowdsourcing data data governance data paprazzi Hadoop Information Everywhere innovation Inquire Within skills Uncategorized     Tags: , , , , , , , , ,

Wall Street is Hiring for Big Data

by Svetlana Sicular  |  April 28, 2014  |  1 Comment

Big data finally reached Wall Street. Not for small science experiments, but seriously, in grand style, with exorbitant salaries, for production and DevOps. A couple of years ago, Silicon Valley companies were bragging to each other about hiring Wall Street quants for data scientists. Now Wall Street is happy to grab  — whom? No, not data scientists (New York has plenty of its own quants) — Data Architects, the breed who can come up with new architecture to combine structured and unstructured data where architecture for one use case usually (still) does not apply to another. returned just 34 data scientist jobs vs. 645 data architects in the New York area tonight. Even my search of “data architect + Hadoop” returned twice as many data architecture jobs compared to data scientists. Sorry, data scientists: data architects are sexy!  My client inquiries shifted lately to no-nonsense big data architecture, management and real-time use cases. Big data vendors hinting left and right about “Wall Street customers in alpha,” e.g. newly signed contracts.  My friend, a Wall Street recruiter, had to cut his vacation short – Wall Street is in a hurry to get data architects right now.  And also, performance engineers, and developers, and administrators. And did I say, DevOps? Yes, for agility, my friends.

This hiring means that Wall Street is ready to use big data strategically.  The picture below shows typical stages of big data adoption described in my research note The Road Map for Successful Big Data Adoption.  The red dot on this picture — a stabilized infrastructure — is the most prominent milestone. After the infrastructure has been built, a capability to derive value from big data technologies leaps to a new level.  Nonbelievers turn into believers.


At the red dot, big data becomes the new normal. It eventually gets related to other information sources. (When using the term “big data” analytics and information management professionals (across the Globe!) first say that they don’t like it.) At the red dot, companies substantially expand the number of nodes or totally rebuild the earlier small layouts. 

A widespread myth is that Hadoop is inexpensive to implement. Really? With Wall Street salaries? An initial implementation is usually more expensive than expected. It involves a lot of unanticipated technical and nontechnical difficulties. By the way, another myth is that big data infrastructure usually takes advantage of commodity hardware. Maybe, but not on Wall Street. Enterprises buy high-end hardware.

I gave my friend a Wall Street recruiter a t-shirt stating “My data is bigger than yours”— he wears it to work (on Fridays). I should make one for myself, with the text Data is what used to be big data. Wall Street is getting there.


Follow Svetlana on Twitter @Sve_Sic

1 Comment »

Category: "Data Scientist" Big Data big data market Crossing the Chasm data Hadoop market analysis Uncategorized     Tags: , , , , , , , ,

How Many Degrees Are in the 360° View of the Customer?

by Svetlana Sicular  |  March 18, 2014  |  3 Comments

I’ve been watching the CRM space since the term CRM was coined. The view of the customer remained at invariable 360° while new ideas, methods and companies kept adding degree by degree to the full view.  Back in 2009, a CRM icon Tom Siebel was attacked by a charging elephant during an African safari. Ominously, this was exactly the time of changing epochs: another elephant, Hadoop, signified a new era in the 360° view of the customer.  This very year of 2009, Cloudera announced the availability of Cloudera Distribution Including Apache Hadoop. This very year MapReduce and HDFS became separate subprojects of Apache HadoopThe era of data has begun. 

Massive amounts of data about the interactions of people open the door to observing and understanding human behavior at an unprecedented scale. Big data technology capabilities lead to new data-driven and customer-centric business models and revenues. Organizations change because of new insights about customers. Depending on a use case, “customer” could mean consumer, employee, voter, patient, criminal, student or all of the above.  Last Sunday, I became a “skier.”  That’s how they call customers at Mt Rose Ski Tahoe.  One more degree.  The most successful innovators are primarily guided by a focus on meeting the needs of the end users whom their solutions serve — the customer, the client, the employee. Our recent research note Focus on the Customer or Employee to Innovate With Cloud, Mobile, Social and Big Data speaks about it in great depth.

User experience that supports people’s personal goals and lifestyles, whether they are customers or employees, is key to success more than ever. Personal analytics is a noteworthy and totally new type of analytics, quite distinct from the well-known business analytics. Personal analytics empowers individuals to make better decisions in their private lives, within their personal circumstances, anytime, anywhere. How many more degrees does that add to the 360° view of the customer?

Siebel Analytics was the first customer analytics solution. It ended up as OBIEE.  (By the way, Oracle just acquired BlueKai — degrees and degrees of “audience” data!)  Siebel Analytics nourished many analytics leaders, off the top of my head — Birst, Facebook Analytics, Splice Machine and even Cognos.  ”I was very fortunate to have survived something you might not think was survivable,” said Tom Siebel  about the elephant attack. Tom Siebel is now running a big data company called C3. Data is pouring from more and more sources. Beacon devices for in-door positioning are gaining more attention. This means imminent customer tracking in retail stores and ball parks.

The bottom line: When declaring a 360° view of the customer, count carefully.  It could be 315°, or it could be 370°. Any angle greater than 360° means that the customer view is not expanding.


Follow Svetlana on Twitter @Sve_Sic


Category: analytics Big Data big data market data data paprazzi Hadoop Humans Uncategorized     Tags: , , , , , ,

Big Data Analytics is a Rocket Ship Fueled by Social, Mobile and Cloud

by Svetlana Sicular  |  March 7, 2014  |  3 Comments

The rocket ship of big data analytics is launched and on its way to orbit.  Data and analytics are gaining importance with a cosmic speed.  The rocket ship is fueled by cloud, mobile and social forces. Information is a single force that gets to the foreground over time while cloud and mobility, once implemented, become less visible.  Then big data and analytics turn into a long-lasting focus of enterprises.  Information architects and analytics gurus, get ready for a much greater demand for your expertise within next several years!


Last fall, my fellow analysts (covering social, mobile and cloud) and I (big data coverage) interviewed 33 people from truly innovative companies that have implemented social, mobile, cloud and information together (a.k.a. the Nexus of Forces).  These were the brilliant innovators who were not just thinking about it, but those who have already done it.  They were not implementing each force individually, but were taking advantage of technologies in combination.  One visionary told us,

The secret sauce is optimization and trade-off to achieve the best whole, bringing it all together for a unique user experience.

Fascinating things are happening:  companies in different industries think of themselves as data companies, information quality is ripe for disruption, everybody is craving for information governance, personal analytics is born and growing quickly (my colleague Angela McIntyre predicts, By 2016, wearable smart electronics in shoes, tattoos and accessories will emerge as a $10 billion industry). Convergence of forces surfaces my favorite subjects: big data, open data, crowdsourcing, and the human factor in technology.

We will talk about Lessons Learned From Real-World Nexus Innovators in a webinar on 11 March.

Three research notes describe our findings, in this order:

  1. Exploit Cloud, Mobile, Data and Social Convergence for Disruptive Innovation — analyzes how the Nexus of Forces is a platform for disruptive innovation and provides Key Insights for the entire Field Research project.
  2. Focus on the Customer or Employee to Innovate with Cloud, Mobile, Social and Big Data  — analyzes how enterprises focus on the individual to capitalize on the Nexus opportunities.
  3. Big Data Analytics Will Drive the Visible Impact of Nexus of Forces — analyzes how big data analytics will be key to enabling transformative business model disruption.

And here is a quote from one of the interviews about the state of big data analytics:

“It’s not just about finding the needle, but getting the hay in the stack.”

The rocket ship is launched.  Get ready for orbit.


Follow Svetlana on Twitter @Sve_Sic


Category: analytics Big Data cloud crowdsourcing data governance Information Everywhere innovation open data Uncategorized     Tags: , , , ,

Гений толпы

by Svetlana Sicular  |  February 5, 2014  |  1 Comment

If Russian is Greek to you, use translation tools such as Google Translate or — they will express the gist of my text. But if you want nuances, crowdsourced translation could be a better solution.  Learn more about crowdsourcing in my webinar Crowd Sorcery for Turning Data into Information on Thursday, 6 February.

Любимая когда-то мною (а я – ею) «Красная бурда» выдала в прошлом тысячелетии смешную фразу — обработка полей солдатами. Эта фраза оказалась пророческой, хоть и не в своем отечестве.  Вот, к примеру, почти классический манускрипт Crowdsourced Databases: Query Processing with People (обработка запросов людьми).  Как следует из заголовка, массовые усилия разрозненных людей называются crowdsourcing, а сами разрозненные люди называются толпой. Толпа эта, однако, не дружная: на улицу ее метлой не выгонишь, сидит себе дома и толпится. И вместо того, чтобы смотреть по вечерам телевизор, работает. Часто не корысти ради, а чтобы не скучать.  Или даже заниматься любимым делом, если на работе не всегда удается.

Люди посвящают досуг защите нашей планеты от астероидов или предсказанию сколько народу в этом году попадет в больницу. Кто предпочитает научную или интеллектуальную работу, а кто и механическую.  Навалятся гурьбой — и все быстренько сделают, да еще и дешево. Один оксфордский ученый гласит, что некоторые легионеры толпы даже продолжают работать несмотря на страдания от изоляции. Т.е. толпа уже настолько продвинулась, что испытывает не только подъем, но и всю остальную гамму чувств, присущую обыкновенному работнику. Так что если кто хочет использовать толпу – уже можно. Толпа помогает не потому, что у вас все глупые, а она – умная, а потому, что она отстраненная, не погрязшая в вашей рутине. Чем дольше работаешь на одном месте, тем сложнее придумать новые решения или прийти к неожиданным выводам.

Когда пешеходы идут по мосту в ногу, мост начинает шататься и даже может разрушиться из-за резонанса.  Поэтому солдатам, если те идут через мост,  приказывают маршировать вразнобой.   Так и с нашей толпой работников: в одних случаях эффект резонанса приносит поразительные плоды, в других случаях—  нарушение резонанса приводит к не менее чудотворным результатам.

Вот пример резонанса: замечательная организация Coursera уже второй год предоставляет всем желающим разные бесплатные курсы, которые читают профессора ведущих университетов мира.  На первый же курс  к основателю организации, Стэнфордскому профессору Эндрю Нг, записались больше ста тысяч человек. Ну как один профессор, или даже с помощниками, может поставить оценки такому количеству народа? Оказалось, что с помощью кое-каких методов, народ может отличнейшее оценивать однокашников сам (±10% по сравнению с профессором).  Довольный профессор Эндрю Нг подытожил, что за первый год Coursera собрала больше сведений о том, как люди учатся, чем все университеты за всю предыдущую историю высшего образования. Например, в первом же курсе, 2000 человек ответили неправильно на один и тот же вопрос, и все две тысячи – одинаково! Значит, надо что-то в университете подправить.

Отсутствие резонанса особенно удобно иллюстрировать на примере карты мира: каждый вносит свою уникальную информацию о месте, в котором живет и которое хорошо знает. И каждая тоже. Кстати, есть всякие узкоспециализированные толпы: женщины, дизайнеры, отставные военные.  Последние – не только для обработки полей, но и для решения задач, для которых нужен допуск.

Высказать свою, не похожую ни на какую другую,  точку зрения в наше просвещенное время почти невозможно – поди разбери, чья это точка зрения, если она черпается из немногих, очень популярных, средств массовой информации.  А толпа – это массы. Но тем толпа и хороша, что в ней всегда найдутся исключения, которые выскажут что-то новое и глубокое. Это называется «мудрость толпы», по-нашему, народная мудрость, каковая часто ставит вопросы с ног на голову. Например, покупать (продавать  конечно же) духи не за то, как они пахнут, а за то, как долго длится запах.  Я вот тоже, хоть и не толпа, поступлю наоборот, и окончу эпиграфом:

И Шуберт на воде, и Моцарт в птичьем гаме,
И Гёте, свищущий на вьющейся тропе,
И Гамлет, мысливший пугливыми шагами,
Считали пульс толпы и верили толпе.



Follow Svetlana on Twitter @Sve_Sic

1 Comment »

Category: crowdsourcing Humans Information Everywhere innovation skills translation Uncategorized     Tags:

Word of the Next Year

by Svetlana Sicular  |  December 27, 2013  |  Comments Off

Selfie is the word of 2013. Or maybe, it isn’t. There are other winners: (data) science, privacy and geek.  The latter wins for acquiring a more positive image, together with hacker, I assume. These words signify growing people centricity as a result of convergence of four powerful forces — big data, mobility, cloud and social computing (yeh-yeh, the Nexus of Forces, 1,540 results if you search for it on  There is another permeating superforce — user experience.  The words of the year signify this too. The question (of the impatients like me) is — what’s next? Ultra-personalization? No, this not the word of the next year, read on.

Depending on where you stand or where you look, or who looks at where you stand, there are two scenarios:

1. Greater good
2. Cut throat business

I see both. And I see a lot through my first-hand encounters with clients and vendors, VCs and scientists, geeks and selfies (sorry for misusing the word, it’s too new).  I just don’t see yet what scenario outweighs.  In the spirit of the season, I’m hoping for the greater good.  This year, open data truly inspired me.  I’ve been writing my research note on this subject for the last several months, with a lot of rewrites and rethinking.  I met amazing people and saw astonishing possibilities at the intersection of open data and crowdsourcing. Although a contender, crowdsourcing is not the word of the next year too.

The word of the next year is — airbnb.

Airbnb is a community marketplace for people to list and book accommodations that are someone’s spare property, from apartments and rooms to treehouses and boats. There is an airbnb for cars, dogs, storage, parking and office space. There are airbnbs for food — Cookening, EatWith and Feastly — you can come to a stranger’s home and have dinner made just for you. That’s hyper-personalization! My colleague Federico De Silva Leon wrote about airbnb for IT — Maverick* Research: Peer-to-Peer Sharing of Excess IT Resources Puts Money in the Bank, where he says: Internet technologies, such as cloud computing, are a central element to the transformation and rebalancing of IT resources, allowing organizations to monetize their excess IT capacity via a direct peer-to-peer (P2P) sharing model, enabled by brokers — a sort of “Airbnb of IT sharing.”

Anything peer-to-peer is airbnb.  It’s the ability to do what was not possible before the social, mobile and cloud forces started dating or even big-dating.  The only question left for the dictionaries that select words of the year: What is correct – “airbnb for” or “airbnb of”?

Comments Off

Category: Big Data cloud crowdsourcing data fashion Humans open data Uncategorized     Tags: , , , , ,

The Charging Elephant — IBM

by Svetlana Sicular  |  September 23, 2013  |  Comments Off

To allow our readers compare Hadoop distribution vendors side-by-side, I forwarded the same set of questions to the participants of the recent Hadoop panel at the Gartner Catalyst conference. The questions are coming from Catalyst attendees and from Gartner analysts. I have published the vendor responses in these blog posts:

Cloudera, August 21, 2013
MapR, August 26, 2013
Pivotal, August 30, 2013

In addition to the Catalyst panelists, IBM also volunteered to answer the questions. Today, I publish responses from Paul Zikopoulos, Vice President, Worldwide Technical Sales, IBM Information Management Software.

1. How specifically are you addressing variety, not merely volume and velocity?

While Hadoop itself can store a wide variety of data, the challenge is really the ability
to derive value out of this data and integrate it within their enterprise – connecting the
dots if you will. After all, big data without analytics is…well…just a bunch of data.
Quite simply, lots of people are talking Hadoop – but we want to ensure people are
talking analytics on Hadoop. To facilitate analytics on data at rest in Hadoop, IBM
has added a suite of analytic tools and we believe this to be one of our significant
differentiators in IBM’s Hadoop distribution, InfoSphere BigInsights, when compared
to those from other vendors.

I’ll talk about working with text from a variety perspective in a moment, but a fun
one to talk about is the IBM Multimedia Analysis and Retrieval System (IMARS).
It’s one of world’s largest image classification systems – and Hadoop powers it.
Consider a set of pictures with no metadata and searching for those pictures that include
“Wintersports,” and the system return such pictures. It’s done with the creating of
training sets that can perform feature extraction and analysis. You can check it out for
yourself at: We’ve also got acoustic modules that our
partners are building around our big data platform. If you consider our rich partner
ecosystem, our Text Analytics Toolkit, the work we’re doing with IMARS, I think IBM
is leading the way when it comes to putting a set of arms around the variety component
of big data.

Pre-built customizable application logic
With more than two dozen pre-built Hadoop Apps, InfoSphere BigInsights helps firms
quickly benefit from their big data platform. Apps include web crawling, data
import/export, data sampling, social media data collection and analysis, machine data
processing and analysis, ad hoc queries, and more. All of these Apps are shipped with full
access to their source code; serving as a launching pad to customize, extend or develop
your own. What’s more, we empower the delivery of these Apps to the enterprise; quite
simply, if you’re familiar with how you invoke and discover Apps on an iPad, then
you’ve got the notion here. We essentially allow you to set up your own ‘App Store”
where users can invoke Apps and manage their run-time, thereby flattening the
deployability costs and driving up the democratization effect. It finds another level yet!
While your core Hadoop experts and programmers can create these discrete Apps, much
like SOA, we enable power users to build their own logic by orchestrating these logic
Apps into full blown applications. For example, if you wanted to grab data from a service
such as BoardReader and merge that with some relational data, then perform a
transformation on that data and run an R script on that – it’s as easy as dragging these Apps from the Apps Palette and connecting them.

There’s also a set of solution accelerators: these are extensive toolkits with dozens of prebuilt
software artifacts that can be customized and used together to quickly build tailormade
solutions for some of the more common kinds of big data analysis. There is a
solution accelerator for machine data (handles data ingest, indexing and search,
sessionization of log data, and statistical analysis), social data (handles ingestion from
social sources like Twitter, analyses historical messages to build social profiles, and
provides framework for in-motion analysis based on historical profiles), and one for
telco-based CDR data.

Spreadsheet-style analysis tool
What’s the most popular BI tool in the world? Likely some form of spreadsheet software.
So to keep business analysts and non-programmers in their comfort zone BigInsights
includes a Web-based spreadsheet-like discovery and visualization facility called
BigSheets. BigSheets lets users visually combine and explore various types of data to
identify “hidden” insights without having to understand the complexities of parallel
computing. Under the covers, it’s generating Pig jobs. BigSheets also lets you run jobs on
a sample of the data, which is really important because in the big data world, you could
be dealing with a lot of data, so why chew up that resource until you’re sure the job
you’ve created works as intended.

Advanced Text Analytics Toolkit
BigInsights helps customers analyze large volumes of documents and messages with its
built-in text-processing engine and library of context-sensitive extractors. This includes a
complete end-to-end development environment that plugs into the Eclipse IDE and offers
a perspective to build text extraction programs that run on data stored in Hadoop. In the
same manner that SQL declarative languages transformed the ability to work with
relational data, IBM has introduced the Annotated Query Language (AQL) for text
extraction, which is similar in look and feel to SQL. Using AQL and the accompanying
IDE, power users can build all kinds of text extraction processes for whatever the task is
at hand (social media analysis, classification of error messages, blog analysis, and so on).
Finally, accompanying this is a run optimizer that compiles the AQL into optimized
execution code that’s run on the Hadoop cluster. This optimizer was built from the
ground up to process and optimize text extraction, which requires a different set of
optimization techniques than typical. In the same manner as BigInsights ships a number
of Apps, the Text Analytics Toolkit involves a number of pre-built extractors that allows
you to pull things such as “Person/Phone”, “URL”, “City” (among many others) from
text (which could be a social media feed, financial document, call record logs, logs
files…you name it). The magic behind these pre-built extractors is that they are really
compiled rules – hundreds of them – from thousands of client engagements. Quite
simply, when it comes to text extraction, we think our platform is going to let you build
your applications fifty percent faster, make them run up to ten times faster than some of
the alternatives we’ve see, and most of all, provide more right answers. After all,
everyone talks about just how fast an answer was returned, and that’s important, but
when it comes to text extraction, how often is the answer right is just as important. This
platform is going to be “more right”.

Indexing and Search facility
Discovery and exploration often involve search, and as Hadoop is well suited to store
large varieties and volumes of data, a robust search facility is needed. Included with the
BigInsights product is InfoSphere Data Explorer (technology that came into the IBM big
data portfolio during the Vivisimo acquisition). Search is all the ‘rage’ these days with
some announcements by other Hadoop vendors in this space. BigInsights includes Data
Explorer for the searching of Hadoop data, however, it can be extended to search all data
assets. So unlike some of the announcements we’ve heard, I think we’re setting a higher
bar. What’s more, the indexing technology behind Data Explorer is positional-based as
opposed to vector-based – and that provides a lot of the differentiating benefits that are
needed in a big data world. Finally, Data Explorer understand security policies. If you’re
only granted access to a portion of a document and it comes up in your search, you still
only have access to the portion that was defined on the source systems. There’s so much
more to this exploration tool – such as automated topical clusters, a portal-like
development environment, and more. Let’s just say we have some serious leadership in
this space.

Integration tools
Data integration is a huge issue today and only gets more complicated with the
variety of big data that needs to be integrated into the enterprise. Mandatory
requirements for big data integration on diverse data sources include: 1) the ability to
extract data from any source, and cleanse and transform the data before moving it into
the Hadoop environment; 2) the ability to run the same data integration processes
outside the Hadoop environment and within the Hadoop environment, wherever most
appropriate (or perhaps run it on data the resides in HDFS without using MapReduce);
3) the ability to deliver unlimited data integration scalability both outside of the

Hadoop environment and within the Hadoop environment, wherever most
appropriate; and 4) the ability to overcome some of Hadoop’s limitations for big data
integration. Relying solely on hand coding within the Hadoop environment is not a
realistic or viable big data integration solution.

Gartner had some comments about how solely relying on Hadoop as an ETL engine
leads to more complexity and costs, and we agree. We think Hadoop has to be a first
class citizen to such an environment, and it is with our Information Server product
set. Information Server is not just for ETL. It is a powerful parallel application
framework that can be used to build and deploy broad classes of big data applications.
For example, it includes design canvas, metadata management, visual debuggers, the
ability to share reusable components, and more. So Information Server can
automatically generate MapReduce code, but it’s also smart enough to know if a
different execution environment might better serve the transformation logic. We like
to think of the work you do in Information Server as “Design the Flow Once – Run
and Scale Anywhere”.

I recently worked with a Pharma-based client than ran blindly down the “Hadoop for
all ETL” path. One of the transformation flows had more than 2,000 lines of code; it
took 30 days to write. What’s more, it had no documentation and was difficult to reuse
and maintain. The exact same logic flow in Information Server was implemented
in just 2 days; it was built graphically and was self-documenting. Performance was
improved, and it was more reusable and maintainable.
Think about it – long before the “ETL via solely Hadoop” it was “ETL via [programming
language of the year]. I saw clients do the same thing 10 years ago
with Perl scripts and they ended up in the same place that clients that are solely using
Hadoop for ETL end up with – higher costs and complexity. Hand coding was
replaced by commercial data integration tools because of these very reasons and we
think most customers do not want to go backwards in time by adopting the high costs,
risks, and limitations of hand coding for big data integration. We think our
Information Server offering is the only commercial data integration platform that
meets all of the requirements for big data integration outlined above.

2.  How do you address security concerns?

To secure data stored in Hadoop, BigInsights security architecture uses both private
and public networks to ensure that sensitive data, such as authentication information
and source data, is secured behind a firewall. It uses reverse proxies, LDAP and PAM
authentication options, options for on-disk encryption (including hardware, file
system, and application based implementations), and built-in role based authorization.
User roles are assigned distinct privileges so that critical Hadoop components cannot
be altered by users without access.

But there is more. After all, if you are storing data in an RDBMS or HDFS, you’re
still storing data. Yet so many don’t stop to think about it. We have a rich portfolio of
data security service and work with Hadoop (or soon will). For example, test data
management, data masking, and archiving services in our Optim family. In addition,
InfoSphere Guardium provides enterprise-level security auditing capabilities and
database activity monitoring (DAM) for every mainstream relational database I can
think of; well don’t you need a DAM solution for data stored in Hadoop to support
the many current compliance mandates in the same manner as traditional structured
data sources. Guardium provides a number of pre-built reports that help you
understand who is running what job, are they authorized to run it, and more. If you
ever tried to work with Hadoop’s ‘chatty’ protocol, you know how tedious this would
be to do on your own, if at all. Guardium makes it a snap.

3.  How do you provide multi-tenancy on Hadoop? How do you implement multiple unrelated workloads in Hadoop?

While Hadoop itself is moving toward better support of multi-tenancy with Hadoop
2.0 and YARN, IBM already offers some unique capabilities in this area.

Specifically, BigInsights customers can deploy Platform Symphony, which extends
capabilities for multi-tenancy. Business application owners frequently have particular
SLAs that they must achieve. For this reason, many organizations run dedicated
cluster infrastructures for each cluster because this is the only way they can be
assured of having resources when they need them. Platform Symphony solves this
problem with a production-proven multi-tenant architecture. It allows ownership to be
expressed on the grid, ensuring that each tenant is guaranteed a contracted SLA,
while also allowing resources to be shared dynamically so that idle resources are fully
utilized to the benefit of all.

For more complex use cases involving issues like multi-tenancy, IBM offers an
alternative file system to HDFS: GPFS-FPO (General Parallel File System: File
Placement Optimizer). One compelling capability of GPFS to enable Hadoop clusters
to supporting different data sets and workloads is the concept of storage volumes.
This enables you to dedicate specific hardware in your cluster to specific data sets. In
short, you could store colder data not requiring heavy analytics on less expensive
hardware. I like to call this “Blue Suited” Hadoop. There’s no need to change your
Hadoop applications if you use GPFS-FPO and there a host of other benefits it offers
such as snapshots, large block I/O, removal of the need for a dedicated name node,
and more. I want to stress the choice is up to you. You can use BigInsights with all
the open source components, and it’s “round-tripable” in that you can go back to open
source Hadoop whenever, or you can use some of these embrace and extend
capabilities that harden Hadoop for the enterprise. I think what separates us from
some of the other approaches is we are giving you the choice of the file system, and
the alternative we offer has a long standing reputation in the high performance
computing (HPC) arena.

4.   What are the top 3 focus areas for the Hadoop community in the next 12 months to make the framework more ready for the enterprise?

One focus area will be a shift from customers simply looking for High Availability to
a requirement for Disaster Recovery. Today DR in Hadoop is primarily limited to
snapshots of HBase; HDFS file datasets and distributed copy. Customers are starting
to ask for automated replication to a second site and automated cluster failover for
true disaster recovery. For customers who require disaster recovery today, GPFS-FPO
(IBM’s alternative to HDFS) includes flexible replication and recovery capabilities.

Disaster Recovery
One focus area will be a shift from customers simply looking for High Availability to
a requirement for Disaster Recovery. Today DR in Hadoop is primarily limited to
snapshots of HBase; HDFS file datasets and distributed copy. Customers are starting
to ask for automated replication to a second site and automated cluster failover for
true disaster recovery. For customers who require disaster recovery today, GPFS-FPO
(IBM’s alternative to HDFS) includes flexible replication and recovery capabilities.

Data Governance and Security
As businesses begin to depend on Hadoop for more of their analytics, data
governance issues become increasingly critical. Regulatory compliance for many
industries and jurisdictions demand strict data access control requirements, the ability
to audit data reads, and the ability to exactly track data lineage – just to name a few
criteria. IBM is a leader in data governance, and has a rich portfolio to enable
organizations to exert tighter control over their data. This portfolio has been
significantly extended to factor in big data, effectively taming Hadoop.

GPFS can also enhance security by providing ACL (access control list) with file and
disk level access control so only those applications, or data itself can be isolated to
privileged users, applications or even physical nodes in highly secured physical
environments. For example, if you have a set of data that can’t be deleted because of
some regulatory compliance, you can create that policy in GPFS. You can also create
immutability policies, and more.

Business-Friendly Analytics
Hadoop has for most of its history been the domain of parallel computing experts. But
as it’s becoming an increasingly popular platform for large-scale data storage and
processing, Hadoop needs to be accessible by users with a wide variety of skillsets.
BigInsights is responding to this need by providing BigSheets (the spreadsheet-based
analytics tool we mentioned in question 1), and an App-store style framework to
enable business users to run analytics functions, mix, match, and chain apps together,
and visualize their results.

5. What are the advancements in allowing data to be accessed directly from native source without requiring it to be replicated?

One example of allowing data to be accessed directly in Hadoop is the industry shift
to improve SQL access to data stored in Hadoop; it’s ironic isn’t it? The biggest
movement in the NoSQL space is … SQL! Despite all of the media coverage of
Hadoop’s ability to manage unstructured data, there is pent-up demand to use Hadoop
as a lower cost platform to store and query traditional structured data. IBM is taking a
unique approach here with the introduction of Big SQL, which is included in
BigInsights 2.1. Big SQL extends the value of data stored within Hive or HBase by
making it immediately query-able with a richer SQL interface than that provided by
HiveQL. Specifically, Big SQL provides full ANSI SQL 92 support for sub-queries,
more SQL 99 OLAP aggregation functions and common table expressions, SQL 2003
windowed aggregate functions, and more. Together with the supplied ODBC and
JDBC drivers, this means that a broader selection of end-user query tools can
generate standard SQL and directly query Hadoop. Similarly, Big SQL provides
optimized SQL access to HBase with support for secondary indexes and predicate
pushdown. We think Big SQL is the ‘closest to the pin’ of all the SQL solutions out
there for Hadoop, and there are a number of them. Of course, Hive is shipped in
BigInsights, so the work that’s being done there is by default part of BigInsights. We
have some big surprises coming in this space, so stay close to it. But the underlying
theme here is that Hadoop needs SQL interfaces to help democratize it and all
vendors are shooting for a hole in one here.

6. What % of your customers are doing analytics in Hadoop? What types of analytics? What’s the most common? What are the trends?

Given that BigInsights includes analytic tools in addition to the open source Apache
Hadoop components, many of our customers are doing analytics in Hadoop. An
example of a customer doing analytics is Constant Contact Inc. who is using
BigInsights to analyze 35 billion annual emails to guide customers on best dates &
times to send emails for maximum response. This has resulted in increased
performance of their customers’ email campaigns by 15 to 25%.

A health bureau in Asia is using BigInsights to develop a centralized medical imaging
diagnostics solution. An estimated 80 percent of healthcare data is medical imaging
data, in particular radiology imaging. The medical imaging diagnostics platform is
expected to significantly improve patient healthcare by allowing physicians to exploit
the experience of other physicians in treating similar cases, and inferring the
prognosis and the outcome of treatments. It will also allow physicians to see
consensus opinions as well as differing alternatives, helping reduce the uncertainty
associated with diagnosis. In the long run, these capabilities will lower diagnostic
errors and improve the quality of care.

Another example is a global media firm concerned about privacy of their digital
content. The firm monitors social media sites (like Twitter, Facebook, and so on) to
detect unauthorized streaming of their content, quantify the annual revenue loss due
to piracy, and analyze trends. For them, the technical challenges involved processing
a wide variety of unstructured and semi-structured data.
The company selected BigInsights for its text analytics and scalability. The firm is
relying on IBM’s services and technical expertise to help it implement its aggressive
application requirements.

Vestas models weather to optimize the placement of turbines, maximizing generation
and longevity of their turbines. They’ve been able to take a month out of preparation
and placement work with their models and reduce turbine placement identification
from weeks to hours using there 1400+ node Hadoop cluster. If you were to take the
wind history they store of the world and capture it as HD TV, you’d be sitting down
and watching your television for 70 years to get through all the data they have – over
2.5 PB and growing to 6 PB more. How impactful is their work? In 2012, they were
awarded Computerworld’s Honors Program in its Search for New Heroes for “the
innovative application of IT to reduce waste, conserve energy, and the creation of
new product to help solve global environment problems”. IBM is really proud to be a
part of that.

Having said that, a substantial number of customers are using BigInsights as a landing
zone or staging area for a data warehouse, meaning it is being deployed as a data
integration or ETL platform. In this regard our customers are seeing additional value
in our Information Server offering combined with BigInsights. Not all data
integration jobs are well suited for MapReduce, and Information Server has the added
advantage of being able to move and transform data from anywhere in the enterprise
and across the internet to the Hadoop environment. Information Server allows them to
build data integration flows quickly and easily, and deploy the same job both within
and external to the Hadoop environment, wherever most appropriate. It delivers the
data governance and operational and administrative management required for
enterprise-class big data integration.

7.  What does it take to get a CIO to sign off on a Hadoop deployment?

We strive to establish a partnership with the customer and CIO to reduce risk and
ensure success in this new and exciting opportunity for their company. Because the
industry is still in the early adopter phase, IBM is focused on helping CIOs
understand the concrete financial and competitive advantages that Hadoop can bring
to the enterprise. Sometimes this involves some form of pilot project – but as
deployments increase across different industries, customers are starting to understand
the value right away. We are starting to see more large deals for enterprise licenses of
BigInsights as an example.

8.  What are the Hadoop tuning opportunities that can improve performance and scalability without breaking user code and environments?

For customers that want optimal “out of the box” performance without having to
worry about tuning parameters we have announced PureData System for Hadoop, our
appliance offering that ships with BigInsights pre-integrated.

For customers who want better performance than Hadoop provides natively, we offer
Adaptive MapReduce, which optionally can replace the Hadoop scheduler with
Platform Symphony. IBM completed big data benchmarking of significance
employing BigInsights and Platform Symphony. These benchmarks include the
SWIM benchmark (Statistical Workload Injector for MapReduce), which is a
benchmark representing a real-world big data workload developed by University of
California at Berkley in close cooperation with Facebook. This test provides rigorous
measurements of the performance of MapReduce systems comprised of real industry
workloads. Platform Symphony Advanced Edition accelerated SWIM/Facebook
workload traces by approximately 6 times.

9. How much of Hadoop 2.0 do you support, given that it is still Alpha pre Apache?

Our current release 2.1 of BigInsights ships with Hadoop V 1.1.1, and was released in
mid-2013, when Hadoop 2 was still in alpha state. Once Hadoop 2 is out of alpha/beta
state, and production ready, we will include it in a near-future release of BigInsights.
IBM strives to provide the most proven, stable versions of Apache Hadoop

10. What kind of RoI do your customers see on Hadoop investments – cost reduction or revenue enhancement?

It depends on the use case. Earlier we discussed Constant Contact Inc., whose use of
BigInsights is driving revenue enhancement through text analytics.

An example of cost reduction is a large automotive manufacturer who has deployed
BigInsights as the landing zone for their EDW environment. In traditional terms, we
might think of the landing zone as the atomic level detail of the data warehouse, or
the system of record. It also serves as the ETL platform, enhanced with Information
Server for data integration across the enterprise. The cost savings result by only
replicating an active subset of the data to the EDW, thereby reducing the size and cost
of the EDW platform. So 100 percent of the data is stored on Hadoop, and about
40% gets copied to the warehouse. Over time, Hadoop also serves as the historical
archive for EDW data and remains query-able.

As Hadoop becomes more widely deployed in the enterprise, regardless of the use case, the ROI increases.

11. Are Hadoop demands being fulfilled outside of IT? What’s the percentage? Is it better when IT helps?

The deployments we are seeing are predominantly being done within IT, although
with an objective of better serving their customers. Once deployed, tools such as
BigSheets make it easier for end users by providing analytics capability without the
need to write MapReduce code.

12. How important will SQL become as mechanism for accessing data in Hadoop? How will this affect the broader Hadoop ecosystem?

SQL access to Hadoop is already important, with a lot of the demand coming from
customers who are looking to reduce the costs of their data warehouse platform. IBM
introduced Big SQL earlier this year and this is an area where we will continue to
invest. It affects the broader Hadoop ecosystem by turning Hadoop into a first class
citizen of a modern enterprise data architecture and offloading some of the work
traditionally done by an EDW. EDW costs can be reduced by using Hadoop to store
the atomic level detail, perform data transformations (rather than in-database), and
serve as query-able archive for older data. IBM can provide a reference architecture
and assist customers looking to incorporate Hadoop into their existing data warehouse
environment in a cost effective manner.

13. For years, Hadoop was known for batch processing and primarily meant MapReduce and HDFS. Is that very definition changing with the plethora of new projects (such as YARN) that could potentially extend the use cases for Hadoop?

Yes, it will extend the use cases for Hadoop. In fact, we have already seen the start of
that with the addition of SQL access to Hadoop and the move toward interactive SQL
queries. We expect to see a broader adoption of Hadoop as an EDW landing zone and
associated workloads.

14. What are some of the key implementation partners you have used?
[Answer skipped.]

15. What factors most affect YOUR Hadoop deployments (eg SSDs; memory size.. and so on)? What are the barriers and opportunities to scale?

IBM is in a unique position of being able to leverage our vast portfolio of hardware
offerings for optimal Hadoop deployments. We are now shipping PureData System
for Hadoop, which takes all of the guesswork out of choosing the components and
configuring a cluster for Hadoop. We are leveraging what we have learned through
delivering the Netezza appliance, and the success of our PureData Systems to
simplify Hadoop deployments and improve time to value.

For customers who want more flexibility, we offer Hadoop hardware reference
architectures that can start small and grow big with options for those who are more
cost conscious, or for those who want the ultimate in performance. These reference
architectures are very prescriptive by defining the exact hardware components needed
to build the cluster and are defined in the IBM Labs by our senior architects based on
our own testing and customer experience.



Follow Svetlana on Twitter @Sve_Sic



Comments Off

Category: Big Data big data market Catalyst Crossing the Chasm data paprazzi Information Everywhere Uncategorized     Tags: , , , , , , , ,