Svetlana Sicular

A member of the Gartner Blog Network

Svetlana Sicular
Research Director
1 year at Gartner
19 years IT industry

Svetlana Sicular has a uniquely combined experience of Fortune 500 IT and business leadership, product management at world-class software vendors, and Big Four consulting. She primarily handles inquiries in the areas of data management strategy, ...Read Full Bio

How Many Degrees Are in the 360° View of the Customer?

by Svetlana Sicular  |  March 18, 2014  |  2 Comments

I’ve been watching the CRM space since the term CRM was coined. The view of the customer remained at invariable 360° while new ideas, methods and companies kept adding degree by degree to the full view.  Back in 2009, a CRM icon Tom Siebel was attacked by a charging elephant during an African safari. Ominously, this was exactly the time of changing epochs: another elephant, Hadoop, signified a new era in the 360° view of the customer.  This very year of 2009, Cloudera announced the availability of Cloudera Distribution Including Apache Hadoop. This very year MapReduce and HDFS became separate subprojects of Apache HadoopThe era of data has begun. 

Massive amounts of data about the interactions of people open the door to observing and understanding human behavior at an unprecedented scale. Big data technology capabilities lead to new data-driven and customer-centric business models and revenues. Organizations change because of new insights about customers. Depending on a use case, “customer” could mean consumer, employee, voter, patient, criminal, student or all of the above.  Last Sunday, I became a “skier.”  That’s how they call customers at Mt Rose Ski Tahoe.  One more degree.  The most successful innovators are primarily guided by a focus on meeting the needs of the end users whom their solutions serve — the customer, the client, the employee. Our recent research note Focus on the Customer or Employee to Innovate With Cloud, Mobile, Social and Big Data speaks about it in great depth.

User experience that supports people’s personal goals and lifestyles, whether they are customers or employees, is key to success more than ever. Personal analytics is a noteworthy and totally new type of analytics, quite distinct from the well-known business analytics. Personal analytics empowers individuals to make better decisions in their private lives, within their personal circumstances, anytime, anywhere. How many more degrees does that add to the 360° view of the customer?

Siebel Analytics was the first customer analytics solution. It ended up as OBIEE.  (By the way, Oracle just acquired BlueKai — degrees and degrees of “audience” data!)  Siebel Analytics nourished many analytics leaders, off the top of my head — Birst, Facebook Analytics, Splice Machine and even Cognos.  ”I was very fortunate to have survived something you might not think was survivable,” said Tom Siebel  about the elephant attack. Tom Siebel is now running a big data company called C3. Data is pouring from more and more sources. Beacon devices for in-door positioning are gaining more attention. This means imminent customer tracking in retail stores and ball parks.

The bottom line: When declaring a 360° view of the customer, count carefully.  It could be 315°, or it could be 370°. Any angle greater than 360° means that the customer view is not expanding.

 

Follow Svetlana on Twitter @Sve_Sic

2 Comments »

Category: analytics Big Data big data market data data paprazzi Hadoop Humans Uncategorized     Tags: , , , , , ,

Big Data Analytics is a Rocket Ship Fueled by Social, Mobile and Cloud

by Svetlana Sicular  |  March 7, 2014  |  3 Comments

The rocket ship of big data analytics is launched and on its way to orbit.  Data and analytics are gaining importance with a cosmic speed.  The rocket ship is fueled by cloud, mobile and social forces. Information is a single force that gets to the foreground over time while cloud and mobility, once implemented, become less visible.  Then big data and analytics turn into a long-lasting focus of enterprises.  Information architects and analytics gurus, get ready for a much greater demand for your expertise within next several years!

RocketInformation

Last fall, my fellow analysts (covering social, mobile and cloud) and I (big data coverage) interviewed 33 people from truly innovative companies that have implemented social, mobile, cloud and information together (a.k.a. the Nexus of Forces).  These were the brilliant innovators who were not just thinking about it, but those who have already done it.  They were not implementing each force individually, but were taking advantage of technologies in combination.  One visionary told us,

The secret sauce is optimization and trade-off to achieve the best whole, bringing it all together for a unique user experience.

Fascinating things are happening:  companies in different industries think of themselves as data companies, information quality is ripe for disruption, everybody is craving for information governance, personal analytics is born and growing quickly (my colleague Angela McIntyre predicts, By 2016, wearable smart electronics in shoes, tattoos and accessories will emerge as a $10 billion industry). Convergence of forces surfaces my favorite subjects: big data, open data, crowdsourcing, and the human factor in technology.

We will talk about Lessons Learned From Real-World Nexus Innovators in a webinar on 11 March.

Three research notes describe our findings, in this order:

  1. Exploit Cloud, Mobile, Data and Social Convergence for Disruptive Innovation — analyzes how the Nexus of Forces is a platform for disruptive innovation and provides Key Insights for the entire Field Research project.
  2. Focus on the Customer or Employee to Innovate with Cloud, Mobile, Social and Big Data  — analyzes how enterprises focus on the individual to capitalize on the Nexus opportunities.
  3. Big Data Analytics Will Drive the Visible Impact of Nexus of Forces — analyzes how big data analytics will be key to enabling transformative business model disruption.

And here is a quote from one of the interviews about the state of big data analytics:

“It’s not just about finding the needle, but getting the hay in the stack.”

The rocket ship is launched.  Get ready for orbit.

 

Follow Svetlana on Twitter @Sve_Sic

3 Comments »

Category: analytics Big Data cloud crowdsourcing data governance Information Everywhere innovation open data Uncategorized     Tags: , , , ,

Гений толпы

by Svetlana Sicular  |  February 5, 2014  |  1 Comment

If Russian is Greek to you, use translation tools such as Google Translate or Translate.com — they will express the gist of my text. But if you want nuances, crowdsourced translation could be a better solution.  Learn more about crowdsourcing in my webinar Crowd Sorcery for Turning Data into Information on Thursday, 6 February.

Любимая когда-то мною (а я – ею) «Красная бурда» выдала в прошлом тысячелетии смешную фразу — обработка полей солдатами. Эта фраза оказалась пророческой, хоть и не в своем отечестве.  Вот, к примеру, почти классический манускрипт Crowdsourced Databases: Query Processing with People (обработка запросов людьми).  Как следует из заголовка, массовые усилия разрозненных людей называются crowdsourcing, а сами разрозненные люди называются толпой. Толпа эта, однако, не дружная: на улицу ее метлой не выгонишь, сидит себе дома и толпится. И вместо того, чтобы смотреть по вечерам телевизор, работает. Часто не корысти ради, а чтобы не скучать.  Или даже заниматься любимым делом, если на работе не всегда удается.

Люди посвящают досуг защите нашей планеты от астероидов или предсказанию сколько народу в этом году попадет в больницу. Кто предпочитает научную или интеллектуальную работу, а кто и механическую.  Навалятся гурьбой — и все быстренько сделают, да еще и дешево. Один оксфордский ученый гласит, что некоторые легионеры толпы даже продолжают работать несмотря на страдания от изоляции. Т.е. толпа уже настолько продвинулась, что испытывает не только подъем, но и всю остальную гамму чувств, присущую обыкновенному работнику. Так что если кто хочет использовать толпу – уже можно. Толпа помогает не потому, что у вас все глупые, а она – умная, а потому, что она отстраненная, не погрязшая в вашей рутине. Чем дольше работаешь на одном месте, тем сложнее придумать новые решения или прийти к неожиданным выводам.

Когда пешеходы идут по мосту в ногу, мост начинает шататься и даже может разрушиться из-за резонанса.  Поэтому солдатам, если те идут через мост,  приказывают маршировать вразнобой.   Так и с нашей толпой работников: в одних случаях эффект резонанса приносит поразительные плоды, в других случаях—  нарушение резонанса приводит к не менее чудотворным результатам.

Вот пример резонанса: замечательная организация Coursera уже второй год предоставляет всем желающим разные бесплатные курсы, которые читают профессора ведущих университетов мира.  На первый же курс  к основателю организации, Стэнфордскому профессору Эндрю Нг, записались больше ста тысяч человек. Ну как один профессор, или даже с помощниками, может поставить оценки такому количеству народа? Оказалось, что с помощью кое-каких методов, народ может отличнейшее оценивать однокашников сам (±10% по сравнению с профессором).  Довольный профессор Эндрю Нг подытожил, что за первый год Coursera собрала больше сведений о том, как люди учатся, чем все университеты за всю предыдущую историю высшего образования. Например, в первом же курсе, 2000 человек ответили неправильно на один и тот же вопрос, и все две тысячи – одинаково! Значит, надо что-то в университете подправить.

Отсутствие резонанса особенно удобно иллюстрировать на примере карты мира: каждый вносит свою уникальную информацию о месте, в котором живет и которое хорошо знает. И каждая тоже. Кстати, есть всякие узкоспециализированные толпы: женщины, дизайнеры, отставные военные.  Последние – не только для обработки полей, но и для решения задач, для которых нужен допуск.

Высказать свою, не похожую ни на какую другую,  точку зрения в наше просвещенное время почти невозможно – поди разбери, чья это точка зрения, если она черпается из немногих, очень популярных, средств массовой информации.  А толпа – это массы. Но тем толпа и хороша, что в ней всегда найдутся исключения, которые выскажут что-то новое и глубокое. Это называется «мудрость толпы», по-нашему, народная мудрость, каковая часто ставит вопросы с ног на голову. Например, покупать (продавать  конечно же) духи не за то, как они пахнут, а за то, как долго длится запах.  Я вот тоже, хоть и не толпа, поступлю наоборот, и окончу эпиграфом:

И Шуберт на воде, и Моцарт в птичьем гаме,
И Гёте, свищущий на вьющейся тропе,
И Гамлет, мысливший пугливыми шагами,
Считали пульс толпы и верили толпе.

 

 

Follow Svetlana on Twitter @Sve_Sic

1 Comment »

Category: crowdsourcing Humans Information Everywhere innovation skills translation Uncategorized     Tags:

Word of the Next Year

by Svetlana Sicular  |  December 27, 2013  |  Comments Off

Selfie is the word of 2013. Or maybe, it isn’t. There are other winners: (data) science, privacy and geek.  The latter wins for acquiring a more positive image, together with hacker, I assume. These words signify growing people centricity as a result of convergence of four powerful forces — big data, mobility, cloud and social computing (yeh-yeh, the Nexus of Forces, 1,540 results if you search for it on gartner.com).  There is another permeating superforce — user experience.  The words of the year signify this too. The question (of the impatients like me) is — what’s next? Ultra-personalization? No, this not the word of the next year, read on.

Depending on where you stand or where you look, or who looks at where you stand, there are two scenarios:

1. Greater good
2. Cut throat business

I see both. And I see a lot through my first-hand encounters with clients and vendors, VCs and scientists, geeks and selfies (sorry for misusing the word, it’s too new).  I just don’t see yet what scenario outweighs.  In the spirit of the season, I’m hoping for the greater good.  This year, open data truly inspired me.  I’ve been writing my research note on this subject for the last several months, with a lot of rewrites and rethinking.  I met amazing people and saw astonishing possibilities at the intersection of open data and crowdsourcing. Although a contender, crowdsourcing is not the word of the next year too.

The word of the next year is — airbnb.

Airbnb is a community marketplace for people to list and book accommodations that are someone’s spare property, from apartments and rooms to treehouses and boats. There is an airbnb for cars, dogs, storage, parking and office space. There are airbnbs for food — Cookening, EatWith and Feastly — you can come to a stranger’s home and have dinner made just for you. That’s hyper-personalization! My colleague Federico De Silva Leon wrote about airbnb for IT — Maverick* Research: Peer-to-Peer Sharing of Excess IT Resources Puts Money in the Bank, where he says: Internet technologies, such as cloud computing, are a central element to the transformation and rebalancing of IT resources, allowing organizations to monetize their excess IT capacity via a direct peer-to-peer (P2P) sharing model, enabled by brokers — a sort of “Airbnb of IT sharing.”

Anything peer-to-peer is airbnb.  It’s the ability to do what was not possible before the social, mobile and cloud forces started dating or even big-dating.  The only question left for the dictionaries that select words of the year: What is correct – “airbnb for” or “airbnb of”?

Comments Off

Category: Big Data cloud crowdsourcing data fashion Humans open data Uncategorized     Tags: , , , , ,

The Charging Elephant — IBM

by Svetlana Sicular  |  September 23, 2013  |  Comments Off

To allow our readers compare Hadoop distribution vendors side-by-side, I forwarded the same set of questions to the participants of the recent Hadoop panel at the Gartner Catalyst conference. The questions are coming from Catalyst attendees and from Gartner analysts. I have published the vendor responses in these blog posts:

Cloudera, August 21, 2013
MapR, August 26, 2013
Pivotal, August 30, 2013

In addition to the Catalyst panelists, IBM also volunteered to answer the questions. Today, I publish responses from Paul Zikopoulos, Vice President, Worldwide Technical Sales, IBM Information Management Software.

1. How specifically are you addressing variety, not merely volume and velocity?

While Hadoop itself can store a wide variety of data, the challenge is really the ability
to derive value out of this data and integrate it within their enterprise – connecting the
dots if you will. After all, big data without analytics is…well…just a bunch of data.
Quite simply, lots of people are talking Hadoop – but we want to ensure people are
talking analytics on Hadoop. To facilitate analytics on data at rest in Hadoop, IBM
has added a suite of analytic tools and we believe this to be one of our significant
differentiators in IBM’s Hadoop distribution, InfoSphere BigInsights, when compared
to those from other vendors.

I’ll talk about working with text from a variety perspective in a moment, but a fun
one to talk about is the IBM Multimedia Analysis and Retrieval System (IMARS).
It’s one of world’s largest image classification systems – and Hadoop powers it.
Consider a set of pictures with no metadata and searching for those pictures that include
“Wintersports,” and the system return such pictures. It’s done with the creating of
training sets that can perform feature extraction and analysis. You can check it out for
yourself at: http://mp7.watson.ibm.com/imars/. We’ve also got acoustic modules that our
partners are building around our big data platform. If you consider our rich partner
ecosystem, our Text Analytics Toolkit, the work we’re doing with IMARS, I think IBM
is leading the way when it comes to putting a set of arms around the variety component
of big data.

Pre-built customizable application logic
With more than two dozen pre-built Hadoop Apps, InfoSphere BigInsights helps firms
quickly benefit from their big data platform. Apps include web crawling, data
import/export, data sampling, social media data collection and analysis, machine data
processing and analysis, ad hoc queries, and more. All of these Apps are shipped with full
access to their source code; serving as a launching pad to customize, extend or develop
your own. What’s more, we empower the delivery of these Apps to the enterprise; quite
simply, if you’re familiar with how you invoke and discover Apps on an iPad, then
you’ve got the notion here. We essentially allow you to set up your own ‘App Store”
where users can invoke Apps and manage their run-time, thereby flattening the
deployability costs and driving up the democratization effect. It finds another level yet!
While your core Hadoop experts and programmers can create these discrete Apps, much
like SOA, we enable power users to build their own logic by orchestrating these logic
Apps into full blown applications. For example, if you wanted to grab data from a service
such as BoardReader and merge that with some relational data, then perform a
transformation on that data and run an R script on that – it’s as easy as dragging these Apps from the Apps Palette and connecting them.

There’s also a set of solution accelerators: these are extensive toolkits with dozens of prebuilt
software artifacts that can be customized and used together to quickly build tailormade
solutions for some of the more common kinds of big data analysis. There is a
solution accelerator for machine data (handles data ingest, indexing and search,
sessionization of log data, and statistical analysis), social data (handles ingestion from
social sources like Twitter, analyses historical messages to build social profiles, and
provides framework for in-motion analysis based on historical profiles), and one for
telco-based CDR data.

Spreadsheet-style analysis tool
What’s the most popular BI tool in the world? Likely some form of spreadsheet software.
So to keep business analysts and non-programmers in their comfort zone BigInsights
includes a Web-based spreadsheet-like discovery and visualization facility called
BigSheets. BigSheets lets users visually combine and explore various types of data to
identify “hidden” insights without having to understand the complexities of parallel
computing. Under the covers, it’s generating Pig jobs. BigSheets also lets you run jobs on
a sample of the data, which is really important because in the big data world, you could
be dealing with a lot of data, so why chew up that resource until you’re sure the job
you’ve created works as intended.

Advanced Text Analytics Toolkit
BigInsights helps customers analyze large volumes of documents and messages with its
built-in text-processing engine and library of context-sensitive extractors. This includes a
complete end-to-end development environment that plugs into the Eclipse IDE and offers
a perspective to build text extraction programs that run on data stored in Hadoop. In the
same manner that SQL declarative languages transformed the ability to work with
relational data, IBM has introduced the Annotated Query Language (AQL) for text
extraction, which is similar in look and feel to SQL. Using AQL and the accompanying
IDE, power users can build all kinds of text extraction processes for whatever the task is
at hand (social media analysis, classification of error messages, blog analysis, and so on).
Finally, accompanying this is a run optimizer that compiles the AQL into optimized
execution code that’s run on the Hadoop cluster. This optimizer was built from the
ground up to process and optimize text extraction, which requires a different set of
optimization techniques than typical. In the same manner as BigInsights ships a number
of Apps, the Text Analytics Toolkit involves a number of pre-built extractors that allows
you to pull things such as “Person/Phone”, “URL”, “City” (among many others) from
text (which could be a social media feed, financial document, call record logs, logs
files…you name it). The magic behind these pre-built extractors is that they are really
compiled rules – hundreds of them – from thousands of client engagements. Quite
simply, when it comes to text extraction, we think our platform is going to let you build
your applications fifty percent faster, make them run up to ten times faster than some of
the alternatives we’ve see, and most of all, provide more right answers. After all,
everyone talks about just how fast an answer was returned, and that’s important, but
when it comes to text extraction, how often is the answer right is just as important. This
platform is going to be “more right”.

Indexing and Search facility
Discovery and exploration often involve search, and as Hadoop is well suited to store
large varieties and volumes of data, a robust search facility is needed. Included with the
BigInsights product is InfoSphere Data Explorer (technology that came into the IBM big
data portfolio during the Vivisimo acquisition). Search is all the ‘rage’ these days with
some announcements by other Hadoop vendors in this space. BigInsights includes Data
Explorer for the searching of Hadoop data, however, it can be extended to search all data
assets. So unlike some of the announcements we’ve heard, I think we’re setting a higher
bar. What’s more, the indexing technology behind Data Explorer is positional-based as
opposed to vector-based – and that provides a lot of the differentiating benefits that are
needed in a big data world. Finally, Data Explorer understand security policies. If you’re
only granted access to a portion of a document and it comes up in your search, you still
only have access to the portion that was defined on the source systems. There’s so much
more to this exploration tool – such as automated topical clusters, a portal-like
development environment, and more. Let’s just say we have some serious leadership in
this space.

Integration tools
Data integration is a huge issue today and only gets more complicated with the
variety of big data that needs to be integrated into the enterprise. Mandatory
requirements for big data integration on diverse data sources include: 1) the ability to
extract data from any source, and cleanse and transform the data before moving it into
the Hadoop environment; 2) the ability to run the same data integration processes
outside the Hadoop environment and within the Hadoop environment, wherever most
appropriate (or perhaps run it on data the resides in HDFS without using MapReduce);
3) the ability to deliver unlimited data integration scalability both outside of the

Hadoop environment and within the Hadoop environment, wherever most
appropriate; and 4) the ability to overcome some of Hadoop’s limitations for big data
integration. Relying solely on hand coding within the Hadoop environment is not a
realistic or viable big data integration solution.

Gartner had some comments about how solely relying on Hadoop as an ETL engine
leads to more complexity and costs, and we agree. We think Hadoop has to be a first
class citizen to such an environment, and it is with our Information Server product
set. Information Server is not just for ETL. It is a powerful parallel application
framework that can be used to build and deploy broad classes of big data applications.
For example, it includes design canvas, metadata management, visual debuggers, the
ability to share reusable components, and more. So Information Server can
automatically generate MapReduce code, but it’s also smart enough to know if a
different execution environment might better serve the transformation logic. We like
to think of the work you do in Information Server as “Design the Flow Once – Run
and Scale Anywhere”.

I recently worked with a Pharma-based client than ran blindly down the “Hadoop for
all ETL” path. One of the transformation flows had more than 2,000 lines of code; it
took 30 days to write. What’s more, it had no documentation and was difficult to reuse
and maintain. The exact same logic flow in Information Server was implemented
in just 2 days; it was built graphically and was self-documenting. Performance was
improved, and it was more reusable and maintainable.
Think about it – long before the “ETL via solely Hadoop” it was “ETL via [programming
language of the year]. I saw clients do the same thing 10 years ago
with Perl scripts and they ended up in the same place that clients that are solely using
Hadoop for ETL end up with – higher costs and complexity. Hand coding was
replaced by commercial data integration tools because of these very reasons and we
think most customers do not want to go backwards in time by adopting the high costs,
risks, and limitations of hand coding for big data integration. We think our
Information Server offering is the only commercial data integration platform that
meets all of the requirements for big data integration outlined above.

2.  How do you address security concerns?

To secure data stored in Hadoop, BigInsights security architecture uses both private
and public networks to ensure that sensitive data, such as authentication information
and source data, is secured behind a firewall. It uses reverse proxies, LDAP and PAM
authentication options, options for on-disk encryption (including hardware, file
system, and application based implementations), and built-in role based authorization.
User roles are assigned distinct privileges so that critical Hadoop components cannot
be altered by users without access.

But there is more. After all, if you are storing data in an RDBMS or HDFS, you’re
still storing data. Yet so many don’t stop to think about it. We have a rich portfolio of
data security service and work with Hadoop (or soon will). For example, test data
management, data masking, and archiving services in our Optim family. In addition,
InfoSphere Guardium provides enterprise-level security auditing capabilities and
database activity monitoring (DAM) for every mainstream relational database I can
think of; well don’t you need a DAM solution for data stored in Hadoop to support
the many current compliance mandates in the same manner as traditional structured
data sources. Guardium provides a number of pre-built reports that help you
understand who is running what job, are they authorized to run it, and more. If you
ever tried to work with Hadoop’s ‘chatty’ protocol, you know how tedious this would
be to do on your own, if at all. Guardium makes it a snap.

3.  How do you provide multi-tenancy on Hadoop? How do you implement multiple unrelated workloads in Hadoop?

While Hadoop itself is moving toward better support of multi-tenancy with Hadoop
2.0 and YARN, IBM already offers some unique capabilities in this area.

Specifically, BigInsights customers can deploy Platform Symphony, which extends
capabilities for multi-tenancy. Business application owners frequently have particular
SLAs that they must achieve. For this reason, many organizations run dedicated
cluster infrastructures for each cluster because this is the only way they can be
assured of having resources when they need them. Platform Symphony solves this
problem with a production-proven multi-tenant architecture. It allows ownership to be
expressed on the grid, ensuring that each tenant is guaranteed a contracted SLA,
while also allowing resources to be shared dynamically so that idle resources are fully
utilized to the benefit of all.

For more complex use cases involving issues like multi-tenancy, IBM offers an
alternative file system to HDFS: GPFS-FPO (General Parallel File System: File
Placement Optimizer). One compelling capability of GPFS to enable Hadoop clusters
to supporting different data sets and workloads is the concept of storage volumes.
This enables you to dedicate specific hardware in your cluster to specific data sets. In
short, you could store colder data not requiring heavy analytics on less expensive
hardware. I like to call this “Blue Suited” Hadoop. There’s no need to change your
Hadoop applications if you use GPFS-FPO and there a host of other benefits it offers
such as snapshots, large block I/O, removal of the need for a dedicated name node,
and more. I want to stress the choice is up to you. You can use BigInsights with all
the open source components, and it’s “round-tripable” in that you can go back to open
source Hadoop whenever, or you can use some of these embrace and extend
capabilities that harden Hadoop for the enterprise. I think what separates us from
some of the other approaches is we are giving you the choice of the file system, and
the alternative we offer has a long standing reputation in the high performance
computing (HPC) arena.

4.   What are the top 3 focus areas for the Hadoop community in the next 12 months to make the framework more ready for the enterprise?

One focus area will be a shift from customers simply looking for High Availability to
a requirement for Disaster Recovery. Today DR in Hadoop is primarily limited to
snapshots of HBase; HDFS file datasets and distributed copy. Customers are starting
to ask for automated replication to a second site and automated cluster failover for
true disaster recovery. For customers who require disaster recovery today, GPFS-FPO
(IBM’s alternative to HDFS) includes flexible replication and recovery capabilities.

Disaster Recovery
One focus area will be a shift from customers simply looking for High Availability to
a requirement for Disaster Recovery. Today DR in Hadoop is primarily limited to
snapshots of HBase; HDFS file datasets and distributed copy. Customers are starting
to ask for automated replication to a second site and automated cluster failover for
true disaster recovery. For customers who require disaster recovery today, GPFS-FPO
(IBM’s alternative to HDFS) includes flexible replication and recovery capabilities.

Data Governance and Security
As businesses begin to depend on Hadoop for more of their analytics, data
governance issues become increasingly critical. Regulatory compliance for many
industries and jurisdictions demand strict data access control requirements, the ability
to audit data reads, and the ability to exactly track data lineage – just to name a few
criteria. IBM is a leader in data governance, and has a rich portfolio to enable
organizations to exert tighter control over their data. This portfolio has been
significantly extended to factor in big data, effectively taming Hadoop.

GPFS can also enhance security by providing ACL (access control list) with file and
disk level access control so only those applications, or data itself can be isolated to
privileged users, applications or even physical nodes in highly secured physical
environments. For example, if you have a set of data that can’t be deleted because of
some regulatory compliance, you can create that policy in GPFS. You can also create
immutability policies, and more.

Business-Friendly Analytics
Hadoop has for most of its history been the domain of parallel computing experts. But
as it’s becoming an increasingly popular platform for large-scale data storage and
processing, Hadoop needs to be accessible by users with a wide variety of skillsets.
BigInsights is responding to this need by providing BigSheets (the spreadsheet-based
analytics tool we mentioned in question 1), and an App-store style framework to
enable business users to run analytics functions, mix, match, and chain apps together,
and visualize their results.

5. What are the advancements in allowing data to be accessed directly from native source without requiring it to be replicated?

One example of allowing data to be accessed directly in Hadoop is the industry shift
to improve SQL access to data stored in Hadoop; it’s ironic isn’t it? The biggest
movement in the NoSQL space is … SQL! Despite all of the media coverage of
Hadoop’s ability to manage unstructured data, there is pent-up demand to use Hadoop
as a lower cost platform to store and query traditional structured data. IBM is taking a
unique approach here with the introduction of Big SQL, which is included in
BigInsights 2.1. Big SQL extends the value of data stored within Hive or HBase by
making it immediately query-able with a richer SQL interface than that provided by
HiveQL. Specifically, Big SQL provides full ANSI SQL 92 support for sub-queries,
more SQL 99 OLAP aggregation functions and common table expressions, SQL 2003
windowed aggregate functions, and more. Together with the supplied ODBC and
JDBC drivers, this means that a broader selection of end-user query tools can
generate standard SQL and directly query Hadoop. Similarly, Big SQL provides
optimized SQL access to HBase with support for secondary indexes and predicate
pushdown. We think Big SQL is the ‘closest to the pin’ of all the SQL solutions out
there for Hadoop, and there are a number of them. Of course, Hive is shipped in
BigInsights, so the work that’s being done there is by default part of BigInsights. We
have some big surprises coming in this space, so stay close to it. But the underlying
theme here is that Hadoop needs SQL interfaces to help democratize it and all
vendors are shooting for a hole in one here.

6. What % of your customers are doing analytics in Hadoop? What types of analytics? What’s the most common? What are the trends?

Given that BigInsights includes analytic tools in addition to the open source Apache
Hadoop components, many of our customers are doing analytics in Hadoop. An
example of a customer doing analytics is Constant Contact Inc. who is using
BigInsights to analyze 35 billion annual emails to guide customers on best dates &
times to send emails for maximum response. This has resulted in increased
performance of their customers’ email campaigns by 15 to 25%.

A health bureau in Asia is using BigInsights to develop a centralized medical imaging
diagnostics solution. An estimated 80 percent of healthcare data is medical imaging
data, in particular radiology imaging. The medical imaging diagnostics platform is
expected to significantly improve patient healthcare by allowing physicians to exploit
the experience of other physicians in treating similar cases, and inferring the
prognosis and the outcome of treatments. It will also allow physicians to see
consensus opinions as well as differing alternatives, helping reduce the uncertainty
associated with diagnosis. In the long run, these capabilities will lower diagnostic
errors and improve the quality of care.

Another example is a global media firm concerned about privacy of their digital
content. The firm monitors social media sites (like Twitter, Facebook, and so on) to
detect unauthorized streaming of their content, quantify the annual revenue loss due
to piracy, and analyze trends. For them, the technical challenges involved processing
a wide variety of unstructured and semi-structured data.
The company selected BigInsights for its text analytics and scalability. The firm is
relying on IBM’s services and technical expertise to help it implement its aggressive
application requirements.

Vestas models weather to optimize the placement of turbines, maximizing generation
and longevity of their turbines. They’ve been able to take a month out of preparation
and placement work with their models and reduce turbine placement identification
from weeks to hours using there 1400+ node Hadoop cluster. If you were to take the
wind history they store of the world and capture it as HD TV, you’d be sitting down
and watching your television for 70 years to get through all the data they have – over
2.5 PB and growing to 6 PB more. How impactful is their work? In 2012, they were
awarded Computerworld’s Honors Program in its Search for New Heroes for “the
innovative application of IT to reduce waste, conserve energy, and the creation of
new product to help solve global environment problems”. IBM is really proud to be a
part of that.

Having said that, a substantial number of customers are using BigInsights as a landing
zone or staging area for a data warehouse, meaning it is being deployed as a data
integration or ETL platform. In this regard our customers are seeing additional value
in our Information Server offering combined with BigInsights. Not all data
integration jobs are well suited for MapReduce, and Information Server has the added
advantage of being able to move and transform data from anywhere in the enterprise
and across the internet to the Hadoop environment. Information Server allows them to
build data integration flows quickly and easily, and deploy the same job both within
and external to the Hadoop environment, wherever most appropriate. It delivers the
data governance and operational and administrative management required for
enterprise-class big data integration.

7.  What does it take to get a CIO to sign off on a Hadoop deployment?

We strive to establish a partnership with the customer and CIO to reduce risk and
ensure success in this new and exciting opportunity for their company. Because the
industry is still in the early adopter phase, IBM is focused on helping CIOs
understand the concrete financial and competitive advantages that Hadoop can bring
to the enterprise. Sometimes this involves some form of pilot project – but as
deployments increase across different industries, customers are starting to understand
the value right away. We are starting to see more large deals for enterprise licenses of
BigInsights as an example.

8.  What are the Hadoop tuning opportunities that can improve performance and scalability without breaking user code and environments?

For customers that want optimal “out of the box” performance without having to
worry about tuning parameters we have announced PureData System for Hadoop, our
appliance offering that ships with BigInsights pre-integrated.

For customers who want better performance than Hadoop provides natively, we offer
Adaptive MapReduce, which optionally can replace the Hadoop scheduler with
Platform Symphony. IBM completed big data benchmarking of significance
employing BigInsights and Platform Symphony. These benchmarks include the
SWIM benchmark (Statistical Workload Injector for MapReduce), which is a
benchmark representing a real-world big data workload developed by University of
California at Berkley in close cooperation with Facebook. This test provides rigorous
measurements of the performance of MapReduce systems comprised of real industry
workloads. Platform Symphony Advanced Edition accelerated SWIM/Facebook
workload traces by approximately 6 times.

9. How much of Hadoop 2.0 do you support, given that it is still Alpha pre Apache?

Our current release 2.1 of BigInsights ships with Hadoop V 1.1.1, and was released in
mid-2013, when Hadoop 2 was still in alpha state. Once Hadoop 2 is out of alpha/beta
state, and production ready, we will include it in a near-future release of BigInsights.
IBM strives to provide the most proven, stable versions of Apache Hadoop
components.

10. What kind of RoI do your customers see on Hadoop investments – cost reduction or revenue enhancement?

It depends on the use case. Earlier we discussed Constant Contact Inc., whose use of
BigInsights is driving revenue enhancement through text analytics.

An example of cost reduction is a large automotive manufacturer who has deployed
BigInsights as the landing zone for their EDW environment. In traditional terms, we
might think of the landing zone as the atomic level detail of the data warehouse, or
the system of record. It also serves as the ETL platform, enhanced with Information
Server for data integration across the enterprise. The cost savings result by only
replicating an active subset of the data to the EDW, thereby reducing the size and cost
of the EDW platform. So 100 percent of the data is stored on Hadoop, and about
40% gets copied to the warehouse. Over time, Hadoop also serves as the historical
archive for EDW data and remains query-able.

As Hadoop becomes more widely deployed in the enterprise, regardless of the use case, the ROI increases.

11. Are Hadoop demands being fulfilled outside of IT? What’s the percentage? Is it better when IT helps?

The deployments we are seeing are predominantly being done within IT, although
with an objective of better serving their customers. Once deployed, tools such as
BigSheets make it easier for end users by providing analytics capability without the
need to write MapReduce code.

12. How important will SQL become as mechanism for accessing data in Hadoop? How will this affect the broader Hadoop ecosystem?

SQL access to Hadoop is already important, with a lot of the demand coming from
customers who are looking to reduce the costs of their data warehouse platform. IBM
introduced Big SQL earlier this year and this is an area where we will continue to
invest. It affects the broader Hadoop ecosystem by turning Hadoop into a first class
citizen of a modern enterprise data architecture and offloading some of the work
traditionally done by an EDW. EDW costs can be reduced by using Hadoop to store
the atomic level detail, perform data transformations (rather than in-database), and
serve as query-able archive for older data. IBM can provide a reference architecture
and assist customers looking to incorporate Hadoop into their existing data warehouse
environment in a cost effective manner.

13. For years, Hadoop was known for batch processing and primarily meant MapReduce and HDFS. Is that very definition changing with the plethora of new projects (such as YARN) that could potentially extend the use cases for Hadoop?

Yes, it will extend the use cases for Hadoop. In fact, we have already seen the start of
that with the addition of SQL access to Hadoop and the move toward interactive SQL
queries. We expect to see a broader adoption of Hadoop as an EDW landing zone and
associated workloads.

14. What are some of the key implementation partners you have used?
[Answer skipped.]

15. What factors most affect YOUR Hadoop deployments (eg SSDs; memory size.. and so on)? What are the barriers and opportunities to scale?

IBM is in a unique position of being able to leverage our vast portfolio of hardware
offerings for optimal Hadoop deployments. We are now shipping PureData System
for Hadoop, which takes all of the guesswork out of choosing the components and
configuring a cluster for Hadoop. We are leveraging what we have learned through
delivering the Netezza appliance, and the success of our PureData Systems to
simplify Hadoop deployments and improve time to value.

For customers who want more flexibility, we offer Hadoop hardware reference
architectures that can start small and grow big with options for those who are more
cost conscious, or for those who want the ultimate in performance. These reference
architectures are very prescriptive by defining the exact hardware components needed
to build the cluster and are defined in the IBM Labs by our senior architects based on
our own testing and customer experience.

 

 

Follow Svetlana on Twitter @Sve_Sic

 

 

Comments Off

Category: Big Data big data market Catalyst Crossing the Chasm data paprazzi Information Everywhere Uncategorized     Tags: , , , , , , , ,

The Charging Elephant — Pivotal

by Svetlana Sicular  |  August 30, 2013  |  Comments Off

To allow our readers compare Hadoop distribution vendors side-by-side, I forwarded the same set of questions to the participants of the recent Hadoop panel at the Gartner Catalyst conference. The questions are coming from Catalyst attendees and from Gartner analysts.

Today, I publish responses from Susheel Kaushik, Chief Architect of Pivotal Data Fabrics.

Susheel Kaushik Pivotal

1. How specifically are you addressing variety, not merely volume and velocity?

Big Data term used in the industry is primarily associated with volume of data. Longer historical perspective or the velocity (frequency at which events are generated) is very high results in large volume of data. Handling various types of data from various data sources is crucial in establishing a complete business application at enterprises. Customers are implementing the Data Lakes use case use the Pivotal HD an HAWQ platform for storing and analyzing all types of data – structured and unstructured. Some examples of data that can be stored and analyzed on the platform are:

  1. Events, Log Files, CDRs, Mobile Data, legacy system and app files
    Stored in Sequence, Text, Comma Separated, JSON, Avro, Map, Set, Array Files and analyzed via the standard interfaces such as SQL, Pig, HiveQL, HBase and MapReduce.
  2. Data from other databases or structured sources
    Stored in Hive RCFile/ORC/Parquet and HFile file formats and analyzed via the standard interfaces such as SQL, Pig, HiveQL, HBase and MapReduce.
  3. Social network feeds
    Ability to access and store the feeds from social network source, such as twitter, and enabling text analytics on the data.
  4. Images
    Stored on HDFS and analyzed using the image analytical algorithms.
  5. Video
    Stored on HDFS an complex and analyzed using the complex video and image analytical algorithms
  6. Time Series
    Storage and analysis of times series data and generating insights from it.

In addition, Pivotal HD allows for the users to extend the formats supported by providing custom input and output formatters. Customers can also extend HAWQ to support proprietary data formats also.

2. How do you address security concerns?

At a very high level data security is about user management and data protection.

  1. User management is primarily focused on creation/deletion of user accounts, defining access policies and role permissions along with mechanisms for authorization and authentication.
  2. Data protection on the other hand is focused on
    • Governing access to data.
      Manage the various actions a user can take on data.
    • Encrypting data at rest
      Using the standard encryption techniques to encrypt data at rest – in this case the files on HDFS
    • Encrypting data in motion
      Apply standard encryption techniques to encrypt data in motion also – when sent over the wire to from one java process to another.
    • Masking/tokenizing data at load time.
      Tokenizing is a concept where users can get to original data if they have access to the correct ‘key’ whereas masking is a one way encryption and users cannot get to the original data.

Pivotal HD controls access to the cluster using Kerberos authentication mechanism. Pivotal HD along with partner products supports data encryption at rest and in motion along with masking/tokenization. HAWQ provides security at table and view level on HDFS, thereby bringing the enterprise class database security to Hadoop.

3. How do you provide multi-tenancy on Hadoop? How do you implement multiple unrelated workloads in Hadoop?

  1. Pivotal VRP allows for fine grain control of the system resources (IO and CPU) for multi-tenancy management for Pivotal HD, HAWQ and Greenplum Database.
  2. HAWQ provides workload multi-tenancy. Queries run in interactive times (100x faster) which in turn allows for higher multi-tenancy.
    • Cost based optimizer generates optimum query plans to deliver better query execution times.
    • Scatter gather technology significantly improves data loading rates.
  3. Pivotal HD includes Hadoop Virtual Extensions to enable better execution on VMware vSphere. vSphere provides a very strong multi-tenancy and resource isolation for virtual environments.

4. What are the top 3 focus areas for the Hadoop community in the next 12 months to make the framework more ready for the enterprise?

  1. Security Standardization

    • Kerberos integration is a starting point for enterprises. Easy to scale and integrate with their existing policy management tools. (E.g. RSA Netwitness – monitoring for policy violations.)
    • Support for Access Control Lists rather than the traditional authorization models of security attributes at a user level.
  2. Unified Meta Data Repository
    • Single metadata across HCatalog (HCatalog supports MapReduce, Pig and Hive) and HBase, HAWQ and other data.
    • Current implementation of the metadata server has scale challenges that need to be resolved for adoption in enterprise environments.
  3. Namenode scalability
    Enterprises environments need to store larger number of file on the platform.

    • Current name node has limitation on the number of files and directories (around 150 million objects).
    • Current options are to run name node server with larger memory to circumvent the physical memory limitation.

5. What are the advancements in allowing data to be accessed directly from native source without requiring it to be replicated?
Data replication can be categorized in two ways. (Interplatform) Data copies made in multiple platforms and (Intraplatform) multiple copies of the data within a platform.

    Interplatform:

  • Unified Storage Services allows users access to data residing on multiple platform without data copying. USS does the metadata translation and the data is streamed directly from the source platform. USS enables applications access to data residing on multiple platforms without increasing the infrastructure costs at the customer end.
  • HAWQ stores data directly on HDFS and eliminates the need for multiple copies of the data. Traditionally customers had a data warehouse for their ETL and operational workload data and made copy of the data on Hadoop for complex adhoc analytical processing.
  • PXF (part of HAWQ) framework also allows users to extend HAWQ SQL support for proprietary data formats also. Native support of other proprietary formats reduces the need to make multiple copies of the data for analytics.
    Intraplatform:

  • Hadoop has storage efficiency of ~ 30% as it maintains 3 copies of the data to prevent data loss in case of multiple node failures. EMC Isilon storage improves the raw storage efficiency for data stored on HDFS to 80%. Isilon OneFS natively supports multiple formats NFS, CIFS and many more along with HDFS, thereby enabling the same storage platform to be used for multiple.

6. What % of your customers are doing analytics in Hadoop? What types of analytics? What’s the most common? What are the trends?

All of our customers today are doing analytics on the data stored on Hadoop and more and more data is coming to the Hadoop platforms. We find customers are gravitating towards the data lake use case where all data is stored and analyzed on the same platform.

  1. ETL offload
    Enterprise customers are looking for a scalable platform to reduce the time, resource and costs for ETL processing on their existing Enterprise Data Warehouse systems. Pivotal HD/HAWQ scalability and parallelism along with SQL supports easy migration of customer applications and reduces the ETL processing time.
  2. Batch Analytics
    Enterprise customers are analyzing events, log files, call data records, mobile data, legacy system and app files for security, fraud and usage insights. Earlier some of the data was not analyzed as the cost of analysis was prohibitive and scalable tools were not available. Pivotal HD/HAWQ allows customers to run batch analytical workloads.
  3. Interactive Analytical Applications
    Advanced enterprises are now leveraging the availability of the structured data and the unstructured data from other non-traditional sources to deliver next generation applications. The key enabler in this case is the data lake use case where all kinds of data is available on a single platform with capabilities to perform advanced analytics.

The trends are:

  1. Cost savings
    Enterprises are look at ways to do more with less. They are looking to reduce their infrastructure spend and at the same time reduce the time for ETL processing.
  2. Enable new applications and insights
    Advanced enterprises are building next generation applications merging legacy data along with non-traditional data. Agility (time to market) is the key business driver for these initiatives which we find to be more business driven.
  3. Real time decision making
    New business models leveraging the latency benefits along with the ability to analyze historical data are being experimented.

7. What does it take to get a CIO to sign off on a Hadoop deployment?

Grass root adoption is very high for Hadoop. All the developers want to get current with the Big Data platforms and skills and are experimenting with Hadoop.

CIOs are convinced that Hadoop is a scalable platform and understand the long term impact of the technology to their business. CIOs are more worried about the short term impact of Hadoop on their organizations and the technology integration costs. They are concerned about their cost structure and the integration and people training costs and the timing of their capital spend.

  1. Right partner
    CIOs want to partner with a vendor with experience in building enterprise application, delivering solutions, supporting products and with existing enterprise customers. They are looking for a partner with a long term vision that enables their business and technology needs of the future.
  2. Minimize Disruption
    CIOs are interesting in reducing the migration and application rewrite costs at their end. We find the CIOs also take monitoring and management costs into account as part of the Return on Investment (RoI) analysis for the immediate use cases before signing off on the Hadoop deployments.

    Investment Extension is crucial for enterprises. Continuing to run existing applications without re-architecture or significant changes is very appealing to the CIOs.

  3. Scale at business pace
    CIOs prefer the scaling paradigm of Pivotal HD/HAWQ – adding more hardware scales the storage and processing platforms both.

8. What are the Hadoop tuning opportunities that can improve performance and scalability without breaking user code and environments?

Hadoop has many configuration parameters and optimum parameter value for a given cluster configuration are not easy to derive. Most of these configuration parameters are spread all over – some are part of the application/job configuration, some are part of the job system configuration, some are storage configuration and quite a few are environment variables and service configuration parameters. Some of these configuration parameters are interrelated and customers need to understand the cross variable impact before making any updates. Pivotal provides following options to the customer to tune their Pivotal HD environments.

  1. Technical Consulting
    Pivotal professional services team is helping our Hadoop customers to optimize and tune their Pivotal HD and HAWQ environments.
  2. 2. Pivotal VRP
    Pivotal VPR allows enterprises to manage system resources at physical level (IO and CPU level) to optimize their environments.
  3. Pivotal HD Vaidya
    Vaidya, an Apache Hadoop component and part of Pivotal distribution, guides users to improve their MapReduce job performance.
  4. HAWQ is already optimized for PHD
    HAWQ leverages the advanced workload and resource management capabilities to deliver interactive query performance.
    • Cost based optimizer delivers optimized query plans for query execution
    • Dynamic pipelining capabilities improve the data loading times.

9. How much of Hadoop 2.0 do you support, given that it is still Alpha pre Apache?

Pivotal HD already supports Hadoop 2.0.

10. What kind of RoI do your customers see on Hadoop investments – cost reduction or revenue enhancement?

  1. Cost Reductions
    Enterprises usually start their journey on Hadoop with cost reduction justifications. Reductions in their infrastructure spend and reductions in execution time for their ETL jobs. Rather than focusing on overall RoI we advise our customers to focus on a single use case from an end to end perspective (including resource leverage and application migration costs).
  2. Revenue Enhancements
    Once the benefits are realized customers focus on the new business revenue enhancement uses cases and their associated RoI justification for scaling the environments. Once the data lake use case is available – customers find even more use cases and justifications for their existing invest spend too.

11. Are Hadoop demands being fulfilled outside of IT? What’s the percentage? Is it better when IT helps?

Hadoop is a complex environment and we advise our enterprise customers that deploying and leveraging Hadoop without assistance from their IT teams is a recipe for failure. Irrespective of where the business need for Hadoop originated, the IT teams are the best team to managing Pivotal HD and HAWQ as opposed to the business attempting it themselves.

12. How important will SQL become as mechanism for accessing data in Hadoop? How will this affect the broader Hadoop ecosystem?

SQL is the most expressive language to manipulate data and enterprises have a long history of managing structured data with it. Support of SQL is absolutely essential for the enterprise customers. Enterprises are looking for:

  1. Resource leverage

    • Ability to leverage existing resources to deliver next generation business applications.
    • SQL is the standard language for manipulating data within most of the enterprise.
  2. Tools
    Enterprises have an established ecosystem of tools for analytics and deliver insights to the existing user base. Integrate with their existing tools to extending their existing investments.
  3. Existing Applications
    Application migrating is a very expensive operation for a business application. Enterprises are looking to scale and speed up their existing applications without making significant architectural and code changes.
True SQL compatibility is a must and it will extend the partner ecosystem along with the available business applications on the platform. This will enable a new class of business applications leveraging the legacy data long with the non-traditional data.

13. For years, Hadoop was known for batch processing and primarily meant MapReduce and HDFS. Is that very definition changing with the plethora of new projects (such as YARN) that could potentially extend the use cases for Hadoop?

New projects (YARN, Hive, HBase) are making Hadoop easier to use. YARN specifically is improving the performance of the platform and enabling latency sensitive workloads on Hadoop. Enterprises are looking to move a significant portion of their ETL and SQL analysis on the platform. HAWQ along with in-database analytics makes it easier for enterprises to migrate their existing applications and start building new models and analytics.

Further addition of other execution frameworks (such as Message Passing Interface – MPI) support will bring more analytical and scientific workloads to the platform.

14. What are some of the key implementation partners you have used?

We collaborate with many partners all across the world and in addition to using the EMC specialized services, we have worked with third party partners such as Zaloni, Accenture, Think Big Analytics, Cap Gemini, Tata Consulting Services, CSC and Impetus to assist our enterprise customers in their application implementations.

15. What factors most affect YOUR Hadoop deployments (eg SSDs; memory size.. and so on)? What are the barriers and opportunities to scale?

Pivotal HD, HAWQ and Isilon are all proven petabyte scale technologies.

We are investigating the impact of SSD, Larger memory, Remote Direct Memory Access (RDMA), High speed interconnect (Infiniband) and TCP offload engines in a scaled 1000 node environment to optimize and performance improvements.

 

Susheel Kaushik leads the Technical Product Marketing team at Pivotal. At Pivotal he has helped many enterprise customers become predictive enterprises leveraging Big Data & Analytics in their decision making. Prior to Pivotal, he led the Hadoop Product Management team at Yahoo! and also led Data Systems Product Management team for the online advertising systems at Yahoo!. He has extensive technical experience in delivering scalable, mission critical solutions for customers. Susheel holds a MBA from Santa Clara University and B.Tech in Computer Science from Institute of Technology – Banaras Hindu University.

 

 

Follow Svetlana on Twitter @Sve_Sic

 

Comments Off

Category: Big Data Catalyst data paprazzi Hadoop Information Everywhere Uncategorized     Tags: , , , , , , , ,

The Charging Elephant — MapR

by Svetlana Sicular  |  August 26, 2013  |  Comments Off

To allow our readers compare Hadoop distribution vendors side-by-side, I forwarded the same set of questions to the participants of the recent Hadoop panel at the Gartner Catalyst conference.  The questions are coming from Catalyst attendees and from Gartner analysts.

Today, I publish responses from Jack Norris, Chief Marketing Officer at MapR.

Jack Norris

 

1. How specifically are you addressing variety, not merely volume and velocity?

MapR has invested heavily in innovations to provide a unified data platform that can be used for a variety of data sources, such as clickstreams to real-time applications that leverage sensor data.

The MapR platform for Hadoop also integrates a growing set of functions including MapReduce, file-based applications, interactive SQL, NoSQL databases, search and discovery, and real-time stream processing. With MapR, data does not need to be moved to specialized silos for processing, data can be processed in place.

This full range of applications and data sources benefit from MapR’s enterprise-grade platform and unified architecture for files and tables. The MapR platform provides high availability, data protection and disaster recovery to support mission-critical applications.

MapReduce:

MapR provides world record performance for MapReduce operations on Hadoop. MapR holds the Minute Sort world record by sorting 1.5 TB of data in one minute. The previous Hadoop record was less than 600 GB. With an advanced architecture that is built in C/C++ and that harnesses distributed metadata and an optimized shuffle process, MapR delivers consistent high performance.

File-Based Applications:

MapR is a 100% POSIX compliant system that fully supports random read-write operations. By supporting industry standard NFS, users can mount a MapR cluster and execute any file-based application, written in any language, directly on the data residing in the cluster. All standard tools in the enterprise including browsers, UNIX tools, spreadsheets, and scripts can access the cluster directly without any modifications.

SQL:

There are a number of applications that support SQL access against data contained in MapR including Hive, Hadapt and others. MapR is also spearheading the development of Apache Drill that brings ANSI SQL capabilities to Hadoop. Apache Drill, inspired by Google’s Dremel project, delivers low latency interactive query capability for large-scale distributed datasets. Apache Drill supports nested/hierarchical data structures, schema discovery and is capable of working with NoSQL, Hadoop as well as traditional RDBMS. With ANSI SQL compatibility, Drill supports all of the standards tools that the enterprise uses to build and implement SQL queries.

Database:

MapR has removed the trade-offs organizations face when looking to deploy a NoSQL solution. Specifically, MapR delivers ease of use, dependability and performance advantages for HBase applications.. MapR provides scale, strong consistency, reliability and continuous low latency with an architecture that does not require compactions or background consistency checks. From a performance standpoint, MapR delivers over a million operations per second from just a 10-node cluster.

Search:

MapR is the first Hadoop distribution to integrate enterprise-grade search. On a single platform customers can now perform predictive analytics, full search and discovery; and conduct advanced database operations. The MapR enterprise-grade search capability works directly on Hadoop data but can also index and search standard files without having to perform any conversion or transformation. All search content and results are protected with enterprise-grade high availability and data protection, including snapshots and mirrors enabling a full restore of search capabilities.

By integrating the search technology of the industry leader, LucidWorks, MapR and its customers benefit from the added value that LucidWorks Search delivers in the areas of security, connectivity and user management for Apache Lucene/Solr.

Stream Processing:

MapR provides a dramatically simplified architecture for real-time stream computational engines such as Storm. Streaming data feeds can be written directly to  the MapR platform for Hadoop for long-term storage and MapReduce processing. Because MapR enables data streams to be written directly to the MapR cluster, MapR allows administrators to eliminate queuing systems such as Kafka or Krestel and perform publish-subscribe models within the data platform.  Storm can then ‘tail’ a file to which it wishes to subscribe, and as soon as new data hits the file system, it is injected into the Storm topology.  This allows for strong Storm/Hadoop interoperability, and a unification & simplification of technologies onto one platform.

2.  How do you address security concerns?

MapR is pushing the envelope on Hadoop security. MapR integrates with Linux security (PAM) and works with any user directory: Active Directory, LDAP, NIS. This enables access control and selective administrative access for portions of the cluster. Only MapR supports logical volumes so individual directories or groups of directories can be grouped into volumes. Not only access control but data protection policies, quotas and other management aspects can be set by volume.

MapR has also delivered strong wire-level authentication and encryption that is currently in beta. This powerful capability includes both kerberos and non-Kerberos options. MapR security features include fine-grained access control including full POSIX permissions on files and directories, ACLs on tables, column families, columns, cells; ACLs on MapReduce jobs and queues; and administration ACLs on cluster and volumes.

3.  How do you provide multi-tenancy on Hadoop? How do you implement multiple unrelated workloads in Hadoop?

MapR includes advanced monitoring, management, isolation and security for Hadoop that leverage and build on volume support within the MapR Distribution. MapR clusters provide powerful features to logically partition a physical cluster to provide separate administration, data placement, job execution and network access. These capabilities enable organizations to meet the needs of multiple users, groups and applications within the same cluster.

MapR enables users to specify exactly which nodes will run each job, to take advantage of different performance profiles or limit jobs to specific physical locations. Powerful wildcard syntax lets users define groups of nodes and assign jobs to any combination of groups. Administrators can leverage MapReduce queues and ACLs to restrict specific users and groups to a subset of the nodes in the cluster. Administrators can also control data placement, enabling them to isolate specific datasets to a subset of the nodes.

4.   What are the top 3 focus areas for the Hadoop community in the next 12 months to make the framework more ready for the enterprise?

1)  YARN – will help expand the use of non Mapreduce jobs on the Hadoop platform

2) HA (High Availability) – There has been some great work done and plans but there are still many issues to be addressed. These include the ability to automatically recover from multiple failures,  and the auto starting of failed jobs so tasks pick up where they left off.

3) Snapshots –This is a step in the right direction but the snapshots are akin to a copy function which is not coordinated across a cluster snapshots. In other words, no point-in-time consistency across the cluster. Many applications will not be able to work properly and any file that is open will not include the latest data in the snapshot.

5. What are the advancements in allowing data to be accessed directly from native source without requiring it to be replicated?

The upcoming NFS support does not support POSIX compliance and will not work with most applications. HDFS is an append-only (or read-only) file system, and the NFS protocol can only be supported by a file system that supports random writes. HDFS doesn’t support random writes.

The gateway has to temporarily save all the data to its local disk (/tmp/.hdfs-nfs) prior to writing it to HDFS. This is needed because HDFS doesn’t support random writes, and the NFS client on many operating systems will reorder the write operations even when the application is writing sequentially. NFS clients often reorders writes. Sequential writes can arrive at the NFS gateway at random order. The gateway directory is used to temporarily save out-of-order writes before writing to HDFS. This impacts performance and the ability to effectively support multiple users and read/write support.

6. What % of your customers are doing analytics in Hadoop? What types of analytics? What’s the most common? What are the trends?

Hadoop is used today for a wide variety of applications most of these involve analytics, some use cases such as data warehouse offload use Hadoop to perform ETL to offload processing and long term data storage on Hadoop. We have customers today that have transformed their business operations with our distribution for Hadoop. Some examples include redefining fraud detection, using new data sources to create new products and services, improving revenue optimization through deeper segmentation and recommendation engines,  and integrating on-line and on-premise purchase paths for better customer targeting.

The biggest challenge to Big Data adoption is the need for organizations to abandon some long held assumptions about data analytics. These assumptions include:

  • The cost of analytics is exponential
  • Project success requires the definition of the questions you need answered ahead of time
  • Specialized analytic clusters are required for in-depth analysis
  • Complex models are the key to better analysis

Big Data, and specifically Hadoop, expands linearly by leveraging commodity servers in a cluster that can expand to thousands of servers. If you need additional capacity, simply add additional servers.

Hadoop supports a variety of data sources and doesn’t require an administrator to define or enforce a schema. Users have broad flexibility with respect to the analytic and programming techniques that they can employ, and they can change the level of granularity for their analysis. Unlike a data warehouse environment, any of these analytic changes does not result in downtime related to modifying the system or redefining tables.

One of the interesting insights that Google has shared is that algorithms on large data outperform more complex models on smaller data sets. In other words, rather than spending time in investing in complex model develop expand your data set first.

All of these changes make it much easier and faster to benefit from Hadoop than other approaches, so the message to organizations is to deploy Hadoop to process and store data and the analytics will follow.

7.  What does it take to get a CIO to sign off on a Hadoop deployment?

Given the advantages of Hadoop it’s fairly easy to get sign off on the initial deployment. The growth from there is due to the proven benefits that the platform provides.

8.  What are the Hadoop tuning opportunities that can improve performance and scalability without breaking user code and environments?

This is an area that MapR has focused on with underlying architecture improvements that drive much high performance (2-5 times) with the same hardware.

9. How much of Hadoop 2.0 do you support, given that it is still Alpha pre Apache?

MapR fully participates in the open source community and includes open source packages as part of our distribution. That said, we test and harden packages and wait until they are production ready before including them directly into our distribution. Hadoop 2.0 is not at the stage yet that we have released it as part of our distribution to customers.

10. What kind of RoI do your customers see on Hadoop investments – cost reduction or revenue enhancement?

We see tremendous payback on Hadoop deployments. Because MapR provides all of the HA,  data protection, and standard file access that are delivered by enterprise storage environments at a much lower cost we’re driving ROI on investments that payback within the year. One of our large telecom clients generated an 8 figure ROI with their MapR deployment.

11. Are Hadoop demands being fulfilled outside of IT? What’s the percentage? Is it better when IT helps?

There’s a difference between test and lab development and production. We see a lot of experimentation outside of IT or cloud deployments be utilized but on premise, production deployments are directed by IT. The experience of IT is helpful when organizations are trying to assess data protection requirements, SLAs, and integration procedures.

12. How important will SQL become as mechanism for accessing data in Hadoop? How will this affect the broader Hadoop ecosystem?

SQL is important but the impact of Hadoop is for the next generation of analytics. SQL makes it easier for business analysts to interact with data contained in a Hadoop cluster, but SQL is one piece, integration with search, machine learning, and HBase are important aspects and the use cases/applications that are driving business results that go far beyond SQL. The ultimate query language that works across structured and unstructured data is search. In a sense it is an unstructured query that is simple for broad users to understand. A search index paired with machine learning constructs is a powerful tool to drive business results.

13. For years, Hadoop was known for batch processing and primarily meant MapReduce and HDFS. Is that very definition changing with the plethora of new projects (such as YARN) that could potentially extend the use cases for Hadoop?

Absolutely, but YARN does not address the fundamental limitation of Hadoop and that is the data storage layer (HDFS).  Even with YARN opening up the types of jobs that can be run on Hadoop it does nothing to address the batch load requirements of the write-once storage layer. At MapR we are excited about running YARN with our next generation file system that provides POSIX, random read/write, support.

14. What are some of the key implementation partners you have used?

We’ve used integrators that are focused on data as well as large system integrators.

15. What factors most affect YOUR Hadoop deployments (eg SSDs; memory size.. and so on)? What are the barriers and opportunities to scale?

MapR has made significant innovations to reduce the footprint and memory requirements for Hadoop processing. We take advantage of greater disk density per node and many of our customers have 16 disks of 3TB drivers associated with each node. We also take advantage of next generation flash and SSD storage with performance comparisons 7X faster than other distributions.

Jack Norris, Chief Marketing Officer at MapR

Jack has over 20 years of enterprise software marketing experience. He has demonstrated success from defining new markets for small companies to increasing sales of new products for large public companies. Jack’s broad experience includes launching and establishing analytic, virtualization, and storage companies and leading marketing and business development for an early-stage cloud storage software provider. Jack has also held senior executive roles with EMC, Rainfinity (now EMC), Brio Technology, SQRIBE, and Bain and Company. Jack earned an MBA from UCLA Anderson and a BA in Economics with honors and distinction from Stanford University.

 

Follow Svetlana on Twitter @Sve_Sic

 

 

Comments Off

Category: Big Data Catalyst data paprazzi Hadoop Information Everywhere open source Uncategorized     Tags: , , , ,

The Charging Elephant — Cloudera

by Svetlana Sicular  |  August 21, 2013  |  Comments Off

To allow our readers compare Hadoop distribution vendors side-by-side, I forwarded the same set of questions to the participants of the recent Hadoop panel at the Gartner Catalyst conference.  The questions are coming from Catalyst attendees and from Gartner analysts.

Today, I publish responses from Amr Awadallah, CTO of Cloudera.

AmrA

 

1. How specifically are you addressing variety, not merely volume and velocity?

Variety is inherently addressed by the fact that Hadoop at its heart is a file system that can store anything. There is also additional functionality to map structure on top of that variety (e.g. HCatalog, Avro, Parquet, etc.). We are evolving Cloudera Navigator to aid with meta-data tracking for all that variety.

2.  How do you address security concerns?

Hadoop has a very strong security story. Hadoop supports authentication through Kerberos, supports fine grained access control (through Sentry), natively supports on-the-wire encryption, and supports perimeter access control through HttpFS and Oozie gateways. Furthermore, Cloudera Navigator supports audit reporting (to see who accessed what and when). We also have a number of technology partners offering on the disk encryption as well. CDH4 is already deployed in a number of financial, health, and federal organizations with very strict security requirements.

3.  How do you provide multi-tenancy on Hadoop? How do you implement multiple unrelated workloads in Hadoop?

First, you could run multiple MapReduce jobs in Hadoop for a number of years now. CDH ships with the fair scheduler which arbitrates resources across a number of jobs with different priorities. Second, as of CDH4, and soon in CDH5, we are including  YARN (from Hadoop 2) which allows for arbitration of resources across multiple frameworks (not just MapReduce). In preparation for CDH5, we are doing a lot of work in both CDH and Cloudera Manager to make it easy to arbitrate resources across multiple isolated workloads on the same hardware infrastructure.

4.   What are the top 3 focus areas for the Hadoop community in the next 12 months to make the framework more ready for the enterprise?

  • Resource Management for multiple workloads
  • Even stronger security
  • Overall stability/hardening.

5. What are the advancements in allowing data to be accessed directly from native source without requiring it to be replicated?

Hadoop is not a query federation engine, rather it is a storage and processing engine. You can run query federation engines on top of hadoop and that will allow you to mix data from inside hadoop with data from external sources. Note however that you will only get the performance/scalability/fault-tolerance benefits of Hadoop when the data is stored in it (as it tries to move compute as close as possible to the disks holding the data). So while federation might be ok for small amounts of data, it isn’t advised for big data per se.

6. What % of your customers are doing analytics in Hadoop? What types of analytics? What’s the most common? What are the trends?

Can’t reveal exact percentages as that is confidential. But fair to say that the most common use-case is still batch transformations (the ETL offload use case). Frequently customers see ETL/ELT jobs speed up from hours to sub-minute once moved to Hadoop (for a fraction of the cost). The second most common use case is what we call Active Archive, i.e. the ability to run analytics over a long history of detail-level data (which requires an economical solution to justify keeping all that data accessible versus moving to tape). While data science is typically thought of as synonymous with Hadoop/Big-Data, it is actually not yet the most common use case as it comes later in the maturity cycle of adopting Hadoop.

7.  What does it take to get a CIO to sign off on a Hadoop deployment?

Typically it is easiest to argue the operational efficiency advantages to the CIO/CFO. i.e. you will be able to do your existing ETL in 1/10th the time at 1/10th the cost. Once Hadoop is deployed inside the enterprise then they start to see all the new capabilities that it brings (dynamic schemas, ability to consolidate all data types, and ability to go beyond SQL). That is when the Hadoop clusters start to expand as the CIO sees that it can create more value for the business by doing things that couldn’t be done before.

8.  What are the Hadoop tuning opportunities that can improve performance and scalability without breaking user code and environments?

Tuning Hadoop is a very complicated task to do manually. One of the core features we have in Cloudera Manager, and our support knowledge base, is to help you exactly with that problem. This can span changing the hardware itself, changing the operating system parameters, to tuning Hadoop configs.

9. How much of Hadoop 2.0 do you support, given that it is still Alpha pre Apache?

Hadoop 2.0 has two parts: all the improvements to HDFS which are actually much more stable than Hadoop 1.0 (lets call that HDFS 2.0), and then the new additions for YARN and MapReduce NG (lets call that MapReduce 2.0). We have been shipping Hadoop 2.0 for more than a year now in CDH4. This allowed our customers to get the better fault-tolerance and higher availability features of HDFS in Hadoop 2.0.

That said, we didn’t replace MapReduce 1 with MapReduce 2, rather we shipped both in CDH4 and marked MapReduce 2 as a technology preview.  This allowed our partners (and early adopter customers) to start experimenting and building newer applications on the MapReduce 2 API while continuing to use MapReduce 1 as is. As the Apache community releases new code for improving the YARN and MapReduce 2 stability we continue to add that code to CDH4 and eventually it will be rolled out with CDH5.

10. What kind of RoI do your customers see on Hadoop investments – cost reduction or revenue enhancement?

This obviously varies by customer, especially in cases when you are extracting new value from your data vs just saving costs. On a cost saving basis Hadoop systems typically cost a few hundred dollars per TB, which is 1/10 the typical cost of RDBM systems.

11. Are Hadoop demands being fulfilled outside of IT? What’s the percentage? Is it better when IT helps?

The majority of our production Hadoop deployments are operated by the IT teams. We offer training specifically for system administrators and DBAs in IT departments to get comfortable managing CDH clusters.

12. How important will SQL become as mechanism for accessing data in Hadoop? How will this affect the broader Hadoop ecosystem?

SQL will be one of the key mechanisms for accessing data in Hadoop given the wide ecosystem of tools and developers that know how to speak SQL. That said, it will not be the only way. For example, Cloudera now offers search on top of Hadoop (still in beta), which is much more suitable for textual data. We also have a strong partnership with SAS for statistical analysis which again is better than SQL for such problems, etc.

13. For years, Hadoop was known for batch processing and primarily meant MapReduce and HDFS. Is that very definition changing with the plethora of new projects (such as YARN) that could potentially extend the use cases for Hadoop?

Absolutely, Hadoop (and CDH in particular) is evolving to be a platform that can support many types of workloads. We have MapReduce/Pig/Hive for batch processing and transformations. We have Impala for interactive SQL and analytics. We have Search for unstructured/textual data. We have partnerships with SAS and R for data mining  and statistics. I expect us to see more workloads/applications move to the platform in the future.

14. What are some of the key implementation partners you have used?

TCS, Capgemini, Deloitte, Infosys and T-Systems and Accenture.

15. What factors most affect YOUR Hadoop deployments (eg SSDs; memory size.. and so on)? What are the barriers and opportunities to scale?

The most important factor is to make sure that the cluster is spec’ed correctly (both servers and network), the operating system is configured correctly, and all the Hadoop parameters are set to match the workloads that you will be running. We offer a zero-to-Hadoop consulting engagement which helps our customers get up and going in the best way possible. We also put a lot of our smarts about running clusters efficiently into Cloudera Manager which tunes the cluster for you.

Hadoop really scales well as a technology once configured correctly. We have a number of customers running at the 1000s of node scale with 10s of PB of data.

 

Amr Awadallah, Ph.D., Chief Technology Officer at Cloudera

Before co-founding Cloudera in 2008, Amr (@awadallah) was an Entrepreneur-in-Residence at Accel Partners. Prior to joining Accel he served as Vice President of Product Intelligence Engineering at Yahoo!, and ran one of the very first organizations to use Hadoop for data analysis and business intelligence. Amr joined Yahoo after they acquired his first startup, VivaSmart, in July of 2000. Amr holds a Bachelor’s and Master’s degrees in Electrical Engineering from Cairo University, Egypt, and a Doctorate in Electrical Engineering from Stanford University.

 

 

Follow Svetlana on Twitter @Sve_Sic

 

 

Comments Off

Category: Big Data Catalyst Hadoop Uncategorized     Tags: , , , , , , ,

The Charging Elephant

by Svetlana Sicular  |  August 15, 2013  |  Comments Off

Hadoop is the most popular elephant of the Big Data Age. Its ecosystem and habitat are of interest even to those who can hardly spell “Hadoop.” Our clients are trying to get oriented in the jungle of choices and in the rapidly changing landscape of big data technologies. At the recent Catalyst conference, we decided to “show” Hadoop distribution providers side-by-side, and have them “tell” about their views on the elephant’s whereabouts. The Catalyst panel included (left to right, if you were looking at the stage):

  • Microsoft — Eron Kelly, General Manager of SQL Server and Data Platform
  • Pivotal — Susheel Kaushik, Chief Architect of Pivotal Data Fabrics
  • Amazon Web Services — Paul Duffy, Principal, Product Marketing
  • MapR — Jack Norris, Chief Marketing Officer
  • Cloudera — Amr Awadallah, Chief Technology Officer
  • Hortonworks — Bob Page, VP of Products

The video of this panel is available on Gartner Events on Demand (along with plethora of fantastic video presentations from Gartner events).  Under the Catalyst conference time constraints, the panelists were able to cover just a few topics.  But they gracefully agreed to follow up in writing on the rest of the questions from the audience and analysts.  I will start publishing their responses to the following questions via this blog space early next week — stay tuned:

  • How specifically are you addressing variety, not merely volume and velocity?
  • How do you address security concerns?
  • How do you provide multi-tenancy on Hadoop? How do you implement multiple unrelated workloads in Hadoop?
  • What are the top 3 focus areas for the Hadoop community in the next 12 months to make the framework more ready for the enterprise?
  • What are the advancements in allowing data to be accessed directly from native source without requiring it to be replicated?
  • What % of your customers are doing analytics in Hadoop? What types of analytics? What’s the most common? What are the trends?
  • What does it take to get a CIO to sign off on a Hadoop deployment?

Knowing that each panelist is busy, I offered them additional questions that they could choose to answer if they have time.

  • What are the Hadoop tuning opportunities that can improve performance and scalability without breaking user code and environments?
  • How much of Hadoop 2.0 do you support, given that it is still Alpha pre Apache?
  • What kind of RoI do your customers see on Hadoop investments – cost reduction or revenue enhancement?
  • What factors most affect YOUR Hadoop deployments (eg SSDs; memory size.. and so on)? What are the barriers and opportunities to scale?
  • What hardware enhancements are you looking for to make Hadoop run better?

At the last year’s Hadoop Summit, I took this picture of the baby elephant, and I’m happy to report — it is charging now.

At the last year’s Hadoop Summit, I took this picture of the baby elephant, and I’m happy to report — it is charging now.

Comments Off

Category: Big Data Hadoop Uncategorized     Tags: , , , , , , , , , , , , , ,

The Illusions of Big Data

by Svetlana Sicular  |  July 24, 2013  |  1 Comment

I’m afraid that more data and more analytics will create an illusion of solutions while the problems still persist. Many problems are intrinsically not possible to solve. Thousands of years of philosophy testify to that.  Philosophy aside, people cannot agree on elementary things: some of us have firm reasons to vote for democrats, some of us have iron-clad arguments in favor of republicans. And after this, it is ridiculous to expect big data technologies to solve many if not all problems.

Companies are increasingly interested in expanding to predictive and prescriptive analytics – answers will be shifting from deterministic to probabilistic.  But probability itself is an illusion. It’s easy to forget that statistics doesn’t apply to a single person or event because this person or event can be an outlier. Conversely, while an individual usually behaves rationally, an aggregated crowd is irrational.  Making sense of the irrational is the eternal challenge.

When we estimate probabilistic answers, we tend to use the 80/20 rule which works as described by my colleague Mark Beyer who says that 80% of 80/20 rules are garbage.  We assume that 80% probability is good enough for predictive or prescriptive analysis.  Some 80/20 rules even assume that 90% is good, and 80% is not.  In reality, most probabilistic answers have a very low probability of being “correct” (think philosophically): 20% of the 80/20 rule would be a very decent number. And there are simpler mistakes: I’ve just recently seen an example of 24 successes vs. 12 successes resulting in 50% improvement (the example was more complicated because of the accompanying non-related big numbers and complex descriptions that distracted from the essence of 24 vs. 12).

At the last year’s Gartner BI Summit, a panel of the leading BI vendors was asked about a percentage of BI failures.  The first panelist answered 50%, the second, a competitor of the first one, decreased estimated failures to 40% (↑), and the last one, who has an obvious low limit, said that most BI implementations do not fail (↑).  There was a similar vendor panel at this year’s Gartner BI and Analytics Summit: the vendors varied but they got the  same question – about the rate of BI implementations failures.  The first vendor gave the answer of 50%, the second one gave 60% (↓), and so on (↓) until the last vendor who didn’t have much wiggle room said that 99% of BI implementations fail (↓).   Both years, vendors had really good explanations of their opinions. But what drove their answers? Data and analysis? Mostly, pressure, not the actual reasons, which are subjective too. This illustrates an illusion of getting the right answers.

An analytical insight leads to different conclusions depending on a decision maker too.  Some people cannot make a decision and enjoy a process, postponing decisions as long as they can, until circumstances change and the insight gets obsolete.  Some people naturally make decisions quickly and then accept implications, good or bad; often, hasty decisions are not very good. Everyone of us, quick or indecisive, made bad judgments in the presence of all necessary information. As suggested by my colleague Nigel Rayner, machines could make a better choice in certain cases, but they are programmed by people about whom I talked above.  Or what if machines discover something obvious to people?  We have seen many times that the answer is a non-event.

By having more and more, and bigger and bigger data, and by delivering more and more analytics, including the current favorites at the world’s leading whiteboards and blueprints — predictive and prescriptive types — we might conclude that many more problems are already solved.  And this will be just an illusion.  Look!

Optical Illusion

The Optical Illusion

P.S. Don’t give up solving problems though. “You are never given a wish without also being given the power to make it true. You may have to work for it however,” said Richard Bach in Illusions.

 

Follow Svetlana on Twitter @Sve_Sic

1 Comment »

Category: "Data Scientist" analytics Big Data data data paprazzi Humans Information Everywhere Inquire Within Uncategorized     Tags: , , , , ,