by Svetlana Sicular | December 23, 2014 | 6 Comments
Shortly after joining Gartner, I noticed subtle magic in the air. Not Magic Quadrants – they were too obvious. It was I told you so repeated by many analysts on many occasions. It seemed almost mystical – how could they know? At some point, I caught myself saying I told you so too — it became natural after seeing and researching so much. For example, I talked about personal analytics, multidisciplinary teams instead of a single data scientist, and about Hadoop being a live archive — these came to fruition and became common place. I told you so. (Mom, I hate when you say it.)
In the big data field research back in 2012, we saw that there was a big data maturity gap. It needed a couple of years to close. I told you so. Glad to report, 2014 was the first year when enterprises became serious about big data: They started asking questions beyond “how do I begin my big data initiative?” or “how do I select a Hadoop distribution?” Hortonworks even went public, the first Hadoop vendor to do so. My colleague Merv Adrian went into great depth on the Hortonworks IPO in his blog post Hortonworks IPO – Why Now?
The question is — what’s next for big data? First of all, big data will become the new normal sometime between 2016 and 2018. My colleagues Donald Feinberg and Mark Beyer will say (with well-earned pride and a flair of mysticism), I told you so.
Organizations are finally ready for big data in the cloud. In the second half of 2014, my clients started asking about a data warehouse and Hadoop in the cloud. I am an analyst in Gartner for Technical Professionals — 90% of our clients are practitioners who are doing things right now. Therefore, I anticipate many interesting developments around big data in the cloud soon.
In 2015, I expect a plethora of big data applications on top of data platforms (remember, these platforms already demonstrate acceptable maturity). Big data applications will be mostly analytical, and they will be small, in the “app store” style, with few customizations — that makes support and maintenance relatively easy. People would be able to download big data apps they need and use them like Lego blocks to make their own customizations. Big data apps will put a process or a workflow into the spotlight.
If big data apps proliferate, they will need… data. This means a focus on data governance, data preparation and an ability to painlessly load data into big data stores. This also means self-service, or rather SELF-SERVICE. It would be a bigger and bigger subject from the data management and analytics perspectives. And from the process perspective too, I already told you so.
People are impatient. Those who want self-service are demanding increasingly real-time data access, response and gratification. This will lead to in-memory and streaming advances, but I don’t think it will be next: for practitioners, it will be next after next.
The Internet of Things (IoT) is at the peak of inflated expectations of the hype cycle. Organizations collect more data than they can process: for them, it’s still not just about finding the needle, but getting the hay in the stacks. The Internet of Data sustains the IoT. Companies are collecting new data, asking about new external data sources and searching for dark data within. Many new data sources are personal data – privacy and ethics accompany them. This year, I wrote Maverick* Research: Put Your Data in the Bank, Get Dividends, where I foresee intermediaries — personal data banks — representing individuals. A personal data bank will keep deposited data and multiply its wealth through commercialized digital markets. The Internet of Data and personal data banks are not reality yet, but I am sure soon I would be able to say about them I told you so.
The Internet of Data sustains all other Internets
Follow Svetlana on Twitter @Sve_Sic
Category: "Data Scientist" analytics Big Data data data paprazzi Hadoop Information Everywhere innovation Inquire Within Uncategorized Tags: analytics, big data, data paprazzi, end users, hadoop, Hadoop distribution, Information Everywhere, market analysis
by Svetlana Sicular | October 23, 2014 | 2 Comments
Suddenly, I realized: fluids are in, animals are out. The big data ecosystem has given up on its elephants, impalas and pigs in favor of aquatics. Perhaps, the shift started with “data lakes,” or, perhaps, data lakes just reflected the state of big data (pun intended). Or maybe, Cascading was the one that signified the shift: Cascading was the first to enable data application development on Apache Hadoop, leaving developers Driven, Lingual and Scalding. It is now obviously Fluid.
According to Metanautix, navigating data has never been so fluid. Did you notice not just nautix, but also meta in the name? This is for a good reason. Metadata is key to deriving value from big data. Tonight (23 October), I am moderating a panel on enabling Hadoop data, so that a data lake wouldn’t turn into a data marsh. Incidentally, one of the panelists is Waterline Data Science. If you are in the Bay Area, come see the panel in person in Sunnyvale — everyone is welcome.
Big data is flowing through all kinds of data pipelines and is streaming from all things, up to the point of becoming a DataTorrent. It can run freely in its purest state — H2O, or get sublimated into Snowflake in the cloud. If weather permits, data could even pour like FirstRain (a company that trademarked the term Personal Business Analytics!).
You might wonder whether the current big data darling Spark fits into the Age of Aquarius picture. It does — how about a killer app Sparkling Water? The new wave of big data technologies is rising!
Follow Svetlana on Twitter @Sve_Sic
Category: "Data Scientist" Big Data data paprazzi Hadoop Information Everywhere Tags: data paprazzi, vendors
by Svetlana Sicular | October 14, 2014 | 1 Comment
The Hadoop ecosystem is like a kaleidoscope, where particles keep colliding, tumbling and forming mesmerizing patterns created by the reflections. My research note What Matters When Comparing Hadoop Distributions is finally out. I’ve been writing it for four months. As soon as I felt it was ready, there were some principle points that I had to resolve, because the Hadoop kaleidoscope kept turning. Hadoop distribution vendors were changing their stances and clients were seeking guidance on more and more Hadoop-related subjects. What’s even more interesting, over this time, a whole wave of the Hadoop ecosystem products became better visible in the kaleidoscope: Databricks / Apache Spark, 0xdata H2O and Adatao are the examples.
I’d like to offer the main points from my research, which can help enterprises get a snapshot of the Hadoop kaleidoscope. Be aware, many new announcements will come from Strata / Hadoop World this week to keep beautiful and evanescent pictures in motion.
- Commercial Hadoop distributions eliminate the complexity of building a Hadoop stack on your own. They ensure indemnification and provide support of open-source software.
- There are more similarities than differences among commercial Hadoop distributions: All Hadoop distributions include the core open-source Apache Hadoop projects, many other open-source projects and a smaller set of distribution-specific components. Most distribution-specific components deliver functionality comparable with functionality of other distributions. This makes vendor lock-in concerns ungrounded for the majority of use cases.
- Hadoop distributions will improve and will deviate from their current state. Gartner expects new technologies in the Hadoop ecosystem in the near future.
- Hadoop’s value is not only in its features and capabilities — given growing YARN resource management maturity — it is also becoming the de facto cluster management standard.
- Given that Hadoop is engineering-driven, certain gaps important to the business could get low priority or may be overlooked.
- For many organizations, big data initiatives are the cutting edge of their innovation. Talented and experienced distribution vendors are often not just service providers but innovation partners and the source of new ideas in the enterprise.
- Cost should not be a key factor in deciding to implement Hadoop on your own. Acquire a commercial Hadoop distribution for your on-premises implementation to address unavoidable technology challenges.
- Partnerships between Hadoop distribution vendors and your key software or hardware suppliers are a main Hadoop distribution selection factor. Determine how a Hadoop distribution fits into your overall architecture.
- The majority of your time would be better spent on determining the value of Hadoop to your enterprise, rather than on choosing among Hadoop distributions.
- Your long-term architectures will evolve along with Hadoop. In the light of rapid changes and upcoming Hadoop improvements, focus your architecture on your immediate use cases.
Follow Svetlana on Twitter @Sve_Sic
Category: Big Data data paprazzi Hadoop Information Everywhere innovation Inquire Within Tags: big data, BigInsights, cloudera, data paprazzi, Hadoop distribution, Hortonworks, innovation, MapR, Pivotal HD, vendors
by Svetlana Sicular | August 28, 2014 | 3 Comments
As a Gartner analyst, I am fortunate to frequently meet amazing people. Qaizar Hassonjee from Adidas is not only one of them, but one of the most memorable ones among the amazing people. He is at the heart of miCoach, including miCoach Elite, the system developed in partnerships with the top soccer players, coaches and teams of the world where soccer is known as football. For instance, German national team was practicing all last year with miCoach.
We invited Qaizar Hassonjee to talk at our Catalyst conference earlier this month, and he accepted our invitation! I was tweeting like crazy, “Everyone, drop everything, go to End-User Case Study: Smart Soccer With adidas miCoach Elite Team System!” This session is recorded by Gartner Events On Demand, which offers analyst and guest speaker presentations from all our conferences, woo-hoo!
Qaizar Hassonjee is a passionate leader who knows how to focus and what to focus on. He leads fantastic innovations, like creation of a sensor t-shirt to monitor an athlete’s heart rate and performance. And this sensor t-shirt is washable! I am writing this blog post, because Qaizar Hassonjee and his team got big data right. Here is the Gartner’s definition of big data (which I explained in the past):
Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.
This is how the big data definition plays out in digital sports.
Part 1. High-volume, high-velocity and high-variety information assets.
This is a screenshot of adidas VP of Innovation Qaizar Hassonjee’s talk at Catalyst
miCoach collects players’ heart rate, physiological parameters, geolocation and much more in real time, with a lot of unexpected uses of data. For example, a location heat map was important to people who maintained the field.
Part 2 of the definition. Information assets that demand cost-effective, innovative forms of information processing.
The miCoach team was focusing on serving the right analytics at the right time. They did not make typical mistakes of relying exclusively on their own expertise, but involved cardiologists, physiologists, equipment managers, and of course, coaches and players.
And finally, part 3 of the definition: Information processing for enhanced insight and decision making.
These are the main points that led to success of miCoach because of the big data insights:
- Don’t overload with data and information.
- Don’t sacrifice performance.
- Don’t over-engineer.
- Focus on integration of different components to bring a unique user experience and test, test, test.
Follow Svetlana on Twitter @Sve_Sic
Category: analytics Big Data data data paprazzi Information Everywhere innovation Uncategorized Tags: analytics, big data, Catalyst, data paprazzi, experience, Information Everywhere, innovation
by Svetlana Sicular | July 29, 2014 | 7 Comments
Everybody talks about successes in big data. And everybody is curious about failures. Today, I want to illustrate some typical causes of big data project failures with real-life examples, no company logos to show, sorry. I’ll give not necessarily “fail fast” scenarios, but also the uneventful and painful “fail slow.” Let’s start with the amazing success story.
Management inertia. Our client, a household name among early internet travel companies, as well as the early adopter of big data technologies, ran click-stream analysis to find out how people navigate this travel site and how they make purchases. It turned out that the buying patterns were exactly opposite from the sales approach of the company’s upper management. This is the verbatim quote about this rare happy end:
“We’ve had great success with this technology. The insights we’ve had changed the business dramatically. To capitalize on these insights we brought in new management.“
How many companies are in a position to get rid of their upper management?
Selecting wrong use cases. Many companies start with advanced use cases that require a better understanding of technologies, which comes with experience. Other companies select the same use cases that they used to implementing on traditional technologies, and, consequently, they don’t see benefits. My blog post The Top Mistake in Evaluating Big Data Initiatives describes this situation.
Asking wrong questions. An automobile manufacturer with thousands of dealerships ran a sentiment analysis project to learn about its customers. Six months and $10M later the findings from big data were distributed to all thousands of dealerships, and all thousands of them were laughing out loud: every one of them knew all along what the big data project was digging out all this time.
Lacking the right skills. Every one of us considers him/herself an expert in human behavior, our native language or our own social life. So are people running big data analytics projects. A financial services company started a project to detect how people’s habits affect their propensity to buy retirement plans. Humans are creatures of habits, and of too many habits. People who ran the project decided (little by little, failing slowly) to narrow down all habits just to smoking / non-smoking. And failed again. It turned out (from my dialogs with a healthcare company, which coincided with this one) that healthcare professionals instead of a black-and-white “do you smoke?” would have asked, ” how many years did you smoke? How many times did you quit smoking? When was the last time you smoked?” The bottom line: look for professionals who know the field you analyze — healthcare experts, linguists, behavioral psychologists, social anthropologists and others who normally don’t belong to IT.
Unanticipated problems that are wider than just a big data technology. One large retailer ran a big data project in the cloud. The network congestion to stores was a problem that derailed the whole project. A team member summarized their learning from the failure:
“Supporting any new platforms on a remote site is more than a technology problem. It must factor in personnel, training, upgrades, maintenance and real estate.”
Disagreement on the enterprise strategy. There are many trains of thought in a large company. Here is an eloquent quote from a client, an information architect:
“We see information as the heart. Others believe cloud is the heart of our strategy.”
As a result, there is no enterprise-wide strategy, but a lot of unrelated initiatives, big data being rather small.
Siloed big data negates the whole idea of having it. This reason for failure relates to the previous one. A client who learned it on his own mistakes said:
“Prioritization of business projects is a bit more difficult because we are so siloed in business units. We do not do a good job justifying the platform as a whole. Whoever screams loudest gets it.”
Solution avoidance. The most typical example is pharmaceutical industry required to report any known adverse drug effects. This whole industry does not conduct sentiment analysis, because they have to report to FDA any event when, for example, a patient complains about a headache in the same paragraph where a particular drug is mentioned.
My list of big data failures can go on, and on, and on. I especially want to stress the need to understand the data, no matter if it’s big or not. There are tons of cases of not knowing data, and, as a result, inability to deliver anything new, or having so much data and no experience of how to manage, analyze or query it. I will talk about data, big data and greater data in two weeks from now, at our Catalyst conference in San Diego. Come over!
Follow Svetlana on Twitter @Sve_Sic
Category: analytics Big Data data data paprazzi Information Everywhere Uncategorized Tags: big data, big data adoption, data paprazzi, Information Everywhere, pseudo-tweets
by Svetlana Sicular | June 17, 2014 | 6 Comments
People often ask me if there is a magic quadrant for big data. There isn’t. What we have is a Hype Cycle for Big Data with abundance of big data technologies, some of which are just nascent, some are on the plateau of productivity, and some, like Hadoop distributions, are in the unrightfully dreaded and largely misunderstood trough of disillusionment.
Gartner also has annual cool vendor reports where analysts write about up and coming companies with innovative ideas, services and technologies. Many reports cover awesome big data vendors, for example, Cool Vendors in Big Data, Cool Vendors in Data Science and Cool Vendors in Information Innovation.
At Gartner for Technical Professionals (where I am), we usually publish vendor-neutral research and do not write for cool vendor reports (to be fair, we submit our choices and peer-review these reports). Yet, our clients constantly ask me and my colleagues about vendors. Last week, Fortune magazine published my opinion on big data companies to watch. My opinion was not about the best or the most prominent, most hyped or most intriguing, most funded or most profitable companies, but about the companies to watch.
Katherine Noyes, the author of the Fortune Magazine article, asked me to name five big data companies to watch and to comment on some published big data vendors lists. Below is my full response, it explains my choices:
Well, selecting just five companies is a challenge since there are many more companies that do interesting things around big data. I have technical and non-technical considerations for giving my list of five. My top noteworthy big data companies would be:
- Neo Technology is a force behind an open source graph database Neo4j – I think graphs have a great future since they show data in its connections rather than as a traditional view of atomic elements. Graph technologies are mostly unexplored by the enterprises but they are the solution that can deliver truly new insights from data. I wrote about graphs some time ago in my blog post Think Graph.
- Splunk has an excellent technology, and it was among the first big data companies to go public. Now, Splunk also has a strong product called Hunk (Splunk on Hadoop) directly delivering big data solutions that are more mature than most products in the market. Hunk is easy to use compared to many big data products, and generally, most customers I spoke with expressed their love to Splunk without any soliciting on my side.
- MemSQL – an in-memory relational database that would be effective for mixed workloads and for analytics. While SAP draws so much attention to in-memory databases by marketing their Hana database, MemSQL seems to be a less expensive and more agile solution in this space.
- Pivotal – while it might not be the most perfect big data solution, Pivotal is solving a much larger problem – the convergence of cloud, mobile, social and big data forces (which Gartner calls the Nexus of Forces). Eventually, big data is not a standalone technology but it should deliver actionable insights about the rapidly changing modern world with its social interactions, mobility, Internet of Things etc. That’s why GE is one of the major investors in Pivotal with the purpose of building the Industrial Internet.
- Teradata – it might be a surprising choice for many big data aficionados who chose Teradata as a target for religious wars of new big data technologies against the data warehouse, where Teradata is an easy prey because it’s a pure play in the data warehousing (as opposed to Oracle, IBM or Microsoft who have many more products). Meanwhile, Teradata delivers a unified data architecture that combines best of both worlds, and enterprises need both.
As you may have noticed, I am covering various segments of big data technologies. If I had more than five companies to choose from, I’d also add companies in other segments:
- Big data analytics: Actian and Datameer
- Predictive analytics: Revolution Analytics and Ayasdi
- Data integration: Pentaho and Denodo (particularly for data virtualization)
- Big data cloud providers: Qubole and Altiscale
- Hadoop: Cloudera – “first in space” for big data, huge recent investments from an interesting set of investors, most notably, Intel.
- Development framework: Concurrent with the open source product called Cascading, now included in some Hadoop distributions. Given that applications are about to explode on Hadoop, Concurrent should do very well.
Please note, this is not a comprehensive research and there are more very good companies. The companies I listed are “to watch” rather than best overall.
Now, to comment on the list of 100 big data companies you pointed me to. Out of this list, the following companies look appealing to me: Dataguise, MapR, MatterSight, Manhattan Software (an excellent player in real estate!) and Data Tamer (I would prefer Paxata though) – see my blog post Big Data Quantity vs. Quality.
I’d like to especially stop on The Hive. This VC company specializing in data has an unusual approach that I personally greatly appreciate. It conducts weekly live meetups, which cover diverse subjects and draw diverse people who are interested in big data. The Hive became one of the most well-known gatherings that attracts the brightest minds in big data as speakers (and as attendees). It became a social “big data hub” in Silicon Valley, and I believe, in India too. Being in the center of the big data life, the Hive has a great opportunity to make successful investments on early stages of data companies.
Finally, I’d like to remind you again: Gartner cool vendor reports are a much more comprehensive and pointed reading than my casual overview.
Follow Svetlana on Twitter @Sve_Sic
Category: "Data Scientist" Big Data big data market data paprazzi Gartner hype cycle Hadoop Information Everywhere innovation The Era of Data Trough of Disillusionment Uncategorized Tags: big data, cloudera, data paprazzi, data scientist, Information Everywhere, innovation, MapR, Silicon Valley, vendors
by Svetlana Sicular | June 3, 2014 | Comments Off
In 2009, a CRM icon Tom Siebel was attacked by a charging elephant during an African safari. Ominously, this was exactly the time of changing epochs signified by another elephant, Hadoop. It was not obvious back in 2009 that the era of CRM came to the end and the era of data began. This very year of 2009, Cloudera announced the availability of the Cloudera Distribution of Hadoop, and MapReduce and HDFS became separate subprojects of Apache Hadoop. This was the year when people started talking about beautiful data.
The era of data is about the process of data commoditization, where data is becoming an independently valuable asset that is freely available on the market. A “commodity” is defined as:
- Something useful that can be turned to commercial or other advantage
- An article of trade or commerce
- An advantage or benefit
Looks familiar? That’s what we want data to become. And we are getting there, not very fast but steadily. Information patterns derived from data are already changing status quo; they disrupt industries and affect lives. Sometimes, data is useful, yet not turned to commercial advantage. For sure, data is increasingly becoming an article of trade or commerce. Notice, the number of new available Web APIs that give public access to data started growing explosively around the beginning of the era of data.
- Source: ProgrammableWeb
I am not the first one pointing to the commoditization of data. Bob Grossman, who epitomizes a data scientist to me, gave a detailed account of commoditization in his outstanding book The Structure of Digital Computing: From Mainframes to Big Data. In particular, the commoditization of time took most of the 17th century. We take our watches, clocks and phone timers for granted — think in a perspective: in the future, someone will take for granted access to all kinds of data. The last chapter of the Grossman’s book is entitled “The Era of Data.”
Open data is the strong manifestation of this new era. The first government open-data websites — data.gov and data,gov.uk — were launched in 2009. The government mandates and open data policies from multiple countries and public entities continue to contribute to the process of data commoditization. Openness has the benefit of increasing the size of the market. The greater the size of the market and the demand for a resource, the greater the competitive pressure on price and, hence, the increase in commoditization of the resource.
When data gets free or inexpensive (as a result of commoditization), the opportunity exists to unite people over data sets to make new discoveries and build new business models. Many companies choose Hadoop because it is a cheap data storage. This entry point is the first step on the journey to the data operating system, a term that I heard three times during past five days, notably from Doug Cutting who brought to the world Hadoop the elephant and the data operating system. This year’s Hadoop Summit starts today. It brought together 3,000 people from 1,000 organizations.
The last part of the “commodity” definition is “an advantage or benefit.” Gartner analysts Mark Beyer and Donald Feinberg predicted several years ago:
By 2014, organizations which have deployed analytics to support new complex data types and large volumes of data in analytics will outperform their market peers by more than 20% in revenue, margins, penetration and retention.
According to my observations, it’s true. If this is true for you? If not, be patient —an elephant’s pregnancy is almost two years long.
P.S. Tom Siebel survived the elephant attack. He is running a big data company C3 Energy now.
Follow Svetlana on Twitter @Sve_Sic
Category: Big Data big data market data paprazzi Hadoop open data The Era of Data Uncategorized Tags: big data, cloudera, data paprazzi, Gartner predicts, hadoop, Information Everywhere, market analysis, Silicon Valley, The Era of Data
by Svetlana Sicular | May 22, 2014 | Comments Off
Increasing adoption of big data technologies brings about the big data dilemmas:
- Quality vs. quantity
- Truth vs. trust
- Correction vs. curation
- Ontology vs. anthology
Data profiling, cleansing or matching in Hadoop or elsewhere are all good but they don’t resolve these dilemmas. My favorite semantic site Twinword pictures what a dilemma is.
You get the picture. New technologies promote sloppiness. People do stupid things because now they can.
Why store all data? — Because we can.
What’s in this data? — Who knows?
Remember the current state of big data analytics? — “It’s not just about finding the needle, but getting the hay in the stack.”
Big data technologies are developing fast. Silicon Valley is excited about new capabilities (which very few are using). In my mind, the best thing to do right now is to enable vast and vague data sources that are commingling in the new and immature data stores, and are confined in mature data stores. Companies store more data than they can process or even fathom. My imagination fails at a quintillion rows (ask Cloudera). Instead, it paints a continuous loop: data enables analysis, analytics boosts the value of data. How to do this? It starts dawning on the market — through information quality and information governance!
My subject today is just the information quality piece. It continues my previous blog post BYO Big Data Quality. (I explained the whole information loop on this picture in Big Data Analytics Will Drive the Visible Impact of the Nexus of Forces.)
Data liberation means more people accessing and changing data. Innovative information quality approaches — visualization, exception handling, data enrichment — are needed to transform raw data into a trusted source suitable for analysis. Some companies use crowdsourcing for data enrichment and validation. Social platforms provide a crowdsourced approach to cleaning up data and facilitate finding armies of workers with diverse backgrounds. Consequently, the quality of crowdsourcing is another new task.
Big data is a way to preserve context that is missing in the refined structured data stores — this means a balance between intentionally “dirty” data and data cleaned from unnecessary digital exhaust, sampling or no sampling. A capability to combine multiple data sources creates new expectations for consistent quality; for example, to accurately account for differences in granularity, velocity of changes, life span, perishability and dependencies of participating datasets. Convergence of social. mobile, cloud and big data technologies presents new requirements — getting the right information to the consumer quickly, ensuring reliability of external data you don’t have control over, validating the relationships among data elements, looking for data synergies and gaps, creating provenance of the data you provide to others, spotting skewed and biased data.
In reality, a data scientist job is 80% of a data quality engineer, and just 20% of a researcher, dreamer and scientist. Data scientist spends enormous amount of time on data curation and exploration to determine whether s/he can get value out of it. The immediate practical answer — work with dark data confined in relational data stores. Well, it’s structured, therefore, it is not really new. But at least, you get enough untapped sources of reasonable quality, and you can extract enough value right away, while new technologies are being developed. Are they?
While Silicon Valley is excited about Drill, Spark and Shark, I am watching a nascent trend — big data quality and data-enabling. Coincidentally, I got two briefings last week, Peaxy (that I liked a lot for its strength in the Industrial Internet) and Viking-FS in Europe with the product called K-Pax, mainly for the financial industry. A briefing with Paxata is on Friday, and a briefing with Trifacta is scheduled too. Earlier this month, I got a briefing from Waterline Data Science, a stealth startup with lots of cool ideas on enabling big data. Earlier, I had encounters with Data Tamer, Ataccama and Cambridge Semantics among others. Finally, have you heard about G2 and sensemaking? Take a look at this intriguing video. All these solutions are very different. The only quality they have in common is immaturity. For now, you are on your own, but hold on — help is coming!
Follow Svetlana on Twitter @Sve_Sic
Category: "Data Scientist" analytics Big Data big data quality Crossing the Chasm data governance data paprazzi Humans Information Everywhere innovation Inquire Within Uncategorized Tags: big data, crossing the chasm, data, data janitor, data paprazzi, data scientist, data spy, Information Everywhere
by Svetlana Sicular | May 16, 2014 | 2 Comments
In the absence of best practices for big data quality, individual companies are coming up with their own solutions. Of course, these organizations first have problems. Let’s look at the example from Paytronix, a cool company providing loyalty management for restaurant chains including my favorite Panera Bread. Paytronix is converging social, mobile, cloud and big data for its business (aha! The Nexus of Forces!). And — by the way — cutting edge technologies help a lot to attract top talent to the company. But first things first, Paytronix had a big data quality problem, here is the description:
- Over a quarter of their clients, restaurants, do not ask for age
- Of those who ask age, 18% leave it blank
- Of those who answer, approximately 10% are blatant liars
All of the above means that identifying families with kids is a huge challenge (spoiler: Paytronix successfully met the challenge). People with kids are younger. They tend to fill the restaurants earlier in the evening. Check average is higher when orders include a kids meal (I confirm for our orders in Panera). That’s why restaurants often want to market to people with children: when they offer a kids meal coupon they get 25% more redemptions. But! What customers say is different than what they do. (Aren’t we all customers?) In other words, here is the picture, instructive for parents:
Big data quality is new and different: Traditional models do not work, familiar standards do not apply, typical metrics miss the mark. Most important, people’s mentality has to change when they assure quality of big data. My colleague Martin Reynolds likes to cite, “most people are woefully muddled information processors who often stumble along ill-chosen shortcuts to reach bad conclusions.” This quote appeared in Newsweek in 1987, BC (before Cloudera, the first commercial Hadoop distribution vendor). E.g. the problem is eternal although it wasn’t so widespread in data management because there was not much data management in 1987. That Newsweek with the quote still advertised typewriters, best in the world. Wikipedia gives a daunting list of cognitive biases —each bias is a big data quality factor because quality applies to the resulting analysis, and to intermediate results, and to iterative data science. In case of Paytronix, segmentation was biased. Biases also apply to data mashups: to evaluating granularity, trustworthiness and dependencies of participating data sets. And sometimes, biases matter even to the absence or presence of particular data sources. Martin Reynolds shared with me the most astonishing example of cognitive bias.
Paytronix solved its big data quality problem by deciding not to change how people think. They validated data by giving it in cubes in a familiar BI tool to good old people. By the way, crowdsourcing is another excellent big data quality method that relies on people. But this is a subject of my next post — I will tell what vendors are doing about big data quality, and even maybe about big data governance. As DBAs like to say, stay tuned.
Follow Svetlana on Twitter @Sve_Sic
Category: Big Data big data market Crossing the Chasm crowdsourcing data data governance data paprazzi Hadoop Information Everywhere innovation Inquire Within skills Uncategorized Tags: big data, big data adoption, cloudera, data paprazzi, data spy, end users, Hadoop distribution, Information Everywhere, innovation, pseudo-tweets
by Svetlana Sicular | April 28, 2014 | 1 Comment
Big data finally reached Wall Street. Not for small science experiments, but seriously, in grand style, with exorbitant salaries, for production and DevOps. A couple of years ago, Silicon Valley companies were bragging to each other about hiring Wall Street quants for data scientists. Now Wall Street is happy to grab — whom? No, not data scientists (New York has plenty of its own quants) — Data Architects, the breed who can come up with new architecture to combine structured and unstructured data where architecture for one use case usually (still) does not apply to another.
Dice.com returned just 34 data scientist jobs vs. 645 data architects in the New York area tonight. Even my search of “data architect + Hadoop” returned twice as many data architecture jobs compared to data scientists. Sorry, data scientists: data architects are sexy! My client inquiries shifted lately to no-nonsense big data architecture, management and real-time use cases. Big data vendors hinting left and right about “Wall Street customers in alpha,” e.g. newly signed contracts. My friend, a Wall Street recruiter, had to cut his vacation short – Wall Street is in a hurry to get data architects right now. And also, performance engineers, and developers, and administrators. And did I say, DevOps? Yes, for agility, my friends.
This hiring means that Wall Street is ready to use big data strategically. The picture below shows typical stages of big data adoption described in my research note The Road Map for Successful Big Data Adoption. The red dot on this picture — a stabilized infrastructure — is the most prominent milestone. After the infrastructure has been built, a capability to derive value from big data technologies leaps to a new level. Nonbelievers turn into believers.
At the red dot, big data becomes the new normal. It eventually gets related to other information sources. (When using the term “big data” analytics and information management professionals (across the Globe!) first say that they don’t like it.) At the red dot, companies substantially expand the number of nodes or totally rebuild the earlier small layouts.
A widespread myth is that Hadoop is inexpensive to implement. Really? With Wall Street salaries? An initial implementation is usually more expensive than expected. It involves a lot of unanticipated technical and nontechnical difficulties. By the way, another myth is that big data infrastructure usually takes advantage of commodity hardware. Maybe, but not on Wall Street. Enterprises buy high-end hardware.
I gave my friend a Wall Street recruiter a t-shirt stating “My data is bigger than yours”— he wears it to work (on Fridays). I should make one for myself, with the text Data is what used to be big data. Wall Street is getting there.
Follow Svetlana on Twitter @Sve_Sic
Category: "Data Scientist" Big Data big data market Crossing the Chasm data Hadoop market analysis Uncategorized Tags: big data, big data adoption, crossing the chasm, data, data paprazzi, end users, hadoop, hiring, market analysis