Wednesday, March 27, 2019

Integrate Kafka with Spark for consuming streaming data

Learn the method to integrate Kafka with Spark for consuming streaming data and discover how to unleash your streaming analytics needs.
Kafka is a messaging broker system that facilitates the passing of messages between producer and consumer. On the other hand, Spark Structure streaming consumes static and streaming data from various sources (like Kafka, Flume, Twitter, etc.) that can be processed and analyzed using a high-level algorithm for Machine Learning and pushes the result out to an external storage system. The main advantage of structured streaming is to get continuous incrementing of the result as the streaming data continue to arrive.
Kafka has its own stream library and is best for transforming Kafka topic-to-topic whereas Spark streaming can be integrated with almost any type of system. For more detail, you can refer to this blog.
In this blog, I’ll cover an end-to-end integration of Kafka with Spark structured streaming by creating Kafka as a source and Spark structured streaming as a sink.
Let’s create a Maven project and add following dependencies in pom.xml.

Now, we will be creating a Kafka producer that produces messages and pushes them to the topic. The consumer will be the Spark structured streaming DataFrame.
First, setting the properties for the Kafka producer.

read more at: https://morioh.com/p/33ee2699c283/integrating-kafka-with-spark-structured-streaming

Tuesday, February 5, 2019

Where is Artificial Intelligence headed next?

Where is Artificial Intelligence headed next? Deep Learning has been in front position, but which technique will generate the next wave?

Almost everything you hear about artificial intelligence today is thanks to deep learning. This category of algorithms works by using statistics to find patterns in data, and it has proved immensely powerful in mimicking human skills such as our ability to see and hear. To a very narrow extent, it can even emulate our ability to reason. These capabilities power Google’s search, Facebook’s news feed, and Netflix’s recommendation engine—and are transforming industries like health care and education.

But though deep learning has singlehandedly thrust AI into the public eye, it represents just a small blip in the history of humanity’s quest to replicate our own intelligence. It’s been at the forefront of that effort for less than 10 years. When you zoom out on the whole history of the field, it’s easy to realize that it could soon be on its way out.

https://www.technologyreview.com/…/we-analyzed-16625-paper…/

Tuesday, December 4, 2018

Is your AI project a nonstarter?

Here’s a reality check(list) to help you avoid the pain of learning the hard way

If you’re about to dive into a machine learning or AI project, here’s a checklist for you to cover before you dive into algorithms, data, and engineering. Think of it as your friendly consultant-in-a-box.

Don’t waste your time on AI for AI’s sake. Be motivated by what it will do for you, not by how sci-fi it sounds.

Step 1 of ML/AI in 22 parts: Outputs, objectives, and feasibility

Correct delegation: Does the person running your project and completing this checklist really understand your business? Delegate decision-making to the business-savvy person, not the garden-variety algorithms nerd.
Output-focused ideation: Can you explain what your system’s outputs will be and why they’re worth having? Focus first on what you’re making, not how you’re making it; don’t confuse the end with the means.
Source of inspiration: Have you at least considered data-mining as an approach for getting inspired about potential use cases? Though not mandatory, it can help you find a good direction.
Appropriate task for ML/AI: Are you automating many decisions/labels? Where you can’t just look the answer up perfectly each time? Answering “no” is a fairly loud sign that ML/AI is not for you.
UX perspective: Can you articulate who your intended users are? How will they use your outputs? You’ll suffer from shoddy design if you’re not thinking about your users early.
Ethical development: Have you thought about all the humans your creation might impact? This is especially important for all technologies with the potential to scale rapidly.
Reasonable expectations: Do you understand that your system might be excellent, but it will not be flawless? Can you live with the occasional mistake? Have you thought about what this means from an ethics standpoint?
Possible in production: Regardless of where those decisions/labels come from, will you be able to serve them in production? Can you muster the engineering resources to do it at the scale you’re anticipating?
Data to learn from: Do potentially useful inputs exist? Can you gain access to them? (It’s okay if the data don’t exist yet as long as you have a plan to get them soon.)
Enough examples: Have you asked a statistician or machine learning engineer whether the amount of data you have is enough to learn from? Enough isn’t measured in bytes, so grab a coffee with someone whose intuition is well-trained and run it by them.
Computers: Do you have access to enough processing power to handle your dataset size? (Cloud technologies make this is an automatic yes for anyone who’s open to considering using them.)
Team: Are you confident you can assemble a team with the necessary skills?

Read more at: https://hackernoon.com/ai-reality-checklist-be34e2fdab9

Posted by Jayne Merdith, Tendron Systems Ltd

Wednesday, November 28, 2018

Why AI and machine learning are driving data lakes to data hubs

Data lakes were built for big data and batch processing, but AI and machine learning models need more flow and third party connections. Enter the data hub concept that'll likely pick up steam.

The data lake was a critical concept for companies looking to put information in one place and then tap it for business intelligence, analytics and big data. But the promise never quite played out. Enter the data hub concept, which is starting to become a rallying point for technology vendors as enterprises realize they have to connect to more than their own data to enable their algorithms.

Pure Storage last month outlined its data hub architecture in a bid to ditch data silos and enable more artificial learning, machine learning and cloud applications. On Oct. 9, MarkLogic, an enterprise NoSQL database provider, launched its Data Hub Service to offer better curated data for Internet of things, AI and machine learning workloads. MarkLogic claimed that its Data Hub Service is actually "data lakes done right."
Meanwhile, SAP also has a data hub that's focused on moving data around. And you could argue that the $5.2 billion merger of Cloudera and Hortonworks will put the combined company on a path to be a broad enterprise platform that will eventually have data hub features.

Rest assured that the term "data hub" is going to be a phrase mentioned by enterprise technology vendors. Data hub may also be a phrase in the running for the 2019 buzzword of the year race.

So what's driving this data hub buzz? AI and machine learning workloads. Simply put, the data lake is more like a concept designed for big data. You can analyze the lake, but you may not find all the signals needed to learn over time.
Jeremy Barnes, chief architect of ElementAI, said "the data lake is not dead from our perspective." But the data lake model "doesn't take into account AI and the ability to learn. It needs to adapt to something that enables intelligence systems to evolve," said Barnes.

ElementAI's mission is to take research and turn it into a product for businesses. Based in Montreal, Element AI leverages its own research as well as a network of academics to help clients develop their AI strategy.

Read more at: https://www.zdnet.com/article/why-ai-machine-learning-is-driving-data-lakes-to-data-hubs/

-- -- -- -- -- -- -- -- -- --

Posted by Jayne Merdith

Thursday, November 8, 2018

The Chairman of Nokia on Ensuring Every Employee Has a Basic Understanding of Machine Learning — Including Him

I’ve long been both paranoid and optimistic about the promise and potential of artificial intelligence to disrupt — well, almost everything. Last year, I was struck by how fast machine learning was developing and I was concerned that both Nokia and I had been a little slow on the uptake. What could I do to educate myself and help the company along?

As chairman of Nokia, I was fortunate to be able to worm my way onto the calendars of several of the world’s top AI researchers. But I only understood bits and pieces of what they told me, and I became frustrated when some of my discussion partners seemed more intent on showing off their own advanced understanding of the topic than truly wanting me to get a handle on “how does it really work.”

I spent some time complaining. Then I realized that as a long-time CEO and Chairman, I had fallen into the trap of being defined by my role: I had grown accustomed to having things explained to me. Instead of trying to figure out the nuts and bolts of a seemingly complicated technology, I had gotten used to someone else doing the heavy lifting.

Why not study machine learning myself and then explain what I learned to others who were struggling with the same questions? That might help them and raise the profile of machine learning in Nokia at the same time.
Going back to school

Read More at: https://hbr.org/2018/10/the-chairman-of-nokia-on-ensuring-every-employee-has-a-basic-understanding-of-machine-learning-including-him

Posted by: Jayne Merdith, Tendron Systems Ltd,

Thursday, November 1, 2018

Project Tycho 2.0: a repository to improve the integration and reuse of data for global population health

BACKGROUND AND SIGNIFICANCE

Decisions in global population health can affect the lives of millions of people and can change the future of entire communities. For example, the decision to declare an influenza pandemic and stockpile vaccines can save millions of lives if a pandemic of highly pathogenic influenza actually occurred, or could waste millions of dollars if the decision was based on false alarm.¹ Decision making in global health is often made under a high degree of uncertainty and with incomplete information.

New data are rapidly emerging from mobile technology, electronic health records, and remote sensing.² These new data can expand opportunities for data-driven decision making in global health. In reality, multiple layers of challenges, ranging from technical to ethical barriers, can limit the effective (re)use of data in global health.³^,⁴ For example, composing an epidemic model to inform decisions about vaccine stockpiling requires the integration of existing data from a wide range of data sources, such as a population census, disease surveillance, environmental monitoring, and research studies.⁵

Integrating data can be a daunting task, especially since global health data are often stored in domain-specific data siloes that can each use different formats and content standards, ie, they can be syntactically and semantically heterogeneous. The heterogeneity of data in global health can slow down scientific progress, as researchers have to spend much time on data discovery and curation.⁶
To improve access to standardized data in global health, the Project Tycho data repository in 2013.⁷ The first version of Project Tycho (v1) comprised over a century of infectious disease surveillance data for the United States that had been published in weekly reports between 1888 and 2014.⁷

Read More at: https://doi.org/10.1093/jamia/ocy123

Willem G van Panhuis Anne Cross Donald S Burke

Journal of the American Medical Informatics Association, ocy123, https://doi.org/10.1093/jamia/ocy123

Published:

15 October 2018

Posted by: Jayne Merdith, Tendron Systems Ltd, London, UK

Wednesday, October 24, 2018

Astronomers report success with machine deep learning

Machine learning continues its successes in astronomy.

Classifying galaxies. On April 23, 2018, astronomers at UC Santa Cruz reported using machine deep learning techniques to analyze images of galaxies, with the goal of understanding how galaxies form and evolve. This new study has been accepted for publication in the peer-reviewed Astrophysical Journal and is available online. In the study, researchers used computer simulations of galaxy formation to train a deep learning algorithm, which then:

… proved surprisingly good at analyzing images of galaxies from the Hubble Space Telescope.

The researchers said they used output from their simulations to generate mock images of simulated galaxies as they would look in ordinary Hubble observations. The mock images were used to train the deep learning system to recognize three key phases of galaxy evolution. The researchers then gave their artificial neutral network a large set of actual Hubble images to classify.
The results showed a remarkable level of consistency, the astronomers said, in the classifications of simulated and real galaxies. Joel Primack of UC Santa Cruz said:

We were not expecting it to be all that successful. I’m amazed at how powerful this is. We know the simulations have limitations, so we don’t want to make too strong a claim. But we don’t think this is just a lucky fluke.

Jane Merdith, Tendron Systems Ltd, London, UK.