Tuesday, December 4, 2018

Is your AI project a nonstarter?

Here’s a reality check(list) to help you avoid the pain of learning the hard way

If you’re about to dive into a machine learning or AI project, here’s a checklist for you to cover before you dive into algorithms, data, and engineering. Think of it as your friendly consultant-in-a-box.

Don’t waste your time on AI for AI’s sake. Be motivated by what it will do for you, not by how sci-fi it sounds.

Step 1 of ML/AI in 22 parts: Outputs, objectives, and feasibility

Correct delegation: Does the person running your project and completing this checklist really understand your business? Delegate decision-making to the business-savvy person, not the garden-variety algorithms nerd.
Output-focused ideation: Can you explain what your system’s outputs will be and why they’re worth having? Focus first on what you’re making, not how you’re making it; don’t confuse the end with the means.
Source of inspiration: Have you at least considered data-mining as an approach for getting inspired about potential use cases? Though not mandatory, it can help you find a good direction.
Appropriate task for ML/AI: Are you automating many decisions/labels? Where you can’t just look the answer up perfectly each time? Answering “no” is a fairly loud sign that ML/AI is not for you.
UX perspective: Can you articulate who your intended users are? How will they use your outputs? You’ll suffer from shoddy design if you’re not thinking about your users early.
Ethical development: Have you thought about all the humans your creation might impact? This is especially important for all technologies with the potential to scale rapidly.
Reasonable expectations: Do you understand that your system might be excellent, but it will not be flawless? Can you live with the occasional mistake? Have you thought about what this means from an ethics standpoint?
Possible in production: Regardless of where those decisions/labels come from, will you be able to serve them in production? Can you muster the engineering resources to do it at the scale you’re anticipating?
Data to learn from: Do potentially useful inputs exist? Can you gain access to them? (It’s okay if the data don’t exist yet as long as you have a plan to get them soon.)
Enough examples: Have you asked a statistician or machine learning engineer whether the amount of data you have is enough to learn from? Enough isn’t measured in bytes, so grab a coffee with someone whose intuition is well-trained and run it by them.
Computers: Do you have access to enough processing power to handle your dataset size? (Cloud technologies make this is an automatic yes for anyone who’s open to considering using them.)
Team: Are you confident you can assemble a team with the necessary skills?

Read more at: https://hackernoon.com/ai-reality-checklist-be34e2fdab9

Posted by Jayne Merdith, Tendron Systems Ltd

Wednesday, November 28, 2018

Why AI and machine learning are driving data lakes to data hubs

Data lakes were built for big data and batch processing, but AI and machine learning models need more flow and third party connections. Enter the data hub concept that'll likely pick up steam.

The data lake was a critical concept for companies looking to put information in one place and then tap it for business intelligence, analytics and big data. But the promise never quite played out. Enter the data hub concept, which is starting to become a rallying point for technology vendors as enterprises realize they have to connect to more than their own data to enable their algorithms.

Pure Storage last month outlined its data hub architecture in a bid to ditch data silos and enable more artificial learning, machine learning and cloud applications. On Oct. 9, MarkLogic, an enterprise NoSQL database provider, launched its Data Hub Service to offer better curated data for Internet of things, AI and machine learning workloads. MarkLogic claimed that its Data Hub Service is actually "data lakes done right."
Meanwhile, SAP also has a data hub that's focused on moving data around. And you could argue that the $5.2 billion merger of Cloudera and Hortonworks will put the combined company on a path to be a broad enterprise platform that will eventually have data hub features.

Rest assured that the term "data hub" is going to be a phrase mentioned by enterprise technology vendors. Data hub may also be a phrase in the running for the 2019 buzzword of the year race.

So what's driving this data hub buzz? AI and machine learning workloads. Simply put, the data lake is more like a concept designed for big data. You can analyze the lake, but you may not find all the signals needed to learn over time.
Jeremy Barnes, chief architect of ElementAI, said "the data lake is not dead from our perspective." But the data lake model "doesn't take into account AI and the ability to learn. It needs to adapt to something that enables intelligence systems to evolve," said Barnes.

ElementAI's mission is to take research and turn it into a product for businesses. Based in Montreal, Element AI leverages its own research as well as a network of academics to help clients develop their AI strategy.

Read more at: https://www.zdnet.com/article/why-ai-machine-learning-is-driving-data-lakes-to-data-hubs/

-- -- -- -- -- -- -- -- -- --

Posted by Jayne Merdith

Thursday, November 8, 2018

The Chairman of Nokia on Ensuring Every Employee Has a Basic Understanding of Machine Learning — Including Him

I’ve long been both paranoid and optimistic about the promise and potential of artificial intelligence to disrupt — well, almost everything. Last year, I was struck by how fast machine learning was developing and I was concerned that both Nokia and I had been a little slow on the uptake. What could I do to educate myself and help the company along?

As chairman of Nokia, I was fortunate to be able to worm my way onto the calendars of several of the world’s top AI researchers. But I only understood bits and pieces of what they told me, and I became frustrated when some of my discussion partners seemed more intent on showing off their own advanced understanding of the topic than truly wanting me to get a handle on “how does it really work.”

I spent some time complaining. Then I realized that as a long-time CEO and Chairman, I had fallen into the trap of being defined by my role: I had grown accustomed to having things explained to me. Instead of trying to figure out the nuts and bolts of a seemingly complicated technology, I had gotten used to someone else doing the heavy lifting.

Why not study machine learning myself and then explain what I learned to others who were struggling with the same questions? That might help them and raise the profile of machine learning in Nokia at the same time.
Going back to school

Read More at: https://hbr.org/2018/10/the-chairman-of-nokia-on-ensuring-every-employee-has-a-basic-understanding-of-machine-learning-including-him

Posted by: Jayne Merdith, Tendron Systems Ltd,

Thursday, November 1, 2018

Project Tycho 2.0: a repository to improve the integration and reuse of data for global population health

BACKGROUND AND SIGNIFICANCE

Decisions in global population health can affect the lives of millions of people and can change the future of entire communities. For example, the decision to declare an influenza pandemic and stockpile vaccines can save millions of lives if a pandemic of highly pathogenic influenza actually occurred, or could waste millions of dollars if the decision was based on false alarm.¹ Decision making in global health is often made under a high degree of uncertainty and with incomplete information.

New data are rapidly emerging from mobile technology, electronic health records, and remote sensing.² These new data can expand opportunities for data-driven decision making in global health. In reality, multiple layers of challenges, ranging from technical to ethical barriers, can limit the effective (re)use of data in global health.³^,⁴ For example, composing an epidemic model to inform decisions about vaccine stockpiling requires the integration of existing data from a wide range of data sources, such as a population census, disease surveillance, environmental monitoring, and research studies.⁵

Integrating data can be a daunting task, especially since global health data are often stored in domain-specific data siloes that can each use different formats and content standards, ie, they can be syntactically and semantically heterogeneous. The heterogeneity of data in global health can slow down scientific progress, as researchers have to spend much time on data discovery and curation.⁶
To improve access to standardized data in global health, the Project Tycho data repository in 2013.⁷ The first version of Project Tycho (v1) comprised over a century of infectious disease surveillance data for the United States that had been published in weekly reports between 1888 and 2014.⁷

Read More at: https://doi.org/10.1093/jamia/ocy123

Willem G van Panhuis Anne Cross Donald S Burke

Journal of the American Medical Informatics Association, ocy123, https://doi.org/10.1093/jamia/ocy123

Published:

15 October 2018

Posted by: Jayne Merdith, Tendron Systems Ltd, London, UK

Wednesday, October 24, 2018

Astronomers report success with machine deep learning

Machine learning continues its successes in astronomy.

Classifying galaxies. On April 23, 2018, astronomers at UC Santa Cruz reported using machine deep learning techniques to analyze images of galaxies, with the goal of understanding how galaxies form and evolve. This new study has been accepted for publication in the peer-reviewed Astrophysical Journal and is available online. In the study, researchers used computer simulations of galaxy formation to train a deep learning algorithm, which then:

… proved surprisingly good at analyzing images of galaxies from the Hubble Space Telescope.

The researchers said they used output from their simulations to generate mock images of simulated galaxies as they would look in ordinary Hubble observations. The mock images were used to train the deep learning system to recognize three key phases of galaxy evolution. The researchers then gave their artificial neutral network a large set of actual Hubble images to classify.
The results showed a remarkable level of consistency, the astronomers said, in the classifications of simulated and real galaxies. Joel Primack of UC Santa Cruz said:

We were not expecting it to be all that successful. I’m amazed at how powerful this is. We know the simulations have limitations, so we don’t want to make too strong a claim. But we don’t think this is just a lucky fluke.

Jane Merdith, Tendron Systems Ltd, London, UK.

Sunday, October 21, 2018

DeepMind Open-Sources Reinforcement Learning Library TRFL

DeepMind announced today that it is open-sourcing its TRFL (pronounced ‘truffle’) library, which contains a variety of building blocks useful for developing reinforcement learning (RL) agents in TensorFlow. Created by the DeepMind Research Engineering team, TRFL is a collection of major algorithmic components that DeepMind has used internally for many of their successful agents, including DQN, DDPG and the Importance Weighted Actor Learner Architecture.

Deep reinforcement learning agents are usually composed of a large number of components which can interact in subtle ways, making it difficult for researchers to identify flaws within the large computational graphs. The DeepMind research team has introduced an approach that uses scalable distributed implementation of the v-trace agent to address this issue. The large agent codebases have made a considerable contribution in reproducing research, but lack flexibility for modification. Hence, a complementary approach is needed to provide reliable, well-tested implementation building blocks which can be used for various RL agents.

The TRFL library contains functions that can implement both advanced techniques and classical RL algorithms. TRFL also provides uncompleted algorithms which can act as complementary implementations when building a fully-functional RL agent.
As the TRFL library is still broadly used by DeepMind they will continue to maintain it and add new functionalities over time.

Posted by: Jayne Merdith, Tendron Systems Ltd, London, UK

Wednesday, October 10, 2018

The rise of machine learning in astronomy

When mapping the universe, it pays to have some smart programming. Experts share how machine learning is changing the future of astronomy.
Astronomy is one of the oldest sciences and the first science to incorporate maths and geometry. It sits at the centre of humankind's search for its place in the universe.
As we delve deeper into the space surrounding our planet, the tools we use become more complex. Astronomers have come a long way from tracking the night sky with the naked eye or cataloguing the stars with a pen and paper.
Modern astronomers use advanced computer programming techniques in their work—from programming satellites to teaching computers to analyse data like a researcher.
So what do astronomers do with their computers?
Mo' data, mo' problems
Big data is a big problem in astronomy. The next generation of radio and optical telescopes will be able to map huge chunks of the night sky. The Square Kilometre Array (SKA) will push data processing to its limits.
Built in two phases, the SKA will have over 2000 radio dishes and 2 million low-frequency antennas once finished. These antennas combined will produce over an exabyte of data each day—more than the world's internet usage per day. The data is then processed to be made manageable, meaning the size of the data that astronomers have to deal with will be smaller.
Project scientist for the Australian SKA Pathfinder Dr. Aidan Hotan explains.
"Data from a radio telescope array is very much like the flow of water through an ecosystem. The individual antennas each produce data, which is then transmitted over some distance and combined with other antennas in various stages—like smaller tributaries combining into a larger river," says Aidan.
"The largest data rate you can consider is the total raw output from each individual antenna, but in reality, we reduce that total rate to more manageable numbers as we flow through the system. We can combine the signals in ways that retain only the information we want or can make use of."

Jane Merdith, Tendron Systems Ltd, London, UK.

Tuesday, October 9, 2018

Using pandas with large data

Using pandas with large data
Tips for reducing memory usage by up to 90%

When working using pandas with small data (under 100 megabytes), performance is rarely a problem. When we move to larger data (100 megabytes to multiple gigabytes), performance issues can make run times much longer, and cause code to fail entirely due to insufficient memory.

While tools like Spark can handle large data sets (100 gigabytes to multiple terabytes), taking full advantage of their capabilities usually requires more expensive hardware. And unlike pandas, they lack rich feature sets for high quality data cleaning, exploration, and analysis. For medium-sized data, we're better off trying to get more out of pandas, rather than switching to a different tool.

In this post, we'll learn about memory usage with pandas, how to reduce a dataframe's memory footprint by almost 90%, simply by selecting the appropriate data types for columns.

Working with baseball game logs

We'll be working with data from 130 years of major league baseball games, originally sourced from Retrosheet.

Originally the data was in 127 separate CSV files, however we have used csvkit to merge the files, and have added column names into the first row. If you'd like to download our version of the data to follow along with this post, we have made it available here.

Let's start by importing our data and taking a look at the first five rows.

Read More at: https://www.dataquest.io/blog/pandas-big-data/

Jane Merdith, Tendron Systems Ltd, London, UK.

Tuesday, September 25, 2018

Why building your own deep learning PC is 10x cheaper than Amazon

If you’ve used, or are considering, AWS/Azure/GCloud for Machine Learning, you know how crazy expensive GPU time is. And turning machines on and off is a major disruption to your workflow. There’s a better way. Just build your own Deep Learning Computer. It’s 10x cheaper and also easier to use.

Building an expandable Deep Learning Computer w/ 1 top-end GPU only costs $3K of computer parts before tax. You’ll be able to drop the price to about $2k by using cheaper components, which is covered in the next post.
Building is 10x cheaper than renting on AWS / EC2 and is just as performant

Assuming your 1 GPU machine depreciates to $0 in 3 years (very conservative), the chart below shows that if you use it for up to 1 year, it’ll be 10x cheaper, including costs for electricity. Amazon discounts pricing if you have a multi-year contract, so the advantage is 4–6x for multi-year contracts. If you are shelling out tens of thousands of dollars for a multi-year contract, you should seriously consider building at 4–6x less money. The math gets more favorable for the 4 GPU version at 21x cheaper within 1 year!
Cost comparisons for building your own computer versus renting from AWS. 1 GPU builds are 4–10x cheaper and 4 GPU builds are 9–21x cheaper, depending on how long you use the computer. AWS pricing includes discounts for full year and 3 year leases (35%, 60%). Power consumption assumed at $0.20 / kWh, and 1 GPU machine consumes 1 kW / h and 4 GPU machine consumes 2 kW / h. Depreciation is conservatively estimated at linear w/ full depletion in 3 years. Additional GPUs at $700 each, before tax.

There are some draw backs, such as slower download speed to your machine because it’s not on the backbone, static IP is required to access it away from your house, you may want to refresh the GPUs in a couple of years, but the cost savings is so ridiculous it’s still worth it.

If you’re thinking of using the 2080 Ti for your Deep Learning Computer, it’s $500 more and still 4-9x cheaper for a 1 GPU machine.
Cloud GPU machines are expensive at $3 / hour and you have to pay even when you’re not using the machine.

The reason for this dramatic cost discrepancy is that Amazon Web Services EC2 (or Google Cloud or Microsoft Azure) is expensive for GPUs at $3 / hour or about $2100 / month. At Stanford, I used it for my Semantic Segmentation project and my bill was $1,000. I’ve also tried Google Cloud for a project and my bill was $1,800. This is with me carefully monitoring usage and turning off machines when not in use — major pain in the butt!

Even when you shut your machine down, you still have to pay storage for the machine at $0.10 per GB per month, so I got charged a hundred dollars / month just to keep my data around.

The machinene I built costs $3k and has the parts shown below. There’s one 1080 Ti GPU to start (you can just as easily use the new 2080 Ti for Machine Learning at $500 more — just be careful to get one with a blower fan design), a 12 Core CPU, 64GB RAM, and 1TB M.2 SSD. You can add three more GPUs easily for a total of four.

Read More at: https://medium.com/the-mission/why-building-your-own-deep-learning-computer-is-10x-cheaper-than-aws-b1c91b55ce8c

Sunday, September 16, 2018

Machine Learning with Decision Trees

Introduction

This paper shows you how to get started with machine learning by applying decision trees using Python on an established dataset. The code used in this article is available on Github. A popular library for creating decision trees is the standard scikit — learn and with this library you can get your first machine learning model running with just a few lines of computer code. In subsequent articles you will apply the SparkML library for machine learning.

Decision trees have influenced the development of machine learning algorithms, including Classification and Regression Tree (CART) models. The divide and conquer approach has attracted many to use them successfully.
A tree-like model of decisions is drawn that can be visually presented and saved to file, both in image form or in pseudo-code form.
A decision tree is drawn like an upside-down tree. We start from the root node, then split the nodes at each level until we reach leaf nodes which represent outcomes or decisions. At each of the internal nodes a decision is taken which then leads to further nodes.

Model From Iris Data

Figure 1 shows a decision tree for the famous Iris dataset. This dataset is available for download from the UCI website which has a list of hundreds of datasets for machine learning applications.

read more at: https://dzone.com/articles/machine-learning-with-decision-trees-1

Posted by Alan, Tendron Systems Ltd