machine learning: October 2018

Wednesday, October 24, 2018

Astronomers report success with machine deep learning

Machine learning continues its successes in astronomy.

Classifying galaxies. On April 23, 2018, astronomers at UC Santa Cruz reported using machine deep learning techniques to analyze images of galaxies, with the goal of understanding how galaxies form and evolve. This new study has been accepted for publication in the peer-reviewed Astrophysical Journal and is available online. In the study, researchers used computer simulations of galaxy formation to train a deep learning algorithm, which then:

… proved surprisingly good at analyzing images of galaxies from the Hubble Space Telescope.

The researchers said they used output from their simulations to generate mock images of simulated galaxies as they would look in ordinary Hubble observations. The mock images were used to train the deep learning system to recognize three key phases of galaxy evolution. The researchers then gave their artificial neutral network a large set of actual Hubble images to classify.
The results showed a remarkable level of consistency, the astronomers said, in the classifications of simulated and real galaxies. Joel Primack of UC Santa Cruz said:

We were not expecting it to be all that successful. I’m amazed at how powerful this is. We know the simulations have limitations, so we don’t want to make too strong a claim. But we don’t think this is just a lucky fluke.

Jane Merdith, Tendron Systems Ltd, London, UK.

Sunday, October 21, 2018

DeepMind Open-Sources Reinforcement Learning Library TRFL

DeepMind announced today that it is open-sourcing its TRFL (pronounced ‘truffle’) library, which contains a variety of building blocks useful for developing reinforcement learning (RL) agents in TensorFlow. Created by the DeepMind Research Engineering team, TRFL is a collection of major algorithmic components that DeepMind has used internally for many of their successful agents, including DQN, DDPG and the Importance Weighted Actor Learner Architecture.

Deep reinforcement learning agents are usually composed of a large number of components which can interact in subtle ways, making it difficult for researchers to identify flaws within the large computational graphs. The DeepMind research team has introduced an approach that uses scalable distributed implementation of the v-trace agent to address this issue. The large agent codebases have made a considerable contribution in reproducing research, but lack flexibility for modification. Hence, a complementary approach is needed to provide reliable, well-tested implementation building blocks which can be used for various RL agents.

The TRFL library contains functions that can implement both advanced techniques and classical RL algorithms. TRFL also provides uncompleted algorithms which can act as complementary implementations when building a fully-functional RL agent.
As the TRFL library is still broadly used by DeepMind they will continue to maintain it and add new functionalities over time.

Posted by: Jayne Merdith, Tendron Systems Ltd, London, UK

Wednesday, October 10, 2018

The rise of machine learning in astronomy

When mapping the universe, it pays to have some smart programming. Experts share how machine learning is changing the future of astronomy.
Astronomy is one of the oldest sciences and the first science to incorporate maths and geometry. It sits at the centre of humankind's search for its place in the universe.
As we delve deeper into the space surrounding our planet, the tools we use become more complex. Astronomers have come a long way from tracking the night sky with the naked eye or cataloguing the stars with a pen and paper.
Modern astronomers use advanced computer programming techniques in their work—from programming satellites to teaching computers to analyse data like a researcher.
So what do astronomers do with their computers?
Mo' data, mo' problems
Big data is a big problem in astronomy. The next generation of radio and optical telescopes will be able to map huge chunks of the night sky. The Square Kilometre Array (SKA) will push data processing to its limits.
Built in two phases, the SKA will have over 2000 radio dishes and 2 million low-frequency antennas once finished. These antennas combined will produce over an exabyte of data each day—more than the world's internet usage per day. The data is then processed to be made manageable, meaning the size of the data that astronomers have to deal with will be smaller.
Project scientist for the Australian SKA Pathfinder Dr. Aidan Hotan explains.
"Data from a radio telescope array is very much like the flow of water through an ecosystem. The individual antennas each produce data, which is then transmitted over some distance and combined with other antennas in various stages—like smaller tributaries combining into a larger river," says Aidan.
"The largest data rate you can consider is the total raw output from each individual antenna, but in reality, we reduce that total rate to more manageable numbers as we flow through the system. We can combine the signals in ways that retain only the information we want or can make use of."

Jane Merdith, Tendron Systems Ltd, London, UK.

Tuesday, October 9, 2018

Using pandas with large data

Using pandas with large data
Tips for reducing memory usage by up to 90%

When working using pandas with small data (under 100 megabytes), performance is rarely a problem. When we move to larger data (100 megabytes to multiple gigabytes), performance issues can make run times much longer, and cause code to fail entirely due to insufficient memory.

While tools like Spark can handle large data sets (100 gigabytes to multiple terabytes), taking full advantage of their capabilities usually requires more expensive hardware. And unlike pandas, they lack rich feature sets for high quality data cleaning, exploration, and analysis. For medium-sized data, we're better off trying to get more out of pandas, rather than switching to a different tool.

In this post, we'll learn about memory usage with pandas, how to reduce a dataframe's memory footprint by almost 90%, simply by selecting the appropriate data types for columns.

Working with baseball game logs

We'll be working with data from 130 years of major league baseball games, originally sourced from Retrosheet.

Originally the data was in 127 separate CSV files, however we have used csvkit to merge the files, and have added column names into the first row. If you'd like to download our version of the data to follow along with this post, we have made it available here.

Let's start by importing our data and taking a look at the first five rows.

Read More at: https://www.dataquest.io/blog/pandas-big-data/

Jane Merdith, Tendron Systems Ltd, London, UK.