Big data, big deal? An introduction to big data at Green Templeton Outreach Talks

The second talk in 2019’s Green Templeton Outreach Talk series was called, ‘Big data, big deal?’, and delivered by Rosemary Walmsley, PhD student at the Big Data Institute.

Big Data, Big Deal? Outreach Talk by Rosemary Walmsley at Green Templeton College on 31 May 2019

Green Templeton’s Carlos Outeiral (DPhil in Statistics), who organised the series, reviews the evening:

On Friday 31 May, Green Templeton College held its second Outreach Talk, a thought-provoking introduction to big data, led by Rosemary Walmsley, PhD student at the Big Data Institute of the University of Oxford. The lecture aimed at delineating the foundations of machine learning, providing a broad overview, but especially aiming to build awareness and start a critical discussion.

Rosemary completed a MMathPhil in Oxford before delving into her doctorate, analysing biometric data obtained from fitness trackers. As both a trained mathematician and a relative newcomer to big data, she is in an ideal position to deliver a gentle introduction to the topic. Although she claims not to be a seasoned expert in this field, she rightly points out that “it is too important a topic not to talk about it”.

The concept of big data can be traced back to the beginning of the century, with the “three Vs” model described by Doug Laney in 2001: big data is characterised by volume, variety and velocity. For example, Facebook has access to a lot of data (volume) in the form of images, written text and even clicks (variety), that is generated very quickly all around the world (velocity). In effect, every time you use your smartphone, you generate a cascade of information that is mined by technological companies.

Big data is undeniably linked to another popular buzzword: machine learning. Machine learning is a paradigm of computer science in which, instead of designing a procedure to solve a problem, we employ an algorithm to improve upon experience rather than executing explicit instructions. This is very useful in cases where we cannot delineate a procedure to solve problems – for example, it is certainly non-trivial to describe how to recognise handwritten digits. There are three main classes of machine learning algorithms, which can be simply understood.

The first paradigm is supervised learning, the most common and perhaps the most successful. In this approach, an algorithm is fed with a large amount of labelled data. For example, if the aim is to classify pictures as showing cats or dogs, the training set will be a collection of pictures with a “cat” or “dog” label. The hope is that the algorithm will pick up the features that make a picture “cat-containing” and use them to classify further images.

The second is unsupervised learning, where the data is not labelled and the algorithm has to figure out the structure by itself. In the case of the previous example, the computer would have to weight which features of an image allow to classify pictures in two classes, and attempt to use these features to label the data. Unsupervised learning is often less successful than supervised learning – which does not mean it is unsuccessful, by any means.

The third approach is reinforcement learning, that has had important successes in the past. In contrast to the previous methods, we do not provide the computer with labels, but rather with a reward: we will recompense when it classifies the image correctly and penalise otherwise. The most popular example is Google Deepmind’s AlphaGo, the engine that defeated the best human player of Go, Ke Jie, in 2017.

After setting forth the foundations of machine learning, Rosemary briefly discussed some of the most common buzzwords that are used in machine learning. For example, she provided a pristine explanation of neural networks, mystified mathematical models that are in fact quite simple, and deep learning, which is just a chain of multiple neural networks stacked together.

The last part of the talk discussed a crucial question: what about the hype? There are many successful applications of machine learning – many seemingly scary, like face recognition. Facebook can learn who and how you are, Google can monitor your interests and locations, and Amazon can predict what you want to buy. But, where are the limits to this technology?

Clearly, big data cannot solve everything. In the first place, the data available is not always “big”: it may be scarce or too complex, as in biomedical applications, or it might refer to situations that only happened a few times in history, like financial crashes. Moreover, there are many situations where the complexity of the problem is so large that machine learning algorithms are simply unable to learn enough to reach useful conclusions. Very especially, machine learning algorithms are not good for understanding causality: their mathematical structure renders them prone to find meaningless correlations that, while effective for some applications, are not effective for all.

Finally, while the prowess of machines is impressive for many tasks, they are still lagging behind what humans are capable of doing. While they have achieved many successes, they have done so at the expense of vast amounts of data, immense computational power and eventually enormous amounts of energy. In contrast, a human baby can identify patterns from just a few examples, and learn complex processes much faster, and with less energy, that current computers ever could. Also, as Rosemary remarked “they do all of this while being powered by milk”.

Given the importance that emergent technologies like big data and machine learning are having in our technology, it is fundamental to engage in an open conversation about it. This conversation needs to be broad, and encompass not only experts in the topic, but also anyone that may be affected by it. Moreover, given the catastrophic possibilities of algorithmic bias (when a machine learns gender or race biases, sometimes due to lack of data), it is important to have this conversation. Hopefully this talk will have provided some tools to critically engage with machine learning. To repeat the words that started it, “this is a topic too important not to talk about it.”

The talk was followed by a very productive discussion with the engaged audience. Some of the topics discussed were the ethics of data collection, the actual capabilities of big data in fields like healthcare or technology, and the many dangers of misused data.

Find out more about the 2019 Green Templeton Outreach Talks.

Upcoming talks in this series:

Talk 4: Thursday 27 June, 17:30
Should we vaccinate children against addictions? Some ethical considerations

Speaker: Lovro Savić, PhD student in Public Health Ethics, University of Oxford

More information and register here

The fast-paced technological development in the last decades has brought many crucial ethical questions to the table. Many of these questions are as fascinating as they are difficult, creating strong controversies both in the medical profession and public opinion. How do we find solutions? How can we understand what is right? In this talk, Lovro Savić, PhD student in Public Health Ethics at the University of Oxford, will provide a gentle but captivating overview of public health ethics, explaining how researchers study these questions and try to come up with useful solutions. In particular, Lovro will be examining a real-life research question: is state-mandated vaccination of children against addictions morally admissible?

Previous talks in this series:

Talk 1: Wednesday 15 May, 17.30
Quantum computing – why you should be interested (and why not)
Speaker: Carlos Outeiral, Oxford PhD student, National Quantum Technologies Hub

Read full report from the lecture

Quantum computing is predestined to become one of the most disruptive technologies in the following years. Harnessing the power of quantum effects to do computation promises to solve several open problems in many areas, from engineering to drug design. Unfortunately, it is often difficult to ascertain what is fact and what are unrealistic expectations, and even more importantly, what will be available in the following years, and what only after many decades of research. In this talk, Carlos Outeiral, Oxford PhD student at the National Quantum Technologies Hub, will deliver an uncomplicated introduction to quantum computing and outline what we can expect to see in the following decade.

Talk 3: Monday 10 June, 17:30
Introducing the economics of happiness
Speaker: Karl Overdick, PhD student in Management, University of Oxford

More information

In the last decades, a new field of economics has been built upon the principle that everyone tries to be as happy as they can be, aiming to quantify and study happiness, as well as trying to understand what can be done to maximise it from the personal and policy point of view. In this talk, Karl Overdick, PhD student in Management at the University of Oxford, will provide an introductory and balanced view of what the field of subjective wellbeing has found in the last few decades since its inception.

Talk 4: Thursday 27 June, 17.30

For more information about the GTC Outreach Talks please contact:

Carlos Outeiral

Created: 11 June 2019