E-tika podcast: Of Data and Men

Sometimes, new technology is released, and only then we discover its effects on people and society. Sometimes, it may be too late. That’s why we discuss the social and ethical dimensions of digital technologies with our guests. We are now turning these conversations into articles, we hope you will enjoy the written format as well. 

It is not too far-fetched to say that to a large extent, our lives are determined by data. They are being used in our social and financial systems, traffic control or are fed into algorithms that determine the content that is displayed on our favorite social media channels. 

When it comes to artificial intelligence systems, we often say they can be as good as the data we feed into them. More importantly, it is crucial to keep in mind that we – humans – are often not only the ones who make use of the data, but we are also the source of the data.

In the fifth episode of the E-tika podcast, we considered the ethical and societal aspects of working with data. Juraj Podroužek and Tomáš Gál were joined by Jakub Šimko, who leads a research team at KInIT that studies web and user data processing

Speaking of artificial intelligence systems today, there exists a certain expectation that such systems can just harvest data and make sense of it. But this is often far from the truth, as making sense of the data is usually accompanied by human oversight, which also ensures the quality of the outcomes we are able to get from such systems. 

From the perspective of data ethics, we should be concerned with the whole data cycle – from its collection, to its use and sharing. Starting with data collection, it is important to consider where this data is coming from, and how reliable and trustworthy it is. There are some methods of data collection that might be considered as unfair or outright dangerous, especially if we base important decisions on them. Today, many of us are aware of instances where collected data were misused, sometimes with terrifying consequences. Our readers are probably aware of the Cambridge Analytica scandal, or the plethora of data leaks from social media making the newspaper headlines on a regular basis.

On the other hand, there are methods of data collection which we call crowdsourcing. In this case, people provide data or solve certain tasks for a financial incentive. However, crowdsourcing is not without its limitations and pitfalls. 

In order to better understand what crowdsourcing is, imagine a kind of virtual marketplace where people can solve rather simple tasks – e.g. answering whether there is a squirrel on a particular set of photographs. For this process of annotating photographs, people are paid a little sum of money. This brings us to broader societal implications of this booming business. In general, the labor force in such crowdsourcing platforms comes from developing countries. Being part of the grey economy, unregulated and with no workers’ rights, such practices are often borderlining a modern form of labour slavery or explotation, where the benefits of the trained artificial intelligence models are often reaped by the corporations

The quality of data collected from crowdfunding platforms can be also brought into question. Given the financial incentive, such platforms are often filled with spammers or dishonest workers who do not complete the tasks mindfully. There are however certain practices that can ensure a higher quality of the data collected, such as basic training or task redundancy where the same tasks are completed by multiple workers to ensure correct outcome, or the practice of splitting tasks into multiple, easier tasks. 

Speaking of data we are often led to believe that they are just some abstract matter, existing somewhere “out there”. Even if we are aware that some data is being collected, we tend to agree with that by being able to use certain services for free, such as social media. But us humans are the most common source of this data. This of course carries certain ethical risks and considerations that people working with data, in this case researchers, should keep in mind. What are the limits of what data should researchers collect, and how to ensure that data about us is not misused? 

When it comes to scientific research, there are certain standards and rules put in place to ensure that data are not misused. In general, researchers are required to obtain informed consent from their data sources and ethical assessments often take place to ensure that researchers reflect on the ethical and societal consequences of their research.  There are also some technical measures that can address some risks, e.g. that the data is back/tracked to its source which can lead to a loss of privacy. 

We might think of anonymization of the research data or about monetary tokens as a precondition for providing human data. However at the end of the day, we find that such measures are often unattainable. Large scale data collections often take place outside of scientific research, but rather in the commercial sphere. This brings us to another broader consequence of large scale data collection and computing. It requires a lot of hardware and energy – thus artificial intelligence can be perceived as both a technological solution to environmental issues, but also as part of the problem. 

How can scientists and researchers working with web and user data contribute to creating resilience against malicious use of data? Long term education seems to be the most crucial aspect that can increase people’s resilience and awareness of how their data can be misused to manipulate them. Education coupled with transparency requirements, such as showing people the many ways in which their data are being used and by whom, as well as effective regulation and continuous auditing of data processing could lead us to more fair data practices.