Machine Learning meets Data-Driven Journalism: Boosting International Understanding and Transparency in News Coverage

نویسندگان

  • Elena Erdmann
  • Karin Boczek
  • Lars Koppers
  • Gerret von Nordheim
  • Christian Pölitz
  • Alejandro Molina
  • Katharina Morik
  • Henrik Müller
  • Jörg Rahnenführer
  • Kristian Kersting
چکیده

Migration crisis, climate change or tax havens: Global challenges need global solutions. But agreeing on a joint approach is difficult without a common ground for discussion. Public spheres are highly segmented because news are mainly produced and received on a national level. Gaining a global view on international debates about important issues is hindered by the enormous quantity of news and by language barriers. Media analysis usually focuses only on qualitative research. In this position statement, we argue that it is imperative to pool methods from machine learning, journalism studies and statistics to help bridging the segmented data of the international public sphere, using the Transatlantic Trade and Investment Partnership (TTIP) as a case study. 1. The need for cross-national analysis The recently published news on the Panama Papers leak demonstrates firstly that tax fraud is an international phenomenon and secondly how cross-national cooperation can be beneficial to investigating and reporting. Admittedly, this is an exceptional case. Global events are still ”primarily covered in accordance with the traditional national outlook, i.e. national domestications and the ’domestic vs. foreign news logic’” (Berglez, 2008, 847). A global public sphere to address globally relevant issues has not been established yet and national biases impede possible international approaches. This way ”the global sociopolitical order becomes defined by the realpolitik of nation-states that cling to the illusion of sovereignty despite the realities wrought by globalization” (Castells, 2008, 80). Reciprocal knowledge about controversial issues across national borders is necessary to provide common ground 2016 ICML Workshop on #Data4Good: Machine Learning in Social Good Applications, New York, NY, USA. Copyright by the author(s). for fruitful global discussions and proposals. In this position statement, we provide evidence that joining forces improves media transparency on a global scale: Combining machine learning with statistics and journalism studies contributes to bridging the segmented data of the international public sphere. Following an interdisciplinary approach we tackle the question of how methods from machine learning help to deepen our understanding of the discussion on cross-national issues. LDA, @TM, PDNs and word2vec are used to enhance transparency on international media coverage: Range, amount and framing of issues can be compared with fewer translation efforts. Differences in perception become obvious and evaluation divides can be interpreted. This will be demonstrated by an analysis of the coverage on the controversial Transatlantic Trade and Investment Partnership (TTIP) between the United States of America (U.S.) and the European Union (E.U.). TTIP was designed to facilitate trade between the U.S. and the E.U. However, TTIP’s actual impact on economies and societies has been discussed controversially in both the U.S. and Europe. Media perception has differed in many aspects. The comparison of a U.S. newspaper (New York Times) and a German newspaper (Süddeutsche Zeitung) reveals that TTIP is more hotly debated in Germany than in the U.S., see Fig. 1(a). The New York Times highlighted the need of bank regulations and the threat that exporting nations pose to local markets. Whereas, Süddeutsche Zeitung focused largely on consumer protection. On the one hand TTIP was criticized for its implications on environmental and food standards, on the other for the negotiation proceedings that were characterized by democratic deficits and insufficient transparency. Intricate questions derive from the simple comparison of word frequencies: Why does the range of reported arguments differ so considerably? Is the German media reluctant to the Trade Partnership in general? And what are the reasons for the German obsession with the ’chlorinated chicken’ as evidenced in the significant number of these words in articles broaching the TTIP issue? 36 ar X iv :1 60 6. 05 11 0v 1 [ st at .M L ] 1 6 Ju n 20 16 Boosting International Understanding and Transparency in News Coverage The TTIP example illustrates that media coverage can largely differ across nations. Public discussion is still strongly influenced by national media. Multiple languages add to the difficulties to frame a common global perspective. However, in a globalized world, crisis and political decisions have become far too complex to be dealt with on a national level only. Combining Machine Learning and data-driven journalism enables researchers to investigate large corpora of texts to reveal national patterns of argumentation, which in turn can promote international understanding. 2. ML meets Data-Driven Journalism There is an arms race to ‘deeply’ understand text data, and consequently a range of different techniques has been developed for media analysis. However, when using them for data-driven journalism, e.g. to gain a deeper understanding of the news reception of important political and societal issues, there are also challenges. Just to name few of the recent ML techniques, DeepDive (Niu et al., 2012) aims to extract structured data from texts, Metro Maps (Shahaf et al., 2012) extract easy to understand networks of news stories, and word2vec (Mikolov et al., 2013) computes Euclidean embeddings of words. It is trained on a corpus of documents and transforms each word into a vector by calculating word correlations. Similarity measures can be applied to the resulting vectors. Particularly, word2vec can be used to compute those words that are most likely to occur in the same context as a given word. It thus has the potential to reveal which words are linked closest to a given issue and hence provides a semantically enriched alternative to classical keyword searches, which is well used by journalists. Finally, topic models, have been used successfully in many scenarios, in particular to model discourses. Most prominent among them is Latent Dirichlet Allocation (LDA) (Blei et al., 2003) that characterizes each topic as a list of words and their respective probabilities to appear in the topic. Topics over Time (TOT) (Wang & McCallum, 2006) follows this paradigm, but introduces a temporal component. In TOT, each document has a timestamp and the probability of a topic grows and declines over time. Thus, TOT can be employed to analyze trends in news. Due to this rich machine learning toolbox for analyzing news articles, it is tempting to put a stack of news articles on a data journalist’s desk saying ‘Enjoy’. Unfortunately, data-driven journalism is not that simple. Reconsider topic models, the main focus of the present paper. TOT does not model attention of the crowd in a physically plausible way. Triggered by models from communication studies (Kolb, 2005) and the observation that the Shifted Gompertz distribution models attentional curves (Bauckhage et al., 2014), we developed a novel Attentional Topic Model (@TM) (Pölitz et al., 2016). It captures well the growth and decline of the popularity of topics in a physically plausible way. Moreover, multinomial word distributions, such as in LDA and TOT capture the most common words used in each topic. However, they often fail to give a deeper understanding of topics required when investigating media discourse. That is why APMs (Inouye et al., 2014), which discover word dependencies in each topic, have been introduced; essentially, they encode topics as weighted undirected graphs. Often, however, word dependencies are asymmetric. If the word ’treaty’ appears in a text, it is very likely that the text will refer to the museum’s ’secretary of state’, too. The phrase ’secretary of state’, on the other hand, is a very general term and can be used in many different contexts. Thus, it does not make the word ’treaty’ per se more likely. In (Erdmann et al., 2016), we therefore extended APMs to directed dependencies using Poisson Dependency Networks (Hadiji et al., 2015). Moreover, longer chains of directed dependencies may provide interesting clues to understand a topic. Finally, topic models have been traditionally evaluated using intrinsic measurements such as the likelihood and the perplexity of topics (Wallach et al., 2009). As these measurements do not necessarily correspond to human judgment (Chang et al., 2009), we pool together the talents of journalists, machine learners and statisticians to obtain a better understanding of what makes a good topic. If we use topic models to create subcorpora e.g. for content analysis we have to ensure that the subcorpora are at least as good as the ones from other methods like keyword searches. The gold standard is the evaluation based on human judgment (Stryker et al., 2006). The use of statistical methods helps to reduce the time requirement for human coders to come to a significant statement about the quality of a subcorpus. Moreover, our interdisciplinary research led to several interesting observations about the quality of topics: While researchers from a mathematical background tended to focus on topics linked to large quantities of documents, journalists oftentimes preferred those topics that were created from only few meaningful documents. Likewise words like ’can’, ’need’ and ’do’, that were considered stopwords by machine learners, really caught the journalist’s attention. We believe that these small and seemingly insignificant notices can help to improve the application and lead a way to new computational models. Can new topic models be developed to cater better to the specific needs of journalists? Are there different approaches to gain deeper insight into each topic? We will illustrate this using the case of

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data-Driven News Generation for Automated Journalism

Despite increasing amounts of data and ever improving natural language generation techniques, work on automated journalism is still relatively scarce. In this paper, we explore the field and challenges associated with building a journalistic natural language generation system. We present a set of requirements that should guide system design, including transparency, accuracy, modifiability and t...

متن کامل

Title of Document : TRANSPARENCY AND TRUST IN JOURNALISM : AN EXAMINATION OF VALUES , PRACTICES AND EFFECTS

Title of Document: TRANSPARENCY AND TRUST IN JOURNALISM: AN EXAMINATION OF VALUES, PRACTICES AND EFFECTS Michael Koliska, Doctor of Philosophy, 2015 Directed By: Professor Linda Steiner, Philip Merrill College of Journalism Journalism scholars and practitioners have repeatedly argued that transparency is crucial to generate trust in the news media, which, over the years, has faced continues dec...

متن کامل

Civic journalism meets civic social science: foregrounding social determinants in health coverage

Many of the intricacies of health feature regularly in news reports depicting, medical practices, specific diseases, breakthroughs in treatment, and lifestyle-orientated interventions. Despite social scientists also demonstrating the importance of economic prosperity, community cohesion, stress, material hardship and stigma for health, such social determinants are often absent from health news....

متن کامل

and journalism : toward new frameworks for imagining news

Journalists and technologists increasingly are organizing and collaborating, both formally and informally, across major news organizations and via grassroots networks on an international scale. This intersection of so-called ‘hacks and hackers’ carries with it a shared interest in finding technological solutions for news, particularly through opensource software programming. This article critic...

متن کامل

Be In The Know: Connecting News Articles to Relevant Twitter Conversations

In the era of data-driven journalism, data analytics can deliver tools to support journalists in connecting to new and developing news stories, e.g., as echoed in microblogs such as Twitter, the new citizen-driven media. In this paper, we propose a framework for tracking and automatically connecting news articles to Twitter conversations as captured by Twitter hashtags. For example, such a syst...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1606.05110  شماره 

صفحات  -

تاریخ انتشار 2016