The Global Database of Events, Language and Tone, or GDELT project, collects large volumes of global media production on an ongoing basis. The project shares near-realtime data in a number of formats, including via an API that can be used to query recent documents collected. GDELT processes the media content they collect, extracting information on entities (e.g. people, organisations) and events mentioned and make this processed ‘knowledge graph’ data available.
I’m interested in what media data GDELT collects from media in New Zealand and the Pacific, including the volume of media articles GDELT is collecting on a daily basis and what media outlets are represented in this data. I’ve put together a quick Jupyter notebook to profile what data is available via the GDELT Doc 2.0 API for New Zealand, Fiji, Cook Islands, Tonga, Samoa and Nauru. I realise this is not an exhaustive list of Pacific states (or even all South Pacific states), but this list includes states of varying sizes and media landscapes. I’m sharing some notes on my scoping here for future reference and in case this is relevant for anyone else.
Based on population and the number of media outlets, it seemed likely that New Zealand should be well represented. I’ve looked into Pacific media recently and so I’m aware there are multiple news websites for Fiji, Cook Islands and Samoa.
How many articles?
Table 1: Number of articles collected by GDELT from New Zealand and Pacific media on June 10th and June 3rd
Articles for 10/6 | Articles for 3/6 | |
---|---|---|
New Zealand | 479 | 355 |
Fiji | 14 | 36 |
Tonga | 3 | 4 |
Samoa | 11 | 7 |
Cook Islands | 0 | 0 |
Nauru | 0 | 0 |
Table 1 provides an overview of data GDELT collected over two 24-hour periods. Over a 24-hour period on June 10th (GMT time) there were 479 articles collected from New Zealand media outlets. As expected, there were fewer GDELT had collected from Pacific-based media. I sampled a second day (June 3rd GMT time) to see if there was a consistent pattern. On both days selected there were articles for Fiji, Samoa and Tonga, but not for the Cook Islands or Nauru.
Table 2: Number of articles collected from Pacific media over a seven-day period (June 4th to June 10th GMT)
4/6 | 5/6 | 6/6 | 7/6 | 8/6 | 9/6 | 10/6 | |
---|---|---|---|---|---|---|---|
Fiji | 25 | 23 | 34 | 38 | 13 | 19 | 14 |
Cook Islands | 4 | 0 | 0 | 9 | 2 | 0 | 0 |
Tonga | 3 | 7 | 5 | 0 | 1 | 2 | 3 |
Samoa | 11 | 6 | 15 | 8 | 1 | 5 | 11 |
Nauru | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
I checked a longer seven-day period to see if GDELT’s Pacific media collection varied from day-to-day and whether the two days sampled were unusual given the low volumes. As indicated in the Table 2, there appears to be regular collection of media from Fiji, Tonga, and Samoa, and intermittent collection from the Cook Islands (three out of the seven days). There were no articles for Nauru in the seven-day period.
What media outlets?
GDELT’s API returns the domain of each article and this provides a way to report on the media outlets they collect.
Table 3: Breakdown of New Zealand domains represented in GDELT data for June 10th (GMT Time)
Domain | Count | |
---|---|---|
0 | nzherald.co.nz | 88 |
1 | home.nzcity.co.nz | 67 |
2 | scoop.co.nz | 63 |
3 | odt.co.nz | 28 |
4 | foreignaffairs.co.nz | 27 |
5 | auckland.scoop.co.nz | 19 |
6 | livenews.co.nz | 18 |
7 | community.scoop.co.nz | 16 |
8 | sunlive.co.nz | 14 |
9 | thedailyblog.co.nz | 12 |
10 | newstalkzb.co.nz | 11 |
11 | business.scoop.co.nz | 10 |
12 | waateanews.com | 10 |
13 | interest.co.nz | 9 |
14 | nzdoctor.co.nz | 9 |
15 | kiwiblog.co.nz | 9 |
16 | indianweekender.co.nz | 6 |
17 | pacific.scoop.co.nz | 5 |
18 | nbr.co.nz | 5 |
19 | newzealandstar.com | 5 |
20 | localmatters.co.nz | 4 |
21 | wellington.scoop.co.nz | 4 |
22 | hauraki.co.nz | 4 |
23 | info.scoop.co.nz | 3 |
24 | thestandard.org.nz | 3 |
25 | thecoast.net.nz | 3 |
26 | itbrief.co.nz | 3 |
27 | sharechat.co.nz | 3 |
28 | morefm.co.nz | 2 |
29 | scene.co.nz | 2 |
30 | newshub.co.nz | 2 |
31 | theedge.co.nz | 2 |
32 | greystar.co.nz | 1 |
33 | undertheradar.co.nz | 1 |
34 | thehits.co.nz | 1 |
35 | metromag.co.nz | 1 |
36 | reseller.co.nz | 1 |
37 | voxy.co.nz | 1 |
38 | indiannewslink.co.nz | 1 |
39 | otago.ac.nz | 1 |
40 | fishing.net.nz | 1 |
41 | asiapacificreport.nz | 1 |
42 | aardvark.co.nz | 1 |
43 | maifm.co.nz | 1 |
44 | salient.org.nz | 1 |
In the case of New Zealand, on June 10th (GMT) there are 44 different domains in GDELT data. Table 3 represents the domains and articles per domain. Domains represented on June 10th were similar to those on June 3rd.
The New Zealand Herald is prominent, accounting for almost one fifth of the NZ articles. Other major media producers with significant audiences are represented, including Otago Daily Times and NewstalkZB, but there are noticeable omissions. For example, no articles from Stuff, RNZ, or TVNZ were collected during the days sampled.
The second most frequent domain, home.nzcity.co.nz, appears to be syndicating news from an Australian media company (ABC). Anyone using GDELT data for analysis of New Zealand media will be getting a large volume of Australian news noise.
There are a range of different kinds of media represented, including political blogs like thedailyblog.co.nz, thestandard.org.nz and kiwiblog.co.nz. Māori media is under-represented. Only waateanews.com appeared in the two days sampled. Prominent NZ-based Pacific media outlets (e.g. Tagata Pasifika) are also not represented.
The number of articles for each domain is revealing. It is interesting to see media outlets with significant production of media articles, like NewstalkZB represented with very low volumes. This could indicate very infrequent crawl of their websites, but it appears that GDELT is not collecting all articles available on each site. It is unclear whether this is by design, intentional filtering, or not.
I haven’t explored the specific media outlets for the Pacific in any detail (see Jupyter notebook for some reporting), but there are only a few outlets and some obvious omissions. For example, Fiji Times and Cook Island News do not seem to be included. Again, for Pacific media outlets that are represented, the volumes in GDELT’s data-set appear much lower than the volume being published.
Thoughts …
GDELT’s website indicate there are big ideas driving this project:
GDELT 2.0 is an index over global society, an open dataset that attempts to make human society itself “computable,” leveraging the enormous power of Google Cloud to fundamentally reimagine how we study the human world in realtime at a planetary scale.
GDELT is an interesting and impressive project, but the assumptions here are striking. Is data collection really ‘global’? Can media data really speak for ‘global society’ / ‘human society’ / ‘the human world … at a planetary scale’? Are computations enabled by realtime monitoring, in keeping with these humanistic ideals, desirable and beneficial for everyone?
This quick scoping exercise speaks to the first of the questions and whether GDELT’s data collection is ‘global’ Anyone making use of GDELT data for research on New Zealand or Pacific media should be aware of the limited and arbtirary coverage and would be advised to spend some time making sense of what ends up in GDELT’s datasets, including what is included and excluded from the domains GDELT crawls.
The specific media outlets included and excluded are, in part at least, based on human judgment. GDELT indicates they’re trying to increase their coverage of the “non-Western world”:
Over the last few months we’ve embarked upon an ambitious initiative to vastly expand GDELT’s knowledge of the media systems of the non-Western world. Working closely with governments, think tanks, academics, NGO’s, and citizens on the ground throughout the world we have been working country-by-country to try to build the highest resolution inventory possible of the media systems of the non-Western world.
To allow researchers to make sense of this data source, it would be helpful to know more about the coverage of different regions and states in GDELT data, whether (and how often) the inventory of media outlets being crawled is updated, who is shaping the data collected and the decisions that make this data source.
Github Repository
Here is a GitHub repository with a Jupyter notebook if you want to play around with this yourself:
https://github.com/polsci/coverage