Large-Scale TV Dataset

The STVD-FC dataset is related to the Fact-checking problem. It foccusses on the French political discourses and encompasses the 2022 French presidential election covering the period from the 1st of February up to the 1st of May 2022. The process to constitute the dataset is detailed in [1]. The dataset contains about 1,330 fact-checked claims that have been scraped from the fact-checking service Factoscope . For the video counterpart, a nearly 6,730 TV programs, that represent a total duration of 6,540 hours have been captured with a TV workstation, alongside with metadata.

The dataset is composed of different parts provided as:

The following naming convention is applied within the dataset:

For the needs of visualization and testing, some samples (audio/video files with related fact-checked claims) are given in the next table.

Audio Video Claim Topics
sample 1 a_1 v_1 f_1 New Caledonian, Guadeloupe
sample 2 a_2 v_2 f_2 François-Xavier Bellamy
sample 3 a_3 v_3 f_3 War in Ukraine
sample 4 a_4 v_4 f_4 Abstention
sample 5 a_5 v_5 f_5 Political battle Macron-Le Pen

The different files constituting the dataset are given below protected with a password. The dataset is available for non-commercial research purposes. Before to download the dataset, get the agreement (in english or french version) and sign it. Then, send the scanned version to Mathieu Delalandre email. After verifying your request, we will contact you with the password to unzip the dataset.

The different files constituting the dataset are given here. We provide first the global file file containing fact-checked claims with its XML schema. The parts 1 to 8 are given in the next table.
For a better accessibility, CSV indexing files are provided for every part having the format
Hashcode; Channel; Program where

e.g. e13d...875b; Franceinfo; Le fil info

Part Duration (h) Hashcodes Index Files Size (GB) Link
1 815.6 h 28 download 16 245.8 GB download
2 815.9 h 19 download 16 246.0 GB download
3 805.5 h 27 download 16 242.6 GB download
4 814.7 h 21 download 16 244.7 GB download
5 812.2 h 14 download 16 242.4 GB download
6 828.4 h 17 download 16 249.5 GB download
7 808.4 h 12 download 16 241.0 GB download
8 806.5 h 13 download 16 243.8 GB download
6507.2 h 151 128 1,956 GB

NB. Our storage service at the UT delivers at 3-16 MB/s for downloading (from a low / high speed connection, respectively) with concurrent access.

For the needs of kick-off, the STVD-FC dataset is provided with an "hello world" index. This index gives baseline results of NLP and CV methods for a first analysis of the dataset. It is organized with the same naming convention of the root dataset for the directories
i.e. \\PartX\Hashcode\ts\
where every directory contains the following index files:

The archive of the index (≃ 216 MB) and the list of the reference keywords can be accessed on the following links index, keywords.

Please cite the following paper [1] if you use this dataset.

  1. F. Rayar, M. Delalandre and V.H. Le. A large-scale TV video and metadata database for French political content analysis and fact-checking. Conference on Content-Based Multimedia Indexing (CBMI), pp. 181-185, 2022.