Imagine this scenario: A region’s stability is in decline due to unrest, crime and terrorism. We need a better understanding of this humanitarian crisis to decide how best to support the situation, gained through the information contained within a set of reports.
The challenge: We have acquired news articles containing potentially relevant information. Using these, we need you to use historical reports to determine the topics for new articles so that they can be classified and prioritised. This will allow analysts to focus on only the most pertinent details of this developing crisis.
The data for this challenge has been acquired from a major international news provider, the Guardian. The training data represents the past historical record, and the test data represents new documents that require classification.
This dataset consist of:
- Training data (TrainingData.zip): All the news articles published between 1999 and 2014. [2.3GB]
- Test data (TestData.zip): A sample of the news articles published between 2015 and 2016. [13.8MB]
- Topic dictionary (topicDictionary.txt): A list of topics for classifying articles that improve awareness of the developing crisis. [2.1KB]
- Sample submission (sampleSubmission.csv): A sample submission file with the correct format but random topic predictions. [2.4MB]
The data is available to download in the Data Download page once you have entered the challenge.
You are required to classify each test article by predicting its topics. Your solution must:
- Classify each test article by predicting the presence or absence of only those topics that are provided in the topic dictionary.
- For each test article, predict a ‘1’ or ‘0’ for each topic in the dictionary where ‘1’ predicts the topic is present and ‘0’ predicts it is absent.
Each article may be classified by predicting that is has multiple topics, only one topic, or no topics from the tag dictionary.
The training data can be used in any way you wish (subject to the data terms in the Official Rules) in order to build your solution and predict topics for the test articles as accurately as you can.
Submitted solutions for the test articles will be evaluated with respect to the ground truth for those articles, which is exactly known. This assessment will be done using the well-known F1 score to produce an overall measure of performance.
The submitted results will be scored using the F1 score, defined as:
2×TP / (2×TP + FP + FN)
- TP are the true positives
- FP are the false positives
- FN are the false negatives
There are two main methods in practice for averaging TPs, FPs and FNs to calculate an F1 score for multi-label classification problems. For this challenge we will use the micro-averaged F1 score. This is obtained by summing the TPs, FPs and FNs over each individual decision for the test examples, to produce a global average in which each test example (document) is weighted equally.
The submission file must be a CSV file (standardised for upload to the website), structured as follows:
- A header row containing a label for the test article reference (labelled ‘id’) and column labels for each topic in the order they are listed in the dictionary
- Id column: This contains the unique article reference id’s for each article in the test dataset, e.g. TestData_000003
- Topic columns: These contain your prediction for each topic. You must enter either a ‘1’ or ‘0’ for each topic with respect to each test article.
Submissions will need to be ordered by unique article reference ‘id’.
The predictions for each test article could contain any combination of ‘0’ and ‘1’s, including multiple ‘1’ or all ‘0’.
The number of submissions will be limited to 3 submissions per day.
Public/private leader board
The public leader board will display scores which have been calculated for a statistically representative subsample (30%) of the test articles.
A private leader board will calculate the score for the remaining (70%) test articles. The private score will be used to assess the competition winner.
This competition will start on Monday 3rd April 2017 and runs for 6 weeks.
Participants with the top score on the private leader board are candidates to be awarded:
- 1st Place – £20,000
- 2nd Place – £12,000
- 3rd Place – £8,000
Note that the cash prizes will only be paid out to a bank account which is not in a country with a score of 37 or less according to Transparency International’s Corruption Perceptions Index 2014.
Similarly, they will not be paid out to an individual who is a national/located in one of these countries too or whom the UK Government is not reasonably satisfied as to the potential recipient’s identity. Please see the Official Rules.
There are guidance notes in the The Challenge Guide that can be downloaded once you have signed up to the challenge.
The Challenge Master is our expert for the challenge and will be available to offer guidance through the challenge forum.
With regard to the use of data, the following points should be considered:
- There are no restrictions on how you use the training data. You may use all of it, a sample of it, or even none of it.
- You may augment the provided training data with your own training data if you wish, providing you have permission to do so from the data owner.
- There are no preferred techniques for this challenge. For example, traditional text analysis techniques and/or machine learning approaches can be used.
The following instructions for your approach must be followed:
- You must predict the presence or absence of only those topics that are provided in the topic dictionary. You must predict a firm ‘1’ (present) or ‘0’ (absent) for each tag.
- You must not use web crawlers to lookup the topics for published Guardian articles and employ these in your solution.
- You must not use Guardian data other than what we have provided to build your solution, e.g. by exploiting other Guardian meta-data in your solution.
- You must not manually predict topics based on simply reading the documents yourself and/or crowdsourcing topics from other readers. An automated software-based solution must be developed.
- You must be able to demonstrate that the solution can be run without any human intervention.
- Your submissions are assessed against the ground truth (as we define it). The judges’ decision shall be final and no correspondence shall be entered into with regard to it.
- Your final solution must operate in an environment without internet access. This is so it can be independently validated as a discrete component (without use of other online services).
- You must not cheat. Cheating is strictly prohibited and any attempts to deliberately enter the competition to disrupt it, is against the rules and the spirit of the competition. We reserve the right to disqualify any participant that does not comply with the above instructions or the Official Rules.
The Challenge Master is our expert for the challenge and will be available to discuss different approaches for solving the challenge, through the challenge forum.
The candidate winner(s) (1st, 2nd & 3rd) may be asked to host their solution on an independent environment (e.g. Amazon Web Services), re-train it and then re-run against the test data to reproduce the results obtained during the competition.
This solution will also be executed against a hold-out dataset (approximately 100 news articles roughly consistent with the test articles). This is to validate the candidate winning solution and ensure challenge rules have not been violated.
Compared to the test data, the hold-out data:
- comes from the same source
- is in the same format
- relates to the same broad crisis theme.
The solution must ingest this hold-out dataset and produce topic predictions in the same format as the submission file.
The results will be assessed against the ground truth in the same way as the competition (using the micro-averaged F1 score).
The aim is to obtain results that are consistent with previous results from the main datasets. Solution evaluation will be conducted by a panel led by the Challenge Master and including other experienced data science SMEs. If there is a major discrepancy then this will be discussed with the participant and if it cannot be explained to the satisfaction of the panel, then this may lead to disqualification. The decision of the evaluation panel is final and no further correspondence on the result will be entered into.
Following successful evaluation, the winning solution shall be delivered to the challenge host in the form of object code and source code (unless restricted by software licence terms). The winning solution must also be accompanied by documentation which describes the approach, resources required and instructions necessary to build and run the solution successfully. Please see Official Rules for further details.
Frequently Asked Questions
Each submission will be ranked by score on the public leader board. For two submissions with equal scores, the one that was first submitted will be ranked higher.
Top 10 entries
|1||cjmcmurtrie||0.6045||19 Apr 2017, 9:09PM BST||18|
|2||agcaci||0.5911||23 Apr 2017, 12:38PM BST||7|
|3||Param-eter||0.5905||22 Apr 2017, 9:44AM BST||11|
|4||chirag.mahapatra||0.5879||25 Apr 2017, 3:12AM BST||36|
|5||mariofilho||0.5856||25 Apr 2017, 12:10AM BST||3|
|6||DataGeek||0.5699||14 Apr 2017, 12:58PM BST||9|
|7||mtilley||0.5684||17 Apr 2017, 5:39PM BST||9|
|8||Seeff||0.5483||09 Apr 2017, 9:21AM BST||1|
|9||qingchenwang||0.5363||24 Apr 2017, 11:21PM BST||9|
|10||schorlton||0.5204||18 Apr 2017, 11:13AM BST||20|
Definitions and Interpretation
- “BAE Systems”
- means BAE Systems Applied Intelligence Limited (company number 1337451) whose registered address is Surrey Research Park, Guildford, Surrey GU2 7YP.
- “Data Challenge”
- means a data challenge competition held on the Website.
- “Challenge Materials”
- means the images and data provided to Competitors as part of a Data Challenge as updated from time to time.
- means the party uploading an Entry to a Data Challenge via the Website.
- “COTS IPR”
- means Intellectual Property Rights that are commonly used and provided in a standard form and generally made commercially available on standard licence terms which are not typically negotiated by the licensor.
- “COTS Software”
- means software (including open source software) that is commonly used and provided in a standard form and generally made commercially available on standard licence terms which are not typically negotiated by the licensor.
- means data uploaded to the Website by a Competitor describing that Competitor's response to a Data Challenge.
- “Intellectual Property Rights”
- means (a) copyright, rights related to or affording protection similar to copyright, rights in databases, patents and rights in inventions, semi-conductor topography rights, trade marks, rights in internet domain names and website addresses and other rights in trade names, designs, know-how, trade secrets and other rights in confidential information; (b) applications for registration, and the right to apply for registration, for any of the rights listed at (a) that are capable of being registered in any country or jurisdiction; and (c) all other rights having equivalent or similar effect in any country or jurisdiction.
- “Non-COTS IPR”
- means Intellectual Property Rights that are not COTS IPR.
- “Non-COTS Software”
- means software that it not COTS Software.
- means the software used to create an Entry.
- “Sponsoring Agencies”
- means the Defence Science and Technology Laboratory (Dstl), the Government Office for Science, MI5 and SIS.
- “UK Government”
- means the government of the United Kingdom acting through the Sponsoring Agencies.
- means the website found at www.datasciencechallenge.org
Data Challenges are hosted, run and judged by BAE Systems as a supplier of services to UK Government. BAE Systems shall be UK Government's authorised representative for these purposes.
These Official Rules govern the relationship between UK Government and each Competitor and are applicable to all Data Challenges.
Additional, specific terms and rules of participation will apply to individual Data Challenges. Competitors should ensure that they are familiar with all of the terms and rules of participation that apply to a particular Data Challenge.
Failure to adhere to these Official Rules and any specific terms and rules of participation applicable to a particular Data Challenge may result in disqualification.
Eligibility to take part in Data Challenges
Data Challenges are open to individuals aged 18 and over. Entries made by or on behalf of corporate entities will not be accepted.
Officers, directors, employees and their immediate families of the Sponsoring Agencies, BAE Systems, Capgemini UK PLC, Roke Manor Research Limited and their respective group companies, contractors and agents may not participate in Data Challenges.
No payment shall be made (whether directly or via a third party/country) to:
any bank account registered and maintained in any country with a score of 37 or less according to Transparency International's Corruption Perceptions Index 2014; or
an individual who is a national and/or resident of, or located in, any country with a score of 37 or less according to Transparency International's Corruption Perceptions Index 2014, or
an individual who UK Government knows or has reason to suspect (or UK Government's authorised representative knows or has reason to suspect) appears on:
the sanctions list maintained by the United Kingdom Foreign and Commonwealth Office (as amended from time to time); or
the Consolidated List of persons, groups and entities subject to EU financial sanctions, as maintained by the European External Action Service (as amended from time to time); or
the Consolidated Screening List as maintained by the United States Government (as amended from to time).
If UK Government is not (or its authorised representative is not) reasonably satisfied as to the potential recipient's identity, no payment shall be made to that person.
UK Government reserves the right (with or without notice) to update Data Challenges and Challenge Materials during their running. Competitors should regularly check the Website for such updates in order to ensure that they remain familiar with the Data Challenge and are using the latest Challenge Materials. UK Government accepts no liability for any failure on the part of a Competitor in this regard.
Assessment of Entries and Solutions
A maximum of 3 Entries per day per Competitor will be assessed; any Entries in excess of this limit will be disregarded. Entries will be assessed in the order in which they are submitted.
Entries shall be assessed electronically against UK Government's model response (referred to as the “ground truth”) for the relevant Data Challenge.
A Competitor whose Entry is selected as a potential winning Entry shall at its own expense install, configure and make available the Solution in a non-internet facing environment (such that the Solution runs without access to online resources) on an Amazon Web Services Inc. cloud platform (or such other PaaS as UK Government or its authorised representative may approve) for evaluation by UK Government or its authorised representative.
Solutions shall be assessed by, inter alia, their ability to automatically respond to previously unseen data sets and the proximity of their responses to the UK Government's ground truth for the relevant data set.
The winning Competitor will be notified within four weeks of the closing date of the relevant Data Challenge. The judges' decision shall be final and no correspondence shall be entered into with regard to it.
Delivery of Winning Solution
Following successful evaluation of a Solution pursuant to Clause 4.1 above, and to the extent that the same does not comprise COTS Software, the winning Competitor shall deliver a copy of the winning Solution in source code and object code form to UK Government or its authorised representative, together with a description of the resources required and the instructions necessary to build and run the Solution successfully.
As a condition of the award of a prize, the delivered Solution must be capable of being built and run by UK Government or its authorised representative in a non-internet facing environment and generating the winning Entry.
Ownership of and IP in Entries and Solutions
Entries to Data Challenges shall comprise CSV files describing the Competitor's response to the relevant Data Challenge. Once uploaded to the Website, Entries shall become the property of UK Government.
Competitors warrant that their Entries and Solutions are their own original work, or where third party material is incorporated this is used with the necessary permissions, licences or consents, in which case the relevant material and third parties shall be clearly identified.
In respect of any COTS Software and COTS IPR used by a Competitor in a Solution:
the Competitor warrants that it has title to the same or the licences necessary to lawfully use the same for that purpose; and
in respect of a winning Entry, the Competitor shall provide UK Government or its authorised representative with a list of all such COTS Software and COTS IPR together with evidence of its title to or rights to lawfully use the same for that purpose.
In respect of any Non-COTS Software and Non-COTS IPR used by a Competitor in a Solution:
in respect of a winning Entry:
the Competitor shall provide UK Government or its authorised representative with a list of all such Non-COTS Software and Non-COTS IPR; and
the Competitor shall make all such Non-COTS Software and Non-COTS IPR available either generally on an MIT open source licence or by granting UK Government a worldwide, perpetual, irrevocable, royalty-free, non-exclusive, sub-licensable licence to use, modify, adapt, enhance, create derivative works of and exploit all such Non-COTS Software and Non-COTS IPR for any purpose relating to the exercise of UK Government's (or any central UK Government body's) business or function.
Intellectual Property in Data Challenges and Challenge Materials
Intellectual property rights in Data Challenges and Challenge Materials belong to UK Government, its contractors and their respective licensors. Competitors are hereby authorised to download and use the Data Challenge Materials for the purposes of taking part in Data Challenges only.
Competitors shall not reproduce, publish, resell or distribute the Challenge Materials and shall delete the same at the end of the Data Challenge.
Competitors shall not use any Challenge Materials in a defamatory or deceptive context, or in a manner that could be considered libellous, obscene or illegal, or give rise to a claim for unfair competition.
Competitors shall use suitable measures to prevent persons who have not agreed to these Official Rules from gaining access to the Challenge Materials.
Aside from the limited rights described in Clause 7.1 above, participation in a Data Challenge shall not be construed as granting or conferring on the Competitor any title, rights or interests in the Challenge Materials.
Third Party IPR Indemnity
A Competitor shall at all times on written demand indemnify UK Government and BAE Systems (each an “Indemnified Person”) and keep them indemnified against all losses, liabilities, damages, costs and expenses (including legal fees) incurred by or awarded against them arising from any claim or action by a third party that:
the relevant Indemnified Person’s possession and/or use of an Entry or a Solution originating from the Competitor infringes the intellectual property rights of a third party; or
the Competitor’s use of the Challenge Materials is contrary to Clause 7 of these Official Rules.
Exclusion of Warranties and Liability
Subject always to Clauses 9.2:
The Website, the Data Challenges and the Challenge Materials are provided “as is” and without warranty as to accuracy, completeness, availability, suitability, or fitness for any particular purpose. All implied conditions, warranties and representations in relation to the provision of the Website, the Data Challenges and the Challenge Materials are hereby excluded.
UK Government excludes all liability, whether in contract, tort (including negligence) breach of statutory duty or otherwise, even if foreseeable, arising under or in connection with use of, inability to use, or reliance on, the Website or any of its content, one or more Data Challenges, or any Challenge Materials.
Nothing in these Official Rules excludes or limits UK Government’s liability:
for death or personal injury caused by its negligence or the negligence of its employees, agents or subcontractors;
for fraud or fraudulent misrepresentation; or
where it would be unlawful to do so.
Third Party Rights
The indemnity at Clause 8 shall be enforceable against a Competitor by BAE Systems. Apart from that, a person who is not a party to this agreement shall not have any rights under the Contracts (Rights of Third Parties) Act 1999 to enforce any term of this agreement, but this does not affect any right or remedy of a third party which exists, or is available, apart from that Act.
Governing law and jurisdiction
These Official Rules shall be governed by and construed in accordance with the laws of England and the courts of England and Wales shall have exclusive jurisdiction in respect of any dispute or claim that arises out of or in connection with them, provided always that where a Competitor is a consumer (that is, an individual acting for purposes which are wholly or mainly outside their trade, business, craft or profession) resident in Northern Ireland, Scotland or another EU Member State they may also bring proceedings in their home jurisdiction.