![]() *.urls: Links to original articles, where appropriate.įor further information please contact Derek Greene.The latest sport action across football, cricket, rugby, Wimbledon, F1 and more!.*.classes: Assignment of documents to natural classes, with each line corresponding to a document.*.docs: List of document identifiers, with each line corresponding to a column of the sparse data matrix.*.terms: List of content-bearing terms in the corpus, with each line corresponding to a row of the sparse data matrix.*.mtx: Original term frequencies stored in a sparse data matrix in Matrix Market format.The files contained in the archives given above have the following formats: The datasets have been pre-processed as follows: stemming (Porter algorithm), stop-word removal ( stop word list) and low term frequency filtering (count < 3) have already been applied to the data. Class Labels: 5 (athletics, cricket, football, rugby, tennis).Consists of 737 documents from the BBC Sport website corresponding to sports news articles in five topical areas from 2004-2005.Class Labels: 5 (business, entertainment, politics, sport, tech).Consists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005.ICML 2006.Īll rights, including copyright, in the content of the original articles are owned by the BBC. "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering", Proc. ![]() If you make use of these datasets please consider citing the publication:ĭ. These datasets are made available for non-commercial and research purposes only. Two news article datasets, originating from BBC News, provided for use as benchmarks for machine learning research. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |