cora数据集

anna_zr

浏览: 203256 次
性别:
来自: 北京

最近访客更多访客>>

yyzhanglbb

whxy0144

aaaaab2

Lovefire_

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

research

Web IE Blog UP

转自：http://blog.sina.com.cn/s/blog_4c98b96001000boc.html --苯苯的小田园

真是找的很辛苦,唉!记下来吧.感谢论文Object Identication with
Attribute-Mediated Dependences提供了cora dataset 的来源:
http://www.cs.umass.edu/~mccallum/data/(如果复制打不开,请自己手动敲到地址栏中)
   论文A Pitfall and Solution in Multi-Class Feature Selection for Text Classification提供了启发,cora是有6大类,36个小类的.这样一来终于解决了相关性的难题.

(a)cora-refs.tar.gz数据集
Cora Citation Matching [reference matching, object correspondence]Text of citations hand-clustered into groups referring to the same paper.
(b) cora-ie.tar.gz数据集
Cora Information Extraction [information extraction] Research paper headers and citations, with labeled segments forauthors, title, institutions, venue, date, page numbers and severalother fields.
(c)cora-classify.tar.gz 数据集
Cora Research Paper Classification [relational document lassification]
Research papers classified into a topic hierarchy with 73 leaves. We call this a relational data set, because the citations provide relations among papers.
(d) cora-hmm.tar.gz
   Cora HMM is the C implementation of HMMs used for information extraction in Cora. It was written by Kristie Seymore.

Cora readme

   Note that in Cora there are two types of papers: those we found on the
Web, and those that are referenced in bibliography sections. It is
possible that a paper we found on the Web is also referenced by other
papers.

FILE SUMMARY:

* The file 'papers' contains limited information on the papers we found
on the Web.

* The file 'citations' contains the citation.

* The file 'classifications' contains class labels

* The directory `extractions' contains the extracted authors, title,
abstract, etc, plus the references (and in some cases surrounding
text). from the postscript papers we found on the Web.

PAPERS

The file `papers' has a list of all the postscript file papers.
Three fields, tab separated:

   <id> <filename> <citation string>

There are about about 52000 lines in this file, but there are a bunch
of papers that have more than one postscript file. If you eliminate
lines with duplicate ids there are about 37000 papers. Note the
citation string is either (1) an arbitrary bibliography reference to
the paper, if one was made or (2) a constructed entry based on the
authors and title extracted from the postscript file.

CITATIONS

The file 'citations' has the citation graph. Two fields, tab
separated:

   <referring_id> <cited_id>

The referring_id is the id of the paper that has the bibliography
section (always one we have postscript for). The cited_id is the
paper referenced (we may or may not have postscript for it). There
are about 715000 citations.

CITATIONS.WITHAUTHORS

The file 'citations.withauthors' contains another copy of the
citation graph. This time we have also included authors and file
names of each paper in addition to each papers' unique paper_id and
the paper_id's of the references they make. The format of this file
is:

   ***
   this_paper_id
   filename
   id_of_first_cited_paper
   id_of_second_cited_paper
   .
   .
   .
   *
   Author#1 (of this paper)
   Author#2
   .
   .
   .

CLASSIFICATIONS

The file `classifications' contains the research topic classifications
for each of the files. The format of the file is:
"filename"+"\t"+"classification". For example:

http:##www.ri.cmu.edu#afs#cs#user#alex#docs#idvl#dl97.ps    /Information_Retrieval/Retrieval/

The file name is the url where the paper came translated to file name
by changing / to #. The classification the label name in the Cora
directory hierarchy.

Note that the class labels were not perfectly assigned.

EXTRACTIONS

The directory 'extractions' contains 52906 files, one for each
postscript paper that we found on the Web. The directory contains so
many files, that you probably don't want to 'ls' it. Commands like
`find extractions -print' will probably work more efficiently.

Each filename in the 'papers' file should have a file here. I believe
there are also some (perhaps many?) extra files in this tarball that
are not in paper-data that you can just ignore.

Each line of each file corresponds to some bit of data about the
postscript file. Most of the MIME-like field tags are
straightforward and explanatory. A few notes:

The fields URL, Refering-URL, and Root-URL are given by the spider.
All other fields are extracted automatically from the text, some by
hand-coded regular expressions and some by an HMM information
extractor.

The fields Abstract-found and Intro-found are binary valued indicators
of whether Abstract and/or Introduction sections were found by some
regular expression matching in the paper.

Each Reference field is one bibliography entry found at the end of the
paper. Note they are marked up using SGML-like tags. Each Reference
field is optionally followed by one (and possibly more?)
Reference-context fields that are snippets of the postscript file
around where the reference was cited.

分享到：

citeseer数据集下载地址 | C和数据结构的学习资料

2009-06-23 09:55
浏览 4750
评论(0)
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

cora数据集

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

cora数据集

评论

发表评论

相关推荐

Weka下LibSVM (WLSVM)的配置

Weka系列转载之入门教程

Weka系列转载之属性选择

Weka系列转载之聚类

Weka系列转载之初体验

HttpClient小试牛刀

结构化、半结构化以及非结构化数据

关于DOI码与科学文献

HTML Parser

HtmlParser初步研究

数据集整理

解析Restful Web Service

citeseer数据集下载地址

最近访客更多访客>>