`
anna_zr
  • 浏览: 200537 次
  • 性别: Icon_minigender_2
  • 来自: 北京
社区版块
存档分类
最新评论

cora数据集

阅读更多
转自:http://blog.sina.com.cn/s/blog_4c98b96001000boc.html --苯苯的小田园

真是找的很辛苦,唉!记下来吧.感谢论文Object Identication with
Attribute-Mediated Dependences提供了cora dataset 的来源:
http://www.cs.umass.edu/~mccallum/data/(如果复制打不开,请自己手动敲到地址栏中)
   论文A Pitfall and Solution in Multi-Class Feature Selection for Text Classification提供了启发,cora是有6大类,36个小类的.这样一来终于解决了相关性的难题.

(a)cora-refs.tar.gz数据集
Cora Citation Matching [reference matching, object correspondence]Text of citations hand-clustered into groups referring to the same paper.
(b) cora-ie.tar.gz数据集
Cora Information Extraction [information extraction] Research paper headers and citations, with labeled segments forauthors, title, institutions, venue, date, page numbers and severalother fields.
(c)cora-classify.tar.gz 数据集
Cora Research Paper Classification [relational document lassification]
Research papers classified into a topic hierarchy with 73 leaves. We call this a relational data set, because the citations provide relations among papers.
(d) cora-hmm.tar.gz
   Cora HMM is the C implementation of HMMs used for information extraction in Cora. It was written by Kristie Seymore.


Cora readme

   Note that in Cora there are two types of papers: those we found on the
Web, and those that are referenced in bibliography sections.  It is
possible that a paper we found on the Web is also referenced by other
papers.


FILE SUMMARY:

* The file 'papers' contains limited information on the papers we found
on the Web.

* The file 'citations' contains the citation.

* The file 'classifications' contains class labels

* The directory `extractions' contains the extracted authors, title,
abstract, etc, plus the references (and in some cases surrounding
text). from the postscript papers we found on the Web.


PAPERS

The file `papers' has a list of all the postscript file papers.
Three fields, tab separated:

   <id> <filename> <citation string>

There are about about 52000 lines in this file, but there are a bunch
of papers that have more than one postscript file.  If you eliminate
lines with duplicate ids there are about 37000 papers.  Note the
citation string is either (1) an arbitrary bibliography reference to
the paper, if one was made or (2) a constructed entry based on the
authors and title extracted from the postscript file.


CITATIONS

The file 'citations' has the citation graph.  Two fields, tab
separated:

   <referring_id> <cited_id>

The referring_id is the id of the paper that has the bibliography
section (always one we have postscript for).  The cited_id is the
paper referenced (we may or may not have postscript for it).  There
are about 715000 citations.


CITATIONS.WITHAUTHORS

The file 'citations.withauthors' contains another copy of the
citation graph.  This time we have also included authors and file
names of each paper in addition to each papers' unique paper_id and
the paper_id's of the references they make. The format of this file
is:

   ***
   this_paper_id
   filename
   id_of_first_cited_paper
   id_of_second_cited_paper
   .
   .
   .
   *
   Author#1 (of this paper)
   Author#2
   .
   .
   .

CLASSIFICATIONS

The file `classifications' contains the research topic classifications
for each of the files. The format of the file is:
"filename"+"\t"+"classification".  For example:

  http:##www.ri.cmu.edu#afs#cs#user#alex#docs#idvl#dl97.ps    /Information_Retrieval/Retrieval/


The file name is the url where the paper came translated to file name
by changing / to #.  The classification the label name in the Cora
directory hierarchy.

Note that the class labels were not perfectly assigned.


EXTRACTIONS

The directory 'extractions' contains 52906 files, one for each
postscript paper that we found on the Web.  The directory contains so
many files, that you probably don't want to 'ls' it.  Commands like
`find extractions -print' will probably work more efficiently.

Each filename in the 'papers' file should have a file here.  I believe
there are also some (perhaps many?) extra files in this tarball that
are not in paper-data that you can just ignore.

Each line of each file corresponds to some bit of data about the
postscript file.  Most of the MIME-like field tags are
straightforward and explanatory.  A few notes:

The fields URL, Refering-URL, and Root-URL are given by the spider.
All other fields are extracted automatically from the text, some by
hand-coded regular expressions and some by an HMM information
extractor.

The fields Abstract-found and Intro-found are binary valued indicators
of whether Abstract and/or Introduction sections were found by some
regular expression matching in the paper.

Each Reference field is one bibliography entry found at the end of the
paper.  Note they are marked up using SGML-like tags.  Each Reference
field is optionally followed by one (and possibly more?)
Reference-context fields that are snippets of the postscript file
around where the reference was cited.
分享到:
评论

相关推荐

    Cora数据集,可供加载

    Cora数据集是一个广泛用于图神经网络(GNN)研究的基准数据集,它在机器学习,特别是图学习领域有着重要的地位。这个数据集最初由Michael Mahoney等人创建,主要用于评估节点分类算法的性能。Cora数据集的核心是学术...

    cora数据集、含图卷积训练代码

    《Cora数据集与图卷积网络训练代码详解》 在机器学习领域,尤其是在图神经网络(Graph Neural Networks, GNNs)的研究中,Cora数据集是一个常被引用的标准基准。Cora数据集主要用于文献分类任务,它包含了2708篇...

    GCN节点分类Cora数据集

    《GCN在Cora数据集上的节点分类应用详解》 GCN,全称为Graph Convolutional Network,是一种在图结构数据上进行深度学习的方法,它借鉴了卷积神经网络(CNN)的思想,将卷积运算扩展到了非欧几里得空间,即图数据。...

    cora_cora_

    《Cora数据集详解及其应用》 Cora数据集是一个广泛用于链接预测、节点分类以及图神经网络(GNN)研究的基准数据集。它的全称为"cora",主要包含学术论文的数据,用于评估机器学习算法在无监督或半监督学习任务上的...

    Cora数据集(Cora 数据集由机器学习论文组成,是近年来图深度学习很喜欢使用的数据集 )

    图机器学习开始布置作业了,第一个作业是Node classification,...运用的数据集为Cora。数据集,再次记录一下。 助教给的Demo中数据集的格式为: cora-&gt;下一级有三个文件,分别为:cora.cites, cora.content, README

    基于python实现CORA数据集节点级分类项目源码(用GCN、SVM、FNN模型)+项目运行说明.zip

    【资源说明】基于python实现CORA数据集节点级分类项目源码(用GCN、SVM、FNN模型)+项目运行说明.zip- 数据集CORA 图数据集- 任务:多分类- 使用模型GCN SVM FNN- 包括构图、数据预处理及feature encoding依赖库安装``...

    图卷积神经网络GCN经典案例,CORA数据集,分类任务,纯pytoch编写,注释清晰,可视化直观感受

    图卷积神经网络GCN是图神经网络GNN的一...CORA数据集是分类任务的一个经典案例,已知一篇论文的特性,求出它是属于哪种类型的论文。代码纯pytoch框架编写,注释清晰,可视化让你直观感受GCN的训练过程,以及训练效果。

    GCN_Keras-master_gcncora_keras_.keras最新论文_GCN_cora数据集与gcn_

    而"GCN_cora数据集与gcn"标签则明确了研究的重点在于使用GCN处理CORA数据集的问题。 总的来说,这个项目提供了一个实用的示例,展示了如何在Keras中实现和应用GCN进行图数据的节点分类任务,对于理解GCN的工作原理...

    cora数据集科研论文头部信息抽取

    《Cora数据集在科研论文头部信息抽取中的应用》 Cora数据集是学术界广泛使用的数据集之一,尤其在信息抽取(Information Extraction, IE)领域具有重要地位。本资源聚焦于论文头部信息的抽取,提供了针对这一特定...

    Cora数据集第二个版本练习(完整数据集+源代码)-20230628

    针对Cora数据集的第二个版本所做的代码练习,资源中:包含完整的数据集(cora数据集的第二个版本)以及完整代码。此外,具体教学已经在专栏【Python从入门到人工智能】中记录,详见个人主页,可根据关键字 “深入浅...

    在Cora和Citeseer数据集上使用GCN网络实现链路预测

    Cora数据集包含了2708篇计算机科学领域的论文,这些论文通过7个类别相互引用,形成了一个复杂的图结构。Citeseer数据集则包含3312篇论文,分为6个类别,同样以引用关系构建了图。这两个数据集的每个节点都有一个特征...

    Cora引文数据集 | Cora.rar(GNN图神经网络)内含有raw

    Cora数据集包含2708篇科学出版物,edges:5429,classes:7,features:1433 每个科学出版物都由一个01词向量描述 训练集(140,1433),测试集(1000,1433),总训练集(1708,1433),训练集从总训练集中抽取,存在...

    图神经网络 GNN、GCN经典数据集包 Cora数据集

    Cora数据集是这类模型的经典研究对象,用于节点分类任务,帮助我们理解和评估GNN和GCN的性能。 Cora数据集是一个引文网络,由2708个科研论文节点组成,这些节点之间通过引用来建立连接,形成一个图。每个节点都有一...

    基于Keras+cora和citeseer数据集实现GAT训练及节点分类测试python源码+数据集+项目说明.zip

    dropout_rate = 0.5# dropout概率率 Adam LR = 5e-3# 学习率 GAT在cora数据集和citeseer数据集上具有70%和80%左右的准确率,上面参数随便设置的,调好超参数应该还能提高一点。【备注】1.项目代码均经过功能验证,...

    cora.tgz深度学习图神经网络数据集

    《Cora数据集:深度学习中的图神经网络实践》 Cora数据集是深度学习领域,特别是图神经网络(Graph Neural Networks, GNN)研究中一个广泛应用的基准数据集。这个数据集主要用于评估节点分类任务的性能,它包含了...

    基于Keras搭建一个GraphSAGE,用cora数据集和citeseer数据集对GraphSAG

    基于Keras搭建一个GraphSAGE,用cora数据集和citeseer数据集对GraphSAG_Keras-GraphSAGE

    GCN实例之CORA数据集的分类_GCN_CORA_CLASSFY.zip

    GCN实例之CORA数据集的分类_GCN_CORA_CLASSFY

    cora_cora数据集_

    样本特征,标签,邻接矩阵该数据集共2708个样本点,每个样本点都是一篇科学论文,所有样本点被分为8个类别,类别分别是1)基于案例;2)遗传算法;3)神经网络;4)概率方法;5)强化学习;6)规则学习;7)理论

    gcn练习代码-Cora数据集

    Cora数据集在之前的资源中可以下载 并非原创,于github上找到的代码,并进行了相应的修改,目前没找到源 于85个epoch中结束了训练,逻辑是设计循环中一个epoch有10次所有的train、val、test值都不再改变

Global site tag (gtag.js) - Google Analytics