Search engine and digital library for scientific and academic papers
CiteSeer
X
(formerly called
CiteSeer
) is a public
search engine
and
digital library
for scientific and academic papers, primarily in the fields of
computer
and
information science
.
CiteSeer's goal is to improve the dissemination and access of academic and scientific literature. As a non-profit service that can be freely used by anyone, it has been considered part of the
open access
movement that is attempting to change
academic and scientific publishing
to allow greater access to scientific literature. CiteSeer freely provided
Open Archives Initiative
metadata
of all indexed documents and links indexed documents when possible to other sources of metadata such as
DBLP
and the
ACM Portal
. To promote
open data
,
CiteSeer
X
shares its data for non-commercial purposes under a
Creative Commons license
.
[1]
CiteSeer is considered a predecessor of academic search tools such as
Google Scholar
and
Microsoft Academic Search
.
[2]
CiteSeer-like engines and archives usually only harvest documents from publicly available websites and do not crawl publisher websites. For this reason, authors whose documents are freely available are more likely to be represented in the index.
CiteSeer changed its name to ResearchIndex at one point and then changed it back.
[3]
History
[
edit
]
CiteSeer and CiteSeer.IST
[
edit
]
CiteSeer was created by researchers
Lee Giles
,
Kurt Bollacker
and
Steve Lawrence
in 1997 while they were at the
NEC Research Institute
(now
NEC Labs
),
Princeton, New Jersey
, US. CiteSeer's goal was to actively crawl and harvest academic and scientific documents on the web and use autonomous
citation indexing
to permit querying by citation or by document, ranking them by
citation impact
. At one point, it was called ResearchIndex.
CiteSeer became public in 1998 and had many new features unavailable in academic search engines at that time. These included:
- Autonomous Citation Indexing automatically created a citation index that can be used for literature search and evaluation.
- Citation statistics and related documents were computed for all articles cited in the database, not just the indexed articles.
- Reference linking, allowing browsing of the database using citation links.
- Citation context showed the context of citations to a given paper, allowing a researcher to quickly and easily see what other researchers have to say about an article of interest.
- Related documents were shown using citation and word based measures, and an active and continuously updated bibliography is shown for each document.
CiteSeer was granted a United States
patent
# 6289342, titled "
Autonomous citation indexing and literature browsing using citation context
", on September 11, 2001. The patent was filed on May 20, 1998, and has priority to January 5, 1998. A continuation patent (US Patent # 6738780) was filed on May 16, 2001, and granted on May 18, 2004.
[
citation needed
]
After NEC, in 2004 it was hosted as CiteSeer.IST on the
World Wide Web
at the College of Information Sciences and Technology, The
Pennsylvania State University
, and had over 700,000 documents. For enhanced access, performance and research, similar versions of CiteSeer were supported at universities such as the
Massachusetts Institute of Technology
,
University of Zurich
and the
National University of Singapore
. However, these versions of CiteSeer proved difficult to maintain and are no longer available. Because CiteSeer only indexes freely available papers on the web and does not have access to publisher metadata, it returns fewer citation counts than sites, such as
Google Scholar
, that have publisher metadata.
CiteSeer had not been comprehensively updated since 2005 due to limitations in its architecture design. It had a representative sampling of research documents in computer and information science but was limited in coverage because it was limited to papers that are publicly available, usually at an author's homepage, or those submitted by an author. To overcome some of these limitations, a modular and open source architecture for CiteSeer was designed ? CiteSeer
X
.
CiteSeer
X
[
edit
]
CiteSeer
X
replaced CiteSeer and all queries to CiteSeer were redirected. CiteSeer
X
[4]
is a public
search engine
and
digital library
and
repository
for scientific and academic papers, primarily with a focus on
computer
and
information science
.
[4]
However, recently CiteSeer
X
has been expanding into other scholarly domains such as economics, physics and others. Released in 2008, it was loosely based on the previous CiteSeer search engine and digital library and is built with a new
open source
infrastructure, SeerSuite, and new algorithms and their implementations. It was developed by researchers Isaac Councill and C.
Lee Giles
at
the College of Information Sciences and Technology
,
Pennsylvania State University
. It continues to support the goals outlined by CiteSeer to actively crawl and harvest academic and scientific documents on the public web and to use a citation inquiry by citations and ranking of documents by the impact of citations. Currently, Lee Giles, Prasenjit Mitra, Susan Gauch, Min-Yen Kan, Pradeep Teregowda, Juan Pablo Fernandez Ramirez, Pucktada Treeratpituk, Jian Wu, Douglas Jordan, Steve Carman, Jack Carroll, Jim Jansen, and Shuyi Zheng are or have been actively involved in its development. Recently, a table search feature was introduced.
[5]
It has been funded by the
National Science Foundation
,
NASA
, and
Microsoft Research
.
CiteSeer
X
continues to be rated as one of the world's top repositories, and was rated number 1 in July 2010.
[6]
It currently has over 6 million documents with nearly 6 million unique authors and 120 million citations.
[
timeframe?
]
CiteSeer
X
also shares its software, data, databases and metadata with other researchers, currently by
Amazon S3
and by
rsync
.
[7]
Its new modular open source architecture and software (available previously on
SourceForge
but now on
GitHub
) is built on
Apache Solr
and other
Apache
and open source tools, which allows it to be a testbed for new algorithms in document harvesting, ranking, indexing, and information extraction.
CiteSeer
X
caches some PDF files that it has scanned. As such, each page includes a
DMCA
link which can be used to report copyright violations.
[8]
Current features
[
edit
]
CiteSeer
X
uses automated
information extraction
tools, usually built on machine learning methods such ParsCit, to extract scholarly document metadata such as title, authors, abstract, citations, etc. As such, there are sometime errors in authors and titles. Other academic search engines have similar errors.
Focused crawling
[
edit
]
CiteSeer
X
crawls publicly available scholarly documents primarily from author webpages and other open resources, and does not have access to publisher metadata. As such, citation counts in CiteSeer
X
are usually less than those in Google Scholar and Microsoft Academic Search who have access to publisher metadata.
Usage
[
edit
]
CiteSeer
X
has nearly one million users worldwide based on unique IP addresses and has millions of hits daily. Annual downloads of document PDFs were nearly 200 million for 2015.
Data
[
edit
]
CiteSeer
X
data is regularly shared under a
Creative Commons BY-NC-SA license
with researchers worldwide and has been and is used in many experiments and competitions.
Thanks to its
OAI-PMH
endpoint,
[9]
CiteSeerX is an
open archive
and its content is indexed like an
institutional repository
in
academic search engines
, for instance
BASE
and
Unpaywall
consumers.
Other SeerSuite-based search engines
[
edit
]
The CiteSeer model had been extended to cover academic documents in business with
SmealSearch
and in e-business with
eBizSearch
. However, these were not maintained by their sponsors. An older version of both of these could be once found at
BizSeer.IST
but is no longer in service.
Other Seer-like search and repository systems have been built for chemistry,
Chem
X
Seer
and for archaeology, ArchSeer. Another had been built for robots.txt file search,
BotSeer
. All of these are built on the open source tool
SeerSuite
, which uses the open source indexer
Lucene
.
See also
[
edit
]
References
[
edit
]
- ^
a
b
"CiteSeerX Data Policy"
. Archived from
the original
on 2012-01-05
. Retrieved
2015-11-10
.
- ^
Kodakateri Pudhiyaveetil, Ajith; Gauch, Susan; Luong, Hiep; Eno, Josh (2009). "Conceptual recommender system for CiteSeerX".
Proceedings of the third ACM conference on Recommender systems
. New York, New York, US: ACM Press. p. 241.
doi
:
10.1145/1639714.1639758
.
ISBN
978-1-60558-435-5
.
S2CID
13900679
.
- ^
Lawrence, Steve (2001). "ResearchIndex: Inside the world's largest free full-text index of scientific literature".
Proceedings of the international conference on Knowledge capture - K-CAP 2001
. p. 3.
doi
:
10.1145/500737.500740
.
ISBN
1-58113-380-4
.
S2CID
19592721
.
- ^
a
b
"About CiteSeerX"
.
Archived
from the original on 2010-07-22
. Retrieved
2010-05-07
.
- ^
"The CiteSeerX Team"
. Pennsylvania State University. Archived from
the original
on 2018-07-26
. Retrieved
2018-05-01
.
- ^
"Ranking Web of World Repositories: Top 800 Repositories"
. Cybermetrics Lab. July 2010. Archived from
the original
on 2010-07-24
. Retrieved
2010-07-24
.
- ^
"About CiteSeerX Data"
. Pennsylvania State University. Archived from
the original
on 2012-01-05
. Retrieved
2012-01-25
.
- ^
For example,
"CiteSeerx ? DMCA Notice"
.
CiteSeerX
10.1.1.604.4916
. Archived from
the original
on 2022-03-18.
The document with the identifier "10.1.1.604.4916" has been removed due to a DMCA takedown notice. If you believe the removal has been in error, please contact us through the feedback page, along with the identifier mentioned in this page.
- ^
Hirst, Tony (2011-12-08).
"Using OAI-PMH as a Single Record Level Query Interface to Citeseer"
.
Archived
from the original on 2020-11-24
. Retrieved
2020-04-25
.
Further reading
[
edit
]
External links
[
edit
]