Wikipedia how-to guide about sourcing
"Wikipedia:GOOGLETEST" redirects here. For the argument about "many google hits" in deletion discussions, see
WP:GOOGLEHITS
.
![](//upload.wikimedia.org/wikipedia/commons/thumb/f/f6/Walnut.png/30px-Walnut.png) | This page in a nutshell:
Measuring is easy. What's hard is knowing
what
it is you're measuring and what your measurement can
mean
. Web searches test the understanding of the
WP:Five pillars
of Wikipedia.
|
A
search engine
lists
web pages
on the
Internet
. This facilitates research by offering an immediate variety of applicable options. Possibly useful items on the results list include the source material or the electronic tools that a web site can provide, such as a dictionary, but the list itself, as a whole, can also indicate important information. However, discerning that information may require insight.
Search engine results can help editors retain (what is
notable
) or delete (what is not
verifiable
) source material, depending on their reliability. There is a high demand for
reliability on Wikipedia
. Discerning the reliability of the source material is an especially core skill for using the web, while the wiki itself only facilitates the creation of multiple drafts. As presentations and deletions progress, this
variety
of choices for input tend to produce the desired objective?
a neutral viewpoint
. Depending on the type of query and kind of search engine, this variety can open up to a single author.
Some search engine tests
- Popularity
? See
Google's trending tool
below.
- Usage
? Identify a term's notability. (See for example
Google's ngram tool
.)
- Genuineness
? Identify a spurious hoax or an
urban legend
.
- Notability
? Decide whether a page should be nominated for deletion.
- Existence
? Discover what sources (including websites) actually
exist
for possible presentation.
- Information
? Review the reliability of facts and citations.
- Names and terminology
? Identify the names used for things (including alternative names and terminology).
- Copyrighting
? Identify whether material is
copied
, and if so, check the licensing.
This page describes both these web search tests and the web search tools that can help develop Wikipedia, and it describes their biases and their limitations.
The advantages of a specific search engine can be distinguished by
using
a variety of common search engines. The distinct advantages of each are their user interface and, less obviously, their algorithms for compiling and searching their own indexes.
Because a web crawler can be blocked?specific ones or just in general?different search engines can list different web sites, and there are more web sites available by URL than are indexed in any database.
The most common search engines are
Google
,
Bing
, and
Yahoo
.
Specialized search engines
exist for
medicine
,
science
,
news
and
law
amongst others. Several generalized search engines exist. These adapt your query to many search engines. See
§?Common search engines
below. This page mostly uses
Google
instead of
Bing
or
Yahoo
, but aims for generality where it can. For example, it describes
Google Groups
(usenet groups),
Google Scholar
(academia),
Google News
, and
Google Books
.
Good-faith searching: a rule of thumb
If an unsourced addition to an article appears plausible, consider taking a moment to use a suitable search engine to find a reliable source before deciding whether to revert.
Search engine tests
Depending on the subject matter, and how carefully it is used, a search engine test can be very effective and helpful, or produce misleading or non-useful results. In most cases, a search engine test is a first-pass
heuristic
or "
rule of thumb
".
What a search test can do, and what it can't
A search engine can index pages and text which others have placed on the internet, just like a big index at the back of a book.
Search engines can:
- Provide information and lead to pages that assist with the above goals
- Confirm "who's reported to have said what" according to sources (useful for neutral citing)
- Often provide full cited copies of source documents
- Confirm roughly how popularly referenced an expression is.
Note, however, that Google searches may report vastly more hits than will ever be returned to the user, especially for exact quoted expressions.
For example, a Google search for "the green goldfish", with quotes, in 2021 initially reports around 209,000 results, yet on paging through to the last search results page shows the returned number of hits to be 303. See also
here
to calculate statistical significance.
[1]
- Search more specifically within certain websites, or for combined and alternative phrases (or excluding certain words and phrases that would otherwise confuse the results).
Search engines cannot:
- Guarantee the results are reliable or "true" (search engines index whatever text people choose to put online, true or false).
- Guarantee
why
something is mentioned a lot, and that it isn't due to
marketing
, reposting as an
internet meme
,
spamming
, or self-promotion, rather than importance.
- Guarantee that the results reflect the uses you mean, rather than other uses. (E.g., a search for a specific John Smith may pick up many "John Smiths" who aren't the one meant, many pages containing "John" and "Smith" separately,
and also
miss out all the useful references indexed under "J. Smith" or, if the term is put in quotes, "John Michael Smith" and "Smith, John")
- Guarantee you aren't missing crucial references through choice of search expression.
- Guarantee that little-mentioned or unmentioned items are automatically unimportant.
- Guarantee that a particular result is the
original
instance of a piece of text and not a reprint, excerpt, quotation, misquotation, or copyright violation.
and search engines
often will not:
- Provide the latest research in depth to the same extent as journals and books, for rapidly developing subjects.
- Be
neutral
.
A search engine test
cannot help you avoid
the work of
interpreting your results
and deciding what they really show. Appearance in an index alone is not usually proof of anything.
Search engine tests and Wikipedia policies
Verifiability
Search engine tests may return results that are fictitious, biased, hoaxes or similar. It is important to consider whether the information used derives from
reliable sources
before using or citing it. Less reliable sources may be unhelpful, or need their status and basis clarified, so that other readers gain a neutral and informed understanding to judge how reliable the sources are.
Neutrality
Google (and other search systems) do not aim for a
neutral point of view
. Wikipedia does. Google indexes self-created pages and media pages which do not have a neutrality policy. Wikipedia has a neutrality policy that is
mandatory
and applies to all articles, and all article-related editorial activity.
As such, Google is specifically
not
a source of neutral titles ? only of popular ones. Neutrality is mandatory on Wikipedia (including deciding what things are called) even if not elsewhere, and specifically, neutrality trumps popularity.
(See
WP:NPOV §?Neutrality and Verifiability
for information on balancing the policies on
verifiability
and neutrality, and
WP:NPOV §?Article naming
on how articles should be named)
Notability
Raw "hit" (search result) count is a very crude measure of importance. Some unimportant subjects have many "hits", some notable ones have few or none, for reasons discussed further down this page.
Hit-count numbers alone can only rarely "prove" anything about
notability
, without further discussion of the type of hits, what's been searched for, how it was searched, and what interpretation to give the results. On the other hand, examining the
types
of hit arising
[
clarification needed
]
(or their lack) often
does
provide useful information related to notability.
Additionally, search engines do not disambiguate, and tend to match partial searches. (However, as described below, you can eliminate partial matches by
quoting
the phrase to be matched): While
Madonna of the Rocks
is certainly an encyclopedic and notable entry, it's not a pop culture icon. However, due to
Madonna
matching as a partial match, as well as other Madonna references not related to the painting, the results of a Google or Bing search result count will be disproportionate as compared to any equally notable Renaissance painting. To?exclude partial matches when Googling for the phrase,
quote
the phrase to be matched as follows:
"Madonna of the Rocks"
.
Using search engines
Search engine expressions (examples and tutorial)
This section explains some
search expressions used in Google web search
.
[2]
Similar approaches will work in many other search engines, and other Google searches, but always read their
help
pages for further information as search engines' capabilities and operation often differ. Note that if you are signed in to a Google account when searching on Google then this may affect the results that you get, based on your search history.
[3]
Also be sure to check "Languages for Displaying (Search) Results" in "Search Settings".
[4]
)
The single most useful search engine tool may be the use of quotation marks to find an exact match for a phrase. However, a search engine such as Google has both an easy, and an advanced search with further search options. The advanced search makes it easier to enter advanced options, that may help your searching. The following collapsible sections cover basic examples and help for using search engines with Wikipedia.
Specialized search engines such as medical paper archives have their own specialized search structure not covered here.
Basic searches.
|
Most searches allow searching for words (
acid
), expressions (
war on terrorism
), and combinations (
"war on terror" OR "war on terrorism"
;
John AND Smith
), as well as excluding certain items (
Bush NOT George
). An expression is given in "double quote" marks, and expressions can be grouped with parentheses. Expressions are not usually case-sensitive. So the following are all valid texts to search for, on Google:
Search:
John Smith
|
Since this isn't in quotes, Google looks for pages containing all of these terms. It finds all pages that contain "john" and "smith". This will return pages that contain "john smith", "john michael smith" but also pages that contain both terms separately, such as "The secretary, john arnold, and treasurer, mike smith..."
|
Search:
"John Smith"
|
The name is in double quotes. Google will look for pages containing the exact expression "John smith", or the two words next to each other ("The author was John. Smith was the composer..."). But it won't pick up name variants such as "John M. Smith".
|
Search:
"John Smith" OR "John M Smith" OR "John Michael Smith"
|
Search:
"Ahmed Abu-Sayed" OR "Ahmed Abusayed"
|
Looks for pages with
any
of these expressions. Note the use of
OR
(which
must
be given in upper case) to find possible alternate spellings when it isn't clear whether or not words are joined by page authors.
|
|
Use of
NOT
|
The term
NOT
(in Google represented by
-
) means: exclude pages that contain this term. The danger is that pages will be excluded because of a term that actually has nothing to do with the search in hand.
NOT
always means "and also not" in Google. The best use of
NOT
(or
-
in Google) is in two circumstances:
- There is a clear expression or term and a page that contains that meaning probably will
not
be relevant to the meaning you are after.
- There are many references and you want to narrow down the search by excluding less likely page suggestions.
Search for a term with a 2nd meaning v1:
George Bush NOT president
|
Search for a term with a 2nd meaning v2:
"George Bush" NOT president
|
Search for a term with a 2nd meaning v3:
George Bush NOT president NOT "White House"
|
You want references to George Bush, but not the one who's the president. Given that 90% of
George Bush
references will be about the US president, it makes sense to rule out all pages with that word, or even tighter, even though some pages may contain both references to non-presidential george bushes and the word president.
Two variations are shown; one looks for the expression
"George Bush"
, and one has a second exclusion to rule out pages with the term
"
White House
"
|
Narrow down widely used terms:
(flavor OR flavour) (quark OR quantum OR physics) -eat -food -drink -cooking -culinary
|
An example of a more complex search. The author is looking for the term
flavor
, in the sense of a property in
quantum physics
. Sources may spell it the
American
way or
British
/
Commonwealth
way, so the first expression is to look for one
OR
the other. Also the page must contain some other words likely to be related to subatomic physics, thus
(quark OR quantum OR physics)
. Last, pages containing references related to food and cooking are explicitly excluded, since most references to "flavor" will be of this kind.
|
|
Advanced searches and copyvio checks.
|
Google allows all sorts of combinations of words, expressions,
OR
,
NOT
, and parentheses, which can be used to make quite detailed searches.
Search:
linux (grub OR lilo) (boot OR startup OR "start-up") kernel init process
|
A person who wants to write an article on the
Linux
start-up (or boot) process, but doesn't know where on the net to look for reliable sources.
This search looks for pages that contain references to Linux, references to the two most common boot loaders with
(grub OR lilo)
, references to start-up under three common terms that might be used, and other words that hopefully will be commonly related to start-up in Linux.
|
Copyvio search:
("zytox is the worlds leading producer of widgets" OR "merger with IBM in 1929" OR "exports radar components to over fifty countries") NOT Wikipedia NOT wiki
|
Looks for any of three memorable phrases from a suspected copyright violation, which do
not
appear on the same page as a reference to
Wikipedia
. Also excludes the term
wiki
, to weed out both a lot of
Wikipedia mirrors
but also other wikis, which are not the sorts of sites we're looking for.
If this text is copied from a website, a search like this will often help to locate the source.
|
|
Finding vaguely remembered information and unfamiliar terms.
|
Search for a vaguely known term:
biology reproduction cell nucleus chromosome helix
|
A search for someone who wants to find what the molecule which reproduces is called (
DNA
) and knows some terms it might be associated with but can't remember the term itself. Use associated terms to try and find pages that mention it.
|
Search for a term with unknown spelling:
piometra OR pieometra OR pyametra OR pymetra
|
A search for
pyometra
by someone who can't remember the spelling. Again, they could equally search using connected terms (Google: bitch womb spay open closed antibiotic ? all terms associated with the veterinary condition pyometra). The odds are good someone else has already misspelt it like you did and it's been indexed, so you can look up more information from there.
|
Search for ambiguous terms:
DNA
(as in, the
cell biology
meaning)
|
An example of a problematic search. The obvious term
DNA
may pull up many unhelpful answers, such as companies with these initials. So it is likely that a person who wants to look up this item and doesn't know much already, will have to search like this:
- Search
DNA
? finding that it has many meanings.
- Search
DNA cell biology helix
? using words commonly associated with that meaning of DNA, to get pages covering that meaning.
- Using those pages to find the correct term is "deoxyribonucleic acid", sometimes written "deoxyribo-nucleic acid"
- Doing a final search for
"Deoxyribonucleic acid" OR "Deoxyribo nucleic acid"
|
Search:
("she's got" OR "she has") "do right by me" ticket ride lyrics
|
A search for a song title ("
Ticket to Ride
"), for a person who knows some phrases and
thinks
they might know others, including useful words that might help narrow it down.
|
|
Searches restricted to news, newsgroups, and other sources.
|
To search all news use
Google News
|
Search for a term within a certain site:
"George Bush" site:www.bbc.com
|
Search for a term in a site's URL:
allinurl:bbc George Bush
|
If searching using
site:
isn't enough, using
allinurl:
will specify that the search terms must appear in the page's URL itself, not just as a term on the page. This is mostly helpful for blogs and news sites that use blog-based
CMSes
that use a lot of plain language in article URLs.
|
|
Specialized options, including searches to include or exclude Wikipedia itself.
|
Google has options to specify web sites to search or not search, and where in the page to search. These are able to be added to the end of any search and will restrict the locations Google will report matches from. Examples of useful searches, using "(Atom OR Bomb)" as the example text being searched for:
To search like this
|
Enter a search string like this
|
Only report pages from websites ending in "en.wikipedia.org", the English Wikipedia.
|
(atom OR bomb) site:en.wikipedia.org
|
Only report pages from websites ending in "wikipedia.org", Wikipedia in any language
|
(atom OR bomb) site:wikipedia.org
|
Only report pages from websites that do
not
end with "wikipedia.org", i.e. pages that are NOT on a Wikipedia website
|
(atom OR bomb) -site:wikipedia.org
|
Avoid pages that mention
Wikipedia
.
(This is a good way to avoid a deluge of results which are all either from Wikipedia, or from copies and mirrors of Wikipedia articles.)
|
(atom OR bomb) NOT Wikipedia NOT wiki
|
Find the phrase
atom bomb
, avoid pages that mention
Wikipedia
or
wiki
or are on
Wikipedia.org
, and
link to
the Google search that you performed, so that others can repeat it.
|
[http://www.google.com/search?&q=%22atom+bomb%22+-wikipedia+-site%3Awikipedia.org "atom bomb" -Wikipedia -wiki -site:wikipedia.org]
|
Search for
atom bomb
on a specific list of sites.
|
[http://www.google.com/search?q=%22atom+bomb%22+site%3Abritannica.com+OR+site%3Abbc.co.uk+OR+site%3Anytimes.com+OR+site%3Aguardian.co.uk+OR+site%3Asmh.com.au+OR+site%3Aamazon.com http://www.google.com/search?q="atom bomb" site:britannica.com OR site:bbc.co.uk OR site:nytimes.com OR site:guardian.co.uk OR site:smh.com.au OR site:amazon.com]
|
For the tennis player
Facundo Arguello
from (Spanish-speaking) Argentina, research how his name is spelled in reliable English sources. The search results should include articles with the word "tennis" but
not
the word "tenis" (the Spanish-language spelling), omit
Spanish-language web sites prefixed with
es
(
http://es.
etc., like Spanish Wikipedia), omit web sites with the
Argentine top-level domain name
ar
, and omit pages that meantion
Wikipedia
or are on
Wikipedia.org
. It's possible to greatly simplify such a search by using the template
{{
Google LC
}}
(though it does not auto-exclude the term
wiki
):
{{subst:google LC|Facundo Arguello|es|ar}}
displays as a clickable external link:
Sources for Facundo Arguello on Google, excluding language(es)/country(ar)
Simply click the template-generated link then add the positive and negative match terms
tennis
and
-tenis -wiki
to the search string and repeat the search.
|
[http://www.google.com/search?q=%22Facundo+Arguello%22+tennis+-tenis+-site%3Aes.*+-site%3A*.ar+-site%3Awikipedia.org+-Wikipedia+-wiki http://www.google.com/search?q="Facundo Arguello" tennis -tenis -site:es.* -site:*.ar -site:wikipedia.org -Wikipedia -wiki]
|
To research the preferred spelling of the soccer player
Facundo Arguello
from Argentina requires a much longer search string in order to eliminate a flood of results from his tennis namesake (see above): Simply click the link then add the positive and negative match terms
soccer
,
football
,
-futbolista
(and so on) to the search string and repeat the search.
|
[http://www.google.com/search?q=%22Facundo+Arguello%22+soccer+football+-futbolista+-tennis+-tenis+-ATP+-Wimbledon+-court+-site%3Aes.*+-site%3A*.ar+-site%3Awikipedia.org+-wikipedia http://www.google.com/search?q="Facundo Arguello" soccer football -futbolista -tennis -tenis -ATP -Wimbledon -court -site:es.* -site:*.ar -site:wikipedia.org -wikipedia]
|
Find pages which link to a particular page, such as Wikipedia's
Main Page
|
link:http://en.wikipedia.org/wiki/Main_Page
|
Specify that the expression must appear in the HTML
<title>
of the page.
|
allintitle: (atom OR bomb)
|
allintitle
and
site:
(or
-site:
) can be
combined
, to find pages on a website (or not on the website) with the given expression in a title
|
allintitle: (atom NOT bomb) site:en.wikipedia.org
|
Specify that the page's URL must contain a particular expression.
|
inurl:(atom OR bomb)
|
Site inclusion/exclusion is often very useful to get views either
from
a named website, or from
any other
websites. For example, it can be used
- To find pages on
Microsoft
terminology that are not self-published by Microsoft (not ending in
microsoft.com
),
- To find pages that are official US or UK government sources (end in
.gov
and
.gov.uk
, accordingly),
- To find sites from a given country (more likely to end with that country's initials, such as
.fr
for
France
),
- Or particular media publishers (e.g.,
cnn.com
or
bbc.co.uk
)
Specialized searches work on the same principles and same basic search expressions as the above, but might be used to check in specialized archives, or with unusual options.
|
Specific uses of search engines in Wikipedia
- Google Trends
can allow you to find which rendering of a word or name is most searched for,
like this
(note: sports category) or
like this
.
"Tidal wave" vs. "Tsunami" example
, see also the Google Books example below.
- Google Books
has a pattern of coverage that is in closer accord with traditional encyclopedia content than is the Web, taken as a whole; if it has systemic bias, it is a very different systemic bias from Google Web searches. Multiple hits on an exact phrase in Google Book search provide convincing evidence for the real use of the phrase or concept. You can compare usage of terms, such as
"Tidal wave" vs. "Tsunami"
. Google Book search can locate print-published testimony to the importance of a person, event, or concept. It can also be used to replace an unsourced "common knowledge" fact with a print-sourced version of the same fact.
[5]
- Google Groups
or other date-stamped media can help establish the timing and context of early references to a word or phrase.
Google Groups search
.
- Google News
can help assess whether something is newsworthy.
Google News
used to be less susceptible to manipulation by self-promoters, but with the advent of pseudo-news sites designed to collect ad revenues or to promote specific agendas, this test is often no more reliable than others in areas of popular interest, and indexes many "news" sources that reflect specific points of view. The news archive goes back many years but may not be free beyond a limited period. News results often include press releases, which are not neutral, independent sources.
- Google Scholar
provides evidence of how many times a publication, document, or author has been cited or quoted by others. Best for scientific or academic topics. Can include Masters and Doctorate thesis papers, patents, and legal documents.
Google Scholar search
.
- Topics alleged to be notable by popular reference can have the type of reference, and popularity, checked. An alleged notable issue that only has a few hundred references on the Internet may not be very notable; truly popular
Internet memes
can have millions or even tens of millions of references.
[6]
However note that in some areas, a notable subject may have very few references; for example, one might only expect a handful of references to some
archaeological
matter, and some matters will not be reflected online at all.
- Topics alleged to be genuine can be checked to test if they are referenced by reliable independent sources; this is a good test for hoaxes and the like.
- Copyright violations from websites can often be identified (as described above).
- Alternative spellings and usages can have their relative frequencies checked (e.g., for a debate which is the more common of two equally neutral and acceptable terms). Google Trends can compare usage in the "News" category (
"Tidal wave" vs "Tsunami" example
), but this may not be reliable for older news.
[7]
Interpreting results
General
A raw hit count should never be relied upon to prove notability. Attention should instead be paid to what (the books, news articles, scholarly articles, and web pages) is found, and whether they actually
do
demonstrate notability or non-notability, case by case. Hit counts have always been, and very likely always will remain, an extremely erroneous tool for measuring notability, and should not be considered either definitive or conclusive. A manageable sample of results found should be opened individually and read, to actually verify their relevance.
In the case of Google (and other search engines such as Bing and Yahoo!), the hit count at the top of the page is unreliable and should usually not be reported. The hit count reported on the penultimate (second-to-last) page of results may be slightly more accurate. For searches with few reported hits (less than 1000) the actual count of hits needed to reach the bottom of the last page of results may be more accurate, but even this is not a sure thing. Google returns different search results depending on factors such as your previous search history and on which Google server you happen to hit.
[8]
[9]
Other useful considerations in interpreting results are:
- Article scope: If narrow, fewer references are required. Try to categorize the point of view, whether it is NPoV, or other; e.g., notice the difference between
Ontology
and
Ontology (computer science)
.
- Article subject: If it's about some historical person, one or two mentions in reliable texts might be enough; if it's some Internet
neologism
or a
pop song
, it may be on 700 pages and might still not be considered 'existing' enough to show any notability, for Wikipedia's purposes.
Biases to be aware of
In most cases, search results should be reviewed with an awareness and careful skepticism before relying upon them. Common biases include:
General biases
General (the Internet or people as a whole):
- Personal bias
? Tendency to be more receptive to beliefs that one is familiar with, agrees with, or are common in one's daily culture, and to discount beliefs and views that contradict one's preferred views.
- Cultural and computer-usage bias
? Biased towards information from Internet-using developed countries and affluent parts of society (internet access). Countries where computer use is not so common will often have lower rates of reference to equally notable material, which may therefore appear (mistakenly) non-notable.
- Undue weight
? May disproportionally represent some matters, especially related to
popular culture
(some matters may be given far more space and others far less, than fairly represents their standing):
popularity is not notability
.
- Sources not readily accessible
? Some sources are accessible to all, but many are payment only, or not reported online. This may, for example, affect the search results you get for a historical topic that achieved its peak media prominence 50 or 100 years ago; valid sources may very well exist, but would be found on microfilms or subscription news archiving sites like
ProQuest
or
Newspapers.com
rather than in a general Google search.
General web search engines (Google, Bing web search etc.):
- Dark net
? Search engines exclude a vast number of pages, and this may include systematic bias so that some matters are excluded disproportionately (for example, because they are commonly visible on sites that do not allow Google indexing, or the content for technical reasons cannot be indexed (
Flash
- or image-based websites etc.)
- Search engines as promotion tool
? An
industry exists
seeking to influence site position, popularity, and ratings in such searches, or sell advertising space related to searches and search positions. Some subjects, such as
pornographic actors
, are so dominated by these that searches cannot be reliably used to establish popularity.
- Review process
varies; some sites accept any information, while others have some form of review or checking system in place.
- Self-mirroring
? Sometimes other sites clone Wikipedia content, which is then passed around the Internet, and more pages built up based upon it (and often not cited), meaning that in reality the source of much of the search engine's findings are actually just copies of Wikipedia's own previous text, not genuine sources.
- Popular usage bias
? Popular usage and
urban legend
is often reported over correctness
- Popular views and perceptions
are likely to be more reported. For example, there may be many references to
acupuncture
and confirming that people are often
allergic
to animal
fur
, but it may only be with careful research that it is revealed there are medical peer-reviewed assessments of the former, and that people are usually not allergic to fur, but to the sticky skin and saliva particles (
dander
)
within
the fur.
- Language selection bias
? For example, an Arabic speaker searching for information on
homosexuality
in Arabic will likely find pages which reflect a different bias than an English speaker searching in English on the same subject, since popular and media views and beliefs about homosexuality can differ widely between English-speaking countries (US, UK, Australia, etc.) that tend to include a higher proportion of homosexuality-accepting groups, and Arabic-speaking countries (Middle East) that tend to include a lower proportion.
Other:
- Note that other Google searches, particularly
Google Book Search
, have a different systemic bias from Google Web searches and give an interesting cross-check and a somewhat independent view.
Foreign languages, non-Latin scripts, and old names
Often for items of non-English origin, or in non-Latin scripts, a considerably larger number of hits result from searching in the correct script or for various transcriptions?be sure to check "
Languages for Displaying (Search) Results
" in "
Search Settings
".
[4]
An
Arabic
name, for instance, needs to be searched for in the original script, which is easily done with Google (provided one knows what to search for), but problems may arise if ? for example ? English, French and German webpages transcribe the name using different conventions. Even for English-only webpages there may be many variants of the same Arabic or
Russian
name. Personal names in other languages (Russian,
Anglo-Saxon
) may have to be searched for both including and excluding the
patronymic
, and searches for names and other words in strongly
inflected
languages should take into account that arriving at the total number of hits may require searching for forms with varying
case
-endings or other grammatical variations not obvious for someone who does not know the language. Names from many cultures are traditionally given together with titles that are considered part of the name, but may also be omitted (as in
Gazi
Mustafa Kemal
Pasha
).
Even in
Old English
, the spelling and rendering of older names may allow dozens of variations for the same person. A simplistic search for one particular variant may underrepresent the web presence by an order of magnitude.
A search like this requires a certain linguistic competence which not every individual Wikipedian possesses, but the Wikipedia community as a whole includes many bilingual and multilingual people and it is important for nominators and voters on AfD at least to
be aware of their own limitations
and not make untoward assumptions when language or transcription bias may be a factor.
Google distinct page count issues
Note also, that the number of search string matches reported by search engines is only an estimate. For example, Google will only calculate the actual number of matches once the user navigates through all result pages, to the last one, and even then it places restrictions on the figure. At times, the "match" count estimate can be significantly different (by one or more
orders of magnitude
) to the total count of results shown on the last results page.
A site-specific search may help determine if most of the matches are coming from the same web site; a single web site can account for hundreds of thousands of hits.
For search terms that return many results, Google uses a process that eliminates results which are "very similar" to other results listed, both by disregarding pages with substantially similar content and by limiting the number of pages that can be returned from any given domain. For example, a search on "Taco Bell" will give only a couple of pages from tacobell.com even though many in that domain will certainly match. Further, Google's list of distinct results is constructed by first selecting the top 1000 results and then eliminating duplicates without replacements. Hence the list of distinct results will always contain fewer than 1000 results regardless of how many webpages actually matched the search terms. For example, as of 14?December?2010
[update]
, from the about 742 million pages related to "Microsoft", Google was returning 572 "distinct" results.
[10]
. Caution must be used in judging the relative importance of websites yielding well over 1000 search results.
Search engine limitations ? technical notes
Many, probably most, of the publicly available web pages in existence are not indexed. Each search engine captures a different percentage of the total. Nobody can tell exactly what portion is captured.
The estimated size of the
World Wide Web
is at least 11.5 billion pages,
[11]
but a much
deeper (and larger) Web
, estimated at over 3 trillion pages, exists within databases whose contents the search engines do not index. These
dynamic web pages
are formatted by a Web server when a user requests them and as such cannot be indexed by conventional search engines. The
United States Patent and Trademark Office
website is an example; although a search engine can find its main page, one can only search its database of individual patents by entering queries into the site itself.
[12]
Google, like all Internet search engines can only find information that has actually been made available on the Internet. There is still a sizable amount of information that is not on the Internet.
Google, like all major Web search services, follows the
robots.txt protocol
and can be
blocked
by sites that do not wish their content to be indexed or cached by Google. Sites that contain large amounts of copyrighted content (Image galleries, subscription newspapers, webcomics, movies, video, help desks), usually involving membership, will block Google and other search engines. Other sites may also block Google due to the stress or bandwidth concerns on the server hosting the content.
Search engines also might not be able to read links or metadata that normally requires a browser plugin,
Adobe PDF
, or Macromedia Flash, or where a website is displayed as part of an image. Search engines also can not listen to podcasts or other audio streams, or even video mentioning a search term. Similarly, search engines cannot read PDF files consisting of photoscans or look inside compressed (.zip) files.
Forums, membership-only and subscription-only sites (since Googlebot does not sign up for site access) and sites that cycle their content are not cached or indexed by any search engine. With more sites moving to AJAX/Web 2.0 designs, this limitation will become more prevalent as search engines only simulate following the links on a web page. AJAX page setups (like Google Maps) dynamically return data based on real-time manipulation of JavaScript.
Google has also been the victim of
redirection exploits
that may cause it to return more results for a specific search term than exist actual content pages.
Google and other popular search engines are also a target for search engine "search result enhancement", also known as
search engine optimizers
, so there may also be many results returned that lead to a page that only serves as an advertisement. Sometimes pages contain hundreds of keywords designed specifically to attract search engine users to that page, but in fact serve an advertisement instead of a page with content related to the keyword.
Hit counts reported by Google are only estimates, which in some cases have been shown to necessarily be off by nearly an order of magnitude, especially for hit counts above a few thousands.
[13]
[14]
For such common words as to yield several thousand Google hits, freely available
text corpora
such as the
British National Corpus
(for British English) and the
Corpus of Contemporary American English
(for American English) can provide a more accurate estimate of the relative frequencies of two words.
Example of the limitations
The
Economic Crime Summit
site is a rather Google- and
Internet Archive-unfriendly
site. It is very graphics heavy, providing Google with little to nothing to look for and many missing pages in the Internet Archive version. So while you can bring up the
2002 Economic Crime Summit Conference
, the overview link that would tell you who presented what does not work. The
2004 Economic Crime Summit Conference archive
is even worse as that was in three places and none of the archived links tells you anything about the papers presented.
Via Internet Archive you have proof that some information regarding "Impact of Advances in Computer Technology in Evidence Processing" existed on the Internet.
[15]
Yet today
Google cannot find that information!
A program known to be part of the 2002 Economic Crime Summit Conference and at one time was listed on a website on the Internet currently
[
when?
]
cannot be found by Google.
Common search engines
The most common search engines are Google, Bing, Yahoo, and DuckDuckGo but the most useful search engine, which depend on a context, may not be the most common ones.
Specialized search engines
Google Scholar
works well for fields that are paper-oriented and have an online presence in all (or nearly all) respected venues. This search engine is a good complement for the commercially available Thompson ISI Web of Knowledge, especially in the areas which are not well covered in the latter, including books, conference papers, non-American journals, the general journals in the field of strategy, management, international business,
[16]
English language education and educational technology.
[17]
The analysis of the
PageRank
algorithm utilised by Google Scholar demonstrated that this search engine, as well as its commercial analogs, provides an adequate information about popularity of some concrete source,
[18]
although that does not automatically reflect the real scientific contribution of concrete publication.
[18]
MedLine
, now part of
PubMed
, is the original broadly based search engine, originating over four decades ago and indexing even earlier papers. Thus, especially in biology and medicine, PubMed "associated articles" is a Google Scholar proxy for older papers with no on-line presence. E.g., The journal
Stroke
puts papers on-line back through 1970s. For this 1978 paper
[1]
, Google Scholar
lists 100 citing articles
, while PubMed
lists 89 associated articles
There are a large number of
law libraries
online, in many countries, including:
Library of Congress
,
Library of Congress (THOMAS)
,
Indiana Supreme Court
,
FindLaw
(US);
Kent University Law Library and sources
(UK).
See also this
list of search engines
.
Generalized search engines
Several generalized search engines exist. These adapt your query to many search engines.
Web browsers offer a choice of search engines to choose to employ for the search box, and these can be used one at a time to experiment with search results. Meta-search engines use several search engines at once. A web browser
plugin
can add a search engine or a meta-search engine to your list of choices.
See also
References
- ^
For example, if there are 16 hits at Google Books under one name, and 24 under another, there is only a 70% confidence that the second name is actually more common.
- ^
Google Search Operators and more search help
- ^
Search history personalization
- ^
a
b
Google Search Settings
- ^
Avoid inauthor:"Books, LLC", as LLC 'publishes' raw printouts of Wikipedia articles.
- ^
Google search for: AYB OR AYBABTU OR "All your base"
- ^
Google Answers question on word frequency in news sources
- ^
Takuya, Funahashi; Hayato, Yamana (2010).
"Reliability Verification of Search Engines' Hit Counts"
(PDF)
.
Proceedings of the 10th international conference on Current trends in web engineering
. Computer Science and Engineering Division, Waseda University
. Retrieved
5 May
2015
.
- ^
Sullivan, Danny (21 October 2010).
"Why Google Can't Count Results Properly"
.
SearchEngineLand.com
. Retrieved
5 May
2015
.
- ^
Google search for "Microsoft"
- ^
Gulli, Antonio; Signorini, Alessio (28 August 2005).
"The Indexable Web is more than 11.5 billion pages"
.
- ^
More, Alvin; Murray, Brian H. (2000). "Sizing the Internet". Cyveillance.
- ^
Mark Liberman (2009), "
Quotes with and without quotes
",
Language Log
.
- ^
Liberman, Mark (2005), "
Questioning reality
",
Language Log
; and other
Language Log
posts linked from there.
- ^
http://web.archive.org/web/20011212161658/http://www.summit.nw3c.org/Programs_Agenda.htm
- ^
Harzing, A. W. K.; van der Wal, R. (2008). Google Scholar as a new source for citation analysis?
Ethics in Science and Environmental Politics
, vol. 8, no. 1, pp. 62?71
- ^
van Aalst, Jan. (2010) Using Google Scholar to Estimate the Impact of Journal Articles in Education.
Educational Researcher
39: 387.
- ^
a
b
Maslov, S.; Redner, S. (2008). Promise and pitfalls of extending Google's PageRank algorithm to citation networks. Journal of Neuroscience, 28, 11103?11105
Further reading
- Joe Meert (30 April 2006).
"Argumentum ad Googlum"
.
Science, AntiScience and Geology
.
?Meert observes that "The temptation to find a quick retort means that, many times, people don't bother to check the source carefully." and that "people will look for a specific phrase that may be taken out-of-context to support their argument". He states that it is "dangerous and irresponsible to think that we can Google away a complex discussion" and that he has "learned long ago that there is no substitute for detailed research on a topic".
- Rich Turner (29 February 2004).
"Argumentum ad Googlum; Why Getting a Million Hits on Google Doesn't Prove Anything"
.
Grumbles
. Archived from
the original
on 3 March 2016.
?Turner points out that "that something gets hits on Google does not make it correct" and gives several examples of things that are incorrect that garner thousands of hits on Google search results.
- Thelwall, M. (2008). Quantitative comparisons of search engine results, Journal of the American Society for Information Science and Technology, 59(11), 1702?1710.
http://www.scit.wlv.ac.uk/~cm1993/papers/SearchEngineComparisons_preprint.doc
- Thelwall, M. (2008). Extracting accurate and complete results from search engines: Case study Windows Live. Journal of the American Society for Information Science and Technology, 59(1), 38?50.
http://www.scit.wlv.ac.uk/~cm1993/papers/2007_Accurate_Complete_preprint.doc
- Gomes, et al. (2000). Detecting query-specific duplicate documents.
http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=1&p=1&f=G&l=50&d=PTXT&S1=6615209.PN.&OS=pn/6615209&RS=PN/6615209
- Thelwall, M. (2008). Quantitative comparisons of search engine results, Journal of the American Society for Information Science and Technology, 59(11), 1702?1710.
http://www.scit.wlv.ac.uk/~cm1993/papers/SearchEngineComparisons_preprint.doc
- Nakov, Preslav and Hearst, Marti (2005). A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies, Proceedings of Recent Advances in Natural Language Processing 2005
http://biotext.berkeley.edu/papers/nakov_ranlp2005.pdf
- Baroni, Marco and Ueyama, Motoko (2006) Building general- and special-purpose corpora by Web crawling, Proceedings of the 13th NIJL International Symposium Language Corpora Their Compilation and Application.
http://tokuteicorpus.jp./result/pdf/2006_004.pdf