¿Qué es un índice?
An index is a database containing the webpages and other content a search engine wants to display on its search results pages.
Los motores de búsqueda no buscan en la web en tiempo real. En cambio, rastrean periódicamente la web para encontrar contenido nuevo y actualizado. Cuando identifican un contenido que quizás quieran incluir en los resultados de búsqueda, lo agregan a su índice.
When a visitor enters a search query into the search engine, it retrieves the relevant result from its index and displays it on the search results page.
El proceso de agregar contenido al índice se llama indexación. Si bien la indexación lo acerca un paso más a aparecer en los resultados de búsqueda, no hay garantía de que el contenido indexado aparezca en los resultados de búsqueda.
Cómo funcionan los motores de búsqueda
To understand indexing, it is essential to know how search engines operate. The process of discovering content online and presenting it to visitors is a three-step process:
Crawling is the process of discovering and browsing content available on the web. Google uses crawlers, also known as bots or spiders, to crawl the web.
These crawlers regularly look for new content published on the web. They also visit previously crawled content to check if it has been updated since their last visit. When the crawler identifies content that it wants to appear in search results, it adds it to its database. This database is called an index, and a webpage’s ability to be indexed is referred to as indexability.
Google reviews and evaluates the indexed content for relevance and quality. When a visitor enters a query into Google, the search engine retrieves the relevant content from its index and then displays it on its search results pages.
Content that is not indexed cannot be displayed on search results pages. Therefore, if you want your content to appear on Google’s search results pages, you must ensure that Google can index it.
Factors That Affect Indexing
Many factors can affect indexing. Some could be from your end, and others may even be from Google’s end. While you can typically not do anything about indexing issues from Google’s end, here are a few factors that may cause indexing issues from your end.
1 Robots.txt File
Su robots.txt file may contain rules restricting search engine bots from crawling certain parts of your site. You may have set up these rules yourself. However, they could also happen if there are misconfigurations in the file.
2 Meta Robots Tags
Google and many other search engines do not index webpages with a noindex meta tag. The tag is typically added to the webpage’s head tag. So, if you want Google to index a webpage, ensure that you have not applied the noindex meta tag to the page.
3 Server Errors
Google typically reduces the rate at which it crawls a site when it encounters a slow server or excessive server-side errors. These errors typically begin with 5xx, such as the HTTP 500 Internal Server error and the HTTP 503 Service Unavailable error. Google will also reduce the rate at which it crawls your site when it repeatedly encounters the 429 Too Many Requests error.
4 Missing Content
Google cannot crawl or index content that is not present on the web. This sort of content is missing from its expected location, causing the webpage to return a 404 Not Found error. Google will typically deindex content that returns a 404 Not Found error for a considerable period.
5 Crawl Budget
los crawl budget refers to the number of pages a search engine bot will crawl within a specified period. Google assigns a crawl budget to every site it crawls. The assigned crawl budget is not based on the site’s actual server capacity but on Google’s assessment of the site’s capacity.
6 Mapa del sitio XML
los mapa del sitio XML helps search engines find and understand a website’s structure. While Google can find your content without a sitemap, it sometimes relies on it to discover your content. In such situations, a missing, incomplete, or poorly structured sitemap can hinder the discovery and indexability of certain pages.
7 Contenido duplicado
You can end up with duplicate content when you have the same or similar content on multiple URLs. These URLs could be on the same or multiple sites. To avoid this, Google recommends specifying the most relevant article using the rel=”canonical” tag.
8 Internal Linking Structure
Google relies on links to find your content. A well-organized internal linking structure would allow Google to discover every page on your site. Orphaned pages, that is, pages with no internal links pointing to them, may not be easily discovered and indexed.
9 HTTPS Protocol
Google prefers sites to use the more secure HTTPS protocol over the less secure HTTP protocol. You can easily convert an HTTP site to HTTPS by installing a Capa de sockets seguros (SSL) certificate on your site.
10 Thin or Low-Quality Content
Google only wants to display helpful pages in its search results. So, it may not index pages with thin content or content that provides little value to visitors.
11 Domain or URL Structure
Multiple URLs can lead to the same content. This may affect indexing, particularly when the site has an unclear, long, or complex URL structure. This may also occur with dynamic URLs and URLs with excessive parameters.