6/29/2023 0 Comments Command e crunchbaseWe can see that by exploring Crunchbase sitemap we can easily and quickly discover profiles listed on the website. INFO | parse_sitemap - found 50000 in sitemap INFO | discover_target - scraping sitemap: INFO | discover_target - found 30 matching sitemap urls (from total of 89) INFO | _scrape_sitemap_index - found 89 sitemaps INFO | _scrape_sitemap_index - scraping sitemap index for sitemap urls Log.info(f"found )")įor url, mod_time in parse_sitemap(response): Sel = Selector(text=compress(ntent).decode()) """parse sitemap for location urls and their last modification times""" Urls = sel.xpath("//sitemap/loc/text()").getall()ĭef parse_sitemap(response) -> Iterator]: Log.info("scraping sitemap index for sitemap urls") """scrape Crunchbase Sitemap index for all sitemap urls""" To scrape sitemaps we'll download the sitemap indexes using our httpx client and parse the URLs using parsel: import gzipįrom typing import Iterator, List, Literal, TupleĪsync def _scrape_sitemap_index(session: httpx.AsyncClient) -> List: Let's take a look at how can we scrape all of this. We can see that this page contains sitemap index pages for acquisitions, events, funding rounds, hubs as well as companies (aka organizations) and people.Įach sitemap index can contain a maximum of 50 000 urls, so currently using this index we can find over 2 million companies and almost 1.5 million people!įurther, we can see that there's also the last update date indicated by the node, so we also have the information for when this index was updated the last time. We can see that there's a sitemap index that contains indexes for various target pages: The /robots.txt page indicates crawling suggestions for various web crawlers (like Google etc). Let's start by taking a look at /robots.txt endpoint: User-agent: * Since Crunchbase wants to be crawled and indexed by search engines it offers a sitemap directory that contains all of its target URLs. Crunchbase does offer a search system however, it's only for its premium users. To start scraping content we need to find a way to find all of the company or people URLs. You can explore available data types by taking a look at the /discover page. discovery page shows all available dataset types In this tutorial, we'll focus on company and people data though we'll be using generic parsing techniques which can be applied to all of the Crunchbase pages. As for, parsel, another great alternative is beautifulsoup package.Ĭrunchbase contains several data types: acquisitions, people, events, hubs, funding rounds and companies. These packages can be easily installed via pip command: $ pip install httpx parsel loguruĪlternatively, feel free to swap httpx out with any other HTTP client package such as requests as we'll only need basic HTTP functions which are almost interchangeable in every library. Optionally we'll also use loguru - a pretty logging library that'll help us keep track of what's going on via nice colorful logs. parsel - HTML parsing library though we'll be doing very little HTML parsing in this tutorial and will be mostly working with JSON data directly instead.httpx - HTTP client library which will let us communicate with 's servers.In this tutorial we'll be using Python and two major community packages: For example, the company dataset contains the company's summary details (like description, website and address), public financial information (like acquisitions, investments and) as well as leadership and used technology data.Īdditionally, Crunchbase data contains a lot of data points used in lead generation like the company's contact details, leadership's social profiles and events aggregation.įor more on scraping use cases see our extensive web scraping use case article Project Setup Crunchbase has an enormous business dataset that can be used in a variety of forms of market analytics and business intelligence.
0 Comments
Leave a Reply. |