In my previous blog Datasets for Machine Learning, I introduced many datasets for machine learning. However, you might not be able to find a dataset that is suitable for your own research or project. The data that you need are published online but not archived. In this case, you need to scrap those data by your own. Of course, if you want to use these scraping data for a commercial purpose, you should take action carefully and seriously in a legal way.
Please pay attention to the below:
- Read the terns and conditions about the data usage in the website you want to scrap.
- Once getting the permission, be polite and friendly when scraping data. DO NOT send too frequently requests to the website in order to avoid increasing too much unnecessary pressure to the server.
- After scraping the data, use them legally.
Here are some tools or libraries in Python or Python-supported for web scraping:
- BeautifulSoup: a Python package for parsing HTML and XML documents.
- Scrapy: an open source, collaborative, fast and high-level web crawling & scraping framework for extracting the data from websites in a fast, simple, yet extensible way.
- pyspider: a powerful spider(web crawler) system in Python.
- pyquery: a jquery-like library that allows to make jquery queries on xml documents.
- webscraping: a library for web scraping or website navigation.
- Selenium: a suite of tools to automate web browsers across many platforms.
- Before scraping, be sure and clear that the data are useful for your analysis. Otherwise, you will just get some useless data and waste your time.
- Explore the structure of the website or page. Here are two websites that provide data about soccer: KassiesA: UEFA European Cup Football and Football-Data.co.uk.
- Based on the structure, write your own spider with the libraries to extract the data that you want.
- Save the data locally well for analysis.
Those libraries and tools are powerful and easy to use. I will describe how to use some of those libraries or tools in detail in the future.
blog comments powered by Disqus