Web Scraping Service

by GrabLab

Technological Stack

I have a lot of experience with web scraping, data processing and creating web dashboards. I’ve used those skills to create full data driven solutions such as price monitoring systems, product databases, content aggregation, data pipelines as well as the extraction of value from data using machine learning.

Here’s a list of areas I have worked with in the past, and can provide consulting on:

Languages

  • Python
  • Javascript

Web Scraping and Web Automation

  • Grab Framework - an asynchronous web scraping framework (it uses pycurl and urllib3)
  • Scrapy - same as above, but heavier and more general
  • Selenium - for web automation, general testing and if a scraping project requires JS execution
  • PhantomJS - a headless browser, used if there is a need to execute JavaScript

Databases

  • MongoDB - a schema-less database perfectly suited to use as a middle data warehouse
  • Redis - in-memory database, used as a cache, tasks queue, pub/sub and for bloom filters
  • PostgreSQL - RDBSM “traditional” database (tables, schema)
  • ElasticSearch - search index
  • InfluxDB - time series database (i.e. prices, quantities etc.)

Backend

  • Django - the most popular framework in the Python ecosystem
  • Flask - a lightweight Python framework
  • aioHTTP - an asynchronous client/server framework for Python > 3.5 based on the asyncio library
  • Loopback - a javascript REST framework based on expressjs

Frontend

  • ReactJS - a front-end framework from Facebook
  • AngularJS 1.x - a front-end framework from Google
  • RiotJS - a tiny components framework (when React/Angular would be over-engeenering)
  • Webpack - a build system for a front-end app
  • ES6 - modern standard of javascript (ECMA2016)

Testing

  • Unit tests
  • Property based tests
  • Integrational tests
  • BDD

DevOps

  • Docker - In most projects I use containers for services (API, Dashboard etc.) and for the deployment of applications
  • Kubernetes - a cluster orchestration solution from Google (limited experience)
  • Amazon AWS - I am proficient with Amazon’s cloud solutions
  • Heroku - PaaS cloud (limited experience)
  • OpenShift - cloud computing solution from RedHat that recently moved to docker containers

Github

Thanks for reading!