Web crawling in Ruby with Capybara
In a Rails project, we use Capybara for feature (end to end) testing. However, Capybara is also good for crawling a page without Rails. Any data within a specific pattern can be obtained with ease.
We can create a simple web crawler within a single file with Capybara DSL.
How To
Create a folder with a Gemfile, because we require multiple gems.
$ mkdir crawler
Create a Gemfile:
source "https://rubygems.org"
gem 'capybara'
gem 'selenium-webdriver'
Run bundle
after that.
Setup your crawler:
require 'capybara'
Capybara.run_server = false
Capybara.current_driver = :selenium
Capybara.app_host = "https://google.com.tw"
You can pick other drivers from the list in Capybara repo. Basically, install relative gems if required.
After setup, create a class and include the DSL from Capybara:
module MyCapybara
class Crawler
include Capybara::DSL
end
end
crawler = MyCapybara::Crawler.new
If a instance is initiated, create a method, fill in with your patterns, and deal with the data. The following is a complete example:
require 'capybara'
Capybara.run_server = false
Capybara.current_driver = :selenium
Capybara.app_host = "https://google.com"
module MyCapybara
class Crawler
include Capybara::DSL
def query(params)
visit("/")
fill_in "#search", with: params
click_button "search"
return find("#result").text
end
end
end
crawler = MyCapybara::Crawler.new
item = crawler.query(item)
File.open("query.txt","a") {|file|
file.write("#{item}\n")
}
More complex operation can be done through other methods offered by Capybara.