Data, data everywhere: content consumption strategies using Nutch and Solr data import handler

Nutch is an open web crawler that lets you do fine grained or Internet wide web crawling. In this session I will introduce you to the Drupal Nutch module, which will help with the setup and control of your crawls. We will combine this with some of the new features in the Apache Solr, Views 3 and Apache Solr views to create hybrid search engine vertical that interleaves your content with supporting web content.

The Agenda will be:

1. An introduction to the Apache Nutch crawler
2. An introduction to the Features of the Drupal Nutch module
3. Technical Design decisions on combining crawled data with your Drupal data in Apache Solr
4. Aggregating other data into Solr using Data import handler (Wikipedia anyone?)
5. Questions

Schedule info
Status: 
Proposed
Session Info
Speaker(s): 
Track: 
Site Building
Experience level: 
Intermediate

Comments

I've tried to implement Lucene before so to have this in Apache in pretty cool. Would like to learn more, sounds like a good talk to me.