I recently launched a little site for finding Catholic parishes and mass times called parish.io. It takes a very different approach to gathering parish info and mass times than other sites in this category. Not only is this approach beneficial to users (more accurate and complete mass schedules), I think it’s of particular interest to other software developers.
Rather than relying on manual data entry, parish.io gathers all of its info by scraping diocese and parish sites. It took months of hacking to validate the concept and develop the scraping logic, and while it’s not perfect (some parishes just don’t have sites, or don’t provide mass times, or put them in unparseable formats), overall I’m quite happy with the results I’m seeing. Here’s what I used to build it:
Python: My programming language of choice. Database aside, everything that follows is a Python library.
lxml: Don’t be fooled by the name. lxml is just as capable of parsing HTML as XML, especially given its support for CSS selectors (similar to jQuery). It’s very fast, and ably handles most poorly formed HTML. Some people are partial to the API in BeautifulSoup, but a few small hangups aside, lxml has performed so well that I’ve never been very tempted to switch.
That toolset accounts for most of the heavy lifting in scraping sites. As for the site itself, I used Flask, a simple and very well-documented web framework; the SQLAlchemy ORM talking to a PostgreSQL database, and critically, PostGIS for geo queries (i.e. lookup by zip code, city, or nearby).
All of the above are open source, with very friendly licensing terms that will work for any project, whether open or closed source.
If any of this is of interest to you, check out these projects. And be sure to give parish.io a try too!