(logo)
(navigation image)
Home Donate | Forums | FAQs | Contributions | Terms, Privacy, & Copyright | Contact | Jobs | Bios

Search: Advanced Search

Anonymous User (login or join us)Upload

News [more]

70,000 books from Cornell Libraries online now
60,000 books from Library of Congress go online
Cornell University Library Partners with the Internet Archive
Shortcovers Adds 1.8 Million Titles Through Internet Archive's BookServer Project
Now Available: MOBI Versions for Most of The Internet Archives 1.8 Million Books
New project in scramble to save vanishing internet links (Times Online)
Efforts underway to ensure no shortlink rots (ZDNet Asia)
Trying to Save the Web's Shortcuts (Wall Street Journal)
URL shorteners working with Internet Archive for long-term preservation
Internet Archive Dishes up BookServer as Digital Books Market Heats Up

Job Opportunities in the Internet Archive Web Group

Web Wide Crawl Engineering - Technical Lead

Web Wide Crawl Engineering - Technical Lead

The Internet Archive seeks a Senior Engineer to lead technical program definition and implementation of IA's web wide harvesting program.

Come help us archive the Internet and preserve it for future generations. You'll be responsible for driving our efforts to retrieve the highest quality content from the Web and for building the largest and most useful Web archive in existence.

Our Web Wide Harvesting program goals are: 1) to grow and augment the 150+ billion Web captures contained within IA's historic Web archive, using open source tools and platforms, through ongoing, regular captures of Web content of interest, 2) to analyze our Web collections to ensure we are harvesting a representative sample of what's happening on the web in each calendar year, and 3) to experiment with harvest techniques and tools that enable the archival capture and re-rendering of rich media, streaming content, social media, and so forth, in addition to traditional web page content, as publishing practices and interactive online mediums evolve.

In this role you will work with the Web Collections Manager to design the strategy and implementation of the program and lead the operation of ongoing Web-scale harvests. You will also assist the program by creating tools and services as needed to improve the crawl (this may include analysis, reporting, QA, data import, etc...). You will

  • Help identify program requirements and define technical, operational, and data analysis requirements
  • Lead efforts to define deployment architecture and workflows
  • Help implement, deploy, maintain and update data harvesting and analysis services

Your responsibilities will include:

  • Running the Heritrix open source web crawler to collect content from the Internet. You can find out more about the crawler at crawler.archive.org
  • Analysis of content collected from the Internet to ensure it is complete and of highest quality
  • Development of tools for automated and human-directed analysis and reporting of crawl material
  • Monitoring all production systems using automated tools

Experience Needed:

  • Experience in software or algorithm design, preferably in the areas of data analytics or distributed systems.
  • Strong foundation in distributed system architecture and development desired. Experience building distributed systems and/or large scale applications.
  • Should be comfortable with an open source development environment (Linux, Apache, Java, MySQL, etc.)
  • Direct experience with large-scale crawling or search systems is a plus
  • Ability to understand and extend an existing application's architecture and design.
  • Experience coding with Java a plus
  • Experience in Internet protocols (HTTP is a must.)
  • Knowledge of HTML, Javascript and Web technologies in general (required)
  • Knowledge of basic Linux system administration is a plus
  • Must be flexible and able to work in, and enjoy, a loosely structured start up work environment
  • Must be able to work independently and as part of a team.
  • Should be able to manage projects, communicate readily and clearly and stay on top of deadlines to ensure desired results.

Education: Computer Science, Math BS/BA or equivalent work experience

The Internet Archive, based in San Francisco's Presidio, is an entrepreneurial and technologically-innovative nonprofit that serves as a public repository for born digital and digitized materials. IA works closely with libraries, archives, museums, and educational institutions from around the globe to promote web archiving best practices and to ensure collections include culturally significant and relevant materials. IA makes all data freely and publicly accessible from www.archive.org. Find out more about our organization and web archive at www.archive.org.

We are an equal opportunity employer. Please send your resume and cover letter to jobs at archive dot org with the subject line "WWW Crawl Tech Lead". The Archive thanks all applicants for their interest, but advises that only those selected for an interview will be contacted. No phone calls please

Crawl Engineer

Crawl Engineer

The Internet Archive is seeking a Crawl Engineer to run large-scale crawls for our partners. Our crawl engineering team is responsible for capturing and managing the highest quality content from the web for our 90+ library and archive partners around the world. An ideal candidate demonstrates independence and initiative, is a problem solver, works well autonomously, and is technologically savvy. Additionally, the ideal candidate is open to being trained on best practices and standards around large-scale web harvests.

Find out more about our organization and web archiving at www.archive.org as well as our tools and services at http://wa.archive.org/

Your responsibilities include:

  • Working directly with Libraries, Archives and Universities to collect specific born digital content for preservation.
  • Running web harvests on specific topics, themes and/or domains using Heritrix, our open source Web crawler. You can find out more about the crawler at crawler.archive.org
  • Trouble shooting and running interference during the crawl to ensure its' on time and successful completion.
  • Analysis and QA of content collected to ensure it is complete and of highest quality
  • Development of tools for automated analysis and reporting of crawl material
  • Contribute to the development of the open source crawler and related access/analysis tools
  • Demonstrated experience of delivering on commitments to partners with deadlines and project time lines.

    Experience Needed:

  • Solid experience in Internet protocols (HTTP is must.)
  • Strong knowledge of HTML, JavaScript and Web technologies in general
  • Experience coding with Java
  • Experience with open source technology and/or Heritirx
  • Knowledge of basic Linux system administration
  • Knowledge of basic building and deploying web applications
  • Ability to work in, and enjoy, a loosely structured work environment
  • Flexibility and a sense of humor

    Education:

    Computer Science, Math BS/BA or equivalent work experience

    We are an equal opportunity employer. Please send your resume and cover letter to kristine at archive dot org with the subject line "Crawl Engineer". The Archive thanks all applicants for their interest, but advises that only those selected for an interview will be contacted. No phone calls please.



  • Terms of Use (10 Mar 2001)