System for automatic crawling and parsing sites

Web Development

A system for automatic crawling and parsing sites with a different structure. Has to support many languages, uses Google Translate API for translating data into English. Usually used for parsing HTML, also can extract text from pdf, doc, and other attachment files. Can check updates with some interval, ignore unnecessary content. From the obtained data is creating the posts for its own resources. All posts sorting by categories and are available for users.

Industries: Legal services, Software manufacturers, Industry

Solution: Media content management

Technologies and tools:

Java, PostgreSQL, Spring, Jsoup, JUnit, Git, Kitematic, Docker, RestTemplate, SpringBoot

System for automatic crawling and parsing sites
Tagged on: