A system for automatic crawling and parsing sites with a different structure. Has to support many languages, uses Google Translate API for translating data into English. Usually used for parsing HTML, also can extract text from pdf, doc, and other attachment files. Can check updates with some interval, ignore unnecessary content. From the obtained data is creating the posts for its own resources. All posts sorting by categories and are available for users.
Industries: Legal services, Software manufacturers, Industry
Solution: Media content management
Technologies and tools:
Java, PostgreSQL, Spring, Jsoup, JUnit, Git, Kitematic, Docker, RestTemplate, SpringBoot