Malicious Project: PHP Web Scraper with an Admin Dashboard using WordPress as headless server
In data management and automation, web scraping plays a crucial role in extracting and processing vast amounts of information. I developed a sub-domain website that serves as a dashboard for managing large datasets, leveraging WordPress as a headless server for authentication and user management. The application ensures secure access using JWT Authentication for the WordPress REST API, allowing only verified users to log in and interact with the system.
Authentication and User Access
To access the dashboard, users must authenticate using their WordPress credentials. The integration of JWT Authentication ensures that only authorized individuals can manage and process data. This setup simplifies user management while leveraging WordPress's robust authentication mechanisms.
HTML File Upload and Scraping Process
The primary function of the dashboard is to process HTML files that users manually upload. These files contain data extracted from external sources, which would typically be fetched programmatically. However, due to security constraints, direct automated requests to external web servers resulted in IP blocking issues. To circumvent this challenge, users were required to manually download the HTML files, upload them to the dashboard, and initiate the scraping process by clicking the “Proceed” button.
Technical Implementation
The scraping logic was implemented in pure PHP, ensuring optimal performance and control over the data processing pipeline. A key component of the system is a PHP class called ProcessHtmlFiles, designed to:
Differentiate between new and existing files.
Prevent duplicate file uploads to conserve storage space.
Maintain a historical database of processed files, ensuring that previously processed data is not re-included, even if the original files are deleted.
Automate the cleanup and management of storage, reducing manual intervention.
Data Processing and Storage
Once uploaded, the HTML files undergo structured parsing, extracting relevant information for storage and further processing. The extracted data is then posted to an external server, ensuring seamless integration with other data management systems. The database not only stores extracted information but also maintains metadata about each processed file, enabling efficient tracking and validation.
Challenges and Solutions
The major challenge in this project was the inability to automate the retrieval of HTML files due to frequent IP bans. This limitation necessitated a manual file upload process, which, although less efficient, ensured compliance with server restrictions. The extensive storage optimization mechanisms implemented in ProcessHtmlFiles mitigated potential issues related to redundant data and inefficient disk usage.
Conclusion
The development of this PHP-based web scraper with a WordPress headless server for authentication showcases the power of integrating traditional CMS platforms with custom data processing solutions. While direct automation was not feasible due to IP restrictions, the manual upload process provided a viable workaround. The implementation of robust storage management and authentication mechanisms ensured a secure, scalable, and efficient solution for handling large datasets.