If you utilize this tool for your own research, please cite the OSF repository.

In quantitative research on higher education, analyzing the demographics of the academic workforce often relies on name-based gender inference algorithms. While tools such as gender_guesser or Ethnea allow for population-scale analysis, they carry inherent error rates, particularly regarding ambiguous names or specific cultural naming conventions.

Our current project investigates women's representation among over 300,000 faculty members across U.S. higher education institutions. To ensure the reliability of our findings, we required a robust mechanism to validate our name-based classifiers against a ground-truth dataset. Visual verification is the gold standard for this validation; however, manually collecting and verifying photographs for a statistically significant stratified sample is prohibitively labor-intensive.

To address this, we developed a Python-based pipeline that automates the retrieval and verification of faculty headshots using computer vision. This tool allows for the rapid generation of high-fidelity validation datasets. Today, we are releasing the source code on the Open Science Framework (OSF) to support reproducibility in bibliometric and demographic research.

The Challenge: Automated Disambiguation

The primary challenge in automating faculty image retrieval is not identification, but disambiguation. A standard search query for a faculty member often returns departmental news pages containing multiple faces—colleagues, students, or guest speakers—alongside the target subject.

Naive scraping approaches (such as selecting the largest face in a returned image) can introduce substantial noise into the dataset. For this validation study, where our goal is to measure the error rate of another algorithm, the ground-truth data must be accurate - false positives in the image set would invalidate the error analysis.

The Solution: Consensus Clustering & Entropy Filtering

We implemented a multi-stage verification algorithm designed to minimize false positives by mimicking human verification logic. The pipeline prioritizes precision over recall: it is preferable to return no data (skip a faculty member) than to return an incorrect image.

Domain-Specific Querying

To ensure each subject's identity, the search is strictly limited to the faculty member's .edu domain (e.g., site:fordham.edu). This approach leverages the university's official web presence as the source of reliable information, eliminating individuals with the same name from other institutions or other third-party sites.

Signal Processing (Shannon Entropy Filter)

Before facial analysis begins, the algorithm calculates the Shannon Entropy of the candidate image's histogram. High entropy indicates photographic data; low entropy characterizes synthetic graphics. By setting a strict threshold, the system automatically discards university logos, placeholders, and vector graphics that frequently appear in search results.

Consensus Clustering (Identity Verification)

The core innovation of the pipeline is the use of Consensus Clustering to resolve ambiguity. For every subject, the script analyzes the top 20 candidate images found on the target domain.

Using OpenFace (a Deep Neural Network), the system generates a 128-dimensional embedding vector for every face detected. It then performs pairwise clustering on these vectors to determine if a visual consensus exists:

Consensus Found: If a cluster of faces (e.g., 3 out of 5 candidates) are statistically similar (Euclidean distance < 0.9), the system infers that this recurring face is the target faculty member.
Ambiguity Rejection: If the candidate images contain faces that are dissimilar (no cluster forms), the system infers that the search results are ambiguous and excludes the subject from the dataset.

This method ensures that the resulting dataset consists only of individuals whose visual identity could be cross-verified across multiple independent sources within the university domain.

Open Science & Reproducibility

At AARC, we believe that transparent methodology is essential for demographic research. By open-sourcing this image collection tool, we aim to assist other researchers in auditing their own name-based inference methods.

The repository includes:

The Search Pipeline: A resumable, idempotent script that manages API quotas and prevents redundant processing.
The Verification Logic: The implementation of OpenFace clustering and entropy filtering.
Documentation: Detailed methodology on the computer vision thresholds used.

You can access the entire project (and the code for this specific image collection technique) here: DOI 10.17605/OSF.IO/UYTF2

Acknowledgements

This tool utilizes the Google Custom Search API for retrieval and OpenCV (Deep Neural Network module) for facial analysis. Code generation and debugging assistance were provided by Large Language Models (Google Gemini). All logic and parameters manually verified by the research team.

If you utilize this tool for your own research, please cite the OSF repository.

Methodological Note: Automated Visual Verification for Large-Scale Demographic Inference