biblioshiny: the shiny app for bibliometrix
Biblioshiny 5.0 now includes Biblio AI – a powerful AI assistant for your science mapping analyses.
biblioshiny and bibliometrix are open-source and freely available for use, distributed under the MIT license.
When they are used in a publication, we ask that authors to cite the following reference:
Aria, M., & Cuccurullo, C. (2017). bibliometrix: An R-tool for comprehensive
science mapping analysis. Journal of Informetrics , 11(4), 959-975.
Failure to properly cite the software is considered a violation of the license.
For an introduction and live examples, visit the bibliometrix website.
SAAS Workflow
Search - Appraisal - Analysis - Synthesis
About the SAAS Workflow
The SAAS workflow represents the comprehensive process of bibliometrix and biblioshiny for conducting scientific bibliometric analysis. Each phase is designed to ensure methodological rigor and reliable results:
- Search: Systematic collection of bibliographic data from academic databases
- Appraisal: Quality assessment and filtering of collected data
- Analysis: Application of advanced bibliometric techniques and AI
- Synthesis: Results synthesis and scientific report generation
The iterative cycle allows continuous refinement of the analysis by returning to previous phases based on obtained results.
SAAS Workflow developed by:
Massimo Aria & Corrado Cuccurullo
University of Naples Federico II, Italy
Aria, M., & Cuccurullo, C. (2017). bibliometrix: An R-tool for comprehensive science mapping analysis. Journal of Informetrics, 11(4), 959-975.
🧠 Biblio AI: AI-Powered Bibliometric Analysis
Starting from version 5.0, Biblioshiny introduces Biblio AI, a new suite of features powered by Google's Gemini models. This integration allows users to receive automatic interpretations, critical insights, and narrative summaries of their bibliometric results – directly within the platform.
Note: Biblio AI requires a Chrome-based browser (such as Google Chrome or Microsoft Edge) installed on your computer to work correctly.
✨ What does Biblio AI do?
Biblio AI enhances the core analytical modules of Biblioshiny by providing contextual, AI-generated commentary on several results, such as:
- Overview: High-level summaries of key bibliometric indicators and collection features.
- Three-Field Plot: Interpretation of the connections among sources, authors, and keywords.
- Authors' Production over Time: Insights on temporal dynamics and productivity patterns of key authors.
- Corresponding Author's Countries Collaboration: Discussion of international scientific collaboration patterns.
- Most Local Cited Documents: Evaluation of the most influential documents within the dataset.
- Reference Publication Year Spectroscopy: Identification and interpretation of historical citation peaks.
- Trend Topics: Explanation of thematic evolution and detection of emerging research trends.
- Knowledge Structures: Analysis of conceptual maps and networks such as co-citation and co-word analysis.
- Country Collaboration World Map: AI-assisted reading of global co-authorship and geographical patterns.
In each of these sections, users can activate the Biblio AI panel to access dynamic text explanations, perfect for use in scientific writing, presentations, or reporting.
🔧 How to enable Biblio AI?
To enable Biblio AI, follow these simple steps:
- Register at Google AI Studio (free access available).
- Generate an API Key enabled for Gemini model access (Free Tier supported).
- Enter your API Key in the Settings section of Biblioshiny.
The interface will guide you through the secure and local setup. Your API key is used only on your device to interact with the AI model.
🎯 Why use Biblio AI?
- Reduces time spent interpreting complex outputs.
- Supports scientific writing and research reporting.
- Helps users better understand bibliometric patterns and dynamics.
- Delivers explanations in natural language, accessible to both experts and newcomers.
📚 Supported Bibliographic Databases and Suggested File Formats
Biblioshiny imports and analyzes collections exported from the following bibliographic databases:
Web of Science, Scopus, and OpenAlex allow users to export the complete set of metadata, making it possible to perform all analyses implemented in Biblioshiny.
Some other databases, such as Dimensions, PubMed, and Cochrane Library, provide only a limited set of metadata. This may impose restrictions on the range of analyses that can be conducted using those datasets.
The following table (not included here) reports, for each supported database:
- The file formats supported by the export interface
- The types of metadata contained in each export option
- The suggested file format to use with Biblioshiny
📖 Main Authors' References (Bibliometrics)
- Aria, M., & Cuccurullo, C. (2017). bibliometrix: An R-tool for comprehensive science mapping analysis. Journal of Informetrics, 11(4), 959–975. https://doi.org/10.1016/j.joi.2017.08.007
- Aria, M., Cuccurullo, C., D'Aniello, L., Misuraca, M., & Spano, M. (2024). Comparative science mapping: a novel conceptual structure analysis with metadata. Scientometrics. https://doi.org/10.1007/s11192-024-05161-6
- Aria, M., Le, T., Cuccurullo, C., Belfiore, A., & Choe, J. (2023). openalexR: An R-Tool for Collecting Bibliometric Data from OpenAlex. R Journal, 15(4). https://doi.org/10.32614/rj-2023-089
- Aria, M., Misuraca, M., & Spano, M. (2020). Mapping the evolution of social research and data science on 30 years of Social Indicators Research. Social Indicators Research. https://doi.org/10.1007/s11205-020-02281-3
- Aria, M., Cuccurullo, C., D'Aniello, L., Misuraca, M., & Spano, M. (2022). Thematic Analysis as a New Culturomic Tool: The Social Media Coverage on COVID-19 Pandemic in Italy. Sustainability, 14(6), 3643. https://doi.org/10.3390/su14063643
- Aria, M., Alterisio, A., Scandurra, A., Pinelli, C., & D'Aniello, B. (2021). The scholar's best friend: research trends in dog cognitive and behavioural studies. Animal Cognition. https://doi.org/10.1007/s10071-020-01448-2
- Cuccurullo, C., Aria, M., & Sarto, F. (2016). Foundations and trends in performance management: A twenty-five years bibliometric analysis in business and public administration domains. Scientometrics. https://doi.org/10.1007/s11192-016-1948-8
- Cuccurullo, C., Aria, M., & Sarto, F. (2015). Twenty years of research on performance management in business and public administration domains. Presented at CARME 2015. Link
- Sarto, F., Cuccurullo, C., & Aria, M. (2014). Exploring healthcare governance literature: systematic review and paths for future research. Mecosan. Link
- Cuccurullo, C., Aria, M., & Sarto, F. (2013). Twenty years of research on performance management in business and public administration domains. Academy of Management Proceedings, Vol. 2013, No. 1, p. 14270. https://doi.org/10.5465/AMBPP.2013.14270abstract
- Belfiore, A., Salatino, A., & Osborne, F. (2022). Characterising Research Areas in the field of AI. arXiv preprint. https://doi.org/10.48550/arXiv.2205.13471
- Belfiore, A., Cuccurullo, C., & Aria, M. (2022). IoT in healthcare: A scientometric analysis. Technological Forecasting and Social Change, 184, 122001. https://doi.org/10.1016/j.techfore.2022.122001
- D'Aniello, L., Spano, M., Cuccurullo, C., & Aria, M. (2022). Academic Health Centers' configurations, scientific productivity, and impact: insights from the Italian setting. Health Policy. https://doi.org/10.1016/j.healthpol.2022.09.007
- Belfiore, A., Scaletti, A., Lavorato, D., & Cuccurullo, C. (2022). The long process by which HTA became a paradigm: A longitudinal conceptual structure analysis. Health Policy. https://doi.org/10.1016/j.healthpol.2022.12.006
Converting data to Bibliometrix format
Import or Load
The use of bibliometric approaches in business and management disciplines.
Dataset 'Management'
Period: 1985 - 2020, Source WoS.
Export Collection
📥 Import or Load: Building Your Bibliometric Collection
The Import or Load module is the starting point for any bibliometric analysis in Biblioshiny. This section allows users to build their bibliographic collection by either importing raw files from supported databases or loading pre-processed bibliometrix files saved in previous sessions.
📂 Three Import Options
Biblioshiny offers three flexible ways to create or load a bibliographic collection:
1. Import Raw File(s)
Import bibliographic data directly from supported databases in their native export formats.
Supported Databases:
- Web of Science (.txt, .bib format)
- Scopus (.bib, .csv format)
- OpenAlex (via API integration or pre-downloaded files)
- Dimensions (.csv, .xlsx format)
- Lens (.csv format)
- PubMed (.txt format)
- Cochrane Library (.txt format)
Import Process:
- Click Browse to select one or more raw export files from your computer
- Biblioshiny automatically detects the database format and parses the metadata
- The system converts the raw data into a standardized
bibliometrixdata frame - A Conversion Results summary displays the number of documents successfully imported
- View a preview table showing key metadata fields (DOI, Authors, Title, Journal, etc.)
Important Notes:
- Files from different databases can be merged later using the Merge Collections module
- For best results, export the full record with cited references from the source database
- Some databases (e.g., Web of Science, Scopus) have export limits—download data in batches if necessary
- Always check the file format requirements in the Info section before exporting from databases
2. Load Bibliometrix File(s)
Resume work on a previously processed collection by loading .rdata or .xlsx files generated by Biblioshiny or the bibliometrix R package.
Use Cases:
- Continue analysis from a previous session
- Load collections pre-processed using the
bibliometrixR package - Share standardized datasets with collaborators
- Work with large collections that have already undergone data cleaning and filtering
Supported Formats:
- .rdata: R Data Serialization format (preserves full metadata and structure)
- .xlsx: Excel format (compatible with bibliometrix exports)
3. Use a Sample Collection
Perfect for testing and learning Biblioshiny's features without preparing your own data.
- Select from pre-loaded example datasets covering various research domains
- Ideal for exploring the platform's analytical capabilities
- No file upload required—start analyzing immediately
🔍 Post-Import Features
After successfully importing or loading a collection, you can:
- View Collection Metadata: Preview document details in a sortable, filterable table
- Add Brief Description: Write a custom description of your collection for documentation purposes
- Export Collection: Save your processed collection as
.rdata,.xlsx, or.csvfor backup or sharing - Start Analysis: Click the blue Start button to proceed to filtering and analysis modules
💾 Exporting Collections
Once your collection is loaded, you can export it in multiple formats:
- .rdata: Recommended for preserving all metadata and R-specific structures
- .xlsx: Excel-compatible format for sharing with non-R users
⚠️ Best Practices
- Always save your processed collections after importing raw files to avoid re-conversion
- Use descriptive filenames when exporting (e.g.,
management_wos_1990-2020.rdata) - Check conversion results carefully—some database exports may have formatting issues that require manual correction
- For large collections (>5,000 documents), consider applying filters early to improve performance
📚 References
Aria, M., & Cuccurullo, C. (2017). bibliometrix: An R-tool for comprehensive science mapping analysis. Journal of Informetrics, 11(4), 959–975. https://doi.org/10.1016/j.joi.2017.08.007
OpenAlex Data Collection
Date Range Filter
PubMed Data Collection
Date Range Filter
Load Collections
Merge collections in Excel or R format coming from different DBsExport collection
🔀 Merge Collections: Combining Data from Multiple Sources
The Merge Collections module allows users to combine bibliographic datasets from different databases (Web of Science, Scopus, OpenAlex, PubMed, etc.) into a single unified collection. This functionality is essential for comprehensive literature reviews, cross-database validation, and maximizing metadata coverage by leveraging the strengths of multiple sources.
🎯 Why Merge Collections?
- Broader Coverage: Different databases index different journals and document types—merging increases the comprehensiveness of your dataset
- Complementary Metadata: Scopus may provide detailed affiliation data, while Web of Science offers comprehensive citation links—combining them enriches your analysis
- Validation: Cross-referencing records from multiple sources improves data quality and identifies discrepancies
- Deduplication: Automatically removes duplicate records that appear in multiple databases
🔧 How to Merge Collections
The merge process in Biblioshiny is straightforward:
- Navigate to Merge Collections: Select Data > Merge Collections from the main menu
- Select Collection Files: Click Browse and select two or more bibliometrix files to merge:
- Supported formats:
.rdata,.xlsx - Files can originate from different databases (e.g.,
wos_collection.rdata+scopus_collection.xlsx) - Files must be valid bibliometrix data frames (created via Import or Load, or R package)
- Supported formats:
- Configure Merge Options:
- Remove Duplicates: Enable (recommended) to automatically detect and remove duplicate records
- Verbose Output: Enable to display detailed information about the merge process and duplicates removed
- Click Start: The merge algorithm combines the collections, standardizes metadata fields, and removes duplicates
- Review Results: A summary displays the total number of documents and how many duplicates were removed
- Export Merged Collection: Save the unified dataset for future analysis
🔬 Merge Algorithm Overview
The merge process follows a sophisticated multi-stage algorithm implemented by the mergeDbSources() function:
Stage 1: Database Identification and Ordering
- Each collection is tagged with its source database (
DBfield: ISI, SCOPUS, OPENALEX, LENS, DIMENSIONS, PUBMED, COCHRANE) - Collections are ordered by database priority to preserve the most reliable metadata when conflicts arise
- Order: Web of Science (ISI) > Scopus > OpenAlex > Lens > Dimensions > PubMed > Cochrane
Stage 2: Field Alignment
- Common metadata fields are identified and aligned across databases (e.g.,
TI= Title,AU= Authors,DI= DOI) - Database-specific fields are preserved when possible
- Missing fields in one database are filled from another when duplicates are detected
- A unified
KW_Mergedfield is created by combining keywords from all sources
Stage 3: Duplicate Detection
Duplicates are identified using a two-step matching strategy:
Step 3.1: DOI-Based Matching
- Documents with identical DOIs are flagged as duplicates
- This is the most reliable method, as DOIs are unique identifiers
- Empty or missing DOIs (
''orNA) are ignored to avoid false positives - Only the first occurrence is retained; subsequent matches are removed
Step 3.2: Title-Year Matching
- For records without DOIs, duplicates are detected using normalized titles and publication years
- Title Normalization:
- Remove all punctuation and special characters
- Convert to lowercase
- Remove extra whitespace
- Example: 'Science Mapping: A Review' → 'science mapping a review'
- Matching Criterion: Two documents are duplicates if they have:
- Identical normalized titles AND
- Identical publication years (
PY)
- This method captures ~95% of duplicates but may miss records with minor title variations
Stage 4: Author Name Standardization
When merging collections from multiple databases, author name formats are standardized:
- Format:
LASTNAME INITIALS(e.g., 'Aria M; Cuccurullo C') - Commas in author names are removed to ensure consistency
- Middle initials are condensed to single letters
- This standardization improves author-based analyses (e.g., collaboration networks, productivity rankings)
Stage 5: Metadata Integration
- The
DB_Originalfield stores each document's source database - The
DBfield is set to 'ISI' (Web of Science format) for compatibility with downstream analyses - Cited references (
CR) are preserved but stored inCR_rawto allow re-processing later - A unique
SR(Short Reference) identifier is generated for each document
📊 Merge Statistics and Validation
After merging, the system provides detailed statistics:
- Total Documents Before Merge: Sum of all input collections
- Duplicates Removed: Number of records eliminated (broken down by DOI matches and title-year matches)
- Total Documents After Merge: Final collection size
- Coverage by Database: Proportion of documents from each source (visible in
DB_Originalfield)
Example Output:
Merging 3 collections:
- WoS: 1,500 documents
- Scopus: 1,800 documents
- OpenAlex: 2,000 documents
Total: 5,300 documents
Removing duplicates...
- 450 duplicates removed by DOI
- 320 duplicates removed by title-year match
Final collection: 4,530 documents
📌 Best Practices
- Always enable duplicate removal unless you have a specific reason to retain duplicates
- Prioritize Web of Science or Scopus as the primary source—these databases generally have the most complete metadata
- Use OpenAlex to supplement coverage for open-access publications or gray literature
- Validate merge results by checking the distribution of
DB_Originalvalues—extreme imbalances may indicate incomplete data from one source - Save merged collections immediately to avoid re-processing
⚠️ Important Considerations
- Citation Data: Merged collections reset the
CR(Cited References) field—you'll need to run Reference Matching again after merging - Field Coverage: Some databases provide richer metadata than others—merging doesn't 'fill in' missing fields unless duplicates are detected
- Large Collections: Merging collections >10,000 documents may take several minutes—be patient and avoid interrupting the process
- Database-Specific Analyses: Some analyses are database-specific—merged collections may lose this granularity
🔍 Example Use Cases
- Systematic Literature Review: Combine Web of Science, Scopus, and PubMed to ensure no relevant publications are missed
- Open Science Research: Merge OpenAlex with traditional databases to include preprints and institutional repositories
- Validation Study: Compare overlap between databases to assess index coverage and bias
- Longitudinal Analysis: Merge historical Web of Science data with recent OpenAlex records to extend temporal coverage
📚 References
Aria, M., & Cuccurullo, C. (2017). bibliometrix: An R-tool for comprehensive science mapping analysis. Journal of Informetrics, 11(4), 959–975. https://doi.org/10.1016/j.joi.2017.08.007
Visser, M., van Eck, N. J., & Waltman, L. (2021). Large-scale comparison of bibliographic data sources: Scopus, Web of Science, Dimensions, Crossref, and Microsoft Academic. Quantitative Science Studies, 2(1), 20–41. https://doi.org/10.1162/qss_a_00112
Martín-Martín, A., Thelwall, M., Orduna-Malea, E., & Delgado López-Cózar, E. (2021). Google Scholar, Microsoft Academic, Scopus, Dimensions, Web of Science, and OpenCitations' COCI: a multidisciplinary comparison of coverage via citations. Scientometrics, 126(1), 871–906. https://doi.org/10.1007/s11192-020-03690-4
Reference Matching
This tool helps identify and merge duplicate citations in your bibliographic dataset. It uses string similarity algorithms to find variants of the same reference, allowing you to clean and standardize your data for more accurate analysis.Matching Statistics
Manual Merge
Top Cited References (After Normalization)
Citation Variants Examples
The table below shows all variants of the selected citation that were matched together.
Matching Options
• 0.90-0.95: Conservative (fewer false positives)
• 0.85-0.90: Balanced (recommended)
• 0.75-0.80: Aggressive (more matching)
Matching in progress...
Please wait while citations are being normalized.
Apply to Data
Reset will restore the original CR field from your initial dataset.
Export Results
Download Normalized Data
The exported data will contain the bibliometric data with normalized citations in the CR field.
Advanced Options
🔗 Reference Matching: Algorithm and Usage
The Reference Matching module implements an advanced algorithm to identify and link cited references within a bibliometric collection to the actual documents present in the dataset. This process enables accurate citation network analysis, co-citation studies, and identification of highly-cited works within the collection.
🔬 Algorithm Overview
The reference matching algorithm follows a multi-step procedure designed to maximize accuracy while handling noisy and incomplete bibliographic data:
- Reference Extraction: Cited references are parsed from the reference list (CR field) of each document in the collection. Each reference string is decomposed into structured components: first author surname, publication year, journal/source, volume, page, and DOI (when available).
- Data Normalization: Both references and documents undergo extensive normalization to reduce variability:
- Author names are standardized (e.g., removing accents, abbreviations, and middle initials)
- Journal titles are normalized using abbreviation lookup tables and string similarity methods
- Years, volumes, and page numbers are cleaned and formatted uniformly
- Blocking Strategy: To improve computational efficiency, references are grouped into blocks based on first author surname and publication year. Only references and documents within the same block are compared, reducing the search space significantly.
- Similarity Computation: For each reference-document pair within a block, a matching score is calculated using a weighted combination of similarity measures:
- DOI matching (if available): exact match = 100% confidence
- First author similarity: string distance (Jaro-Winkler or Levenshtein)
- Year match: exact or within ±1 year tolerance
- Journal/source similarity: string distance between normalized titles
- Volume and page matching: exact or fuzzy comparison
- Threshold-Based Assignment: A reference is matched to a document if the combined similarity score exceeds a predefined threshold (typically 0.85–0.95). The threshold can be adjusted by the user to balance precision and recall.
- Ambiguity Resolution: In cases where a reference matches multiple documents, the algorithm selects the candidate with the highest similarity score. If scores are nearly identical, the match is flagged for manual review.
💡 Usage in Biblioshiny
To perform reference matching in Biblioshiny, follow these steps:
- Load your bibliographic collection: Ensure that your dataset includes the
CR(Cited References) field, which is available in full exports from Web of Science, Scopus, and OpenAlex. - Navigate to the Reference Matching module: Access the module from the Analysis menu or the Citation Network section.
- Configure matching parameters:
- Similarity threshold: Adjust the matching threshold to control precision (higher values = stricter matching, fewer false positives).
- Normalization options: Enable or disable specific normalization rules (e.g., journal abbreviation matching, fuzzy year tolerance).
- Run the algorithm: Click Start Matching to initiate the process. Depending on the collection size, this may take several minutes.
- Review results: The output includes:
- A summary table of matched and unmatched references
- A list of ambiguous matches for manual inspection
- Network visualization options (e.g., co-citation network, historiograph)
- Export matched data: The matched citation network can be exported for further analysis in external tools (e.g., Gephi, Pajek) or used directly in Biblioshiny for advanced network analysis.
⚙️ Key Parameters and Options
- Matching Threshold: Minimum similarity score (0–1) required for a match. Default: 0.90. Lower values increase recall but may introduce false positives.
- Fuzzy Year Matching: Allows matches within ±1 year (useful for handling publication date discrepancies). Default: enabled.
- DOI Priority: When a DOI is available, it overrides other matching criteria. Default: enabled.
- Manual Review Mode: Flags ambiguous matches (score between 0.85–0.90) for user verification. Default: disabled.
📊 Applications
Reference matching is essential for several bibliometric analyses:
- Co-citation analysis: Identify documents frequently cited together, revealing intellectual structure.
- Historiograph: Trace the historical development of research topics through citation linkages.
- Most Cited Local Documents: Rank documents by the number of times they are cited within the collection.
- Citation networks: Construct directed citation graphs for network-based metrics (PageRank, betweenness centrality).
⚠️ Important Notes
- Reference matching quality depends heavily on the completeness and accuracy of the
CRfield in the original data export. - Incomplete or poorly formatted references (e.g., missing author names, incorrect years) may result in lower matching rates.
- For very large collections (>10,000 documents), consider using subsets or increasing the matching threshold to improve performance.
- Always verify ambiguous matches manually, especially for high-stakes analyses.
📚 References
Aria, M., & Cuccurullo, C. (2017). bibliometrix: An R-tool for comprehensive science mapping analysis. Journal of Informetrics, 11(4), 959–975. https://doi.org/10.1016/j.joi.2017.08.007
Garfield, E. (1979). Citation indexing: Its theory and application in science, technology, and humanities. New York: Wiley.
Small, H. (1973). Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science, 24(4), 265–269. https://doi.org/10.1002/asi.4630240406
1. General
2. (J) Journal
3. (AU) Author's Country
4. (DOC) Documents
🔍 Filters: Refining Your Bibliometric Collection
The Filters module provides a comprehensive set of tools to refine and subset your bibliographic collection based on multiple metadata criteria. By applying filters, you can focus your analysis on specific document types, time periods, geographic regions, journals, or citation thresholds—enabling more targeted and meaningful bibliometric insights.
Filters are organized into four thematic panels, each addressing different aspects of bibliographic metadata. At the top of the page, a real-time summary displays how many documents, sources, and authors remain after applying your filter selections.
📊 Real-Time Filter Summary
Located at the top of the Filters page, this summary updates dynamically as you adjust filter settings:
- Documents: Shows the number of documents currently selected (e.g., '898 of 898' means all documents are included; '450 of 898' means 450 documents match your filter criteria).
- Sources: The number of distinct journals, books, or conferences represented in the filtered subset.
- Authors: The total number of unique authors contributing to the filtered documents.
These indicators help you assess the impact of your filters before applying them, ensuring your subset maintains sufficient size for robust analysis.
1️⃣ General Filters
The General panel provides fundamental filters applicable to most bibliometric collections:
Document Type
- Function: Filters documents by publication type (e.g., Article, Book Chapter, Proceedings Paper, Review, Editorial, Letter, Note).
- How to Use:
- By default, all document types are selected (shown in the filter box).
- To exclude a document type, click on its name in the filter box—it will be removed.
- To include a previously excluded type, click on it in the list below the filter box.
- Use Cases:
- Focus on peer-reviewed articles by excluding editorials, letters, and notes.
- Analyze conference proceedings separately from journal articles.
- Include only review articles for systematic literature reviews.
Language
- Function: Filters documents by publication language (e.g., English, Spanish, French, German, Chinese).
- Interaction: Similar to Document Type—click to select/deselect languages.
- Note: Most bibliometric databases predominantly index English-language publications. Non-English documents may represent a small fraction (<5%) of typical collections.
Publication Year
- Function: Restricts the collection to documents published within a specific time range.
- How to Use:
- Use the slider to adjust the start and end years.
- The selected range is displayed below the slider (e.g., '1985 - 2020').
- The histogram shows the distribution of publications across years, helping you identify periods of high activity.
- Use Cases:
- Temporal segmentation: Analyze different decades separately (e.g., 1990-2000 vs. 2010-2020).
- Exclude recent publications: Remove documents <2 years old to avoid citation lag bias.
- Focus on historical literature: Study foundational works from earlier periods.
2️⃣ Journal (J) Filters
The Journal panel enables filtering based on publication venues, journal rankings, or Bradford's Law zones:
Upload a List of Journals
- Function: Restricts the collection to documents published in a user-defined list of journals.
- How to Use:
- Prepare a file (
.csv,.txt, or.xlsx) with journal titles listed in the first column. - Click Browse... and select your file.
- Only documents from journals matching the uploaded list (case-insensitive, partial matching) will be retained.
- Prepare a file (
- Use Cases:
- Focus on core journals in your field (e.g., top 10 management journals).
- Analyze publications from open-access journals only.
- Exclude predatory or low-quality journals identified via external blacklists.
- Example File Format:
Journal of Informetrics
Scientometrics
Journal of the Association for Information Science and Technology
Research Policy
Upload a Journal Ranking List
- Function: Filters journals based on quality rankings (e.g., Q1, Q2, Q3, Q4 quartiles; A*, A, B, C grades).
- How to Use:
- Prepare a file (
.csvor.xlsx) with two columns and headers:- Column 1: Journal titles (must match exactly or closely)
- Column 2: Ranking categories (e.g., Q1, Q2, A*, B)
- Upload the file via Browse...
- Select which ranking categories to include in your filtered collection.
- Prepare a file (
- Use Cases:
- Focus on top-tier journals (e.g., Q1 only) for high-impact analysis.
- Compare publication patterns across journal tiers (e.g., Q1 vs. Q2-Q4).
- Filter by national rankings (e.g., Italian VQR, Australian ABDC, UK ABS).
- Example File Format:
Journal,Quartile
Journal of Informetrics,Q1
Scientometrics,Q1
Library Quarterly,Q2
Online Information Review,Q3
Source by Bradford Law Zones
- Function: Filters journals based on Bradford's Law, which divides sources into three productivity zones: Core, Zone 2, and Zone 3.
- Theory: Bradford's Law states that:
- Core journals (Zone 1): A small number of highly productive sources publishing ~1/3 of all documents.
- Zone 2: A moderate number of sources contributing another ~1/3.
- Zone 3: A large number of peripheral sources producing the final ~1/3.
- How to Use:
- Select 'All Sources' to include everything (default).
- Select 'Core' to focus on the most productive journals.
- Select 'Zone 2' or 'Zone 3' to analyze mid-tier or peripheral journals.
- Use Cases:
- Identify the core journals dominating a research field.
- Compare citation impact between core and peripheral sources.
- Exclude low-productivity journals (Zone 3) to streamline analysis.
3️⃣ Author's Country (AU) Filters
The Author's Country panel enables geographic filtering based on author affiliations:
Region
- Function: Filters documents by broad geographic regions (e.g., Africa, Asia, Europe, North America, South America, Oceania, Seven Seas, Unknown).
- How to Use: Click on region buttons to toggle selection. Selected regions are highlighted in blue.
- Note: 'Seven Seas' represents international waters or unclassified regions; 'Unknown' indicates missing affiliation data.
Country
- Function: Filters documents by specific author countries (e.g., USA, China, UK, Germany, Italy).
- How to Use:
- Use the search box to quickly find countries.
- Countries are displayed in two columns: left (available), right (selected).
- Click a country in the left column to add it; click in the right column to remove.
- Use Cases:
- Analyze national research outputs (e.g., Italian contributions to bibliometrics).
- Study international collaboration by including multiple countries.
- Compare regional trends (e.g., Europe vs. Asia vs. North America).
- Identify emerging research nations in a field.
- Important: Multi-country documents (with authors from different countries) are included if any selected country is represented among the authors.
4️⃣ Documents (DOC) Filters
The Documents panel provides citation-based filters with interactive histograms:
Total Citations
- Function: Filters documents by their cumulative citation count (from database records).
- How to Use:
- Use the slider below the histogram to set minimum and maximum citation thresholds.
- The histogram shows the distribution of citation counts across documents, helping you identify highly-cited outliers.
- Example: Set minimum = 50 to include only documents with ≥50 citations.
- Use Cases:
- Focus on high-impact documents (e.g., citations >100) for influence analysis.
- Exclude uncited documents (citations = 0) for citation network studies.
- Identify the citation elite (top 1% most-cited papers).
Total Citations per Year
- Function: Filters documents by their average annual citation rate, calculated as:
Total Citations / (Current Year - Publication Year). - Why Use This? Raw citation counts are biased toward older publications. Citations per year normalizes for document age, enabling fairer comparison between recent and historical works.
- How to Use: Adjust the slider to set citation-per-year thresholds (e.g., ≥5 citations/year).
- Use Cases:
- Identify rapidly accumulating citations (indicators of emerging influence).
- Compare citation velocity across time periods.
- Find recent high-impact papers that haven't yet accumulated large total citation counts but show strong annual growth.
🎛️ Filter Workflow
Follow these steps to apply filters effectively:
- Review the initial collection: Check the summary counts (Documents, Sources, Authors) before applying filters.
- Select filter criteria: Adjust settings across the four panels based on your research objectives.
- Monitor real-time updates: The summary at the top updates dynamically as you change selections, showing how many documents remain.
- Click 'Apply': Once satisfied with your selections, click the blue Apply button to activate the filters.
- Verify results: Check the updated summary to ensure your filters produced the expected subset size.
- Proceed to analysis: Navigate to other modules (Overview, Sources, Authors, etc.) to analyze the filtered collection.
- Reset if needed: Click the Reset button to clear all filters and restore the original dataset.
💡 Best Practices
- Avoid over-filtering: Very small subsets (<100 documents) may not provide robust results for network or clustering analyses. Aim for at least 200-300 documents when possible.
- Document your filters: Record which filters you applied for reproducibility and transparency in research reporting (e.g., 'Filtered to Q1 journals, 2010-2020, English-language articles only').
- Iterative refinement: Start with broad filters and gradually narrow your criteria while monitoring the summary counts.
- Combine filters strategically: Use multiple filter types together (e.g., specific countries + high citations + recent years) for highly targeted analyses.
- Save filtered collections: After applying filters, export your refined collection using the Data button (top right) to preserve your work.
- Compare filtered vs. unfiltered: Run key analyses on both the full and filtered collections to assess how filters impact results.
⚠️ Important Considerations
- Citation Data Availability: Citation counts depend on database indexing. Web of Science and Scopus provide citation data; PubMed and some other databases do not. Missing citation data will result in empty histograms in the Documents panel.
- Affiliation Data Quality: Author country filters rely on affiliation metadata, which may be incomplete or inconsistent, especially in older publications or non-WoS/Scopus databases.
- Subject Category Coverage: Subject categories are database-specific. Scopus categories differ from Web of Science categories; merged collections may have inconsistent classification.
- Filter Order Independence: Filters are applied simultaneously, not sequentially. The order in which you select filters does not affect the final result.
- Bradford Zone Recalculation: Bradford's Law zones are calculated based on the current collection. If you merge collections or upload new data, zones may shift.
🔍 Use Case Examples
Example 1: Analyzing Top-Tier Recent Research
- Goal: Focus on high-impact, recent publications in core journals.
- Filters Applied:
- Document Type: Article, Review
- Publication Year: 2015-2020
- Source by Bradford Law Zones: Core
- Total Citations per Year: ≥10
- Outcome: A curated subset of influential papers from leading journals, suitable for identifying emerging research fronts.
Example 2: National Research Assessment
- Goal: Evaluate research output from Italian universities in Computer Science.
- Filters Applied:
- Author's Country: Italy
- Subject Category: Computer Science, Information Systems
- Document Type: Article
- Outcome: A collection focused on Italian contributions to CS, enabling analysis of national productivity, collaboration patterns, and impact.
Example 3: Historical Foundational Literature
- Goal: Study the intellectual foundations of a field by examining seminal works.
- Filters Applied:
- Publication Year: 1970-1990
- Total Citations: ≥100
- Document Type: Article
- Outcome: A set of highly-cited historical documents representing foundational contributions.
📚 References
Bradford, S. C. (1934). Sources of information on specific subjects. Engineering, 137, 85–86.
Aria, M., & Cuccurullo, C. (2017). bibliometrix: An R-tool for comprehensive science mapping analysis. Journal of Informetrics, 11(4), 959–975. https://doi.org/10.1016/j.joi.2017.08.007
Garfield, E. (2009). From the science of science to Scientometrics: visualizing the history of science with HistCite software. Journal of Informetrics, 3(3), 173–179. https://doi.org/10.1016/j.joi.2009.03.009
Main Information
Thinking...
📊 Main Information: Overview of Your Bibliometric Collection
The Main Information page provides a comprehensive, at-a-glance summary of the key bibliometric indicators for your collection. This dashboard-style interface displays 12 core metrics organized into visual cards, allowing you to quickly assess the scope, composition, and characteristics of your dataset.
This section is the ideal starting point for understanding your collection before diving into more detailed analyses. It answers fundamental questions such as: How large is my dataset? What is the temporal coverage? How collaborative is the research? How impactful are the documents?
📈 Core Metrics Explained
The Main Information dashboard displays the following indicators:
1. Timespan
- Definition: The temporal range covered by the collection, from the earliest to the most recent publication year.
- Example:
1985-2020indicates documents published between 1985 and 2020. - Interpretation: A wider timespan enables longitudinal trend analysis and historical perspectives. Collections spanning decades are suitable for studying research evolution and paradigm shifts.
2. Sources
- Definition: The total number of distinct publication venues (journals, conferences, books) represented in the collection.
- Interpretation: A higher number of sources suggests a multidisciplinary or dispersed research field, while a lower number indicates concentration in a few core journals. This metric is useful for identifying dominant publication venues via Bradford's Law analysis.
3. Documents
- Definition: The total number of bibliographic records (articles, reviews, proceedings, etc.) in the collection.
- Interpretation: This is the fundamental sample size for all subsequent analyses. Larger collections (>1,000 documents) provide more robust insights, especially for network and clustering analyses.
4. Annual Growth Rate
- Definition: The average percentage increase in the number of publications per year over the collection's timespan.
- Formula: Compound Annual Growth Rate (CAGR) calculated as:
[(N_final / N_initial)^(1/years) - 1] × 100 - Interpretation: A positive growth rate indicates an expanding research field, while negative or near-zero values suggest maturity or decline. High growth rates (>10%) often signal emerging topics attracting increasing scholarly attention.
5. Authors
- Definition: The total number of unique authors who contributed to the documents in the collection.
- Interpretation: This metric reflects the size of the research community. A high author-to-document ratio suggests collaborative research, while a low ratio may indicate a field dominated by a few prolific researchers.
6. Authors of Single-Authored Docs
- Definition: The number of authors who published at least one single-authored document in the collection.
- Interpretation: Single-authored papers are more common in humanities and theoretical disciplines. A low proportion suggests high collaboration intensity, typical of experimental sciences and interdisciplinary fields.
7. International Co-Authorship
- Definition: The percentage of documents authored by researchers from multiple countries.
- Interpretation: High international collaboration (>30%) indicates global research networks and is often associated with higher citation impact. This metric is a proxy for research globalization and cross-border knowledge exchange.
8. Co-Authors per Document
- Definition: The average number of authors per document in the collection.
- Interpretation: Values typically range from 2 (social sciences, humanities) to 5+ (biomedical sciences, physics). Increasing values over time reflect the trend toward team science and large-scale collaborative projects.
9. Author's Keywords (DE)
- Definition: The total number of unique keywords provided by authors (
DE= Descriptors) across all documents. - Interpretation: A rich keyword set (>1,000 unique terms) enables robust thematic analysis and topic modeling. The diversity of keywords reflects the conceptual breadth of the research field.
10. References
- Definition: The total number of cited references listed in the bibliographies of all documents in the collection.
- Interpretation: This metric is essential for citation-based analyses (co-citation, bibliographic coupling, reference publication year spectroscopy). Larger reference pools enable more comprehensive intellectual structure mapping.
11. Document Average Age
- Definition: The average number of years elapsed since publication, calculated relative to the current year.
- Formula:
Current Year - Mean(Publication Years) - Interpretation: Lower values (<5 years) indicate a focus on recent research, while higher values suggest inclusion of foundational or historical literature. This metric helps assess whether the collection is contemporary or retrospective.
12. Average Citations per Document
- Definition: The mean number of citations received by documents in the collection (based on database citation counts).
- Interpretation: Higher values indicate high-impact research. Average citation rates vary widely by discipline (e.g., biomedical sciences >20, social sciences ~10). This metric is influenced by document age, field norms, and database coverage.
🧠 Biblio AI Integration
If Biblio AI is enabled, you can click the Biblio AI tab to receive an automated narrative summary of these indicators. The AI-generated text provides contextualized interpretations, highlights notable patterns, and offers insights suitable for inclusion in research reports or presentations.
Example AI-generated insights:
- 'The collection exhibits a strong annual growth rate of 14.05%, suggesting an emerging and rapidly expanding research domain.'
- 'With 36.41% international co-authorship, the field demonstrates moderate global collaboration, indicating opportunities for further cross-border partnerships.'
- 'The average of 37.12 citations per document reflects high scholarly impact, placing this collection above typical citation rates for the social sciences.'
📋 Viewing Options
The Main Information page offers three viewing modes via tabs at the top:
- Plot: Visual card-based dashboard (default view) with color-coded metrics
- Table: Tabular representation of all indicators for easy export to reports
- Biblio AI: AI-generated narrative summary and interpretation (requires Gemini API key)
💡 How to Use Main Information
This section is designed for multiple purposes:
- Initial Data Assessment: Quickly validate that your collection has been imported correctly and contains the expected number of documents and metadata fields.
- Research Reporting: Extract summary statistics for the 'Methods' or 'Data' section of a systematic review or bibliometric study.
- Comparative Analysis: Compare indicators across different datasets (e.g., two time periods, competing research streams) to identify differences in growth, collaboration, or impact.
- Presentation Material: Export the dashboard or AI-generated text for use in slides, posters, or grant proposals.
📌 Best Practices
- Always review Main Information first before proceeding to advanced analyses—it helps identify potential data quality issues (e.g., missing years, incomplete author data).
- Compare with field benchmarks: Contextualize your indicators by comparing them with known norms for your discipline (e.g., citation rates, collaboration patterns).
- Document your collection: Use the 'Brief Description' text box (visible in the Import/Load section) to record search queries, inclusion criteria, and data sources for reproducibility.
- Export summary statistics: Save the table view as a reference for your research documentation or supplementary materials.
⚠️ Important Considerations
- Database Bias: Indicators reflect the coverage and indexing policies of the source database(s). Web of Science and Scopus have different journal lists, which affects metrics like citation counts and international co-authorship.
- Citation Lag: Recent documents (<2 years old) typically have lower citation counts due to insufficient time for accumulation. Average citations per document may be biased downward if your collection includes many recent papers.
- Incomplete Metadata: Some databases (e.g., PubMed, Dimensions) provide limited metadata, which may result in missing or incomplete values for certain indicators (e.g., author affiliations for international co-authorship calculation).
- Growth Rate Sensitivity: Annual growth rate calculations are sensitive to the start and end years of the collection. Unusual spikes or drops in specific years can distort the overall trend.
🔍 Next Steps
After reviewing the Main Information dashboard, proceed to more detailed analyses:
- Filters: Refine your collection by applying metadata filters (e.g., document type, time range, subject category)
- Sources: Identify the most productive journals and analyze publication patterns
- Authors: Examine author productivity, collaboration networks, and impact metrics
- Conceptual Structure: Explore thematic evolution and topic clustering via keyword co-occurrence and thematic maps
- Intellectual Structure: Investigate citation networks through co-citation analysis and historiography
📚 References
Aria, M., & Cuccurullo, C. (2017). bibliometrix: An R-tool for comprehensive science mapping analysis. Journal of Informetrics, 11(4), 959–975. https://doi.org/10.1016/j.joi.2017.08.007
Zupic, I., & Čater, T. (2015). Bibliometric methods in management and organization. Organizational Research Methods, 18(3), 429–472. https://doi.org/10.1177/1094428114562629
Annual Scientific Production
Average Citations Per Year
Life Cycle of Scientific Production
Thinking...
📈 Life Cycle of Scientific Production: Modeling Research Topic Evolution
The Life Cycle of Scientific Production module implements a logistic growth model to analyze the temporal dynamics of research topics. This approach, grounded in the theory of scientific paradigms and innovation diffusion, allows researchers to identify the current developmental stage of a field, predict future trends, and estimate when a topic will reach maturity or saturation.
By fitting a logistic curve to the annual publication counts in your collection, this analysis reveals whether a research area is in its emergence phase, rapid growth phase, maturity phase, or decline phase.
📐 The Logistic Growth Model
The life cycle analysis is based on the logistic growth function, which models how the cumulative number of publications evolves over time:
Formula:
P(t) = K / (1 + exp(-b(t - t₀)))
Where:
- P(t): Cumulative number of publications at time
t - K: Saturation level (maximum total publications the topic will produce)
- b: Growth rate parameter (determines the steepness of the curve)
- t₀: Inflection point (time when growth rate is highest)
The annual publication rate is derived as the first derivative of P(t), producing a bell-shaped curve that peaks at the inflection point and gradually declines as the topic approaches saturation.
🔬 Model Overview: Key Parameters
The Model Overview section displays four fundamental indicators derived from the fitted logistic model:
1. Saturation (K)
- Definition: The estimated maximum total number of publications that will ever be produced on this research topic.
- Interpretation:
- High K values (>5,000) indicate a broad, impactful research domain with sustained long-term interest.
- Low K values (<1,000) suggest a niche topic with limited scope or a specialized subtopic within a larger field.
- The current cumulative total as a percentage of K reveals how close the topic is to exhaustion.
- Example: K = 8,980 publications suggests the topic will produce approximately 8,980 total documents before reaching saturation.
2. Peak Year (Tm)
- Definition: The year when annual publication output is predicted to reach its maximum.
- Interpretation:
- If the peak year is in the future, the topic is still in a growth phase and attracting increasing attention.
- If the peak year is in the past, the topic has entered a maturity or decline phase, with decreasing annual output.
- If the peak year is near the present, the topic is at the zenith of its popularity.
- Example: Peak Year = 2029 indicates the topic will reach maximum annual productivity in 2029, suggesting it is currently in an accelerating growth phase.
3. Peak Annual
- Definition: The maximum number of publications per year predicted to occur at the Peak Year.
- Interpretation: This metric reflects the intensity of research activity at the topic's peak. Higher values indicate greater scholarly attention and resource allocation.
- Example: Peak Annual = 592 pubs/year means the topic will generate approximately 592 publications annually at its zenith.
4. Growth Duration (Δt)
- Definition: The estimated time span (in years) from the topic's emergence (10% of K) to near-saturation (90% of K).
- Interpretation:
- Short duration (<10 years): Rapid maturation, typical of hot topics, technological innovations, or crisis-driven research (e.g., COVID-19 studies).
- Medium duration (10-20 years): Typical of mainstream research domains with sustained but gradual growth.
- Long duration (>20 years): Slow-developing fields, foundational topics, or interdisciplinary areas requiring extensive infrastructure.
- Example: Growth Duration = 16.7 years suggests the topic will take approximately 17 years to mature from its early stage to near-saturation.
✅ Model Fit Quality
The Model Fit Quality section assesses how well the logistic curve fits the observed publication data using four statistical metrics:
1. R² (Coefficient of Determination)
- Range: 0 to 1 (higher is better)
- Interpretation: Proportion of variance in publication counts explained by the model.
- R² > 0.90: Excellent fit—the logistic model accurately captures the publication trend.
- 0.70 < R² < 0.90: Good fit—the model is reasonable but may not capture all nuances (e.g., fluctuations due to external events).
- R² < 0.70: Poor fit—the logistic model may not be appropriate for this dataset (non-logistic growth pattern, data quality issues).
- Example: R² = 0.953 indicates an excellent fit, with 95.3% of publication variance explained by the model.
2. RMSE (Root Mean Squared Error)
- Definition: Average deviation between observed and predicted annual publications.
- Interpretation: Lower values indicate better fit. RMSE should be interpreted relative to the scale of annual publications (e.g., RMSE = 10 is negligible for topics with 500+ annual pubs, but significant for topics with <50 pubs/year).
3. AIC (Akaike Information Criterion)
- Purpose: Balances model fit against complexity (penalizes overfitting).
- Interpretation: Lower AIC values indicate a better model. AIC is most useful for comparing alternative models rather than assessing absolute fit quality.
4. BIC (Bayesian Information Criterion)
- Purpose: Similar to AIC but applies a stronger penalty for model complexity.
- Interpretation: Lower BIC values indicate better models. BIC is more conservative than AIC and favors simpler models.
Overall Assessment: Biblioshiny automatically classifies model fit as Excellent, Good, or Poor based primarily on R² values. An 'Excellent' fit (R² > 0.90) validates the use of logistic growth assumptions for forecasting.
📍 Current Status
This section provides a snapshot of the topic's present state relative to its life cycle trajectory:
- Last Observed Year: The most recent year with publication data in your collection.
- Annual Publications: The number of publications in the last observed year.
- Cumulative Total: The total number of publications from the collection's start to the last observed year.
- Progress to Saturation: The percentage of K (saturation level) already reached.
- 0-30%: Emergence or early growth phase.
- 30-70%: Rapid growth phase (the topic is 'hot').
- 70-90%: Late growth phase, approaching maturity.
- >90%: Maturity or decline phase, nearing exhaustion.
Example Interpretation: If Progress to Saturation = 10.0%, the topic is in the rapid growth phase, with 90% of its publication potential still ahead. This signals a promising emerging field attracting increasing scholarly attention.
🏁 Milestone Years
The Milestone Years section predicts when the topic will reach specific saturation thresholds:
- 10% of K: Emergence milestone—marks the topic's transition from niche to recognized research area.
- 50% of K (Midpoint): The inflection point where growth rate is highest. This coincides with the Peak Year (Tm).
- 90% of K: Maturity milestone—indicates the topic is approaching saturation, with declining annual growth.
- 99% of K: Near-complete saturation—the topic has exhausted most of its research potential.
Example:
10% of K: 2021.0
50% of K: 2029.3 (+9 years)
90% of K: 2037.6 (+18 years)
99% of K: 2046.7 (+27 years)
This indicates the topic emerged around 2021, will peak in 2029, and approach saturation by 2038, with a full life cycle spanning approximately 25 years.
The system also classifies the topic's current phase (e.g., 'rapid growth phase' if between 10-50% of K) to aid interpretation.
🚀 Forecast
The Forecast section projects future publication output based on the fitted logistic model:
- Forecast Period: The time range for predictions (typically 5-50 years into the future).
- Projection for 2025: Estimated cumulative total publications by 2025 (includes annual projection in parentheses).
- Projection for 2030: Estimated cumulative total publications by 2030 (includes annual projection in parentheses).
Example:
Projection for 2025: 2183 cumulative (436 annual)
Projection for 2030: 4898 cumulative (587 annual)
This suggests the topic will grow from ~900 publications (current) to over 4,800 by 2030, with annual output peaking around 587 publications per year.
Important: Forecasts assume the logistic model remains valid (no disruptive events, paradigm shifts, or external shocks). Long-term forecasts (>10 years) should be interpreted with caution.
📊 Visualizations
The Plot tab provides two complementary graphs:
1. Life Cycle - Annual Publications
- Blue solid line: Logistic fit to observed data
- Blue dashed line: Forecasted annual publications
- Blue dots: Observed annual publications from your collection
- Red dashed vertical line: Peak Year (Tm)
Interpretation: This bell-shaped curve shows how publication activity rises, peaks, and eventually declines. The shape reveals the topic's maturity:
- Steep ascent, pre-peak: Emerging or rapidly growing topic.
- Near or at peak: Mature topic at maximum attention.
- Descending curve, post-peak: Declining topic losing relevance.
2. Cumulative Growth Curve
- Green solid line: Logistic fit to observed cumulative data
- Green dashed line: Forecasted cumulative publications
- Green dots: Observed cumulative publications
- Horizontal dashed lines: Saturation thresholds (50%, 90%)
Interpretation: This S-shaped curve illustrates the topic's total knowledge accumulation over time. The curve's position and steepness reveal:
- Lower left (shallow slope): Emergence phase with slow initial growth.
- Middle (steep slope): Rapid growth phase with exponential accumulation.
- Upper right (flattening): Maturity phase approaching saturation asymptote (K).
🧠 Biblio AI Integration
The Biblio AI tab allows you to generate AI-powered narrative interpretations of the life cycle analysis. Key features include:
- Customizable Prompts: Edit the default prompt to add context-specific details (e.g., research domain, database source, filter criteria).
- Graph-Based Analysis: Biblio AI analyzes the visualizations to identify trends, anomalies, and key transition points.
- Automatic Interpretation: Generates text suitable for research reports, explaining model parameters, growth phases, and forecasts in natural language.
Example Prompt Enhancement:
The analysis was performed on a collection downloaded from WOS focusing on machine learning applications in healthcare from 1990-2020.
This contextual information helps Biblio AI produce more accurate and domain-relevant interpretations.
💡 Use Cases
- Identifying Emerging Topics: Detect rapidly growing fields in their early stages (10-30% of K) for strategic research investment.
- Timing Research Entry: Avoid entering saturated fields (>90% of K) where novelty is harder to achieve.
- Forecasting Resource Needs: Predict future publication volumes to plan journal submissions, conferences, or funding opportunities.
- Comparative Life Cycle Analysis: Run the analysis on multiple subtopics to identify which are growing vs. declining.
- Paradigm Shift Detection: Poor model fit (R² < 0.70) may signal non-logistic patterns caused by disruptive innovations or paradigm shifts.
📌 Best Practices
- Ensure sufficient data: Logistic models require at least 10-15 years of publication data for reliable fitting. Collections with <10 years may produce unstable forecasts.
- Check model fit: Always review R² and visual fit before interpreting forecasts. Poor fits (R² < 0.70) indicate the logistic model may not be appropriate.
- Consider external events: The model assumes smooth, uninterrupted growth. Real-world shocks (e.g., pandemics, funding cuts, technological breakthroughs) can invalidate long-term forecasts.
- Use relative comparisons: Life cycle parameters (K, Peak Year) are most informative when comparing multiple topics or time periods within the same field.
- Validate forecasts periodically: Re-run the analysis with updated data every 2-3 years to recalibrate predictions.
⚠️ Important Considerations
- Database Coverage: The model reflects only publications indexed in your source database(s). Incomplete coverage (e.g., missing journals, preprints) can distort saturation estimates.
- Definition Drift: Topic boundaries may shift over time (e.g., 'artificial intelligence' in 1990 vs. 2020), affecting the validity of K estimates.
- Multiple Life Cycles: Some broad topics exhibit multiple overlapping life cycles as subtopics emerge and decline independently. In such cases, aggregate logistic fits may be misleading.
- Self-Fulfilling Prophecies: Publishing forecasts may influence researcher behavior (e.g., avoiding 'saturated' topics), potentially altering actual trajectories.
- Model Limitations: The logistic model assumes a single saturation point and smooth growth. Topics experiencing resurgence (e.g., due to new technologies) may not fit this pattern.
🔍 Interpreting Fit Quality Issues
If your model shows poor fit (R² < 0.70), consider these potential causes:
- Insufficient Data: Too few years or highly irregular publication patterns.
- Non-Logistic Growth: The topic may exhibit exponential, linear, or cyclic growth rather than logistic.
- Recent Disruptions: External shocks (e.g., COVID-19 boosting health research) create anomalies that deviate from smooth curves.
- Topic Too Broad: Aggregating multiple subtopics with different life cycles can obscure individual patterns.
- Data Quality Issues: Missing years, database indexing changes, or inconsistent metadata.
Solution: Try narrowing your collection (e.g., focusing on a specific subtopic or time range) or exploring alternative growth models.
📚 References
Aria, M., Misuraca, M., & Spano, M. (2020). Mapping the evolution of social research and data science on 30 years of Social Indicators Research. Social Indicators Research, 149, 803–831. https://doi.org/10.1007/s11205-020-02281-3
Bettencourt, L. M., Kaiser, D. I., & Kaur, J. (2009). Scientific discovery and topological transitions in collaboration networks. Journal of Informetrics, 3(3), 210–221. https://doi.org/10.1016/j.joi.2009.03.001
Rogers, E. M. (2003). Diffusion of Innovations (5th ed.). New York: Free Press.
Small, H., & Upham, S. P. (2009). Citation structure of an emerging research area on the verge of application. Scientometrics, 79(2), 365–375. https://doi.org/10.1007/s11192-009-0424-0
Wang, Q. (2018). A bibliometric model for identifying emerging research topics. Journal of the Association for Information Science and Technology, 69(2), 290–304. https://doi.org/10.1002/asi.23930
Three-Field Plot
Options:
Main Configuration
Thinking...
🔀 Three-Field Plot
The Three-Field Plot is an advanced visualization tool that reveals the relationships among three distinct bibliographic dimensions through an interactive Sankey diagram. This plot enables researchers to explore the complex connections between different metadata fields, making it particularly useful for understanding how research topics, authors, sources, and references are interconnected within a scientific domain.
🎯 Purpose and Application
The Three-Field Plot serves multiple analytical purposes:
- Relationship Mapping: Visualizes how elements from three different bibliographic fields are associated with each other
- Knowledge Flow: Tracks the flow of ideas and citations across different dimensions (e.g., from cited references through authors to keywords)
- Thematic Connections: Identifies which keywords or topics are most strongly associated with specific authors or sources
- Author-Topic Associations: Shows which authors are working on which topics and citing which foundational works
📊 How It Works
The visualization consists of three vertical columns representing different bibliographic fields:
- Left Field: Typically represents sources (cited references, journals) or temporal information
- Middle Field: Usually displays authors or intermediary elements that connect the other two fields
- Right Field: Often shows keywords, topics, or other thematic elements
The width of each flow (colored band) is proportional to the frequency of co-occurrence between elements. Thicker flows indicate stronger associations, while thinner ones represent weaker connections.
⚙️ Configuration Options
The Options panel allows you to customize the plot:
- Left Field: Select from available metadata fields (e.g., Cited References, Sources, Authors' Countries)
- Middle Field: Choose the central connecting field (e.g., Authors, Sources, Keywords)
- Right Field: Define the destination field (e.g., Author's Keywords, Keywords Plus, Subject Categories)
- Number of Items: Control how many top elements to display for each field (typically 10-30 items per field)
💡 Common Field Combinations
Some particularly insightful field combinations include:
- References → Authors → Keywords: Shows which foundational works are cited by which authors working on which topics
- Sources → Authors → Countries: Maps the geographical distribution of authors publishing in specific journals
- Keywords → Authors → Cited References: Reveals the intellectual foundations of different research themes
- Authors' Countries → Authors → Keywords: Identifies national research specializations and thematic focuses
- Publication Year → Authors → Keywords: Tracks temporal evolution of author productivity and topic emergence
🔍 Interpretation Guidelines
- Flow Thickness: A thick flow between two elements indicates a strong association (high co-occurrence frequency)
- Multiple Connections: Elements with many outgoing or incoming flows are central nodes in the network
- Isolated Flows: Thin, isolated connections may represent niche specializations or emerging topics
- Color Coding: Colors help distinguish different elements in the left field, making it easier to trace specific flows
- Cross-field Patterns: Look for patterns where multiple elements from one field connect to the same element in another field, indicating convergence or interdisciplinarity
📌 Best Practices
- Start Simple: Begin with a small number of items (10-15 per field) to avoid visual clutter, then increase if needed
- Logical Sequences: Arrange fields in a logical flow (e.g., past → present, source → output, context → content)
- Interactive Exploration: Hover over flows and nodes to see exact frequencies and connections
- Export Results: Use the plot in presentations to illustrate complex relationships in an accessible way
- Combine with Networks: Use Three-Field Plots alongside network analyses for complementary perspectives on your data
- Context Matters: Always interpret the plot in the context of your research question and domain knowledge
⚠️ Limitations
- Aggregation Effects: The plot shows aggregate patterns and may obscure individual document-level details
- Top-N Selection: Only the most frequent items are displayed; rare but potentially important connections may be hidden
- Direction Ambiguity: While flows suggest relationships, they don't always imply causal or temporal direction
- Visual Complexity: With too many items, the plot can become difficult to interpret; reduce the number of items if necessary
🤖 Biblio AI Integration
When Biblio AI is enabled, you can generate automatic interpretations of the Three-Field Plot. The AI will:
- Identify the most important flows and connections
- Highlight dominant patterns and relationships
- Provide narrative explanations suitable for research reports and presentations
- Suggest potential interpretations based on the observed patterns
📚 Key References
Aria, M., & Cuccurullo, C. (2017). bibliometrix: An R-tool for comprehensive science mapping analysis. Journal of Informetrics, 11(4), 959–975. https://doi.org/10.1016/j.joi.2017.08.007
Chen, C. (2017). Science Mapping: A Systematic Review of the Literature. Journal of Data and Information Science, 2(2), 1–40. https://doi.org/10.1515/jdis-2017-0006
Most Relevant Sources
Options:
Main Configuration
Most Local Cited Sources
Options:
Main Configuration
Core Sources by Bradford's Law
Sources' Local Impact
Options:
Main Configuration
Sources' Production over Time
Options:
Main Configuration
Most Relevant Authors
Options:
Main Configuration
Author Profile
👤 Author Profile Overview
The Author Profile page provides a dual-perspective bibliometric overview of each author included in the collection:
🔹 Global Profile
The Global Profile presents the author's complete scientific output, based on metadata retrieved from OpenAlex via the openalexR R package. This profile includes all publications authored by the researcher, regardless of whether they are part of the current collection.
Main features of the Global Profile include:
- Total Publications and Citations
- H-Index and i10-Index
- 2-Year Mean Citation Rate
- Publication Trends over the last 10 years
- Main Research Topics extracted from OpenAlex concepts
Data Source: OpenAlex API (via openalexR)
Unique Identifier: OpenAlex Author ID (e.g., A5014455237)
🔸 Local Profile
The Local Profile focuses exclusively on the subset of the author's publications that are included in the user-defined collection currently under analysis in the project.
Main features of the Local Profile include:
- Number of Publications, Total Citations, and Local H-Index
- Average Citations per Work
- Recent Activity: Number of publications in the last 5 years
- Publication Trends (based only on local data)
- Main Keywords derived from the local collection
- List of Publications with full metadata (title, year, journal, DOI, citations)
This local profile helps contextualize the author's role and impact within the specific research topic or dataset under investigation.
🔄 Interpretation and Use
The Global Profile offers a broad, external view of the author's overall scholarly influence, while the Local Profile highlights their specific relevance within the current study.
This dual visualization is particularly useful for:
- Identifying influential researchers in the topic area
- Comparing local vs. global impact
- Evaluating thematic alignment of authors with the collection's focus
📚 References
Priem, J. et al. (2022). OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. Retrieved from https://openalex.org
Aria, M., Le, T., Cuccurullo, C., Belfiore, A., & Choe, J. (2024). openalexR: An R-Tool for Collecting Bibliometric Data from OpenAlex. R Journal, 15(4), 167–180. https://doi.org/10.32614/RJ-2023-089
Aria, M. et al. (2023). openalexR: An R package for programmatic access to OpenAlex metadata. CRAN. Retrieved from https://cran.r-project.org/package=openalexR
Hirsch, J.E. (2005). An index to quantify an individual's scientific research output. Proceedings of the National Academy of Sciences, 102(46), 16569–16572. https://doi.org/10.1073/pnas.0507655102
Most Local Cited Authors
Options:
Main Configuration
Authors' Production over Time
Options:
Main Configuration
Thinking...
Author Productivity through Lotka's Law
Authors' Local Impact
Options:
Main Configuration
Most Relevant Affiliations
Options:
Main Configuration
Affiliations' Production over Time
Options:
Main Configuration
Corresponding Author's Countries
Options:
Main Configuration
Thinking...
Countries' Scientific Production
Countries' Production over Time
Options:
Main Configuration
Most Cited Countries
Options:
Main Configuration
Most Global Cited Documents
Options:
Main Configuration
Most Local Cited Documents
Options:
Main Configuration
Thinking...
Most Local Cited References
Options:
Reference Spectroscopy
Options:
Main Configuration
Time Slice
Thinking...
Most Frequent Words
Options:
Main Configuration
Text Editing
Upload a TXT or CSV file containing a list of terms you want to remove from the analysis.
Terms have to be separated by a standard separator (comma, semicolon or tabulator).
Upload a TXT or CSV file containing terms and their respective synonyms.
Each row must contain a term and related synonyms, separated by a standard separator (comma, semicolon or tabulator).
WordCloud
Options:
Main Configuration
Text Editing
Upload a TXT or CSV file containing a list of terms you want to remove from the analysis.
Terms have to be separated by a standard separator (comma, semicolon or tabulator).
Upload a TXT or CSV file containing, in each row, a list of synonyms, that will be merged into a single term (the first word contained in the row)
Terms have to be separated by a standard separator (comma, semicolon or tabulator). Rows have to be separated by return separator.
Parameters
TreeMap
Options:
Main Configuration
Text Editing
Upload a TXT or CSV file containing a list of terms you want to remove from the analysis.
Terms have to be separated by a standard separator (comma, semicolon or tabulator).
Upload a TXT or CSV file containing terms and their respective synonyms.
Each row must contain a term and related synonyms, separated by a standard separator (comma, semicolon or tabulator).
Words' Frequency over Time
Options:
Main Configuration
Text Editing
Upload a TXT or CSV file containing a list of terms you want to remove from the analysis.
Terms have to be separated by a standard separator (comma, semicolon or tabulator).
Upload a TXT or CSV file containing, in each row, a list of synonyms, that will be merged into a single term (the first word contained in the row)
Terms have to be separated by a standard separator (comma, semicolon or tabulator). Rows have to be separated by return separator.
Parameters
Trend Topics
Options:
Main Configuration
Text Editing
Upload a TXT or CSV file containing a list of terms you want to remove from the analysis.
Terms have to be separated by a standard separator (comma, semicolon or tabulator).
Upload a TXT or CSV file containing, in each row, a list of synonyms, that will be merged into a single term (the first word contained in the row)
Terms have to be separated by a standard separator (comma, semicolon or tabulator). Rows have to be separated by return separator.
Parameters
Thinking...
Clustering by Coupling
Options:
Parameters
Co-occurrence Network
Options:
Main Configuration
Text Editing
Stop Words
Upload a TXT or CSV file containing a list of terms you want to remove from the analysis.
Terms have to be separated by a standard separator (comma, semicolon or tabulator).
Synonyms
Upload a TXT or CSV file containing, in each row, a list of synonyms that will be merged into a single term.
Terms have to be separated by a standard separator. Rows have to be separated by return separator.
Method Parameters
Network Size
Filtering Options
Graphical Parameters
Visual Appearance
Label Settings
Node & Edge Settings
Export Network
Thinking...
Thematic Map
Options:
Main Configuration
Text Editing
Stop Words
Upload a TXT or CSV file containing a list of terms you want to remove from the analysis.
Terms have to be separated by a standard separator (comma, semicolon or tabulator).
Synonyms
Upload a TXT or CSV file containing, in each row, a list of synonyms that will be merged into a single term.
Terms have to be separated by a standard separator. Rows have to be separated by return separator.
Parameters
Data Parameters
Display Parameters
Network Parameters
Thinking...
Thematic Evolution
Options:
Main Configuration
Text Editing
Stop Words
Upload a TXT or CSV file containing a list of terms you want to remove from the analysis.
Terms have to be separated by a standard separator (comma, semicolon or tabulator).
Synonyms
Upload a TXT or CSV file containing, in each row, a list of synonyms that will be merged into a single term.
Terms have to be separated by a standard separator. Rows have to be separated by return separator.
Parameters
Data Parameters
Weight Parameters
Display Parameters
Time Slices
Thinking...
Factorial Analysis
Options:
Main Configuration
Text Editing
Stop Words
Upload a TXT or CSV file containing a list of terms you want to remove from the analysis.
Terms have to be separated by a standard separator (comma, semicolon or tabulator).
Synonyms
Upload a TXT or CSV file containing, in each row, a list of synonyms that will be merged into a single term.
Terms have to be separated by a standard separator. Rows have to be separated by return separator.
Method Parameters
Graphical Parameters
Thinking...
Co-citation Network
Options:
Main Configuration
Method Parameters
Network Size
Filtering Options
Graphical Parameters
Visual Appearance
Label Settings
Node & Edge Settings
Export Network
Thinking...
Historiograph
Options:
Main Configuration
Graphical Parameters
Label Configuration
Filtering Options
Visual Settings
Thinking...
Collaboration Network
Options:
Main Configuration
Method Parameters
Network Size
Filtering Options
Graphical Parameters
Visual Appearance
Label Settings
Node & Edge Settings
Export Network
Thinking...
Countries' Collaboration World Map
Options:
Method Parameters
Filtering Options
Graphical Parameters
Edge Settings
Thinking...
Scientific Article Content Analysis
Upload a PDF file and analyze citation patterns, context, and co-occurrence networks.
Readability Indices
Readability indices will appear here after analysis.
Text Statistics
Text statistics will appear here after analysis.
N-grams Analysis
Top Unigrams
Top Bigrams
Top Trigrams
Citation Types Distribution
Citations by Section
Word Distribution Analysis
Word Distribution Over Document
No visualization available
Select words from the list above and click 'Update Visualization' to see their distribution across the document.
Distribution Statistics
Statistics will appear here after visualization is generated.
In-Context Citation Analysis
Citation Contexts Visualization
Citation Co-occurrence Network
Network Information
Strongest Connections
Bibliography
Total References
From PDF
From Crossref
From OpenAlex
No references available
References will appear here after the analysis is complete.
References can be extracted from the PDF or fetched from Crossref using the document's DOI.
AI-Powered Document Summarization
AI-Generated Summary
No summary generated yet
Select your summary type above and click 'Generate Summary' to start.
Make sure you have uploaded a PDF document first.
Upload a PDF file and start the analysis
Select a scientific article in PDF format and configure the analysis parameters to begin.
Choose Import Method
1. Import PDF File ▼
Extracting Text from PDF...
Please wait while we extract the document content.
1. Load Saved Text File ▼
Load a .txt file that was previously saved from this tool. The file should have DOI and citation format info in the first lines for automatic configuration.2. Analysis Parameters ▼
Advanced Options
3. Run Analysis
Analyzing content...
📄 Scientific Article Content Analysis
Content Analysis is a specialized feature in Biblioshiny that enables researchers to perform deep, AI-enhanced analysis of individual scientific articles in PDF format. This tool goes beyond traditional bibliometric analysis by examining the full text of documents, extracting citations with their surrounding context, and revealing patterns in how research is cited and discussed within scholarly narratives.
This menu integrates in Biblioshiny the functions included in the R library contentanalysis by Aria and Cuccurullo (https://cran.r-project.org/package=contentanalysis). The module is built on the bibliometrix ecosystem and integrates advanced text mining capabilities to support:
- Extraction and analysis of in-text citations with context windows
- Citation co-occurrence network analysis
- Readability and linguistic quality assessment
- Word distribution and trend analysis across document sections
- AI-powered document summarization through Biblio AI
- Comprehensive reference list extraction and matching
🎯 Purpose and Applications
Content Analysis is particularly valuable for:
- Understanding Citation Context: Examining how and where references are cited within a paper, distinguishing between peripheral mentions and substantive discussions.
- Identifying Citation Clusters: Detecting which references are frequently cited together, revealing the conceptual structure and intellectual foundations of the research.
- Quality Assessment: Evaluating document readability, lexical diversity, and linguistic complexity using established metrics like Flesch-Kincaid, ARI, and Gunning Fog Index.
- Thematic Flow Analysis: Tracking how key terms and concepts are distributed across different sections of the paper (Introduction, Methods, Results, Discussion).
- Literature Review Enhancement: Using AI-powered summarization to quickly extract key insights, research questions, methodologies, and findings from lengthy documents.
- Citation Practice Research: Analyzing citation patterns and practices for methodological or meta-research studies.
📥 Step 1: Import PDF File
The analysis begins by uploading a scientific article in PDF format. The system supports both single-column and multi-column layouts (specify the number of columns for accurate text extraction).
Citation Format Detection: The tool uses AI-enhanced extraction to identify citations in multiple formats:
- Author-year format: (Smith, 2020) or Smith et al. (2015)
- Numeric brackets: [1] or [15-17]
- Numeric superscripts: ¹ or ²³
- Mixed formats: The system can handle documents with inconsistent citation styles (though results may be less reliable)
AI-Enhanced Extraction: When enabled, this feature uses advanced AI models to improve citation detection accuracy, particularly useful for:
- PDFs with complex layouts or formatting issues
- Documents with non-standard citation formats
- Multi-column articles where citation boundaries are ambiguous
- Papers with extensive in-text author listings (e.g., 'Smith, Jones, Williams, and Brown, 2020')
Note: AI-enhanced extraction requires a configured Google Gemini API key in Settings. See the Biblio AI help section for setup instructions.
⚙️ Step 2: Analysis Parameters
Users can customize the extraction and analysis through several key parameters:
Context Window Size (words)
Defines the number of words to extract before and after each citation. Default is 20 words.
- Smaller windows (5-10 words): Capture only immediate context, useful for identifying direct citation purposes (e.g., methodology references).
- Medium windows (15-30 words): Balance between context richness and data volume. Suitable for most analyses.
- Larger windows (40-50 words): Capture broader argumentative context, useful for discourse analysis or understanding how citations are integrated into narrative flow.
Max Distance for Network (characters)
Defines the maximum character distance between two citations to be considered 'co-occurring' in the network analysis. Default is 800 characters (roughly 120-150 words).
- Shorter distances (300-500 chars): Identify only tightly co-cited references, revealing core conceptual links.
- Medium distances (600-1000 chars): Capture paragraph-level co-citations, showing related but distinct concepts.
- Longer distances (1200-2000 chars): Include section-level co-citations, useful for broad thematic analysis but may create noisy networks.
Advanced Options
- Parse complex multiple citations: Attempts to separate compound citations like '(Smith 2020; Jones 2019; Williams et al. 2018)' into individual references. This improves network accuracy but may increase processing time.
- Remove stopwords from analysis: Excludes common words (e.g., 'the', 'and', 'of') from word frequency and trend analyses, focusing on substantive terms.
- Custom stopwords (comma-separated): Add domain-specific stopwords (e.g., 'study', 'research', 'analysis') to refine term extraction for your field.
📊 Analysis Results and Tabs
1️⃣ Descriptive Statistics
Provides an overview of the document's structural and linguistic characteristics:
Document Metrics:
- Total Words: Overall word count (excluding references section if detected).
- Citations Found: Number of unique in-text citations identified.
- Narrative Citations: Citations that include author names in the sentence (e.g., 'As Smith (2020) demonstrated...').
- Citation Density: Citations per 1000 words, indicating reference saturation. Typical ranges:
- 0-5: Light citation (review articles often have 10-20+)
- 5-10: Moderate (common in empirical studies)
- 10+: Heavy (common in systematic reviews or theoretical papers)
Readability Indices:
- Flesch-Kincaid Grade: Estimates U.S. grade level required to understand the text. Values of 12-14 indicate college-level reading, while 16+ suggests graduate-level complexity.
- Reading Ease (Flesch): 0-100 scale where higher scores indicate easier readability. Scientific papers typically score 20-40 (difficult to very difficult).
- ARI Index (Automated Readability Index): Another grade-level estimate based on character count rather than syllables. Generally correlates with Flesch-Kincaid but may differ for technical texts.
- Gunning Fog Index: Estimates years of formal education needed. Values above 17 indicate very complex, technical prose common in specialized research.
Text Statistics:
- Characters, Words, Sentences: Basic text volume metrics.
- Syllables: Used in readability calculations.
- Complex Words: Words with 3+ syllables, expressed as count and percentage. Higher percentages (>30%) indicate technical vocabulary.
- Avg words/sentence: Sentence length. Scientific writing typically ranges from 15-25 words/sentence. Very long sentences (>30) may reduce readability.
- Lexical Diversity: Ratio of unique words to total words. Higher diversity (>0.5) suggests varied vocabulary; lower values (<0.4) may indicate repetitive or formulaic writing.
N-grams Analysis: Displays the most frequent unigrams (single words), bigrams (two-word phrases), and trigrams (three-word phrases) in the document. This reveals key concepts and repeated terminology.
2️⃣ Word Trends
Visualizes how selected terms are distributed across the document's sections. This analysis helps understand the thematic flow and identify where specific concepts are emphasized.
Features:
- Track up to 10 terms: Select from the most frequent words or enter custom terms (e.g., domain-specific keywords).
- Segmentation options:
- Auto (use sections if available): Automatically detects standard sections (Abstract, Introduction, Methods, Results, Discussion, Conclusion) if structured.
- Document sections: Manually defined sections based on detected headers.
- Equal-length segments: Divides the document into uniform chunks (e.g., quartiles) regardless of logical structure.
- Visualization types:
- Line chart: Shows temporal trends for each term across segments.
- Area chart: Emphasizes volume changes with filled areas under lines.
Interpretation Examples:
- A term peaking in the Introduction but absent in Results may indicate a concept discussed in literature review but not directly addressed in the study.
- Uniform distribution suggests a central theme integrated throughout the paper.
- Concentration in Methods indicates technical or procedural terminology.
- Terms appearing only in Discussion may represent emerging implications or future directions.
Distribution Statistics: Provides detailed frequency counts and statistical measures (mean, standard deviation, skewness) for each tracked term across segments.
3️⃣ In-Context Citations
Displays each extracted citation with its surrounding context window, enabling qualitative citation analysis. This is one of the most powerful features for understanding why and how sources are cited.
Features:
- Searchable list: Filter citations by searching for specific authors, keywords, or phrases.
- Minimum context words: Set a threshold to exclude citations with insufficient context (useful for filtering out reference-only lists or captions).
- Grouping options:
- Auto (use sections if available): Groups citations by paper section (e.g., all Introduction citations together).
- Document sections: Uses manually detected section headers.
- Equal-length segments: Groups citations by position in the document (e.g., first quartile, second quartile).
- Citation type identification: The interface distinguishes between:
- Parenthetical citations: References in parentheses, often used for supporting evidence.
- Narrative citations: Author names integrated into sentence structure, typically indicating more substantive engagement.
- Reference matching: When possible, the system attempts to match in-text citations to the corresponding full reference in the bibliography. A green indicator shows successful matches; hover to view full reference details.
Analytical Uses:
- Citation Function Analysis: Categorize citations by their role (e.g., establishing theoretical framework, justifying methodology, supporting findings, contrasting results).
- Author Authority: Identify which authors are cited most frequently and in what contexts (e.g., some authors may be cited exclusively for methods, others for theory).
- Hedging and Certainty: Examine the language surrounding citations to assess how authors express confidence or uncertainty (e.g., 'Smith (2020) demonstrated...' vs. 'some studies suggest... (Smith 2020)').
- Self-Citation Patterns: Identify author self-citations and analyze whether they serve substantive or promotional purposes.
Export Contexts: Use the 'Export Contexts' button to download all citation contexts as a structured dataset (CSV format) for external qualitative coding or further analysis in tools like NVivo, ATLAS.ti, or custom R scripts.
4️⃣ Network Analysis
Generates a citation co-occurrence network that visualizes which references are cited near each other in the text. This reveals the intellectual structure and conceptual clusters within the paper.
Network Construction:
- Nodes: Each node represents a cited reference (identified by first author and year).
- Edges: A link is created between two references if they appear within the specified Max Distance parameter (default: 800 characters).
- Node Size: Proportional to the number of times each reference is cited in the document (total citation frequency).
- Node Color: Represents the document section where the reference appears most frequently, helping identify whether certain clusters are methodological, theoretical, or results-focused.
Interpretation:
- Densely connected clusters: Groups of tightly co-cited references indicate core conceptual or methodological frameworks that the paper builds upon.
- Bridge references: Nodes connecting otherwise separate clusters represent interdisciplinary links or integrative studies that synthesize multiple research traditions.
- Peripheral isolates: References cited alone without nearby co-citations may serve specific, standalone purposes (e.g., citing a statistical test or a single example).
- Section-based coloring: If most nodes are blue (Introduction), the paper heavily relies on literature review. If red (Methods) dominates, it's methodology-focused.
Network Metrics: While not explicitly displayed, users can infer centrality and clustering qualitatively:
- Central nodes (many connections): Foundational references that anchor the paper's argument.
- Betweenness (bridge position): References connecting distinct themes, suggesting synthesis or interdisciplinary work.
Export Network: Download the network as a graph file (GraphML or edge list format) for advanced analysis in specialized network software like Gephi, Cytoscape, or igraph in R.
Legend: The right panel displays a legend showing the color coding for different document sections. This helps quickly identify which parts of the paper contribute most to the citation network.
5️⃣ References
Displays the complete bibliography extracted from the document. References are automatically parsed and enriched with metadata from Crossref and OpenAlex databases when DOIs are available.
Features:
- Search functionality: Find specific references by author name, title, year, or DOI.
- Source indicators: Icons show the data source:
- PDF icon: Reference extracted directly from the PDF.
- Crossref icon: Metadata retrieved from Crossref (indicates DOI-based matching).
- OpenAlex icon: Metadata enriched from OpenAlex (provides additional fields like citation counts, authors' affiliations).
- View Details: Click on any reference to open a detailed modal showing:
- Full author list
- Publication year and journal/venue
- DOI (with direct link to the publisher)
- Abstract (if available)
- Citation counts and impact metrics from OpenAlex
- Export References: Download the bibliography in various formats (BibTeX, RIS, CSV) for import into reference managers like Zotero, Mendeley, or EndNote.
Reference Matching Quality:
- High confidence (green): Reference successfully matched to Crossref/OpenAlex with high certainty (exact DOI or strong title/author match).
- Low confidence (yellow): Partial match based on fuzzy title/author similarity; manual verification recommended.
- No match (red): Reference could not be matched to external databases. Possible reasons include:
- Missing or incorrect DOI
- Pre-print or non-indexed publication
- Parsing errors in reference extraction
- Non-standard formatting in the original PDF
Total References: The header shows the breakdown:
- Total References: All unique references found in the document.
- From PDF: References extracted directly from the PDF (may have parsing imperfections).
- From Crossref: References successfully matched to the Crossref database.
- From OpenAlex: References matched to OpenAlex (often overlaps with Crossref but provides broader coverage for non-DOI works).
6️⃣ BiblioAI Summary
Uses Google Gemini AI models to generate intelligent, context-aware summaries of the analyzed document. This feature transforms raw PDF content into structured, actionable insights.
Summary Types:
- Short Abstract (250 words): A concise overview covering the main research question, methodology, key findings, and conclusions. Ideal for quick reference or creating study cards.
- Narrative Abstract (500-600 words): A detailed, paragraph-form summary that places the research in context, explains the study's rationale, describes methods and results, and discusses implications. Suitable for grant applications or research summaries.
- IMRaD Structure Summary: A structured summary organized by the traditional scientific paper format:
- Introduction: Background, research gap, and objectives
- Methods: Study design, data sources, analytical approach
- Results: Key findings with quantitative details where applicable
- Discussion: Interpretation, limitations, and implications
- Thematic Bibliography: Generates a thematic categorization of the paper's references, grouping cited works by conceptual topic (e.g., 'Theoretical Foundations', 'Methodological Approaches', 'Empirical Evidence', 'Critiques and Limitations'). This is invaluable for:
- Understanding how the author organizes their intellectual framework
- Identifying key reference clusters for literature review purposes
- Discovering reading lists organized by research theme
- Research Questions & Context: Extracts and articulates:
- The main research question(s) or hypotheses
- The broader research context and motivation
- Key theoretical or empirical gaps the study addresses
- The study's positioning within its field
Customization:
- Edit Prompt: The AI summary generation is driven by an editable prompt template. Users can:
- Add specific questions to the prompt (e.g., 'What statistical methods were used?')
- Specify output format preferences
- Include contextual information (e.g., 'This paper is from a special issue on climate change')
- Request focus on particular sections (e.g., 'Emphasize methodological contributions')
- Language Selection: Summaries can be generated in multiple languages (depending on Gemini model support), making international collaboration and translation easier.
Requirements:
- A valid Google Gemini API key must be configured in Settings. Free tier limits allow for moderate usage (typically 15-60 requests per minute depending on the model).
- The 'API Key Configured' indicator must show green for the feature to work.
Best Practices:
- Verify AI outputs: While Gemini models are highly capable, always review generated summaries for accuracy, especially for technical details or numerical results.
- Use appropriate summary types: Short abstracts are great for skimming large numbers of papers, while IMRaD summaries are better for in-depth study or quality appraisal.
- Combine with manual review: AI summaries should complement, not replace, human reading. Use them to prioritize which papers to read in full.
- Iterate prompts: If the initial summary misses key information, refine your prompt with more specific instructions.
🧠 Integration with Biblio AI
Content Analysis is fully integrated with the Biblio AI ecosystem. Throughout the analysis tabs, users can activate AI-assisted interpretation panels that:
- Explain patterns in the citation network (e.g., why certain clusters form)
- Interpret readability scores in the context of the target audience and field norms
- Suggest alternative analyses or parameters based on detected document characteristics
- Generate publication-ready text descriptions of figures and results
These dynamic interpretations adapt to your specific document and parameters, providing contextualized guidance rather than generic explanations.
💡 Advanced Use Cases
1. Citation Context Mining for Meta-Research
Researchers studying citation practices can use Content Analysis to:
- Quantify the proportion of narrative vs. parenthetical citations across papers
- Analyze sentiment or evaluative language in citation contexts (requires exporting contexts for sentiment analysis)
- Study how often citations are backed by direct quotes vs. paraphrases
- Examine self-citation contexts to distinguish between necessary methodological references and gratuitous self-promotion
2. Quality and Transparency Assessment
Evaluate methodological transparency by:
- Checking if key methodological references appear in the Methods section (using In-Context Citations and section grouping)
- Identifying whether data sources and statistical tests are properly cited
- Assessing whether limitations are discussed with appropriate citations to prior critiques
3. Comparative Literature Review
Analyze multiple papers on the same topic by:
- Running Content Analysis on several key papers
- Comparing their citation networks to identify consensus foundational references
- Examining differences in thematic emphasis using Word Trends
- Creating a merged bibliography of all thematic bibliographies from BiblioAI summaries
4. Pedagogical Applications
Use Content Analysis to teach:
- Citation Skills: Show students examples of effective narrative vs. parenthetical citations.
- Literature Review Structure: Demonstrate how successful papers organize their conceptual frameworks using network analysis.
- Writing Clarity: Use readability scores to illustrate the balance between technical precision and accessibility.
- Source Integration: Highlight how professional academics weave citations into argumentative flow.
5. Editorial and Peer Review Support
Editors and reviewers can use Content Analysis to:
- Quickly assess whether a manuscript cites the appropriate foundational literature (using network analysis to check for missing clusters)
- Identify potential plagiarism or over-reliance on a single source (using n-gram analysis and citation frequency)
- Evaluate whether methods and results are properly supported by citations
- Generate constructive feedback on readability for papers that score poorly on standard metrics
📌 Best Practices
- PDF Quality Matters: Text-based PDFs (not scanned images) produce the most accurate results. Use OCR pre-processing for image-based PDFs.
- Check Citation Parsing: Always review the 'Citations Found' count. If it seems unusually low, try enabling AI-enhanced extraction or manually specifying the citation format.
- Balance Context Windows: Larger windows provide richer qualitative data but increase processing time and data volume. Start with default settings (20 words) and adjust based on your analytical needs.
- Export for Deep Dives: For complex citation function analysis or qualitative coding, export the citation contexts to CSV and work with specialized qualitative data analysis software.
- Combine with Traditional Bibliometrics: Content Analysis is designed to complement, not replace, traditional bibliometric methods. Use it alongside tools like Data → Overview or Conceptual Structure for a complete picture.
- Mind the Model Limits: AI-powered features (enhanced extraction, BiblioAI summaries) have rate limits and token constraints. For very long documents (>50 pages), summaries may truncate; consider analyzing sections separately.
⚠️ Limitations and Considerations
- Citation Format Variability: Non-standard or inconsistent citation formats may result in incomplete extraction. Manual verification is recommended for critical analyses.
- Automatic Section Detection: The system attempts to identify standard sections (Introduction, Methods, etc.) using heuristics. Papers with unconventional structures may be segmented incorrectly. Use 'Equal-length segments' as a fallback.
- Language Support: While the tool supports PDFs in any language, AI-powered features (enhanced extraction, summaries) are optimized for English. Other languages may produce less accurate results.
- Reference Matching Accuracy: Matching in-text citations to bibliography entries relies on fuzzy matching when DOIs are unavailable. Ambiguous references (e.g., 'Smith et al., 2020' when multiple Smiths exist) may match incorrectly.
- Network Interpretation: Citation co-occurrence does not imply intellectual similarity. References may appear together for diverse reasons (e.g., contrasting studies, citing a methods paper alongside an application). Always combine network analysis with context reading.
- AI Hallucinations: While rare, Biblio AI summaries may occasionally include plausible but incorrect details not present in the original text. Critical applications (e.g., systematic reviews) should verify AI outputs against the source document.
- Database Coverage: Reference enrichment from Crossref/OpenAlex is limited to indexed works with DOIs. Books, preprints, gray literature, or very recent papers may not be matched.
🔄 Integration with Biblioshiny Workflow
Content Analysis complements other Biblioshiny modules:
- Import Phase: After collecting a bibliographic dataset (e.g., from Web of Science), use Content Analysis to deeply examine a few key highly-cited papers identified in Most Local Cited Documents.
- Conceptual Structure Analysis: Once you've identified thematic clusters using co-word analysis or MCA, select representative papers from each cluster and use Content Analysis to understand how those themes are discussed and cited within individual papers.
- Intellectual Structure: After running co-citation or bibliographic coupling networks, use Content Analysis on the central nodes to see why they're central—are they cited together because they address the same method, theory, or empirical finding?
- Trend Topics: When you identify an emerging trend, analyze a seminal paper from that trend to understand its intellectual roots and citation context.
📚 Key References
Main References on Bibliometrix and Content Analysis Tools:
Aria, M., & Cuccurullo, C. (2025). contentanalysis: Scientific Content and Citation Analysis from PDF Documents. [R package]. https://doi.org/10.32614/CRAN.package.contentanalysis
Aria, M., & Cuccurullo, C. (2017). bibliometrix: An R-tool for comprehensive science mapping analysis. Journal of Informetrics, 11(4), 959–975. https://doi.org/10.1016/j.joi.2017.08.007
Aria, M., Cuccurullo, C., D'Aniello, L., Misuraca, M., & Spano, M. (2024). Comparative science mapping: a novel conceptual structure analysis with metadata. Scientometrics. https://doi.org/10.1007/s11192-024-05161-6
Aria, M., Cuccurullo, C., D'Aniello, L., Misuraca, M., & Spano, M. (2022). Thematic Analysis as a New Culturomic Tool: The Social Media Coverage on COVID-19 Pandemic in Italy. Sustainability, 14(6), 3643. https://doi.org/10.3390/su14063643
Aria, M., Misuraca, M., & Spano, M. (2020). Mapping the evolution of social research and data science on 30 years of Social Indicators Research. Social Indicators Research, 149, 803–831. https://doi.org/10.1007/s11205-020-02281-3
Quantitative Content Analysis - Foundational Works:
Berelson, B. (1952). Content Analysis in Communication Research. Glencoe, IL: Free Press.
Krippendorff, K. (2018). Content Analysis: An Introduction to Its Methodology (4th ed.). Thousand Oaks, CA: SAGE Publications.
Neuendorf, K. A. (2017). The Content Analysis Guidebook (2nd ed.). Thousand Oaks, CA: SAGE Publications.
Weber, R. P. (1990). Basic Content Analysis (2nd ed.). Newbury Park, CA: SAGE Publications.
🎓 Further Reading
For more information on using Content Analysis and related techniques, see:
- bibliometrix documentation: https://www.bibliometrix.org
- contentanalysis documentation: https://massimoaria.github.io/contentanalysis-website/
- Content analysis in communication research: Riffe, D., Lacy, S., & Fico, F. (2014). Analyzing Media Messages: Using Quantitative Content Analysis in Research (3rd ed.). Routledge.
Report
Select results to include in the Report
TALL - Text Analysis for All
Biblioshiny now includes a dedicated export tool that allows you to prepare and extract textual data (Titles, Abstracts, and Keywords) from your bibliographic collection in a format ready to be used in TALL.
TALL is a user-friendly R Shiny application designed to support researchers in performing textual data analysis without requiring advanced programming skills.
TALL offers a comprehensive workflow for data cleaning, pre-processing, statistical analysis, and visualization of textual data, by combining state-of-the-art text analysis techniques into an R Shiny app.
TALL includes a wide set of methodologies specifically tailored for various text analysis tasks. It aims to address the needs of researchers without extensive programming skills, providing a versatile and general-purpose tool for analyzing textual data. With TALL, researchers can leverage a wide range of text analysis techniques without the burden of extensive programming knowledge, enabling them to extract valuable insights from textual data in a more efficient and accessible manner.
Learn more at: www.tall-app.com
Export a corpus for TALL
Select textual metadata:
Select additional metadata:
Select at least one textual field to export, click 'Play' to generate the dataset, then save and import it into TALL.
Settings
Configure global settings for plots, analysis reproducibility, and AI features.