
Web information gathering
Web Reconnaissance Guide
1. Overview of Web Reconnaissance
- Definition: Web reconnaissance is the preliminary phase of a security assessment, focusing on systematically gathering information about a target website or web application to understand its structure and potential weaknesses.
- Objectives:
- Asset Discovery: Identify public-facing elements such as web pages, subdomains, IP addresses, and technologies in use.
- Uncover Sensitive Data: Detect exposed information like configuration files or backup data.
- Map Attack Surface: Pinpoint vulnerabilities, misconfigurations, or entry points for potential exploitation.
- Gather Intelligence: Collect data for social engineering, such as key personnel details or email addresses.
- Significance:
- Attackers leverage reconnaissance to craft targeted attacks and evade defenses.
- Defenders use it to proactively identify and mitigate vulnerabilities.
- Reconnaissance Types:
- Active Reconnaissance:
- Involves direct interaction with the target (e.g., scanning ports or vulnerabilities).
- Techniques: Port scanning, vulnerability scanning, network mapping, banner grabbing, OS fingerprinting, service enumeration, web crawling.
- Tools: Nmap, Nessus, Nikto, Burp Suite Spider, curl.
- Risk: High chance of detection due to triggering intrusion detection systems (IDS) or firewalls.
- Passive Reconnaissance:
- Relies on publicly available data without direct target interaction.
- Techniques: Search engine queries, WHOIS lookups, DNS analysis, web archive reviews, social media scraping, code repository analysis.
- Tools: Google, WHOIS CLI, dig, Wayback Machine, LinkedIn, GitHub.
- Risk: Minimal detection risk, as it resembles typical internet activity.
- Active Reconnaissance:
2. WHOIS Protocol
- Definition: WHOIS is a query/response protocol used to retrieve registration details for internet resources like domains, IP addresses, and autonomous systems from public databases.
- Purpose: Acts as the internet’s directory, providing ownership and technical details for online assets.
- Key WHOIS Record Elements:
- Domain Name: e.g.,
example.com
. - Registrar: The entity managing the domain (e.g., GoDaddy).
- Registrant Contact: The domain owner (individual or organization).
- Administrative Contact: Manages domain operations.
- Technical Contact: Handles technical configurations.
- Creation/Expiration Dates: Registration and expiry dates.
- Name Servers: Resolve domain to IP addresses.
- Domain Name: e.g.,
- Historical Context:
- Developed in the 1970s by Elizabeth Feinler at Stanford’s NIC for ARPANET.
- Originally tracked network users, hostnames, and domains.
- Relevance to Reconnaissance:
- Personnel Insights: Exposes names, emails, or phone numbers for social engineering or phishing.
- Infrastructure Mapping: Name servers and IPs reveal hosting providers or misconfigurations.
- Historical Analysis: Tools like WhoisFreaks track changes in ownership or configurations.
- Use Cases:
- Phishing Detection:
- A WHOIS lookup on a suspicious email domain reveals recent registration, hidden ownership, or shady hosting, indicating phishing.
- Action: Block the domain, alert users, investigate the hosting provider.
- Malware Investigation:
- A malware C2 server’s WHOIS shows anonymous emails, high-risk hosting countries, or lax registrars, suggesting a compromised server.
- Action: Notify the provider, escalate investigation.
- Threat Intelligence:
- Analyzing WHOIS data across threat actor domains uncovers patterns like clustered registrations or shared name servers.
- Action: Develop threat profiles, share indicators of compromise (IOCs).
- Phishing Detection:
- Using WHOIS:
- Installation:
sudo apt update && sudo apt install whois -y
(Linux). - Command:
whois example.com
(e.g.,whois google.com
). - Sample Output (google.com):
- Registrar: MarkMonitor Inc.
- Creation Date: 1997-09-15.
- Expiry Date: 2028-09-14.
- Registrant: Google LLC, Domain Admin.
- Domain Status: Protected (clientDeleteProhibited, clientTransferProhibited).
- Name Servers:
ns1.google.com
,ns2.google.com
, etc. - Insight: Well-secured, long-standing domain with Google-managed DNS.
- Limitations: May not reveal specific vulnerabilities or employee details; supplement with other recon methods.
- Installation:
3. Domain Name System (DNS)
- Definition: DNS translates user-friendly domain names (e.g.,
example.com
) into IP addresses (e.g., 93.184.216.34), serving as the internet’s navigation system. - DNS Resolution Process:
- Query Initiation: A device checks its cache, then queries a DNS resolver (e.g., ISP server).
- Recursive Query: Resolver contacts a root name server.
- Root Response: Directs to a top-level domain (TLD) server (e.g.,
.com
). - TLD Response: Points to the authoritative name server.
- Authoritative Response: Provides the IP address.
- Resolver Delivery: Returns the IP to the device and caches it.
- Connection: Device connects to the target server.
- Hosts File:
- Location:
/etc/hosts
(Linux/macOS),C:\Windows\System32\drivers\etc\hosts
(Windows). - Format:
<IP> <Hostname> [<Alias>]
(e.g.,127.0.0.1 localhost
). - Purpose: Local DNS overrides for testing, development, or blocking (e.g.,
0.0.0.0 ads.example.com
). - Editing: Requires admin/root access; changes are immediate.
- Location:
- Core DNS Concepts:
- Zone: A managed segment of a domain’s namespace (e.g.,
example.com
and its subdomains). - Zone File: Contains resource records for a zone (e.g., A, MX, NS).
- DNS Record Types:
- A: Links hostname to IPv4 (e.g.,
www.example.com IN A 93.184.216.34
). - AAAA: Links hostname to IPv6.
- CNAME: Aliases a hostname to another (e.g.,
blog.example.com IN CNAME server1.example.net
). - MX: Defines mail servers (e.g.,
example.com IN MX 10 mail.example.com
). - NS: Lists authoritative name servers.
- TXT: Stores arbitrary text (e.g., SPF records).
- SOA: Specifies zone authority (e.g., serial number, refresh intervals).
- SRV: Indicates service locations.
- PTR: Maps IP to hostname for reverse DNS.
- A: Links hostname to IPv4 (e.g.,
- IN: Denotes Internet protocol in records.
- Zone: A managed segment of a domain’s namespace (e.g.,
- Reconnaissance Value:
- Asset Identification: Reveals subdomains, mail servers, and hosting infrastructure.
- Infrastructure Mapping: NS/A records pinpoint hosting providers or load balancers.
- Change Monitoring: New subdomains (e.g.,
api.example.com
) or TXT records (e.g., security tools) signal new services or vulnerabilities.
- Sample Zone File:
$TTL 3600 @ IN SOA ns1.example.com. admin.example.com. (2025010101 3600 900 604800 86400) @ IN NS ns1.example.com. @ IN NS ns2.example.com. @ IN MX 10 mail.example.com. www IN A 93.184.216.34 mail IN A 198.51.100.10 ftp IN CNAME www.example.com.
4. DNS Reconnaissance Techniques
- Objective: Extract detailed infrastructure insights using DNS-focused tools.
- Key Tools:
- dig: Robust tool for detailed DNS queries.
- nslookup: Simple DNS lookup utility.
- host: Quick tool for A/AAAA/MX queries.
- dnsenum, fierce, dnsrecon: Advanced tools for subdomain enumeration and zone transfers.
- theHarvester: OSINT tool for collecting emails, subdomains, and hosts.
- Online Platforms: Web-based DNS lookup services for ease of use.
- Using dig:
- Commands:
dig example.com
: Retrieves A record.dig example.com MX
: Lists mail servers.dig example.com NS
: Shows name servers.dig example.com TXT
: Displays text records.dig example.com CNAME
: Finds aliases.dig example.com SOA
: Gets zone authority.dig @8.8.8.8 example.com
: Queries a specific server.dig +trace example.com
: Traces the resolution path.dig -x 93.184.216.34
: Performs reverse lookup.dig +short example.com
: Provides minimal output.dig +noall +answer example.com
: Shows only the answer section.dig example.com ANY
: Requests all records (often restricted per RFC 8482).
- Caution: Excessive queries may trigger rate limits or detection; always obtain permission.
- Commands:
- Sample dig Output (example.com):
dig example.com ; <<>> DiG 9.18.24 <<>> example.com ; Got answer: ; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12345 ; QUESTION SECTION: ;example.com. IN A ; ANSWER SECTION: example.com. 3600 IN A 93.184.216.34 ; Query time: 10 msec ; SERVER: 8.8.8.8#53(8.8.8.8) (UDP) ; WHEN: Wed May 14 13:32:00 +06 2025
- Analysis:
- Header: Confirms successful query (NOERROR).
- Question: Requested A record for
example.com
. - Answer: IP address
93.184.216.34
. - Metadata: Query time, server used.
- Analysis:
5. Advanced Reconnaissance: Subdomains, Zone Transfers, Virtual Hosts, and Certificate Transparency
5.1 Subdomains
- Definition: Subdomains are extensions of a primary domain (e.g.,
app.example.com
forexample.com
), used to segment services like email, blogs, or admin portals. - Reconnaissance Value:
- Development Environments: Subdomains like
staging.example.com
may be less secure, exposing sensitive data. - Administrative Portals: Hidden subdomains (e.g.,
admin.example.com
) may host login interfaces. - Legacy Systems: Forgotten subdomains may run outdated, exploitable software.
- Data Exposure: Misconfigured subdomains may leak configurations or internal documents.
- Development Environments: Subdomains like
- Enumeration Methods:
- Active Enumeration:
- Directly queries target DNS servers.
- Techniques:
- Zone Transfers: Attempts to retrieve the full zone file (rarely successful due to modern security).
- Brute-Forcing: Tests subdomain names using wordlists.
- Tools:
dnsenum
,fierce
,gobuster
. - Risk: Detectable by security systems.
- Passive Enumeration:
- Leverages external data sources without contacting the target.
- Techniques:
- Certificate Transparency Logs: Public SSL certificate records reveal subdomains.
- Search Engines: Queries like
site:*.example.com
uncover subdomains. - DNS Databases: Aggregate historical DNS data.
- Risk: Low detection risk, highly stealthy.
- Best Practice: Combine active and passive techniques for thorough coverage.
- Active Enumeration:
5.2 Subdomain Brute-Forcing
- Definition: An active method that tests potential subdomain names against a domain using wordlists to identify valid subdomains.
- Workflow:
- Select Wordlist:
- Generic: Common terms (e.g.,
dev
,mail
,admin
). - Targeted: Industry-specific or based on observed naming conventions.
- Custom: Derived from recon data or keywords.
- Generic: Common terms (e.g.,
- Query Execution: Tool appends wordlist entries to the domain (e.g.,
test.example.com
). - DNS Resolution: Verifies if subdomains resolve to IPs via A/AAAA records.
- Validation: Filters results, optionally checks accessibility.
- Select Wordlist:
- Tools:
- dnsenum: Multifaceted DNS recon with brute-forcing, zone transfers, and WHOIS lookups.
- fierce: Streamlined for subdomain discovery with wildcard detection.
- dnsrecon, amass, assetfinder, puredns: Specialized for efficient subdomain enumeration.
- Example (dnsenum):
dnsenum --enum example.com -f /usr/share/seclists/Discovery/DNS/subdomains-top1million-5000.txt
- Output: Identifies
www.example.com
,mail.example.com
resolving to93.184.216.34
. - Features: Recursive brute-forcing, DNS record enumeration, leverages SecLists.
- Output: Identifies
- Considerations:
- Generates noticeable DNS traffic; may trigger alerts.
- Use focused wordlists to minimize noise and improve accuracy.
5.3 DNS Zone Transfers
- Definition: A process for syncing DNS records between primary and secondary name servers to ensure consistency.
- Mechanism:
- AXFR Request: Secondary server requests a full zone transfer (AXFR).
- SOA Delivery: Primary sends the Start of Authority record.
- Record Transfer: Sends all records (A, MX, NS, etc.).
- Completion: Primary signals transfer completion.
- Confirmation: Secondary acknowledges receipt.
- Security Risk:
- Misconfigured servers may allow unauthorized AXFR requests, exposing the entire zone file.
- Exposed Data: Subdomains, IPs, mail servers, hosting details, and misconfigurations.
- Historical Note: Common in the early internet; now rare but misconfigurations persist.
- Exploitation Example:
- Tool:
dig
. - Command:
dig axfr @nsztm1.digi.ninja zonetransfer.me
- Output (zonetransfer.me):
zonetransfer.me. 7200 IN SOA nsztm1.digi.ninja. robin.digi.ninja. ... zonetransfer.me. 7200 IN A 5.196.105.14 zonetransfer.me. 7200 IN MX 0 ASPMX.L.GOOGLE.COM. www.zonetransfer.me. 7200 IN A 5.196.105.14 internal.zonetransfer.me. 7200 IN A 127.0.0.1 ...
- Insight: Exposes subdomains (
internal.zonetransfer.me
), IPs, and mail servers. - Note:
zonetransfer.me
is a test domain for educational purposes.
- Tool:
- Mitigation:
- Limit zone transfers to authorized secondary servers.
- Regularly audit DNS configurations.
- Recon Value:
- Provides a complete DNS infrastructure map.
- Uncovers hidden subdomains (e.g., internal or staging servers).
- Even failed transfers reveal configuration details.
5.4 Virtual Hosts (VHosts)
- Definition: Virtual hosts enable multiple websites or domains to operate on a single server or IP, differentiated by the HTTP Host header.
- VHosts vs. Subdomains:
- Subdomains: Managed via DNS records (e.g.,
blog.example.com
). - VHosts: Server-side configurations, may not have DNS entries.
- Example (Apache):
<VirtualHost *:80> ServerName www.example.com DocumentRoot /var/www/example </VirtualHost> <VirtualHost *:80> ServerName app.example.org DocumentRoot /var/www/app </VirtualHost>
- Subdomains: Managed via DNS records (e.g.,
- Operation:
- Browser sends an HTTP request with a Host header (e.g.,
app.example.com
). - Server matches the Host header to a VHost configuration.
- Serves content from the corresponding document root.
- Browser sends an HTTP request with a Host header (e.g.,
- Accessing Undocumented VHosts:
- Edit the local hosts file (e.g.,
/etc/hosts
orC:\Windows\System32\drivers\etc\hosts
). - Example:
93.184.216.34 hidden.example.com
to bypass DNS.
- Edit the local hosts file (e.g.,
- Virtual Hosting Types:
- Name-Based: Relies on Host header; efficient, widely used, but limited for some protocols (e.g., older SSL).
- IP-Based: Assigns unique IPs per site; flexible but IP-intensive.
- Port-Based: Uses different ports (e.g.,
:8080
); less common due to user inconvenience.
- VHost Discovery (Fuzzing):
- Technique: Tests various hostnames against a server’s IP to identify active VHosts.
- Tool: Gobuster.
- Command:
gobuster vhost -u http://example.com -w /usr/share/seclists/Discovery/DNS/subdomains-top1million-110000.txt --append-domain
- Output: Discovers
app.example.com
(Status: 200). - Flags:
-u
: Target URL or IP.-w
: Wordlist path.--append-domain
: Adds base domain to queries.-t
: Adjusts thread count for speed.-k
: Skips SSL verification.-o
: Exports results to a file.
- Considerations:
- Generates significant HTTP traffic; may trigger WAF/IDS.
- Requires explicit authorization to avoid legal issues.
- Review results for hidden portals or internal systems.
Advanced Web Reconnaissance Guide
5. Certificate Transparency (CT) Logs
- Definition: Certificate Transparency logs are public, tamper-proof records of SSL/TLS certificate issuances, maintained by independent entities to ensure transparency.
- Purpose:
- Identify Unauthorized Certificates: Detect rogue or misissued certificates.
- Ensure CA Accountability: Monitor certificate authorities for improper practices.
- Enhance Web Security: Strengthen trust in the Public Key Infrastructure (PKI).
- Reconnaissance Value:
- Exposes subdomains listed in certificate Subject Alternative Name (SAN) fields.
- Reveals historical or expired subdomains (e.g., forgotten development servers).
- Offers reliable subdomain discovery without relying on brute-forcing or wordlists.
- Tools:
- crt.sh:
- Web-based platform and API for querying certificate data.
- Pros: Free, no signup required, intuitive interface.
- Cons: Basic filtering capabilities.
- Censys:
- Comprehensive platform for certificate and device discovery.
- Pros: Rich dataset, API support.
- Cons: Requires account (free tier available).
- crt.sh:
- Example (crt.sh):
curl -s "https://crt.sh/?q=example.com&output=json" | jq -r '.[] | select(.name_value | contains("dev")) | .name_value' | sort -u
- Output:
dev.example.com
,secure.dev.example.com
,.dev.example.com
. - Breakdown:
curl
: Retrieves JSON data from crt.sh forexample.com
.jq
: Filters subdomains containing “dev” inname_value
, extracts unique entries.sort -u
: Sorts and removes duplicates.
- Output:
- Advantages:
- Passive and stealthy; no direct interaction with the target.
- Uncovers obscure or historical subdomains missed by other methods.
Practical Reconnaissance Strategies
- Comprehensive Approach:
- Use zone transfers (if misconfigured) for complete DNS insights, brute-forcing for active subdomain discovery, CT logs for passive enumeration, and VHost fuzzing to identify server-side configurations.
- Stealth Techniques:
- Prioritize passive methods like CT logs and search engine queries to avoid detection.
- For active methods, use targeted wordlists and rate-limited queries to minimize noise.
- Tool Integration:
- Leverage
dig
for zone transfers,dnsenum
orgobuster
for brute-forcing, andcrt.sh
for CT log analysis. - Automate data extraction with scripts (e.g.,
curl
andjq
for CT logs).
- Leverage
- Validation:
- Verify discovered subdomains and VHosts with HTTP requests or manual inspection.
- Investigate anomalies, such as development servers or internal portals.
- Ethical Guidelines:
- Obtain explicit permission before performing active reconnaissance (e.g., brute-forcing, VHost fuzzing, zone transfer attempts).
- Adhere to rate limits to prevent server disruption.
- Continuous Monitoring:
- Regularly query CT logs for newly issued certificates and subdomains.
- Track DNS changes (e.g., via historical zone transfer data, if available).
Key Tools and Commands
- dig (Zone Transfer):
dig axfr @<nameserver> <domain>
- dnsenum (Subdomain Brute-Forcing):
dnsenum --enum <domain> -f <wordlist>
- gobuster (VHost Fuzzing):
gobuster vhost -u http://<IP> -w <wordlist> --append-domain
- crt.sh (CT Logs):
curl -s "https://crt.sh/?q=<domain>&output=json" | jq -r '.[] | .name_value' | sort -u
6. Web Crawling, Fingerprinting, robots.txt, and Well-Known URIs
6.1 Fingerprinting
- Definition: The process of identifying a website’s technical components (e.g., web server, OS, CMS, frameworks) to understand its technology stack and potential vulnerabilities.
- Reconnaissance Value:
- Targeted Exploitation: Enables attacks tailored to specific software versions.
- Misconfiguration Detection: Uncovers outdated software or insecure settings.
- Prioritization: Guides focus toward vulnerable systems.
- Holistic Profiling: Builds a detailed picture of the target’s infrastructure.
- Techniques:
- Banner Grabbing: Extracts software/version details from server responses.
- HTTP Header Analysis: Inspects headers like
Server
orX-Powered-By
. - Response Probing: Sends crafted requests to trigger unique responses.
- Content Inspection: Analyzes page structure, scripts, or metadata for clues.
- Tools:
- Wappalyzer: Browser extension for detecting CMS and frameworks.
- BuiltWith: Comprehensive tech stack analysis (free/paid tiers).
- WhatWeb: CLI tool for fingerprinting web technologies.
- Nmap: Network scanner with NSE scripts for OS and service detection.
- Netcraft: Provides tech, hosting, and security insights.
- wafw00f: Identifies Web Application Firewalls (WAFs).
- Example (example.com):
- Banner Grabbing (curl):
curl -I example.com
- Output:
Server: nginx/1.14.2
, redirects to HTTPS,X-Powered-By: PHP/7.4
. - Insight: Reveals nginx server, PHP backend.
- Output:
- wafw00f:
pip3 install wafw00f wafw00f example.com
- Output: Detects Cloudflare WAF.
- Implication: Indicates robust security; adjust recon to bypass WAF restrictions.
- Nikto:
nikto -h example.com -Tuning b
- Output:
- IP:
93.184.216.34
. - Server: nginx/1.14.2 (potentially outdated).
- CMS: Drupal detected via
/CHANGELOG.txt
. - Headers: Missing
Content-Security-Policy
. - Issues: Potential Drupal vulnerabilities, insecure headers.
- IP:
- Output:
- Banner Grabbing (curl):
- Considerations:
- WAFs may block aggressive probes; use subtle techniques.
- Combine fingerprinting with crawling for contextual insights.
6.2 Web Crawling
- Definition: An automated process (spidering) that systematically navigates a website by following links to collect data like pages, files, and metadata.
- Process:
- Begin with a seed URL (e.g., homepage).
- Fetch and parse the page, extracting links.
- Queue and crawl links iteratively.
- Crawling Approaches:
- Breadth-First: Explores all links on a page before diving deeper; ideal for mapping site structure.
- Depth-First: Pursues one link path deeply; suited for targeting specific content.
- Collected Data:
- Links: Internal (site hierarchy) and external (third-party connections).
- Comments: May reveal sensitive details (e.g., developer notes, software versions).
- Metadata: Includes titles, descriptions, keywords, or authorship.
- Sensitive Files: Configs (
config.php
), backups (.bak
), logs (access.log
), or credentials.
- Reconnaissance Value:
- Maps site architecture and uncovers hidden pages.
- Identifies exploitable files or comments.
- Enables contextual analysis (e.g., linking comments to exposed directories).
- Example:
- Crawling reveals
/backups/
with directory listing enabled, exposingdatabase.sql
. - A comment referencing “legacy API” combined with
/api/
discovery suggests outdated endpoints.
- Crawling reveals
- Considerations:
- Analyze findings holistically to connect data points.
- Avoid server overload by limiting request rates.
6.3 Web Crawling Tools
- Purpose: Automate crawling to streamline data collection and focus on analysis.
- Key Tools:
- Burp Suite Spider: Active crawler for mapping web applications and identifying vulnerabilities.
- OWASP ZAP: Open-source scanner with a spider for manual or automated vulnerability discovery.
- Scrapy: Python framework for building custom crawlers tailored to specific needs.
- Apache Nutch: Scalable Java crawler for large or focused crawls; requires configuration expertise.
- Scrapy Example (example.com):
- Setup:
pip3 install scrapy
- Custom Spider:
scrapy crawl recon -a url=http://example.com -o results.json
- Output (results.json):
emails
:info@example.com
,support@example.com
.links
: Internal (/about
), external (cdn.example.net
).external_files
:report.pdf
.js_files
:main.js
,vendor.js
.form_fields
,images
,videos
,audio
,comments
(e.g.,<!-- debug mode -->
).
- Data Structure:
Key Description emails Email addresses on the site links Internal/external URLs external_files Downloadable files (e.g., PDFs) js_files JavaScript files form_fields Form input fields images Image URLs videos Video URLs audio Audio URLs comments HTML comments
- Setup:
- Ethical Considerations:
- Secure permission before crawling.
- Respect server limits to avoid disruption.
- Reconnaissance Value:
- Provides structured data for mapping site functionality.
- Highlights entry points like forms or sensitive files.
6.4 robots.txt
- Definition: A text file located at a website’s root (e.g.,
example.com/robots.txt
) that adheres to the Robots Exclusion Standard, instructing crawlers on allowed or restricted paths. - Format:
- User-agent: Specifies bots (e.g.,
*
for all,Bingbot
for Bing). - Directives:
Disallow
: Blocks paths (e.g.,/private/
).Allow
: Permits paths (e.g.,/public/
).Crawl-delay
: Sets delay between requests (e.g.,Crawl-delay: 5
).Sitemap
: Links to sitemap (e.g.,Sitemap: https://example.com/sitemap.xml
).
- User-agent: Specifies bots (e.g.,
- Example:
User-agent: * Disallow: /admin/ Disallow: /internal/ Allow: /blog/ User-agent: Googlebot Crawl-delay: 5 Sitemap: https://example.com/sitemap.xml
- Insight: Suggests
/admin/
and/internal/
may contain sensitive content.
- Insight: Suggests
- Purpose of robots.txt:
- Prevents server overload from excessive crawling.
- Protects sensitive areas from search engine indexing.
- Ensures compliance with site policies.
- Reconnaissance Value:
- Hidden Paths:
Disallow
entries (e.g.,/admin/
) hint at sensitive directories. - Site Layout: Allowed/disallowed paths reveal structure.
- Security Awareness: Traps or honeypot paths indicate defensive measures.
- Hidden Paths:
- Considerations:
- Respect robots.txt during ethical reconnaissance.
- Manually explore
Disallow
paths for potential insights.
6.5 Well-Known URIs
- Definition: A standardized directory (
/.well-known/
) defined by RFC 8615, hosted at a site’s root, containing metadata, configurations, and service details, managed by IANA. - Common URIs:
security.txt
(RFC 9116): Provides security contact information.change-password
: Points to password reset page.openid-configuration
: Supplies OpenID Connect metadata.assetlinks.json
: Verifies app or asset ownership.mta-sts.txt
: Defines email security policies (MTA-STS).
- OpenID Connect Example:
- URL:
https://example.com/.well-known/openid-configuration
. - JSON Output:
- Endpoints for authorization, token issuance, and user info.
jwks_uri
for cryptographic keys.- Supported scopes, response types, and algorithms.
- Recon Value:
- Maps authentication infrastructure.
- Exposes security configurations (e.g., signing algorithms).
- URL:
- Reconnaissance Value:
- Reveals critical endpoints and configurations.
- Provides structured metadata for understanding site functionality.
- Methodology:
- Consult IANA’s well-known URI registry.
- Probe paths like
curl https://example.com/.well-known/security.txt
.
- Considerations:
- Passive method with minimal detection risk.
- Combine with crawling to map site features comprehensively.
Practical Strategies for Crawling and Analysis
- Integrated Workflow:
- Fingerprinting: Identify technologies (e.g., nginx, Drupal) to prioritize vulnerability research.
- Crawling: Map site structure and extract links, files, or comments using tools like Scrapy.
- robots.txt: Investigate
Disallow
paths (e.g.,/internal/
) for sensitive content. - Well-Known URIs: Check
/.well-known/
for security or authentication details.
- Stealth Techniques:
- Focus on passive methods (e.g., robots.txt, well-known URIs) to avoid detection.
- Limit crawl intensity and respect
Crawl-delay
directives.
- Tool Synergy:
- curl: Fetch headers (
curl -I
) for quick fingerprinting. - wafw00f/Nikto: Detect WAFs and vulnerabilities.
- Scrapy: Automate structured data collection.
- curl: Fetch headers (
- Validation:
- Manually verify sensitive paths from robots.txt or crawling results.
- Test well-known URIs to confirm active endpoints.
- Ethical Guidelines:
- Secure explicit authorization for active reconnaissance (e.g., crawling, fingerprinting).
- Avoid excessive requests to respect server resources.
- Contextual Analysis:
- Combine findings (e.g., Drupal from fingerprinting,
/backups/
from crawling,/admin/
from robots.txt) to uncover exploitable weaknesses.
- Combine findings (e.g., Drupal from fingerprinting,
Key Tools and Commands
- curl (Banner Grabbing):
curl -I https://example.com
- wafw00f (WAF Detection):
wafw00f example.com
- Nikto (Fingerprinting):
nikto -h example.com -Tuning b
- Scrapy (Crawling):
scrapy crawl recon -a url=http://example.com -o results.json
- robots.txt Check:
curl https://example.com/robots.txt
- Well-Known URIs:
curl https://example.com/.well-known/security.txt
7. Search Engine Discovery, Web Archives, and Automation
7.1 Search Engine Discovery
- Definition: Using search engines for Open Source Intelligence (OSINT) to collect data on targets (e.g., websites, organizations) through advanced query techniques.
- Reconnaissance Value:
- Accessibility: Public, legal, and cost-free.
- Broad Coverage: Indexes extensive web content.
- Simplicity: Requires minimal technical expertise.
- Applications:
- Security Assessments: Identify exposed data, vulnerabilities, or entry points.
- Competitive Analysis: Gather insights on competitors’ strategies or technologies.
- Investigations: Uncover hidden relationships or activities.
- Threat Intelligence: Monitor malicious actors and predict attack patterns.
- Limitations:
- Incomplete indexing of web content.
- Restricted access to protected or unindexed data.
- Search Operators:
Operator Description Example Use Case site:
Restricts to a domain site:example.com
Map all pages on a domain inurl:
Searches URL for term inurl:admin
Find admin panels filetype:
Targets file type filetype:pdf
Locate documents intitle:
Searches page title intitle:"login portal"
Find login pages intext:
Searches page content intext:"confidential"
Find sensitive content cache:
Views cached page cache:example.com
Access past content link:
Finds linking pages link:example.com
Discover external links related:
Finds similar sites related:example.com
Identify comparable sites numrange:
Searches number range site:example.com numrange:2020-2025
Find pages with specific numbers allintext:
All terms in content allintext:admin password
Precise content search allinurl:
All terms in URL allinurl:login panel
URLs with multiple terms AND
,OR
,NOT
Logical operators site:example.com NOT inurl:blog
Refine queries *
Wildcard site:example.com user*guide
Match variations ..
Range search site:example.com "price" 100..500
Find price ranges ""
Exact phrase "security policy"
Exact matches -
Excludes term site:example.com -inurl:signup
Exclude irrelevant pages - Google Dorking:
- Advanced queries to uncover sensitive data or vulnerabilities.
- Examples:
- Login Pages:
site:example.com inurl:(login | dashboard)
- Exposed Files:
site:example.com filetype:(pdf | xlsx)
- Config Files:
site:example.com inurl:config
- Backups:
site:example.com filetype:bak
- Login Pages:
- Resource: Exploit-DB’s Google Hacking Database for curated dorks.
- Considerations:
- Passive method with low detection risk.
- Combine with web archives or crawling for deeper insights.
- Manually verify results to filter false positives.
7.2 Web Archives
- Definition: Repositories like the Internet Archive’s Wayback Machine that preserve historical snapshots of websites, capturing content, design, and functionality.
- Wayback Machine Mechanics:
- Crawling: Bots capture webpages, including HTML, CSS, JavaScript, and media.
- Storage: Snapshots are timestamped and archived.
- Retrieval: Users access snapshots by URL and date.
- Snapshot Frequency:
- Varies by site popularity and archive resources.
- High-traffic sites: Frequent snapshots (e.g., daily).
- Niche sites: Infrequent snapshots (e.g., yearly).
- Limitations:
- Incomplete capture of dynamic or restricted content.
- Site owners may request exclusion (not always enforced).
- Reconnaissance Value:
- Hidden Assets: Exposes old subdomains, directories, or files.
- Change Analysis: Tracks site evolution (e.g., tech upgrades, design changes).
- OSINT Insights: Reveals past strategies, personnel, or technologies.
- Stealth: Passive with no target interaction.
- Example (example.com):
- Access Wayback Machine, enter
example.com
, select a 2018 snapshot. - Insight: Reveals discontinued
/forum/
subdirectory or outdated CMS.
- Access Wayback Machine, enter
- Considerations:
- Analyze snapshots for forgotten assets or vulnerabilities.
- Compare historical and current data to identify changes.
- Cross-reference with search engine results for anomalies.
7.3 Automating Reconnaissance
- Definition: Employing tools and frameworks to automate repetitive reconnaissance tasks for efficiency and consistency.
- Benefits:
- Speed: Accelerates data collection.
- Scalability: Supports multiple targets or domains.
- Accuracy: Minimizes human error.
- Versatility: Covers DNS, subdomains, crawling, and scanning.
- Integration: Combines with other tools for streamlined workflows.
- Key Frameworks:
- FinalRecon: Python tool for headers, WHOIS, SSL, crawling, DNS, subdomains, and web archives.
- Recon-ng: Modular framework for DNS, subdomains, crawling, and exploit discovery.
- theHarvester: Collects emails, subdomains, and hosts from public sources.
- SpiderFoot: OSINT tool for domains, emails, social media, and scanning.
- OSINT Framework: Curated toolset for search engines, social media, and public records.
- FinalRecon Example (example.com):
- Setup:
git clone https://github.com/thewhiteh4t/FinalRecon.git cd FinalRecon pip3 install -r requirements.txt chmod +x finalrecon.py
- Command:
./finalrecon.py --headers --whois --url http://example.com
- Output:
- Headers:
Server: nginx/1.14.2
,Content-Type: text/html
. - WHOIS:
- Domain:
example.com
. - Registrar: Example Registrar.
- Creation: 1995-08-13.
- Expiry: 2026-08-12.
- Name Servers:
ns1.example.com
,ns2.example.com
.
- Domain:
- Export: Results saved to
~/.local/share/finalrecon/dumps/
.
- Headers:
- Options:
--headers
: Fetches HTTP headers.--whois
: Performs WHOIS lookup.--sslinfo
: Analyzes SSL certificates.--crawl
: Crawls the site.--dns
: Enumerates DNS records.--sub
: Discovers subdomains.--dir
: Scans directories.--wayback
: Queries Wayback Machine.--ps
: Conducts port scanning.--full
: Runs all modules.- Additional:
-w
(wordlist),-e
(file extensions),-o
(output format).
- Setup:
- Considerations:
- Active methods (e.g., scanning, crawling) may trigger detection; use cautiously.
- Obtain authorization to ensure legal and ethical compliance.
- Tailor modules to the target’s context for optimal results.
Practical Strategies for Search and Automation
- Integrated Workflow:
- Search Engine Discovery: Use Google Dorks to identify login pages or exposed files; validate findings manually.
- Web Archives: Query Wayback Machine for historical subdomains or technologies; compare with current data.
- Automation: Leverage FinalRecon for broad reconnaissance, supplemented by targeted tools like Nikto or Scrapy.
- Stealth Techniques:
- Emphasize passive methods (e.g., search engines, web archives) to minimize detection.
- Apply rate-limiting to automated scans to respect server limits.
- Tool Synergy:
- Google:
site:example.com filetype:pdf
to find documents. - Wayback Machine: Access via
archive.org
for historical snapshots. - FinalRecon: Use
--sub
,--crawl
,--wayback
for comprehensive data collection.
- Google:
- Validation:
- Verify dork results (e.g., config files) through manual checks.
- Confirm Wayback findings against the live site for relevance.
- Review automated outputs for actionable vulnerabilities.
- Ethical Guidelines:
- Secure explicit permission for active reconnaissance (e.g., FinalRecon scans).
- Adhere to robots.txt and site terms during crawling.
- Contextual Analysis:
- Combine dork findings (e.g.,
/dashboard/
frominurl:dashboard
), Wayback data (e.g., old CMS), and FinalRecon headers (e.g., nginx) to uncover exploitable insights.
- Combine dork findings (e.g.,