There are some use cases and scenarios where you need to retrieve
SharePoint site collection information quickly without performance suffer
through object model that will iterate all sub-sites and lists. One of the use
cases is to evaluate third party search crawling index process accuracy on the
site collection. We would like to get the accurate list of all documents exclude
system ones in a site collection to compare the crawling result. The key here is to exclude system documents.
There are many different ways including using server object model,
client object model, PowerShell, search API, SOAP/REST web service, or 3rd
party tools. Another way, and a really quick one, is querying directly the SP
content db tables or views. Since reading
from a SharePoint DB (with the
exception of the logging database) might
leave it in an unsupported state indicated in KB article, I would suggest doing
this in non-production environment. You could also confirm the script with
Microsoft before run on production.
If you read the SP content db schema, you might find several SQL scripts to get list of documents from SharePoint database. Some of them are from SP 2007 that will no longer work and some of them will list many system files/documents. I modified the SQL script and added several enhancements so we could get accurate list of user created documents exclude system ones from SharePoint database without blocking other queries. Here are some tips.
- Use Docs view instead of AllDocs table to simplify query
- Join Docs view with UserData view to exclude system files/documents
- All (NOLOCK) hint to both Docs and UserData views to prevent performance degradation
- Filter out file type to exclude system like files. Example is LeafName LIKE '%.stp' to exclude solution files
- Filter out directory type to exclude system like files. Example is filesDirName LIKE '%_catalogs%' as branding files
SELECT
distinct DirName AS Directory,
UserData.nvarchar7 AS Title,
LeafName AS DocName,
Docs.Size AS Size,
Docs.TimeCreated AS Time
FROM Docs AS Docs WITH (NOLOCK) INNER JOIN UserData(NOLOCK) as UserData ON (Docs.DoclibRowId = UserData.tp_ID) AND Docs.ListId = UserData.tp_ListId
WHERE (DoclibRowId > 0)
AND Docs.ListID = UserData.tp_ListID
AND Docs.DoclibRowId = UserData.tp_ID
AND Docs.Type <> 1
AND (LeafName NOT LIKE '%.stp')
AND (LeafName NOT LIKE '%.aspx')
AND (LeafName NOT LIKE '%.xfp')
AND (LeafName NOT LIKE '%.dwp')
AND (LeafName NOT LIKE '%template%')
AND (LeafName NOT LIKE '%.inf')
AND (LeafName NOT LIKE '%.css')
AND (LeafName NOT LIKE '%.master')
AND (LeafName NOT LIKE '%.xml')
AND (Docs.Size > 1)
AND (DirName NOT LIKE '%_catalogs%')
AND (DirName NOT LIKE 'personal%')
AND (DirName NOT LIKE '%Style%')
AND (DirName NOT LIKE '%SiteCollectionImages')
AND (DirName NOT LIKE 'mysites%')
AND (DirName NOT LIKE '%Publishing%')
AND (DirName NOT LIKE '%OntolicaStyles%')
AND (DirName NOT LIKE '%Data Connections for PerformancePoint%')
AND (DirName NOT LIKE '%ProjectBICenter/Templates%')
AND (DirName NOT LIKE '%Data Connections%')
AND (DirName NOT LIKE '%Reporting Templates%')
AND (DirName NOT LIKE '%FormServerTemplates%')
AND (DirName NOT LIKE '%ProjectBICenter/Sample Reports%')
AND (DirName NOT LIKE '%Pages%')
ORDER BY Directory
There are some other useful SQL query on users on the site collection and list data you could play around. Please remember to use these on non-production environment and consultant with Microsoft support before execute on production.