•  


페이지를 파싱하기 어렵습니다. CC technical blog urn:uuid:cc6bc3c1-d0ad-365f-b7a6-1fcd73488c56 2024-05-28T00:00:00Z New CreativeCommons.org launched 2023 September 2024-05-28T00:00:00Z ['sara', 'shafiya', 'TimidRobot'] urn:uuid:7ea5e14e-e9cb-35f3-8be8-e344b66f7a7e <p>Creative Commons (CC) launched a new <a href="https://creativecommons.org/">CreativeCommons.org</a> website on 2023 September 27th. This relaunch included not just the website, but the entire technology stack (platform, server, and website components).</p> <h2 id="improved-platform">Improved platform</h2><p>The new website is hosted on AWS. This allowed us to design a more secure network architecture between services and deploy/manage the services using infrastructure as code.</p> <h2 id="improved-services">Improved services</h2><p>The services running the website were simplified and updated. The number of distinct servers was reduced from six down to two. Previously, loading the homepage required five services (HAProxy, Varnish, Apache2, PHP+FPM, and MariaDB). The complexity of the old services made troubleshooting more difficult. They were designed before Cloudflare began supporting us through <a href="https://www.cloudflare.com/galileo/">Project Galileo</a>. The new website requires only two services (Apache2 and MariaDB).</p> <h2 id="improved-website-components">Improved website components</h2><h3 id="vocabulary">Vocabulary</h3><p>The website consists of a variety of components that use the Vocabulary design system (<a href="https://github.com/creativecommons/vocabulary">creativecommons/vocabulary</a>) to present a unified user experience. This relaunch was the first implementation of the new Vocabulary. It has returned to web core principals favoring semantic HTML and appropriately scoped CSS styling. It keeps the style layer responsibilities firmly within the CSS, rather than utilizing a framework like Bootstrap to add a myriad of style-based classes to the HTML layer. Furthermore, JavaScript use has been kept incredibly minimal, offering routes of behavior that can’t already be accomplished via HTML and/or CSS, letting HTML and CSS do what they do best. This simplicity improves performance and also lowers barriers for community contributions.</p> <p>Accessibility was a priority, making the code more semantic already helps, but we went further in ensuring that all the affordances you get from HTML aren’t blocked or altered via opinionated (and often non-standard) frameworks. The site performs better generally, and is much kinder to slower connection speeds.</p> <p>The new implementation of Vocabulary includes a new Information Architecture and more stable UX approach for better visitor experiences. CC licensed media is one of our strengths and as such it was important to allow proper attribution to be baked into every instance of media rendering within the design. This means that while the image or video may be important to the flow of content, its attribution also gets a level of appropriate importance as well, highlighting ways in which others might handle attribution and following through on our own mission in the pursuit of better sharing at large.</p> <h3 id="wordpress">WordPress</h3><p>The project utilizes a custom WordPress theme (<a href="https://github.com/creativecommons/vocabulary-theme">creativecommons/vocabulary-theme</a>) that implements the new Vocabulary design system.</p> <p>The theme utilizes the WordPress Classic Editor because of its long-term stability and more stable UX. Gutenberg still does not adhere to adequate Accessibility approaches, nor does it have a sense of stable feature-completeness. This creates an unreliable landscape to build upon. Gutenberg also requires one to build Block composition through React.js to accomplish tasks that are far easier and more approachable with the standard PHP templates that the Classic Editor is compatible with. This dramatically improves the ability for a new contributor to help, and speeds up the development process.</p> <p>To allow a degree of more varied page composition, Advanced Custom Fields was utilized to more easily add, update, and version control custom fields across pages and page templates. This strikes a balance between more complex page composition, but within a more controllable set of circumstances.</p> <p>Plugins in general were cut dramatically. The legacy site contained 20 active plugins, while this project relies on less than half, at 9, with hopeful pathways to eventually cut that number even further.</p> <p>The site utilizes several custom content types and better taxonomies to split up the UX flow of varied kinds of content creation, allowing for smoother multi-author attribution, site-wide notices for fundraising and event announcements, and better blog post organization and way-finding overall.</p> <h3 id="cc-legal-tools">CC Legal Tools</h3><p>With the deployment of our new website, we also replaced the legacy ccEngine with the new CC Legal Tools. The current legal tool landscape is refreshingly simple with only seven tools (CC BY 4.0, CC BY-NC 4.0, CC BY-NC-ND 4.0, CC BY-NC-SA 4.0, CC BY-ND 4.0, CC BY-SA 4.0, CC0 1.0). However since previous versions of the licenses were adapted to specific jurisdictions (ported) and we collaborate with the community to support many translations, the new CC Legal Tools app manages over 30,000 documents!</p> <p>The project to rewrite the CC Legal Tools and replace the legacy ccEngine began in 2020 with a request for proposals (<a href="https://docs.google.com/document/d/1mlgmjDorTEwgIRRrvILK3v0pTJbGx8fB5SE1yplrz3Y/edit">RFP: License Infrastructure - Google Docs</a>). The <a href="https://www.caktusgroup.com/">Caktus Group</a> began the new CC Legal Tools using the Django Python web framework. The work was continued by Timid Robot. Saurabh helped with RDF/XML generation (<a href="/blog/entries/2023-08-25-machine-layer/">CC Legal Tools: Machine-Readable Layer ? Creative Commons Open Source</a>).</p> <p>The new CC Legal Tools consist of two repositories:</p> <ol> <li><a href="https://github.com/creativecommons/cc-legal-tools-app">creativecommons/cc-legal-tools-app</a>: <em>Static site generator using Django</em></li> <li><a href="https://github.com/creativecommons/cc-legal-tools-data">creativecommons/cc-legal-tools-data</a>: <em>Inputs and outputs of the application</em></li> </ol> <p>The legacy ccEngine consists of around 15,960 lines of Python 2. It was developed and extended organically over time, resulting in a less coherent codebase. The new CC Legal Tools has the benefit of hindsight and was architected as a single application to meet all of current requirements of CC. It consists of around 17,400 lines of Python 3 (including around 4,000 lines of tests). Benefits of the new CC Legal Tools include:</p> <ul> <li>Currently supported software (Python 3, Django 4.2, etc.)</li> <li>Simplified data model</li> <li>Improved translation handling</li> <li>Improved RDF/XML generation/management</li> </ul> <p>In particular, the fact that the new CC Legal Tools generate static assets is noteworthy. Static assets can be hosted performantly with a very simple service setup.</p> <h3 id="chooser">Chooser</h3><p>The new chooser beta (<a href="https://github.com/creativecommons/chooser">creativecommons/chooser</a>) was promoted to production with the new header and footer from the Vocabulary design system for a more uniform user experience.</p> <h3 id="faq-platform-toolkit">FAQ &amp; Platform Toolkit</h3><p>The FAQ (<a href="https://github.com/creativecommons/faq">creativecommons/faq</a>) and Platform Toolkit (<a href="https://github.com/creativecommons/mp">creativecommons/mp</a>) were updated to use the new header and footer from the Vocabulary design system for a more uniform user experience.</p> <h2 id="improved-development">Improved development</h2><p>Utilizing infrastructure as code, we now have a much more robust staging environment. This allows us to preview larger changes so that they can be deployed to production with minimum risk. We also improved our local development environment and content synchronization tooling (<a href="https://github.com/creativecommons/index-dev-env">creativecommons/index-dev-env</a>). This means that not only did we fix many old bugs, but when new bugs are identified, we can fix them more rapidly!</p> <h2 id="thank-you">Thank you</h2><p>Thank you to the people who directly contributed to the success of the new website!</p> <ul> <li>Nate, former Director of Communications &amp; Community</li> <li>Sara, Full Stack Engineer</li> <li>Shafiya, Systems Engineer</li> <li>Timid Robot, Director of Technology</li> <li><em>as well as many other previous staff, community contributors, and other <a href="/community/supporters/">supporters</a>!</em></li> </ul> CC Legal Tools: Machine-Readable Layer 2023-08-28T00:00:00Z ['saurabh'] urn:uuid:ca25cf56-a3e2-3703-bd66-7619befdf05c <p>Greetings, readers!?? I'm excited to share that as part of Google Summer of Code (GSoC) 2023, I had the incredible opportunity to contribute to the exciting project "CC Legal Tools: Machine-Readable Layer." This journey has been a remarkable blend of learning, coding, and collaboration, and I'm thrilled to share the highlights of this journey with you all.</p> <p><img src="/blog/entries/2023-08-25-machine-layer/gsoc2023cc.png" alt="GSoC 2023 and CC"></p> <h2 id="project-overview">Project Overview</h2><p>The project's core focus was to enhance the Creative Commons (CC) <a href="https://github.com/creativecommons/cc-legal-tools-app">Legal Tools app</a> by introducing a robust machine-readable layer. The machine-readable layer enables computers to understand the intricacies of CC licenses, making it easier for legal professionals, developers, and enthusiasts to work with CC licenses programmatically.</p> <h2 id="getting-started">Getting Started</h2><p>My journey began with delving into the existing codebase and understanding the project's requirements i.e. understanding the app's architecture, its components, and how it currently handled CC licenses was crucial for what lay ahead.</p> <p>RDF, or Resource Description Framework, emerged as a crucial player in the project. Grasping the intricacies of RDF and its role in representing licenses was a necessary step in the journey.</p> <h2 id="challenges-and-learning-opportunities">Challenges and Learning Opportunities</h2><p>One of my early challenges was unraveling the complexities of the legacy RDF/XML files. How did they differ from the new RDF/XML files we aimed to generate? This exploration led me to discover improvements in structure, updated license information, and additional metadata.</p> <p>Generating RDF files for various licenses and versions became a puzzle to solve. Crafting RDF triples, understanding licensing nuances, and weaving this logic into the app's views became both a learning opportunity and a rewarding challenge.</p> <h2 id="contributions-and-the-work">Contributions and The Work</h2><p>As the project evolved, I worked to dynamically generate RDF/XML files, allowing the app to generate machine-readable licenses on-the-fly.</p> <p>To maintain an organized approach, it is ensured that the generated RDF files are sorted, all the credit goes to <a href="/blog/authors/TimidRobot/">Timid Robot</a>.</p> <p>The newly generated RDF/XML aims to enhance the clarity, accuracy, compatibility, and standardization of Creative Commons license representation in RDF format. These improvements boost machine-readability and semantic understanding, fostering seamless integration and interpretation in digital systems.</p> <h2 id="overview-of-changes">Overview of changes:</h2><ul> <li><strong>Improved Structure and Consistency:</strong><ul> <li>The new RDF/XML boasts a more organized, standardized structure, aligning with RDF standards. This enhances machine comprehension and accurate data processing.</li> </ul> </li> <li><strong>Updated License Information</strong>:<ul> <li>License information has been updated to reflect the latest permissions and restrictions. This ensures users and systems are informed accurately.</li> </ul> </li> <li><strong>Alignment with RDF Best Practices</strong>:<ul> <li>Changes align the representation with RDF best practices. This boosts interoperability and compatibility, thanks to standardized namespaces, consistent naming, and proper relationship definitions.</li> </ul> </li> </ul> <p>Throughout the journey, I had the privilege of working closely with my mentor, engaging in collaborative discussions and receiving insightful code reviews..</p> <p>As my GSoC journey draws to a close, I'm excited about the foundation we've laid for the CC Legal Tools app. The machine-readable layer opens doors to a future of smarter, automated legal processes.</p> <p>The improvements made during GSoC will continue to ripple through the CC Legal Tools app, benefiting users and the broader open-source community.</p> <h2 id="mentor-and-support">Mentor and Support</h2><p>A heartfelt thanks to my mentor, <a href="/blog/authors/TimidRobot/">Timid Robot</a>, for guiding me through this incredible journey. Your unwavering support, wisdom, feedback, and willingness to share knowledge have truly been invaluable. I'm deeply grateful for the opportunity to learn and grow under your mentorship. Thank you for making this journey unforgettable.</p> <h2 id="takeaways-and-conclusion">Takeaways and Conclusion</h2><p>GSoC became a platform for me to acquire new skills, dive into complex concepts, and broaden my horizons. The learning experience was immersive and transformative.</p> <p>Being part of the open-source community was a revelation. Interacting with like-minded individuals, contributing to a shared goal, and experiencing the true essence of collaboration was a highlight.</p> <p>My GSoC journey has been a remarkable adventure of exploration, discovery, and growth. The project's mission to create a machine-readable layer for CC Legal Tools has left an indelible mark on my journey as a developer.</p> <p>Thank you for joining me on this expedition. Here's to the future of open-source contributions and the endless possibilities they hold.</p> <p>Cheers,</p> <p>Saurabh Kumar</p> New Chapter of My Professional Life 2023-06-20T00:00:00Z ['shafiya'] urn:uuid:65e12b2b-0d17-3dea-82a1-96d24beaa8a4 <p>Greetings, readers! I’m Shafiya Heena, from Hyderabad, India, who now finds herself immersed in the vibrant city of Toronto, Canada. After spending six fruitful years as a DevOps Engineer, I recently embarked on a new professional journey with Creative Commons, a nonprofit organization. Today, I want to share my experiences and thoughts on the stark differences in culture between these two organizations and shed light on my recent encounter with an event called InTown Week (ITW).</p> <p>Before joining Creative Commons, I had limited exposure to open source initiatives and nonprofit organizations. However, upon stepping foot into Creative Commons, I found myself captivated by its unique culture. The emphasis on collaboration, transparency, and fostering a sense of responsibility among staff members left me awestruck. The organization's commitment to open source and its associated ethos ignited a newfound passion within me.</p> <p>Prior to ITW, I had heard positive whispers of this event, but had little knowledge about its significance. Little did I know that this week-long gathering would prove to be an enlightening and transformative experience. Over the course of five days, I delved deep into self-discovery, learning more about my team members, and gaining profound insights into Creative Commons as an organization.</p> <p>One aspect that particularly struck me during ITW was the knowledgeable, sense of equality among staff members. Everyone was encouraged to share their opinions, even if they contradicted the prevailing decisions. This environment fostered a culture of inclusivity and collective growth. Additionally, I had the opportunity to take an insightful discovery test that provided valuable insights into my personality traits and how I could enhance my contributions within the team.</p> <p>During all the conversations and activities, I found myself occasionally getting lost in the whirlwind of information. Thankfully, I was fortunate to have a dedicated mentor who skillfully guided me back on track, ensuring that I comprehended the nuances of the happenings around me. Through this mentorship, I discovered how to boost my energy levels and become an even better fit within the team.</p> <p>While the experience of ITW was enriching, it was not without its challenges. Due to visa issues, I was not able to attend in person. Engaging in extended virtual calls throughout the week was demanding, but the active participation from every individual in all activities made it all worthwhile. One particularly heartwarming moment was bidding farewell, as everyone stood in a row to say goodbye to me personally on the screen. This gesture made me feel truly present, transcending the virtual realm, and I express my heartfelt gratitude to my mentor for facilitating such connections.</p> <p>A notable contrast I observed during my time at Creative Commons was the vivaciousness and approachability of the CEO. In stark contrast to my previous organization, where CEO communication was primarily limited to formal emails announcing changes or decisions, I was pleasantly surprised by the CEO's warmth and genuine interest in engaging with employees in a personable manner. This refreshing leadership style evoked a sense of enthusiasm and bolstered my commitment to the organization's goals.</p> <p>To encapsulate my thoughts on the open source culture at Creative Commons in a single word, I would choose "likable." My mentor, in particular, played a crucial role in establishing transparency by designing a comprehensive three-month onboarding document, which laid out expectations and goals. Here, although the infrastructure may be smaller, the workflow is streamlined, and there is an absence of restrictive IT teams that hinder access to websites or prioritize hardware security over staff responsibility. Instead, Creative Commons embraces a culture where employees take ownership of their hardware, fostering an environment of trust and empowerment.</p> Many Mona Lisas? Artistic Data Quantification and Assessment 2023-04-26T00:00:00Z ['grace_coleman', 'anthony_ho', 'tyler_phillips', 'claire_wan'] urn:uuid:c9c225be-0216-3f8b-98d2-db7836cd912b <p>Quantifying the Commons</p> <p>University of Michigan, School of Information</p> <h2 id="project-objective-and-problem-statement">Project Objective and Problem Statement</h2><p>Creative Commons (CC) has over one billion licensed works. However, there is no central data or organization of CC’s licensed works, making it difficult to quantify the number of works and to analyze which licenses are useful or should be retired. The goal of this project is to help CC staff identify redundant licenses and use quantitative data in marketing its impact. It focuses on Open Education Resources (OER).</p> <h2 id="data-collection">Data Collection</h2><p>Data was collected from <a href="https://www.oercommons.org/">OER Commons</a>, which is one of CC’s platforms and a library containing digital education resources. The first step in data collection was identifying which licenses this data source uses and how many works are under each license within OER Commons. OER Commons uses the licenses CC-BY, CC-BY-SA, CC-BY-ND, CC-BY-NC, CC-BY-NC-SA, and CC-BY-NC-ND which contribute to both ‘fair use’ and ‘commercial use’ assets, respectively. The next step in data collection was querying the Application Programming Interface (API) by license. In order to retrieve all works for a license, queries are batched by a maximum of 50 works retrieved at once. This process is repeated until all works for a license are retrieved. These steps are run for every license. For every API call, the response is in XML which is parsed for features including education level, subject area, material type, media format, languages, primary user, and educational use. The results are outputted to a tab-separated CSV file.</p> <h2 id="exploratory-data-analysis-eda">Exploratory Data Analysis (EDA)</h2><p>After collecting all of our data, we began exploring the different columns in our dataframe. In particular, we looked at the distribution of different languages, the distribution of items by license type, and when items were added to the OER Commons API. Through this exploration, we were able to further specify our analysis and dig deeper into the different relationships of the data.</p> <h3 id="diagram-1">Diagram #1:</h3><p><img src="/blog/entries/2023-04-26-umsi-how-many-mona-lisas/diagram_01.png" alt="Diagram #1: Percentage of Items per License Type"></p> <p>Diagram #1 shows the distribution of items taken from OER Commons by license type. It is clear that the CC-BY license type is the most popular, with 43% of the items having that license type. The CC-BY-SA license is also fairly popular, accounting for 27% of the items collected.</p> <h3 id="diagram-2">Diagram #2:</h3><p><img src="/blog/entries/2023-04-26-umsi-how-many-mona-lisas/diagram_02.png" alt="Diagram #2: Number of Items by Month since Dec 2015"></p> <p>Diagram #2 shows when items have been added to the OER Commons API. There is little activity from December 2015, up to the beginning of 2023. However, close to 30,000 items were added to the API in early 2023.</p> <h3 id="diagram-3">Diagram #3:</h3><p><img src="/blog/entries/2023-04-26-umsi-how-many-mona-lisas/diagram_03.png" alt="Diagram #3: Percentage of Items by Language"></p> <p>Diagram #3 shows the percentage of items by language. English is the most used language, with about 86% of the items being in English. The other languages each have a small amount of the items.</p> <h3 id="diagram-4">Diagram #4:</h3><p><img src="/blog/entries/2023-04-26-umsi-how-many-mona-lisas/diagram_04.png" alt="Diagram #4: Percentage of Items in English per License Type"></p> <p>Since English is clearly the most popular language, we decided to see the license distribution for items that are in English. Diagram #4 shows a similar distribution to the pie chart depicting the overall license distribution; this is to be expected since items in English account for 86% of all items, so the distribution of licenses is similar to the overall distribution.</p> <h3 id="diagram-5">Diagram #5:</h3><p><img src="/blog/entries/2023-04-26-umsi-how-many-mona-lisas/diagram_05.png" alt="Diagram #5: Percentage of Items in French per License Type"></p> <p>We continued to look at the distribution of licenses by each language. Diagram #5 shows that for the items in French, CC-BY license is the most popular at 49%, with CC-BY-SA being right behind it at 32%.</p> <h2 id="visualizations">Visualizations</h2><h3 id="diagram-6">Diagram #6:</h3><p><img src="/blog/entries/2023-04-26-umsi-how-many-mona-lisas/diagram_06.png" alt="Diagram #6: License Type Breakdown by Primary User"></p> <p>Diagram #6 shows the distribution of items on OER commons by primary user and broken down by license type. The platform predominantly contains items designed for teachers and students, with the rest for parents, administrators, librarians among others. The breakdown of licenses for each primary user is relatively consistent with the overall breakdown of the platform, as seen from the charts below (Diagram #7 and Diagram #8).</p> <h3 id="diagram-7">Diagram #7:</h3><p><img src="/blog/entries/2023-04-26-umsi-how-many-mona-lisas/diagram_07.png" alt="Diagram #7: Percentage of Items Used by Teachers per License Type"></p> <h3 id="diagram-8">Diagram #8:</h3><p><img src="/blog/entries/2023-04-26-umsi-how-many-mona-lisas/diagram_08.png" alt="Diagram #8: Percentage of Items Used by Students per License Type"></p> <h3 id="diagram-9">Diagram #9:</h3><p><img src="/blog/entries/2023-04-26-umsi-how-many-mona-lisas/diagram_09.png" alt="Diagram #9: Subject Area by License"></p> <p>Another aspect analyzed was inspecting the subject areas and the licenses that they hold as shown in Diagram #9. Some preliminary data cleaning had to be conducted as there were too many subjects on the platform, while some subjects had very low counts. The team grouped similar subjects into nine different categories, for example, social science, anthropology, sociology, communication, world cultures, psychology, women’s studies, and social work were grouped into social sciences.</p> <p>It can be seen from Diagram #9 that the most popular subject areas on the platform are health sciences, language/arts and other sciences. Diving deeper into these subject areas, health sciences and language/arts have a higher proportion of items with the CC-BY-NC-SA license.</p> <h3 id="diagram-10">Diagram #10:</h3><p><img src="/blog/entries/2023-04-26-umsi-how-many-mona-lisas/diagram_10.png" alt="Diagram #10: Material Type Breakdown by Education Level"></p> <p>Finally, the team analyzed the material types of the items and sorted it by education level that the items were created for. Again, some data cleaning was required as there were too many material types to analyze and some also had very small data counts. The seven material types shown in Diagram #10 were the most popular, and represented roughly 2/3 of the total.</p> <p>After sorting the education levels in chronological order, an interesting trend that emerged is that the number of items increases with education level from preschool, hits a peak at the community college level, and then decreases afterwards. A shift in the material types can also be drawn from the graph, as lesson plans represent a large proportion of items from preschool to high school, but become insignificant from the college level onwards. On the other hand, this is replaced by a higher proportion of readings. Another observation worth remarking is that there is also a higher proportion of items at the college level for textbooks.</p> <h2 id="key-value">Key Value</h2><p>The insights created through the analysis of this project will be helpful for CC’s marketing efforts. The ability to understand the distribution of license types in different contexts such as education level, will help CC be better equipped to target their marketing toward key demographics such as preschool education materials for example. Another take away in terms of key value was CC’s initiative to long term preservation. CC’s need to centralize their collaborators' content into a database warehouse system has been an identified direction since the start of this project. Our prototype database of the OER Commons has contributed to these efforts in both small scale implementation as well as meeting the scope of our database system modeling. As other CC cohort chapters contribute their own databases of licenced works, there is a hopeful expectation that a merger of acquisition will take place with other CC chapters in the future.</p> <h2 id="next-steps">Next Steps</h2><p>As CC expands its contributing members into the open-source initiative of bringing licensed works to the world, other internal systems of data preservation and maintenance start to become a point of serious interest as the databases start to become an integrated endeavor in the future. Running our prototype case study of the OER-Commons database has given us insights on the direction of CC current database system and how this system will be better suited to evolve into a data warehouse hub as a long-term solution. When we started the process of data mining and data analysis, using Python3 has been a staple in both our groups efforts as well as CC’s previous protocols with Git. So, complementing this framework with other Python libraries that allow for easier database querying will be a step in the right direction for the next cohort of CC contributors to further this process along. An example of this library integration would be pandasql to utilize the family pandas library methods along with the SQL command logic that makes database maintenance easy and manageable. Besides updating the data storage, future work can continue to collect data from other sources with CC licensed work including the GLAM and Internet Archive.</p> <h2 id="acknowledgements">Acknowledgements</h2><p>We would like to express our gratitude towards Timid Robot Zehta, our client, for working on behalf of CC, as well as <a href="https://www.oercommons.org/">OER Commons</a> for their valuable contributions towards the development of digital licensing and open source databasing initiatives. Without them, this project would not have been possible. Their efforts have been instrumental in giving us the tools and resources to help progress in the open-source initiative by allowing us to promote the free exchange of ideas, knowledge, and resources within the art, health, and education sectors of non-profit endeavors. Open source projects are important because they allow the public to use and work on projects without restrictions or keys. Since this initiative is open source, our efforts can be added to and built upon, allowing the project to continue through the addition of new contributors with fresh perspectives. Both of their commitment to promoting accessible and inclusive content has enabled individuals and organizations to create and distribute digital assets without facing any legal restrictions around the world. It has been an absolute pleasure to work with these organizations and be a part of their mission to democratize access to information.</p> Considering Community Contributions at Creative Commons 2023-03-24T00:00:00Z ['sara'] urn:uuid:40a26755-2701-374d-a187-7bb3b00b503c <p>Different open source communities work differently and so everyone may arrive at Creative Commons' projects with their own set of individual expectations. Someone might expect to directly submit a Pull Request to a project without an Issue. Or they may submit an Issue and then immediately an associated Pull Request. At Creative Commons we have a process we hope to follow so there's a chance for consideration, community participation, and discussion along the way. Where we make collaborative, well documented, and informed work more possible!</p> <p>Things usually begin with an idea for new functionality, new/revised documentation, or an encountered error of sorts. That idea or error is then captured as a GitHub Issue used to describe its details. Think of this as the Abstract that comes before the Implementation.</p> <p>It's important to first look through all the existing Issues, including ones that have been Closed to determine if someone else has already made an overlapping Issue. If they have, it's best to add any new information you've discovered or thought of as a comment (or series of comments) to that Issue, rather than create a new one.</p> <p>Errors (often referred to as Bugs) should be verified, and reproducible if possible. Things like screenshots, steps to reproduce, a video, and environment details are all incredibly helpful for others when they want to review the error. All that information is gathered and placed in a succinct, but detailed Issue on the associated repository. It's worth noting that the documented Issue alone is a valued contribution. It will provide guidance and documentation for whomever works on resolving or implementing it, so it's just as important as the eventual code that will be written. That means it should be done well, because the better an Issue describes an error and provides a clear way to reproduce it the easier it will be for anyone to address it.</p> <p>Functionality and Feature proposals are often a little more involved. Errors are some aberration in the existing expectations or functionality of the codebase's state, but new/changed functionality or features introduce larger planning considerations. They have to take into account the current state of things and the proposed future state they're introducing as an Issue. This is an exercise in communication and description first and foremost, and that means that having a detailed writeup, wireframes, mockups, and evidence to support the proposal is vital to its success. Where Errors might be able to consider a more isolated set of consequences to fixing something, introducing new features/functionality may have unintended side effects, it may require multiple parts of the codebase to be changed or altered. All of these larger picture considerations should be taken into account and addressed within the Issue. One should expect that a Feature Issue may on average take longer to introduce, and longer to adequately document in a clear and concise way to get the point across to the rest of the community.</p> <p>Documentation can always use improvements whether within code comments, a project's README.md, or associated documentation. These would largely be considered a "Feature Issue" technically but it's worth pointing them out separately because they're as important, if not more so, than fixing errors or adding codebase level functionality. Good documentation makes the project strong and the community more informed. Improvements here should document where there's a gap or where revisions are needed, and how they should be corrected.</p> <p>Whether an Error or Feature/functionality Issue, once it's been submitted, in accordance with the <a href="/contributing-code/">Contribution Guidelines</a>, it will move to a status of "awaiting triage". This means that it is waiting to be reviewed by one of the core codebase contributors. While it's in this state no implementation work should be done (no PRs, no code work to add or correct the behavior). An Issue submitted is largely the start of a process, and a conversation. Core contributors will review the Issue and see if it adequately describes its appropriate details, and if its objectives fit within the larger pattern and goals of the codebase itself. It's entirely possible that a well thought through Feature Issue that adds some new menu functionality is in isolation a good idea, but that it doesn't fit within the goals of the project in question and won't move forward. And that's OK, even if an Issue doesn't move forward it can now stand as documentation for the community on what won't be worked on at this time, which is just as important as what will. It's a contribution whether it moves forward or not, so long as it describes itself well enough.</p> <p>If this happens, the Issue will be moved to a status of "discarded", and will be closed with a comment explaining why. The other reason an Issue might be moved to "discarded" is that it duplicates the work in another Issue, which is why it's important to first check all the existing Issues prior to submitting a new one.</p> <p>Sometimes an Issue might describe something much broader than can be easily contained within itself and may be converted to a status of "discussion". This means that the Issue should spark a larger conversation within the community to consider all the angles of abstract, and possibly split the idea up into more manageable pieces across multiple Issues. Other outcomes might be a discussion that realizes that while the idea is sound, it's not implementable at this time and won't move forward.</p> <p>Some Issues are solid ideas, but they are not something that can move forward until work on other Issues is completed first. As such they tend to move to a status of "blocked". They'll sit in that state until they're unblocked and the work can happen.</p> <p>If an Issue seems like it doesn't have enough information to determine what to do with it, then it will likely move to a status of "ticket work required" and a comment will usually be left describing what needs to be worked on.</p> <p>Remember, an Issue is a form of documentation, and in a way it's a conversation, and that means that until it moves forward it's very much a work in progress.</p> <p>If an Issue passes through this period as implementable, then it'll move to a status of "ready for work". This is the point at which it can be implemented, and a contributor can submit a Pull Request addressing it. (See the <a href="/contributing-code/repo-labels/#status">Repository Labels Status section</a> for more information)</p> <p>During this process it is worth noting that there will be multiple types of contribution. For example:</p> <ul> <li>The Issue itself is a contribution</li> <li>Comments on the Issue from the community refining it are each contributions</li> <li>Someone's comment on the Issue helping another person sort out why the Error is occurring is a contribution.</li> <li>Someone finding another related Issue and linking it as relevant to that Issue is a contribution.</li> </ul> <p>All of these contributions occurred before a Pull Request was ever initiated. Once an Issue enters a status of "ready for work" someone who has indicated interest on that Issue will be assigned to it and can then fork the repository, make a branch to work within, and once settled submit a Pull Request. That process alone may involve several contributions as well, such as:</p> <ul> <li>The code work encounters a problem, someone asks for assistance within their draft PR, and several members offer help as comments.</li> <li>Someone reviews the final PR and leaves a detailed review on what might need addressing</li> <li>A discussion breaks out on the best way to resolve an encountered problem with the PR, each of these comments is a contribution</li> <li>And, of course, the PR itself is a contribution.</li> </ul> <p>If the PR passes Review then it'll be marked as Approved and merged into the codebase, that will trigger the associated Issue to close as complete and now the Error Fix or Functionality in question will be fully implemented into the project.</p> <p>To get here it took multiple contributions, from different community members, that's the power of open source!</p> Outreachy Internship Mid-point Progress Update 2023-02-01T00:00:00Z ['precious'] urn:uuid:b485694d-3627-3a6f-b124-489f4786b53b <p><img src="https://res.cloudinary.com/dexcmkxjl/image/upload/v1675262087/1157214599-I-may-not-be-there-yet_s3cjxm.webp" alt="quote image"></p> <p><strong>Outreachy Internship- Refactor CC Meta Search- Mid-point Progress Update</strong></p> <p>As an intern at Creative Commons, my original project timeline was to refactor the old CC search website to use semantic and modern HTML and CSS. This project was intended to improve the user experience and make the website more accessible to a broader audience. The old CC Meta Searcg website is built on PHP and JavaScript. My major goal for this internship is to convert the PHP to semantic html and modern CSS while ensuring that all neccessary functionalities are intact.</p> <p>I have met several of my goals in the first half of my internship. So far, I have successfully refactored the website's HTML to use semantic elements, which improves the website's accessibility and makes it easier for users to understand the content. I achieved this by creating a new index.html file and rebuilding the site with semantic HTML. Additionally, I have also implemented modern CSS techniques to improve the website's visual design and make it more responsive on various devices. All of this is currently being reviewed by my mentors for feedback on additional changes that might be needed.</p> <p>However, there were some project goals that took longer than expected to complete. One of the main reasons for this was that the website's codebase was not well-organized, which made it difficult for me to navigate and understand. Additionally, the site had a pre-determined CSS file that I was supposed to follow or incorporate while building the new site, but this file was so cumbersome and most of the styles did not give the desired result. I spent a lot of time trying to understand and navigate through this and eventually I had to speak to my mentor about it, and also brought up the suggestion of me writing out my own CSS stylings, which she agreed to. Thus this made the original goal, of incorporating the CC Vocabulary CSS file, to be modified.</p> <p>Additionally, I had to prioritize certain tasks over others and make adjustments to my plan as necessary.</p> <p>The new CSS I have written so far already makes the website's layout responsive. I have also created a new script.js file and started working on the neccessary functionalities of the website. I plan to implement all feedback gotten from my mentors and debug any remaining issues. Additionally, I will be working on improving the website's overall performance by implementing several optimization techniques as necessary.</p> <p>Overall, My aim is to ensure that the website is fully functional and user-friendly for all users.</p> How I Landed My First Internship With Outreachy 2023-01-04T00:00:00Z ['precious'] urn:uuid:54076fa3-da5e-30ee-afe8-9b9d8baadb6d <p><img src="https://res.cloudinary.com/dexcmkxjl/image/upload/v1671657493/blog_image_kjvep8.jpg" alt="quote image"></p> <p><em>Take a moment to resonate with the words above. This is a principle I always follow in life. "Doing it scared"</em></p> <p><strong>Get To Know Me</strong></p> <p>My name is Precious Oritsedere. I am a Nigerian. I am a software engineer who is known for her dedication to her work and her strong core values. I believe in the power of love, contribution, and empathy, hence I use these values as my guide in both my personal and professional life.</p> <p>As a young girl, I have always been fascinated by technology, gadgets, and even games. However, I didn't study any technology-related courses at university. I graduated with a degree in International Studies and Diplomacy but this didn't stop my curiosity for Technology. I really wanted to pursue my passion for technology but I was unsure of the means to go about it because I had no guide.</p> <p>Fast-forward to January 2022, I spoke to a friend about how I really wanted to know and learn more about Software Engineering and he introduced me to AltSchool Africa. This was the starting point of my Tech Career. Subsequently, I began to learn all about Frontend Engineering, and I also got introduced to open source.</p> <p><strong>Why I applied to Outreachy</strong></p> <p>As a software engineer, I am committed to using my skills to make a positive impact on the world. I believe in the power of collaboration and teamwork and I am always willing to lend a helping hand to my colleagues. This was why the concept of <a href="https://opensource.com/resources/what-open-source">open source</a> really appealed to my personality.</p> <p>One of the things that motivate me, is my desire to give back to my community. I am passionate about making a difference, and this is why I applied for the <a href="https://www.outreachy.org/">Outreachy</a> internship. Outreachy is a program that provides paid internships in open source and open science. Outreachy provides internships to people subject to systemic bias and impacted by underrepresentation in the technology industry where they are living. Interns are paid a stipend of $7,000 for a period of 3 months.</p> <p>The Outreachy application process consists of three stages: the initial application, the contribution period, and the final application. And I was determined to succeed in each stage and eventually secure an internship. Which I did! Here's how;</p> <p>The first stage of the application process was the <strong>initial application</strong>. I carefully reviewed the requirements and made sure I met all of them. We were asked to write 3 essays centered around how we have been underrepresented in the tech industry. I took quite some time to carefully think this through before writing my essays. After which, I submitted my application on time and waited for a response.</p> <p><img src="https://res.cloudinary.com/dexcmkxjl/image/upload/v1671658423/initial_applic_bnriee.png" alt="initial application mail"></p> <p>A few weeks later, I received an email from the Outreachy organizers stating that I had been selected to move on to the next stage: the contribution period. This stage involved making a contribution to an open-source project. I was excited to have the opportunity to make a real impact and I spent hours researching the different projects. I eventually chose to contribute to the <a href="https://creativecommons.org/">Creative Commons</a> project "Refactor CC Meta Search".</p> <p><a href="https://creativecommons.org/">Creative Commons</a> is an American non-profit organization and international network devoted to educational access and expanding the range of creative works available for others to build upon legally and to share. They also help overcome legal obstacles to the sharing of knowledge and creativity to address the world’s most pressing challenges.</p> <p>I was so determined to make a high-quality contribution to this project. I learnt a lot during this stage. I got introduced to new tools like Docker , Linux and even PHP. I spent countless hours learning the codebase and working on my contribution. In the spirit of love and collaboration, I spent a lot of hours helping my fellow Outreachy applicants who were beginners or confused about what to do. I also reached out to the project mentors for guidance when I got stuck and feedback on the contributions I was making.</p> <p>My hard work paid off and I successfully submitted my contribution on time. For the final stage, I made a final application to be considered for an internship position with Creative Commons.</p> <p><img src="https://res.cloudinary.com/dexcmkxjl/image/upload/v1671658873/accepted_chmuad.png" alt="acceptance mail"></p> <p>I was so thrilled when I received the congratulatory mail some weeks later that I had been selected for the internship. My success in the Outreachy application process was a result of my dedication and hard work. I put in the time and effort to ensure I met all the requirements and made a valuable contribution to the project. And I am happy to announce that my determination and perseverance paid off and I was rewarded with a valuable internship opportunity.</p> <p><img src="https://res.cloudinary.com/dexcmkxjl/image/upload/v1671659054/interns_tedoag.png" alt="interns page"></p> <p>In conclusion, Precious Oritsedere is a talented software engineer who is dedicated to her work and driven by her core values of love, contribution, and empathy. She is motivated by her desire to make a difference, and this is evident in her commitment to her work and her participation in the Outreachy internship.</p> Thinking More Openly About Working in The Open 2022-12-16T00:00:00Z ['sara'] urn:uuid:8f6dcc9a-f3b5-3f3f-9e29-56ef6d657688 <p>I began working at Creative Commons (CC) as the Full Stack Engineer this year and it’s been amazing to get to work in the open at CC. But as someone who has been working in closed, internal source environments for a very long time it’s definitely been a learning experience and a perspective shift.</p> <p>For years I benefited from, observed, and offered up personal work into the world of open source, but I was never deeply involved in other projects in a big way, nor was I able to contribute anything I did at my professional day job back into the open source world (despite the benefit open source afforded the work I did every day). It had been a hope of mine, something I had advocated for, but had ultimately not worked out. Now at CC I finally get to participate in projects that operate in the open, and a larger community of contributors around the world.</p> <p>It's been refreshing and rewarding, but it's also been enlightening. There's so much that's different now. Working in the open doesn't just shift the terms under which your code is licensed or how many people can contribute, it requires a significant shift in both approach and process.</p> <p>For example, working in the open means that while there may be community members eager to contribute they may lack contextual understanding that someone more intimately familiar with a project might develop over time and rely upon. To support contributions well you need to have a heavily documentation-first strategy that affords new contributors key information in understandable and clear instructions.</p> <p>That also means that documenting <em>issues</em> isn't just an item on a todo list you'll get to later. There's extreme value in writing out detailed information both for your future self, but also for any would-be community contributors to understand the problem and address it. Setup instructions, contextual documentation about the codebase, as well as detailed known issues, roadmaps, etc. All of it needs to be documented and written out, which not only benefits the community contributors, but also benefits the project as a whole. It means key information has to live in the open alongside the code it informs. It's truly a win-win all around.</p> <p>The process also has to shift, you can't just make a list of things you want to tackle and get to work, you have to consider how each item can be smoothly adopted as granular and iterative Pull Requests that might all be worked on by entirely different individuals. The level of care in how the work is divided and scoped matters even more in this situation than it would have with an internal team. Working in the open doesn't just mean coding in the open, it also means planning in the open, and that means having a clearer view on the overall roadmap and goals the project hopes to meet.</p> <p>If you are the steward of a codebase any task list you create or <em>issues</em> you identify are ultimately not just for you alone. Putting an item on your list when you're working alone isn't enough, you've also got to find time to work on that item, and work your way through completing it.</p> <p>In the open source context, working with a community of contributors, creating an <em>issue</em> is just as important and meaningful as writing code, in many cases it might actually be MORE important. Because <em>issues</em> are often the way in which contributors first offer up help and insight, they're the first contact they have with your project. Furthermore, any <em>issue</em> you create may end up getting completed by one or more people that are not you, which means it doesn't just sit on a list till you do it. It's a small, but significant shift in how you think about planning and breaking down work on a codebase in the open.</p> <p>It’s certainly new, but incredibly rewarding. Even on days where I might not get to submit a Pull Request myself, or squash a bug in a meaningful way, I can still feel I offered up meaningful contributions to the community and the codebase through better documentation, answering someone’s question, reworking a process, or reviewing someone else’s generous contribution. Open Source means opening up your definition of what contribution means, and it’s a lot broader and more meaningful than I thought.</p> Data Science Discovery: Quantifying the Commons 2022-12-07T00:00:00Z ['Dun-MingHuang', 'ShuranYang'] urn:uuid:b5ca9376-727d-3a82-97f0-d150f1827d77 <p>University of California, Berkeley, Data Science Discovery Program Fall 2022</p> <h2 id="project-objective">Project Objective</h2><h3 id="problem-statement">Problem Statement</h3><p>In the previous years, from 2014 to 2017, Creative Commons (CC) have been releasing public reports detailing the growth, size, and usage of Creative Commons, demonstrating the significance and influences of Creative Commons. However, the effort to quantity Creative Commons has ceased at the proceeding year. This is the preincarnation of our current open-source project: <a href="https://github.com/creativecommons/quantifying">Quantifying the Commons</a>.</p> <p>An example visualization from the previous report in 2017: <img src="/blog/entries/2022-12-07-berkeley-quantifying/2017_state_of_the_commons_data.png" alt="2017 State of the Commons data graph"></p> <p>The reason is that prior efforts to generate usage reports suffered unreliable data retrieval methods; while prone to malfunction over the updates of website architecture from data sources, these data extraction methods are not particularly rigorous in performance and have a significantly low (compared to current methods, at the scale or an hour v.s. 5 business days).</p> <p>To advance and continue the work of quantifying CC product states, the student researchers are delegated the design and implementation for reliable data retrieval processes on CC data that were employed in previous reports to replicate past efforts of this project's preincarnation, quantify the size and diversity of CC Product Usage on the Internet.</p> <h2 id="data-retrieval">Data Retrieval</h2><h3 id="how-to-detect-county-of-cc-licensed-documents">How to detect county of CC-Licensed Documents?</h3><p>If an online document uses a CC tool to protect it, then it will either be labeled as license under that tool or contain a hyperlink towards a creativecommons.org webpage that explains the license's rules (the deed).</p> <p>Therefore, we may use the following approach to identify and count CC-licensed documents:</p> <ol> <li>Select a list of CC tools to inspect (provided by CC).</li> <li>Use APIs of different online platforms to detect and count documents that are labeled as license by platform and/or contains a hyperlink towards CC license webpages.</li> <li>Store these data in tabular form to contain the count of documents protected under each type of CC tools.</li> </ol> <h3 id="what-platforms-to-collect-counts-from">What platforms to collect counts from?</h3><p>Here is a list of online platforms that we sampled document count from, as well as the delegations for platforms' data collection, visualization, and modeling in this project:</p> <table class="table table-striped"> <thead class="thead-dark"><tr> <th>Platforms Containing Webpages</th> <th>Platforms Containing Photos</th> <th>Platforms containing Videos</th> </tr> </thead> <tbody> <tr> <td>Google (Dun-Ming Huang)</td> <td>DeviantArt (Dun-Ming Huang)</td> <td>Vimeo (Dun-Ming Huang)</td> </tr> <tr> <td>Internet Archive (Dun-Ming Huang)</td> <td>Flickr (Shuran Yang)</td> <td>YouTube (Dun-Ming Huang)</td> </tr> <tr> <td></td> <td>MetMuseum (Dun-Ming Huang)</td> <td></td> </tr> <tr> <td></td> <td>WikiCommons (Dun-Ming Huang)</td> </tr> </tbody> </table> <h3 id="exploratory-data-analysis-eda">Exploratory Data Analysis (EDA)</h3><p>Here are some significant defects found in datasets across sampled platforms during EDA:</p> <h3 id="flickr">Flickr</h3><ul> <li>Sampled Document Count from this dataset is at 35,000% ~ 100,000% of deviation from official statistics per CC product (license) investigated.</li> <li>Sampling frame locked at 4,000 available searched photos from each license.</li> <li>Significant duplication issue (resolved).</li> </ul> <h4 id="google-custom-search-api">Google Custom Search API</h4><ul> <li>Programmable Search Engine only reaches a subset of Google's website. The impact is not significant (then, further resolved via sampling frame adjustments in PSE).</li> <li>Accidentally used deprecated operators and parameters, causing faithfulness problems (resolved).</li> </ul> <h4 id="youtube-data-api">YouTube Data API</h4><ul> <li>API has maximum response value on total count of YouTube videos, causing severe underestimate.<ul> <li>Resolved via implementing custom granularity on data to enable honest response, conserve development cost, and introduce imputations in visualization.</li> </ul> </li> </ul> <h3 id="expanding-the-dataset">Expanding the Dataset</h3><p>Here are reasons and efforts of dataset expansion on platforms that received more data:</p> <h4 id="google-custom-search-api">Google Custom Search API</h4><ul> <li>Revised Data Sampling process to solve EDA-discovered inaccuracies.</li> <li>For expanding the horizons of CC product usage analyses upon past boundaries, where visualization was only conducted to compare cross-product performance, I incorporated further CC-product usage data across temporal axis and geographical demographics.</li> </ul> <h4 id="youtube-data-api">YouTube Data API</h4><ul> <li>Revised Data Sampling process to solve EDA-discovered inaccuracies.</li> <li>To perform unprecedented analyses on media-specific time-respective developments of CC options on popular platforms, YouTube's CC-licensed video count across two-month periods.</li> <li>Introduced imputation to alleviate unresolvable capped responses from YouTube and mitigate developmental cost in response to Youtube API's capping behaviour.</li> </ul> <h2 id="visualization">Visualization</h2><h3 id="philosophies-and-principles">Philosophies and Principles</h3><p>The visualizations of Quantifying the Commons is to be communicative and exhibitory.</p> <p>Some new aesthetics and principles we adopted (as a response to enhancement of prior efforts) are to:</p> <ul> <li>Present length in place of area for comprehensibility</li> <li>Analyze product development beyond license-wise comparisons</li> <li>Utilize colors for presenting data inclinations via works in Pandas, Seaborn, NumPy, Geopandas, and SpaCy</li> </ul> <h3 id="exhibiting-a-selection-of-visualizations">Exhibiting a Selection of Visualizations</h3><h4 id="diagram-1c">Diagram 1C</h4><p>Trend Chart of Creative Commons Usage on Google <img src="/blog/entries/2022-12-07-berkeley-quantifying/diagram_1c.png" alt="Trend Chart of Creative Commons Usage on Google"></p> <p>There are now <strong>more than 2.7 Billion webpages protected by Creative Commons</strong> indexed by Google!</p> <h4 id="diagram-2">Diagram 2</h4><p>Heatmap on density of CC-licensed Google indexed webpages over country <img src="/blog/entries/2022-12-07-berkeley-quantifying/diagram_2.png" alt="Heatmap on density of CC-licensed Google indexed webpages over country"></p> <p>Particularly, <strong>Western Europe and Americas enjoy a much robust use</strong> of Creative Commons document in terms of quantity. A Development in Asia and Africa should be encouraged.</p> <h4 id="diagram-3c">Diagram 3C</h4><p>Barplot for number of webpages protected by six primary CC licenses <img src="/blog/entries/2022-12-07-berkeley-quantifying/diagram_3c.png" alt="Barplot for number of webpages protected by six primary CC licenses"></p> <p>We can see that <strong>Attribution</strong> (BY) and <strong>Attribution-Nonderivative (BY-ND) are popular licenses</strong> among the 3 billion documents sampled across the dataset.</p> <h4 id="diagram-6">Diagram 6</h4><p>Barplot of CC-licensed documents across Free Culture and Non Free Culture licenses <img src="/blog/entries/2022-12-07-berkeley-quantifying/diagram_6.png" alt="Barplot of CC-licensed documents across Free Culture and Non Free Culture licenses"></p> <p>Roughly <strong>45.3% of the documents under CC protection are covered by Free Culture</strong> legal tools.</p> <h4 id="flickr-diagrams">Flickr Diagrams</h4><p>Usage of CC licenses on Flickr concentrated on Australia, Brazil, United Stated of America while is pretty low in Asia countries.</p> <p><strong>Note:</strong> Sampling Frame of these visualizations are locked at the first 4,000 search results on photos under each general license types.</p> <h5 id="diagram-7a">Diagram 7A</h5><p>Analysis of Creative Commons Usage on Flickr</p> <p><img src="/blog/entries/2022-12-07-berkeley-quantifying/diagram_7a.png" alt="CC BY-SA 2.0 license usage in Flickr pictures taken during 1962-2022"></p> <h5 id="diagram-7b">Diagram 7B</h5><p><img src="/blog/entries/2022-12-07-berkeley-quantifying/diagram_7b.png" alt="Flickr maximum views of pictures under all licenses"></p> <p>Photos on Flickr under Attribution-NonCommercial-NoDerivs (BY-NC-ND) license has gained highest possible views, while usage of license Public Domain Mark has highest increasing trend in recent years.</p> <h5 id="diagram-7c">Diagram 7C</h5><p><img src="/blog/entries/2022-12-07-berkeley-quantifying/diagram_7c.png" alt="Flickr yearly trend of all licenses 2018-2022"></p> <h5 id="diagram-7d">Diagram 7D</h5><p><img src="/blog/entries/2022-12-07-berkeley-quantifying/diagram_7d.png" alt="Flickr Photos under CC-BY-NC-SA 2.0 and CC BY-NC 2.0: Categories Keywords"></p> <h4 id="diagram-8">Diagram 8</h4><p>Number of works under Creative Commons Tools across Platforms <img src="/blog/entries/2022-12-07-berkeley-quantifying/diagram_8.png" alt="Number of works under Creative Commons Tools across Platforms"></p> <p>DeviantArt presents the most number of works under Creative Commons licenses and tools, followed by Wikipedia and WikiCommons. The estimate of video counts on YouTube is understimated, as demonstrated in Diagram 11B.</p> <h4 id="diagram-9b">Diagram 9B</h4><p>Barplot of Creative Commons Protected Documents across Countries <img src="/blog/entries/2022-12-07-berkeley-quantifying/diagram_9b.png" alt="Barplot of Creative Commons Protected Documents across Countries"></p> <h4 id="diagram-10">Diagram 10</h4><p>Barplot of Creative Commons Protected Documents across languages <img src="/blog/entries/2022-12-07-berkeley-quantifying/diagram_10.png" alt="Barplot of Creative Commons Protected Documents across languages"></p> <h4 id="diagram-11b">Diagram 11B</h4><p>Trend Chart of Cumulative Count of CC-Licensed YouTube Videos across Each Two-Months <img src="/blog/entries/2022-12-07-berkeley-quantifying/diagram_11b.png" alt="Trend Chart of Cumulative Count of CC-Licensed YouTube Videos across Each Two-Months"></p> <p>The <strong>orange line stand for the imputed value of new CC-Licensed YouTube video counts based on linear regression,</strong> which is the decided method of imputation because most medias' growth of CC-licensed document count also experience a linear growth.</p> <h2 id="modeling">Modeling</h2><p>(A side track)</p> <h3 id="objectives-of-modeling">Objectives of Modeling</h3><p>The models of this project aim to answer: "What is the license typing of a webpage/web document given its content?"</p> <p>Individual researchers have attempted each of their solutions via different resources, metrics, under different modeling contexts:</p> <h4 id="model-of-google-webpages-dun-ming-huang">Model of Google Webpages (Dun-Ming Huang)</h4><ul> <li>Modeling Context: Multiclass Classifier (7 classes).</li> <li>Modeling Training set: Text webpage contents acquired from Google API collected webpages (Common Crawl, the original choice, was marked unavailable due to source code corruption).</li> <li>Main Model Metric: Top-k accuracy, as this model is considered as the backend of a license recommendation system that receives webpage content and recommend 2 to 3 licenses to the user.</li> </ul> <h4 id="model-for-flickr-photos-shuran-yang">Model for Flickr Photos (Shuran Yang)</h4><ul> <li>Modeling Context: Binary Classifier (BY vs. BY-SA)</li> <li>Modeling Training set: Text photo descriptions acquired from Flickr API (with sampling frame of visualizations)</li> <li>Main Model Metric: Accuracy</li> </ul> <h3 id="training-process-summary-google-model">Training Process Summary: Google Model</h3><h4 id="preprocessing-pipeline">Preprocessing Pipeline</h4><ol> <li>Deduplication</li> <li>Remove Non-English Characters</li> <li>URL, <code>[^\w\s]</code>, Stopword Removal</li> <li>Remove Non-English Words</li> <li>Remove Short Words, Short Contents</li> <li>TF-IDF + SVD</li> <li>SMOTE</li> </ol> <h4 id="model-selection">Model Selection</h4><div class="hll"><pre><span></span><span class="n">Logistic</span> <span class="n">Regression</span><span class="p">(</span> <span class="n">penalty</span><span class="o">=</span><span class="s2">&quot;l2&quot;</span><span class="p">,</span> <span class="n">solver</span><span class="o">=</span><span class="s2">&quot;liblinear&quot;</span><span class="p">,</span> <span class="n">class_weight</span><span class="o">=</span><span class="s2">&quot;balanced&quot;</span><span class="p">,</span> <span class="n">C</span><span class="o">=</span><span class="mf">0.1</span><span class="p">,</span> <span class="p">)</span> </pre></div> <div class="hll"><pre><span></span><span class="n">SVC</span><span class="p">(</span> <span class="n">C</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span> <span class="n">probability</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">kernel</span><span class="o">=</span><span class="s2">&quot;poly&quot;</span><span class="p">,</span> <span class="n">degreee</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">class_weight</span><span class="o">=</span><span class="s2">&quot;balanced&quot;</span><span class="p">,</span> <span class="p">)</span> </pre></div> <div class="hll"><pre><span></span><span class="n">RandomClassifier</span><span class="p">(</span> <span class="n">class_weight</span><span class="o">=</span><span class="s2">&quot;balanced_subsample&quot;</span><span class="p">,</span> <span class="n">n_estimators</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="p">)</span> </pre></div> <div class="hll"><pre><span></span><span class="n">GradientBoostingClassifier</span><span class="p">(</span> <span class="n">n_estimators</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="p">)</span> </pre></div> <div class="hll"><pre><span></span><span class="n">NultinomialNB</span><span class="p">(</span> <span class="n">fit_prior</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="p">)</span> </pre></div> <ol> <li>text : InputLayer</li> <li>preprocessing : KerasLayer</li> <li>BERT_encoder : KerasLayer</li> <li>dropout : Dropout</li> <li>classifier : Dense</li> </ol> <h4 id="training-results">Training Results</h4><p><img src="/blog/entries/2022-12-07-berkeley-quantifying/training_performance.png" alt="Testing Performances across Models by Top-k Accuracy"></p> <h3 id="training-process-summary-flickr-model">Training Process Summary: Flickr Model</h3><h4 id="preprocessing-pipeline">Preprocessing Pipeline</h4><ol> <li>Deduplication</li> <li>Translation</li> <li>Stopword Removal, Lemmatization</li> <li>TF-IDF</li> </ol> <h4 id="model-selection">Model Selection</h4><div class="hll"><pre><span></span><span class="n">SVC</span><span class="p">(</span> <span class="n">C</span><span class="o">=</span><span class="mf">1.0</span><span class="p">,</span> <span class="n">kernel</span><span class="o">=</span><span class="s2">&quot;linear&quot;</span><span class="p">,</span> <span class="n">gamma</span><span class="o">=</span><span class="s2">&quot;auto&quot;</span><span class="p">,</span> <span class="p">)</span> </pre></div> <h4 id="training-results">Training Results</h4><p>An accuracy of 66.87% was reached.</p> <h2 id="next-steps">Next Steps</h2><h3 id="from-preincarnation-to-present">From Preincarnation to Present</h3><p>Via the efforts addressed above, we have not only managed to transform a data retrieval process from unstable, unexplored, and unavailable into an algorithmic, deterministic process reliable, documented, and interpretable! And the visualizations have become more exhibitory, concentrating on more effortfully extracted insights, and look at Creative Commons in further depth and more remarkable breadth.</p> <p>With significant re-implementations and designing policies to the data retrieval process for Quantifying the Commons, visualizations can be readily, immediately produced upon command; and upon the conceptual transformations of visualization production, Creative Commons will obtain new insights into the development of product and eventual policies upon the axes along which data was extracted from. Furthermore, we expect the production of model to work beyond the bounds of a Machine Learning product, but as a possibility to draw inferences upon product usage upon.</p> <p><strong>Such efforts are a short jump start to the long-term reincarnation of Quantifying the Commons.</strong></p> <h3 id="from-reincarnation-onto-baton-touches">From Reincarnation onto Baton Touches</h3><p>The current team would encourage the future team to increase the availability and user experience for our open source data extraction method, via automation and by-batch data extraction methods, for which Dun-Ming has written a design policy for. For modeling, the team also encourage building ingerence pipelines for using ELI5 for Logistic Regression models, as well as experiment more with loss function options of Gradient Boosting Classifier. For Flickr, the writer of this poster would like to suggest some data extraction method outside Flickr API but has access towards Flickr media, say Google Custom Search API.</p> <h2 id="additional-reading">Additional Reading</h2><ul> <li>Dun-Ming Huang blogs:<ul> <li><a href="https://medium.com/@bransthre/dsd-fall-2022-quantifying-the-commons-0-10-d1844092fc7a">DSD Fall 2022: Quantifying the Commons (0/10) | by Bransthre | Nov, 2022 | Medium</a></li> <li><a href="https://medium.com/@bransthre/dsd-fall-2022-quantifying-the-commons-1-10-970dc24626b">DSD Fall 2022: Quantifying the Commons (1/10) | by Bransthre | Nov, 2022 | Medium</a></li> <li><a href="https://medium.com/@bransthre/dsd-fall-2022-quantifying-the-commons-2-10-537a5b204d7b">DSD Fall 2022: Quantifying the Commons (2/10) | by Bransthre | Nov, 2022 | Medium</a></li> <li><a href="https://medium.com/@bransthre/dsd-fall-2022-quantifying-the-commons-3-10-79bbfeb90daa">DSD Fall 2022: Quantifying the Commons (3/10) | by Bransthre | Nov, 2022 | Medium</a></li> <li><a href="https://medium.com/@bransthre/dsd-fall-2022-quantifying-the-commons-4-10-9bc90ec98262">DSD Fall 2022: Quantifying the Commons (4/10) | by Bransthre | Nov, 2022 | Medium</a></li> <li><a href="https://medium.com/@bransthre/dsd-fall-2022-quantifying-the-commons-5-10-475334a8895">DSD Fall 2022: Quantifying the Commons (5/10) | by Bransthre | Nov, 2022 | Medium</a></li> <li><a href="https://medium.com/@bransthre/dsd-fall-2022-quantifying-the-commons-6-10-961de95ef3aa">DSD Fall 2022: Quantifying the Commons (6/10) | by Bransthre | Nov, 2022 | Medium</a></li> <li><a href="https://medium.com/@bransthre/dsd-fall-2022-quantifying-the-commons-7a-10-ea011b9e05ee">DSD Fall 2022: Quantifying the Commons (7A/10) | by Bransthre | Nov, 2022 | Medium</a></li> <li><a href="https://medium.com/@bransthre/dsd-fall-2022-quantifying-the-commons-7b-10-e8bd8ba1c18a">DSD Fall 2022: Quantifying the Commons (7B/10) | by Bransthre | Nov, 2022 | Medium</a></li> <li><a href="https://medium.com/@bransthre/dsd-fall-2022-quantifying-the-commons-8a-10-6f5336c00d11">DSD Fall 2022: Quantifying the Commons (8A/10) | by Bransthre | Nov, 2022 | Medium</a></li> <li><a href="https://medium.com/@bransthre/dsd-fall-2022-quantifying-the-commons-8b-10-aa1ec8e2ae63">DSD Fall 2022: Quantifying the Commons (8B/10) | by Bransthre | Nov, 2022 | Medium</a></li> <li><a href="https://medium.com/@bransthre/dsd-fall-2022-quantifying-the-commons-9-10-536617bdcbb0">DSD Fall 2022: Quantifying the Commons (9/10) | by Bransthre | Nov, 2022 | Medium</a></li> <li><a href="https://medium.com/@bransthre/dsd-fall-2022-quantifying-the-commons-10-10-47cbcb9bc8c2">DSD Fall 2022: Quantifying the Commons (10/10) | by Bransthre | Nov, 2022 | Medium</a></li> </ul> </li> <li>Shuran Yang blog:<ul> <li><a href="https://medium.com/@shuran1030/quantifying-the-commons-data-science-discovery-program-fall-2022-8e8c15b1ace3">Quantifying the Commons ? Data Science Discovery Program Fall 2022 | by Shuran Yang | Nov, 2022 | Medium</a></li> </ul> </li> </ul> CalVer to SemVer 2022-11-11T00:00:00Z ['TimidRobot'] urn:uuid:70d61d34-1664-30a4-81f1-8cf222f5b31f <p>Creative Commons (CC) tried to use CalVer (calendar versioning), but encountered too many issues and decided on SemVer (semantic versioning) instead.</p> <h2 id="why-we-chose-calver">Why we chose CalVer</h2><p>Years ago, the CC technology team standardized on using <a href="https://calver.org/">CalVer</a> as our versioning scheme. Specifically, we selected <code>YYYY.0M.MICRO</code>. <a href="https://calver.org/">CalVer</a>:</p> <blockquote><ol> <li><strong><code>YYYY</code></strong> - Full year - 2006, 2016, 2106</li> <li><p><strong><code>0M</code></strong> - Zero-padded month - 01, 02 ... 11, 12</p> </li> <li><p><strong><code>Micro</code></strong> - The third and usually final number in the version. Sometimes referred to as the "patch" segment.</p> </li> </ol> </blockquote> <p>The use of CalVer was inspired by Ubuntu, pip, SaltStack, and others. It was thought that CalVer not only matched <a href="https://semver.org/">SemVer</a> in communicating potential risks to users, but also gave additional temporal context. Also, many argue that the promises of SemVer’s <code>MAJOR.MINOR.PATCH</code> go unfulfilled often enough that they lose meaning and that the differences between MINOR/PATCH are too poorly defined (more on these later).</p> <h2 id="issues-encountered-with-calver">Issues Encountered with CalVer</h2><h3 id="time/duration-is-not-primarily-relevant">Time/Duration Is Not Primarily Relevant</h3><p>CalVer is often favored by projects for which time/duration is of primary relevance (ex. Ubuntu releases which have a limited support window). However, none of CC’s projects have time/duration as a primary relevance.</p> <h3 id="major-expectations-and-slow-iteration"><code>MAJOR</code> Expectations and Slow Iteration</h3><p>SemVer is a formalization of longstanding convention. Many many users, especially developers, expect the first number of a versioning scheme to indicate change severity. With <code>YYYY</code> indicating current release year, the <code>YYYY.0M.MICRO</code> versioning scheme might set an expectation of significant changes or improvements (ex. <code>2021.09.1</code> to <code>2022.02.1</code>) even when the content of the changes are trivial. With <code>YYYY</code> indicating original release year, a slow moving but stable and functional release might appear abandoned or insecure (ex. <code>2019.03.2</code> in 2022).</p> <h3 id="poor-support-for-calver">Poor Support for CalVer</h3><p>We also encountered poor support for CalVer in software and systems. For example, NPM currently strips leading zeros which breaks CDN integration (<a href="https://github.com/cc-archive/vocabulary-legacy/issues/588.">CalVer and CDN compatibility · Issue #588 · creativecommons/vocabulary</a>).</p> <h2 id="using-semver">Using SemVer</h2><p>Our experiment with CalVer is a win for the scientific method. We can be more confident, today, that SemVer will treat both the developers and users of CC software better than CalVer.</p> <h3 id="semvers-promises-commitments">SemVer’s <del>Promises</del> Commitments</h3><p>The CC Technology team sees SemVer as a set of commitments we are making to the users and developers of CC open source software. We may not achieve perfection in fulfilling those commitments, but they outline expectations and we hope you’ll open an issue if we make a mistake.</p> <h3 id="cc-semver-specifics">CC SemVer Specifics</h3><p>We will be using <a href="https://semver.org/">SemVer</a> (semantic versioning) going forward. To add additional clarity, we will avoid mixing functionality changes and bug fixes in releases:</p> <ol> <li><code>MAJOR</code> version when you make incompatible API changes</li> <li><code>MINOR</code> version when you add functionality in a backwards compatible manner<ul> <li>Releases that increment the <code>MINOR</code> version <strong>must not</strong> include bug fixes</li> </ul> </li> <li><code>PATCH</code> version when you make backwards compatible bug fixes<ul> <li>Releases that increment the <code>PATCH</code> version <strong>must not</strong> include functionality additions</li> </ul> </li> </ol> <p>When a bug fix <em>technically</em> changes functionality, we will release a bug fix (incrementing only the <code>PATCH</code> version) as the change preserves the <em>intended functionality</em>.</p> Building the CC Global Components Library 2022-03-17T00:00:00Z ['MuluhGodson'] urn:uuid:1140a314-fe2d-30a3-aeb4-19a4ba942e21 <h3 id="introduction">Introduction</h3><p>During the course of my Outreachy internship with the Creative Commons, I got to work on some cool projects, one of which is the CC Global Components library supervised by my mentor <a href="/blog/authors/brylie/">Brylie Christopher Oxley</a>.</p> <p>Having a unified design theme/look or experience accross the different CC websites has always been an important factor while developing these websites. With this in mind, there are several components which are part of most CC web properties. The three components in particular are:-</p> <ul> <li><strong> The Global navigation menu </strong> : displayed on sub-paths of the main creativecommons.org website, such as /licenses</li> <li><strong> The Global footer </strong> : displayed on most Creative Commons properties</li> <li><strong> The Explore CC component </strong> : displayed on all CC web properties, such as Global Summit etc.</li> </ul> <p>Instead of having each project implement these components leading to code duplication accross projects and maintenance issues, we decided it was preferable to have a seperate library of these components which finally led to the CC Global Components project.</p> <h3 id="choosing-a-technology">Choosing a technology</h3><p>The goal of the Global components library was to build a custom web component that can be served via CDN. While planning, we needed to decide on the technology to use. Agreeably, most web frameworks like React and Vue can be used to develop this but we wanted a simple implementation with fewer dependencies. Some ideal characteristics of what we were looking for was a technology that meets the following criteria:</p> <ul> <li>Web Standards oriented</li> <li>Clean separation of HTML, CSS, and JavaScript (structure, aesthetics, and functionality)</li> <li>Lightweight / small bundle size</li> <li>Loosely coupled (no tight or unrelated dependencies)</li> </ul> <p>The two primary technologies we were considering were <a href="https://v3.vuejs.org">Vue JS</a> and <a href="https://lwc.dev">Lightning Web Components</a> but finally decided to use Vue JS since we already had other projects developed in Vue (such as the Chooser project).</p> <h3 id="building-the-components">Building the components</h3><p>To scaffold the project, we used <a href="https://www.npmjs.com/package/vue-sfc-rollup">Vue SFC rollup</a>, which is a CLI templating utility that scaffolds a minimal setup for compiling a library of multiple Vue SFCs (Single File Components) - into a form ready to share via npm. With this, we could just focus on building the templates. We used <a href="https://cc-vocabulary.netlify.app/">Vocabulary CSS</a>, our own CC design package to style the components.</p> <h4 id="1-cc-global-footer">1) CC Global Footer</h4><p>The CC Global Footer component was the easiest given that it's mostly static HTML. This component takes two attributes:</p> <ul> <li><code>logo-url</code>: which should point to the logo of the website it is used on.</li> <li><code>donation-url</code>: which is used for the donation button.</li> </ul> <p>After importing the CDN script for the CC Global components, we can then use the CC Global footer in any page as such:</p> <div class="hll"><pre><span></span><span class="p">&lt;</span><span class="nt">cc-global-footer</span> <span class="na">donation-url</span><span class="o">=</span><span class="s">&quot;http://example.com&quot;</span> <span class="na">logo-url</span><span class="o">=</span><span class="s">&quot;/example/logo-white.png&quot;</span> <span class="p">/&gt;</span> </pre></div> <p>and this renders as shown below:</p> <p><img src="/blog/entries/building-the-cc-global-components-library/cc_global_footer.png" alt="CC Global Footer"></p> <h4 id="2-cc-explore">2) CC Explore</h4><p>The CC Explore component is an expandable banner which coontains links to all the CC Web properties. This component use a click listener which just toggles the expandable banner to show or hide when it is clicked. As with the CC Global Footer component, the CC Explore component takes two attributes.</p> <div class="hll"><pre><span></span><span class="p">&lt;</span><span class="nt">cc-explore</span> <span class="na">donation-url</span><span class="o">=</span><span class="s">&quot;http://example.com&quot;</span> <span class="na">logo-url</span><span class="o">=</span><span class="s">&quot;/example/logo-white.png&quot;</span> <span class="p">/&gt;</span> </pre></div> <p>and this renders as shown below:</p> <p><img src="/blog/entries/building-the-cc-global-components-library/cc_explore.gif" alt="CC Explore"></p> <h4 id="3-cc-global-header">3) CC Global Header</h4><p>The CC Global Header was an important component given that we had to make API calls to be able to render the Menu items for downstream projects such as the <a href="https://github.com/creativecommons/cc-legal-tools-app">Licenses and Tools</a>. We used the Axios library for the API calls to the Wordpress backend of the parent project <a href="https://github.com/creativecommons/project_creativecommons.org">Projec_creativecommons.org</a>.</p> <p>The CC Global Header has three required attributes, <code>base-url</code>, <code>donation-url</code> and <code>logo-url</code>, which are the URLs used for the API call, Donation button and Logo respectively. There is one additional attribute <code>use-menu-placeholders</code> you can set which renders placeholder Menu Items if you are in a development environment. However, for a stagin/production setup we do not use this attribute.</p> <div class="hll"><pre><span></span><span class="p">&lt;</span><span class="nt">cc-global-header</span> <span class="na">base-url</span><span class="o">=</span><span class="s">&quot;http://127.0.0.1:8000&quot;</span> <span class="na">donation-url</span><span class="o">=</span><span class="s">&quot;http:/example.com&quot;</span> <span class="na">use-menu-placeholders</span> <span class="na">logo-url</span><span class="o">=</span><span class="s">&quot;/example/logo-black.png&quot;</span> <span class="p">/&gt;</span> </pre></div> <p>and this renders as shown:</p> <p><img src="/blog/entries/building-the-cc-global-components-library/cc_global_header.png" alt="CC Global Header"></p> <h3 id="conclusion">Conclusion</h3><p>The first version of this library (0.1.1) was released and published to NPM on Dec 10, 2021. Till date [the time of this writing] we have had several changes and optimizations to the code and are currently on version <code>0.5.0</code>. This was a really enriching experience for me as it was my first time working with Vue JS. We've also had additional code review and optimizations from <a href="/blog/authors/TimidRobot/">Timid Robot</a>.</p> <p>The CC Global Components with all 3 components used renders as:</p> <p><img src="/blog/entries/building-the-cc-global-components-library/cc_global_components.gif" alt="CC global components"></p> <p>You can find the CC Global Components project at:</p> <ul> <li>GitHub: <a href="https://github.com/cc-archive/cc-global-components">CC Global Components</a></li> <li>NPM: <a href="https://www.npmjs.com/package/@creativecommons/cc-global-components">cc-global-components</a></li> </ul> CC Messaging Update 2022Q1 (Dropping IRC) 2022-01-06T00:00:00Z ['TimidRobot'] urn:uuid:043a7604-a0bf-3890-b20d-de0f99b67f8c <h2 id="past-moved-to-slack">Past: Moved to Slack</h2><p>In 2016, Creative Commons (CC) moved to Slack as our primary messaging platform (<a href="https://creativecommons.org/2016/10/18/slack-announcement/">We're on Slack! Join us! - Creative Commons</a>). We are very thankful for the generous support that Slack has provided. The Slack messaging platform is far more accessible than IRC. We saw an immediate and sustained increase in our messaging community (<a href="https://creativecommons.org/2016/12/09/a-month-of-slack/">A month of Slack: Growing global communities every day - Creative Commons, Lessons learned from a year of Slack, 1000 members, and immeasurable community growth - Creative Commons</a>). We currently have 10,293 members in our Slack workspace. Of those, we see daily activity from an average of 250 of them spread across almost 70 public channels. The Slack platform is not without valid criticisms, but those will be addressed in the Future: Open Source section, below.</p> <h2 id="present-dropping-irc">Present: Dropping IRC</h2><p>When CC moved to Slack, we also set up a bridge with our three IRC channels on Freenode. However those channels only see ones of active users and tens of messages per year. With the hostile takeover of Freenode in 2021, the Free/Libre and Open Source (FOSS) community has largely moved to <a href="https://libera.chat/">libera.chat</a>. However, we will not be moving our Slack/IRC bridge there. <strong>Effective 2022-01-24 we are dropping IRC as an officially supported messaging platform.</strong> In addition to there having been very few active users on IRC, many of the active IRC users also have active Slack accounts. Dropping IRC will allow us to better allocate our technical resources to better serve the community as a whole.</p> <h2 id="future-open-source">Future: Open Source</h2><p>Over the years, Slack has had performance and UX issues. It is also designed around assumptions that do not fit a large open community. Those issues have not prevented it from being a strong and capable messaging platform that has served our community well. However, an Open Source messaging platform would better align with the Creative Commons community and the values we champion. The Open Source and Open Content communities have long enjoyed a significant overlap and collaboration. With regards to messaging, we hope to increase that overlap in the next year or two.</p> Upcoming Changes to the CC Open Source Community 2020-12-07T00:00:00Z ['kgodey'] urn:uuid:f2584ce6-4f24-3ccb-97d8-9f68e62bc65a <p>Creative Commons (CC) is adopting a brand new organizational strategy in 2021, just in time for our 20th anniversary. As part of the organization's evolution in alignment with the new strategy, <a href="/blog/authors/aldenpage/">Alden Page</a>, <a href="/blog/authors/mathemancer/">Brent Moran</a>, <a href="/blog/authors/hugosolar/">Hugo Solar</a>, and I (<a href="/blog/authors/kgodey/">Kriti Godey</a>) will have departed Creative Commons by the end of December. Moving forward, the CC staff engineering team of <a href="/blog/authors/TimidRobot/">Timid Robot Zehta</a> and <a href="/blog/authors/zackkrida/">Zack Krida</a> will focus on supporting a smaller set of core projects.</p> <p>We are extremely proud of the work we have done together to build CC's vibrant open source community over the past two years. And of course, we're thankful for all the amazing contributions that all our community members have made. We've made significant improvements to existing tools, and launched entirely new projects with your help. <a href="/blog/categories/cc-vocabulary/">We created Vocabulary,</a> a design system for Creative Commons and launched half a dozen sites using it. We added <a href="/blog/categories/cc-catalog/">dozens of new sources to CC Search</a> and improved <a href="/blog/authors/AyanChoudhary/">its accessibility</a>. We released tools such as the <a href="/blog/authors/ahmadbilaldev/">CC WordPress plugin</a> and <a href="/blog/authors/makkoncept/">CC Search browser extension</a> that integrated CC licensing with widely used software. And, there's so much more.</p> <h3 id="community-changes">Community Changes</h3><p>The CC Open Source community remains central to our engineering work, and we will continue to support you in every way we can. However, based on the new staff capacity, we will be making a few changes to our community processes:</p> <ul> <li>Community Team members will no longer have access to CC's Asana. Most tasks are tracked on GitHub, and managing Asana adds unnecessary complexity to the community team.</li> <li>We will invite all Community Team members to meetings and documents open to the community, regardless of role.</li> <li>We will deprecate the "community-team-core" mailing list in favor of a single "community-team" mailing list.</li> <li>We will have a new monthly Open Source Community meeting and cancel the existing biweekly Engineering Meeting.</li> <li>We will no longer have a paid Open Source Community Coordinator, <a href="/community/community-team/community-building-roles/">relying instead on volunteers</a> to help assist new community members, maintain our Twitter account, etc.</li> </ul> <p>We welcome new Community Team members and we will continue to participate in internship programs such as Google Summer of Code.</p> <h3 id="project-changes">Project Changes</h3><p>With a smaller engineering team, we will need to support fewer projects. Please see below for the current status of all projects with at least one Community Team member.</p> <p><strong>Active Development</strong></p> <p>We will continue to actively develop the following projects:</p> <ul> <li><a href="https://github.com/creativecommons/ccsearch-browser-extension">CC Search Browser Extension</a> (maintainer: Mayank Nader)</li> <li><a href="https://github.com/creativecommons/creativecommons.github.io-source">CC Open Source website</a> (maintainers: Zack Krida &amp; Timid Robot Zehta)</li> <li><a href="https://github.com/creativecommons/creativecommons-base">CC WordPress base</a> &amp; child themes (new maintainer: Zack Krida)</li> <li><a href="https://github.com/creativecommons/legaldb">CC Legal Database</a> (maintainer: Timid Robot Zehta)</li> <li><a href="https://github.com/creativecommons/chooser">CC Chooser</a> (maintainer: Zack Krida)</li> <li><a href="https://github.com/creativecommons/cc-link-checker/">CC Link Checker</a> (maintainer: Timid Robot Zehta)</li> <li><a href="https://github.com/creativecommons/licensebuttons/">License Buttons</a> (maintainer: Timid Robot Zehta)</li> <li><a href="https://github.com/creativecommons/mp/">Platform Toolkit</a> (maintainer: Timid Robot Zehta)</li> <li><a href="https://github.com/creativecommons/vocabulary">Vocabulary</a> (maintainers: Zack Krida &amp; Dhruv Bhanushali)</li> <li><a href="https://github.com/creativecommons/wp-plugin-creativecommons">WordPress Plugin</a> (new maintainer: Zack Krida)</li> </ul> <p><strong>Maintenance Mode</strong></p> <p>The following projects are entering maintenance mode. The services will remain online, but we will not accept any new pull requests or deploy new code after Dec 15, 2020.</p> <ul> <li><a href="https://github.com/cc-archive/cccatalog">CC Catalog</a></li> <li><a href="https://github.com/cc-archive/cccatalog-api">CC Catalog API</a></li> <li><a href="https://github.com/cc-archive/cccatalog-frontend/">CC Search</a></li> <li><a href="https://github.com/cc-archive/cccatalog-dataviz/">Linked Commons</a></li> </ul> <p>Catalog, API, and Linked Commons contributors are encouraged to contribute to our other Python projects such as the <a href="https://github.com/creativecommons/legaldb">CC Legal Database</a> or the upcoming <a href="https://github.com/creativecommons/cc-licenses">CC Licenses</a> project. If you are a CC Search contributor, we recommend checking out frontend projects such as the <a href="https://github.com/creativecommons/chooser">CC Chooser</a> or <a href="https://github.com/creativecommons/vocabulary">Vocabulary</a>.</p> <h3 id="thank-you">Thank You!</h3><p>We cannot express our gratitude for our community enough. You are all an absolute pleasure to work with, and we're looking forward to continuing to collaborate with you for years to come.</p> Vocabulary Landing Page & Usage Guide Final Report 2020-12-03T00:00:00Z ['nimishbongale'] urn:uuid:9c4ce7a3-30fa-397d-bbcd-c2b9ba3658c3 <p>We have reached the end of this wonderful journey. Let's comprehensively recap all my contributions during the GSoD internship period!</p> <h2 id="vocabulary-site-updates-edition-4/4">Vocabulary Site Updates (Edition 4/4)</h2><p>After securing acceptance, I received the necessary github invites. I was given write access to the <a href="https://github.com/creativecommons/vocabulary">Vocabulary GitHub repository</a> as a <strong>CC Vocabulary Core Committer</strong>.</p> <h3 id="proposed-initial-plan">Proposed Initial Plan</h3><h4 id="project-synopsis">Project Synopsis</h4><p>Vocabulary has immense potential to be used as a primary UI component library for website building. What it needs is a robust yet layman-friendly how-to guide. Important developer information such as component guides, usage specifications and configuration tweaks form an essential part of any documentation. This will not only encourage existing users to get a feel of how vocabulary continues to grow and reach new milestones, but also promote the usage of Vocabulary in comparatively newer projects. The desired outcomes of my stint as an intern would not only involve penning out a no-nonsense guide to using the pre-existing components but also the designing and developing of a home page (leading to an integrated documentation for each) for Vocabulary, Vue-Vocabulary and Fonts.</p> <h3 id="proposed-improvised-timelines-deliverables">Proposed &amp; Improvised Timelines &amp; Deliverables</h3><p>Here's a list of all the weekly goals that I met:</p> <p><strong>Pre-Internship</strong></p> <ul> <li>Understood Creative Commons as an organisation, its work and related ethics.</li> <li>Had a look at CC’s github repositories and understand the code structure.</li> <li>Opened Issues and PR’s to get acquainted with the repository workflows.</li> <li>Interacted with my mentor and established the basic ideas regarding the project in question.</li> <li>Further researched about the needs of the project, and ponder over its potential impact after implementation.</li> </ul> <p><strong>Week 1</strong> (09/14 - 09/21)</p> <ul> <li>Understood Vocabulary, Vue-Vocabulary and Fonts in greater depth, and their existing components.</li> <li>Designed a first look unified landing page for Vocabulary, Vue-Vocabulary and Fonts based on Vocabulary components.</li> <li>Interacted with my mentor and other team members and established a rapport.</li> </ul> <p><strong>Week 2</strong> (09/22 - 09/28)</p> <ul> <li>Tackled queries regarding the choice of design, page structure etc., and sought approval from CC’s UX Designer.</li> <li>Began to write the content which will need to fill up the main landing page.</li> </ul> <p><strong>Week 3</strong> (09/29 - 10/06)</p> <ul> <li>Finalized the headings, sub-headings and other sections which will need to be present in the landing site &amp; documentation.</li> <li>Kept the code ready for accepting documentation contents. Have github pages/netlify/surge configured for continuous integration and deployment.</li> </ul> <p><strong>Week 4</strong> (10/07 - 10/14)</p> <ul> <li>Began to write under “Introduction”, “Getting Started” and ”Grid Components” sub-headings of the documentation.</li> <li>Started developing the main landing page using Vocabulary components.</li> </ul> <p><strong>Week 5</strong> (10/15 - 10/22)</p> <ul> <li>Got complete approval for the main page contents.</li> <li>Worked on coding the “Dark Theme”.</li> <li>Facilitated hacktoberfest contributors and spoke at a CCOS event.</li> </ul> <p><strong>Week 6</strong> (10/23 - 10/30)</p> <ul> <li>Wrote a mid-internship blog post describing work done and how the experience has been so far with CC.</li> <li>Started compiling the document guides for all the components in Vocabulary. Made revamps where necessary.</li> </ul> <p><strong>Week 7</strong> (10/31 - 11/07)</p> <ul> <li>Integrated the main page contents and the main landing page itself, had it up and running.</li> </ul> <p><strong>Week 8</strong> (11/08 - 11/15)</p> <ul> <li>Finished writing the Vocabulary usage guide and seek initial approval.</li> </ul> <p><strong>Week 9</strong> (11/16 - 11/23)</p> <ul> <li>Finalized on the guides and the main page contents.</li> <li>Carried out the necessary landing page to doc integration.</li> <li>Published a sample build using surge for viewing and surveying purposes.</li> </ul> <p><strong>Week 10</strong> (11/24 - 11/30)</p> <ul> <li>Surveyed development builds for Accessibility using WAVE and Accessibility Insights for Web.</li> <li>Surveyed the site for responsiveness using Chrome Dev Tools.</li> <li>Generated Lighthouse reports.</li> <li>Optimised for Search Engines using meta tags and external links.</li> </ul> <p><strong>Week 11</strong> (11/30 - 12/05)</p> <ul> <li>Worked towards improving the report statistics until they reach a respectable target.</li> <li>Wrote a blog post summarizing everything, and about my performance cum involvement in CC.</li> </ul> <p><strong>Week 12</strong> (12/06 - 12/12)</p> <ul> <li>Sought daily approvals until everything is finalised.</li> <li>Go through my writings and code upteen times for any miniscule errors.</li> </ul> <p><strong>Week 13</strong> (12/13 - 12/19)</p> <ul> <li>Cleaned code, make sure everything is properly linted and ready before the final closing commits.</li> <li>Published the “Concluding Internship” blog post, rounding up my wholesome journey.</li> <li>Sought final closing approval.</li> </ul> <p><strong>Post-Internship</strong></p> <ul> <li>Promote the use of CC attributed works.</li> <li>Interact with the community, answer queries or doubts regarding CC.</li> <li>Carry out community work of the repositories I’ve contributed to.</li> <li>Leverage experience gained during this internship for future endeavours.</li> </ul> <h3 id="the-vocabulary-site">The Vocabulary Site</h3><p>Here's the link to <a href="https://cc-vocab-draft.web.app">the landing site</a>.</p> <ul> <li>Went through <strong>3</strong> Design Iterations.</li> <li>Designed the mockups in <a href="https://figma.com">Figma</a>.</li> <li>Wrote the content filling up the landing page.</li> <li>After approval from the UX Designer, waited for an approval from the Frontend Engineer.</li> <li>Sought continuous approval from my mentor <a href="/blog/authors/dhruvkb/">dhruvkb</a>.</li> <li>Used <a href="https://vuejs.org">Vue.js</a> + <a href="https://www.npmjs.com/package/@creativecommons/vocabulary">CC Vocabulary</a> to build a highly modularised site.</li> <li>Went through a couple of iterations of the website itself.</li> <li>Made about <strong>112</strong> commits (<strong>15,000</strong> lines of code) in my <em>gsod-nimish</em> branch.</li> </ul> <pre> <center> <img alt"Contributions to CC" src="github.png"/><br> <small class="muted">All my contributions to Creative Commons!</small> </center> </pre><ul> <li>Used Github API to display repository statistics.</li> </ul> <pre> <center> <img alt"Fetch stats from Github API" src="stats.png"/><br> <small class="muted">Fetching dynamic stats from the GitHub API</small> </center> </pre><ul> <li>PR was reviewed and merged on the <strong>25th of November</strong>.</li> </ul> <p>Here's how the site looks right now:</p> <pre> <center> <img alt"The final website!" src="website.png"/><br> <small class="muted">Snapshot of the final website!</small> </center> </pre><ul> <li>Used <a href="https://surge.sh">surge</a> &amp; <a href="https://web.app">firebase</a> for draft deploys.</li> <li>Carried out <a href="https://developers.google.com/web/tools/lighthouse">lighthouse</a> testing.</li> </ul> <pre> <center> <img alt="Lighthouse reports" src="light.png"/><br> <small class="muted">Lighthouse reports for our live site</small> </center> </pre><ul> <li>Prompted changes to improve accessibility, SEO and PWA characteristics.</li> </ul> <h3 id="core-documentation">Core Documentation</h3><p>Here's the link to the <a href="https://cc-vocabulary.netlify.app">documentation site</a>.</p> <ul> <li>Used <a href="https://storybook.js.org/">StorybookJS</a>.</li> <li>Modified the existing overview page.</li> <li>Removed highly verbose sections from the docs.</li> <li>Documented Vocabulary sprint planning workflow.</li> <li>Documented how to use a markdown component with CC Vocabulary.</li> <li>Embedded hyperlink to other open source projects to improve SEO.</li> <li>Increased uniformity across documentation present in the storybooks.</li> <li>Added alt descriptions &amp; aria labels for certain images to improve accessibility.</li> </ul> <h3 id="my-learnings-and-challenges">My Learnings And Challenges</h3><ul> <li>Design is more than just picking colors and placing components on a grey screen.</li> <li>It's important to read your own writings from an unbiased perspective to actually understand how well it would be perceived.</li> <li>Publishing to npmjs is not difficult!</li> <li>Knowing the previously existing code in your project is of serious essence. It's important to understand the code styles, structure &amp; activity of the code that you are dealing with.</li> <li>Be patient! Its fine to delay something if it makes sense to have it logically accomplished only after certain other tasks are done &amp; dusted with.</li> <li>How essential it is to write neat code is something that's not spoken too often. (I wonder why...)</li> <li>I always thought Vue.js sets up SPA's by default. I'm surprised you need to configure it additionally to do just that!</li> <li>Storybook is just a really nifty OSS with great community support!</li> <li>Vue.js is fantastic. Maybe I'm a Vue.js fan now. Should I remain loyal to React? I don't know.</li> <li>Making a site responsive isn't the easiest of tasks, but it's certainly doable after a lot of stretching &amp; compressing; lets say that.</li> <li>"Code formatting is essential" would be an understatement to make.</li> <li>Monorepo's have their own pro's and con's. But in our case the con's were negligible, thankfully!</li> <li>GSoD isn't just about documentation; there's some serious amount of coding too!</li> <li>You don't have to sit and write code for hours together. Take breaks, come back, and the fix will strike you sooner than ever.</li> <li>Timelines change; improvisation being an essential aspect of any project!</li> <li>MDX is a neat little format to code in! Documenting code is just so much easier.</li> <li>Things become obsolete. Versions become outdated. Code maintaining is therefore, easier said than done!</li> </ul> <h3 id="issues-pr-s-raised-during-gsod-period">Issues &amp; PR's raised during GSoD period</h3><table> <thead> <tr> <th>Repository</th> <th>Contribution</th> <th>Relevant links</th> </tr> </thead> <tbody> <tr> <td rowspan=14><a href="https://github.com/creativecommons/vocabulary">@creativecommons/vocabulary</a></td> <td>Developed the CC Vocabulary Landing Page</td> <td><a href="https://github.com/cc-archive/vocabulary-legacy/pull/747">https://github.com/cc-archive/vocabulary-legacy/pull/747<br><a href="https://cc-vocab-draft.web.app">https://cc-vocab-draft.web.app</a></td> </tr> <tr> <td>Implemented dark mode for our storybooks</td> <td><a href="https://github.com/cc-archive/vocabulary-legacy/pull/806">https://github.com/cc-archive/vocabulary-legacy/pull/806</a><br><a href="https://cc-vocabulary.netlify.app">https://cc-vocabulary.netlify.app</a></td> </tr> <tr> <td>Carried out a monorepo wide documentation revamp</td> <td><a href="https://github.com/cc-archive/vocabulary-legacy/pull/813">https://github.com/cc-archive/vocabulary-legacy/pull/813</a></td> </tr> <tr> <td>Wrote the Monorepo Documentation Story</td> <td><a href="https://github.com/cc-archive/vocabulary-legacy/pull/785">https://github.com/cc-archive/vocabulary-legacy/pull/785</a><br><a href="https://cc-vocabulary.netlify.app/?path=/docs/vocabulary-structure--page#why-is-vocabulary-a-monorepo">https://cc-vocabulary.netlify.app/?path=/docs/vocabulary-structure--page#why-is-vocabulary-a-monorepo</a></td> </tr> <tr> <td>Wrote the Grid Documentation Story</td> <td><a href="https://github.com/cc-archive/vocabulary-legacy/pull/802">https://github.com/cc-archive/vocabulary-legacy/pull/802</a><br><a href="https://cc-vocabulary.netlify.app/?path=/docs/layouts-grid--fullhd#grid-system">https://cc-vocabulary.netlify.app/?path=/docs/layouts-grid--fullhd#grid-system</a></td> </tr> <tr> <td>Wrote the "Getting Started" Usage Guide</td> <td><a href="https://github.com/cc-archive/vocabulary-legacy/pull/774">https://github.com/cc-archive/vocabulary-legacy/pull/774</a><br><a href="https://cc-vocabulary.netlify.app/?path=/story/vocabulary-getting-started--page#getting-started">https://cc-vocabulary.netlify.app/?path=/story/vocabulary-getting-started--page#getting-started</a></td> </tr> <tr> <td>Added a CHANGELOG.md to adhere to OSS conventions</td> <td><a href=https://github.com/cc-archive/vocabulary-legacy/pull/671">https://github.com/cc-archive/vocabulary-legacy/pull/671</a><br><a href="https://github.com/cc-archive/vocabulary-legacy/blob/main/CHANGELOG.md">https://github.com/cc-archive/vocabulary-legacy/blob/main/CHANGELOG.md</a></td> </tr> <tr> <td>Unified README.md and updated monorepo build process</td> <td><a href="https://github.com/cc-archive/vocabulary-legacy/pull/649">https://github.com/cc-archive/vocabulary-legacy/pull/649</a><br><a href="https://www.npmjs.com/package/@creativecommons/vocabulary">https://www.npmjs.com/package/@creativecommons/vocabulary</a><br><a href="https://www.npmjs.com/package/@creativecommons/fonts">https://www.npmjs.com/package/@creativecommons/fonts</a><br><a href="https://www.npmjs.com/package/@creativecommons/vue-vocabulary">https://www.npmjs.com/package/@creativecommons/vue-vocabulary</a></td> </tr> <tr> <td>Configured GitHub native dependabot</td> <td><a href="https://github.com/cc-archive/vocabulary-legacy/pull/452">https://github.com/cc-archive/vocabulary-legacy/pull/452</a></td> </tr> <tr> <td>Added phone screen backgrounds</td> <td><a href="https://github.com/cc-archive/vocabulary-legacy/pull/445">https://github.com/cc-archive/vocabulary-legacy/pull/445</a></td> </tr> <tr> <td>Introduce Snapshot Testing to Vocabulary using Chromatic</td> <td><a href="https://github.com/cc-archive/vocabulary-legacy/issues/735">https://github.com/cc-archive/vocabulary-legacy/issues/735</a></td> </tr> <tr> <td>Add a maintained with Lerna badge</td> <td><a href="https://github.com/cc-archive/vocabulary-legacy/issues/807">https://github.com/cc-archive/vocabulary-legacy/issues/807</a><br><a href="https://github.com/cc-archive/vocabulary-legacy/blob/main/README.md">https://github.com/cc-archive/vocabulary-legacy/blob/main/README.md</a></td> </tr> <tr> <td>Add new install size badges for our packages</td> <td><a href="https://github.com/cc-archive/vocabulary-legacy/issues/776">https://github.com/cc-archive/vocabulary-legacy/issues/776</a><br><a href="https://github.com/cc-archive/vocabulary-legacy/blob/main/README.md">https://github.com/cc-archive/vocabulary-legacy/blob/main/README.md</a</td> </tr> <tr> <td>Customise individual README's for our packages</td> <td><a href="https://github.com/cc-archive/vocabulary-legacy/issues/736">https://github.com/cc-archive/vocabulary-legacy/issues/736</a></td> </tr> <tr> <td rowspan=5><a href="https://github.com/creativecommons/creativecommons.github.io-source">@creativecommons/creativecommons.github.io-source</a></td> <td>Introductory First Blog Post</td> <td><a href="https://github.com/creativecommons/creativecommons.github.io-source/pull/530">https://github.com/creativecommons/creativecommons.github.io-source/pull/530</a><br><a href="/blog/entries/cc-vocabulary-docs-intro/">/blog/entries/cc-vocabulary-docs-intro/</a></td> </tr> <tr> <td>Vocabulary Site Update v1</td> <td><a href="https://github.com/creativecommons/creativecommons.github.io-source/pull/549">https://github.com/creativecommons/creativecommons.github.io-source/pull/549</a><br><a href="/blog/entries/cc-vocabulary-docs-updates-1/">/blog/entries/cc-vocabulary-docs-updates-1/</a></td> </tr> <tr> <td>Vocabulary Mid Internship Update v2</td> <td><a href="https://github.com/creativecommons/creativecommons.github.io-source/pull/555">https://github.com/creativecommons/creativecommons.github.io-source/pull/555</a><br><a href="/blog/entries/cc-vocabulary-docs-updates-2/">/blog/entries/cc-vocabulary-docs-updates-2/</a></td> </tr> <tr> <td>Vocabulary Site Update v3</td> <td><a href="https://github.com/creativecommons/creativecommons.github.io-source/pull/561">https://github.com/creativecommons/creativecommons.github.io-source/pull/561</a><br><a href="/blog/entries/cc-vocabulary-docs-updates-3/">/blog/entries/cc-vocabulary-docs-updates-3/</a></td> </tr> <tr> <td>Vocabulary Site Final Update</td> <td><a href="https://github.com/creativecommons/creativecommons.github.io-source/pull/564">https://github.com/creativecommons/creativecommons.github.io-source/pull/564</a><br><a href="/">/blog/entries/cc-vocabulary-docs-updates-closing/</a></td> </tr> <tr> <td><a href="https://github.com/cc-archive/cccatalog-api">@cc-archive/cccatalog-api</a></td> <td>Configured GitHub native dependabot</td> <td><a href="https://github.com/cc-archive/cccatalog-api/pull/53">https://github.com/cc-archive/cccatalog-api/pull/53</a></td> </tr> <tr> <td><a href="https://github.com/creativecommons/ccos-scripts">@creativecommons/ccos-scripts</a></td> <td>Fix file extension in README.md docs</td> <td><a href="https://github.com/creativecommons/ccos-scripts/pull/100">https://github.com/creativecommons/ccos-scripts/pull/100</a></td> </tr> </tbody> </table><p>Follow along my complete GSoD journey through <a href="/blog/series/gsod-2020-vocabulary-usage-guide/">these series of posts</a>.</p> <h3 id="memorable-milestones-screenshots">Memorable Milestones Screenshots</h3><pre> <center> <img alt"Merged!" src="merged747.png"/><br> <small class="muted">GSoD PR merged!</small> </center> </pre> <br> <pre> <center> <img alt"Dark Mode" src="darkmode.png"/><br> <small class="muted">Behold the dark theme!</small> </center> </pre> <br> <pre> <center> <img alt"Grid Docs" src="grid.png"/><br> <small class="muted">Grid Documenation Story</small> </center> </pre> <br> <pre> <center> <img alt"Monorepo Document Story" src="structure.png"/><br> <small class="muted">Monorepo Structure Story</small> </center> </pre><h3 id="conclusion">Conclusion</h3><p>My GSoD internship has been by far, a very successful and a fruitful one. I thank the the GSoD team for all their efforts in oragnising it this year. I would also like to thank the entire Creative Commons team for all their motivation and support. The onboarding &amp; see-off was very smooth indeed!</p> <p align="center"> <strong>Thank you for all your time! This was the final blog post under the Vocabulary docs series. I'll be around for times to come, but until then, sayonara!</strong> </p> Summary: My GSoD 2020 Journey 2020-12-02T00:00:00Z ['ariessa'] urn:uuid:8ef2254d-edd9-37d9-a144-5b539249b19f <p>Thank you for the wonderful experience, Creative Commons!</p> <p>This blog post serves as a project report for ‘Improve CC Catalog API Usage Guide’. It describes the work that I’ve done during my Google Season Of Docs (GSOD) 2020. My mentors for this project are Alden Page and Kriti Godey from Creative Commons.</p> <p>In total, there are 12 weeks in the Doc Development Phase. Every 2 weeks, I would publish a blog post to update my progress to my mentors and organization.</p> <h3 id="week-1">Week 1</h3><p>So, the first two weeks of Google Season of Docs have passed. For the first week, I added examples to perform the query using curl command. I hit some problem with a Forbidden error. Turns out my access key got expired. My problem was solved after obtaining a new access key.</p> <h3 id="week-2">Week 2</h3><p>For the second week, I started to write response samples. It was tough as I have a hard time understanding drf-yasg, which is an automatic Swagger generator. It can produce Swagger / OpenAPI 2.0 specifications from a Django Rest Framework API. I tried to find as many examples as I could to increase my understanding. Funny, but it took me awhile to realise that drf-yasg is not made up of random letters. The DRF part stands for Django Rest Framework while YASG stands for Yet Another Swagger Generator.</p> <h3 id="week-3">Week 3</h3><p>Week 3 was quite hectic. I moved back to my hometown during week 3. Took 3 days off to settle my stuff and set up a workspace. I worked on my GSoD project for only 2 days, Monday and Tuesday. I managed to create response samples for most API endpoints. Had a monthly video call with Kriti this week.</p> <h3 id="week-4">Week 4</h3><p>I reviewed what I’ve done and what I haven’t to estimate new completion time. Thank god, I have a buffer week in my GSoD timeline and deliverables. So yeah, all is good in terms of completion time. I started to write descriptions for API endpoints. Submitted first PR and published blog entry.</p> <h3 id="week-5">Week 5</h3><p>I managed to add a lot of stuff into the documentation. I figured out how to add help texts to classes and how to create serializers. I also managed to move all code examples under response samples. In order to do this, I created a new class called CustomAutoSchema to add x-code-samples. Other stuff that I did include creating new sections such as “Register and Authenticate” and “Glossary”. The hardest part of this week is probably trying to figure out how to add request body examples and move code examples.</p> <h3 id="week-6">Week 6</h3><p>I added another section called Contribute that provides a todolist to start contributing on Github. I also wrote and published this blog post.</p> <h3 id="week-7">Week 7</h3><p>I restructured the file README in CC Catalog API repository. I added a step by step guide on how to run the server locally. I hope new users will be less intimidated to contribute to this project with the updated guide on how to run the server locally.</p> <h3 id="week-8">Week 8</h3><p>I created Documentation Guidelines which provides steps on how to contribute to CC Catalog API documentation, documentation styles, and cheat sheet for drf-yasg. I also wrote and published this blog post.</p> <h3 id="week-9">Week 9</h3><p>I had completed all GSoD tasks by week 9. So, I took a couple of days off and fixed last week's PR. Kriti assigned me with new tasks, which is to port CC Catalog documentation from the internal wiki into GitHub repository. Brent, the CC Catalog maintainer explained to me about what needs to be done.</p> <h3 id="week-10">Week 10</h3><p>I started exploring CC Catalog and its documentation. Reminds me a lot about the first and second weeks of GSoD. Trying to understand new stuff and having an "aha" moment when the dots finally connect. I started to move the documentation from the internal wiki to CC Catalog’s GitHub repository. I also wrote and published this blog post.</p> <h3 id="week-11">Week 11</h3><p>I finished working on porting CC Catalog documentation from internal wiki to CC Catalog’s GitHub repository. Kriti told me that there would be a meeting in which I have to present what I've done for GSoD. Since the meeting will take place at 1AM in my local time, Kriti told me that I should send a video presentation instead.</p> <h3 id="week-12">Week 12</h3><p>I submitted a video presentation to Kriti. Finished writing project report and evaluation for GSoD. I published 2 blog posts this week. One for updates on Week 11 and Week 12. Another one is this blog post.</p> <p><br/></p> <hr> <p>You can view the latest CC Catalog API documentation <a href="https://api.creativecommons.engineering/v1/">here</a>.</p> Finish Video Presentation, Project Report and Evaluation Form 2020-12-01T00:00:00Z ['ariessa'] urn:uuid:3c10027c-ae79-392a-961f-ef9a2362be2a <p>For week 10 and 11, I finished porting CC Catalog documentation, submitted a video presentation, and wrapped up my GSoD 2020 journey.</p> <h3 id="week-11">Week 11</h3><p>For Week 11, I finished working on porting CC Catalog documentation from internal wiki to CC Catalog’s GitHub repository. Kriti told me that there would be a meeting in which I have to present what I've done for GSoD. Since the meeting will take place at 1AM in my local time, Kriti told me that I should send a video presentation instead.</p> <h3 id="week-12">Week 12</h3><p>For this week, I submitted a video presentation to Kriti. Finished writing project report and evaluation for GSoD. I published 2 blog posts this week. One for updates on Week 11 and Week 12. Another one is a summary of my GSoD 2020 journey, which also serves as a project report.</p> <hr> <p>Signing off.</p> Presenting CC Base docs - A WordPress Base Theme Usage Guide for the CC Base Theme 2020-11-27T00:00:00Z ['JackieBinya'] urn:uuid:c27f3a10-a8a1-3cd9-8c66-89cb70d26f58 <p>We are live ??</p> <p>The CC Base documentation is live and its available on this <a href="https://cc-wp-theme-base.netlify.app/">link</a>.</p> <p>The docs were successfully migrated from Google Docs to the site! One of the most notable changes in the theme and consequently reflected in the documentation is the product name change. The CC WP Theme Base has been renamed to CC Base.</p> <p>But the old adage says good documentation is never complete, we hope to engage the Creative Commons Community and perform usability tests. Any feed back gathered from the usability tests will then be used to further improve the CC Base docs.</p> <p>In future iterations of the docs development we hope to include the following features:</p> <ul> <li>Increase the quantity of illustrative media so as to make the docs more intuitive this will be marked by adding video tutorials on how to use certain features of the CC Base theme and also adding illustrative tree diagrams to explain hierarchy of key directories and files in the CC Base project structure.</li> <li>Integration of <a href="https://www.algolia.com/">Algolia</a> a software tool used to power search functionality in static generated sites.</li> <li>We also hope to improve SEO for the site.</li> </ul> <p>All the above mentioned improvements are geared at improving the overall user experience of the docs as well as ensure faster onboarding for our community members to get started on using the CC Base theme.</p> <p>In conclusion I would like to thank all members of the Creative Commons engineering team, with special mention to Hugo Solar and Kriti Godey. Thank you for your guidance and faith in my abilities as a technical writer and software developer.</p> Vocabulary Site Updates (Part 3/n) 2020-11-25T00:00:00Z ['nimishbongale'] urn:uuid:9b0d8e22-8041-3b9a-a0dc-66d44cdb924f <p>Excited to know more about this week's vocabulary site updates? Read on to find out!</p> <h2 id="vocabulary-site-updates-edition-3/many-more-to-come">Vocabulary Site Updates (Edition 3/many more to come)</h2><h3 id="what-i-ve-been-up-to">What I've been up to</h3><center> <img alt"Halfway There" src="merged.png"/><br> <small class="muted">The surreal feeling...</small> </center><p>Merged? Yes. <strong>Merged</strong>. Here's my story!</p> <ul> <li>After getting a thumbs up from the UX Designer, I put up my <a href="https://github.com/cc-archive/vocabulary-legacy/pull/747">GSoD Website PR</a> for review.</li> <li>I was confident there would be changes, and I let them roll in. It's important to note here that what seems perfect to you may not be so to others, and only experience teaches you the right from the wrong.</li> <li>There were a few of them, mainly dealing with spacing, textual content and colors. I resolved them as soon as I could.</li> <li><a href="/blog/authors/zackkrida/">zackkrida</a> has been kind enough to point and enumerate all of them for me!</li> <li>After receiving a final approval from the engineering team, my PR was finally merged!</li> <li>The final draft of the vocabulary site is live! It will soon be deployed (on <a href="https://netlify,com">Netlify</a>) and be made available for public viewing.</li> <li>For my readers, here's <a href="https://cc-vocab-draft.web.app">exclusive preview</a> of the final draft.</li> <li>I've tried making it as optimised as possible, but if you have any inputs whatsoever feel free to raise issues over on our <a href="https://github.com/creativecommons/vocabulary">GitHub repository</a>.</li> <li>The famed <a href="https://developers.google.com/web/tools/lighthouse">Lighthouse report</a> suggests that it's a pretty good start! I've also taken care of the <a href="https://www.w3.org/standards/webdesign/accessibility">accessibility aspect</a> wherever applicable.</li> </ul> <center> <img alt"Halfway There" src="light.png"/><br> <small class="muted">Aiming high!</small> </center><h3 id="what-i-ve-learnt">What I've learnt</h3><ul> <li>GSoD isn't just about documentation; there's some serious amount of coding too!</li> <li>You don't have to sit and write code for hours together. Take breaks, come back, and the fix will strike you sooner than ever.</li> <li>Timelines change; improvisation being an essential aspect of any project!</li> <li><a href="https://mdxjs.com/">MDX</a> is a neat little format to code in! Documenting code is just so much easier.</li> <li>Things become obsolete. Versions become outdated. Code maintaining is therefore, easier said than done!</li> </ul> <h3 id="other-community-work-tidbits">Other community work tidbits</h3><p>Being a part of an open source organisation also means that I must try to bring in contributions from existing &amp; first time contributors. Here's a peek into my efforts for the same:</p> <ul> <li>The <a href="https://github.com/cc-archive/vocabulary-legacy/pull/806">dark mode PR</a> started off as a hacktoberfest contribution, and it is now complete!</li> <li>Created a <code>/shared</code> package to house common files between packages (such as the dark &amp; light theme after referring to the <a href="https://reactjs.org/">React</a> documentation.</li> <li>The automated npm <a href="https://github.com/cc-archive/vocabulary-legacy/pull/746">README.md customisation</a> is now up and running. (really had a blast solving that issue!)</li> <li>If the snapshot testing stands approved, we'll have it running on chromatic!</li> <li>Raised issues to add multiple badges to the root README.md file; namely <code>maintained with Lerna</code> &amp; custom badges for package sizes from <a href="https://packagephobia.com/">packagephobia</a>.</li> </ul> <p align="center"> <strong>Thank you for your time! Stay put for the season finale!</strong> </p> Finish GSoD Tasks and Explore CC Catalog Documentation 2020-11-20T00:00:00Z ['ariessa'] urn:uuid:d5ea49f0-63f2-3466-93e0-7c513b4dc2d6 <p>Today marks my fifth blog entry on Creative Commons. For week 9 and 10, I explored CC Catalog documentation and began improving the documentation by removing keys and generalizing instructions.</p> <h3 id="week-9">Week 9</h3><p>I had completed all GSoD tasks by week 9. So, I took a couple of days off and fixed last week's PR. Kriti assigned me with new tasks, which is to port CC Catalog documentation from the internal wiki into GitHub repository. Brent, the CC Catalog maintainer explained to me about what needs to be done.</p> <h3 id="week-10">Week 10</h3><p>For week 10, I started exploring CC Catalog and its documentation. Reminds me a lot about the first and second weeks of GSoD. Trying to understand new stuff and having an "aha" moment when the dots finally connect. I started to move the documentation from the internal wiki to CC Catalog’s GitHub repository. I also wrote and published this blog post.</p> <hr> <p>End of blog entry.</p> Content Creation Phase: WordPress Base Theme Usage Guide 2020-11-10T00:00:00Z ['JackieBinya'] urn:uuid:bb3168cf-7243-320e-a8e2-0586030dc3a7 <p>For the past couple of weeks we have been actively creating content for the Creative Commons WordPress Base Theme Usage Guide. Currently the draft content is under final review before it is migrated to the main docs site.</p> <h2 id="our-strategy">Our Strategy</h2><p>Our main goal in creating the docs is to create rich, intuitive, engaging, and beautifully presented community facing documentation for the Creative Commons WordPress Base Theme.</p> <p>In alignment with the defined goal, our core focus is to create the docs collaboratively.</p> <p>The CC WordPress team consists of I, Jacqueline Binya, Hugo Solar and Timid Robot Zehta. Although our team is small it is quite diverse. It consists of a diverse mix of technical skills: I am a junior developer whereas Hugo and Timid are way senior. We also have non-native and native English speakers.</p> <p>Diversity is important as we hope to create a high quality product that caters for everyone.</p> <p>My role as the tech writer/frontend developer is to create the content: write the documentation, build the docs site and also to create all illustrative media.</p> <p>During the content creation phase, the first step involved creating the skeleton of the actual docs site. We created a git branch called <em>docs</em> within the <a href="https://github.com/creativecommons/wp-theme-base">creative-commons/wp-base-theme</a> repository. All content related to the documentation is persisted in that branch. So,please feel free to contribute. We then used <a href="https://gridsome.org/starters/jamdocs/">JamDocs</a>, a <a href="https://gridsome.org/">Gridsome</a> theme to quickly scaffold the site. We had to adapt the theme so as to make it meet our own specific needs, this involved overhauling the styles and changing the functionality of some of the features in the theme. After that was completed, we then created a <a href="https://docs.google.com/document/d/1yfAQGG70T8BUhZYWglAlQ_lTo4_tYpyjhPN5FsZnSvI/edit?usp=sharing">Google Doc</a> we use for collaboratively writing the draft content for the docs site.</p> <h2 id="tech-stack">Tech Stack</h2><p>As it was mentioned we used <a href="https://gridsome.org/">Gridsome</a> a static generator for <a href="https://vuejs.org/">Vuejs</a>. We chose Gridsome because:</p> <ul> <li><p>We wanted to lower the barrier of entry to contributing:</p> <ul> <li>Gridsome/Vuejs community is very active, help is but a click away.</li> <li>The Gridsome official documentation is very resourceful and well maintained.</li> </ul> </li> <li><p>Gridsome is highly flexible: The content for the actual documentation is written in <a href="https://www.markdownguide.org/getting-started/">Markdown</a> but using <a href="https://gridsome.org/plugins/@gridsome/vue-remark">@gridsome/vue-remark</a>, which is a Gridsome plugin, we are able to use javascript in Markdown. We intend to include a copy to the clipboard Vuejs component in the site.</p> </li> <li><p>Time Constraint: This project is a short running project which has to be completed in a 3 month period. Through the use of JamDocs, a Gridsome templating theme as well various plugins it was easy and fast to get started we were able to add more functionality to the theme with minimal effort.</p> </li> <li><p>Ease of integrating <a href="https://cc-vocabulary.netlify.app/">CC Vocabulary</a> with Gridsome: it is a requirement that the general aesthetics of all front facing Creative Commons applications is derived from the CC Vocabulary Design System. Major cons for using a design system include the ensuring uniformity in design for all front facing CC products.</p> </li> </ul> <h2 id="tools-used">Tools Used</h2><ul> <li><a href="https://www.figma.com/">Figma</a>:</li> </ul> <p><img src="/blog/entries/cc-wp-base-theme-docs-content-creation/image.png" alt="An example of illustrative media"></p> <p>Figma was used to make assets(banners, logos and illustrations) in the theme. The illustrative media was created with accessibility in mind and all the topography used in the illustrative assets was derived from the CC Vocabulary.</p> <ul> <li><p><a href="https://linuxecke.volkoh.de/vokoscreen/vokoscreen.html">VokoScreenNG</a>: an open source screencast recording tool used to record all the screen cast demos available in the docs site.</p> </li> <li><p><a href="https://shotcut.org/">ShortCut</a>: an open source video editing tool.</p> </li> </ul> <h2 id="what-comes-next">What comes next ?</h2><p>After the final review is completed and all feedback implemented we will migrate all the content to the main docs site.</p> <p><em>Stay tuned for an update about the launch of the CC WP Base Theme Docs site.</em></p> Vocabulary Site Mid-Internship Update (v2) 2020-11-09T00:00:00Z ['nimishbongale'] urn:uuid:9d7836e4-5dca-3b69-b77e-ba196ec923cc <p>This is a mid-internship blog post. Wait. what!? Already? Let's glance over my progress, shall we?</p> <h2 id="vocabulary-site-updates-edition-2/many-more-to-come">Vocabulary Site Updates (Edition 2/many more to come)</h2><p>Oh boy! 1.5 months have passed since I've been investing time in building a landing site &amp; usage guide for CC Vocabulary. A lot has changed since the time of posting my last blog post. <strong>A lot</strong>.</p> <center> <img alt"Halfway There" src="speed.gif"/><br> <small class="muted">Hitting "the point of no return" has never been this exciting! Time to step on the throttle! Source: <a href="https://cliply.co">Cliply</a></small> </center><h3 id="what-i-ve-been-up-to">What I've been up to</h3><blockquote><p><strong>Designing</strong> ? <strong>Drafting</strong> ? <strong>Developing</strong> ? <strong>Debugging</strong> ? <strong>Deploying</strong></p> </blockquote> <p>And the cycle contrinues. I guess it sums it all up very nicely. <em>Can somebody appreciate the alliteration though?</em></p> <p>Here's a gist of what I've achieved so far:</p> <ul> <li>I've gone through <strong>2</strong> iterations of the design. I'm happy with how the new site looks (and I genuinely hope the design team does too!).</li> <li>I've drafted around <strong>5+</strong> writeups dealing with Monorepo Migration, Getting Started guide, Vocabulary Overview and of course these blog posts.</li> <li>My branch on the vocabulary repository now has over <strong>50+</strong> commits &amp; over <strong>13,000</strong> lines of code (not that I've written all of them, but you know, just for the stats)</li> <li>The first draft of the vocabulary site is now live! I'm expecting a whole bunch of changes still, but here it is if you want to have a sneak peek: <a href="https://cc-vocab-draft.surge.sh">https://cc-vocab-draft.surge.sh</a></li> <li>I've consumed the <a href="https://docs.github.com/en/free-pro-team@latest/rest">Github API</a> to get live release history, forks and starrers count. I think it adds a really nice touch to the site in general.</li> <li>I've used <a href="https://surge.sh">surge.sh</a> to deploy the draft site. I believe it's a really simple tool to have your site deployed within seconds!</li> </ul> <center> <img alt"Github commit gif" src="github.png"/><br> <small class="muted">My github contribution chart is filling up!</small> </center><h3 id="what-i-ve-learnt">What I've learnt</h3><p>Some say it's hard to learn through virtual internships. Well, let me prove you wrong. Here are my leanings in the past few weeks:</p> <ul> <li>It's surprising how subjective (&amp; yet objective) designing really is.</li> <li>Vue.js is <em>fantastic</em>. Maybe I'm a Vue.js fan now. Should I remain loyal to React? I don't know.</li> <li>Making a site responsive isn't the <em>easiest</em> of tasks, but it's certainly doable after a lot of stretching &amp; compressing; lets say that.</li> <li>"Code formatting is essential" would be an <em>understatement</em> to make.</li> <li>Monorepo's have their own pro's and con's. But in our case the con's were negligible, thankfully!</li> <li>I'll be following up with some performance &amp; accessibility testing this coming week, so let's see how that plays out!</li> <li>A mentor plays a vital role in any project. My mentor <code>@dhruvkb</code> has been very supportive and has made sure I stick to my timeline!</li> </ul> <h3 id="other-community-work-tidbits">Other community work tidbits</h3><p>I believe apart from the internship work that I'm engaged in, I should also help around with some community PR work. I've been told I'm always welcome to, which is great!</p> <ul> <li>I got the opportunity to speak at a CCOS event alongwith fellow speakers <a href="/blog/authors/dhruvkb/">dhruvkb</a> &amp; <a href="/blog/authors/dhruvi16/">dhruvi16</a>. I had a blast talking to budding students from DSC-IIT Surat &amp; DSC-RIT.</li> <li>The dark mode (as promised) should be out before my next blog post.</li> <li>Deployed the vocabulary storybook on <a href="https://chromatic.com">Chromatic</a> and compared &amp; contrasted the pros &amp; cons. Snapshot testing in the near future maybe?</li> <li>Completed the hacktoberfest challenge.</li> </ul> <h3 id="bonus-content">Bonus content</h3><p>Not many of you may know this, but this site uses the <a href="https://getlektor.com">Lektor</a> CMS. I needed to have it installed on my system (windows 10 OS) to run the code in our site repository. Lektor suggests running the following code in powershell as an installation step:</p> <div class="hll"><pre><span></span><span class="p">(</span><span class="nb">new-object</span> <span class="n">net</span><span class="p">.</span><span class="n">webclient</span><span class="p">).</span><span class="n">DownloadString</span><span class="p">(</span><span class="s1">&#39;https://www.getlektor.com/installer.py&#39;</span><span class="p">)</span> <span class="p">|</span> <span class="n">python</span> </pre></div> <p>I just didn't think this is a very elegant way. Being an ardent <a href="/blog/entries/cc-vocabulary-docs-updates-2/chocolatey.org">chocolatey.org</a> fan, I just had to have it up on there! Now the installation step for lektor is simply:</p> <div class="hll"><pre><span></span><span class="n">choco</span> <span class="n">install</span> <span class="n">lektor</span> </pre></div> <p>on the Windows PowerShell!</p> <p>Have a look at the package here:</p> <p><a href="https://chocolatey.org/packages/lektor">https://chocolatey.org/packages/lektor</a></p> <p align="center"> <strong>Thank you for your time! Stay put for the next Vocabulary site update!</strong> </p> Restructure README and Add Documentation Guidelines 2020-11-05T00:00:00Z ['ariessa'] urn:uuid:98db0051-b46f-3e4b-ae18-a56418565d50 <p>This is my fourth blog entry on Creative Commons. For week 7 and 8, I restructured the file README to be more digestible to new users and created Documentation Guidelines for CC Catalog API documentation.</p> <h3 id="week-7">Week 7</h3><p>For this week, I restructured the file README in CC Catalog API repository. I added a step by step guide on how to run the server locally. I hope new users will be less intimidated to contribute to this project with the updated guide on how to run the server locally.</p> <h3 id="week-8">Week 8</h3><p>For week 8, I created Documentation Guidelines which provides steps on how to contribute to CC Catalog API documentation, documentation styles, and cheat sheet for drf-yasg. I also wrote and published this blog post.</p> <hr> <p>Finis.</p> Vocabulary Site Updates (v1) 2020-10-26T00:00:00Z ['nimishbongale'] urn:uuid:51a44307-9725-3d59-bfcd-9badc7cce229 <p>Hello there! Well well well. It has been an eventful first few weeks, to say the least! Let's gauge my progress, shall we?</p> <h2 id="vocabulary-site-updates-edition-1/many-more-to-come">Vocabulary Site Updates (Edition 1/many more to come)</h2><h3 id="what-i-ve-been-upto">What I've been upto</h3><p>I've mainly got myself invested in a survey of the existing documentation that vocabulary currently possesses, and find places where it could be made better. After clearing those issues out, I began building the main landing site for <code>Vocabulary</code>, <code>Vue-vocabulary</code> and <code>Fonts</code>. It wasn't particularly difficult to establish the necessary workflows as I had done something similar before. During the process of designing the basic structure of the site, I came across a few instances where I felt we needed new/improved components &amp; I discussed the same with my team over on the sprint calls. The design of the site is nearly done. I'm also building the site parallelly &amp; seeking approval from the CC Design Team. I've gotten myself involved in multiple other community contributions to CC as well across multiple of our repositories.</p> <h3 id="what-i-ve-learnt">What I've learnt</h3><ul> <li>Knowing the previously existing code in your project is of serious essence. It's important to understand the code styles, structure &amp; activity of the code that you are dealing with.</li> <li>Be patient! Its fine to delay something if it makes sense to have it logically accomplished only after certain other tasks are done &amp; dusted with.</li> <li>How essential it is to write <em>neat code</em> is something that's not spoken too often. (I wonder why...)</li> <li>I always thought VueJS sets up SPA's by default. I'm surprised you need to configure it additionally to do just that!</li> <li>Storybook is just a really nifty OSS with great community support!</li> </ul> <h3 id="other-community-work-tidbits">Other community work tidbits</h3><ul> <li>I've been working on the Dark Mode (a much awaited feature, at least for me!) for our storybooks with some support from our community. It should be up and running shortly!</li> <li>Fixed some formatting bugs in the <code>README.md</code> &amp; suggested changes wrt to <code>npm v7</code> considerations.</li> <li>Fixed storybook components docs for 2 features.</li> <li>Raised a ticket for a component to render markdown text within vocabulary itself.</li> <li>Raised a few other issues for potential hacktoberfest contributions.</li> </ul> <p align="center"> <strong>Thank you for your time! To be continued...</strong> </p> Add New Sections, Descriptions, Help Texts, Code Examples, Schemas, and Serializers 2020-10-21T00:00:00Z ['ariessa'] urn:uuid:b832956d-0c04-37e5-bde6-eb53722a4ba3 <p>Welcome to my third blog entry! For week 5 and 6, I added new sections, descriptions, help texts, code examples, schemas, and serializers. I was so productive these past two weeks.</p> <h3 id="week-5">Week 5</h3><p>For this week, I managed to add a lot of stuff into the documentation. I figured out how to add help texts to classes and how to create serializers. I also managed to move all code examples under response samples. In order to do this, I created a new class called CustomAutoSchema to add <a href="https://github.com/Redocly/redoc/blob/master/docs/redoc-vendor-extensions.md#x-codesamples">x-code-samples</a>. Other stuff that I did include creating new sections such as “Register and Authenticate” and “Glossary”. The hardest part of this week is probably trying to figure out how to add request body examples and move code examples.</p> <h3 id="week-6">Week 6</h3><p>For week 6, I added another section called Contribute that provides a todolist to start contributing on Github. I also wrote and published this blog post.</p> <hr> <p>All caught up!</p> Add Response Samples and Descriptions for API Endpoints 2020-10-09T00:00:00Z ['ariessa'] urn:uuid:23d6eb6e-e157-34fb-af99-ddfcad7c5044 <p>Well, hello again ??! For week 3 and week 4, I added response samples and descriptions for API endpoints. Writing documentation feels a bit like coding at this point because I need to read a lot about drf-yasg, dig through issues and questions at Github / Stackoverflow to ensure that I don’t ask redundant (or even stupid) questions.</p> <h3 id="week-3">Week 3</h3><p>Week 3 was quite hectic. I moved back to my hometown during week 3. Took 3 days off to settle my stuff and set up a workspace. I worked on my GSoD project for only 2 days, Monday and Tuesday. I managed to create response samples for most API endpoints. Had a monthly video call with Kriti this week.</p> <h3 id="week-4">Week 4</h3><p>For this week, I reviewed what I’ve done and what I haven’t to estimate new completion time. Thank god, I have a buffer week in my GSoD timeline and deliverables. So yeah, all is good in terms of completion time. I started to write descriptions for API endpoints. Submitted first PR and published blog entry.</p> <hr> <p>Over and out.</p> Vocabulary Site & Usage Guide Introduction (GSoD'20) 2020-10-02T00:00:00Z ['nimishbongale'] urn:uuid:4974b78b-a4f9-3bbc-8875-d595042b954c <p>Hey there! I'm Nimish Bongale, a Technical Writer &amp; Software Developer based out of Bangalore, India. My other hobbies include playing chess and the guitar. I look forward to build the CC Vocabulary site and usage guides as a part of GSoD'20.</p> <h2 id="but-what-is-gsod">But what is GSoD?</h2><p>GSoD, or Google Season of Docs, is a program that stresses on the importance of the documentation aspect of Open Source projects. It invites technical writers from across the world to submit proposals based on projects floated in by the participating Open Source Organisations. The selected technical writers then work with the their respective organisations and look to complete their work by the end of their internship period. More information about the same can be found <a href="https://developers.google.com/season-of-docs">here</a>.</p> <p>Let's talk a bit about my project, shall we?</p> <h2 id="vocabulary-site-usage-guide">Vocabulary Site &amp; Usage Guide</h2><h3 id="introduction">Introduction</h3><p><a href="https://github.com/creativecommons/vocabulary">CC Vocabulary</a> is a cohesive design system &amp; Vue component library to unify the web-facing Creative Commons. It's currently comprised of 3 packages, namely Vocabulary, Vue-Vocabulary &amp; Fonts. My contribution to this project would majorly involve building the landing site for CC Vocabulary, and refactor the documentation wherever necessary.</p> <h3 id="what-drives-me">What drives me</h3><p>Documentation is one of the primary reasons which determines how successful a certain open source library will be. The major question that developers think of while choosing a suitable tech stack to build their applications is:</p> <ul> <li>Is the library <em>well documented</em>?</li> <li>Is it <em>well maintained</em>?</li> <li>Does it have some <em>considerable usage and error support</em>?</li> </ul> <p>These are exactly the questions I should be asking myself while going about this project idea.</p> <p>As aforementioned, there is an immanent need to have a concise and consolidated documentation. The lack of documentation hurts the future perspectives of open source applications, and is by far, an essential and non-negligible component. Linking to these documentations should be an appealing home page, which captures the interest of the people in an instant. The documentation should be well organised, thereby enabling a seamless flow through it.</p> <h3 id="tech-stack-of-the-project">Tech stack of the project</h3><p>We have decided to move forward with <a href="https://vuejs.org/">Vuejs</a> for building the site, and continue work on the existing <a href="https://storybook.js.org/">storybooks</a> of Vocabulary, Vue-Vocabulary and Fonts. Storybookjs has had some great improvements in recent times, and the new addons that are offered will greatly support my work. Besides these, I will also be using <a href="https://stackedit.io/">StackEdit</a> to write and share Markdown files of my writings.</p> <h3 id="progress-baby-steps">Progress - Baby Steps</h3><p>I have contributed to CC in the past. It would now be my first time contributing to a specific project within CC, while being a member of CC Open Source. Some tasks that I've been able to initiate/accomplish so far:</p> <ul> <li>Look at Open Source documentation conventions, and see if we violate any.</li> <li>Understand the level of existing documentation currently present in our storybooks.</li> <li>Discuss about the Monorepo migration and help out with the implementation.</li> <li>Migrate <code>storybookjs</code> to the latest version.</li> <li>Implement <code>addon-controls</code> for vocabulary.</li> <li>Design the vocabulary site.</li> <li>Promote the involvement of CC Open Source in <a href="https://hacktoberfest.digitalocean.com/">Hacktoberfest</a> 2020.</li> </ul> <h3 id="what-did-i-learn">What did I learn?</h3><ul> <li>Design is more than just picking colors and placing components on a grey screen.</li> <li>It's important to read your own writings from an unbiased perspective to actually understand how well it would be perceived.</li> <li>Interacting with your mentor on a regular basis is of the absolute essence.</li> <li>Publishing to <a href="/blog/entries/cc-vocabulary-docs-intro/npmjs.com">npmjs</a> is not difficult!</li> </ul> <p align="center"> <strong>Thank you for your time!</strong> </p> Creative Commons WordPress plugin: attribution for images 2020-10-01T00:00:00Z ['rczajka'] urn:uuid:818016d3-7344-3937-9fd6-7e5ffad98071 <p>As a part of <a href="https://centrumcyfrowe.pl">Centrum Cyfrowe</a>'s <a href="https://otwartakultura.org/noworries/">#NoWorries project</a> funded by EUIPO, I have had the pleasure of enhancing the Creative Commons Wordpress plugin. The new version of CC's Wordpress plugin has a feature called “attribution information for images”. It works like this:</p> <ol> <li>you upload an image to the Wordpress Media Library and fill out the correct attribution information there.</li> <li>You then insert the image into a page using the Image Gutenberg block.</li> <li>When the image is then displayed on site, the plugin will show the attribution information ? the name of the author, the image's title and link to source, and the CC license used ? right there, in a nice semi-transparent overlay over the image.</li> </ol> <h2 id="how-does-it-work">How does it work?</h2><p>To find the relevant information from the Media Library, the plugin reuses the information already provided by Gutenberg Image Blocks. Each time an image is inserted using such a block, Wordpress adds a special CSS class to it, in the form of <code>wp-image-{id}</code>, containing the image's identifier in the Media Library. It can be used to add individual styles to a specific image ? we're using it to find the relevant entry in the Media Library and add individual attribution information. With this approach, we avoid the need for any custom markup ? while also only hitting the database with a query when an actual image from the Media Library is found on the page.</p> <p>All you need to do is make sure the licensing information is there in the Media Library, and that the images are inserted using the Image block.</p> <p>This wasn't the first attempt at adding a similar function to the CC Wordpress plugin. The previous attempt used a <code>[license]</code> shortcode wrapping the image ? which it's unwieldy with the current Wordpress Gutenberg editor. It also used multiple calls to <code>attachment_url_to_postid</code> to locate the image in the Media Library, which meant executing more database queries for each image. With the new approach, the user doesn't have to change their posts at all ? all they need to do is install the plugin and add attribution information in the Media Library, and it will automatically start working for their normally inserted images.</p> <p>See here how to install the plugin:</p> <video src="install.mp4" controls></video><p>See here how to use the image attribution function:</p> <video src="use.mp4" controls></video> WordPress Base Theme Usage Guide (GSOD-2020): Hello World! 2020-09-30T00:00:00Z ['JackieBinya'] urn:uuid:8ca33f24-33d1-3bdb-a55d-5fb6381316f5 <p>My name is Jacqueline Binya. I am a software developer and technical writer from Zimbabwe. I am going to write a series of blog posts documenting my experience and lessons as I contribute to the <a href="https://github.com/creativecommons/wp-theme-base">Creative Commons WordPress Base Theme(CC WP Base Theme)</a> during the <a href="https://developers.google.com/season-of-docs">Google Season of Docs (GSOD-2020)</a> as a technical writer.</p> <h2 id="what-is-google-season-of-docs">What is Google Season of Docs?</h2><p>The Google Season of the Docs was born out of a need to improve the quality of open-source documentation as well as to advocate for open source, for documentation, and for technical writing. Annually during the GSOD, technical writers are invited to contribute to open-source projects through a highly intensive process geared at ensuring that the technical writers and the projects they contribute to during GSOD are a good fit, after that has been determined GSOD then resumes.</p> <h2 id="building-the-docs">Building the docs</h2><p>The CC WP Base theme is a WordPress theme used to create front-facing Creative Commons (CC) websites. My task is to collaborate with the engineering team to create community facing docs for the theme.</p> <h3 id="guiding-principles">Guiding principles</h3><p>The docs should be inclusive meaning: they should be written in an easy-to-understand manner taking care to avoid the use of excessive technical jargon, they should be accessible and they should have support for internationalization. We hope to provide our users with a smooth and memorable experience whilst using the docs hence the docs site should be fast and easy to navigate.</p> <h3 id="technical-stack-of-the-project">Technical stack of the project</h3><p>We decided to build the docs using <a href="https://jamstack.org/">Jamstack</a>, to be specific we are using <a href="https://gridsome.org/">Gridsome</a> a static generator for <a href="https://vuejs.org/">Vuejs</a>. We are using Gridsome as it is highly performant, and it also integrates smoothly with the <a href="https://cc-vocabulary.netlify.app/">CC Vocabulary</a>. Gridsome also has out-of-the-box support for important features like Google Analytics and <a href="https://www.algolia.com/">Angolia</a>, these features will obviously be useful in future iterations of the docs. To quickly scaffold the docs we used a Gridsome theme called <a href="https://gridsome.org/starters/jamdocs/">JamDocs</a>.</p> <h3 id="progress">Progress</h3><p>Currently, the project is on track. As it's been stated we are creating the docs collaboratively. The very first step in our workflow is to create draft content using Google docs. That task is assigned to me, it involves doing lots of research, reading and also testing out the theme. Afterwards, my mentors Hugo Solar and Timid Robot Zehta then give me feedback on the draft. Then I implement the feedback and continuously work on improvements. The final step is migrating the approved draft content to the docs projects in markdown format.</p> <h3 id="my-lessons-so-far">My lessons so far:</h3><ul> <li>Always ask questions: frankly, the only way you can create good content is when you have a solid understanding of the subject matter.</li> <li>It's better to over-communicate than under-communicate especially when working in a remotely, this is especially more important if you encounter blockers whilst executing your work.</li> <li>Push that code and open PR quickly and then go ahead and ask for a review don't procrastinate this will ensure fast turnover you get feedback quickly and can work on improvements.</li> </ul> <p><em>Thank you for reading, watch out for the next update which will be posted soon.</em></p> Add Query Using curl Command and Provide Response Samples 2020-09-25T00:00:00Z ['ariessa'] urn:uuid:2841326a-a1a6-3593-8236-10fc75b59aa0 <p>First of all, I’m very thankful to get selected as a Google Season of Docs participant under Creative Commons. My project name is Improve CC Catalog API Usage Guide. The project aims to revamp the existing CC Catalog API documentation to include more narrative elements and increase user friendliness. As the focal point of this project will potentially be delivered before the end of the GSOD period, this project will also improve the CC Catalog API repo documentation for potential contributors. This project will also produce guidelines for contributing to documentation. For this project, my mentor is Alden Page.</p> <h3 id="week-1">Week 1</h3><p>So, the first two weeks of Google Season of Docs have passed. For the first week, I added examples to perform the query using curl command. I hit some problem with a Forbidden error. Turns out my access key got expired. My problem was solved after obtaining a new access key.</p> <h3 id="week-2">Week 2</h3><p>For the second week, I started to write response samples. It was tough as I have a hard time understanding <a href="https://github.com/axnsan12/drf-yasg">drf-yasg</a>, which is an automatic Swagger generator. It can produce Swagger / OpenAPI 2.0 specifications from a Django Rest Framework API. I tried to find as many examples as I could to increase my understanding. Funny, but it took me awhile to realise that drf-yasg is not made up of random letters. The DRF part stands for Django Rest Framework while YASG stands for Yet Another Swagger Generator.</p> <hr> <p>That’s all!</p> The specifics - Revamping CCOS 2020-09-02T00:00:00Z ['dhruvi16'] urn:uuid:013d291c-a7ee-3a05-8aef-87f357322155 <p>In this blog, I will be talking about how I managed to use Vocabulary ( Creative Commons's Design Library ) efficiently in our Open Source website.</p> <h3 id="what-is-vocabulary">What is Vocabulary?</h3><p><a href="https://cc-vocabulary.netlify.app/?path=/story/vocabulary-introduction--page">Vocabulary</a> is a cohesive design system to unite the web-facing Creative Commons. In essence Vocabulary is a component library that uses and extends Bulma CSS library. Vocabulary makes it easier to develop Creative Commons apps while ensuring a consistently familiar experience. This project is still under development.</p> <h3 id="why-vocabulary">Why Vocabulary?</h3><p>Vocabulary is used to describe the overall visual design of our digital products. At first glance, it appears to be: an amalgamation of component designs with a consistent visual aesthetic and brand, typically accompanied by usage guidelines in the form of online documentation. But there is a lot more to it. When it comes to a large software community with a huge range of products, certain problems come along. One of those problems is maintaining the level of harmony across all the products of the network. So, there comes a need for a unified visual language that heightens the level of harmony in a digital ecosystem. And in our case, Vocabulary solves this problem. This design system is well built and helps us bring the following aspects to the table -</p> <ol> <li>Recognizability</li> <li>Consistency</li> <li>Authenticity</li> <li>Efficiency</li> </ol> <p>And many more.</p> <h3 id="how-did-i-use-it-examples">How did I use it? ? Examples</h3><p>As I stated before, I have added Vocabulary by updating all the Templates in the CCOS <a href="https://www.getlektor.com/">Lektor</a> project.</p> <p>As far as components are concerned, I just had to paste the code snippets given on the Vocabulary’s website with the requires changes -</p> <h4 id="integration-of-breadcrumb">Integration of Breadcrumb -</h4><figure style="text-align: center;"> <img src="breadcrumb.png" alt="Breadcrumb"> <figcaption>Screenshot ? <a href="https://cc-vocabulary.netlify.app/?path=/docs/navigation-breadcrumb--default-story">Breadcrumb</a> (Vocabulary)</figcaption> </figure><p>The code for integration ?</p> <pre><code>&lt;!-- Breadcrumb --&gt; {% if this._path != '/'%} &lt;div class="breadcrumb-container"&gt; &lt;nav class="container breadcrumb caption bold" aria-label="breadcrumbs"&gt; &lt;ul&gt; {% set crumbs = [] %} {% set current = {'crumb': this} %} &lt;!-- Extracting the slugs of URL --&gt; {% for i in this._path.split("/") %} {% if current.crumb is not none %} {% if crumbs.insert(0, current.crumb._slug) %}{% endif %} {% if current.update({"crumb": current.crumb.parent}) %}{% endif %} {% endif %} {% endfor %} {% for crumb in crumbs %} &lt;!-- Active link --&gt; {% if this._slug == crumb %} &lt;li class="is-active"&gt;&lt;a aria-current="page displayed"&gt;{{ crumb | title | replace('-', ' ') }}&lt;/a&gt;&lt;/li&gt; {% else %} &lt;!-- Forming the URL using extracted slugs --&gt; {% set i = loop.index %} {% set ns = namespace (link = '') %} {% for j in range(i) %} {% set ns.link = ns.link + crumbs[j] + '/' %} {% endfor %} &lt;li&gt;&lt;a class="link" href="{{ ns.link|url }}"&gt; {% if crumb != '' %} {{ crumb | title | replace('-', ' ') }} {% else %} Home {% endif %} &lt;/a&gt;&lt;/li&gt; {% endif %} {% endfor %} &lt;/ul&gt; &lt;/nav&gt; &lt;/div&gt; {% endif %} </code></pre> <p>Other than the components, there are other visual elements like typography, colors, spacing, and others that are extensively used in CCOS.</p> <p>This is code for the Hero section of the home page.</p> <h5 id="the-block-template">The block template -</h5><pre><code>&lt;section class="hero"&gt; &lt;div class="container"&gt; &lt;div class="hero-title column is-12 is-paddingless"&gt; &lt;h1&gt; {{ this.title }} &lt;/h1&gt; &lt;/div&gt; &lt;div class="columns"&gt; &lt;div class="column is-5"&gt; &lt;p class="hero-description"&gt; {{ this.description }} &lt;/p&gt; {{ this.links }} &lt;/div&gt; &lt;/div&gt; &lt;/div&gt; &lt;div class="level-right hero-image"&gt; &lt;img class="image" src="./github.svg" /&gt; &lt;/div&gt; &lt;/section&gt; </code></pre> <h5 id="the-block-styling">The block styling -</h5><pre><code>// Hero section - Home page .hero { @extend .margin-top-large; .hero-title { @extend .padding-horizontal-big; } .hero-description { @extend .body-bigger; @extend .padding-top-big; @extend .padding-horizontal-big; } .hero-links { @extend .margin-vertical-normal; @extend .padding-horizontal-big; .button { @extend .margin-top-normal; text-decoration: none; .icon { @extend .margin-right-small; @extend .padding-vertical-smaller; } } } .hero-image { @include from($fullhd) { margin-top: -20rem; .image { width: 50%; } } @include until($fullhd) { .image { width: 100%; } } } } </code></pre> <figure> <img src="output.png" alt="Output"> <figcaption>Output</figcaption> </figure><h3 id="improvements-in-the-lektor-project">Improvements in the Lektor project -</h3><p>I tried to write the perfect code that is cleaner and readable. I would try to demonstrate my effort using the home page code where I used <a href="https://www.getlektor.com/docs/models/flow/">Lektor Flowblocks</a>. The new homepage design have four sections where each section communicated something and I realized they were all independent and building the whole page through one single template would become a bit messy and hard to handle. So I did some research and found a way where I could build sub-templates and use them all to develop a single page and Lektor’s flowblocks allowed me to do so. Here is one of the flowblock and if you want to check out the whole working you can go to ? <a href="https://github.com/creativecommons/creativecommons.github.io-source">CCOS Repository</a>.</p> <h4 id="recent-blog-post-block">Recent Blog Post block -</h4><h5 id="the-block-template">The block Template -</h5><pre><code>{% from "macros/author_name.html" import render_author_name %} &lt;section class="recent-posts"&gt; &lt;div class="container"&gt; &lt;div class="level"&gt; &lt;h2 class="is-paddingless level-left"&gt; {{ this.title }} &lt;/h2&gt; &lt;span class="level-right"&gt; &lt;a class="posts-link" href="/blog"&gt;See all posts &lt;i class="icon angle-right"&gt;&lt;/i&gt;&lt;/a&gt; &lt;/span&gt; &lt;/div&gt; &lt;div class="columns"&gt; {% for post in site.query('/blog/entries') %} {% if loop.index &lt;= 3 %} {% set author = post.parent.parent.children.get('authors').children.get(post.author) %} &lt;div class="column is-one-third is-paddingless padding-horizontal-big padding-top-bigger"&gt; &lt;article class="card entry-post horizontal no-border blog-entry"&gt; &lt;header&gt; &lt;figure class="image blog-image"&gt; {% if author.about %} {% if author.md5_hashed_email %} &lt;img class="profile" src="https://secure.gravatar.com/avatar/{{ author.md5_hashed_email }}?size=200" alt="gravatar" /&gt; {% endif %} {% endif %} &lt;/figure&gt; &lt;/header&gt; &lt;div class="blog-content"&gt; &lt;h4 class="b-header"&gt;&lt;a class="blog-title" href="{{ post|url }}"&gt;{{ post.title }}&lt;/a&gt;&lt;/h4&gt; &lt;span class="blog-author"&gt;by &lt;a class="author-name" href="{{ author|url }}"&gt;{{ render_author_name(author) }}&lt;/a&gt; on {{ post.pub_date|dateformat("YYYY-MM-dd") }}&lt;/span&gt; &lt;div class="excerpt"&gt; {{ post.body | excerpt | string | striptags() | truncate(100) }} &lt;/div&gt; &lt;/div&gt; &lt;/article&gt; &lt;/div&gt; {% endif %} {% endfor %} &lt;/div&gt; &lt;/div&gt; &lt;/section&gt; </code></pre> <h5 id="the-block-model">The block Model -</h5><pre><code>[block] name = Recent Posts [fields.title] label = Title type = string </code></pre> <h5 id="the-block-styling">The block styling -</h5><pre><code>// Recent-posts section - Home page .recent-posts { background-color: rgba(4, 166, 53, 0.1); .container { @extend .padding-vertical-xl; @extend .padding-horizontal-big; .columns { @extend .padding-top-bigger; @extend .padding-bottom-xl; } } .blog-title { @extend .has-color-dark-slate-gray; } .posts-link { @extend .has-color-forest-green; @extend .body-normal; font-weight: bold; line-height: 1.5; text-decoration: none; .icon { @extend .has-color-forest-green; @extend .padding-left-small; } } } </code></pre> <figure> <img src="output2.png" alt="Output"> <figcaption>Output ? Recent blog posts.</figcaption> </figure><p>I would also like to point out the amazing Query functionality provided by Lektor where you can access the child pages of the root. Here I am accessing blog posts from our Blog page and limiting the count of posts to three.</p> <h3 id="difference-in-experience">Difference in Experience</h3><p>The level of user experience has been significantly elevated due to the use of Vocabulary. I would like to point one of the major experience change here. The major part of the website is guidelines ? we have guidelines for contributing, guidelines for how to join a community, guidelines for how to write a blog and many more. The new website has cleaner and readable guideline with a proper hierarchy and every piece of information is made accessible using secondary navigation.</p> <h5 id="below-are-the-images-of-some-guidelines-pages-from-new-website">Below are the images of some guidelines pages from new website.</h5><figure> <img width="300" height="300" src="new1.png" alt="Screenshot"> <img width="300" height="300" src="new2.png" alt="Screenshot"> <img width="300" height="300" src="new3.png" alt="Screenshot"> <figcaption>Screenshots from new website</figcaption> </figure><h5 id="below-are-the-images-of-some-guidelines-pages-from-old-website-you-can-see-the-difference-of-experience-in-both-cases">Below are the images of some guidelines pages from old website. You can see the difference of experience in both cases.</h5><figure> <img width="400" src="old1.png" alt="Screenshot"> <img width="400" src="old2.png" alt="Screenshot"> <figcaption>Screenshots from old website</figcaption> </figure><h3 id="how-you-can-use-vocabulary-and-also-contribute-to-it">How you can use Vocabulary and also contribute to it?</h3><p>Vocabulary is very easy to use. It is intuitive, consistent and highly reusable. Vocabulary uses Storybook to present each visual element that makes it very convenient for a user to integrate Vocabulary in their project. The code snippets attached with every element can be copied as it is and can be used. The code snippets above indicate how the library can be used and how easily you can achieve desired web pages. For more details, you can visit <a href="https://cc-vocabulary.netlify.app/?path=/docs/vocabulary-usage--page">usage guidelines</a>.</p> <p>Vocabulary is still under development, feedback and bug reports are welcome, fixes and patches even more so. Here is the link to <a href="https://cc-vocabulary.netlify.app/?path=/docs/vocabulary-contribution--page">contribution guidelines</a>.</p> Accessibility and Internationalization: WrapUp GSoC 2020 2020-08-31T00:00:00Z ['AyanChoudhary'] urn:uuid:8dd9da39-50da-34b2-91ac-bbe3eee35aee <p>These is the final blog of my internship with CC. I am working on improving the accessibility of cc-search and internationalizing it as well. This blog is the conclusion of my work. These past 10 weeks with CC have taught me a lot and I am really grateful to have got this opportunity. The experience was just amazing and the poeple are so helpful I really enjoyed working with them and am looking forward to continue working with the CC team.</p> <p>You can glance through my work through these blog posts:</p> <ol> <li><a href="/blog/entries/cc-search-accessibility-and-internationalization/">CC Search, Proposal Drafting and Community Bonding</a></li> <li><a href="/blog/entries/cc-search-accessibility-week1-2/">CC Search, Setting up vue-i18n and internationalizing homepage</a></li> <li><a href="/blog/entries/cc-search-accessibility-week3-4/">Internationalization Continued: Handling strings in the store</a></li> <li><a href="/blog/entries/cc-search-accessibility-week5-6/">Internationalization continued: Modifying tests</a></li> <li><a href="/blog/entries/cc-search-accessibility-week7-8/">CC Search, Initial Accessibility Improvements</a></li> <li><a href="/blog/entries/cc-search-accessibility-week9-10/">Accessibility Improvements: Final Changes and Modal Accessilibity</a></li> </ol> <p>The progress of the project can be tracked on <a href="https://github.com/cc-archive/cccatalog-frontend">cc-search</a></p> <p>CC Search Accessiblity is my GSoC 2020 project under the guidance of <a href="https://creativecommons.org/author/zackcreativecommons-org/">Zack Krida</a> and <a href="/blog/authors/akmadian/">Ari Madian</a>, who is the primary mentor for this project, <a href="https://creativecommons.org/author/annacreativecommons-org/">Anna Tumadottir</a> for helping all along and engineering director <a href="https://creativecommons.org/author/kriticreativecommons-org/">Kriti Godey</a>, have been very supportive.</p> Linked Commons: GSoC'20 Wrap Up 2020-08-28T00:00:00Z ['subhamX'] urn:uuid:511904ae-a464-335a-bcdf-76f80c5309d1 <p>Time flies faster when you are having fun! I didn't believe it back then. But now I do after experiencing it. It couldn't have been more accurate that here I am writing my concluding blog of the <strong>GSoC 2020: The Linked Commons series</strong> when I just started enjoying things.</p> <p>In this post, I will give a brief overview of the linked commons and my GSoC contributions. It was an exciting journey, and I loved working on this project.</p> <p>Before I begin any further just for ritual, let me share a one liner on <strong>what The Linked Commons is</strong>, although I highly recommend reading the other posts in this series, who knows you might join our team.??</p> <blockquote><p>The CC catalog data visualization or linked commons is a web application which finds and explores relationships between Creative Commons licensed content on the web.</p> </blockquote> <p>My primary contributions to the project during the GSoC timeline were threefold.</p> <p><strong>Firstly</strong>, revamp the design and migrate the project to react.js for a fast and scalable rendering performance.</p> <p><strong>Secondly</strong>, add graph filtering methods and scale the data to enable users to visualize massive data more efficiently.</p> <p><strong>At last</strong>, make the developer onboarding easy by dockerizing the project and bring more application portability.</p> <h2 id="gsoc-work-product">GSoC Work Product</h2><p>The live version of the linked commons can be found <a href="http://dataviz.creativecommons.engineering/">here</a>. You can interact with it and <strong>"explore the creative commons in graphs"</strong>.</p> <p>If you wish to access the raw or filtered data, then here is a brief documentation of our new API.</p> <div class="hll"><pre><span></span><span class="nt">URL</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">/api/graph-data</span> <span class="nt">Method</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">GET</span> <span class="nt">Description</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">Returns a randomized graph having around 500 nodes and links.</span> </pre></div> <p>&nbsp;</p> <div class="hll"><pre><span></span><span class="nt">URL</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">/api/graph-data/?name={node_name}</span> <span class="nt">Method</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">GET</span> <span class="nt">Description</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">Returns the filtered graph with a set of nodes which are either immediate neighbours to {node_name} in the original graph or the transpose graph.</span> </pre></div> <p>&nbsp;</p> <div class="hll"><pre><span></span><span class="nt">URL</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">/api/suggestions/?q={query}</span> <span class="nt">Method</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">GET</span> <span class="nt">Description</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">Returns a set of nodes which contains the {query} pattern in their nodeid</span> </pre></div> <h3 id="demo">Demo</h3><div style="text-align: center; width: 90%; margin-left: 5%;"> <figure> <img src="graph-filtering.gif" alt="demo" style="border: 1px solid black"> <figcaption>Linked Commons: Filtering the Graph ??</figcaption> </figure> </div><h3 id="my-code-contributions">My Code Contributions</h3><p><strong>Repository:</strong> <a href="https://github.com/cc-archive/cccatalog-dataviz/">https://github.com/cc-archive/cccatalog-dataviz/</a></p> <p><strong>Commits:</strong> <a href="https://github.com/cc-archive/cccatalog-dataviz/commits/master">https://github.com/cc-archive/cccatalog-dataviz/commits/master</a></p> <p><strong>Contributors:</strong> <a href="https://github.com/cc-archive/cccatalog-dataviz/graphs/contributors">https://github.com/cc-archive/cccatalog-dataviz/graphs/contributors</a></p> <p><a href="https://github.com/cc-archive/cccatalog-dataviz/pull/28"><strong>Migrate frontend to React #28</strong></a></p> <ul> <li>Migrated the frontend to a web application using React.js for smooth rendering performance.</li> <li>Add client-side graph filtering method to enable users to interact with the loaded graph.</li> </ul> <p><a href="https://github.com/cc-archive/cccatalog-dataviz/pull/29"><strong>Add server-side filtering #29</strong></a></p> <ul> <li>We realized that the client-side graph filtering method is not very scalable. This PR adds the basic structure for the backend server and adds server-side graph filtering logic.</li> <li>Added a parser to convert the input JSON file from <code>{nodes:[], links:[]}</code> schema to the distance list format.</li> </ul> <div style="text-align: center; width: 90%; margin-left: 5%;"> <figure> <img src="api-call.png" alt="API call" style="border: 1px solid black"> <figcaption>API call to filter graph data</figcaption> </figure> </div><p><a href="https://github.com/cc-archive/cccatalog-dataviz/pull/33"><strong>Design upgrade #33</strong></a></p> <ul> <li>It revamped the design of the frontend.</li> <li>Added both primary light theme and secondary dark theme</li> </ul> <div style="text-align: center; font-style: normal; width: 80%; margin-left: 10%;"> <figure> <img src="design-dark.png" alt="Dark Theme" style="border: 1px solid black"> <figcaption><em>Linked Commons: Dark Theme</em></figcaption> </figure> </div><p><a href="https://github.com/cc-archive/cccatalog-dataviz/pull/35"><strong>Add node suggestions feature #35</strong></a></p> <ul> <li>Added query autocomplete feature, to enable users to explore all the nodes in the database.</li> <li>This functionality aims to minimize the number of misspelt filtering tries from the client.</li> <li>Refer to <a href="/blog/entries/linked-commons-autocomplete-feature/">this blog</a> for the motivation and detailed report on why we added autocomplete aka node suggestions feature.</li> </ul> <p><a href="https://github.com/cc-archive/cccatalog-dataviz/pull/38"><strong>Fix filtering module #38</strong></a></p> <ul> <li>Optimizes the build-dB-script to run efficiently on the larger and newer dataset of the cc-catalog.</li> <li>Added the basic form of the randomized graph filtering method.</li> <li>Refer to <a href="/blog/entries/linked-commons-data-update/">this blog</a> for a piece of detailed information on data update.</li> </ul> <p><a href="https://github.com/cc-archive/cccatalog-dataviz/pull/39"><strong>Database upgrade and core enhancements #39</strong></a></p> <ul> <li>It upgrades the primary database from shelve to MongoDB for higher performance.</li> <li>Dockerizes the frontend and backend for both dev and prod environments for higher application portability.</li> </ul> <p><a href="https://github.com/cc-archive/cccatalog-dataviz/pull/40"><strong>Frontend enhancements #40</strong></a></p> <ul> <li>Fixes common UI bugs, updates the frontend design and enhances the mobile and smaller devices experience with the linked commons</li> <li>Modularizes and updates the code documentation</li> </ul> <div style="text-align: center; width: 80%; margin-left: 10%;"> <figure> <img src="design-light.png" alt="Theme Light" style="border: 1px solid black"> <figcaption>Linked Commons: Light Theme</figcaption> </figure> </div><div style="text-align: center; width: 90%; margin-left: 5%;"> <figure> <img src="lighthouse-audit.png" alt="Lighthouse Audit" style="border: 1px solid black"> <figcaption>Lighthouse Stats of the latest version of the Linked Commons</figcaption> </figure> </div><h3 id="whats-next">What’s Next?</h3><p>Throughout this internship period, we, the Linked Commons team, aimed to make this version the best among all. But there is still scope for improvement.</p> <p>Just to give you some insights; currently, the complete graph contains 235k nodes and 4.14million links. During the preprocessing, we dropped a lot of the nodes. Additionally, we removed more than 3 million nodes which didn't have cc_licenses information. So, in general, the current version shows only those nodes which are soundly linked with other domains and their licenses information is available. To give a complete picture of the massive "cc-catalog", the linked commons need to "gird up his loins".</p> <p>After seeing the tremendous potential it has, I will undoubtedly continue working on it and help the linked commons in this quest. ?</p> <h3 id="ending-note">Ending Note</h3><p>In the end, I would like to thank my mentors Maria and Brent, for their unconditional guidance throughout this internship period. The insights I got from them will truly help me in the days to come.</p> <p>Special thanks to Francisco, Anna and Kriti for the awesome brainstorming ideas in the UX meet which helped us build an increment superior version of the Linked Commons.</p> <p>It is not the end, rather a new beginning. Cheers! ??????</p> CC Search Extension: Wrapping up GSoC 2020 2020-08-27T00:00:00Z ['makkoncept'] urn:uuid:a836222c-bec0-3f28-a8cd-9699e8ba778f <p>In this post, I'll give an overview of the improvements and features that were added to the CC Search browser extension. I am delighted to state that the goals that were set for Google Summer of Code 2020 have been successfully completed.</p> <h2 id="widen-the-integration-with-cc-catalog-api">Widen the integration with CC Catalog API</h2><p>Both <a href="/blog/entries/cc-search-extension-wrapping-up-gsoc-2020/search.creativecommons.org">CC Search</a> and CC Search Extension are powered by <a href="https://api.creativecommons.engineering/v1/">CC Catalog REST API</a>. The API allows programmatic access to search for CC-licensed and public domain digital media. Better integration with the API was one of the major targets during this internship because it significantly improves and adds new searching workflows to the extension.</p> <p>This can be sub-divided into <em>New Filters</em>, <em>Browse By Sources</em>, <em>search by tags</em>, and <em>related images</em>.</p> <h3 id="new-filters">New Filters</h3><p>The <code>/image</code> endpoint of the API is used for searching. We can also provide several query parameters that can filter the result. Previously, the extension only supported filtering the content using <code>license</code>, <code>sources</code>, and <code>use case</code>. Now, besides these filters, the extension also supports filtering by <code>image type</code>, <code>file type</code>, <code>aspect ratio</code>, and <code>image size</code>.</p> <p><figure > <img src="old-extension-filters.gif" style="width: 70%"> <figcaption> <em>Filters in the old version</em> </figcaption> </figure></p> <figure> <img src="new-extension-filters.gif" style="width: 70%"> <figcaption> <em>Filters in the new version</em> </figcaption> </figure><p><em>Rationale</em>: This will allow users to be more precise in their queries when searching.</p> <h3 id="browsing-by-source">Browsing by source</h3><p>The extension now has a dynamically updated "sources" section. Clicking a source link triggers a request to the <code>/image</code> endpoint to get the images associated with it.</p> <p><figure > <img src="source-section-light.gif" style="width: 70%"> <figcaption> <em>Source section</em> </figcaption> </figure></p> <figure> <img src="source-section-dark.gif" style="width: 70%"> <figcaption> <em>Source section in dark mode</em> </figcaption> </figure><p><em>Rationale</em>: This opens an avenue for exploration of all the different sources which are available in the catalog. This is advantageous for the users who are not familiar with the type of content a particular source provides. They might run into a source that has a huge catalog of high-quality images that they are looking for.</p> <h3 id="search-by-tags">Search by tags</h3><p>Most of the images have some tags associated with them, which are also sent along with the image data by the API. This, and the flexibility of the <code>/image</code> endpoint, paved the way for the addition of searching for images using image-tags.</p> <p><figure > <img src="search-by-image-tag.gif" style="width: 70%"> <figcaption> <em>Search by image tag</em> </figcaption> </figure></p> <p><em>Rational</em> - Image tags will allow users to incrementally make their queries better and more specific.</p> <h3 id="related-images">Related images</h3><p>In the image detail section of any particular image, you can now see several recommendations. This has been made possible by adding support for the <a href="https://api.creativecommons.engineering/v1/#tag/recommendations"><code>/recommendations/images/{identifier}</code></a> endpoint of the API.</p> <p><figure > <img src="related-images.gif" style="width: 70%"> <figcaption> <em>Image recommendations</em> </figcaption> </figure></p> <p><em>Rationale</em> - This will help users find a variety of images that fit their requirements and also explore the images that would not usually show up on the initial pages of the search result.</p> <h2 id="improvements-to-bookmarks-section">Improvements to bookmarks section</h2><p>The bookmark section has great prominence in CC Search Extension because the export/import workflow is tied to it and unlike the search result data, the bookmarks data is preserved across user sessions (closing the extension does not wipe out the bookmarks). It has undergone some crucial improvements like caching, voluntary loading and increase in the number bookmarks that it can hold (the limit now is 300 which earlier was ~50).</p> <p>The bookmarks section is significantly faster now as caching has eliminated the need to make many simultaneous network requests to the API when bookmarks are loaded. Voluntary loading also helps reduce perceived lag by reducing the number of bookmarks that load at once.</p> <p>Though the improvement in performance is better recognized when you are using the extension, I tried to demonstrate that by comparing the rendering of the bookmarked images.</p> <p><figure > <img src="bookmarks-in-old-version.gif" style="width: 70%"> <figcaption> <em>Bookmark section in the old version</em> </figcaption> </figure></p> <p><figure > <img src="bookmarks-in-new-version.gif" style="width: 70%"> <figcaption> <em>Bookmarks section in the new version</em> </figcaption> </figure></p> <h2 id="a-better-use-of-sync-storage">A better use of sync storage</h2><p>The bookmarks and the user settings are synced between user systems. There are very tight write limits and bytes quotas associated with this storage(<a href="https://developer.chrome.com/apps/storage#properties">documentation link</a>). Due to this, the way the extension used this storage, and the assumptions it made about its schema was improved multiple times. Since the extension was already in production, and had around 5,000 weekly users, the code for migrating the user's sync storage was pushed along with these updates. Also, the support was added for legacy bookmark files that some users might still be using.</p> <h2 id="integration-with-vocabulary">Integration with Vocabulary</h2><p>The extension now supports the latest version of <a href="https://github.com/creativecommons/vocabulary">CC vocabulary</a>. The challenging part of this was to rethink, mold, and update each and every workflow of the extension according to the new design.</p> <p><figure > <img src="image-detail-old-version.gif" style="width: 70%"> <figcaption> <em>Old version ? Image detail</em> </figcaption> </figure></p> <p><figure > <img src="image-detail-new-version.gif" style="width: 70%"> <figcaption> <em>New version ? Image detail</em> </figcaption> </figure></p> <p><figure > <img src="deletion-in-old-version.gif" style="width: 70%"> <figcaption> <em>Old version ? Deleting bookmarks</em> </figcaption> </figure></p> <p><figure > <img src="deletion-in-new-version.gif" style="width: 70%"> <figcaption> <em>New version ? Deleting bookmarks</em> </figcaption> </figure></p> <p><figure > <img src="dark-mode-old-version.gif" style="width: 70%"> <figcaption> <em>Old version ? Dark mode</em> </figcaption> </figure></p> <p><figure > <img src="dark-mode-new-version.gif" style="width: 70%"> <figcaption> <em>New version ? Dark mode</em> </figcaption> </figure></p> <h2 id="release-on-microsoft-edge">Release on Microsoft Edge</h2><p>I am also testing the extension on Microsoft Edge. We also have it <a href="https://microsoftedge.microsoft.com/addons/detail/cc-search/djolilnbndifmlfmcdnifdfjfbglipgc">listed</a> on the Edge store. You can soon expect the latest version of CC Search Extension available for install there.</p> <h2 id="code">Code</h2><p>The project repository is hosted on <a href="https://github.com/creativecommons/ccsearch-browser-extension">Github</a>. During this period, I have made <a href="https://github.com/creativecommons/ccsearch-browser-extension/compare/v1.3.0...master">more than 320</a> commits.</p> <p>The Major pull requests: <a href="https://github.com/creativecommons/ccsearch-browser-extension/pull/249">#249</a>, <a href="https://github.com/creativecommons/ccsearch-browser-extension/pull/255">#255</a>, <a href="https://github.com/creativecommons/ccsearch-browser-extension/pull/268">#268</a>, <a href="https://github.com/creativecommons/ccsearch-browser-extension/pull/270">#270</a>, <a href="https://github.com/creativecommons/ccsearch-browser-extension/pull/271">#271</a>, <a href="https://github.com/creativecommons/ccsearch-browser-extension/pull/272">#272</a>, <a href="https://github.com/creativecommons/ccsearch-browser-extension/pull/275">#275</a>, <a href="https://github.com/creativecommons/ccsearch-browser-extension/pull/276">#276</a></p> <p>Also, during this period, 5 updates of the extension were pushed to the extension stores. You can check out the <a href="https://github.com/creativecommons/ccsearch-browser-extension/releases">releases page</a>.</p> <h2 id="acknowledgements">Acknowledgements</h2><p>I would like to thank <a href="https://creativecommons.org/author/aldencreativecommons-org/">Alden</a> and <a href="https://creativecommons.org/author/kriticreativecommons-org/">Kriti</a> for their valuable guidance during this journey. Special thanks to <a href="https://github.com/panchovm">Fransisco</a>, for designing the mockups of the extension, and to the wonderful contributors of CC Vocabulary.</p> Overview of the GSoC 2020 Project 2020-08-26T00:00:00Z ['charini'] urn:uuid:4cdfc111-c714-33e1-b521-390d487d46da <p>This is my final blog post under the <a href="/blog/entries/overview-of-the-gsoc-2020-project/#series">GSoC 2020: CC catalog</a> series, where I will highlight and summarize my contributions to Creative Commons (CC) as part of my GSoC project. The CC Catalog project collects and stores CC licensed images scattered across the internet, such that they can be made accessible to the general public via the <a href="https://ccsearch.creativecommons.org/">CC Search</a> and <a href="https://api.creativecommons.engineering/v1/">CC Catalog API</a> tools. I got the opportunity to work on different aspects of the CC Catalog repository which ultimately enhances the user experience of the CC Search and CC Catalog API tools. My primary contributions in the duration of GSoC, and the related pull requests (PR) are as follows.</p> <ol> <li><p><strong>Sub-provider retrieval</strong>: The first task I completed as part of my GSoC project was the retrieval of sub-providers (also known as <em>source</em>) such that images could be categorised under these sources, ensuring an enhanced search experience for the users. I completed the implementation of sub-provider retrieval for three providers; Flickr, Europeana, and Smithsonian. If you are interested in learning how the retrieval logic works, please check my <a href="/blog/entries/flickr-sub-provider-retrieval/">initial blog post</a> of this series. The PRs related to this task are as follows.</p> <ul> <li>PR #<a href="https://github.com/cc-archive/cccatalog/pull/420">420</a>: Retrieve sub-providers within Flickr</li> <li>PR #<a href="https://github.com/cc-archive/cccatalog/pull/442">442</a>: Retrieve sub-providers within Europeana</li> <li>PR #<a href="https://github.com/cc-archive/cccatalog/pull/455">455</a>: Retrieve sub-providers within Smithsonian</li> <li>PR #<a href="https://github.com/cc-archive/cccatalog/pull/461">461</a>: Add new source as a sub-provider of Flickr</li> </ul> </li> <li><p><strong>Alert updates to Smithsonian unit codes</strong>: For the Smithsonian provider, we rely on the field known as <em>unit code</em> to determine the sub-provider (for Smithsonian it is often a museum) each image belongs to. However, it is possible for the <em>unit code</em> values to change over time at the upstream, and if CC is unaware of these changes, it could hinder the successful categorisation of Smithsonian images under unique sub-provider values. I have therefore introduced a mechanism of alerting the CC code maintainers of potential changes to <em>unit code</em> values at the upstream. More information is provided in my <a href="/blog/entries/smithsonian-unit-code-update/">second blog post</a> of this series. The PR related to this task is #<a href="https://github.com/cc-archive/cccatalog/pull/465">465</a>.</p> </li> <li><p><strong>Improvements to the Smithsonian provider API script</strong>: Smithsonian is an important provider which aggregates images from 19 museums. However, due to the fact that the different museums have different data models and the resultant incompatibility of the JSON responses returned from requests to the Smithsonian API, it is difficult to know which fields to rely on to obtain the information necessary for CC. This results in CC missing out on certain important information. As part of my GSoC project, I improved the completeness of <em>creator</em> and <em>description</em> information, by identifying previously unknown fields from which these details could be retrieved. Even though my improvements did not result in the identification of a comprehensive list of fields, the completeness of data was considerably improved for some Smithsonian museums compared to how it was before. For more context about this issue please refer to the ticket #<a href="https://github.com/cc-archive/cccatalog/issues/397">397</a>. Apart from improving information of Smithsonian data, I was also able to identify issues with certain Smithsonian API responses which did not contain mandatory information for some of the museums. We have informed the Smithsonian technical team of these issues and they are highlighted in ticket #<a href="https://github.com/cc-archive/cccatalog/issues/397">397</a> as well. The PRs related to this task are as follows.</p> <ul> <li>PR #<a href="https://github.com/cc-archive/cccatalog/pull/474">474</a>: Improve the creator and description information of the Smithsonian source <em>National Museum of Natural History</em> (NMNH). This is the largest museum (source) under the Smithsonian provider.</li> <li>PR #<a href="https://github.com/cc-archive/cccatalog/pull/476">476</a>: Improve the <em>creator</em> and <em>description</em> information of other sources coming under the Smithsonian provider.</li> </ul> </li> <li><p><strong>Expiration of outdated images</strong>: The final task I completed as part of my GSoC project was implementing a strategy for expiring outdated images in the CC database. CC has a mechanism for keeping the images they have retrieved from providers up-to-date, based on how old an image is. This is called the <a href="/blog/entries/date-partitioned-data-reingestion/">re-ingestion strategy</a>, where newer images are updated more frequently compared to older images. However, this re-ingestion strategy does not detect images which have been deleted at the upstream. Thus, it is possible that some of the images stored in the CC database are obsolete, which could result in broken links being presented via the <a href="https://ccsearch.creativecommons.org/">CC Search</a> tool. As a solution, I have implemented a mechanism of identifying whether images in the CC database are obsolete by looking at the <em>updated_on</em> column value of the CC image table. Depending on the re-ingestion strategy per provider, we can know what the oldest <em>updated_on</em> value, an image can assume. If the <em>updated_on</em> value is older than the oldest valid value, we flag the corresponding image record as obsolete. The PR related to this task is #<a href="https://github.com/cc-archive/cccatalog/pull/483">483</a>.</p> </li> </ol> <p>I will continue to take the responsibility for maintaining my code in the CC Catalog repository, and I hope to continue contributing to the CC codebase. It has been a wonderful GSoC journey for me and special thanks goes to my supervisor Brent for his guidance.</p> Automate GitHub for more than CI/CD 2020-08-26T00:00:00Z ['zackkrida'] urn:uuid:1f2b4dad-de07-33ab-b1c7-394778548e55 <blockquote><p><em>Get started using GitHub bots and actions for community management and repository health.</em></p> </blockquote> <p>In late 2018, in the midst of being acquired by Microsoft, GitHub <a href="https://github.blog/2018-10-16-future-of-software/">launched Github Actions</a> into public beta, allowing users to run code on the popular development platform for the first time. With a straightforward <code>YAML</code> configuration syntax and the power of Microsoft's Azure cloud, GitHub Actions quickly rose to compete with existing Continuous Integration (CI) and Continuous Deployment (CD) platforms like <strong>Circle CI</strong> and <strong>Travis CI</strong>. GitHub Actions made it easier than ever for developers to test and deploy software in the cloud, but from the beginning GitHub had bigger plans for the service.</p> <p>In a <a href="https://techcrunch.com/2018/10/16/github-launches-actions-its-workflow-automation-tool/">2018 TechCrunch interview</a>, GitHub's then head of platform acknowledged the usefulness of actions for more than CI/CD. "I see CI/CD as one narrow use case of actions. It’s so, so much more,” Lambert stressed. “And I think it’s going to revolutionize DevOps because people are now going to build best in breed deployment workflows for specific applications and frameworks, and those become the de facto standard shared on GitHub. […] It’s going to do everything we did for open source again for the DevOps space and for all those different parts of that workflow ecosystem."</p> <p>At Creative Commons, we use Github Actions and Bots on many of <a href="https://github.com/creativecommons?type=source">our open-source projects</a> for more than CI/CD?to manage our <a href="/community/community-team/">community team</a>; to automate repository health; and to automate tedious but frequent tasks. The following examples are just a small snapshot of our existing and in-progress automations.</p> <h2 id="example-automations">Example automations</h2><p><!-- no toc --></p> <ul> <li><a href="/blog/entries/automate-github-for-more-than-CI CD/#automatic-release-note-generation">Release note generation</a></li> <li><a href="/blog/entries/automate-github-for-more-than-CI CD/#repository-normalization">Repository normalization</a></li> <li><a href="/blog/entries/automate-github-for-more-than-CI CD/#automatic-dependency-updates">Dependency updates</a></li> </ul> <h3 id="release-note-generation">Release note generation</h3><p>Our frontend Vue.js application for CC Search gets released weekly, and is subject to constant pull requests from myself, one-time volunteers making their first open source contribution, and long-term, dedicated community members who frequently contribute. It's important for us to highlight <em>all</em> of these contributions in our release notes, regardless of size or scope. Additionally, we find it useful to group changes into categories, so our users have a clear sense of what kinds of updates we've made.</p> <div style="text-align: center;"> <figure class="margin-bottom-large"> <img src="release-notes-screenshot.png" alt="GitHub screenshot of release notes for CC Search" /> <figcaption> <em> An example of CC Search release notes generated by the <a href="https://github.com/marketplace/actions/release-drafter">Release Drafter</a> GitHub Action. </em> </figcaption> </figure> </div><p>The quality of these release notes made them quite tedious to generate manually. With the <a href="https://github.com/marketplace/actions/release-drafter">release drafter action</a>, we're able to automatically update a draft release note on every pull request to CC Search. The action lets us configure the line added for each pull request with some basic templating which includes variables for the pr number, title, and author (among others):</p> <div class="hll"><pre><span></span><span class="nt">change-template</span><span class="p">:</span><span class="w"> </span><span class="s">&#39;-</span><span class="nv"> </span><span class="s">$TITLE:</span><span class="nv"> </span><span class="s">#$NUMBER</span><span class="nv"> </span><span class="s">by</span><span class="nv"> </span><span class="s">@$AUTHOR&#39;</span> </pre></div> <p><br />This means each pull request gets a line like this in our release notes:</p> <blockquote><p>Enable web monetization on single result pages: <strong>#1191</strong> by <strong>@zackkrida</strong></p> </blockquote> <p>Perfect! We can also map GitHub labels on our pull requests to the sections of our generated release notes, like so:</p> <div class="hll"><pre><span></span><span class="nt">categories</span><span class="p">:</span> <span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">title</span><span class="p">:</span><span class="w"> </span><span class="s">&#39;New</span><span class="nv"> </span><span class="s">Features&#39;</span> <span class="w"> </span><span class="nt">label</span><span class="p">:</span><span class="w"> </span><span class="s">&#39;feature&#39;</span> <span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">title</span><span class="p">:</span><span class="w"> </span><span class="s">&#39;Bug</span><span class="nv"> </span><span class="s">Fixes&#39;</span> <span class="w"> </span><span class="nt">label</span><span class="p">:</span> <span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="s">&#39;bug&#39;</span> <span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="s">&#39;critical&#39;</span> </pre></div> <p>The resulting release notes require no manual editing at release time, and has saved us hours over time and allows our developers to focus on DevOps work instead of copywriting on release days. We also never miss a contribution or expression of gratitude to one of our contributors. You can read the <a href="https://github.com/cc-archive/cccatalog-frontend/releases/latest">latest CC Search release notes</a> or <a href="https://github.com/cc-archive/cccatalog-frontend/blob/develop/.github/release-drafter.yml">see our full release-drafter.yml file here</a>.</p> <h3 id="repository-normalization">Repository Normalization</h3><p>Within a private repository of internal helper scripts, the CC technical team has a number of Github Actions which trigger Python scripts to keep configuration standardized across our repositories. We casually call this process "repository normalization". One such script ensures that we use a standard set of GitHub labels across all of our projects. This consistency helps us do things like direct users to <a href="https://github.com/search?q=org%3Acreativecommons+label%3A%22help+wanted%22+state%3Aopen&amp;type=Issues">open issues in need of assistance</a> across the organization, or issues <a href="https://github.com/search?q=org%3Acreativecommons+label%3A%22good+first+issue%22+state%3Aopen&amp;type=Issues">good for first-time open source contributors</a>. With GitHub Actions, its easy to set up scheduled tasks with only a few lines of human-readable configuration. Here's the gist of running a Python script daily, for example:</p> <div class="hll"><pre><span></span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">Example scheduled python action</span> <span class="nt">on</span><span class="p">:</span> <span class="w"> </span><span class="nt">schedule</span><span class="p">:</span> <span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">cron</span><span class="p">:</span><span class="w"> </span><span class="s">&#39;0</span><span class="nv"> </span><span class="s">0</span><span class="nv"> </span><span class="s">*</span><span class="nv"> </span><span class="s">*</span><span class="nv"> </span><span class="s">*&#39;</span> <span class="w"> </span><span class="nt">push</span><span class="p">:</span> <span class="w"> </span><span class="nt">branches</span><span class="p">:</span> <span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">master</span> <span class="nt">jobs</span><span class="p">:</span> <span class="w"> </span><span class="nt">build</span><span class="p">:</span> <span class="w"> </span><span class="nt">runs-on</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">ubuntu-latest</span> <span class="w"> </span><span class="nt">steps</span><span class="p">:</span> <span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">uses</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">actions/checkout@v2</span> <span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">Set up Python 3.7</span> <span class="w"> </span><span class="nt">uses</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">actions/setup-python@v1</span> <span class="w"> </span><span class="nt">with</span><span class="p">:</span> <span class="w"> </span><span class="nt">python-version</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">3.7</span> <span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">Install dependencies</span> <span class="w"> </span><span class="nt">run</span><span class="p">:</span><span class="w"> </span><span class="p p-Indicator">|</span> <span class="w"> </span><span class="no">python -m pip install --upgrade pip</span> <span class="w"> </span><span class="no">python -m pip install pipenv</span> <span class="w"> </span><span class="no">pipenv install</span> <span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">Export token to env and run our script</span> <span class="w"> </span><span class="nt">run</span><span class="p">:</span><span class="w"> </span><span class="p p-Indicator">|</span> <span class="w"> </span><span class="no">pipenv run python our-script.py</span> <span class="w"> </span><span class="nt">env</span><span class="p">:</span> <span class="w"> </span><span class="nt">ADMIN_GITHUB_TOKEN</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">${{ secrets.ADMIN_GITHUB_TOKEN }}</span> </pre></div> <p>Internally and publicly, we use <a href="https://github.com/orgs/creativecommons/projects">GitHub Projects</a> to manage our bi-weekly sprints and backlogs. The <a href="https://github.com/subhamX/github-project-bot">GitHub Project Bot</a> action was built by <a href="https://github.com/subhamX">one of our community contributors</a> and allows us to add pull requests to our project columns. Here's an example step in such a job:</p> <div class="hll"><pre><span></span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">Handle cccatalog-frontend Repo</span> <span class="w"> </span><span class="nt">uses</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">subhamX/github-project-bot@v1.0.0</span> <span class="w"> </span><span class="nt">with</span><span class="p">:</span> <span class="w"> </span><span class="nt">ACCESS_TOKEN</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">${{ secrets.ADMIN_GITHUB_TOKEN }}</span> <span class="w"> </span><span class="nt">COLUMN_NAME</span><span class="p">:</span><span class="w"> </span><span class="s">&quot;In</span><span class="nv"> </span><span class="s">Progress</span><span class="nv"> </span><span class="s">(Community)&quot;</span> <span class="w"> </span><span class="nt">PROJECT_URL</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">https://github.com/orgs/creativecommons/projects/7</span> <span class="w"> </span><span class="nt">REPO_URL</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">https://github.com/cc-archive/cccatalog-frontend</span> </pre></div> <p>We have additional scripts that sync our community team members across our open source website and GitHub, and several others that do even more of this cross-platform synchronization work. All of these scripts relive significant burden off of our engineering manager and open source community coordinator.</p> <h3 id="dependency-updates">Dependency Updates</h3><p>Modern JavaScript projects are built atop piles of 3rd party dependencies. This frees developers to focus on product code instead of writing the same utility code over and over again, but exposes projects to issues of security and dependency management. To help alleviate these issues, GitHub <a href="https://github.blog/2019-05-23-introducing-new-ways-to-keep-your-code-secure/#automated-security-fixes-with-dependabot">acquired a startup called Dependabot</a> which initially focused on automatic security updates for repositories. Dependabot creates pull requests that update third-party code with known security vulnerabilities to the latest safe and stable versions.</p> <p>This summer (June 2020), GitHub <a href="https://github.blog/2020-06-01-keep-all-your-packages-up-to-date-with-dependabot/">expanded dependabot's scope</a> to keep <em>all</em> third-party code up to date, regardless of security. By adding a <code>dependabot-config.yml</code> file to any repo, developers no longer need to keep track of dependency updates on their own.</p> <div style="text-align: center;"> <figure class="margin-bottom-large"> <img src="dependabot-example.png" alt="GitHub screenshot of a Dependabot PR message" /> <figcaption> <em> Dependabot writes pull requests to bump JavaScript dependencies and will automatically resolve merge conflicts and keep the PR up to date. </em> </figcaption> </figure> </div><p>If your project has strong test coverage and a solid quality control process for release management, Dependabot pull requests can be made even more powerful with the <a href="https://github.com/ridedott/merge-me-action">Merge Me Action.</a> Merge Me can be added to the end of any series of Github Actions to automatically merge pull requests that pass all CI tests which were authored by a particular user (the action assumes <code>dependabot</code> by default). This means your repository can have highly-configurable, fully-automated dependency updates in just a few lines of <code>YAML</code>.</p> <h2 id="here-s-a-few-more">Here's a few more</h2><p>Here's some smaller and simpler automations that can make a huge difference in your workflows.</p> <ul> <li><a href="https://github.com/probot/stale">Automatically close old PRs after a period of inactivity</a></li> <li><a href="https://github.blog/2020-08-24-automate-releases-and-more-with-the-new-sentry-release-github-action/">Automate security releases on Sentry</a></li> <li><a href="https://github.com/probot/reminders">Add reminders to issues and pull requests</a></li> </ul> <p>These examples are a small sample of the non-CI/CD capabilities of GitHub Actions. You can peek in the <code>.github/</code> directory of any of our open source repositories to see the actions we're using, and feel free to make an issue on any project if you have an idea for an automation of your own. As we increase the number and quality of integrations in our open source repositories, we may update this article or create follow-up posts with more examples.</p> <p>If you're interested in learning more about GitHub Actions, GitHub has a wonderful <a href="https://github.com/marketplace?type=actions">marketplace</a> of avaliable actions you can explore, and the <a href="https://docs.github.com/actions">documentation for actions</a> is avaliable in several languages.</p> Linked Commons: Data Update 2020-08-25T00:00:00Z ['subhamX'] urn:uuid:bf2032b7-d1c2-320b-9cf1-92ad64320a02 <p>In this blog, I will be explaining the task we were working on for the last 3-4 weeks. It will take you on a journey of optimizations from million graph traversals in building the database to just a few traversals in the end. Also, we will be covering the new architecture for the upcoming version of the Linked Commons and the reason behind the change.</p> <h2 id="where-does-it-fit">Where does it fit?</h2><p>So far the Linked Commons was using a tiny subset of the data available in the CC Catalog. One of the primary targets of our team was to update the data. If you observe closely all tasks so far starting from adding "Graph Filtering Methods" to "Autocomplete Feature". These were actually bringing us closer towards this task. i.e. the much-awaited <strong>"Scale the Data of Linked Commons"</strong>. We aim to add around <strong>235k nodes and 4.14 million links</strong> into the Linked Commons project from around <strong>400 nodes and 500 links</strong> in the current version. This drastic addition of new data is one of its kind, which makes this task very challenging and exciting.</p> <h2 id="pilot">Pilot</h2><p>The raw CC Catalog data cannot be used directly in the Linked Commons. Our first task involves processing it, which includes removing isolated nodes, etc. You can read more about it in the data processing series <a href="/blog/entries/cc-datacatalog-data-processing/">blog</a> written by my mentor Maria. After this, we need to build a database which stores the <strong>"distance list"</strong> of all the nodes.</p> <h3 id="what-is-distance-list">What is "distance list"?</h3><div style="text-align: center; width: 90%; margin-left: 5%;"> <figure> <img src="distance-list.png" alt="Distance List" style="border: 1px solid black"> <figcaption>Distance list representation* of the node 'icij' part of a hypothetical graph</figcaption> </figure> </div><hr> <p><strong>Distance List</strong> is a method of graph representation. It is similar to <a href="https://en.wikipedia.org/wiki/Adjacency_list">Adjacency List</a> representation of graphs but instead of storing data of just immediate neighbouring nodes, "distance list" groups all vertices based on their distance from the root node and stores this grouped data for every vertex in the graph. In short, "distance list" is a more general form of the Adjacency List representation.</p> <p>To build this "distance list", we created a script for this, let’s name it <strong>build-dB-script.py</strong>, which uses the <a href="https://en.wikipedia.org/wiki/Breadth-first_search">Breadth-First Search(BFS)</a> algorithm on every node to traverse the graph and gradually build this distance list. The filtering nodes feature of our web page connects to the server, which uses the aforementioned database and serves a smaller chunk of data.</p> <h2 id="problem">Problem</h2><p>Now that we know where the <em>build-dB-script</em> is used, let’s discuss the problems with it. The new graph data we are going to use is enormous and is in millions. A full traversal of a graph with million nodes, million times is very slow. Just to give some helpful numbers, the script was taking around 10 minutes to process a hundred nodes. Assuming the growth is linear(in the best case), it will take more than <strong>15 days</strong> to complete the computations. <strong>It is scary, and thus, optimizations in the <em>build-dB-script</em> are the need of the hour!!</strong></p> <h2 id="optimizations">Optimizations</h2><p>In this section, we will talk of the different versions of the build database script, starting from the brute force BFS method.</p> <p>The brute force BFS was the most simple and technically correct solution, but as the name suggests it was slow. In the next iteration, I stored the details of last n nodes, 10 to be precise and performed the same old BFS. It was faster but it had a logic error. Say, there is a link from a node to an already visited/traversed node. The script was not putting all the nodes which could have been explored from this path. After a few more leaps from Depth-first Search, to Breadth-first search, and other methods, eventually with the help of my mentors, we built a new approach - <strong>"Sequential dB Build"</strong>.</p> <p>To keep this blog short, I won’t be going too much into implementation details, but here are some of the critical points.</p> <h3 id="key-points-of-the-sequential-db-build">Key points of the Sequential dB Build:</h3><ul> <li>It was the fastest of all the predecessors and reduced the script timing significantly.</li> <li>In this approach, we aimed to build the all distance list of [1, 2, 3,... ., k-1] before building kth distance list.</li> </ul> <p>Unfortunately, still, it was not enough for our current requirements. Just to give you some insights, the distance two list computation was taking around <strong>4 hours</strong>, and <strong>distance three list</strong> computation was taking <strong>20+ hours</strong>. It shows that all these optimizations were not enough and were incapable of handling this big dataset.</p> <h2 id="new-architecture">New Architecture</h2><p>As the optimizations in "build-dB-scripts" weren’t enough, we started looking to simplify the current architecture. In the end, we want to have a viable product which is scalable to this massive data. Although we are still not dropping the multi-distance filtering, we will continue our research on it and hopefully will have it in <strong>Linked Commons 3.0</strong>. ??</p> <p>For any node, it is more likely that any person would wish to know the immediate neighbours who are linking to some arbitrary node. Nodes at a distance greater than one exhibits very less information on the reach and connectivity of the root node. It was because of this we decided to change our current logic of having the distance list up to 10; instead, we reduced it to 1 and also stored the immediate incomming nodes list (Nodes which are at distance 1 in the <a href="https://en.wikipedia.org/wiki/Transpose_graph">transpose graph</a>).</p> <p>This small change in the design simplified a lot of things, and now the new graph build was taking around 2 minutes. By the time I am writing this blog we have upgraded our database from <strong>shelve to MongoDB</strong> where the build time is further reduced. ????</p> <div style="text-align: center; width: 90%; margin-left: 5%;"> <figure> <img src="graph.png" alt="Light Theme" style="border: 1px solid black"> <figcaption>Graph showing neighbouring nodes. Incoming link are coloured with Turquoise and outgoing are coloured with Red.</figcaption> </figure> </div><h2 id="conclusion">Conclusion</h2><p>This task was really challenging and I learnt a lot. It was really mesmerizing to see the <strong>Linked Commons grow and evolve</strong>. I hope you enjoyed reading this blog. You can follow the project development <a href="https://github.com/cc-archive/cccatalog-dataviz/">here</a>, and access the stable version of linked commons <a href="http://dataviz.creativecommons.engineering/">here</a>.</p> <p>Feel free to report bugs and suggest features. It will help us improve this project. If you wish to join the our team, consider joining our <a href="https://creativecommons.slack.com/channels/cc-dev-cc-catalog-viz">slack</a> channel. Read more about our community teams <a href="/community/">here</a>. See you in my next blog! ??</p> <hr> <p>*<em>Linked Commons uses a more complex schema. The picture is just for illustration.</em></p> CC Catalog: wrapping up GSoC20 2020-08-25T00:00:00Z ['srinidhi'] urn:uuid:ba947438-0d00-32ec-8ba7-acf6f5f15eb5 <p>With the summer of code coming to an end, this blog post summarises the work done during the last three months. The project I have been working on is to add more provider API scripts to the CC Catalog. The CC Catalog project is responsible for collecting CC licensed images hosted across the web.</p> <p>The internship journey has been great , and I was glad to get the opportunity to understand more about the working of the data pipeline. My work during the internship mainly involved researching new API providers and checking if they meet the necessary conditions, then we decided on a strategy to crawl the API. The strategy varies according to different APIs: some can be partitioned based on date, others have to be paginated . Script is written for the API according to the strategy. During the later phase of the internship, I had worked on the reingestion strategy for europeana and a script to merge Common Crawl tags and metadata to the corresponding image in the image table.</p> <p>Provider API implemented :</p> <ul> <li>Science Museum : Science Museum collection has around 60,000 images and was initially crawled through Common Crawl and shifted to API based crawl.<ul> <li>Issue: <a href="https://github.com/cc-archive/cccatalog/issues/302">Science Museum ticket</a></li> <li>Related PRs: <a href="https://github.com/cc-archive/cccatalog/pull/400">Science Museum script</a>, <a href="https://github.com/cc-archive/cccatalog/pull/411">Science Museum workflow</a></li> </ul> </li> </ul> <ul> <li>Statens Museum : Statens Museum for Kunst is Denmark’s leading museum for artwork . This is a new integration and 39115 images have been collected.<ul> <li>Issue: <a href="https://github.com/cc-archive/cccatalog/issues/393">Statens Museum ticket</a></li> <li>Related PRs: <a href="https://github.com/cc-archive/cccatalog/pull/428">Statens Museum implementation</a></li> </ul> </li> </ul> <ul> <li>Museums Victoria : It was initially ingested from Common Crawl later shifted to API based crawl. It has around 140,000 images.<ul> <li>Issue: <a href="https://github.com/cc-archive/cccatalog/issues/291">Museums Victoria ticket</a></li> <li>Related PRs: <a href="https://github.com/cc-archive/cccatalog/pull/447">Museums Victoria implementation</a></li> </ul> </li> </ul> <ul> <li>NYPL : New York Public Library is a new integration , as of now it has around 1296 images.<ul> <li>Issue: <a href="https://github.com/cc-archive/cccatalog/issues/147">NYPL ticket</a></li> <li>Related PRs: <a href="https://github.com/cc-archive/cccatalog/pull/462">NYPL implementation</a></li> </ul> </li> </ul> <ul> <li>Brooklyn Museum : This was an existing integration , changes were made to follow the new <code>ImageStore</code> and <code>DelayedRequestor</code> class , it has 61503 images.<ul> <li>Issue: <a href="https://github.com/cc-archive/cccatalog/issues/348">Brooklyn Museum ticket</a></li> <li>Related PRs: <a href="https://github.com/cc-archive/cccatalog/pull/355">Brooklyn Museum implementation</a></li> </ul> </li> </ul> <p>Iconfinder is a provider of icons that could not be integrated as the current strategy of ingestion is very slow and we need a better strategy.</p> <ul> <li>Issue : <a href="https://github.com/cc-archive/cccatalog/issues/396">Iconfinder ticket</a></li> </ul> <h2 id="europeana-reingestion-strategy">Europeana reingestion strategy</h2><p>Data collected from europeana was collected on a daily basis and there was a need to refresh it. The idea is that new data should be refreshed more frequently and as the data gets old, refreshing should become less frequent. While developing the strategy the API key limit and maximum collection expected is to be kept in mind. Considering these factors, a workflow was set up such that each day it crawls 59 days of data. The 59 days were split up into layers. The DAG crawls daily up to 1 week old data then it crawls monthly for data more than 1 week old and less than a year old data, anything older than a year is crawled every 3 months.</p> <ul> <li>Issue: <a href="https://github.com/cc-archive/cccatalog/issues/412">Europeana reingestion ticket</a></li> <li>Related PR: <a href="https://github.com/cc-archive/cccatalog/pull/473">Europeana reingestion strategy</a></li> </ul> <p>More details regarding the math of reingestion: <a href="/blog/entries/date-partitioned-data-reingestion/">Data reingestion</a></p> <div style="text-align:center;"> <img src="dag_image_1.png" width="1000px"/> <img src="dag_image_2.png" width="1000px"/> <img src="dag_image_3.png" width="1000px"/> <p>Europeana reingestion workflow</p> </div><h2 id="merging-common-crawl-tags">Merging Common Crawl tags</h2><p>When a provider is shifted from Common Crawl to API based crawl, the new data from API doesn’t have tags and metadata that were generated using clarifai and hence there is need to associate the new data with the tags corresponding to that image from the Common Crawl data. A direct url match is not possible as the Common Crawl urls and API image url are different, so we try to match it on the number or identifier that is associated with the url.</p> <p>Currently the merging logic is applied to Science Museum, Museums Victoria and Met Museum .</p> <p>In Science Museum, API url in image table is like <a href="https://coimages.sciencemuseumgroup.org.uk/images/240/862/large_BAB_S_1_02_0017.jpg">https://coimages.sciencemuseumgroup.org.uk/images/240/862/large_BAB_S_1_02_0017.jpg</a> and CC url is like <a href="https://s3-eu-west-1.amazonaws.com/smgco-images/images/369/541/medium_SMG00096855.jpg">https://s3-eu-west-1.amazonaws.com/smgco-images/images/369/541/medium_SMG00096855.jpg</a> . So the idea is to reduce the url to the last identifier like number , so after the modification of the url by modify_urls function it looks like <code>gpj.1700_20_1_S_BAB_</code> (API url) and <code>gpj.55869000GMS_</code> (CC url) . Similar logic has been applied to met museum and museum victoria.</p> <ul> <li>Issue: <a href="https://github.com/cc-archive/cccatalog/issues/468">https://github.com/cc-archive/cccatalog/issues/468</a></li> <li>Related PR: <a href="https://github.com/cc-archive/cccatalog/pull/478">https://github.com/cc-archive/cccatalog/pull/478</a></li> </ul> <h2 id="acknowledgement">Acknowledgement</h2><p>I would like to thank my mentors Brent and Anna for their guidance throughout the internship.</p> X5GON Using CC Catalog API for Image Results 2020-08-24T00:00:00Z ['annatuma'] urn:uuid:ffcc37e0-31ad-3231-b583-73749555ba0b <p>A few months ago, the Open Education team at Creative Commons made an introduction between the folks working on X5GON and CC Search.</p> <p>Throughout a few conversations, we quickly discovered that there are many parallels to how we're approaching our work, and some important differences that would allow each of us to benefit from cooperation.</p> <p><a href="https://www.x5gon.org/">X5GON</a> is building an AI-driven platform, focused on delivery of open education resources (OER). At its core, it is building a catalog of OER, upon which other <a href="https://www.x5gon.org/platforms/services/">services</a> are based, such as analytics for personalized recommendations, and a discovery engine. By aggregating relevant content, curating it with the use of artificial intelligence and machine learning, and personalizing the experience to each learner, they're making OER more accessible and relevant.</p> <p>CC Search is not yet ready to ingest content types beyond images, but when we are able to do so, we plan to integrate via API with X5GON in order to serve OER that is made available in formats we will support in the future, starting with audio.</p> <p>The <a href="https://discovery.x5gon.org/">X5GON Discovery search engine</a> allows users to find OER in video, audio, and text formats - and now, with the integration of results powered by the CC Catalog API, which also powers CC Search, users can also find openly licensed images for relevant educational queries. This is a great resource for educators and learners from all over the world.</p> <p>Try it for yourself, or look at these results for making <a href="https://discovery.x5gon.org/search?q=geometry&amp;type=Image">geometry</a> visual and fun!</p> How to politely crawl and analyze 500 million images 2020-08-17T00:00:00Z ['aldenpage'] urn:uuid:c4bde8a8-0a5d-324a-b450-571f43e3af02 <h4 id="background">Background</h4><p>The goal of <a href="https://search.creativecommons.org">CC Search</a> is to index all of the Creative Commons works on the internet, starting with images. We have indexed over 500 million images, which we believe is roughly 36% of all CC licensed content on the internet by <a href="https://creativecommons.org/2018/05/08/state-of-the-commons-2017/">our last count</a>. To further enhance the usefulness of our search tool, we recently started crawling and analyzing images for improved search results. This article will discuss the process of taking a paper design for a large scale crawler, implementing it, and putting it in production, with a few idealized code snippets and diagrams along the way. The full source code can be viewed on <a href="https://github.com/creativecommons/image-crawler">GitHub</a>.</p> <p>Originally, when we discovered an image and inserted it into CC Search, we didn't even bother downloading it; we stuck the URL in our database and embedded the image in our search results. This approach has a lot of problems:</p> <ol> <li>We don't know the dimensions or compression quality of images, which is useful both for relevance purposes (de-ranking low quality images) and for filtering. For example, some users are only interested in high resolution images and would like to exclude content below a certain size.</li> <li>We can't run any type of computer vision analysis on any of the images, which could be useful for enriching search metadata through object recognition.</li> <li>Embedding third party content is fraught with problems. What if the other party's server goes down, the images disappear due to link rot, or their TLS certificates expire? Each of these situations results in broken images appearing in the search results or browser alerts about degraded security.</li> </ol> <p>We solved (3) by setting up a <a href="https://github.com/willnorris/imageproxy">caching thumbnail proxy</a> between images in the search results and their 3rd party origin, as well as some last-minute liveness checks to make sure that the image hasn't 404'd.</p> <p>(1) and (2), however, are not possible to solve without actually downloading the image and performing some analysis on the contents of the file. For us to reproduce the features that users take for granted in image search, we're going to need a fairly powerful crawling system.</p> <p>On the scale of several thousand images, it would be easy to cobble together a few scripts to spit out this information, but with half a billion images, there are a lot of hurdles to overcome.</p> <ul> <li>We want to crawl <a href="https://en.wikipedia.org/wiki/Web_crawler#Politeness_policy">politely</a>; however, the concentration and quantity of images means that we have to hit some sources with a high crawl rate in order to have any hope of finishing the crawl in a reasonable period of time. Our data sources range from non-profit museums with a single staff IT person to tech companies with their own data centers and thousands of employees; the crawl rate has to be tailored to download quickly from the big players but not overwhelm small sources. At the same time, we need to be sure that we are not overestimating any source's capacity and watch for signs that our crawler is straining the server.</li> <li>We need to keep the time to process each image as low as possible to make it feasible to finish the crawling and analysis task in a reasonable period of time. This means that the crawling and analysis tasks need to be distributed to multiple machines in parallel.</li> <li>A lot of metadata will be produced by this crawler. The step of integrating it with our internal systems needs to not block resizing tasks. That suggests that a message bus will be necessary to buffer messages before they are written into our data layer, where writes can be expensive.</li> <li>We want to have a basic idea of how the crawl is progressing in the form of summaries of error counts, status codes, and crawl rates for each source.</li> </ul> <p>In summary, the challenge here isn't so much making a really fast crawler as much as it is tailoring the crawl speed to each source. At a minimum, we'll need to deal with concurrency and parallelism, provisioning and managing the life cycle of crawler infrastructure, pipelines for capturing output data, a way to monitor the progress of the crawl, a suite of tests to make sure the system behaves as expected, and a reliable way to enforce a "politeness" policy. That's not a trivial project, particularly for our tiny three person tech team (of which only one person is available to do all of the crawling work). Can't we just use an off-the-shelf open source crawler?</p> <h4 id="what-about-existing-open-source-crawlers">What about existing open source crawlers?</h4><p>Any decent software engineer will consider existing options before diving into a project and reinventing the wheel. My assessment was that although there are a lot of open source crawling frameworks available, few of them focus on images, some are not actively maintained, and all would require extensive customization to meet the requirements of our crawl strategy. Further, many solutions are more complex than than our use case demands and would significantly expand our use of cloud infrastructure, resulting in higher expenses and more operational headaches. I experimented with Apache Nutch, Scrapy Cluster, and Frontera; none of the existing options looked quite right for our use case.</p> <p>As a reminder, we want to eventually crawl every single Creative Commons work on the internet. Effective crawling is central to the capabilities that our search engine is able to provide. In addition to being central to achieving high quality image search, crawling could also be useful for discovering new Creative Commons content of any type on any website. In my view, that's a strong argument for spending some time designing a custom crawling solution where we have complete end-to-end control of the process, as long as the feature set is limited in scope. In the next section, we'll assess the effort required to build a crawler from the ground up.</p> <h4 id="designing-the-crawler">Designing the crawler</h4><p>We know we're not going to be able to crawl 500 million images with one virtual machine and a single IP address, so it is obvious from the start that we are going to need a way to distribute the crawling and analysis tasks over multiple machines. A basic queue-worker architecture will do the job here; when we want to crawl an image, we can dispatch the URL to an inbound images queue, and a worker eventually pops that task out and processes it. Kafka will handle all of the hard work of partitioning and distributing the tasks between workers.</p> <p>The worker processes do the actual analysis of the images, which essentially entails downloading the image, extracting interesting properties, and sticking the resulting metadata back into a Kafka topic for later downstream processing. The worker will also have to include some instrumentation for conforming to rate limits and error reporting.</p> <p>We also know that we will need to share some information about crawl progress between worker processes, such as whether we've exceeded our prescribed rate limit for a website, the number of times we've seen a status code in the last minute, how many images we've processed so far, and so on. Since we're only interested in sharing application state and aggregate statistics, a lightweight key/value store like Redis seems like a good fit.</p> <p>Finally, we need a supervising process that centrally controls the crawl. This key governing process will be responsible for making sure our crawler workers are behaving properly by moderating crawl rates for each source, taking action in the face of errors, and reporting statistics to the operators of the crawler. We'll call this process the crawl monitor.</p> <p>Here's a rough sketch of how things will work:</p> <p><img src="/blog/entries/crawling-500-million/image_crawler_simplified.png" alt="Diagram"></p> <p>At a high level, the problem of building a fast crawler seems solvable for our team, even on the scale of several hundred million images. If we can sustain a crawl and analysis rate of 200 images per second, we could crawl all 500 million images in about a month.</p> <p>In the next section, we'll examine some of the key components that make up the crawler.</p> <h4 id="detailed-breakdown">Detailed breakdown</h4><h5 id="concurrency-with-asyncio">Concurrency with <code>asyncio</code></h5><p>Crawling is a massively IO bound task. The workers need to maintain lots of simultaneous open connections with internal systems like Kafka and Redis as well as 3rd party websites holding the target images. Once we have the image in memory, performing our actual analysis task is easy and cheap. For these reasons, an asynchronous approach seems more attractive than using multiple threads of execution. Even if our image processing task grows in complexity and becomes CPU bound, we can get the best of both worlds by offloading heavyweight tasks to a process pool. See "<a href="https://docs.python.org/3/library/asyncio-dev.html#running-blocking-code">Running Blocking Code</a>" in the <code>asyncio</code> docs for more details.</p> <p>Another reason that an asynchronous approach may be desirable is that we have several interlocking components which need to react to events in real-time: our crawl monitoring process needs to simultaneously control the rate limiting process and also interrupt crawling if errors are detected, while our worker processes need to consume crawl events, process images, upload thumbnails, and produce events documenting the metadata of each image. Coordinating all of these components through inter-process communication could be difficult, but breaking up tasks into small pieces and yielding to the event loop is comparatively easy.</p> <h5 id="the-resize-task">The resize task</h5><p>This is the most vital part of our crawling system: the part that actually does the work of fetching and processing an image. As established previously, we need to execute this task concurrently, so everything needs to be defined with <code>async</code>/<code>await</code> syntax to allow the event loop to multitask. The actual task itself is otherwise straightforward.</p> <ol> <li>Download the remote image and load it into memory.</li> <li>Extract the resolution and compression quality.</li> <li>Thumbnail the image for later computer vision analysis and upload it to S3.</li> <li>Write the information we've discovered to a Kafka topic.</li> <li>Report success/errors to Redis in aggregate.</li> </ol> <p>See <a href="https://github.com/creativecommons/image-crawler/blob/master/worker/image.py">image.py</a> for the nitty-gritty details.</p> <h4 id="rate-limiting-with-token-buckets-and-error-circuit-breakers">Rate limiting with token buckets and error circuit breakers</h4><h5 id="how-do-we-determine-the-rate-limit">How do we determine the rate limit?</h5><p>Often times, when designing highly concurrent software, the goal is to maximize the throughput and push servers to their absolute limit. The opposite is true with a web crawler, particularly when you are operating a non-profit organization completely reliant on the goodwill of others to exist. We want to be as certain as reasonably possible that we aren't going to knock a resource off of the internet with an accidental <a href="https://en.wikipedia.org/wiki/Denial-of-service_attack">DDoS</a>. At the same time, we need to crawl as quickly as possible against sources with adequate resources to withstand a heavy crawl, or else we'll never finish. How can we match our crawl rate to a site's capabilities?</p> <p>Originally, my plan was to determine this through an adaptive rate limiting strategy, where we would start with a low rate limit and use a hill climbing algorithm to determine the optimal rate. We could track metrics like <a href="https://en.wikipedia.org/wiki/Time_to_first_byte">time to first byte</a> (TTFB) and bandwidth speed to determine the exact moment that we have started to strain upstream servers. However, there are a lot of drawbacks here:</p> <ol> <li>It may not be correct to assume that performance will steadily degrade instead of failing all at once.</li> <li>We can't detect whether we are the cause of a performance issue or if the host is simply experiencing server trouble due to configuration errors or high traffic. We could get stuck at a suboptimal rate limit due to normal fluctuations in traffic.</li> <li>Recording TTFB in Python is difficult because it requires low level access to connection data. We might have to write an extension to <code>aiohttp</code> to get it.</li> </ol> <p>Eventually I decided that this is too much hassle. Can we get the job done with a simpler strategy?</p> <p>It turns out that the size of a website is typically correlated with infrastructure capabilities. The reasoning behind this is that if you are capable of hosting 450MM images, you are probably able to handle at least a couple hundred requests per second for serving traffic. In our case, we already know how many images a source has, so it's easy for us to peg our rate limit between a low minimum for small websites and a reasonable maximum for large websites, and then interpolate the rate limit for everything in between.</p> <p>Of course, it's important to note that this is only a rough heuristic that we use to make a reasonable guess about what a website can handle. We have to allow the possibility that we set our rate limit too aggressively in spite of our precautions.</p> <h5 id="backing-off-with-circuit-breakers">Backing off with circuit breakers</h5><p>If our heuristic fails to correctly approximate the bandwidth capabilities of a site, we are going to start encountering problems. For one, we might exceed the server-side rate limit, which means we will see <code>429 Rate Limit Exceeded</code> and <code>403 Forbidden</code> errors instead of the images we're trying to crawl. Worse yet, the upstream source might continue to happily serve requests while we suck up all of their traffic capacity, resulting in other users being unable to view the images. Clearly, in either scenario, we need to either reduce our crawl rate or even give up crawling the source entirely if it appears that we are impacting their uptime.</p> <p>To handle these situations, we have two tools in our toolbox: a sliding window recording the status code of every request made we've made to each domain in the last 60 seconds, and a list of the last 50 statuses for each website. If the number of errors in our one minute window exceed 10%, something is wrong; we should wait a minute before trying again. If we have encountered many errors in a row, however, that suggests that we're having trouble with a particular site, so we ought to give up crawling the source and raise an alert.</p> <p>Workers can keep track of this information in sorted sets in Redis. For the sliding error window, we'll sort each request by its timestamp, which will make it easy and cheap for us to expire status codes beyond the sliding window interval. Maintaining a list of the last N response codes is even easier; we just stick the status code in a list associated with the source.</p> <div class="hll"><pre><span></span><span class="k">class</span> <span class="nc">StatsManager</span><span class="p">:</span> <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">redis</span><span class="p">):</span> <span class="bp">self</span><span class="o">.</span><span class="n">redis</span> <span class="o">=</span> <span class="n">redis</span> <span class="bp">self</span><span class="o">.</span><span class="n">known_sources</span> <span class="o">=</span> <span class="nb">set</span><span class="p">()</span> <span class="nd">@staticmethod</span> <span class="k">async</span> <span class="k">def</span> <span class="nf">_record_window_samples</span><span class="p">(</span><span class="n">pipe</span><span class="p">,</span> <span class="n">source</span><span class="p">,</span> <span class="n">status</span><span class="p">):</span> <span class="w"> </span><span class="sd">&quot;&quot;&quot; Insert a status into all sliding windows. &quot;&quot;&quot;</span> <span class="n">now</span> <span class="o">=</span> <span class="n">time</span><span class="o">.</span><span class="n">monotonic</span><span class="p">()</span> <span class="c1"># Time-based sliding windows</span> <span class="k">for</span> <span class="n">stat_key</span><span class="p">,</span> <span class="n">interval</span> <span class="ow">in</span> <span class="n">WINDOW_PAIRS</span><span class="p">:</span> <span class="n">key</span> <span class="o">=</span> <span class="sa">f</span><span class="s1">&#39;</span><span class="si">{</span><span class="n">stat_key</span><span class="si">}{</span><span class="n">source</span><span class="si">}</span><span class="s1">&#39;</span> <span class="k">await</span> <span class="n">pipe</span><span class="o">.</span><span class="n">zadd</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">now</span><span class="p">,</span> <span class="sa">f</span><span class="s1">&#39;</span><span class="si">{</span><span class="n">status</span><span class="si">}</span><span class="s1">:</span><span class="si">{</span><span class="n">time</span><span class="o">.</span><span class="n">monotonic</span><span class="p">()</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">)</span> <span class="c1"># Delete events from outside the window</span> <span class="k">await</span> <span class="n">pipe</span><span class="o">.</span><span class="n">zremrangebyscore</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="s1">&#39;-inf&#39;</span><span class="p">,</span> <span class="n">now</span> <span class="o">-</span> <span class="n">interval</span><span class="p">)</span> <span class="c1"># &quot;Last n requests&quot; window</span> <span class="k">await</span> <span class="n">pipe</span><span class="o">.</span><span class="n">rpush</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;</span><span class="si">{</span><span class="n">LAST_50_REQUESTS</span><span class="si">}{</span><span class="n">source</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">,</span> <span class="n">status</span><span class="p">)</span> <span class="k">await</span> <span class="n">pipe</span><span class="o">.</span><span class="n">ltrim</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;</span><span class="si">{</span><span class="n">LAST_50_REQUESTS</span><span class="si">}{</span><span class="n">source</span><span class="si">}</span><span class="s1">&#39;</span><span class="p">,</span> <span class="o">-</span><span class="mi">50</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> </pre></div> <p><em><center>Collecting status codes in aggregate</center></em></p> <p>Meanwhile, the crawl monitor process can keep tabs on the contents of each error threshold.</p> <p>When more than 10% of the requests made to a source in the last minute are errors, we'll set a halt condition in Redis and stop replenishing rate limit tokens (more on that below).</p> <div class="hll"><pre><span></span><span class="n">now</span> <span class="o">=</span> <span class="n">time</span><span class="o">.</span><span class="n">monotonic</span><span class="p">()</span> <span class="n">one_minute_window</span> <span class="o">=</span> <span class="k">await</span> <span class="n">redis</span><span class="o">.</span><span class="n">zrangebyscore</span><span class="p">(</span> <span class="n">one_minute_window_key</span><span class="p">,</span> <span class="s1">&#39;-inf&#39;</span><span class="p">,</span> <span class="n">now</span> <span class="o">-</span> <span class="mi">60</span> <span class="p">)</span> <span class="n">errors</span> <span class="o">=</span> <span class="mi">0</span> <span class="n">success</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">for</span> <span class="n">status</span> <span class="ow">in</span> <span class="n">one_minute_window</span><span class="p">:</span> <span class="k">if</span> <span class="n">status</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">EXPECTED_STATUSES</span><span class="p">:</span> <span class="n">errors</span> <span class="o">+=</span> <span class="mi">1</span> <span class="k">else</span><span class="p">:</span> <span class="n">successful</span> <span class="o">+=</span> <span class="mi">1</span> <span class="n">tolerance</span> <span class="o">=</span> <span class="n">ERROR_TOLERANCE_PERCENT</span> <span class="o">/</span> <span class="mi">100</span> <span class="k">if</span> <span class="ow">not</span> <span class="n">successful</span> <span class="ow">or</span> <span class="n">errors</span> <span class="o">/</span> <span class="n">successful</span> <span class="o">&gt;</span> <span class="n">tolerance</span><span class="p">:</span> <span class="k">await</span> <span class="n">redis</span><span class="o">.</span><span class="n">sadd</span><span class="p">(</span><span class="n">TEMP_HALTED_SET</span><span class="p">,</span> <span class="n">source</span><span class="p">)</span> </pre></div> <p><em><center>Detecting elevated crawl errors for a source</center></em></p> <p>For detecting "serious" errors, where we've seen 50 failed requests in a row, we'll set a permanent halt condition. Someone will have to manually troubleshoot the situation and switch the crawler back on for that source.</p> <div class="hll"><pre><span></span><span class="n">last_50_statuses_key</span> <span class="o">=</span> <span class="sa">f</span><span class="s1">&#39;statuslast50req:</span><span class="si">{</span><span class="n">source</span><span class="si">}</span><span class="s1">&#39;</span> <span class="n">last_50_statuses</span> <span class="o">=</span> <span class="k">await</span> <span class="n">redis</span><span class="o">.</span><span class="n">lrange</span><span class="p">(</span><span class="n">last_50_statuses_key</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">last_50_statuses</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="mi">50</span> <span class="ow">and</span> <span class="n">_every_request_failed</span><span class="p">(</span><span class="n">last_50_statuses</span><span class="p">):</span> <span class="k">await</span> <span class="n">redis</span><span class="o">.</span><span class="n">sadd</span><span class="p">(</span><span class="n">HALTED_SET</span><span class="p">,</span> <span class="n">source</span><span class="p">)</span> </pre></div> <p><em><center>Detecting persistent crawl errors</center></em></p> <p>In practice, keeping a sliding window for tracking error thresholds and setting reasonable minimum and maximum crawl rates has worked well enough that the circuit breaker has never been tripped.</p> <h5 id="enforcing-rate-limits-with-token-buckets">Enforcing rate limits with token buckets</h5><p>It's one thing to set a policy for crawling; it's another thing entirely to actually enforce it. How can we coordinate our multiple crawling processes to prevent them from overstepping our rate limit?</p> <p>The answer is to implement a distributed token bucket system. The idea behind this is that each crawler has to obtain a token from Redis before making a request. Every second, the crawl monitor sets a variable containing the number of requests that can be made against a source. Each crawler process decrements the counter before making a request. If the decremented result is above zero, the worker is cleared to crawl. Otherwise, the rate limit has been reached and we should wait until a token has been obtained.</p> <p>The beauty of token buckets is their simplicity, performance, and resilience against failure. If our crawler monitor process dies, crawling halts completely; making a request is not possible without first acquiring a token. This is a much better alternative to the guard rails completely disappearing with the crawl monitor and allowing unbounded crawling. Further, since decrementing a counter and retrieving the result is an atomic operation in Redis, there's no risk of race conditions and therefore no need for locking. This is a huge boon for performance, as the overhead of coordinating and blocking on every single request would rapidly bog down our crawling system.</p> <p>To ensure that all crawling is performed at the correct speed, I wrapped <code>aiohttp.ClientSession</code> with a rate limited version of the class.</p> <div class="hll"><pre><span></span><span class="k">class</span> <span class="nc">RateLimitedClientSession</span><span class="p">:</span> <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">aioclient</span><span class="p">,</span> <span class="n">redis</span><span class="p">):</span> <span class="bp">self</span><span class="o">.</span><span class="n">client</span> <span class="o">=</span> <span class="n">aioclient</span> <span class="bp">self</span><span class="o">.</span><span class="n">redis</span> <span class="o">=</span> <span class="n">redis</span> <span class="k">async</span> <span class="k">def</span> <span class="nf">_get_token</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">source</span><span class="p">):</span> <span class="n">token_key</span> <span class="o">=</span> <span class="sa">f</span><span class="s1">&#39;</span><span class="si">{</span><span class="n">CURRTOKEN_PREFIX</span><span class="si">}{</span><span class="n">source</span><span class="si">}</span><span class="s1">&#39;</span> <span class="n">tokens</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">redis</span><span class="o">.</span><span class="n">decr</span><span class="p">(</span><span class="n">token_key</span><span class="p">))</span> <span class="k">if</span> <span class="n">tokens</span> <span class="o">&gt;=</span> <span class="mi">0</span><span class="p">:</span> <span class="n">token_acquired</span> <span class="o">=</span> <span class="kc">True</span> <span class="k">else</span><span class="p">:</span> <span class="c1"># Out of tokens</span> <span class="k">await</span> <span class="n">asyncio</span><span class="o">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="n">token_acquired</span> <span class="o">=</span> <span class="kc">False</span> <span class="k">return</span> <span class="n">token_acquired</span> <span class="k">async</span> <span class="k">def</span> <span class="nf">get</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">url</span><span class="p">,</span> <span class="n">source</span><span class="p">):</span> <span class="n">token_acquired</span> <span class="o">=</span> <span class="kc">False</span> <span class="k">while</span> <span class="ow">not</span> <span class="n">token_acquired</span><span class="p">:</span> <span class="n">token_acquired</span> <span class="o">=</span> <span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">_get_token</span><span class="p">(</span><span class="n">source</span><span class="p">)</span> <span class="k">return</span> <span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">client</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">)</span> </pre></div> <p>Meanwhile, the crawl monitor process is filling up each bucket every second.</p> <h5 id="scheduling-tasks-somewhat-intelligently">Scheduling tasks (somewhat) intelligently</h5><p>The final gotcha in the design of our crawler is that we want to crawl every single website at the same time at its prescribed rate limit. That sounds almost tautological, like something that we should be able to take for granted after implementing all of this logic for preventing our crawler from working too quickly, but it turns out our crawler's processing capacity itself is a limited and contentious resource. We can only schedule so many tasks simultaneously on each worker, and we need to ensure that tasks from a single website aren't starving other sources of crawl capacity.</p> <p>For instance, imagine that each worker is able to handle 5000 simultaneous crawling tasks, and every one of those tasks is tied to a tiny website with a very low rate limit. That means that our entire worker, which is capable of handling hundreds of crawl and analysis jobs per second, is stuck making one request per second until some faster tasks appear in the queue.</p> <p>In other words, we need to make sure that each worker process isn't jamming itself up with a single source. We have a <a href="https://en.wikipedia.org/wiki/Scheduling_(computing%29">scheduling problem</a>. We've naively implemented first-come-first-serve and need to switch to a different scheduling strategy.</p> <p>There are innumerable ways to address scheduling problems. Since there are only a few dozen sources in our system, we can get away with using a stupid scheduling algorithm: give each source equal capacity in every worker. In other words, if there are 5000 tasks to distribute and 30 sources, we can allocate 166 simultaneous tasks to each source per worker. That's plenty for our purposes. There are obvious drawbacks of this approach in that eventually there will be so many sources that we start starving high rate limit sources of work. We'll cross that bridge when we come to it; it's better to use the simplest possible approach we can get away with instead of spending all of our time on solving hypothetical future problems.</p> <div class="hll"><pre><span></span> <span class="k">async</span> <span class="k">def</span> <span class="nf">_schedule</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">task_schedule</span><span class="p">):</span> <span class="n">raw_sources</span> <span class="o">=</span> <span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">redis</span><span class="o">.</span><span class="n">smembers</span><span class="p">(</span><span class="s1">&#39;inbound_sources&#39;</span><span class="p">)</span> <span class="n">sources</span> <span class="o">=</span> <span class="p">[</span><span class="nb">str</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="s1">&#39;utf-8&#39;</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">raw_sources</span><span class="p">]</span> <span class="n">num_sources</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">sources</span><span class="p">)</span> <span class="c1"># A source never gets more than 1/4th of the worker&#39;s capacity. This</span> <span class="c1"># helps prevent starvation of lower rate limit requests and ensures</span> <span class="c1"># that the first few sources to be discovered don&#39;t get all of the</span> <span class="c1"># initial task slots.</span> <span class="n">max_share</span> <span class="o">=</span> <span class="n">settings</span><span class="o">.</span><span class="n">MAX_TASKS</span> <span class="o">/</span> <span class="mi">4</span> <span class="n">share</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="n">math</span><span class="o">.</span><span class="n">floor</span><span class="p">(</span><span class="n">settings</span><span class="o">.</span><span class="n">MAX_TASKS</span> <span class="o">/</span> <span class="n">num_sources</span><span class="p">),</span> <span class="n">max_share</span><span class="p">)</span> <span class="n">to_schedule</span> <span class="o">=</span> <span class="p">{}</span> <span class="k">for</span> <span class="n">source</span> <span class="ow">in</span> <span class="n">sources</span><span class="p">:</span> <span class="n">num_unfinished</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_get_unfinished_tasks</span><span class="p">(</span><span class="n">task_schedule</span><span class="p">,</span> <span class="n">source</span><span class="p">)</span> <span class="n">num_to_schedule</span> <span class="o">=</span> <span class="n">share</span> <span class="o">-</span> <span class="n">num_unfinished</span> <span class="n">consumer</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_get_consumer</span><span class="p">(</span><span class="n">source</span><span class="p">)</span> <span class="n">source_msgs</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_consume_n</span><span class="p">(</span><span class="n">consumer</span><span class="p">,</span> <span class="n">num_to_schedule</span><span class="p">)</span> <span class="n">to_schedule</span><span class="p">[</span><span class="n">source</span><span class="p">]</span> <span class="o">=</span> <span class="n">source_msgs</span> <span class="k">return</span> <span class="n">to_schedule</span> </pre></div> <p><em><center>Scheduling tasks for every source</center></em></p> <p>The one implementation detail to deal with here is that our workers can't draw from a single inbound images queue anymore; we need to partition each source into its own queue so we can pull tasks from each source when we need it. This partitioning process can be handled transparently by the crawl monitor.</p> <p><img src="/blog/entries/crawling-500-million/image_crawler.png" alt="A more complete diagram"></p> <p><em><center>A more complete diagram showing the system with a queue for each source</center></em></p> <h5 id="designing-for-testability">Designing for testability</h5><p>It's quite difficult to test IO-heavy systems because of their need to interact with lots of external dependencies. Often times it is necessary to write complex integration tests or run manual tests to be certain that key functionality works as expected. This is no good because integration tests are much more expensive to maintain and take far longer to execute. We certainly wouldn't go to production without running a smoke test to verify correctness in real-world conditions, but it's still critical to have unit tests in place for catching bugs quickly during the development process.</p> <p>The solution to this problem is to use dependency injection, which is a fancy way of saying that we never do IO directly from within our application. Instead, we delegate IO to external objects that can be passed in at run-time. This makes it easy to pass in fake objects that approximate real world behavior without real world consequences.</p> <p>For example, the crawl monitor usually has to talk to our CC Search API (for assessing source size), Redis, and Kafka to do its job of regulating the crawl; instead of setting up a brittle and complicated integration test with all of those dependencies, we just instantiate some mock objects and pass them in. Now we can easily test individual components such as the error circuit breaker.</p> <p><em><center>Testing our crawl monitor's circuit breaking functionality with mock dependencies</center></em></p> <div class="hll"><pre><span></span><span class="nd">@pytest</span><span class="o">.</span><span class="n">fixture</span> <span class="k">def</span> <span class="nf">source_fixture</span><span class="p">():</span> <span class="w"> </span><span class="sd">&quot;&quot;&quot; Mocks the /v1/sources endpoint response. &quot;&quot;&quot;</span> <span class="k">return</span> <span class="p">[</span> <span class="p">{</span> <span class="s2">&quot;source_name&quot;</span><span class="p">:</span> <span class="s2">&quot;example&quot;</span><span class="p">,</span> <span class="s2">&quot;image_count&quot;</span><span class="p">:</span> <span class="mi">5000000</span><span class="p">,</span> <span class="s2">&quot;display_name&quot;</span><span class="p">:</span> <span class="s2">&quot;Example&quot;</span><span class="p">,</span> <span class="s2">&quot;source_url&quot;</span><span class="p">:</span> <span class="s2">&quot;example.com&quot;</span> <span class="p">},</span> <span class="p">{</span> <span class="s2">&quot;source_name&quot;</span><span class="p">:</span> <span class="s2">&quot;another&quot;</span><span class="p">,</span> <span class="s2">&quot;image_count&quot;</span><span class="p">:</span> <span class="mi">1000000</span><span class="p">,</span> <span class="s2">&quot;display_name&quot;</span><span class="p">:</span> <span class="s2">&quot;Another&quot;</span><span class="p">,</span> <span class="s2">&quot;source_url&quot;</span><span class="p">:</span> <span class="s2">&quot;whatever&quot;</span> <span class="p">}</span> <span class="p">]</span> <span class="k">def</span> <span class="nf">create_mock_monitor</span><span class="p">(</span><span class="n">sources</span><span class="p">):</span> <span class="n">response</span> <span class="o">=</span> <span class="n">FakeAioResponse</span><span class="p">(</span><span class="n">status</span><span class="o">=</span><span class="mi">200</span><span class="p">,</span> <span class="n">body</span><span class="o">=</span><span class="n">sources</span><span class="p">)</span> <span class="n">session</span> <span class="o">=</span> <span class="n">FakeAioSession</span><span class="p">(</span><span class="n">response</span><span class="o">=</span><span class="n">response</span><span class="p">)</span> <span class="n">redis</span> <span class="o">=</span> <span class="n">FakeRedis</span><span class="p">()</span> <span class="n">regulator_task</span> <span class="o">=</span> <span class="n">asyncio</span><span class="o">.</span><span class="n">create_task</span><span class="p">(</span><span class="n">rate_limit_regulator</span><span class="p">(</span><span class="n">session</span><span class="p">,</span> <span class="n">redis</span><span class="p">))</span> <span class="k">return</span> <span class="n">redis</span><span class="p">,</span> <span class="n">regulator_task</span> <span class="nd">@pytest</span><span class="o">.</span><span class="n">mark</span><span class="o">.</span><span class="n">asyncio</span> <span class="k">async</span> <span class="k">def</span> <span class="nf">test_error_circuit_breaker</span><span class="p">(</span><span class="n">source_fixture</span><span class="p">):</span> <span class="n">sources</span> <span class="o">=</span> <span class="n">source_fixture</span> <span class="n">redis</span><span class="p">,</span> <span class="n">monitor</span> <span class="o">=</span> <span class="n">create_mock_monitor</span><span class="p">(</span><span class="n">sources</span><span class="p">)</span> <span class="n">redis</span><span class="o">.</span><span class="n">store</span><span class="p">[</span><span class="s1">&#39;statuslast50req:example&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="sa">b</span><span class="s1">&#39;500&#39;</span><span class="p">]</span> <span class="o">*</span> <span class="mi">50</span> <span class="n">redis</span><span class="o">.</span><span class="n">store</span><span class="p">[</span><span class="s1">&#39;statuslast50req:another&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="sa">b</span><span class="s1">&#39;200&#39;</span><span class="p">]</span> <span class="o">*</span> <span class="mi">50</span> <span class="k">await</span> <span class="n">run_monitor</span><span class="p">(</span><span class="n">monitor_task</span><span class="p">)</span> <span class="k">assert</span> <span class="sa">b</span><span class="s1">&#39;example&#39;</span> <span class="ow">in</span> <span class="n">redis</span><span class="o">.</span><span class="n">store</span><span class="p">[</span><span class="s1">&#39;halted&#39;</span><span class="p">]</span> <span class="k">assert</span> <span class="sa">b</span><span class="s1">&#39;another&#39;</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">redis</span><span class="o">.</span><span class="n">store</span><span class="p">[</span><span class="s1">&#39;halted&#39;</span><span class="p">]</span> </pre></div> <p>The main drawback of dependency injection is that initializing your objects will take some more ceremony. See the <a href="https://github.com/creativecommons/image-crawler/blob/00b59aba9a15faccf203a53d73a98e8c06cb69e8/worker/scheduler.py#L162">initialization of the crawl scheduler</a> for an example of wiring up an object with a lot of dependencies. You might also find that constructors and other functions with a lot of dependencies will have a lot of arguments if care isn't taken to bundle external dependencies together. In my opinion, the price of a few extra lines of initialization code is well worth the benefits gained from testability and modularity.</p> <h4 id="smoke-testing">Smoke testing</h4><p>Even with our unit test coverage, we still need to do some basic small-scale manual tests to make sure our assumptions hold up in the real world. We'll need to write <a href="https://www.terraform.io/">Terraform</a> modules that provision a working version of the real system. Sadly, our Terraform infrastructure repository is private for now, but here's a taste of what the infra code looks like.</p> <div class="hll"><pre><span></span><span class="kr">module</span><span class="w"> </span><span class="nv">&quot;image-crawler&quot;</span><span class="w"> </span><span class="p">{</span> <span class="w"> </span><span class="na">source</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">&quot;../../modules/services/image-crawler&quot;</span> <span class="w"> </span><span class="na">environment</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">&quot;prod&quot;</span> <span class="w"> </span><span class="na">docker_tag</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">&quot;0.25.0&quot;</span> <span class="w"> </span><span class="na">aws_access_key_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">&quot;${var.aws_access_key_id}&quot;</span> <span class="w"> </span><span class="na">aws_secret_access_key</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">&quot;${var.aws_secret_access_key}&quot;</span> <span class="w"> </span><span class="na">zookeeper_endpoint</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">&quot;${module.kafka.zookeeper_brokers}&quot;</span> <span class="w"> </span><span class="na">kafka_brokers</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">&quot;${module.kafka.kafka_brokers}&quot;</span> <span class="w"> </span><span class="na">worker_instance_type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">&quot;m5.large&quot;</span> <span class="w"> </span><span class="na">worker_count</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span> <span class="p">}</span> </pre></div> <p><em><center>Initialization of crawler Terraform module in our production environment</center></em></p> <div class="hll"><pre><span></span><span class="kr">resource</span><span class="w"> </span><span class="nc">&quot;aws_instance&quot;</span><span class="w"> </span><span class="nv">&quot;crawler-workers&quot;</span><span class="w"> </span><span class="p">{</span> <span class="w"> </span><span class="na">ami</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">&quot;${var.ami}&quot;</span> <span class="w"> </span><span class="na">instance_type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">&quot;${var.worker_instance_type}&quot;</span> <span class="w"> </span><span class="na">user_data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">&quot;${data.template_file.worker_init.rendered}&quot;</span> <span class="w"> </span><span class="na">subnet_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">&quot;${element(data.aws_subnet_ids.subnets.ids, 0)}&quot;</span> <span class="w"> </span><span class="na">vpc_security_group_ids</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">[</span><span class="s2">&quot;${aws_security_group.image-crawler-sg.id}&quot;</span><span class="p">]</span> <span class="w"> </span><span class="na">count</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">&quot;${var.worker_count}&quot;</span> <span class="w"> </span><span class="nb">tags</span><span class="w"> </span><span class="p">{</span> <span class="w"> </span><span class="na">Name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">&quot;image-crawler-worker-${var.environment}&quot;</span> <span class="w"> </span><span class="na">environment</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">&quot;${var.environment}&quot;</span> <span class="w"> </span><span class="s2">&quot;cc:environment&quot;</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">&quot;${var.environment == &quot;dev&quot; ? &quot;staging&quot; : &quot;production&quot;}&quot;</span> <span class="w"> </span><span class="s2">&quot;cc:product&quot;</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">&quot;cccatalog-api&quot;</span> <span class="w"> </span><span class="s2">&quot;cc:purpose&quot;</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">&quot;Image crawler worker&quot;</span> <span class="w"> </span><span class="s2">&quot;cc:team&quot;</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">&quot;cc-search&quot;</span> <span class="w"> </span><span class="p">}</span> <span class="p">}</span> <span class="kr">resource</span><span class="w"> </span><span class="nc">&quot;aws_instance&quot;</span><span class="w"> </span><span class="nv">&quot;crawler-monitor&quot;</span><span class="w"> </span><span class="p">{</span> <span class="w"> </span><span class="na">ami</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">&quot;${var.ami}&quot;</span> <span class="w"> </span><span class="na">instance_type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">&quot;c5.large&quot;</span> <span class="w"> </span><span class="na">user_data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">&quot;${data.template_file.monitor_init.rendered}&quot;</span> <span class="w"> </span><span class="na">subnet_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">&quot;${element(data.aws_subnet_ids.subnets.ids, 0)}&quot;</span> <span class="w"> </span><span class="na">vpc_security_group_ids</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">[</span><span class="s2">&quot;${aws_security_group.image-crawler-sg.id}&quot;</span><span class="p">]</span> <span class="w"> </span><span class="nb">tags</span><span class="w"> </span><span class="p">{</span> <span class="w"> </span><span class="na">Name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">&quot;image-crawler-monitor-${var.environment}&quot;</span> <span class="w"> </span><span class="na">environment</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">&quot;${var.environment}&quot;</span> <span class="w"> </span><span class="s2">&quot;cc:environment&quot;</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">&quot;${var.environment == &quot;dev&quot; ? &quot;staging&quot; : &quot;production&quot;}&quot;</span> <span class="w"> </span><span class="s2">&quot;cc:product&quot;</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">&quot;cccatalog-api&quot;</span> <span class="w"> </span><span class="s2">&quot;cc:purpose&quot;</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">&quot;Image crawler monitor&quot;</span> <span class="w"> </span><span class="s2">&quot;cc:team&quot;</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">&quot;cc-search&quot;</span> <span class="w"> </span><span class="p">}</span> <span class="p">}</span> </pre></div> <p><em><center>An excerpt of the crawler module definition</center></em></p> <p>One <code>terraform plan</code> and <code>terraform apply</code> cycle later, we're ready to feed a few million test URLs to the inbound image queue and see what happens. By my recollection, this uncovered many glaring issues:</p> <ul> <li>Basic network security configuration problems preventing communication between key components</li> <li>The need for our scheduling algorithm to be overhauled (already discussed)</li> <li>Workers exceeding Redis maximum connection limit</li> <li>Workers crashing due to hitting open file limit due to huge number of concurrent connections</li> <li>Probably a half dozen other problems</li> </ul> <p>After fixing all of those issues and performing a larger smoke test, we're ready to start crawling on a large scale.</p> <h5 id="monitoring-the-crawl">Monitoring the crawl</h5><p>Unfortunately, we can't just kick back and relax while the crawler does its thing for a few weeks. We need some transparency about what the crawler is doing so we can be alerted when something breaks.</p> <ul> <li>How fast are we crawling each website? What's our target rate limit?</li> <li>How many errors have occurred? How many images have we successfully processed?</li> <li>Are we crawling right now, or are we finished?</li> </ul> <p>It would be nice to build a reporting dashboard for this, but in the interest of time, we'll dump a giant JSON blob to <code>STDOUT</code> every 5 seconds and call it a day. When we want to check on crawl progress, we <code>ssh</code> into the crawl monitoring virtual machine and <code>tail</code> the logs (we could also use our Graylog instance if we're feeling lazy). Fortunately, JSON is both trivially human and machine readable, so we can build a more sophisticated monitoring system later by parsing the logs.</p> <p>Here's an example log line from one of our smoke tests, indicating that we've crawled 13,224 images successfully and nothing else is happening.</p> <div class="hll"><pre><span></span><span class="p">{</span> <span class="w"> </span><span class="nt">&quot;event&quot;</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;monitoring_update&quot;</span><span class="p">,</span> <span class="w"> </span><span class="nt">&quot;time&quot;</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;2020-04-17T20:22:56.837232&quot;</span><span class="p">,</span> <span class="w"> </span><span class="nt">&quot;general&quot;</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="p">{</span> <span class="w"> </span><span class="nt">&quot;global_max_rps&quot;</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="mf">193.418869804698</span><span class="p">,</span> <span class="w"> </span><span class="nt">&quot;error_rps&quot;</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span> <span class="w"> </span><span class="nt">&quot;processing_rate&quot;</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span> <span class="w"> </span><span class="nt">&quot;success_rps&quot;</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span> <span class="w"> </span><span class="nt">&quot;circuit_breaker_tripped&quot;</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="p">[],</span> <span class="w"> </span><span class="nt">&quot;num_resized&quot;</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="mi">13224</span><span class="p">,</span> <span class="w"> </span><span class="nt">&quot;resize_errors&quot;</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span> <span class="w"> </span><span class="nt">&quot;split_rate&quot;</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="mi">0</span> <span class="w"> </span><span class="p">},</span> <span class="w"> </span><span class="nt">&quot;specific&quot;</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="p">{</span> <span class="w"> </span><span class="nt">&quot;flickr&quot;</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="p">{</span> <span class="w"> </span><span class="nt">&quot;successful&quot;</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="mi">13188</span><span class="p">,</span> <span class="w"> </span><span class="nt">&quot;last_50_statuses&quot;</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="p">{</span> <span class="w"> </span><span class="nt">&quot;200&quot;</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="mi">50</span> <span class="w"> </span><span class="p">},</span> <span class="w"> </span><span class="nt">&quot;rate_limit&quot;</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="mf">178.375147633876</span><span class="p">,</span> <span class="w"> </span><span class="nt">&quot;error&quot;</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="mi">0</span> <span class="w"> </span><span class="p">},</span> <span class="w"> </span><span class="nt">&quot;animaldiversity&quot;</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="p">{</span> <span class="w"> </span><span class="nt">&quot;last_50_statuses&quot;</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="p">{</span> <span class="w"> </span><span class="nt">&quot;200&quot;</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="mi">18</span> <span class="w"> </span><span class="p">},</span> <span class="w"> </span><span class="nt">&quot;successful&quot;</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="mi">18</span><span class="p">,</span> <span class="w"> </span><span class="nt">&quot;error&quot;</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span> <span class="w"> </span><span class="nt">&quot;rate_limit&quot;</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="mf">0.206215440554406</span> <span class="w"> </span><span class="p">},</span> <span class="w"> </span><span class="nt">&quot;phylopic&quot;</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="p">{</span> <span class="w"> </span><span class="nt">&quot;rate_limit&quot;</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="mf">0.2</span><span class="p">,</span> <span class="w"> </span><span class="nt">&quot;error&quot;</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span> <span class="w"> </span><span class="nt">&quot;successful&quot;</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="mi">18</span><span class="p">,</span> <span class="w"> </span><span class="nt">&quot;last_50_statuses&quot;</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="p">{</span> <span class="w"> </span><span class="nt">&quot;200&quot;</span><span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="mi">18</span> <span class="w"> </span><span class="p">}</span> <span class="w"> </span><span class="p">}</span> <span class="w"> </span><span class="p">}</span> <span class="p">}</span> </pre></div> <p>Now that we can see what the crawler is up to, we can schedule the larger crawl and start collecting production quality data.</p> <h4 id="takeaways">Takeaways</h4><p>The result here is that we have a lightweight, modular, highly concurrent, and polite distributed image crawler with only a handful of lines of code.</p> <div class="hll"><pre><span></span>alden:~/code/image_crawler$<span class="w"> </span>cloc<span class="w"> </span>. <span class="w"> </span><span class="m">48</span><span class="w"> </span>text<span class="w"> </span>files. <span class="w"> </span><span class="m">43</span><span class="w"> </span>unique<span class="w"> </span>files. <span class="w"> </span><span class="m">25</span><span class="w"> </span>files<span class="w"> </span>ignored. github.com/AlDanial/cloc<span class="w"> </span>v<span class="w"> </span><span class="m">1</span>.81<span class="w"> </span><span class="nv">T</span><span class="o">=</span><span class="m">0</span>.02<span class="w"> </span>s<span class="w"> </span><span class="o">(</span><span class="m">1667</span>.4<span class="w"> </span>files/s,<span class="w"> </span><span class="m">130887</span>.8<span class="w"> </span>lines/s<span class="o">)</span> ------------------------------------------------------------------------------ Language<span class="w"> </span>files<span class="w"> </span>blank<span class="w"> </span>comment<span class="w"> </span>code ------------------------------------------------------------------------------ Python<span class="w"> </span><span class="m">16</span><span class="w"> </span><span class="m">244</span><span class="w"> </span><span class="m">242</span><span class="w"> </span><span class="m">1324</span> Markdown<span class="w"> </span><span class="m">5</span><span class="w"> </span><span class="m">79</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="m">219</span> YAML<span class="w"> </span><span class="m">3</span><span class="w"> </span><span class="m">2</span><span class="w"> </span><span class="m">4</span><span class="w"> </span><span class="m">61</span> XML<span class="w"> </span><span class="m">3</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="m">18</span> Bourne<span class="w"> </span>Shell<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="m">4</span> ------------------------------------------------------------------------------ SUM:<span class="w"> </span><span class="m">28</span><span class="w"> </span><span class="m">325</span><span class="w"> </span><span class="m">247</span><span class="w"> </span><span class="m">1626</span> ------------------------------------------------------------------------------ alden:~/code/image_crawler$<span class="w"> </span>tree<span class="w"> </span>. . ├──<span class="w"> </span>architecture.png ├──<span class="w"> </span>CODE_OF_CONDUCT.md ├──<span class="w"> </span>CONTRIBUTING.md ├──<span class="w"> </span>crawl_monitor │<span class="w"> </span>├──<span class="w"> </span>__init__.py │<span class="w"> </span>├──<span class="w"> </span>monitor.py │<span class="w"> </span>├──<span class="w"> </span>rate_limit.py │<span class="w"> </span>├──<span class="w"> </span>README.md │<span class="w"> </span>├──<span class="w"> </span>settings.py │<span class="w"> </span>├──<span class="w"> </span>source_splitter.py │<span class="w"> </span>├──<span class="w"> </span>structured_logging.py │<span class="w"> </span>└──<span class="w"> </span>tsv_producer.py ├──<span class="w"> </span>docker-compose.yml ├──<span class="w"> </span>Dockerfile-monitor ├──<span class="w"> </span>Dockerfile-worker ├──<span class="w"> </span>__init__.py ├──<span class="w"> </span>LICENSE ├──<span class="w"> </span>Pipfile ├──<span class="w"> </span>Pipfile.lock ├──<span class="w"> </span>publish_release.sh ├──<span class="w"> </span>README.md ├──<span class="w"> </span><span class="nb">test</span> │<span class="w"> </span>├──<span class="w"> </span>corrupt.jpg │<span class="w"> </span>├──<span class="w"> </span>__init__.py │<span class="w"> </span>├──<span class="w"> </span>mocks.py │<span class="w"> </span>├──<span class="w"> </span>test_image.jpg │<span class="w"> </span>├──<span class="w"> </span>test_monitor.py │<span class="w"> </span>└──<span class="w"> </span>test_worker.py └──<span class="w"> </span>worker <span class="w"> </span>├──<span class="w"> </span>image.py <span class="w"> </span>├──<span class="w"> </span>__init__.py <span class="w"> </span>├──<span class="w"> </span>message.py <span class="w"> </span>├──<span class="w"> </span>rate_limit.py <span class="w"> </span>├──<span class="w"> </span>scheduler.py <span class="w"> </span>├──<span class="w"> </span>settings.py <span class="w"> </span>├──<span class="w"> </span>stats_reporting.py <span class="w"> </span>└──<span class="w"> </span>util.py <span class="m">3</span><span class="w"> </span>directories,<span class="w"> </span><span class="m">34</span><span class="w"> </span>files </pre></div> <p>We now have a lot of useful information about images that we were lacking before. The next step is to take this metadata and integrate it into our search engine, as well as perform deeper analysis of images using computer vision.</p> Say Hello To Our Community Team 2020-08-14T00:00:00Z ['dhruvkb'] urn:uuid:ddeb63da-7771-357d-8299-ad51defeaf4a <p>Creative Commons is committed to open-source software. We have over two dozen projects, spanning three times as many repositories on GitHub, each with its small, but extremely enthusiastic, subcommunity. With only a few full-time employees working on these projects, it is vital that we enable members from the community to take increased responsibility in developing and maintaining them, and growing the community of which they are a part.</p> <p>With that goal in mind, we've launched our Community Team initiative.</p> <h3 id="what-is-the-community-team">What is the Community Team?</h3><p>Communities that grow organically around open source projects tend to be a bit disorganised and the frequency of contributions and degree of involvement tends to vary from member to member. Our goal is to identify contributors who are actively involved within their communities and give them increased permissions over the codebase and access to more information channels and tools in an effort to empower them to participate more fully in the project.</p> <p>This is not restricted to code though. We're also looking for people who work with the community on other aspects of the projects, such as design, documentation, evangelism, and onboarding to name a few.</p> <ul> <li>The Community Team establishes a framework for formalising the level of involvement, which is a spectrum, into discrete level or 'roles'.</li> <li>Each role is mapped to a set of responsibilities that a member holding the role is encouraged to take up.</li> <li>Each role also entrusts the members holding it to certain privileges, accesses and permissions, to help them execute these responsibilities.</li> </ul> <p>Roles also progressively include members in our roadmaps and planning meetings to ensure that the community is aligned with our long-term goals.</p> <h3 id="what-s-in-it-for-me">What's in it for me?</h3><p>The Community Team is not just a one-sided deal. Your membership in the Community Team is just as beneficial for the you as it is for us. While there is a <a href="/community/community-team/#benefits-of-joining-the-community-team">laundry list of benefits</a> that you're entitled to, I'll just mention some notable ones here.</p> <ul> <li>You gain real-world practical experience of working on open-source projects.</li> <li>You gain both soft-skills and technical-skills by interacting with other developers from both the community as well as CC staff.</li> <li>Since we've already seen the quality of your work and involvement with the community, you get priority in internship applications*.</li> </ul> <p>Oh and, lest I forget, you'll receive CC swag!</p> <p><blockquote class="twitter-tweet" data-align="center"> <p lang="en" dir="ltr"> Thanks for the goodies!! <a href="https://twitter.com/creativecommons?ref_src=twsrc%5Etfw">@creativecommons</a> ?? <a href="https://twitter.com/hashtag/OpenSource?src=hash&amp;ref_src=twsrc%5Etfw">#OpenSource</a> <a href="https://twitter.com/hashtag/creativecommons?src=hash&amp;ref_src=twsrc%5Etfw">#creativecommons</a> <a href="https://twitter.com/hashtag/GSoC?src=hash&amp;ref_src=twsrc%5Etfw">#GSoC</a> <a href="https://t.co/DFvpXCs8uu">pic.twitter.com/DFvpXCs8uu</a> </p> &mdash; Mayank Nader (@MayankNader) <a href="https://twitter.com/MayankNader/status/1137995920866390016?ref_src=twsrc%5Etfw">June 10, 2019</a> </blockquote></p> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script><h3 id="what-are-these-roles">What are these 'roles'?</h3><p>If you've reached this point, I assume you see the potential of the Community Team. Let's see where you'd fit in them.</p> <p>We have two kinds of roles, code-oriented <a href="/community/community-team/project-roles/">Project roles</a>, that give you responsibilities and permissions related to one CC project, and non-code-oriented <a href="/community/community-team/community-building-roles/">Community Building roles</a>, that give you responsibilities and permissions related to improving the community of all CC projects as a whole.</p> <p>Each type has a few levels but that I'll just link them for you to read on your own. While your eligibility for any role depends on how involved you have been in the past, the role you choose reflects how involved you would like to be in the future.</p> <p>Start by asking yourself a simple question, "Do I code?"</p> <h4 id="sure-i-can-code.."">"Sure, I can code..."</h4><p><em>That's awesome!</em> We have projects in a diverse array of languages, using myriad tools and frameworks. Depending on the skills you have, or are planning to acquire, you can pick a project and start contributing to it. Based on your contributions and your familiarity with the codebase, you can then apply for the role that matches your desired level of involvement.</p> <p>So if you want to be lightly involved with code-reviews and would like to know about our plans in advance, you can start off as a Project Contributor. This is a fantastic role to get started with and ensures that you get excellent mentorship as you start your FOSS journey.</p> <p>As your familiarity with the codebase increases, you might want to triage incoming issues or block certain PRs that you've reviewed. You could escalate your role to Project Collaborator. Want to me more involved? You can apply to be a Project Core Committer, or even a Project Maintainer.</p> <h4 id="no-i-can-t-code.."">"No, I can't code..."</h4><p><em>That's cool too!</em> We realise that open source communities are never just about the code. If you're passionate about growing the CC community by enabling new contributors to get started or by spreading the word, you can apply for one of the Community Building roles. Like the Project roles, there are a couple of levels to choose from.</p> <p>Community builders have a whole different set of responsibilities and privileges specifically catered to the unique task of cultivating a healthy community around our many open source projects.</p> <p>So if you want to be lightly involved with onboarding new contributors to the repositories and the workflows, you could start off as a Community Contributor. This is a fantastic role to help new contributors get a headstart in their journey with FOSS.</p> <p>As your familiarity with the community increases, you might want to suggest tweets for our Twitter account, or pariticipate in long-term community building tasks from Asana. You could escalate your role to Community Collaborator. Want to me more involved? You can even apply to be a Community Maintainer.</p> <h3 id="what-s-next">What's next?</h3><p>The Community Team is a fairly novel idea for us and we're still tweaking things along the way. For example, we recently merged of two Project roles, namely Project Member and Project Collaborator, when we realised they weren't so different. As we internalise these roles more and more, we'll find more scope for improvement and we'll continue to refine these roles over time.</p> <p>We're excited about the Community Team. If you're interested in joining us on this ride, it's really easy to <a href="/community/community-team/">get started</a>.</p> <p><small>*We do not guarantee that you will be accepted if you apply for an internship!</small></p> Accessibility Improvements: Final Changes and Modal Accessilibity 2020-08-12T00:00:00Z ['AyanChoudhary'] urn:uuid:4b846809-9764-37d3-9551-f5c639064470 <p>These are the last two weeks of my internship with CC. I am working on improving the accessibility of cc-search and internationalizing it as well. This post contains details of my work done to make accessibility improvements to the search result page and the image detail page and also covers some advanced accessiblity improvement details.</p> <p>The topics included in this post cover:</p> <ol> <li>Tooltip accessibility and keyboard interactions</li> <li>Improve modal accessibility and implement trap focus</li> <li>Fix <code>&lt;label&gt;</code> for form elements</li> </ol> <p>The first stage involved fixing the license explanation tooltips. These tooltips worked fine on click but did not respond to keypress events. The solution to overcome this was to use an event listener on the element which would would execute the <code>showLicenseExplanation</code> function onClick. Luckily <code>VueJS</code> provides this function inbuilt via the <code>v-on:keyup</code> attribute. So after change the code looks as follows:</p> <div class="hll"><pre><span></span><span class="p">&lt;</span><span class="nt">img</span> <span class="na">:aria-label</span><span class="o">=</span><span class="s">&quot;$t(&#39;browse-page.aria.license-explanation&#39;)&quot;</span> <span class="na">tabindex</span><span class="o">=</span><span class="s">&quot;0&quot;</span> <span class="na">v-if</span><span class="o">=</span><span class="s">&quot;filterType == &#39;licenses&#39;&quot;</span> <span class="na">src</span><span class="o">=</span><span class="s">&quot;@/assets/help_icon.svg&quot;</span> <span class="na">alt</span><span class="o">=</span><span class="s">&quot;help&quot;</span> <span class="na">class</span><span class="o">=</span><span class="s">&quot;license-help is-pulled-right padding-top-smallest padding-right-smaller&quot;</span> <span class="err">@</span><span class="na">click</span><span class="err">.</span><span class="na">stop</span><span class="o">=</span><span class="s">&quot;toggleLicenseExplanationVisibility(item.code)&quot;</span> <span class="na">v-on:keyup</span><span class="err">.</span><span class="na">enter</span><span class="o">=</span><span class="s">&quot;toggleLicenseExplanationVisibility(item.code)&quot;</span> <span class="p">/&gt;</span> </pre></div> <p>Similar change was made to all the tooltips. The reason behind this error was that non-semantic element representation (i.e. using <code>&lt;div&gt;</code>, <code>&lt;span&gt;</code> or <code>&lt;img&gt;</code> instead of a <code>&lt;button&gt;</code>) does not register a keypress listener for these tags and hence they don't respond on keypress.</p> <p>The second change is related to modals. Modals have some stringent accessilibity parameters that have to be carefully handled. The criteria are:</p> <ol> <li>On opening the modal the remaining elements should get disabled.</li> <li>The modal should have trap-focus(the user should not exit the modal when using tab to navigate).</li> <li>The modal should close on pressing <strong>esc</strong> or on clicking the overlay.</li> </ol> <p>To meet the criteria we developed a new <a href="https://github.com/cc-archive/cccatalog-frontend/blob/develop/src/components/AppModal.vue">modal component</a>. This modal has an overlay and closes when we press the <strong>esc</strong> key or click on the overlay. The modal also disables other elements when it is opened.</p> <p>The final task achieved in the modal was the implementation of trap focus. For this we used the <a href="https://github.com/posva/focus-trap-vue">vue-trap-focus library</a> The library exposes a <code>&lt;focus-trap&gt;</code> component which acts as wrapper to enable focus-trap. The implementation we used was:</p> <div class="hll"><pre><span></span><span class="p">&lt;</span><span class="nt">focus-trap</span> <span class="na">:active</span><span class="o">=</span><span class="s">&quot;true&quot;</span><span class="p">&gt;</span> <span class="p">&lt;</span><span class="nt">div</span> <span class="na">class</span><span class="o">=</span><span class="s">&quot;modal relative&quot;</span> <span class="na">aria-modal</span><span class="o">=</span><span class="s">&quot;true&quot;</span> <span class="na">role</span><span class="o">=</span><span class="s">&quot;dialog&quot;</span><span class="p">&gt;</span> <span class="p">&lt;</span><span class="nt">header</span> <span class="na">v-if</span><span class="o">=</span><span class="s">&quot;title&quot;</span> <span class="na">class</span><span class="o">=</span><span class="s">&quot;modal-header padding-top-bigger padding-left-bigger padding-right-normal padding-bottom-small&quot;</span> <span class="p">&gt;</span> <span class="p">&lt;</span><span class="nt">slot</span> <span class="na">name</span><span class="o">=</span><span class="s">&quot;header&quot;</span><span class="p">&gt;</span> <span class="p">&lt;</span><span class="nt">h3</span><span class="p">&gt;</span>{{ title }}<span class="p">&lt;/</span><span class="nt">h3</span><span class="p">&gt;</span> <span class="p">&lt;/</span><span class="nt">slot</span><span class="p">&gt;</span> <span class="p">&lt;</span><span class="nt">button</span> <span class="na">type</span><span class="o">=</span><span class="s">&quot;button&quot;</span> <span class="na">class</span><span class="o">=</span><span class="s">&quot;close-button has-color-gray is-size-6 is-size-4-touch&quot;</span> <span class="err">@</span><span class="na">click</span><span class="o">=</span><span class="s">&quot;$emit(&#39;close&#39;)&quot;</span> <span class="na">:aria-label</span><span class="o">=</span><span class="s">&quot;$t(&#39;browse-page.aria.close&#39;)&quot;</span> <span class="p">&gt;</span> <span class="p">&lt;</span><span class="nt">i</span> <span class="na">class</span><span class="o">=</span><span class="s">&quot;icon cross&quot;</span> <span class="p">/&gt;</span> <span class="p">&lt;/</span><span class="nt">button</span><span class="p">&gt;</span> <span class="p">&lt;/</span><span class="nt">header</span><span class="p">&gt;</span> <span class="p">&lt;</span><span class="nt">slot</span> <span class="na">default</span> <span class="p">/&gt;</span> <span class="p">&lt;/</span><span class="nt">div</span><span class="p">&gt;</span> <span class="p">&lt;/</span><span class="nt">focus-trap</span><span class="p">&gt;</span> </pre></div> <p>Apart from these the modal also has the <code>aria-modal</code> attribute and the <code>role="dialog"</code> attribute. These attributes direct our screen readers to recognise this component as a modal and declare it whenever the modal opens.</p> <p>The last improvement involves using appropriate label tags for the form elements. A lot of elements did not have proper labels or were nested in wrong way. These elements were fixed and after the fixing the nestings the elements had proper labels which the screen readers were able to identify. An example a proper input elements with correct label nesting is:</p> <div class="hll"><pre><span></span><span class="p">&lt;</span><span class="nt">label</span> <span class="na">class</span><span class="o">=</span><span class="s">&quot;checkbox&quot;</span> <span class="na">:for</span><span class="o">=</span><span class="s">&quot;item.code&quot;</span> <span class="na">:disabled</span><span class="o">=</span><span class="s">&quot;block(item)&quot;</span><span class="p">&gt;</span> <span class="p">&lt;</span><span class="nt">input</span> <span class="na">type</span><span class="o">=</span><span class="s">&quot;checkbox&quot;</span> <span class="na">class</span><span class="o">=</span><span class="s">&quot;filter-checkbox margin-right-small&quot;</span> <span class="na">:id</span><span class="o">=</span><span class="s">&quot;item.code&quot;</span> <span class="na">:key</span><span class="o">=</span><span class="s">&quot;index&quot;</span> <span class="na">:checked</span><span class="o">=</span><span class="s">&quot;item.checked&quot;</span> <span class="na">:disabled</span><span class="o">=</span><span class="s">&quot;block(item)&quot;</span> <span class="err">@</span><span class="na">change</span><span class="o">=</span><span class="s">&quot;onValueChange&quot;</span> <span class="p">/&gt;</span> <span class="p">&lt;</span><span class="nt">license-icons</span> <span class="na">v-if</span><span class="o">=</span><span class="s">&quot;filterType == &#39;licenses&#39;&quot;</span> <span class="na">:license</span><span class="o">=</span><span class="s">&quot;item.code&quot;</span> <span class="p">/&gt;</span> {{ $t(item.name) }} <span class="p">&lt;/</span><span class="nt">label</span><span class="p">&gt;</span> </pre></div> <p>Notice how the input is a child of the <code>&lt;label&gt;</code> tag which has the <code>for</code> attribute to point which element it labels.</p> <p>Apart from these changes, the eslint configuration of the project were also changed to include a11y-linting for the elments. We used the <a href="https://github.com/maranran/eslint-plugin-vue-a11y">eslint-plugin-vue-a11y</a> to enforce accessibility guidelines for our components via lint checks. Furthermore all the aria-labels were internationalized to enforce the i18n standard in our repo that we had setup earlier this summer.</p> <p>After all these changes we had the following inprovements in the accessibility scores(computed from lighthouse):</p> <ol> <li>Browse Page: 76 -&gt; 98 | +22</li> <li>Collections Browse Page: 86 -&gt; 96 | +10</li> <li>Photo Detail Page: 75 -&gt; 95 | +20</li> </ol> <p>And we are officially done with our work for the summer internship. The next blog will be the culmination of this series.</p> <p>You can track the work done for these weeks through these PRs:</p> <ol> <li><a href="https://github.com/cc-archive/cccatalog-frontend/pull/1072">Accessibility Improvements</a></li> <li><a href="https://github.com/cc-archive/cccatalog-frontend/pull/1121">setup vue-a11y for eslint</a></li> <li><a href="https://github.com/cc-archive/cccatalog-frontend/pull/1123">Aria labels and internationalization</a></li> <li><a href="https://github.com/cc-archive/cccatalog-frontend/pull/1120">internationalize aria-labels for about page and feedback page</a></li> <li><a href="https://github.com/cc-archive/cccatalog-frontend/pull/1153">add trap focus to modals</a></li> </ol> <p>The progress of the project can be tracked on <a href="https://github.com/cc-archive/cccatalog-frontend">cc-search</a></p> <p>CC Search Accessiblity is my GSoC 2020 project under the guidance of <a href="https://creativecommons.org/author/zackcreativecommons-org/">Zack Krida</a> and <a href="/blog/authors/akmadian/">Ari Madian</a>, who is the primary mentor for this project, <a href="https://creativecommons.org/author/annacreativecommons-org/">Anna Tumadottir</a> for helping all along and engineering director <a href="https://creativecommons.org/author/kriticreativecommons-org/">Kriti Godey</a>, have been very supportive.</p> CC Legal Database: Developing features 2020-08-07T00:00:00Z ['krysal'] urn:uuid:22d82fa8-ac33-3574-89ba-98937b7365f7 <p>In this post, I want to update the progress on Reimplementing the CC Legal Database site, my Outreachy project. There are several features added over the last month to date.</p> <h3 id="submission-forms">Submission forms</h3><p>The first thing I wanted to implement was the respective forms so that anyone can submit a case or article to the database. These forms were slightly modified in the redesign (discussed in the previous articles), so now it has fewer mandatory fields to lower the bar and facilitate the contribution of users.</p> <figure style="text-align: center;"> <img src="scholarship-form.png" alt="Form to submit an article related to CC licenses" style="border: 1px solid black; width: 60%;"> <figcaption>Scholarship form to submit an article.</figcaption> </figure><p>For the Scholarship form, for example, it is only needed to share your name, email and a link to propose an article related to any of the CC licenses, although the more information you can provide us the better, in any case, each contribution is reviewed by the staff before publishing.</p> <h3 id="search">Search</h3><p>The second important task was to allow searching in each of the listings. A basic function to start making use of the exposed information. In the <a href="https://labs.creativecommons.org/caselaw/">current site</a>, this function is delegated to an external service, a certain famous search engine. Filtering is now performed in the backend based on the keywords entered by the user, thus returning the reduced list. Later this will be combined with filtering by tags or topics that are associated with each entry (case or scholarship).</p> <h3 id="automated-tests">Automated tests</h3><p>While developing the mentioned functionalities I was also in charge of adding automatic unit tests, to ensure that future changes to the code base do not damage already functional parts of the site. This, in addition to giving more confidence to future contributors, they provide value immediately, at the time of writing the tests you should think about possible edge cases, so they allowed me to notice a missing validation in a couple of routes and then correct it.</p> <figure style="text-align: center;"> <img src="404-page.png" alt="404 page" style="border:1px solid black; width:70%;"> <figcaption>Example of page obtained when requesting a case detail that is not published or doesn't exist.</figcaption> </figure><p>In this process of adding automated tests I wanted them to run on every pull request created, so I learned how to write a GitHub Action with a PostgreSQL service, the DBMS used in this case. Previously, I had already created a job for linting, so I needed to add another one to run in parallel to save time. This service provided by GitHub is pretty cool and useful, it opens up a world of possibilities, from running third party services like <a href="https://github.com/GoogleChrome/lighthouse-ci">Lighthouse test</a> to even <a href="https://github.com/gr2m/twitter-together">send tweets</a>! If you want to see the GitHub Action file configurated for this project, check it out: <a href="https://github.com/creativecommons/legaldb/blob/31c3002a7860d78f3fdb464150c5c1b2f8bb86fc/.github/workflows/main.yml"><code>.github/workflows/main.yml</code></a>.</p> <h3 id="accessibility">Accessibility</h3><p>To check if the site had shortcomings I did the Lighthouse test on the homepage, discovering that there were indeed some issues to tackle. In principle the results were these:</p> <figure style="text-align: center;"> <img src="lighthouse-before.png" alt="" style="border:1px solid black; width:70%;"> <figcaption>Initial Lighthouse test measurements.</figcaption> </figure><p>The good thing about this test is that it throws up suggestions on how to fix the bugs found, so after adding certain missing attributes and labels, the following results were achieved.</p> <figure style="text-align: center;"> <img src="lighthouse-after.png" alt="" style="border:1px solid black; width:70%;"> <figcaption>Lighthouse test measurements after corrections.</figcaption> </figure><p>There is still room for improvement but at least we are within a quite acceptable green range.</p> <h3 id="other-features-and-tweaks">Other features and tweaks</h3><p>Some other features were implemented but only relevant to our registered users, that is, the Legal Staff. They consist of Django admin customization, such as filtering records by status, and a particular thing requested, the answers of frequently asked questions need to be displayed formatted, so they are now saved as Markdown text and transformed to HTML with style on the public site, showing lists, bold text, links, etc. The admin can also see a preview while editing.</p> <h3 id="conclusion">Conclusion</h3><p>After reviewing all done this last month I see significant progress has been made, I have learned many things along the way: more of what Django and its ecosystem offers, about accessibility, continuous integration in Heroku and GitHub, and more. One of the things that makes me most happy is being able to be contributing and being part of an Open Source organization, knowing how it moves and works inside, something I never imagine before.</p> <p>Time flies and there are less than two weeks left to finish, so if you want to follow the project here is the repository to suggest improvements or report bugs, or if you prefer something less technical you can join us on the <a href="https://creativecommons.slack.com/channels/cc-dev-legal-database">slack channel</a>.</p> Smithsonian Unit Code Update 2020-08-03T00:00:00Z ['charini'] urn:uuid:23985c56-007a-3707-9f9c-66bde8adae7e <h2 id="introduction">Introduction</h2><p>The Creative Commons (CC) Catalog project collects and stores CC licensed images scattered across the internet, such that they can be made accessible to the general public via the <a href="https://ccsearch.creativecommons.org/">CC Search</a> and <a href="https://api.creativecommons.engineering/v1/">CC Catalog API</a> tools. Numerous information associated with each image, which help in the image search and categorisation process are stored via CC Catalog in the CC database.</p> <p>In my <a href="/blog/entries/flickr-sub-provider-retrieval/">previous blog post</a> of this series entitled 'Flickr Sub-provider Retrieval', I discussed how the images from a certain provider (such as Flickr) can be categorised based on the sub-provider values (which reflects the underlying organisation or entity that published the images through the provider). We have similarly implemented the sub-provider retrieval logic for Europeana and Smithsonian providers. Unlike in Flickr and Europeana, every single image from Smithsonian is categorised under some sub-provider value where the sub-providers are identified based on a unit code value as contained in the API response (for more information please refer to the pull request <a href="https://github.com/cc-archive/cccatalog/pull/455">#455</a>). The unit code values and the corresponding sub provider values are maintained in the dictionary <em>SMITHSONIAN_SUB_PROVIDERS</em>. However, there is the possibility of the <em>unit code</em> values being updated at the Smithsonian API level, and it is important that we have a mechanism of reflecting those updates in the <em>SMITHSONIAN_SUB_PROVIDERS</em> dictionary as well. In this blog post, we discuss how we learn the potential changes to the <em>unit code</em> values and keep the <em>SMITHSONIAN_SUB_PROVIDERS</em> dictionary up-to-date.</p> <h2 id="implementation">Implementation</h2><h3 id="retrieving-the-latest-unit-codes">Retrieving the latest unit codes</h3><p>We are required to obtain the latest <em>unit codes</em> supported by the Smithsonian API to achieve this task. Furthermore, since we are only interested in image data, the <em>unit codes</em> which are associated with images alone need to be retrieved. The latest Smithsonian <em>unit codes</em> corresponding to images can be retrieved by calling the end point <a href="https://api.si.edu/openaccess/api/v1.0/terms/unit_code?q=online_media_type:Images&amp;api_key=REDACTED">https://api.si.edu/openaccess/api/v1.0/terms/unit_code?q=online_media_type:Images&amp;api_key=REDACTED</a></p> <h3 id="check-for-unit-code-updates">Check for unit code updates</h3><p>In order to identify whether changes have occurred to the collection of <em>unit codes</em> supported by the Smithsonian API (in the form of additions and/or deletions), we compare the values retrieved by calling the previously mentioned endpoint, with the values contained in the <em>SMITHSONIAN_SUB_PROVIDERS</em> dictionary. All changes are reflected in a table named <em>smithsonian_new_unit_codes</em> which contains the two fields, 'new_unit_code' and 'action'. If a new <em>unit code</em> is introduced at the API level, we store that <em>unit code</em> value with the corresponding action value 'add' in the table. This reflects that the given <em>unit code</em> value needs to be added to the <em>SMITHSONIAN_SUB_PROVIDERS dictionary</em>. If a <em>unit code</em> that appears in the <em>SMITHSONIAN_SUB_PROVIDERS</em> dictionary does not appear at the API level, we store the <em>unit code</em> value with the corresponding action value 'delete' in the table, reflecting that it needs to be deleted from the dictionary.</p> <h3 id="triggering-the-unit-code-update-workflow">Triggering the unit code update workflow</h3><p>A separate workflow named <em>check_new_smithsonian_unit_codes_workflow</em> allows executing the logic we discussed via the Airflow UI. For each execution, the table <em>smithsonian_new_unit_codes</em> is completely cleared of previous data, and the latest updates to reflect in the <em>SMITHSONIAN_SUB_PROVIDERS</em> dictionary are stored. Note that the actual updates to the dictionary (as reflected in the table) needs to be carried out by a person, since editing the dictionary is not automated. Furthermore, this workflow is expected to be executed at-least once a week, preferably prior to running the Smithsonian image retrieval script such that the Smithsonian sub-provider retrieval task can be run with no issue.</p> <h2 id="acknowledgement">Acknowledgement</h2><p>I express my gratitude to my GSoC supervisor Brent Moran for assisting me with this task.</p> Linked Commons: Autocomplete Feature 2020-07-31T00:00:00Z ['subhamX'] urn:uuid:0e278e85-748f-3d35-a640-24daab837875 <p>The following blog intends to explain the very recent feature integrated to the Linked Commons. Be it the giant Google Search or any small website having a form field, everyone wishes to predict what’s on the user’s mind. For every keystroke, a nice search bar always renders some possible options the user could be looking for. The core ideology behind having this feature is ? <em>do as much work as possible for the user!</em></p> <div style="text-align: center; width: 100%;"> <figure> <img src="autocomplete-feat-in-action.gif" alt="autocomplete-feature" style="border: 1px solid black"> <figcaption style="font-weight: 500;">Autocomplete feature in action</figcaption> </figure> </div><h2 id="motivation">Motivation</h2><p>One of the newest features integrated last month into Linked Commons is Filtering by node name. Here a user can search for his/her favourite node and explore all its neighbours. Since the list is very big, it was self-evident for us to have a text box (and not a drop-down) where the user is supposed to type the node name.</p> <p>Some of the reasons why to have "autocomplete feature" in the filtering by node name -</p> <ul> <li>Some of the node names are very uncommon and lengthy. There is a high probability of misspelling it.</li> <li>Submitting the form and getting a response of “Node doesn’t exist” isn’t a very good user flow, and we want to minimise such incidents.</li> </ul> <p>Also, on a side note giving a search bar to the user and giving no hints is ruthless. We all need recommendations and guess what linked commons got you covered! Now for every keystroke, we load a bunch of node names which you might be looking for. ;)</p> <h2 id="problem">Problem</h2><p>The autocomplete feature on a very basic level aims to predict the rest of a word the user is typing. A possible implementation is though the linear traversal of all the nodes in the list. It will be having a <strong>linear time complexity</strong>. It’s not very good and it’s very obvious to look for a faster and more efficient way. Also, even if for once we neglect the <strong>time complexity</strong>, looking for the best 10 nodes out of these millions on the client's machine is not at all a good idea; it will cause throttling and will result in performance drops. On the other hand, a <strong>trie based solution</strong> is more efficient for sure but still, we cannot do this indexing on the client machine for the same reasons stated above. So far, it is now apparent that we implement this feature on the server and also aim for at least something better than linear time complexity.</p> <h2 id="a-non-conventional-solution">A non-conventional solution</h2><p>We could have used Elastic Search, which is very powerful and has a ton of functionalities but since our needs are very small we wanted to look for other simple alternatives. Moreover, we didn't want to complicate our current architecture by adding an additional framework and libraries.</p> <p>Taking the above points into consideration we went ahead with the following solution. We store all nodes data into an SQL dB and search for all the nodes whose domain name pattern was matching to the query string. After slicing the query set and other randomization we sent the payload to the client. To make it more robust, we are caching the results in the frontend to avoid multiple calls for the same query. It will surely reduce the load from the server and also give a faster response.</p> <h2 id="results">Results</h2><p>To make sure our solution works well, we performed load tests, checking that any response time does not exceed 1000 ms. We used locust which is a user load testing tool. We simulated with <strong>1000 users</strong> and <strong>10 as Hatch rate</strong>. The following test is performed on the local machine to ensure that the server location isn’t affecting the results.</p> <p>Here are some aggregated result statistics.</p> <table class="table table-striped"> <thead class="thead-dark"><tr> <th>Field Name</th> <th>Value</th> </tr> </thead> <tbody> <tr> <td>Request Count</td> <td><strong> 23323 </strong></td> </tr> <tr> <td>Failure Count</td> <td><strong> 0 </strong></td> </tr> <tr> <td>Median Response Time</td> <td><strong> 360 ms </strong></td> </tr> <tr> <td>Average Response Time</td> <td><strong> 586.289 ms</strong></td> </tr> <tr> <td>Min Response Time</td> <td><strong> 4.03094 ms</strong></td> </tr> <tr> <td>Max Response Time</td> <td><strong> 4216 ms </strong></td> </tr> <tr> <td>Average Content Size</td> <td><strong> 528.667 ms</strong></td> </tr> <tr> <td>Requests/s</td> <td><strong> 171.754 </strong></td> </tr> <tr> <td>Max Requests/s</td> <td><strong> 214 </strong></td> </tr> <tr> <td>Failures/s</td> <td><strong> 0 </strong></td> </tr> </tbody> </table> <h2 id="next-steps">Next steps</h2><p>In the next blog, we will be covering the long awaited data update and the new architecture.</p> <h2 id="conclusion">Conclusion</h2><p>Overall, I enjoyed working on this feature and it was a great learning experience. This feature has been successfully integrated to the development version, do check it out. Now that you have read this blog till the end, I hope that you enjoyed it. For more information please visit our <a href="https://github.com/cc-archive/cccatalog-dataviz/">Github repo</a>. We are looking forward to hearing from you on linked commons. Our <a href="https://creativecommons.slack.com/channels/cc-dev-cc-catalog-viz">slack</a> doors are always open to you, see you there. :)</p> CC Search, Initial Accessibility Improvements 2020-07-25T00:00:00Z ['AyanChoudhary'] urn:uuid:b523aa13-3429-3303-b7cb-3c66beeed723 <p>These are the seventh and eighth weeks of my internship with CC. I am working on improving the accessibility of cc-search and internationalizing it as well. This post contains details of my work done to make initial accessibility improvements to homepage and the other static pages.</p> <p>With the internationalization work complete, our next target were the accessiblity improvements. So I decided to tackle the homepage and the static pages first. The aforementioned pages had the following accessiblity issues:</p> <ol> <li>No aria-label on links</li> <li>Improper landmarks</li> <li>Improper aria-control nestings</li> <li>Some elements not being read by the screen-reader</li> <li>Color contrast Issues(to be covered later)</li> </ol> <p>But before working I ran another set of audit tests to exactly pin-point these issues. I used <a href="https://www.nvaccess.org/">NVDA</a> for running these audits. Lets go through the fixes one at a time.</p> <p>The first issue of no aria-label was pre-dominantly found in the footer. we had some links such as:</p> <pre><code>&lt;a href="https://www.instagram.com/creativecommons" class="social has-text-white" target="_blank" rel="noopener" &gt; </code></pre> <p>These links did not contain any aria-label and were read as <strong>cc link</strong>. So the aria-labels had to be added <code>aria-label="instagram link"</code> in this case which fixed this problem.</p> <p>The next issue was of improper landmarks. Most of the pages had no <strong>main</strong> landmark and some had no <strong>complimentary</strong> or <strong>region</strong> landmarks even though they were required in those pages. These landmarks had to be added after the carefully scrutinising the pages in the audits.</p> <p>The next issue was of improper aria-control nestings. This is interesting as it involves having some deeper understanding of the roles involved. So I will explain this in a little depth. The area where we had this issue was in feedback page. The code involved was:</p> <pre><code>&lt;ul&gt; &lt;li :class="tabClass(0, 'tab')"&gt; &lt;a href="#panel0" :aria-selected="activeTab == 0" @click.prevent="setActiveTab(0)" &gt; Help us Improve &lt;/a&gt; &lt;/li&gt; &lt;li :class="tabClass(1, 'tab')"&gt; &lt;a href="#panel1" :aria-selected="activeTab == 1" @click.prevent="setActiveTab(1)" &gt; Report a Bug &lt;/a&gt; &lt;/li&gt; &lt;/ul&gt; </code></pre> <p>The reason why this is an error is because of the <code>aria-selected</code> attribute can only be applied to an element having the role <strong>tab</strong> nested inside a <strong>tablist</strong> element. For reference, in the above example the <code>&lt;ul&gt;</code> should have the role <strong>tablist</strong> and each <code>&lt;li&gt;</code> element should have the role <strong>tab</strong>. And so the <code>aria-selected</code> attribute should be in the <code>&lt;li&gt;</code> element instead of the <code>&lt;a&gt;</code> tag.</p> <p>The corrected code is:</p> <pre><code>&lt;ul role="tablist"&gt; &lt;li role="tab" :class="tabClass(0, 'tab')" :aria-selected="activeTab == 0"&gt; &lt;a aria-label="help us improve form" href="#panel0" @click.prevent="setActiveTab(0)" &gt; {{ $t('feedback.improve') }} &lt;/a&gt; &lt;/li&gt; &lt;li role="tab" :class="tabClass(1, 'tab')" :aria-selected="activeTab == 1"&gt; &lt;a aria-label="report a bug form" href="#panel1" @click.prevent="setActiveTab(1)" &gt; {{ $t('feedback.bug') }} &lt;/a&gt; &lt;/li&gt; &lt;/ul&gt; </code></pre> <p>Another interesting finding involved the screen readers not reading particular special characters such as <code>~</code> and <code>|</code>. This issue was quite pronounced in the search guide page where these symbols were used in plenty in both links as well as texts. So I had to phonetically write these out in the aria-labels of the links to make the screen reader read them out loud. The corresponding changes are:</p> <pre><code>&lt;a aria-label="dog vertical bar cat" href="https://search.creativecommons.org/search?q=dog%7Ccat" &gt; &lt;em&gt;dog|cat&lt;/em&gt; &lt;/a&gt; </code></pre> <p>After all these changes we had some increase in the accessibility scores(computed from lighthouse):</p> <ol> <li>About Page: 78 -&gt; 97 | +19</li> <li>Search-Guide Page: 76 -&gt; 97 | +23</li> <li>Feedback Page: 75 -&gt; 97 | +22</li> </ol> <p>Whoosh!! That was quite a lot. We are done with these two weeks for now. Hope to see you in the next post as well.</p> <p>You can track the work done for these weeks through these PRs:</p> <ol> <li><a href="https://github.com/cc-archive/cccatalog-frontend/pull/1068">Accessibility</a></li> <li><a href="https://github.com/cc-archive/cccatalog-frontend/pull/1072">Accessibility Improvements</a></li> </ol> <p>The progress of the project can be tracked on <a href="https://github.com/cc-archive/cccatalog-frontend">cc-search</a></p> <p>CC Search Accessiblity is my GSoC 2020 project under the guidance of <a href="https://creativecommons.org/author/zackcreativecommons-org/">Zack Krida</a> and <a href="/blog/authors/akmadian/">Ari Madian</a>, who is the primary mentor for this project, <a href="https://creativecommons.org/author/annacreativecommons-org/">Anna Tumadottir</a> for helping all along and engineering director <a href="https://creativecommons.org/author/kriticreativecommons-org/">Kriti Godey</a>, have been very supportive.</p> Data flow: from API to DB 2020-07-22T00:00:00Z ['srinidhi'] urn:uuid:44db39c0-d562-3a62-b5c5-ac4babcbbe40 <h2 id="introduction">Introduction</h2><p>The CC Catalog project handles the flow of image metadata from the source or provider and loads it to the database, which is then surfaced to the <a href="https://ccsearch.creativecommons.org/about">CC search</a> tool. The workflows are set up for each provider to gather metadata about CC licensed images. These workflows are handled with the help of Apache Airflow. Airflow is an open source tool that helps us to schedule and monitor workflows.</p> <h2 id="airflow-intro">Airflow intro</h2><p>Apache Airflow is an open source tool that helps us to schedule tasks and monitor workflows . It provides an easy to use UI that makes managing tasks easy. In Airflow, the tasks we want to schedule are organised in DAGs (Directed Acyclic Graphs). DAGs consist of a collection of tasks, and a relationship defined among these tasks, so that they run in an organised manner. DAGs files are standard python files that are loaded from the defined <code>DAG_FOLDER</code> on a host. Airflow selects all the python files in the <code>DAG_FOLDER</code> that have a DAG instance defined globally, and executes them to create the DAG objects.</p> <h2 id="cc-catalog-workflow">CC Catalog Workflow</h2><p>In the CC catalog, Airflow is set up inside a docker container along with other services . The loader and provider workflows are inside the <code>dags</code> directory in the repo <a href="https://github.com/cc-archive/cccatalog/tree/dacb48d24c6ae9b532ff108589b9326bde0d37a3/src/cc_catalog_airflow/dags">dag folder</a>. Provider workflows are set up to pull metadata about CC licensed images from the respective providers , the data pulled is structured into a standardised format and written into a TSV (Tab Separated Values) file locally. These TSV files are then loaded into S3 and then finally to PostgreSQL DB by the loader workflow.</p> <h2 id="provider-api-workflow">Provider API workflow</h2><p>The provider workflows are usually scheduled in one of two time frequencies, daily or monthly.</p> <p>Providers such as Flickr or Wikimedia Commons that are filtered using the date parameter are usually scheduled for daily jobs. These providers have a large volume of continuously changing data, and so daily updates are required to keep the data in sync.</p> <p>Providers that are scheduled for monthly ingestion are ones with a relativley low volume of data, or for which filtering by date is not possible. This means we need to ingest the entire collection at once. Examples are museum providers like the <a href="https://collection.sciencemuseumgroup.org.uk/">Science museum UK</a> or <a href="https://www.smk.dk/">Statens Museum for Kunst</a>. We don’t expect museum providers to change data on a daily basis.</p> <p>The scheduling of the DAGs by the scheduler daemons depends on a few parameters.</p> <ul> <li><code>start_date</code> - it denotes the starting date from which the task should begin running.</li> <li><code>schedule_interval</code> - it denotes the interval between subsequent runs, it can be specified with airflow keyword strings like “@daily”, “@weekly”, “@monthly”, “@yearly” other than these we can also schedule the interval using cron expression.</li> </ul> <p>Example: Cleveland museum is currently scheduled for a monthly crawl with a starting date as <code>2020-01-15</code>. <a href="https://github.com/cc-archive/cccatalog/blob/dacb48d24c6ae9b532ff108589b9326bde0d37a3/src/cc_catalog_airflow/dags/cleveland_museum_workflow.py">cleveland_museum_workflow</a></p> <h2 id="loader-workflow">Loader workflow</h2><p>The data from the provider scripts are not directly loaded into S3. Instead, they are stored in a TSV file on the local disk, and the tsv_postgres workflow handles loading of data to S3, and eventually PostgreSQL. The DAG starts by calling the task to stage the oldest tsv file from the output directory of the provider scripts to the staging directory. Next, two tasks run in parallel, one loads the tsv file in the staging directory to S3 , while the other creates the loading table in the PostgreSQL database. Once the data is loaded to S3 and the loading table has been created, the data from S3 is loaded to the intermediate loading table and then finally inserted into the image table. If loading from S3 fails the data is loaded to PostgreSQL from the locally stored tsv file. When the data has been successfully transferred to the image table, the intermediate loading table is dropped and the tsv files in the staging directory are deleted. If the copying the tsv files to S3 fails or then those files are moved to the failure directory for future inspection.</p> <div style="text-align:center;"> <img src="loader_workflow.png" width="1000px"/> <p> Loader workflow </p> </div><h2 id="acknowledgement">Acknowledgement</h2><p>I would like to thank Brent Moran for helping me write this blog post.</p> What is up? - CCOS Revamp 2020-07-20T00:00:00Z ['dhruvi16'] urn:uuid:db2b95cd-c221-3f56-9ffb-8b165676fdbe <p>In my previous blog, I demonstrated what my Outreachy project was about. Here I will talk about my progress in the past 7 weeks.</p> <h3 id="the-set-up">The Set-Up -</h3><p>The <a href="/">Creative Commons Open Source</a> website is built using <a href="https://www.getlektor.com/">Lektor</a>. I was not very familiar with it, so I started by going through the documentation and the official website code. I learned how awesome it is and can also be used by non-coders. I got familiar with the <a href="https://palletsprojects.com/p/jinja/">jinja templates</a> and working of themes in a Lektor app. For integrating new styles from Vocabulary, I replaced <code>templates/</code> folder with a <code>theme/</code> folder. Here is the link to how <a href="https://www.getlektor.com/docs/templates/">templates</a> work in Lektor.</p> <p>As the revamping process is gradual, there was a need of setting up a staging environment where we could test the website. Deploying the branch that consists of the ongoing changes was pretty easy, I just followed the official <a href="https://www.netlify.com/blog/2016/05/25/lektor-on-netlify-a-step-by-step-guide/">documentation</a> provided by Netlify and deployed it.</p> <h3 id="adding-new-components-to-vocabulary">Adding New Components to Vocabulary -</h3><p>The <a href="https://www.figma.com/file/mttcugI1UxvCJRKE5sdpbO/Mockups">mock-ups</a> for the new CCOS website extensively use Vocabulary components, styles, and patterns, and it had components that were not available in Vocabulary. So, I worked on building them from scratch. I enjoyed this part a bit too much. And also this was a part of the project, I did not think would take up like 2 weeks but it did. I enjoyed questioning the scope, the design, the experience of the components, and getting satisfactory answers. Maintaining the practices, focusing on details were fun things to do. It felt like I own those components. You can check them out <a href="https://cc-vocabulary.netlify.app/?path=/docs/vocabulary-introduction--page">here</a> and also use them wherever needed.</p> <h3 id="updating-templates-of-the-theme">Updating Templates of the Theme -</h3><p>I started by updating the home page template. I try to make the code cleaner and more readable. Going through the <a href="https://www.getlektor.com/">Lektor</a> documentation, I came across different ways to do so. One of them was <a href="/blog/entries/what-is-up-ccos/(https:/www.getlektor.com/">flow blocks</a>), I like how it makes a template more modular and readable so I implemented the home page using flow blocks. This one after one, I started updating every template. For now, I have updated 10 templates and I plan to update the remaining in upcoming weeks.</p> <h3 id="my-experience-so-far">My Experience so far -</h3><p>This has been one heck of a journey for me. I have never collaborated with such a huge open-source organization and so that was something new for me. I have learned a lot of things both technical and non-technical so far. I have become more alert about the code I write, this journey has helped me improve the questions I ask to myself while writing code or thinking about the solution, I got to learn about new technologies such as <a href="https://www.getlektor.com/">Lektor</a>, <a href="https://webpack.js.org/">Webpack</a>, <a href="https://sass-lang.com/documentation/syntax">SCSS</a> and many more. I am just very glad to be a part of this.</p> Linked Commons: What's new? 2020-07-16T00:00:00Z ['subhamX'] urn:uuid:cf56c986-4df1-344a-a15e-e852c66f895d <p><strong>Linked Commons</strong> is a visualization project which aims to showcase and establish a relationship between millions of data points of licensed content metadata using graphs. Since it is the first blog of this new series, let’s discuss the core ideology of having this project and then about new features. Development of all components mentioned in this blog is complete and has been successfully integrated, so do check out the development version. Happy Reading!</p> <h2 id="motivation-and-why-does-visualization-matter">Motivation and why does visualization matter?</h2><p>The number of websites using creative commons licensed content is very huge and growing very rapidly. The CC catalog hosts these millions of data points and each node contains information about the URL of the websites and the licenses used. One can surely do rigorous data analysis, but this would only be interpretable by a few people with a technical background. On the other hand, by visualizing data, it becomes incredibly easier to identify patterns and trends. As an old saying, a picture is worth a thousand words. That’s the core ideology of the Linked Commons, i.e. to show the millions of licensed content metadata and how these nodes are connected in a visually appealing form on the HTML canvas like a picture.</p> <h2 id="task-1-code-refactoring">Task 1: Code Refactoring</h2><p>My first task was to refactor the code and migrate it to react. The existing codebase had all core functionalities, but we wanted to make it more modular, improve the design, code readability, and reduce complexity. This will help us maintain this project in the long run. Also, it will be easier for the community to contribute and understand the logic.</p> <h2 id="task-2-graph-filtering">Task 2: Graph Filtering</h2><h3 id="need-for-filtering-methods">Need for Filtering Methods</h3><div style="text-align: center; width: 90%; margin-left: 5%;"> <figure> <img src="big-graph.png" alt="Large Graph" style="border: 1px solid black"> <figcaption style="font-weight: 500;">Pic showing clustors of a Graph with 9982 nodes 5000 links</figcaption> </figure> </div><p>The aggregate data that cc catalog has is in hundreds of millions. Rendering a graph with these many nodes will be a nightmare for the browser’s rendering and JavaScript engine. Just like we divide any standard textbook into chapters, we thought about adding filtering options that enable the user to retrieve precise information according to certain criteria selected by them. Hence,? we need to have a way in which we can filter the aggregate data into smaller chunks.</p> <h3 id="what-filtering-methods">What filtering methods?</h3><p>After brainstorming for a while, we converged and agreed to have <strong>filtering based on node name and distance</strong>. The primary reason behind this was, it is kind of basic that a person would like to look for his/her favourite node and its neighbours. This is not the end for sure, and many more filtering methods will be added, maybe with the support of chaining one after another. This is just a baby step!</p> <h3 id="server-side-filtering-vs-client-side-filtering">Server-side Filtering vs Client-Side Filtering?</h3><p>Now that we know on what query params the filter should work, we need to decide where to do the filtering. Should we do it on the client machine or do it on our server and pass the processed and filtered data to the client? In any filtering method, we need to traverse the whole graph. The JS engine in the browser is doing rendering stuff, complex calculation, etc. With all these processes, doing a full traversal of the dataset having more than a million nodes is going to take a lot of time and memory. The above claim assumes that we have a moderately dense graph. On the other hand, another strategy to accomplish graph filtering could be to delegate that load to a server, and the client’s browser can ask for a fresh copy of the filtered data whenever needed. As mentioned above the client-side filtering has serious shortcomings and user experience won’t be very good with browser freezing and frame drops. So, that's why we decided to go with the latter option i.e server-side filtering.</p> <div style="text-align: center;"> <figure> <img src="filtering-in-action.gif" alt="Filtering In Action" style="border: 1px solid black"> <figcaption>Filtering In Action</figcaption> </figure> </div><h2 id="task-3-new-design">Task 3: New Design</h2><p>My third task was to upgrade the front-end design of the project. It now has a very clean and refreshing look along with the support for both light and dark theme. Check out our webpage in dark mode and do let us know if it saves your PC energy consumption (As claimed by some websites). Now you all can visit the Linked Commons webpage at mid-night too, no strain to the eyes. ;)</p> <div style="text-align: center; width: 90%; margin-left: 5%;"> <figure> <img src="new-design-light.png" alt="Light Theme" style="border: 1px solid black"> <figcaption>Linked Commons - Light Theme</figcaption> </figure> </div><h2 id="next-steps">Next steps</h2><p>In the next two weeks, I will be working on the following features.</p> <ul> <li>Implement suggestions API on the server and integrate it with the frontend</li> <li>Update the visualization with a more recent and bigger dataset</li> </ul> <h2 id="conclusion">Conclusion</h2><p>Overall, it was fantastic and rejuvenating experience working on these tasks. Now that you have read this blog till the end, I hope that you enjoyed it. For more information visit our <a href="https://github.com/cc-archive/cccatalog-dataviz/">Github repo</a>. We are looking forward to hearing from you on linked commons. Our <a href="https://creativecommons.slack.com/channels/cc-dev-cc-catalog-viz">slack</a> doors are always open to you, see you there. :)</p> Internationalization continued: Modifying tests 2020-07-10T00:00:00Z ['AyanChoudhary'] urn:uuid:91d205ed-b94e-3d2a-9c30-5af7f04506ac <p>These are the fifth and sixth weeks of my internship with CC. I am working on improving the accessibility of cc-search and internationalizing it as well. This post contains yet another important aspect to be taken care of while internationalizing the Vue components, i.e. modifying tests to include the changes.</p> <p>The components which were left are the two pages displaying the most content:</p> <ol> <li><a href="https://github.com/cc-archive/cccatalog-frontend/blob/develop/src/pages/BrowsePage.vue">Browse page</a></li> <li><a href="https://github.com/cc-archive/cccatalog-frontend/blob/develop/src/pages/PhotoDetailPage.vue">ImageDetail page</a></li> </ol> <p>The above two pages too were handled similar to the remaining pages, special care had to be taken in case of ImageDetail page since there are too many components in different files. By this point we also have our json structure figured out, mostly and are ready to push the json for fetching further translations.</p> <p>Now let's look at the modifications required in the tests. We generally use <code>$t</code> to access strings from the locales json, but this method/custom-component is not present in the testing vue instance, so we had to inject this method usin localVue and a custom i18n instance.</p> <pre><code>const localVue = createLocalVue(); localVue.use(Vuex); localVue.use(VueI18n); const messages = require('@/locales/en.json'); const i18n = new VueI18n({ locale: 'en', fallbackLocale: 'en', messages, }); </code></pre> <p>Now we inject this 18n component into our vue instance and we have access to our <code>$t</code>, but there is still one more step left. We still need to mock its functionality in the tests. So we create a mock <code>$t</code> instance to mock in our component. The final code is given below:</p> <pre><code>const $t = (key) =&gt; i18n.messages[key]; options = { mocks: { $t, }, }; </code></pre> <p>Now we are ready to render our component using these custom options with mocks for testing.</p> <p>And <em>drum rolls</em> we have successfully completed Internationalization of the complete cc search. Below are the images for some of the completed pages:</p> <p><img src="/blog/entries/cc-search-accessibility-week5-6/final.png" alt="final.png"></p> <p><img src="/blog/entries/cc-search-accessibility-week5-6/finalAbout.png" alt="finalAbout.png"></p> <p><img src="/blog/entries/cc-search-accessibility-week5-6/finalImageDetail.png" alt="finalImageDetail.png"></p> <p>The issues closed with the completion of Internationalization are:</p> <ol> <li><a href="https://github.com/cc-archive/cccatalog-frontend/issues/487">[META] Internationalisation (i18n) Setup</a></li> <li><a href="https://github.com/cc-archive/cccatalog-frontend/issues/941">Set up vue-i18n infrastructure</a></li> <li><a href="https://github.com/cc-archive/cccatalog-frontend/issues/942">Create locale messages format JSON structure</a></li> <li><a href="https://github.com/cc-archive/cccatalog-frontend/issues/943">Allow users to change locale on the client</a></li> </ol> <p>You can track the work done for these weeks through this PR:</p> <ol> <li><a href="https://github.com/cc-archive/cccatalog-frontend/pull/1040">Localize browsepage and single-result page</a></li> </ol> <p>The progress of the project can be tracked on <a href="https://github.com/cc-archive/cccatalog-frontend">cc-search</a></p> <p>CC Search Accessiblity is my GSoC 2020 project under the guidance of <a href="https://creativecommons.org/author/zackcreativecommons-org/">Zack Krida</a> and <a href="/blog/authors/akmadian/">Ari Madian</a>, who is the primary mentor for this project, <a href="https://creativecommons.org/author/annacreativecommons-org/">Anna Tumadottir</a> for helping all along and engineering director <a href="https://creativecommons.org/author/kriticreativecommons-org/">Kriti Godey</a>, have been very supportive.</p> CC Legal Database: Coding and Mid-term status 2020-07-08T00:00:00Z ['krysal'] urn:uuid:0283bfb3-3e0e-3c6a-a70b-02df13f31235 <p>We are already in the second half of the time stipulated for the project and it is time to pause for review the initial plan, celebrate the objectives achieved and think about what remains to be done.</p> <h2 id="initial-plan">Initial plan</h2><p>Initially, two weeks were allocated to do the redesign for the new site. I thought there would be plenty of time here, <em>is just design</em> I said to myself, despite not having done any serious project in Figma before beyond a few sketches. Later we will see I was wrong here. This included creating new Vocabulary components if necessary. Between the second and third weeks, I would create the data models (for Django and therefore for the database as well) and from the fourth week onwards it would start to implement all this in code: make the Homepage, listing, details pages and the others.</p> <h2 id="issues-in-the-way">Issues in the way</h2><p>One task that took longer than expected was to finish the designs, a key point because the other tasks depended on this. Though the initial scheme was ready on time, as it was discussed with the stakeholders new requirements became evident, so more modifications had to be made. For example, on the <a href="https://labs.creativecommons.org/caselaw/">current site</a>, the way to explore cases and scholarship is by country, and in principle, this would stay the same way and I designed with that in mind, but talking to our internal user (which acts as a <em>product owner</em> here) was better to change this scheme to one for labels or categories that are more related with both entities. Highlighting the case of the Scholarship model, in which the attribute of the country was eliminated because it is not so relevant, and although it seemed somewhat a small thing, this also caused changes in the design of the home page, the listings and how the content of the database will be explored in general. Design for a good user experience is not so easy as a non-designer may think. There were times when there was a lack of ideas but the important thing is to make decisions and move forward, in later iterations it will be improved.</p> <p>As in all software development, unexpected things happen and errors will appear no matter how much you plan ahead, for the fourth week I had planned to build a continuous integration system to have a server where anyone can see the progress of my changes, however, there were a few inconveniences that had me googling for a couple of days, publishing a Django project in Heroku can be tricky, specially regarding static files (assets like style sheets and scripts) if they are generated by Heroku at some point in the deployment pipeline, depending on the phase in which it is carried out, they can be lost in the ephemeral file system of Heroku, a process that I will not delve into here but that seems important to me to highlight if anyone else has similar problems.</p> <h2 id="progress-so-far">Progress so far</h2><p>I have managed to finish the main tasks and I would say that even the initially expected result has been improved. So I can list the following achievements:</p> <ul> <li>Redesigned the entire website using the Figma Design Library</li> <li>Built first pages: Home, listing and details pages for both Cases and Scholarship, and one for the FAQs.</li> <li>Create a GitHub Action to lint every PR and check if it follows the project's code style</li> <li>Deployment of the Django project on Heroku with a CI process linked to a GitHub repository, see the live development site <a href="https://cc-caselaw.herokuapp.com/">here</a>.</li> </ul> <p>It is said quickly but each task carries its considerable workload. It's been a good result so far, I've learned a lot of things along the way, like basic use of Figma, use of Storybook (related to Vocabulary components), good code security practices, some accessibility details, and more.</p> <h2 id="plan-for-the-second-half-of-the-timeline">Plan for the second half of the timeline</h2><p>There are some tasks due from past weeks, such as build forms for Case and Scholarship submissions, but I am confident that now that the project has reached a stable state I can do it quickly in the next days. Other tasks were moved for later: searching records and filtering by tags moved after forms are created, so I can finish the visual parts of the site first and focus on functional work without shifting between types of tasks.</p> <p>The tasks and they order have changed, like I mentioned earlier, requirements were modified (a bit) so some tasks I planned for last weeks are not necessary anymore or are done already out-of-box with Django admin (benefits of choosing a batteries included framework!). In general, I don't think the initial plan was wrong, we just went through the natural evolution of a product software. Mentors have also been very helpful in keeping a reasonable scope and adjusting priorities.</p> <p>After main functionalities are done we can start making improvements, as we already identified some nice to have features but not so important at the moment. Stay tuned for more to come.</p>
- "漢字路" 한글한자자동변환 서비스는 교육부 고전문헌국역지원사업의 지원으로 구축되었습니다.
- "漢字路" 한글한자자동변환 서비스는 전통문화연구회 "울산대학교한국어처리연구실 옥철영(IT융합전공)교수팀"에서 개발한 한글한자자동변환기를 바탕하여 지속적으로 공동 연구 개발하고 있는 서비스입니다.
- 현재 고유명사(인명, 지명등)을 비롯한 여러 변환오류가 있으며 이를 해결하고자 많은 연구 개발을 진행하고자 하고 있습니다. 이를 인지하시고 다른 곳에서 인용시 한자 변환 결과를 한번 더 검토하시고 사용해 주시기 바랍니다.
- 변환오류 및 건의,문의사항은 juntong@juntong.or.kr로 메일로 보내주시면 감사하겠습니다. .
Copyright ⓒ 2020 By '전통문화연구회(傳統文化硏究會)' All Rights reserved.
 한국   대만   중국   일본